2023-12-18 10:51:27

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 00/16] Transparent Contiguous PTEs for User Mappings

Hi All,

This is a series to opportunistically and transparently use contpte mappings
(set the contiguous bit in ptes) for user memory when those mappings meet the
requirements. It is part of a wider effort to improve performance by allocating
and mapping variable-sized blocks of memory (folios). One aim is for the 4K
kernel to approach the performance of the 16K kernel, but without breaking
compatibility and without the associated increase in memory. Another aim is to
benefit the 16K and 64K kernels by enabling 2M THP, since this is the contpte
size for those kernels. We have good performance data that demonstrates both
aims are being met (see below).

Of course this is only one half of the change. We require the mapped physical
memory to be the correct size and alignment for this to actually be useful (i.e.
64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will
allocate large folios up to the PMD size today, and more filesystems are coming.
And the other half of my work, to enable "multi-size THP" (large folios) for
anonymous memory, makes contpte sized folios prevalent for anonymous memory too
[4].

Note that the first 3 patchs are for core-mm and provides the refactoring to
make some crucial optimizations possible - which are then implemented in patches
15 and 16. The remaining patches are arm64-specific.


Testing
=======

I've tested this series together with multi-size THP [4] on both Ampere Altra
(bare metal) and Apple M2 (VM):
- mm selftests (inc new tests written for multi-size THP); no regressions
- Speedometer Java script benchmark in Chromium web browser; no issues
- Kernel compilation; no issues
- Various tests under high memory pressure with swap enabled; no issues


Performance
===========

High Level Use Cases
~~~~~~~~~~~~~~~~~~~~

First some high level use cases (kernel compilation and speedometer JavaScript
benchmarks). These are running on Ampere Altra (I've seen similar improvements
on Android/Pixel 6).

baseline: mm-unstable (inc mTHP but switched off)
mTHP: enable 16K, 32K, 64K mTHP sizes "always"
mTHP + contpte: + this series
mTHP + contpte + exefolio: + poc patch to always read executable memory from
file into 64K folio to enable contpte-mapping the
text

Kernel Compilation with -j8 (negative is faster):

| kernel | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline | 0.0% | 0.0% | 0.0% |
| mTHP | -4.6% | -38.0% | -0.4% |
| mTHP + contpte | -5.4% | -37.7% | -1.3% |
| mTHP + contpte + exefolio | -7.4% | -39.5% | -3.5% |

Kernel Compilation with -j80 (negative is faster):

| kernel | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline | 0.0% | 0.0% | 0.0% |
| mTHP | -4.9% | -36.1% | -0.2% |
| mTHP + contpte | -5.8% | -36.0% | -1.2% |
| mTHP + contpte + exefolio | -6.8% | -37.0% | -3.1% |

Speedometer (positive is faster):

| kernel | runs_per_min |
|:--------------------------|--------------|
| baseline | 0.0% |
| mTHP | 1.5% |
| mTHP + contpte | 3.7% |
| mTHP + contpte + exefolio | 4.9% |

Micro Benchmarks
~~~~~~~~~~~~~~~~

Additionally for this version, I've done a significant amount of
microbenchmarking (and fixes!) to ensure the performance of fork(),
madvise(DONTNEED) and munmap() do not regress. Thanks to David for sharing his
benchmarks.

baseline: mm-unstable (inc mTHP but switched off)
contpte-dis: + this series with ARM64_CONTPTE disabled at
compile-time (to show impact of the core-mm changes)
contpte-ena: + ARM64_CONTPTE enabled at compile-time (to show
impact of arm64-specific changes)

I'm showing the collated results summary here. See individual patch commit logs
for commentary:

| Apple M2 VM | fork | dontneed | munmap |
| order-0 |-------------------|-------------------|-------------------|
| (pte-map) | mean | stdev | mean | stdev | mean | stdev |
|---------------|---------|---------|---------|---------|---------|---------|
| baseline | 0.0% | 1.1% | 0.0% | 7.5% | 0.0% | 3.8% |
| contpte-dis | -1.0% | 2.0% | -9.6% | 3.1% | -1.9% | 0.2% |
| contpte-ena | 2.6% | 1.7% | -10.2% | 1.6% | 1.9% | 0.7% |

| Apple M2 VM | fork | dontneed | munmap |
| order-9 |-------------------|-------------------|-------------------|
| (pte-map) | mean | stdev | mean | stdev | mean | stdev |
|---------------|---------|---------|---------|---------|---------|---------|
| baseline | 0.0% | 1.2% | 0.0% | 7.9% | 0.0% | 6.4% |
| contpte-dis | -0.1% | 1.1% | -4.9% | 8.1% | -4.7% | 0.8% |
| contpte-ena | -25.4% | 1.9% | -9.9% | 0.9% | -6.0% | 1.4% |

| Ampere Altra | fork | dontneed | munmap |
| order-0 |-------------------|-------------------|-------------------|
| (pte-map) | mean | stdev | mean | stdev | mean | stdev |
|---------------|---------|---------|---------|---------|---------|---------|
| baseline | 0.0% | 1.0% | 0.0% | 0.1% | 0.0% | 0.9% |
| contpte-dis | -0.1% | 1.2% | -0.2% | 0.1% | -0.2% | 0.6% |
| contpte-ena | 1.8% | 0.7% | 1.3% | 0.0% | 2.0% | 0.4% |

| Ampere Altra | fork | dontneed | munmap |
| order-9 |-------------------|-------------------|-------------------|
| (pte-map) | mean | stdev | mean | stdev | mean | stdev |
|---------------|---------|---------|---------|---------|---------|---------|
| baseline | 0.0% | 0.1% | 0.0% | 0.0% | 0.0% | 0.1% |
| contpte-dis | -0.1% | 0.1% | -0.1% | 0.0% | -3.2% | 0.2% |
| contpte-ena | -6.7% | 0.1% | 14.1% | 0.0% | -0.6% | 0.2% |

Misc
~~~~

John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
some workloads at [5], when using 64K base page kernel.

---
All dependencies listed against v1 are now resolved; This series applies cleanly
against v6.7-rc1 and against mm-unstable as of a few days ago (3ecae30dda24).

Changes since v3 [3]
====================

- Added v3#1 to batch set_ptes() when splitting a huge pmd to ptes; avoids
need to fold contpte blocks for perf improvement
- Separated the clear_ptes() fast path into its own inline function (Alistair)
- Reworked core-mm changes to copy_present_ptes() and zap_pte_range() to
remove overhead when memory is all order-0 folios (for arm64 and !arm64)
- Significant optimization of arm64 backend fork operations (set_ptes_full()
and set_wrprotects()) to ensure no regression when memory is order-0 folios.
- fixed local variable declarations to be reverse xmas tree. - Added
documentation for the new backend APIs (pte_batch_remaining(),
set_ptes_full(), clear_ptes(), ptep_set_wrprotects())
- Renamed tlb_get_guaranteed_space() -> tlb_reserve_space() and pass requested
number of slots. Avoids allocating memory when not needed; perf improvement.


Changes since v2 [2]
====================

- Removed contpte_ptep_get_and_clear_full() optimisation for exit() (v2#14),
and replaced with a batch-clearing approach using a new arch helper,
clear_ptes() (v3#2 and v3#15) (Alistair and Barry)
- (v2#1 / v3#1)
- Fixed folio refcounting so that refcount >= mapcount always (DavidH)
- Reworked batch demarcation to avoid pte_pgprot() (DavidH)
- Reverted return semantic of copy_present_page() and instead fix it up in
copy_present_ptes() (Alistair)
- Removed page_cont_mapped_vaddr() and replaced with simpler logic
(Alistair)
- Made batch accounting clearer in copy_pte_range() (Alistair)
- (v2#12 / v3#13)
- Renamed contpte_fold() -> contpte_convert() and hoisted setting/
clearing CONT_PTE bit to higher level (Alistair)


Changes since v1 [1]
====================

- Export contpte_* symbols so that modules can continue to call inline
functions (e.g. ptep_get) which may now call the contpte_* functions (thanks
to JohnH)
- Use pte_valid() instead of pte_present() where sensible (thanks to Catalin)
- Factor out (pte_valid() && pte_cont()) into new pte_valid_cont() helper
(thanks to Catalin)
- Fixed bug in contpte_ptep_set_access_flags() where TLBIs were missed (thanks
to Catalin)
- Added ARM64_CONTPTE expert Kconfig (enabled by default) (thanks to Anshuman)
- Simplified contpte_ptep_get_and_clear_full()
- Improved various code comments


[1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
[2] https://lore.kernel.org/linux-arm-kernel/[email protected]/
[3] https://lore.kernel.org/linux-arm-kernel/[email protected]/
[4] https://lore.kernel.org/linux-arm-kernel/[email protected]/
[5] https://lore.kernel.org/linux-mm/[email protected]/


Thanks,
Ryan

Ryan Roberts (16):
mm: thp: Batch-collapse PMD with set_ptes()
mm: Batch-copy PTE ranges during fork()
mm: Batch-clear PTE ranges during zap_pte_range()
arm64/mm: set_pte(): New layer to manage contig bit
arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
arm64/mm: pte_clear(): New layer to manage contig bit
arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
arm64/mm: ptep_get(): New layer to manage contig bit
arm64/mm: Split __flush_tlb_range() to elide trailing DSB
arm64/mm: Wire up PTE_CONT for user mappings
arm64/mm: Implement new helpers to optimize fork()
arm64/mm: Implement clear_ptes() to optimize exit, munmap, dontneed

arch/arm64/Kconfig | 10 +-
arch/arm64/include/asm/pgtable.h | 384 +++++++++++++++++++++---
arch/arm64/include/asm/tlbflush.h | 13 +-
arch/arm64/kernel/efi.c | 4 +-
arch/arm64/kernel/mte.c | 2 +-
arch/arm64/kvm/guest.c | 2 +-
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/contpte.c | 480 ++++++++++++++++++++++++++++++
arch/arm64/mm/fault.c | 12 +-
arch/arm64/mm/fixmap.c | 4 +-
arch/arm64/mm/hugetlbpage.c | 40 +--
arch/arm64/mm/kasan_init.c | 6 +-
arch/arm64/mm/mmu.c | 16 +-
arch/arm64/mm/pageattr.c | 6 +-
arch/arm64/mm/trans_pgd.c | 6 +-
include/asm-generic/tlb.h | 11 +
include/linux/pgtable.h | 123 ++++++++
mm/huge_memory.c | 59 ++--
mm/memory.c | 156 ++++++----
mm/mmu_gather.c | 15 +
20 files changed, 1182 insertions(+), 168 deletions(-)
create mode 100644 arch/arm64/mm/contpte.c

--
2.25.1



2023-12-18 10:51:41

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 01/16] mm: thp: Batch-collapse PMD with set_ptes()

Refactor __split_huge_pmd_locked() so that a present PMD can be
collapsed to PTEs in a single batch using set_ptes(). It also provides a
future opportunity to batch-add the folio to the rmap using David's new
batched rmap APIs.

This should improve performance a little bit, but the real motivation is
to remove the need for the arm64 backend to have to fold the contpte
entries. Instead, since the ptes are set as a batch, the contpte blocks
can be initially set up pre-folded (once the arm64 contpte support is
added in the next few patches). This leads to noticeable performance
improvement during split.

Signed-off-by: Ryan Roberts <[email protected]>
---
mm/huge_memory.c | 59 ++++++++++++++++++++++++++++--------------------
1 file changed, 34 insertions(+), 25 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6be1a380a298..fbf7e95ea983 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2535,15 +2535,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,

pte = pte_offset_map(&_pmd, haddr);
VM_BUG_ON(!pte);
- for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
- pte_t entry;
- /*
- * Note that NUMA hinting access restrictions are not
- * transferred to avoid any possibility of altering
- * permissions across VMAs.
- */
- if (freeze || pmd_migration) {
+
+ /*
+ * Note that NUMA hinting access restrictions are not transferred to
+ * avoid any possibility of altering permissions across VMAs.
+ */
+ if (freeze || pmd_migration) {
+ for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+ pte_t entry;
swp_entry_t swp_entry;
+
if (write)
swp_entry = make_writable_migration_entry(
page_to_pfn(page + i));
@@ -2562,28 +2563,36 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
entry = pte_swp_mksoft_dirty(entry);
if (uffd_wp)
entry = pte_swp_mkuffd_wp(entry);
- } else {
- entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
- if (write)
- entry = pte_mkwrite(entry, vma);
+
+ VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+ set_pte_at(mm, addr, pte + i, entry);
+ }
+ } else {
+ pte_t entry;
+
+ entry = mk_pte(page, READ_ONCE(vma->vm_page_prot));
+ if (write)
+ entry = pte_mkwrite(entry, vma);
+ if (!young)
+ entry = pte_mkold(entry);
+ /* NOTE: this may set soft-dirty too on some archs */
+ if (dirty)
+ entry = pte_mkdirty(entry);
+ if (soft_dirty)
+ entry = pte_mksoft_dirty(entry);
+ if (uffd_wp)
+ entry = pte_mkuffd_wp(entry);
+
+ for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
if (anon_exclusive)
SetPageAnonExclusive(page + i);
- if (!young)
- entry = pte_mkold(entry);
- /* NOTE: this may set soft-dirty too on some archs */
- if (dirty)
- entry = pte_mkdirty(entry);
- if (soft_dirty)
- entry = pte_mksoft_dirty(entry);
- if (uffd_wp)
- entry = pte_mkuffd_wp(entry);
page_add_anon_rmap(page + i, vma, addr, RMAP_NONE);
+ VM_WARN_ON(!pte_none(ptep_get(pte + i)));
}
- VM_BUG_ON(!pte_none(ptep_get(pte)));
- set_pte_at(mm, addr, pte, entry);
- pte++;
+
+ set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);
}
- pte_unmap(pte - 1);
+ pte_unmap(pte);

if (!pmd_migration)
page_remove_rmap(page, vma, true);
--
2.25.1


2023-12-18 10:51:54

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

Convert copy_pte_range() to copy a batch of ptes in one go. A given
batch is determined by the architecture with the new helper,
pte_batch_remaining(), and maps a physically contiguous block of memory,
all belonging to the same folio. A pte batch is then write-protected in
one go in the parent using the new helper, ptep_set_wrprotects() and is
set in one go in the child using the new helper, set_ptes_full().

The primary motivation for this change is to reduce the number of tlb
maintenance operations that the arm64 backend has to perform during
fork, as it is about to add transparent support for the "contiguous bit"
in its ptes. By write-protecting the parent using the new
ptep_set_wrprotects() (note the 's' at the end) function, the backend
can avoid having to unfold contig ranges of PTEs, which is expensive,
when all ptes in the range are being write-protected. Similarly, by
using set_ptes_full() rather than set_pte_at() to set up ptes in the
child, the backend does not need to fold a contiguous range once they
are all populated - they can be initially populated as a contiguous
range in the first place.

This code is very performance sensitive, and a significant amount of
effort has been put into not regressing performance for the order-0
folio case. By default, pte_batch_remaining() is compile constant 1,
which enables the compiler to simplify the extra loops that are added
for batching and produce code that is equivalent (and equally
performant) as the previous implementation.

This change addresses the core-mm refactoring only and a separate change
will implement pte_batch_remaining(), ptep_set_wrprotects() and
set_ptes_full() in the arm64 backend to realize the performance
improvement as part of the work to enable contpte mappings.

To ensure the arm64 is performant once implemented, this change is very
careful to only call ptep_get() once per pte batch.

The following microbenchmark results demonstate that there is no
significant performance change after this patch. Fork is called in a
tight loop in a process with 1G of populated memory and the time for the
function to execute is measured. 100 iterations per run, 8 runs
performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
performed for case where 1G memory is comprised of order-0 folios and
case where comprised of pte-mapped order-9 folios. Negative is faster,
positive is slower, compared to baseline upon which the series is based:

| Apple M2 VM | order-0 (pte-map) | order-9 (pte-map) |
| fork |-------------------|-------------------|
| microbench | mean | stdev | mean | stdev |
|---------------|---------|---------|---------|---------|
| baseline | 0.0% | 1.1% | 0.0% | 1.2% |
| after-change | -1.0% | 2.0% | -0.1% | 1.1% |

| Ampere Altra | order-0 (pte-map) | order-9 (pte-map) |
| fork |-------------------|-------------------|
| microbench | mean | stdev | mean | stdev |
|---------------|---------|---------|---------|---------|
| baseline | 0.0% | 1.0% | 0.0% | 0.1% |
| after-change | -0.1% | 1.2% | -0.1% | 0.1% |

Tested-by: John Hubbard <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
mm/memory.c | 92 ++++++++++++++++++++++++++---------------
2 files changed, 139 insertions(+), 33 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..db93fb81465a 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
#define arch_flush_lazy_mmu_mode() do {} while (0)
#endif

+#ifndef pte_batch_remaining
+/**
+ * pte_batch_remaining - Number of pages from addr to next batch boundary.
+ * @pte: Page table entry for the first page.
+ * @addr: Address of the first page.
+ * @end: Batch ceiling (e.g. end of vma).
+ *
+ * Some architectures (arm64) can efficiently modify a contiguous batch of ptes.
+ * In such cases, this function returns the remaining number of pages to the end
+ * of the current batch, as defined by addr. This can be useful when iterating
+ * over ptes.
+ *
+ * May be overridden by the architecture, else batch size is always 1.
+ */
+static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long addr,
+ unsigned long end)
+{
+ return 1;
+}
+#endif
+
#ifndef set_ptes

#ifndef pte_next_pfn
@@ -246,6 +267,34 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
#endif
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

+#ifndef set_ptes_full
+/**
+ * set_ptes_full - Map consecutive pages to a contiguous range of addresses.
+ * @mm: Address space to map the pages into.
+ * @addr: Address to map the first page at.
+ * @ptep: Page table pointer for the first entry.
+ * @pte: Page table entry for the first page.
+ * @nr: Number of pages to map.
+ * @full: True if systematically setting all ptes and their previous values
+ * were known to be none (e.g. part of fork).
+ *
+ * Some architectures (arm64) can optimize the implementation if copying ptes
+ * batach-by-batch from the parent, where a batch is defined by
+ * pte_batch_remaining().
+ *
+ * May be overridden by the architecture, else full is ignored and call is
+ * forwarded to set_ptes().
+ *
+ * Context: The caller holds the page table lock. The pages all belong to the
+ * same folio. The PTEs are all in the same PMD.
+ */
+static inline void set_ptes_full(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr, int full)
+{
+ set_ptes(mm, addr, ptep, pte, nr);
+}
+#endif
+
#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
extern int ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
@@ -622,6 +671,37 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
}
#endif

+#ifndef ptep_set_wrprotects
+struct mm_struct;
+/**
+ * ptep_set_wrprotects - Write protect a consecutive set of pages.
+ * @mm: Address space that the pages are mapped into.
+ * @address: Address of first page to write protect.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of pages to write protect.
+ * @full: True if systematically wite protecting all ptes (e.g. part of fork).
+ *
+ * Some architectures (arm64) can optimize the implementation if
+ * write-protecting ptes batach-by-batch, where a batch is defined by
+ * pte_batch_remaining().
+ *
+ * May be overridden by the architecture, else implemented as a loop over
+ * ptep_set_wrprotect().
+ *
+ * Context: The caller holds the page table lock. The PTEs are all in the same
+ * PMD.
+ */
+static inline void ptep_set_wrprotects(struct mm_struct *mm,
+ unsigned long address, pte_t *ptep,
+ unsigned int nr, int full)
+{
+ unsigned int i;
+
+ for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+ ptep_set_wrprotect(mm, address, ptep);
+}
+#endif
+
/*
* On some architectures hardware does not set page access bit when accessing
* memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/memory.c b/mm/memory.c
index 809746555827..111f8feeb56e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -929,42 +929,60 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
}

/*
- * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page
- * is required to copy this pte.
+ * Copy set of contiguous ptes. Returns number of ptes copied if succeeded
+ * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
+ * first pte.
*/
static inline int
-copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
- pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
- struct folio **prealloc)
+copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+ pte_t *dst_pte, pte_t *src_pte, pte_t pte,
+ unsigned long addr, unsigned long end,
+ int *rss, struct folio **prealloc)
{
struct mm_struct *src_mm = src_vma->vm_mm;
unsigned long vm_flags = src_vma->vm_flags;
- pte_t pte = ptep_get(src_pte);
struct page *page;
struct folio *folio;
+ int nr, i, ret;
+
+ nr = pte_batch_remaining(pte, addr, end);

page = vm_normal_page(src_vma, addr, pte);
- if (page)
+ if (page) {
folio = page_folio(page);
+ folio_ref_add(folio, nr);
+ }
if (page && folio_test_anon(folio)) {
- /*
- * If this page may have been pinned by the parent process,
- * copy the page immediately for the child so that we'll always
- * guarantee the pinned page won't be randomly replaced in the
- * future.
- */
- folio_get(folio);
- if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
- /* Page may be pinned, we have to copy. */
- folio_put(folio);
- return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
- addr, rss, prealloc, page);
+ for (i = 0; i < nr; i++, page++) {
+ /*
+ * If this page may have been pinned by the parent
+ * process, copy the page immediately for the child so
+ * that we'll always guarantee the pinned page won't be
+ * randomly replaced in the future.
+ */
+ if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
+ if (i != 0)
+ break;
+ /* Page may be pinned, we have to copy. */
+ folio_ref_sub(folio, nr);
+ ret = copy_present_page(dst_vma, src_vma,
+ dst_pte, src_pte, addr,
+ rss, prealloc, page);
+ return ret == 0 ? 1 : ret;
+ }
+ VM_BUG_ON(PageAnonExclusive(page));
}
- rss[MM_ANONPAGES]++;
+
+ if (unlikely(i < nr)) {
+ folio_ref_sub(folio, nr - i);
+ nr = i;
+ }
+
+ rss[MM_ANONPAGES] += nr;
} else if (page) {
- folio_get(folio);
- page_dup_file_rmap(page, false);
- rss[mm_counter_file(page)]++;
+ for (i = 0; i < nr; i++)
+ page_dup_file_rmap(page + i, false);
+ rss[mm_counter_file(page)] += nr;
}

/*
@@ -972,10 +990,9 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
* in the parent and the child
*/
if (is_cow_mapping(vm_flags) && pte_write(pte)) {
- ptep_set_wrprotect(src_mm, addr, src_pte);
+ ptep_set_wrprotects(src_mm, addr, src_pte, nr, true);
pte = pte_wrprotect(pte);
}
- VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));

/*
* If it's a shared mapping, mark it clean in
@@ -988,8 +1005,8 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
if (!userfaultfd_wp(dst_vma))
pte = pte_clear_uffd_wp(pte);

- set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
- return 0;
+ set_ptes_full(dst_vma->vm_mm, addr, dst_pte, pte, nr, true);
+ return nr;
}

static inline struct folio *folio_prealloc(struct mm_struct *src_mm,
@@ -1030,6 +1047,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
int rss[NR_MM_COUNTERS];
swp_entry_t entry = (swp_entry_t){0};
struct folio *prealloc = NULL;
+ int nr_ptes;

again:
progress = 0;
@@ -1060,6 +1078,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
arch_enter_lazy_mmu_mode();

do {
+ nr_ptes = 1;
+
/*
* We are holding two locks at this point - either of them
* could generate latencies in another task on another CPU.
@@ -1095,16 +1115,21 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
* the now present pte.
*/
WARN_ON_ONCE(ret != -ENOENT);
+ ret = 0;
}
- /* copy_present_pte() will clear `*prealloc' if consumed */
- ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
- addr, rss, &prealloc);
+ /* copy_present_ptes() will clear `*prealloc' if consumed */
+ nr_ptes = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
+ ptent, addr, end, rss, &prealloc);
+
/*
* If we need a pre-allocated page for this pte, drop the
* locks, allocate, and try again.
*/
- if (unlikely(ret == -EAGAIN))
+ if (unlikely(nr_ptes == -EAGAIN)) {
+ ret = -EAGAIN;
break;
+ }
+
if (unlikely(prealloc)) {
/*
* pre-alloc page cannot be reused by next time so as
@@ -1115,8 +1140,9 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
folio_put(prealloc);
prealloc = NULL;
}
- progress += 8;
- } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+ progress += 8 * nr_ptes;
+ } while (dst_pte += nr_ptes, src_pte += nr_ptes,
+ addr += PAGE_SIZE * nr_ptes, addr != end);

arch_leave_lazy_mmu_mode();
pte_unmap_unlock(orig_src_pte, src_ptl);
--
2.25.1


2023-12-18 10:52:08

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 03/16] mm: Batch-clear PTE ranges during zap_pte_range()

Convert zap_pte_range() to copy a batch of ptes in one go. A given batch
is determined by the architecture (see pte_batch_remaining()), and maps
a physically contiguous block of memory, all belonging to the same
folio. A pte batch is cleared using the new helper, clear_ptes().

The primary motivation for this change is to reduce the number of tlb
maintenance operations that the arm64 backend has to perform during exit
and other syscalls that cause zap_pte_range() (e.g. munmap,
madvise(DONTNEED), etc.), as it is about to add transparent support for
the "contiguous bit" in its ptes. By clearing ptes using the new
clear_ptes() API, the backend doesn't have to perform an expensive
unfold operation when a PTE being cleared is part of a contpte block.
Instead it can just clear the whole block immediately.

This code is very performance sensitive, and a significant amount of
effort has been put into not regressing performance for the order-0
folio case. By default, pte_batch_remaining() always returns 1, which
enables the compiler to simplify the extra loops that are added for
batching and produce code that is equivalent (and equally performant) as
the previous implementation.

This change addresses the core-mm refactoring only, and introduces
clear_ptes() with a default implementation that calls
ptep_get_and_clear_full() for each pte in the range. Note that this API
returns the pte at the beginning of the batch, but with the dirty and
young bits set if ANY of the ptes in the cleared batch had those bits
set; this information is applied to the folio by the core-mm. Given the
batch is guaranteed to cover only a single folio, collapsing this state
does not lose any useful information.

A separate change will implement clear_ptes() in the arm64 backend to
realize the performance improvement as part of the work to enable
contpte mappings.

The following microbenchmark results demonstate that there is no
madvise(dontneed) performance regression (and actually an improvement in
some cases) after this patch. madvise(dontneed) is called for each page
of a 1G populated mapping and the total time is measured. 100 iterations
per run, 8 runs performed on both Apple M2 (VM) and Ampere Altra (bare
metal). Tests performed for case where 1G memory is comprised of order-0
folios and case where comprised of pte-mapped order-9 folios. Negative
is faster, positive is slower, compared to baseline upon which the
series is based:

| Apple M2 VM | order-0 (pte-map) | order-9 (pte-map) |
| dontneed |-------------------|-------------------|
| microbench | mean | stdev | mean | stdev |
|---------------|---------|---------|---------|---------|
| baseline | 0.0% | 7.5% | 0.0% | 7.9% |
| after-change | -9.6% | 3.1% | -4.9% | 8.1% |

| Ampere Altra | order-0 (pte-map) | order-9 (pte-map) |
| dontneed |-------------------|-------------------|
| microbench | mean | stdev | mean | stdev |
|---------------|---------|---------|---------|---------|
| baseline | 0.0% | 0.1% | 0.0% | 0.0% |
| after-change | -0.2% | 0.1% | -0.1% | 0.0% |

The following microbenchmark results demonstate that there is no munmap
performance regression (and actually an improvement in some cases) after
this patch. munmap is called for a 1G populated mapping and the time to
execute the function is measured. 100 iterations per run, 8 runs
performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
performed for case where 1G memory is comprised of order-0 folios and
case where comprised of pte-mapped order-9 folios. Negative is faster,
positive is slower, compared to baseline upon which the series is based:

| Apple M2 VM | order-0 (pte-map) | order-9 (pte-map) |
| munmap |-------------------|-------------------|
| microbench | mean | stdev | mean | stdev |
|---------------|---------|---------|---------|---------|
| baseline | 0.0% | 3.8% | 0.0% | 6.4% |
| after-change | -1.9% | 0.2% | -4.7% | 0.8% |

| Ampere Altra | order-0 (pte-map) | order-9 (pte-map) |
| munmap |-------------------|-------------------|
| microbench | mean | stdev | mean | stdev |
|---------------|---------|---------|---------|---------|
| baseline | 0.0% | 0.9% | 0.0% | 0.1% |
| after-change | -0.2% | 0.6% | -3.2% | 0.2% |

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
include/asm-generic/tlb.h | 11 +++++++
include/linux/pgtable.h | 43 ++++++++++++++++++++++++++
mm/memory.c | 64 ++++++++++++++++++++++++++++-----------
mm/mmu_gather.c | 15 +++++++++
4 files changed, 115 insertions(+), 18 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 129a3a759976..1b25929d2000 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -75,6 +75,10 @@
* boolean indicating if the queue is (now) full and a call to
* tlb_flush_mmu() is required.
*
+ * tlb_reserve_space() attempts to preallocate space for nr pages and returns
+ * the minimum garanteed number of pages that can be queued without overflow,
+ * which may be more or less than requested.
+ *
* tlb_remove_page() and tlb_remove_page_size() imply the call to
* tlb_flush_mmu() when required and has no return value.
*
@@ -263,6 +267,7 @@ struct mmu_gather_batch {
extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
struct encoded_page *page,
int page_size);
+extern unsigned int tlb_reserve_space(struct mmu_gather *tlb, unsigned int nr);

#ifdef CONFIG_SMP
/*
@@ -273,6 +278,12 @@ extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
#endif

+#else
+static inline unsigned int tlb_reserve_space(struct mmu_gather *tlb,
+ unsigned int nr)
+{
+ return 1;
+}
#endif

/*
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index db93fb81465a..170925379534 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -601,6 +601,49 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
}
#endif

+#ifndef clear_ptes
+struct mm_struct;
+/**
+ * clear_ptes - Clear a consecutive range of ptes and return the previous value.
+ * @mm: Address space that the ptes map.
+ * @address: Address corresponding to the first pte to clear.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of ptes to clear.
+ * @full: True if systematically clearing all ptes for the address space.
+ *
+ * A batched version of ptep_get_and_clear_full(), which returns the old pte
+ * value for the first pte in the range, but with young and/or dirty set if any
+ * of the ptes in the range were young or dirty.
+ *
+ * May be overridden by the architecture, else implemented as a loop over
+ * ptep_get_and_clear_full().
+ *
+ * Context: The caller holds the page table lock. The PTEs are all in the same
+ * PMD.
+ */
+static inline pte_t clear_ptes(struct mm_struct *mm,
+ unsigned long address, pte_t *ptep,
+ unsigned int nr, int full)
+{
+ unsigned int i;
+ pte_t pte;
+ pte_t orig_pte = ptep_get_and_clear_full(mm, address, ptep, full);
+
+ for (i = 1; i < nr; i++) {
+ address += PAGE_SIZE;
+ ptep++;
+ pte = ptep_get_and_clear_full(mm, address, ptep, full);
+
+ if (pte_dirty(pte))
+ orig_pte = pte_mkdirty(orig_pte);
+
+ if (pte_young(pte))
+ orig_pte = pte_mkyoung(orig_pte);
+ }
+
+ return orig_pte;
+}
+#endif

/*
* If two threads concurrently fault at the same page, the thread that
diff --git a/mm/memory.c b/mm/memory.c
index 111f8feeb56e..81b023cf3182 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1447,6 +1447,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
pte_t *start_pte;
pte_t *pte;
swp_entry_t entry;
+ int nr;

tlb_change_page_size(tlb, PAGE_SIZE);
init_rss_vec(rss);
@@ -1459,6 +1460,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
do {
pte_t ptent = ptep_get(pte);
struct page *page;
+ int i;
+
+ nr = 1;

if (pte_none(ptent))
continue;
@@ -1467,43 +1471,67 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
break;

if (pte_present(ptent)) {
- unsigned int delay_rmap;
+ unsigned int delay_rmap = 0;
+ struct folio *folio;
+ bool full = false;
+
+ /*
+ * tlb_gather always has at least one slot so avoid call
+ * to tlb_reserve_space() when pte_batch_remaining() is
+ * a compile-time constant 1 (default).
+ */
+ nr = pte_batch_remaining(ptent, addr, end);
+ if (unlikely(nr > 1))
+ nr = min_t(int, nr, tlb_reserve_space(tlb, nr));

page = vm_normal_page(vma, addr, ptent);
if (unlikely(!should_zap_page(details, page)))
continue;
- ptent = ptep_get_and_clear_full(mm, addr, pte,
- tlb->fullmm);
+ ptent = clear_ptes(mm, addr, pte, nr, tlb->fullmm);
arch_check_zapped_pte(vma, ptent);
- tlb_remove_tlb_entry(tlb, pte, addr);
- zap_install_uffd_wp_if_needed(vma, addr, pte, details,
- ptent);
+
+ for (i = 0; i < nr; i++) {
+ unsigned long subaddr = addr + PAGE_SIZE * i;
+
+ tlb_remove_tlb_entry(tlb, &pte[i], subaddr);
+ zap_install_uffd_wp_if_needed(vma, subaddr,
+ &pte[i], details, ptent);
+ }
if (unlikely(!page)) {
ksm_might_unmap_zero_page(mm, ptent);
continue;
}

- delay_rmap = 0;
- if (!PageAnon(page)) {
+ folio = page_folio(page);
+ if (!folio_test_anon(folio)) {
if (pte_dirty(ptent)) {
- set_page_dirty(page);
+ folio_mark_dirty(folio);
if (tlb_delay_rmap(tlb)) {
delay_rmap = 1;
force_flush = 1;
}
}
if (pte_young(ptent) && likely(vma_has_recency(vma)))
- mark_page_accessed(page);
+ folio_mark_accessed(folio);
}
- rss[mm_counter(page)]--;
- if (!delay_rmap) {
- page_remove_rmap(page, vma, false);
- if (unlikely(page_mapcount(page) < 0))
- print_bad_pte(vma, addr, ptent, page);
+ rss[mm_counter(page)] -= nr;
+ for (i = 0; i < nr; i++, page++) {
+ if (!delay_rmap) {
+ page_remove_rmap(page, vma, false);
+ if (unlikely(page_mapcount(page) < 0))
+ print_bad_pte(vma, addr, ptent, page);
+ }
+
+ /*
+ * nr calculated based on available space, so
+ * can only be full on final iteration.
+ */
+ VM_WARN_ON(full);
+ full = __tlb_remove_page(tlb, page, delay_rmap);
}
- if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+ if (unlikely(full)) {
force_flush = 1;
- addr += PAGE_SIZE;
+ addr += PAGE_SIZE * nr;
break;
}
continue;
@@ -1557,7 +1585,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
}
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
- } while (pte++, addr += PAGE_SIZE, addr != end);
+ } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);

add_mm_rss_vec(mm, rss);
arch_leave_lazy_mmu_mode();
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 4f559f4ddd21..39725756e6bf 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -47,6 +47,21 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
return true;
}

+unsigned int tlb_reserve_space(struct mmu_gather *tlb, unsigned int nr)
+{
+ struct mmu_gather_batch *batch = tlb->active;
+ unsigned int nr_alloc = batch->max - batch->nr;
+
+ while (nr_alloc < nr) {
+ if (!tlb_next_batch(tlb))
+ break;
+ nr_alloc += tlb->active->max;
+ }
+
+ tlb->active = batch;
+ return nr_alloc;
+}
+
#ifdef CONFIG_SMP
static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_struct *vma)
{
--
2.25.1


2023-12-18 10:52:23

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 04/16] arm64/mm: set_pte(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 12 ++++++++----
arch/arm64/kernel/efi.c | 2 +-
arch/arm64/mm/fixmap.c | 2 +-
arch/arm64/mm/kasan_init.c | 4 ++--
arch/arm64/mm/mmu.c | 2 +-
arch/arm64/mm/pageattr.c | 2 +-
arch/arm64/mm/trans_pgd.c | 4 ++--
7 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b19a8aee684c..650d4f4bb6dc 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,8 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))

#define pte_none(pte) (!pte_val(pte))
-#define pte_clear(mm,addr,ptep) set_pte(ptep, __pte(0))
+#define pte_clear(mm, addr, ptep) \
+ __set_pte(ptep, __pte(0))
#define pte_page(pte) (pfn_to_page(pte_pfn(pte)))

/*
@@ -261,7 +262,7 @@ static inline pte_t pte_mkdevmap(pte_t pte)
return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
}

-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte(pte_t *ptep, pte_t pte)
{
WRITE_ONCE(*ptep, pte);

@@ -350,7 +351,7 @@ static inline void set_ptes(struct mm_struct *mm,

for (;;) {
__check_safe_pte_update(mm, ptep, pte);
- set_pte(ptep, pte);
+ __set_pte(ptep, pte);
if (--nr == 0)
break;
ptep++;
@@ -534,7 +535,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
{
__sync_cache_and_tags(pte, nr);
__check_safe_pte_update(mm, ptep, pte);
- set_pte(ptep, pte);
+ __set_pte(ptep, pte);
}

static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
@@ -1118,6 +1119,9 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t old_pte, pte_t new_pte);
+
+#define set_pte __set_pte
+
#endif /* !__ASSEMBLY__ */

#endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 0228001347be..44288a12fc6c 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -111,7 +111,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
pte = set_pte_bit(pte, __pgprot(PTE_PXN));
else if (system_supports_bti_kernel() && spd->has_bti)
pte = set_pte_bit(pte, __pgprot(PTE_GP));
- set_pte(ptep, pte);
+ __set_pte(ptep, pte);
return 0;
}

diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index c0a3301203bd..51cd4501816d 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -121,7 +121,7 @@ void __set_fixmap(enum fixed_addresses idx,
ptep = fixmap_pte(addr);

if (pgprot_val(flags)) {
- set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
+ __set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
} else {
pte_clear(&init_mm, addr, ptep);
flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 555285ebd5af..5eade712e9e5 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -112,7 +112,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
if (!early)
memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
next = addr + PAGE_SIZE;
- set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
+ __set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
} while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
}

@@ -266,7 +266,7 @@ static void __init kasan_init_shadow(void)
* so we should make sure that it maps the zero page read-only.
*/
for (i = 0; i < PTRS_PER_PTE; i++)
- set_pte(&kasan_early_shadow_pte[i],
+ __set_pte(&kasan_early_shadow_pte[i],
pfn_pte(sym_to_pfn(kasan_early_shadow_page),
PAGE_KERNEL_RO));

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 15f6347d23b6..e884279b268e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -178,7 +178,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
do {
pte_t old_pte = READ_ONCE(*ptep);

- set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
+ __set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));

/*
* After the PTE entry has been populated once, we
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 924843f1f661..a7996d8edf0a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -41,7 +41,7 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
pte = clear_pte_bit(pte, cdata->clear_mask);
pte = set_pte_bit(pte, cdata->set_mask);

- set_pte(ptep, pte);
+ __set_pte(ptep, pte);
return 0;
}

diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 7b14df3c6477..230b607cf881 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -41,7 +41,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
* read only (code, rodata). Clear the RDONLY bit from
* the temporary mappings we use during restore.
*/
- set_pte(dst_ptep, pte_mkwrite_novma(pte));
+ __set_pte(dst_ptep, pte_mkwrite_novma(pte));
} else if ((debug_pagealloc_enabled() ||
is_kfence_address((void *)addr)) && !pte_none(pte)) {
/*
@@ -55,7 +55,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
*/
BUG_ON(!pfn_valid(pte_pfn(pte)));

- set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_novma(pte)));
+ __set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_novma(pte)));
}
}

--
2.25.1


2023-12-18 10:52:38

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 05/16] arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

set_pte_at() is a core macro that forwards to set_ptes() (with nr=1).
Instead of creating a __set_pte_at() internal macro, convert all arch
users to use set_ptes()/__set_ptes() directly, as appropriate. Callers
in hugetlb may benefit from calling __set_ptes() once for their whole
range rather than managing their own loop. This is left for future
improvement.

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 10 +++++-----
arch/arm64/kernel/mte.c | 2 +-
arch/arm64/kvm/guest.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/arm64/mm/hugetlbpage.c | 10 +++++-----
5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 650d4f4bb6dc..323ec91add60 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -342,9 +342,9 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
mte_sync_tags(pte, nr_pages);
}

-static inline void set_ptes(struct mm_struct *mm,
- unsigned long __always_unused addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
+static inline void __set_ptes(struct mm_struct *mm,
+ unsigned long __always_unused addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
{
page_table_check_ptes_set(mm, ptep, pte, nr);
__sync_cache_and_tags(pte, nr);
@@ -358,7 +358,6 @@ static inline void set_ptes(struct mm_struct *mm,
pte_val(pte) += PAGE_SIZE;
}
}
-#define set_ptes set_ptes

/*
* Huge pte definitions.
@@ -1067,7 +1066,7 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
#endif /* CONFIG_ARM64_MTE */

/*
- * On AArch64, the cache coherency is handled via the set_pte_at() function.
+ * On AArch64, the cache coherency is handled via the __set_ptes() function.
*/
static inline void update_mmu_cache_range(struct vm_fault *vmf,
struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
@@ -1121,6 +1120,7 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
pte_t old_pte, pte_t new_pte);

#define set_pte __set_pte
+#define set_ptes __set_ptes

#endif /* !__ASSEMBLY__ */

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a41ef3213e1e..dcdcccd40891 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -67,7 +67,7 @@ int memcmp_pages(struct page *page1, struct page *page2)
/*
* If the page content is identical but at least one of the pages is
* tagged, return non-zero to avoid KSM merging. If only one of the
- * pages is tagged, set_pte_at() may zero or change the tags of the
+ * pages is tagged, __set_ptes() may zero or change the tags of the
* other page via mte_sync_tags().
*/
if (page_mte_tagged(page1) || page_mte_tagged(page2))
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index aaf1d4939739..629145fd3161 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -1072,7 +1072,7 @@ int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
} else {
/*
* Only locking to serialise with a concurrent
- * set_pte_at() in the VMM but still overriding the
+ * __set_ptes() in the VMM but still overriding the
* tags, hence ignoring the return value.
*/
try_page_mte_tagging(page);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 460d799e1296..a287c1dea871 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -205,7 +205,7 @@ static void show_pte(unsigned long addr)
*
* It needs to cope with hardware update of the accessed/dirty state by other
* agents in the system and can safely skip the __sync_icache_dcache() call as,
- * like set_pte_at(), the PTE is never changed from no-exec to exec here.
+ * like __set_ptes(), the PTE is never changed from no-exec to exec here.
*
* Returns whether or not the PTE actually changed.
*/
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index f5aae342632c..741cb53672fd 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -254,12 +254,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,

if (!pte_present(pte)) {
for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
- set_pte_at(mm, addr, ptep, pte);
+ __set_ptes(mm, addr, ptep, pte, 1);
return;
}

if (!pte_cont(pte)) {
- set_pte_at(mm, addr, ptep, pte);
+ __set_ptes(mm, addr, ptep, pte, 1);
return;
}

@@ -270,7 +270,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
clear_flush(mm, addr, ptep, pgsize, ncontig);

for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
- set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+ __set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
}

pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -478,7 +478,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,

hugeprot = pte_pgprot(pte);
for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
- set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+ __set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);

return 1;
}
@@ -507,7 +507,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
pfn = pte_pfn(pte);

for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
- set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+ __set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
}

pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
--
2.25.1


2023-12-18 10:52:53

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 06/16] arm64/mm: pte_clear(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 3 ++-
arch/arm64/mm/fixmap.c | 2 +-
arch/arm64/mm/hugetlbpage.c | 2 +-
arch/arm64/mm/mmu.c | 2 +-
4 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 323ec91add60..1464e990580a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))

#define pte_none(pte) (!pte_val(pte))
-#define pte_clear(mm, addr, ptep) \
+#define __pte_clear(mm, addr, ptep) \
__set_pte(ptep, __pte(0))
#define pte_page(pte) (pfn_to_page(pte_pfn(pte)))

@@ -1121,6 +1121,7 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,

#define set_pte __set_pte
#define set_ptes __set_ptes
+#define pte_clear __pte_clear

#endif /* !__ASSEMBLY__ */

diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index 51cd4501816d..bfc02568805a 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -123,7 +123,7 @@ void __set_fixmap(enum fixed_addresses idx,
if (pgprot_val(flags)) {
__set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
} else {
- pte_clear(&init_mm, addr, ptep);
+ __pte_clear(&init_mm, addr, ptep);
flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
}
}
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 741cb53672fd..510b2d4b89a9 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -400,7 +400,7 @@ void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
ncontig = num_contig_ptes(sz, &pgsize);

for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
- pte_clear(mm, addr, ptep);
+ __pte_clear(mm, addr, ptep);
}

pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e884279b268e..080e9b50f595 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -859,7 +859,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
continue;

WARN_ON(!pte_present(pte));
- pte_clear(&init_mm, addr, ptep);
+ __pte_clear(&init_mm, addr, ptep);
flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
if (free_mapped)
free_hotplug_page_range(pte_page(pte),
--
2.25.1


2023-12-18 10:53:06

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 07/16] arm64/mm: ptep_get_and_clear(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 5 +++--
arch/arm64/mm/hugetlbpage.c | 6 +++---
2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1464e990580a..994597a0bb0f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -941,8 +941,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

-#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
-static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
unsigned long address, pte_t *ptep)
{
pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
@@ -1122,6 +1121,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
#define set_pte __set_pte
#define set_ptes __set_ptes
#define pte_clear __pte_clear
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define ptep_get_and_clear __ptep_get_and_clear

#endif /* !__ASSEMBLY__ */

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 510b2d4b89a9..c2a753541d13 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -188,7 +188,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
unsigned long i;

for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
- pte_t pte = ptep_get_and_clear(mm, addr, ptep);
+ pte_t pte = __ptep_get_and_clear(mm, addr, ptep);

/*
* If HW_AFDBM is enabled, then the HW could turn on
@@ -236,7 +236,7 @@ static void clear_flush(struct mm_struct *mm,
unsigned long i, saddr = addr;

for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
- ptep_clear(mm, addr, ptep);
+ __ptep_get_and_clear(mm, addr, ptep);

flush_tlb_range(&vma, saddr, addr);
}
@@ -411,7 +411,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
pte_t orig_pte = ptep_get(ptep);

if (!pte_cont(orig_pte))
- return ptep_get_and_clear(mm, addr, ptep);
+ return __ptep_get_and_clear(mm, addr, ptep);

ncontig = find_num_contig(mm, addr, ptep, &pgsize);

--
2.25.1


2023-12-18 10:53:20

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 08/16] arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 18 +++++++-----------
1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 994597a0bb0f..9b4a9909fd5b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -887,8 +887,9 @@ static inline bool pud_user_accessible_page(pud_t pud)
/*
* Atomic pte/pmd modifications.
*/
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-static inline int __ptep_test_and_clear_young(pte_t *ptep)
+static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long address,
+ pte_t *ptep)
{
pte_t old_pte, pte;

@@ -903,18 +904,11 @@ static inline int __ptep_test_and_clear_young(pte_t *ptep)
return pte_young(pte);
}

-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
- unsigned long address,
- pte_t *ptep)
-{
- return __ptep_test_and_clear_young(ptep);
-}
-
#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
- int young = ptep_test_and_clear_young(vma, address, ptep);
+ int young = __ptep_test_and_clear_young(vma, address, ptep);

if (young) {
/*
@@ -937,7 +931,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long address,
pmd_t *pmdp)
{
- return ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
+ return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

@@ -1123,6 +1117,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
#define pte_clear __pte_clear
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define ptep_get_and_clear __ptep_get_and_clear
+#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
+#define ptep_test_and_clear_young __ptep_test_and_clear_young

#endif /* !__ASSEMBLY__ */

--
2.25.1


2023-12-18 10:53:35

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 09/16] arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9b4a9909fd5b..fc1005222ee4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -138,7 +138,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
* so that we don't erroneously return false for pages that have been
* remapped as PROT_NONE but are yet to be flushed from the TLB.
* Note that we can't make any assumptions based on the state of the access
- * flag, since ptep_clear_flush_young() elides a DSB when invalidating the
+ * flag, since __ptep_clear_flush_young() elides a DSB when invalidating the
* TLB.
*/
#define pte_accessible(mm, pte) \
@@ -904,8 +904,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
return pte_young(pte);
}

-#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+static inline int __ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
int young = __ptep_test_and_clear_young(vma, address, ptep);
@@ -1119,6 +1118,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
#define ptep_get_and_clear __ptep_get_and_clear
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define ptep_test_and_clear_young __ptep_test_and_clear_young
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+#define ptep_clear_flush_young __ptep_clear_flush_young

#endif /* !__ASSEMBLY__ */

--
2.25.1


2023-12-18 10:53:46

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 10/16] arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 10 ++++++----
arch/arm64/mm/hugetlbpage.c | 2 +-
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index fc1005222ee4..423cc32b2777 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -958,11 +958,11 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

/*
- * ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
* dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
*/
-#define __HAVE_ARCH_PTEP_SET_WRPROTECT
-static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+ unsigned long address, pte_t *ptep)
{
pte_t old_pte, pte;

@@ -980,7 +980,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
unsigned long address, pmd_t *pmdp)
{
- ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
+ __ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
}

#define pmdp_establish pmdp_establish
@@ -1120,6 +1120,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
#define ptep_test_and_clear_young __ptep_test_and_clear_young
#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
#define ptep_clear_flush_young __ptep_clear_flush_young
+#define __HAVE_ARCH_PTEP_SET_WRPROTECT
+#define ptep_set_wrprotect __ptep_set_wrprotect

#endif /* !__ASSEMBLY__ */

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index c2a753541d13..952462820d9d 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -493,7 +493,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
pte_t pte;

if (!pte_cont(READ_ONCE(*ptep))) {
- ptep_set_wrprotect(mm, addr, ptep);
+ __ptep_set_wrprotect(mm, addr, ptep);
return;
}

--
2.25.1


2023-12-18 10:53:59

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 11/16] arm64/mm: ptep_set_access_flags(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 10 ++++++----
arch/arm64/mm/fault.c | 6 +++---
arch/arm64/mm/hugetlbpage.c | 2 +-
3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 423cc32b2777..85010c2d4dfa 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -312,7 +312,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,

/*
* Check for potential race with hardware updates of the pte
- * (ptep_set_access_flags safely changes valid ptes without going
+ * (__ptep_set_access_flags safely changes valid ptes without going
* through an invalid entry).
*/
VM_WARN_ONCE(!pte_young(pte),
@@ -842,8 +842,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
}

-#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
-extern int ptep_set_access_flags(struct vm_area_struct *vma,
+extern int __ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty);

@@ -853,7 +852,8 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp,
pmd_t entry, int dirty)
{
- return ptep_set_access_flags(vma, address, (pte_t *)pmdp, pmd_pte(entry), dirty);
+ return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
+ pmd_pte(entry), dirty);
}

static inline int pud_devmap(pud_t pud)
@@ -1122,6 +1122,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
#define ptep_clear_flush_young __ptep_clear_flush_young
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define ptep_set_wrprotect __ptep_set_wrprotect
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+#define ptep_set_access_flags __ptep_set_access_flags

#endif /* !__ASSEMBLY__ */

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index a287c1dea871..7cebd9847aae 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -209,9 +209,9 @@ static void show_pte(unsigned long addr)
*
* Returns whether or not the PTE actually changed.
*/
-int ptep_set_access_flags(struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep,
- pte_t entry, int dirty)
+int __ptep_set_access_flags(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep,
+ pte_t entry, int dirty)
{
pteval_t old_pteval, pteval;
pte_t pte = READ_ONCE(*ptep);
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 952462820d9d..627a9717e98c 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -459,7 +459,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
pte_t orig_pte;

if (!pte_cont(pte))
- return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
+ return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);

ncontig = find_num_contig(mm, addr, ptep, &pgsize);
dpfn = pgsize >> PAGE_SHIFT;
--
2.25.1


2023-12-18 10:54:21

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 12/16] arm64/mm: ptep_get(): New layer to manage contig bit

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

arm64 did not previously define an arch-specific ptep_get(), so override
the default version in the arch code, and also define the private
__ptep_get() version. Currently they both do the same thing that the
default version does (READ_ONCE()). Some arch users (hugetlb) were
already using ptep_get() so convert those to the private API. While
other callsites were doing direct READ_ONCE(), so convert those to use
the appropriate (public/private) API too.

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 12 +++++++++---
arch/arm64/kernel/efi.c | 2 +-
arch/arm64/mm/fault.c | 4 ++--
arch/arm64/mm/hugetlbpage.c | 18 +++++++++---------
arch/arm64/mm/kasan_init.c | 2 +-
arch/arm64/mm/mmu.c | 12 ++++++------
arch/arm64/mm/pageattr.c | 4 ++--
arch/arm64/mm/trans_pgd.c | 2 +-
8 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 85010c2d4dfa..6930c14f062f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -276,6 +276,11 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
}
}

+static inline pte_t __ptep_get(pte_t *ptep)
+{
+ return READ_ONCE(*ptep);
+}
+
extern void __sync_icache_dcache(pte_t pteval);
bool pgattr_change_is_safe(u64 old, u64 new);

@@ -303,7 +308,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
if (!IS_ENABLED(CONFIG_DEBUG_VM))
return;

- old_pte = READ_ONCE(*ptep);
+ old_pte = __ptep_get(ptep);

if (!pte_valid(old_pte) || !pte_valid(pte))
return;
@@ -893,7 +898,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
{
pte_t old_pte, pte;

- pte = READ_ONCE(*ptep);
+ pte = __ptep_get(ptep);
do {
old_pte = pte;
pte = pte_mkold(pte);
@@ -966,7 +971,7 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
{
pte_t old_pte, pte;

- pte = READ_ONCE(*ptep);
+ pte = __ptep_get(ptep);
do {
old_pte = pte;
pte = pte_wrprotect(pte);
@@ -1111,6 +1116,7 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t old_pte, pte_t new_pte);

+#define ptep_get __ptep_get
#define set_pte __set_pte
#define set_ptes __set_ptes
#define pte_clear __pte_clear
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 44288a12fc6c..9afcc690fe73 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -103,7 +103,7 @@ static int __init set_permissions(pte_t *ptep, unsigned long addr, void *data)
{
struct set_perm_data *spd = data;
const efi_memory_desc_t *md = spd->md;
- pte_t pte = READ_ONCE(*ptep);
+ pte_t pte = __ptep_get(ptep);

if (md->attribute & EFI_MEMORY_RO)
pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 7cebd9847aae..d63f3a0a7251 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -191,7 +191,7 @@ static void show_pte(unsigned long addr)
if (!ptep)
break;

- pte = READ_ONCE(*ptep);
+ pte = __ptep_get(ptep);
pr_cont(", pte=%016llx", pte_val(pte));
pte_unmap(ptep);
} while(0);
@@ -214,7 +214,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
pte_t entry, int dirty)
{
pteval_t old_pteval, pteval;
- pte_t pte = READ_ONCE(*ptep);
+ pte_t pte = __ptep_get(ptep);

if (pte_same(pte, entry))
return 0;
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 627a9717e98c..52fb767607e0 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -152,14 +152,14 @@ pte_t huge_ptep_get(pte_t *ptep)
{
int ncontig, i;
size_t pgsize;
- pte_t orig_pte = ptep_get(ptep);
+ pte_t orig_pte = __ptep_get(ptep);

if (!pte_present(orig_pte) || !pte_cont(orig_pte))
return orig_pte;

ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
for (i = 0; i < ncontig; i++, ptep++) {
- pte_t pte = ptep_get(ptep);
+ pte_t pte = __ptep_get(ptep);

if (pte_dirty(pte))
orig_pte = pte_mkdirty(orig_pte);
@@ -184,7 +184,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
unsigned long pgsize,
unsigned long ncontig)
{
- pte_t orig_pte = ptep_get(ptep);
+ pte_t orig_pte = __ptep_get(ptep);
unsigned long i;

for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
@@ -408,7 +408,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
{
int ncontig;
size_t pgsize;
- pte_t orig_pte = ptep_get(ptep);
+ pte_t orig_pte = __ptep_get(ptep);

if (!pte_cont(orig_pte))
return __ptep_get_and_clear(mm, addr, ptep);
@@ -431,11 +431,11 @@ static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
{
int i;

- if (pte_write(pte) != pte_write(ptep_get(ptep)))
+ if (pte_write(pte) != pte_write(__ptep_get(ptep)))
return 1;

for (i = 0; i < ncontig; i++) {
- pte_t orig_pte = ptep_get(ptep + i);
+ pte_t orig_pte = __ptep_get(ptep + i);

if (pte_dirty(pte) != pte_dirty(orig_pte))
return 1;
@@ -492,7 +492,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
size_t pgsize;
pte_t pte;

- if (!pte_cont(READ_ONCE(*ptep))) {
+ if (!pte_cont(__ptep_get(ptep))) {
__ptep_set_wrprotect(mm, addr, ptep);
return;
}
@@ -517,7 +517,7 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
size_t pgsize;
int ncontig;

- if (!pte_cont(READ_ONCE(*ptep)))
+ if (!pte_cont(__ptep_get(ptep)))
return ptep_clear_flush(vma, addr, ptep);

ncontig = find_num_contig(mm, addr, ptep, &pgsize);
@@ -550,7 +550,7 @@ pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr
* when the permission changes from executable to non-executable
* in cases where cpu is affected with errata #2645198.
*/
- if (pte_user_exec(READ_ONCE(*ptep)))
+ if (pte_user_exec(__ptep_get(ptep)))
return huge_ptep_clear_flush(vma, addr, ptep);
}
return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 5eade712e9e5..5274c317d775 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -113,7 +113,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
next = addr + PAGE_SIZE;
__set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
- } while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
+ } while (ptep++, addr = next, addr != end && pte_none(__ptep_get(ptep)));
}

static void __init kasan_pmd_populate(pud_t *pudp, unsigned long addr,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 080e9b50f595..784f1e312447 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -176,7 +176,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,

ptep = pte_set_fixmap_offset(pmdp, addr);
do {
- pte_t old_pte = READ_ONCE(*ptep);
+ pte_t old_pte = __ptep_get(ptep);

__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));

@@ -185,7 +185,7 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
* only allow updates to the permission attributes.
*/
BUG_ON(!pgattr_change_is_safe(pte_val(old_pte),
- READ_ONCE(pte_val(*ptep))));
+ pte_val(__ptep_get(ptep))));

phys += PAGE_SIZE;
} while (ptep++, addr += PAGE_SIZE, addr != end);
@@ -854,7 +854,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,

do {
ptep = pte_offset_kernel(pmdp, addr);
- pte = READ_ONCE(*ptep);
+ pte = __ptep_get(ptep);
if (pte_none(pte))
continue;

@@ -987,7 +987,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,

do {
ptep = pte_offset_kernel(pmdp, addr);
- pte = READ_ONCE(*ptep);
+ pte = __ptep_get(ptep);

/*
* This is just a sanity check here which verifies that
@@ -1006,7 +1006,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
*/
ptep = pte_offset_kernel(pmdp, 0UL);
for (i = 0; i < PTRS_PER_PTE; i++) {
- if (!pte_none(READ_ONCE(ptep[i])))
+ if (!pte_none(__ptep_get(&ptep[i])))
return;
}

@@ -1475,7 +1475,7 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte
* when the permission changes from executable to non-executable
* in cases where cpu is affected with errata #2645198.
*/
- if (pte_user_exec(READ_ONCE(*ptep)))
+ if (pte_user_exec(ptep_get(ptep)))
return ptep_clear_flush(vma, addr, ptep);
}
return ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index a7996d8edf0a..0c4e3ecf989d 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -36,7 +36,7 @@ bool can_set_direct_map(void)
static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
{
struct page_change_data *cdata = data;
- pte_t pte = READ_ONCE(*ptep);
+ pte_t pte = __ptep_get(ptep);

pte = clear_pte_bit(pte, cdata->clear_mask);
pte = set_pte_bit(pte, cdata->set_mask);
@@ -245,5 +245,5 @@ bool kernel_page_present(struct page *page)
return true;

ptep = pte_offset_kernel(pmdp, addr);
- return pte_valid(READ_ONCE(*ptep));
+ return pte_valid(__ptep_get(ptep));
}
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 230b607cf881..5139a28130c0 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -33,7 +33,7 @@ static void *trans_alloc(struct trans_pgd_info *info)

static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
{
- pte_t pte = READ_ONCE(*src_ptep);
+ pte_t pte = __ptep_get(src_ptep);

if (pte_valid(pte)) {
/*
--
2.25.1


2023-12-18 10:54:43

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 14/16] arm64/mm: Wire up PTE_CONT for user mappings

With the ptep API sufficiently refactored, we can now introduce a new
"contpte" API layer, which transparently manages the PTE_CONT bit for
user mappings. Whenever it detects a set of PTEs that meet the
requirements for a contiguous range, the PTEs are re-painted with the
PTE_CONT bit. Use of contpte mappings is intended to be transparent to
the core-mm, which continues to interact with individual ptes.

Since a contpte block only has a single access and dirty bit, the
semantic here changes slightly; when getting a pte (e.g. ptep_get())
that is part of a contpte mapping, the access and dirty information are
pulled from the block (so all ptes in the block return the same
access/dirty info). When changing the access/dirty info on a pte (e.g.
ptep_set_access_flags()) that is part of a contpte mapping, this change
will affect the whole contpte block. This is works fine in practice
since we guarantee that only a single folio is mapped by a contpte
block, and the core-mm tracks access/dirty information per folio.

This initial change provides a baseline that can be optimized in future
commits. That said, fold/unfold operations (which imply tlb
invalidation) are avoided where possible with a few tricks for
access/dirty bit management. Write-protect modifications for contpte
mappings are currently non-optimal, and incure a regression in fork()
performance. This will be addressed in follow-up changes.

In order for the public functions, which used to be pure inline, to
continue to be callable by modules, export all the contpte_* symbols
that are now called by those public inline functions.

The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
at build time. It defaults to enabled as long as its dependency,
TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
enabled, then there is no chance of meeting the physical contiguity
requirement for contpte mappings.

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/Kconfig | 10 +-
arch/arm64/include/asm/pgtable.h | 184 +++++++++++++++
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/contpte.c | 388 +++++++++++++++++++++++++++++++
4 files changed, 582 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/mm/contpte.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7b071a00425d..de76e484ff3a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2209,6 +2209,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
select UNWIND_TABLES
select DYNAMIC_SCS

+config ARM64_CONTPTE
+ bool "Contiguous PTE mappings for user memory" if EXPERT
+ depends on TRANSPARENT_HUGEPAGE
+ default y
+ help
+ When enabled, user mappings are configured using the PTE contiguous
+ bit, for any mappings that meet the size and alignment requirements.
+ This reduces TLB pressure and improves performance.
+
endmenu # "Kernel Features"

menu "Boot options"
@@ -2318,4 +2327,3 @@ endmenu # "CPU Power Management"
source "drivers/acpi/Kconfig"

source "arch/arm64/kvm/Kconfig"
-
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6930c14f062f..e64120452301 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
*/
#define pte_valid_not_user(pte) \
((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
+/*
+ * Returns true if the pte is valid and has the contiguous bit set.
+ */
+#define pte_valid_cont(pte) (pte_valid(pte) && pte_cont(pte))
/*
* Could the pte be present in the TLB? We must check mm_tlb_flush_pending
* so that we don't erroneously return false for pages that have been
@@ -1116,6 +1120,184 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t old_pte, pte_t new_pte);

+#ifdef CONFIG_ARM64_CONTPTE
+
+/*
+ * The contpte APIs are used to transparently manage the contiguous bit in ptes
+ * where it is possible and makes sense to do so. The PTE_CONT bit is considered
+ * a private implementation detail of the public ptep API (see below).
+ */
+extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte);
+extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte);
+extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
+extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
+extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr);
+extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ pte_t entry, int dirty);
+
+static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+ /*
+ * Only bother trying if both the virtual and physical addresses are
+ * aligned and correspond to the last entry in a contig range. The core
+ * code mostly modifies ranges from low to high, so this is the likely
+ * the last modification in the contig range, so a good time to fold.
+ * We can't fold special mappings, because there is no associated folio.
+ */
+
+ const unsigned long contmask = CONT_PTES - 1;
+ bool valign = (((unsigned long)ptep >> 3) & contmask) == contmask;
+ bool palign = (pte_pfn(pte) & contmask) == contmask;
+
+ if (valign && palign &&
+ pte_valid(pte) && !pte_cont(pte) && !pte_special(pte))
+ __contpte_try_fold(mm, addr, ptep, pte);
+}
+
+static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+ if (pte_valid_cont(pte))
+ __contpte_try_unfold(mm, addr, ptep, pte);
+}
+
+/*
+ * The below functions constitute the public API that arm64 presents to the
+ * core-mm to manipulate PTE entries within their page tables (or at least this
+ * is the subset of the API that arm64 needs to implement). These public
+ * versions will automatically and transparently apply the contiguous bit where
+ * it makes sense to do so. Therefore any users that are contig-aware (e.g.
+ * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
+ * private versions, which are prefixed with double underscore. All of these
+ * APIs except for ptep_get_lockless() are expected to be called with the PTL
+ * held.
+ */
+
+#define ptep_get ptep_get
+static inline pte_t ptep_get(pte_t *ptep)
+{
+ pte_t pte = __ptep_get(ptep);
+
+ if (!pte_valid_cont(pte))
+ return pte;
+
+ return contpte_ptep_get(ptep, pte);
+}
+
+#define ptep_get_lockless ptep_get_lockless
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+ pte_t pte = __ptep_get(ptep);
+
+ if (!pte_valid_cont(pte))
+ return pte;
+
+ return contpte_ptep_get_lockless(ptep);
+}
+
+static inline void set_pte(pte_t *ptep, pte_t pte)
+{
+ /*
+ * We don't have the mm or vaddr so cannot unfold or fold contig entries
+ * (since it requires tlb maintenance). set_pte() is not used in core
+ * code, so this should never even be called. Regardless do our best to
+ * service any call and emit a warning if there is any attempt to set a
+ * pte on top of an existing contig range.
+ */
+ pte_t orig_pte = __ptep_get(ptep);
+
+ WARN_ON_ONCE(pte_valid_cont(orig_pte));
+ __set_pte(ptep, pte_mknoncont(pte));
+}
+
+#define set_ptes set_ptes
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+{
+ pte = pte_mknoncont(pte);
+
+ if (nr == 1) {
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+ __set_ptes(mm, addr, ptep, pte, 1);
+ contpte_try_fold(mm, addr, ptep, pte);
+ } else
+ contpte_set_ptes(mm, addr, ptep, pte, nr);
+}
+
+static inline void pte_clear(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+ __pte_clear(mm, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+ return __ptep_get_and_clear(mm, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
+static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+{
+ pte_t orig_pte = __ptep_get(ptep);
+
+ if (!pte_valid_cont(orig_pte))
+ return __ptep_test_and_clear_young(vma, addr, ptep);
+
+ return contpte_ptep_test_and_clear_young(vma, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+{
+ pte_t orig_pte = __ptep_get(ptep);
+
+ if (!pte_valid_cont(orig_pte))
+ return __ptep_clear_flush_young(vma, addr, ptep);
+
+ return contpte_ptep_clear_flush_young(vma, addr, ptep);
+}
+
+#define __HAVE_ARCH_PTEP_SET_WRPROTECT
+static inline void ptep_set_wrprotect(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+ __ptep_set_wrprotect(mm, addr, ptep);
+ contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
+}
+
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+static inline int ptep_set_access_flags(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ pte_t entry, int dirty)
+{
+ pte_t orig_pte = __ptep_get(ptep);
+
+ entry = pte_mknoncont(entry);
+
+ if (!pte_valid_cont(orig_pte))
+ return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+
+ return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+}
+
+#else /* CONFIG_ARM64_CONTPTE */
+
#define ptep_get __ptep_get
#define set_pte __set_pte
#define set_ptes __set_ptes
@@ -1131,6 +1313,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
#define ptep_set_access_flags __ptep_set_access_flags

+#endif /* CONFIG_ARM64_CONTPTE */
+
#endif /* !__ASSEMBLY__ */

#endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d..60454256945b 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -3,6 +3,7 @@ obj-y := dma-mapping.o extable.o fault.o init.o \
cache.o copypage.o flush.o \
ioremap.o mmap.o pgd.o mmu.o \
context.o proc.o pageattr.o fixmap.o
+obj-$(CONFIG_ARM64_CONTPTE) += contpte.o
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
obj-$(CONFIG_PTDUMP_DEBUGFS) += ptdump_debugfs.o
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
new file mode 100644
index 000000000000..69c36749dd98
--- /dev/null
+++ b/arch/arm64/mm/contpte.c
@@ -0,0 +1,388 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/mm.h>
+#include <linux/export.h>
+#include <asm/tlbflush.h>
+
+static inline bool mm_is_user(struct mm_struct *mm)
+{
+ /*
+ * Don't attempt to apply the contig bit to kernel mappings, because
+ * dynamically adding/removing the contig bit can cause page faults.
+ * These racing faults are ok for user space, since they get serialized
+ * on the PTL. But kernel mappings can't tolerate faults.
+ */
+ return mm != &init_mm;
+}
+
+static inline pte_t *contpte_align_down(pte_t *ptep)
+{
+ return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
+}
+
+static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, int nr)
+{
+ struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+ unsigned long start_addr = addr;
+ int i;
+
+ for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
+ __pte_clear(mm, addr, ptep);
+
+ __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
+}
+
+static bool ptep_any_valid(pte_t *ptep, int nr)
+{
+ int i;
+
+ for (i = 0; i < nr; i++, ptep++) {
+ if (pte_valid(__ptep_get(ptep)))
+ return true;
+ }
+
+ return false;
+}
+
+static void contpte_convert(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+ struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+ unsigned long start_addr;
+ pte_t *start_ptep;
+ int i;
+
+ start_ptep = ptep = contpte_align_down(ptep);
+ start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+ pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
+
+ for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
+ pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
+
+ if (pte_dirty(ptent))
+ pte = pte_mkdirty(pte);
+
+ if (pte_young(ptent))
+ pte = pte_mkyoung(pte);
+ }
+
+ __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
+
+ __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
+}
+
+void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+ /*
+ * We have already checked that the virtual and pysical addresses are
+ * correctly aligned for a contpte mapping in contpte_try_fold() so the
+ * remaining checks are to ensure that the contpte range is fully
+ * covered by a single folio, and ensure that all the ptes are valid
+ * with contiguous PFNs and matching prots. We ignore the state of the
+ * access and dirty bits for the purpose of deciding if its a contiguous
+ * range; the folding process will generate a single contpte entry which
+ * has a single access and dirty bit. Those 2 bits are the logical OR of
+ * their respective bits in the constituent pte entries. In order to
+ * ensure the contpte range is covered by a single folio, we must
+ * recover the folio from the pfn, but special mappings don't have a
+ * folio backing them. Fortunately contpte_try_fold() already checked
+ * that the pte is not special - we never try to fold special mappings.
+ * Note we can't use vm_normal_page() for this since we don't have the
+ * vma.
+ */
+
+ unsigned long folio_saddr;
+ unsigned long folio_eaddr;
+ unsigned long cont_saddr;
+ unsigned long cont_eaddr;
+ struct folio *folio;
+ struct page *page;
+ unsigned long pfn;
+ pte_t *orig_ptep;
+ pgprot_t prot;
+ pte_t subpte;
+ int i;
+
+ if (!mm_is_user(mm))
+ return;
+
+ page = pte_page(pte);
+ folio = page_folio(page);
+ folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
+ folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
+ cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+ cont_eaddr = cont_saddr + CONT_PTE_SIZE;
+
+ if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
+ return;
+
+ pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
+ prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+ orig_ptep = ptep;
+ ptep = contpte_align_down(ptep);
+
+ for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
+ subpte = __ptep_get(ptep);
+ subpte = pte_mkold(pte_mkclean(subpte));
+
+ if (!pte_valid(subpte) ||
+ pte_pfn(subpte) != pfn ||
+ pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
+ return;
+ }
+
+ pte = pte_mkcont(pte);
+ contpte_convert(mm, addr, orig_ptep, pte);
+}
+EXPORT_SYMBOL(__contpte_try_fold);
+
+void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+ /*
+ * We have already checked that the ptes are contiguous in
+ * contpte_try_unfold(), so just check that the mm is user space.
+ */
+
+ if (!mm_is_user(mm))
+ return;
+
+ pte = pte_mknoncont(pte);
+ contpte_convert(mm, addr, ptep, pte);
+}
+EXPORT_SYMBOL(__contpte_try_unfold);
+
+pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
+{
+ /*
+ * Gather access/dirty bits, which may be populated in any of the ptes
+ * of the contig range. We are guarranteed to be holding the PTL, so any
+ * contiguous range cannot be unfolded or otherwise modified under our
+ * feet.
+ */
+
+ pte_t pte;
+ int i;
+
+ ptep = contpte_align_down(ptep);
+
+ for (i = 0; i < CONT_PTES; i++, ptep++) {
+ pte = __ptep_get(ptep);
+
+ if (pte_dirty(pte))
+ orig_pte = pte_mkdirty(orig_pte);
+
+ if (pte_young(pte))
+ orig_pte = pte_mkyoung(orig_pte);
+ }
+
+ return orig_pte;
+}
+EXPORT_SYMBOL(contpte_ptep_get);
+
+pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
+{
+ /*
+ * Gather access/dirty bits, which may be populated in any of the ptes
+ * of the contig range. We may not be holding the PTL, so any contiguous
+ * range may be unfolded/modified/refolded under our feet. Therefore we
+ * ensure we read a _consistent_ contpte range by checking that all ptes
+ * in the range are valid and have CONT_PTE set, that all pfns are
+ * contiguous and that all pgprots are the same (ignoring access/dirty).
+ * If we find a pte that is not consistent, then we must be racing with
+ * an update so start again. If the target pte does not have CONT_PTE
+ * set then that is considered consistent on its own because it is not
+ * part of a contpte range.
+ */
+
+ pgprot_t orig_prot;
+ unsigned long pfn;
+ pte_t orig_pte;
+ pgprot_t prot;
+ pte_t *ptep;
+ pte_t pte;
+ int i;
+
+retry:
+ orig_pte = __ptep_get(orig_ptep);
+
+ if (!pte_valid_cont(orig_pte))
+ return orig_pte;
+
+ orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
+ ptep = contpte_align_down(orig_ptep);
+ pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
+
+ for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
+ pte = __ptep_get(ptep);
+ prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+ if (!pte_valid_cont(pte) ||
+ pte_pfn(pte) != pfn ||
+ pgprot_val(prot) != pgprot_val(orig_prot))
+ goto retry;
+
+ if (pte_dirty(pte))
+ orig_pte = pte_mkdirty(orig_pte);
+
+ if (pte_young(pte))
+ orig_pte = pte_mkyoung(orig_pte);
+ }
+
+ return orig_pte;
+}
+EXPORT_SYMBOL(contpte_ptep_get_lockless);
+
+void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+{
+ unsigned long next;
+ unsigned long end;
+ unsigned long pfn;
+ pgprot_t prot;
+ pte_t orig_pte;
+
+ if (!mm_is_user(mm))
+ return __set_ptes(mm, addr, ptep, pte, nr);
+
+ end = addr + (nr << PAGE_SHIFT);
+ pfn = pte_pfn(pte);
+ prot = pte_pgprot(pte);
+
+ do {
+ next = pte_cont_addr_end(addr, end);
+ nr = (next - addr) >> PAGE_SHIFT;
+ pte = pfn_pte(pfn, prot);
+
+ if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
+ pte = pte_mkcont(pte);
+ else
+ pte = pte_mknoncont(pte);
+
+ /*
+ * If operating on a partial contiguous range then we must first
+ * unfold the contiguous range if it was previously folded.
+ * Otherwise we could end up with overlapping tlb entries.
+ */
+ if (nr != CONT_PTES)
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+
+ /*
+ * If we are replacing ptes that were contiguous or if the new
+ * ptes are contiguous and any of the ptes being replaced are
+ * valid, we need to clear and flush the range to prevent
+ * overlapping tlb entries.
+ */
+ orig_pte = __ptep_get(ptep);
+ if (pte_valid_cont(orig_pte) ||
+ (pte_cont(pte) && ptep_any_valid(ptep, nr)))
+ ptep_clear_flush_range(mm, addr, ptep, nr);
+
+ __set_ptes(mm, addr, ptep, pte, nr);
+
+ addr = next;
+ ptep += nr;
+ pfn += nr;
+
+ } while (addr != end);
+}
+EXPORT_SYMBOL(contpte_set_ptes);
+
+int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+{
+ /*
+ * ptep_clear_flush_young() technically requires us to clear the access
+ * flag for a _single_ pte. However, the core-mm code actually tracks
+ * access/dirty per folio, not per page. And since we only create a
+ * contig range when the range is covered by a single folio, we can get
+ * away with clearing young for the whole contig range here, so we avoid
+ * having to unfold.
+ */
+
+ int young = 0;
+ int i;
+
+ ptep = contpte_align_down(ptep);
+ addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+ for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+ young |= __ptep_test_and_clear_young(vma, addr, ptep);
+
+ return young;
+}
+EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
+
+int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+{
+ int young;
+
+ young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
+
+ if (young) {
+ /*
+ * See comment in __ptep_clear_flush_young(); same rationale for
+ * eliding the trailing DSB applies here.
+ */
+ addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+ __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
+ PAGE_SIZE, true, 3);
+ }
+
+ return young;
+}
+EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
+
+int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ pte_t entry, int dirty)
+{
+ unsigned long start_addr;
+ pte_t orig_pte;
+ int i;
+
+ /*
+ * Gather the access/dirty bits for the contiguous range. If nothing has
+ * changed, its a noop.
+ */
+ orig_pte = pte_mknoncont(ptep_get(ptep));
+ if (pte_val(orig_pte) == pte_val(entry))
+ return 0;
+
+ /*
+ * We can fix up access/dirty bits without having to unfold/fold the
+ * contig range. But if the write bit is changing, we need to go through
+ * the full unfold/fold cycle.
+ */
+ if (pte_write(orig_pte) == pte_write(entry)) {
+ /*
+ * For HW access management, we technically only need to update
+ * the flag on a single pte in the range. But for SW access
+ * management, we need to update all the ptes to prevent extra
+ * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
+ * and instead flush the whole range at the end.
+ */
+ ptep = contpte_align_down(ptep);
+ start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+
+ for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+ __ptep_set_access_flags(vma, addr, ptep, entry, 0);
+
+ if (dirty)
+ __flush_tlb_range(vma, start_addr, addr,
+ PAGE_SIZE, true, 3);
+ } else {
+ __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
+ __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+ contpte_try_fold(vma->vm_mm, addr, ptep, entry);
+ }
+
+ return 1;
+}
+EXPORT_SYMBOL(contpte_ptep_set_access_flags);
--
2.25.1


2023-12-18 10:54:57

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 13/16] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement. This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB. Forthcoming "contpte" code will take advantage of this
when clearing the young bit from a contiguous range of ptes.

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bb2c2833a987..925ef3bdf9ed 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -399,7 +399,7 @@ do { \
#define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false)

-static inline void __flush_tlb_range(struct vm_area_struct *vma,
+static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
unsigned long stride, bool last_level,
int tlb_level)
@@ -431,10 +431,19 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
else
__flush_tlb_range_op(vae1is, start, pages, stride, asid, tlb_level, true);

- dsb(ish);
mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
}

+static inline void __flush_tlb_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ unsigned long stride, bool last_level,
+ int tlb_level)
+{
+ __flush_tlb_range_nosync(vma, start, end, stride,
+ last_level, tlb_level);
+ dsb(ish);
+}
+
static inline void flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
--
2.25.1


2023-12-18 10:55:18

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 15/16] arm64/mm: Implement new helpers to optimize fork()

With the core-mm changes in place to batch-copy ptes during fork, we can
take advantage of this in arm64 to greatly reduce the number of tlbis we
have to issue, and recover the lost fork performance incured when adding
support for transparent contiguous ptes.

This optimization covers 2 cases:

2) The memory being CoWed is contpte-sized (or bigger) folios. We set
wrprotect in the parent and set the ptes in the child for a whole
contpte block in one hit. This means we can operate on the whole
block and don't need to unfold/fold.

1) The memory being CoWed is all order-0 folios. No folding or unfolding
occurs here, but the added cost of checking if we need to fold on
every pte adds up. Given we are forking, we are just copying the ptes
already in the parent, so we should be maintaining the single/contpte
state into the child anyway, and any check for folding will always be
false. Therefore, we can elide the fold check in set_ptes_full() and
ptep_set_wrprotects() when full=1.

The optimization to wrprotect a whole contpte block without unfolding is
possible thanks to the tightening of the Arm ARM in respect to the
definition and behaviour when 'Misprogramming the Contiguous bit'. See
section D21194 at https://developer.arm.com/documentation/102105/latest/

The following microbenchmark results demonstate the recovered (and
overall improved) fork performance for large pte-mapped folios once this
patch is applied. Fork is called in a tight loop in a process with 1G of
populated memory and the time for the function to execute is measured.
100 iterations per run, 8 runs performed on both Apple M2 (VM) and
Ampere Altra (bare metal). Tests performed for case where 1G memory is
comprised of pte-mapped order-9 folios. Negative is faster, positive is
slower, compared to baseline upon which the series is based:

| fork | Apple M2 VM | Ampere Altra |
| order-9 |-------------------|-------------------|
| (pte-map) | mean | stdev | mean | stdev |
|---------------|---------|---------|---------|---------|
| baseline | 0.0% | 1.2% | 0.0% | 0.1% |
| before-change | 541.5% | 2.8% | 3654.4% | 0.0% |
| after-change | -25.4% | 1.9% | -6.7% | 0.1% |

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 97 ++++++++++++++++++++++++++------
arch/arm64/mm/contpte.c | 47 ++++++++++++++++
2 files changed, 128 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e64120452301..d4805f73b9db 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -966,16 +966,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

-/*
- * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
- * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
- */
-static inline void __ptep_set_wrprotect(struct mm_struct *mm,
- unsigned long address, pte_t *ptep)
+static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
+ unsigned long address, pte_t *ptep,
+ pte_t pte)
{
- pte_t old_pte, pte;
+ pte_t old_pte;

- pte = __ptep_get(ptep);
do {
old_pte = pte;
pte = pte_wrprotect(pte);
@@ -984,6 +980,26 @@ static inline void __ptep_set_wrprotect(struct mm_struct *mm,
} while (pte_val(pte) != pte_val(old_pte));
}

+/*
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
+ */
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+ unsigned long address, pte_t *ptep)
+{
+ ___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
+}
+
+static inline void __ptep_set_wrprotects(struct mm_struct *mm,
+ unsigned long address, pte_t *ptep,
+ unsigned int nr, int full)
+{
+ unsigned int i;
+
+ for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+ __ptep_set_wrprotect(mm, address, ptep);
+}
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define __HAVE_ARCH_PMDP_SET_WRPROTECT
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
@@ -1139,6 +1155,8 @@ extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
+extern void contpte_set_wrprotects(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, unsigned int nr, int full);
extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t entry, int dirty);
@@ -1170,6 +1188,17 @@ static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
__contpte_try_unfold(mm, addr, ptep, pte);
}

+#define pte_batch_remaining pte_batch_remaining
+static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long addr,
+ unsigned long end)
+{
+ if (!pte_valid_cont(pte))
+ return 1;
+
+ return min(CONT_PTES - ((addr >> PAGE_SHIFT) & (CONT_PTES - 1)),
+ (end - addr) >> PAGE_SHIFT);
+}
+
/*
* The below functions constitute the public API that arm64 presents to the
* core-mm to manipulate PTE entries within their page tables (or at least this
@@ -1219,20 +1248,30 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
__set_pte(ptep, pte_mknoncont(pte));
}

-#define set_ptes set_ptes
-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
+#define set_ptes_full set_ptes_full
+static inline void set_ptes_full(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr,
+ int full)
{
pte = pte_mknoncont(pte);

if (nr == 1) {
- contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+ if (!full)
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
__set_ptes(mm, addr, ptep, pte, 1);
- contpte_try_fold(mm, addr, ptep, pte);
+ if (!full)
+ contpte_try_fold(mm, addr, ptep, pte);
} else
contpte_set_ptes(mm, addr, ptep, pte, nr);
}

+#define set_ptes set_ptes
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+{
+ set_ptes_full(mm, addr, ptep, pte, nr, false);
+}
+
static inline void pte_clear(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
@@ -1272,13 +1311,38 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
return contpte_ptep_clear_flush_young(vma, addr, ptep);
}

+#define ptep_set_wrprotects ptep_set_wrprotects
+static inline void ptep_set_wrprotects(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep,
+ unsigned int nr, int full)
+{
+ if (nr == 1) {
+ /*
+ * Optimization: ptep_set_wrprotects() can only be called for
+ * present ptes so we only need to check contig bit as condition
+ * for unfold, and we can remove the contig bit from the pte we
+ * read to avoid re-reading. This speeds up fork() with is very
+ * sensitive for order-0 folios. Should be equivalent to
+ * contpte_try_unfold() for this case.
+ */
+ pte_t orig_pte = __ptep_get(ptep);
+
+ if (unlikely(pte_cont(orig_pte))) {
+ __contpte_try_unfold(mm, addr, ptep, orig_pte);
+ orig_pte = pte_mknoncont(orig_pte);
+ }
+ ___ptep_set_wrprotect(mm, addr, ptep, orig_pte);
+ if (!full)
+ contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
+ } else
+ contpte_set_wrprotects(mm, addr, ptep, nr, full);
+}
+
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
- contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
- __ptep_set_wrprotect(mm, addr, ptep);
- contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
+ ptep_set_wrprotects(mm, addr, ptep, 1, false);
}

#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
@@ -1310,6 +1374,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
#define ptep_clear_flush_young __ptep_clear_flush_young
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define ptep_set_wrprotect __ptep_set_wrprotect
+#define ptep_set_wrprotects __ptep_set_wrprotects
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
#define ptep_set_access_flags __ptep_set_access_flags

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 69c36749dd98..72e672024785 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -339,6 +339,53 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
}
EXPORT_SYMBOL(contpte_ptep_clear_flush_young);

+void contpte_set_wrprotects(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, unsigned int nr, int full)
+{
+ unsigned long next;
+ unsigned long end;
+
+ if (!mm_is_user(mm))
+ return __ptep_set_wrprotects(mm, addr, ptep, nr, full);
+
+ end = addr + (nr << PAGE_SHIFT);
+
+ do {
+ next = pte_cont_addr_end(addr, end);
+ nr = (next - addr) >> PAGE_SHIFT;
+
+ /*
+ * If wrprotecting an entire contig range, we can avoid
+ * unfolding. Just set wrprotect and wait for the later
+ * mmu_gather flush to invalidate the tlb. Until the flush, the
+ * page may or may not be wrprotected. After the flush, it is
+ * guarranteed wrprotected. If its a partial range though, we
+ * must unfold, because we can't have a case where CONT_PTE is
+ * set but wrprotect applies to a subset of the PTEs; this would
+ * cause it to continue to be unpredictable after the flush.
+ */
+ if (nr != CONT_PTES)
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+
+ __ptep_set_wrprotects(mm, addr, ptep, nr, full);
+
+ addr = next;
+ ptep += nr;
+
+ /*
+ * If applying to a partial contig range, the change could have
+ * made the range foldable. Use the last pte in the range we
+ * just set for comparison, since contpte_try_fold() only
+ * triggers when acting on the last pte in the contig range.
+ */
+ if (nr != CONT_PTES)
+ contpte_try_fold(mm, addr - PAGE_SIZE, ptep - 1,
+ __ptep_get(ptep - 1));
+
+ } while (addr != end);
+}
+EXPORT_SYMBOL(contpte_set_wrprotects);
+
int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t entry, int dirty)
--
2.25.1


2023-12-18 10:55:32

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v4 16/16] arm64/mm: Implement clear_ptes() to optimize exit, munmap, dontneed

With the core-mm changes in place to batch-clear ptes during
zap_pte_range(), we can take advantage of this in arm64 to greatly
reduce the number of tlbis we have to issue, and recover the lost
performance in exit, munmap and madvise(DONTNEED) incured when adding
support for transparent contiguous ptes.

If we are clearing a whole contpte range, we can elide first unfolding
that range and save the tlbis. We just clear the whole range.

The following microbenchmark results demonstate the effect of this
change on madvise(DONTNEED) performance for large pte-mapped folios.
madvise(dontneed) is called for each page of a 1G populated mapping and
the total time is measured. 100 iterations per run, 8 runs performed on
both Apple M2 (VM) and Ampere Altra (bare metal). Tests performed for
case where 1G memory is comprised of pte-mapped order-9 folios. Negative
is faster, positive is slower, compared to baseline upon which the
series is based:

| dontneed | Apple M2 VM | Ampere Altra |
| order-9 |-------------------|-------------------|
| (pte-map) | mean | stdev | mean | stdev |
|---------------|---------|---------|---------|---------|
| baseline | 0.0% | 7.9% | 0.0% | 0.0% |
| before-change | -1.3% | 7.0% | 13.0% | 0.0% |
| after-change | -9.9% | 0.9% | 14.1% | 0.0% |

The memory is initially all contpte-mapped and has to be unfolded (which
requires tlbi for the whole block) when the first page is touched (since
the test is madvise-ing 1 page at a time). Ampere Altra has high cost
for tlbi; this is why cost increases there.

The following microbenchmark results demonstate the recovery (and
overall improvement) of munmap performance for large pte-mapped folios.
munmap is called for a 1G populated mapping and the function runtime is
measured. 100 iterations per run, 8 runs performed on both Apple M2 (VM)
and Ampere Altra (bare metal). Tests performed for case where 1G memory
is comprised of pte-mapped order-9 folios. Negative is faster, positive
is slower, compared to baseline upon which the series is based:

| munmap | Apple M2 VM | Ampere Altra |
| order-9 |-------------------|-------------------|
| (pte-map) | mean | stdev | mean | stdev |
|---------------|---------|---------|---------|---------|
| baseline | 0.0% | 6.4% | 0.0% | 0.1% |
| before-change | 43.3% | 1.9% | 375.2% | 0.0% |
| after-change | -6.0% | 1.4% | -0.6% | 0.2% |

Tested-by: John Hubbard <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++
arch/arm64/mm/contpte.c | 45 ++++++++++++++++++++++++++++++++
2 files changed, 87 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d4805f73b9db..f5bf059291c3 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -953,6 +953,29 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
return pte;
}

+static inline pte_t __clear_ptes(struct mm_struct *mm,
+ unsigned long address, pte_t *ptep,
+ unsigned int nr, int full)
+{
+ pte_t orig_pte = __ptep_get_and_clear(mm, address, ptep);
+ unsigned int i;
+ pte_t pte;
+
+ for (i = 1; i < nr; i++) {
+ address += PAGE_SIZE;
+ ptep++;
+ pte = __ptep_get_and_clear(mm, address, ptep);
+
+ if (pte_dirty(pte))
+ orig_pte = pte_mkdirty(orig_pte);
+
+ if (pte_young(pte))
+ orig_pte = pte_mkyoung(orig_pte);
+ }
+
+ return orig_pte;
+}
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
@@ -1151,6 +1174,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr);
+extern pte_t contpte_clear_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, unsigned int nr, int full);
extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
@@ -1279,6 +1304,22 @@ static inline void pte_clear(struct mm_struct *mm,
__pte_clear(mm, addr, ptep);
}

+#define clear_ptes clear_ptes
+static inline pte_t clear_ptes(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep,
+ unsigned int nr, int full)
+{
+ pte_t pte;
+
+ if (nr == 1) {
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+ pte = __ptep_get_and_clear(mm, addr, ptep);
+ } else
+ pte = contpte_clear_ptes(mm, addr, ptep, nr, full);
+
+ return pte;
+}
+
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
@@ -1366,6 +1407,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
#define set_pte __set_pte
#define set_ptes __set_ptes
#define pte_clear __pte_clear
+#define clear_ptes __clear_ptes
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define ptep_get_and_clear __ptep_get_and_clear
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 72e672024785..6f2a15ac5163 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -293,6 +293,51 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL(contpte_set_ptes);

+pte_t contpte_clear_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
+ unsigned int nr, int full)
+{
+ /*
+ * If we cover a partial contpte block at the beginning or end of the
+ * batch, unfold if currently folded. This makes it safe to clear some
+ * of the entries while keeping others. contpte blocks in the middle of
+ * the range, which are fully covered don't need to be unfolded because
+ * we will clear the full block.
+ */
+
+ unsigned int i;
+ pte_t pte;
+ pte_t tail;
+
+ if (!mm_is_user(mm))
+ return __clear_ptes(mm, addr, ptep, nr, full);
+
+ if (ptep != contpte_align_down(ptep) || nr < CONT_PTES)
+ contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+
+ if (ptep + nr != contpte_align_down(ptep + nr))
+ contpte_try_unfold(mm, addr + PAGE_SIZE * (nr - 1),
+ ptep + nr - 1,
+ __ptep_get(ptep + nr - 1));
+
+ pte = __ptep_get_and_clear(mm, addr, ptep);
+
+ for (i = 1; i < nr; i++) {
+ addr += PAGE_SIZE;
+ ptep++;
+
+ tail = __ptep_get_and_clear(mm, addr, ptep);
+
+ if (pte_dirty(tail))
+ pte = pte_mkdirty(pte);
+
+ if (pte_young(tail))
+ pte = pte_mkyoung(pte);
+ }
+
+ return pte;
+}
+EXPORT_SYMBOL(contpte_clear_ptes);
+
int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
{
--
2.25.1


2023-12-18 17:41:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 01/16] mm: thp: Batch-collapse PMD with set_ptes()

On 18.12.23 11:50, Ryan Roberts wrote:
> Refactor __split_huge_pmd_locked() so that a present PMD can be
> collapsed to PTEs in a single batch using set_ptes(). It also provides a
> future opportunity to batch-add the folio to the rmap using David's new
> batched rmap APIs.

I'd drop that sentence and rather just say "In the future, we might get
rid of the remaining manual loop by using rmap batching.".

>
> This should improve performance a little bit, but the real motivation is
> to remove the need for the arm64 backend to have to fold the contpte
> entries. Instead, since the ptes are set as a batch, the contpte blocks
> can be initially set up pre-folded (once the arm64 contpte support is
> added in the next few patches). This leads to noticeable performance
> improvement during split.
>
Acked-by: David Hildenbrand <[email protected]>

--
Cheers,

David / dhildenb


2023-12-18 17:50:51

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 18.12.23 11:50, Ryan Roberts wrote:
> Convert copy_pte_range() to copy a batch of ptes in one go. A given
> batch is determined by the architecture with the new helper,
> pte_batch_remaining(), and maps a physically contiguous block of memory,
> all belonging to the same folio. A pte batch is then write-protected in
> one go in the parent using the new helper, ptep_set_wrprotects() and is
> set in one go in the child using the new helper, set_ptes_full().
>
> The primary motivation for this change is to reduce the number of tlb
> maintenance operations that the arm64 backend has to perform during
> fork, as it is about to add transparent support for the "contiguous bit"
> in its ptes. By write-protecting the parent using the new
> ptep_set_wrprotects() (note the 's' at the end) function, the backend
> can avoid having to unfold contig ranges of PTEs, which is expensive,
> when all ptes in the range are being write-protected. Similarly, by
> using set_ptes_full() rather than set_pte_at() to set up ptes in the
> child, the backend does not need to fold a contiguous range once they
> are all populated - they can be initially populated as a contiguous
> range in the first place.
>
> This code is very performance sensitive, and a significant amount of
> effort has been put into not regressing performance for the order-0
> folio case. By default, pte_batch_remaining() is compile constant 1,
> which enables the compiler to simplify the extra loops that are added
> for batching and produce code that is equivalent (and equally
> performant) as the previous implementation.
>
> This change addresses the core-mm refactoring only and a separate change
> will implement pte_batch_remaining(), ptep_set_wrprotects() and
> set_ptes_full() in the arm64 backend to realize the performance
> improvement as part of the work to enable contpte mappings.
>
> To ensure the arm64 is performant once implemented, this change is very
> careful to only call ptep_get() once per pte batch.
>
> The following microbenchmark results demonstate that there is no
> significant performance change after this patch. Fork is called in a
> tight loop in a process with 1G of populated memory and the time for the
> function to execute is measured. 100 iterations per run, 8 runs
> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
> performed for case where 1G memory is comprised of order-0 folios and
> case where comprised of pte-mapped order-9 folios. Negative is faster,
> positive is slower, compared to baseline upon which the series is based:
>
> | Apple M2 VM | order-0 (pte-map) | order-9 (pte-map) |
> | fork |-------------------|-------------------|
> | microbench | mean | stdev | mean | stdev |
> |---------------|---------|---------|---------|---------|
> | baseline | 0.0% | 1.1% | 0.0% | 1.2% |
> | after-change | -1.0% | 2.0% | -0.1% | 1.1% |
>
> | Ampere Altra | order-0 (pte-map) | order-9 (pte-map) |
> | fork |-------------------|-------------------|
> | microbench | mean | stdev | mean | stdev |
> |---------------|---------|---------|---------|---------|
> | baseline | 0.0% | 1.0% | 0.0% | 0.1% |
> | after-change | -0.1% | 1.2% | -0.1% | 0.1% |
>
> Tested-by: John Hubbard <[email protected]>
> Reviewed-by: Alistair Popple <[email protected]>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
> mm/memory.c | 92 ++++++++++++++++++++++++++---------------
> 2 files changed, 139 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..db93fb81465a 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
> #define arch_flush_lazy_mmu_mode() do {} while (0)
> #endif
>
> +#ifndef pte_batch_remaining
> +/**
> + * pte_batch_remaining - Number of pages from addr to next batch boundary.
> + * @pte: Page table entry for the first page.
> + * @addr: Address of the first page.
> + * @end: Batch ceiling (e.g. end of vma).
> + *
> + * Some architectures (arm64) can efficiently modify a contiguous batch of ptes.
> + * In such cases, this function returns the remaining number of pages to the end
> + * of the current batch, as defined by addr. This can be useful when iterating
> + * over ptes.
> + *
> + * May be overridden by the architecture, else batch size is always 1.
> + */
> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long addr,
> + unsigned long end)
> +{
> + return 1;
> +}
> +#endif

It's a shame we now lose the optimization for all other archtiectures.

Was there no way to have some basic batching mechanism that doesn't
require arch specifics?

I'd have thought that something very basic would have worked like:

* Check if PTE is the same when setting the PFN to 0.
* Check that PFN is consecutive
* Check that all PFNs belong to the same folio

--
Cheers,

David / dhildenb


2023-12-19 08:19:16

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 01/16] mm: thp: Batch-collapse PMD with set_ptes()

On 18/12/2023 17:40, David Hildenbrand wrote:
> On 18.12.23 11:50, Ryan Roberts wrote:
>> Refactor __split_huge_pmd_locked() so that a present PMD can be
>> collapsed to PTEs in a single batch using set_ptes(). It also provides a
>> future opportunity to batch-add the folio to the rmap using David's new
>> batched rmap APIs.
>
> I'd drop that sentence and rather just say "In the future, we might get rid of
> the remaining manual loop by using rmap batching.".

OK fair enough. Will fix for next version.

>
>>
>> This should improve performance a little bit, but the real motivation is
>> to remove the need for the arm64 backend to have to fold the contpte
>> entries. Instead, since the ptes are set as a batch, the contpte blocks
>> can be initially set up pre-folded (once the arm64 contpte support is
>> added in the next few patches). This leads to noticeable performance
>> improvement during split.
>>
> Acked-by: David Hildenbrand <[email protected]>

Thanks!


2023-12-19 08:30:34

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 18/12/2023 17:47, David Hildenbrand wrote:
> On 18.12.23 11:50, Ryan Roberts wrote:
>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>> batch is determined by the architecture with the new helper,
>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>> all belonging to the same folio. A pte batch is then write-protected in
>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>> set in one go in the child using the new helper, set_ptes_full().
>>
>> The primary motivation for this change is to reduce the number of tlb
>> maintenance operations that the arm64 backend has to perform during
>> fork, as it is about to add transparent support for the "contiguous bit"
>> in its ptes. By write-protecting the parent using the new
>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>> when all ptes in the range are being write-protected. Similarly, by
>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>> child, the backend does not need to fold a contiguous range once they
>> are all populated - they can be initially populated as a contiguous
>> range in the first place.
>>
>> This code is very performance sensitive, and a significant amount of
>> effort has been put into not regressing performance for the order-0
>> folio case. By default, pte_batch_remaining() is compile constant 1,
>> which enables the compiler to simplify the extra loops that are added
>> for batching and produce code that is equivalent (and equally
>> performant) as the previous implementation.
>>
>> This change addresses the core-mm refactoring only and a separate change
>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>> set_ptes_full() in the arm64 backend to realize the performance
>> improvement as part of the work to enable contpte mappings.
>>
>> To ensure the arm64 is performant once implemented, this change is very
>> careful to only call ptep_get() once per pte batch.
>>
>> The following microbenchmark results demonstate that there is no
>> significant performance change after this patch. Fork is called in a
>> tight loop in a process with 1G of populated memory and the time for the
>> function to execute is measured. 100 iterations per run, 8 runs
>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>> performed for case where 1G memory is comprised of order-0 folios and
>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>> positive is slower, compared to baseline upon which the series is based:
>>
>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>> | fork          |-------------------|-------------------|
>> | microbench    |    mean |   stdev |    mean |   stdev |
>> |---------------|---------|---------|---------|---------|
>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>
>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>> | fork          |-------------------|-------------------|
>> | microbench    |    mean |   stdev |    mean |   stdev |
>> |---------------|---------|---------|---------|---------|
>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>
>> Tested-by: John Hubbard <[email protected]>
>> Reviewed-by: Alistair Popple <[email protected]>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>>   include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>   mm/memory.c             | 92 ++++++++++++++++++++++++++---------------
>>   2 files changed, 139 insertions(+), 33 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index af7639c3b0a3..db93fb81465a 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>   #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>   #endif
>>   +#ifndef pte_batch_remaining
>> +/**
>> + * pte_batch_remaining - Number of pages from addr to next batch boundary.
>> + * @pte: Page table entry for the first page.
>> + * @addr: Address of the first page.
>> + * @end: Batch ceiling (e.g. end of vma).
>> + *
>> + * Some architectures (arm64) can efficiently modify a contiguous batch of ptes.
>> + * In such cases, this function returns the remaining number of pages to the end
>> + * of the current batch, as defined by addr. This can be useful when iterating
>> + * over ptes.
>> + *
>> + * May be overridden by the architecture, else batch size is always 1.
>> + */
>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long addr,
>> +                        unsigned long end)
>> +{
>> +    return 1;
>> +}
>> +#endif
>
> It's a shame we now lose the optimization for all other archtiectures.
>
> Was there no way to have some basic batching mechanism that doesn't require arch
> specifics?

I tried a bunch of things but ultimately the way I've done it was the only way
to reduce the order-0 fork regression to 0.

My original v3 posting was costing 5% extra and even my first attempt at an
arch-specific version that didn't resolve to a compile-time constant 1 still
cost an extra 3%.


>
> I'd have thought that something very basic would have worked like:
>
> * Check if PTE is the same when setting the PFN to 0.
> * Check that PFN is consecutive
> * Check that all PFNs belong to the same folio

I haven't tried this exact approach, but I'd be surprised if I can get the
regression under 4% with this. Further along the series I spent a lot of time
having to fiddle with the arm64 implementation; every conditional and every
memory read (even when in cache) was a problem. There is just so little in the
inner loop that every instruction matters. (At least on Ampere Altra and Apple M2).

Of course if you're willing to pay that 4-5% for order-0 then the benefit to
order-9 is around 10% in my measurements. Personally though, I'd prefer to play
safe and ensure the common order-0 case doesn't regress, as you previously
suggested.


2023-12-19 11:29:42

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 19.12.23 09:30, Ryan Roberts wrote:
> On 18/12/2023 17:47, David Hildenbrand wrote:
>> On 18.12.23 11:50, Ryan Roberts wrote:
>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>> batch is determined by the architecture with the new helper,
>>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>>> all belonging to the same folio. A pte batch is then write-protected in
>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>> set in one go in the child using the new helper, set_ptes_full().
>>>
>>> The primary motivation for this change is to reduce the number of tlb
>>> maintenance operations that the arm64 backend has to perform during
>>> fork, as it is about to add transparent support for the "contiguous bit"
>>> in its ptes. By write-protecting the parent using the new
>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>> when all ptes in the range are being write-protected. Similarly, by
>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>> child, the backend does not need to fold a contiguous range once they
>>> are all populated - they can be initially populated as a contiguous
>>> range in the first place.
>>>
>>> This code is very performance sensitive, and a significant amount of
>>> effort has been put into not regressing performance for the order-0
>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>> which enables the compiler to simplify the extra loops that are added
>>> for batching and produce code that is equivalent (and equally
>>> performant) as the previous implementation.
>>>
>>> This change addresses the core-mm refactoring only and a separate change
>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>> set_ptes_full() in the arm64 backend to realize the performance
>>> improvement as part of the work to enable contpte mappings.
>>>
>>> To ensure the arm64 is performant once implemented, this change is very
>>> careful to only call ptep_get() once per pte batch.
>>>
>>> The following microbenchmark results demonstate that there is no
>>> significant performance change after this patch. Fork is called in a
>>> tight loop in a process with 1G of populated memory and the time for the
>>> function to execute is measured. 100 iterations per run, 8 runs
>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>> performed for case where 1G memory is comprised of order-0 folios and
>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>> positive is slower, compared to baseline upon which the series is based:
>>>
>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>> | fork          |-------------------|-------------------|
>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>> |---------------|---------|---------|---------|---------|
>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>
>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>> | fork          |-------------------|-------------------|
>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>> |---------------|---------|---------|---------|---------|
>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>
>>> Tested-by: John Hubbard <[email protected]>
>>> Reviewed-by: Alistair Popple <[email protected]>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>>   include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>   mm/memory.c             | 92 ++++++++++++++++++++++++++---------------
>>>   2 files changed, 139 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index af7639c3b0a3..db93fb81465a 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>   #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>   #endif
>>>   +#ifndef pte_batch_remaining
>>> +/**
>>> + * pte_batch_remaining - Number of pages from addr to next batch boundary.
>>> + * @pte: Page table entry for the first page.
>>> + * @addr: Address of the first page.
>>> + * @end: Batch ceiling (e.g. end of vma).
>>> + *
>>> + * Some architectures (arm64) can efficiently modify a contiguous batch of ptes.
>>> + * In such cases, this function returns the remaining number of pages to the end
>>> + * of the current batch, as defined by addr. This can be useful when iterating
>>> + * over ptes.
>>> + *
>>> + * May be overridden by the architecture, else batch size is always 1.
>>> + */
>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long addr,
>>> +                        unsigned long end)
>>> +{
>>> +    return 1;
>>> +}
>>> +#endif
>>
>> It's a shame we now lose the optimization for all other archtiectures.
>>
>> Was there no way to have some basic batching mechanism that doesn't require arch
>> specifics?
>
> I tried a bunch of things but ultimately the way I've done it was the only way
> to reduce the order-0 fork regression to 0.

Let me give it a churn today. I think we should really focus on having
only a single folio_test_large() check on the fast path for order-0. And
not even try doing batching for anything that works on bare PFNs.

Off to prototyping ... :)

--
Cheers,

David / dhildenb


2023-12-19 17:30:57

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 19.12.23 09:30, Ryan Roberts wrote:
> On 18/12/2023 17:47, David Hildenbrand wrote:
>> On 18.12.23 11:50, Ryan Roberts wrote:
>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>> batch is determined by the architecture with the new helper,
>>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>>> all belonging to the same folio. A pte batch is then write-protected in
>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>> set in one go in the child using the new helper, set_ptes_full().
>>>
>>> The primary motivation for this change is to reduce the number of tlb
>>> maintenance operations that the arm64 backend has to perform during
>>> fork, as it is about to add transparent support for the "contiguous bit"
>>> in its ptes. By write-protecting the parent using the new
>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>> when all ptes in the range are being write-protected. Similarly, by
>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>> child, the backend does not need to fold a contiguous range once they
>>> are all populated - they can be initially populated as a contiguous
>>> range in the first place.
>>>
>>> This code is very performance sensitive, and a significant amount of
>>> effort has been put into not regressing performance for the order-0
>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>> which enables the compiler to simplify the extra loops that are added
>>> for batching and produce code that is equivalent (and equally
>>> performant) as the previous implementation.
>>>
>>> This change addresses the core-mm refactoring only and a separate change
>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>> set_ptes_full() in the arm64 backend to realize the performance
>>> improvement as part of the work to enable contpte mappings.
>>>
>>> To ensure the arm64 is performant once implemented, this change is very
>>> careful to only call ptep_get() once per pte batch.
>>>
>>> The following microbenchmark results demonstate that there is no
>>> significant performance change after this patch. Fork is called in a
>>> tight loop in a process with 1G of populated memory and the time for the
>>> function to execute is measured. 100 iterations per run, 8 runs
>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>> performed for case where 1G memory is comprised of order-0 folios and
>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>> positive is slower, compared to baseline upon which the series is based:
>>>
>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>> | fork          |-------------------|-------------------|
>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>> |---------------|---------|---------|---------|---------|
>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>
>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>> | fork          |-------------------|-------------------|
>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>> |---------------|---------|---------|---------|---------|
>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>
>>> Tested-by: John Hubbard <[email protected]>
>>> Reviewed-by: Alistair Popple <[email protected]>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>>   include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>   mm/memory.c             | 92 ++++++++++++++++++++++++++---------------
>>>   2 files changed, 139 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index af7639c3b0a3..db93fb81465a 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>   #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>   #endif
>>>   +#ifndef pte_batch_remaining
>>> +/**
>>> + * pte_batch_remaining - Number of pages from addr to next batch boundary.
>>> + * @pte: Page table entry for the first page.
>>> + * @addr: Address of the first page.
>>> + * @end: Batch ceiling (e.g. end of vma).
>>> + *
>>> + * Some architectures (arm64) can efficiently modify a contiguous batch of ptes.
>>> + * In such cases, this function returns the remaining number of pages to the end
>>> + * of the current batch, as defined by addr. This can be useful when iterating
>>> + * over ptes.
>>> + *
>>> + * May be overridden by the architecture, else batch size is always 1.
>>> + */
>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long addr,
>>> +                        unsigned long end)
>>> +{
>>> +    return 1;
>>> +}
>>> +#endif
>>
>> It's a shame we now lose the optimization for all other archtiectures.
>>
>> Was there no way to have some basic batching mechanism that doesn't require arch
>> specifics?
>
> I tried a bunch of things but ultimately the way I've done it was the only way
> to reduce the order-0 fork regression to 0.
>
> My original v3 posting was costing 5% extra and even my first attempt at an
> arch-specific version that didn't resolve to a compile-time constant 1 still
> cost an extra 3%.
>
>
>>
>> I'd have thought that something very basic would have worked like:
>>
>> * Check if PTE is the same when setting the PFN to 0.
>> * Check that PFN is consecutive
>> * Check that all PFNs belong to the same folio
>
> I haven't tried this exact approach, but I'd be surprised if I can get the
> regression under 4% with this. Further along the series I spent a lot of time
> having to fiddle with the arm64 implementation; every conditional and every
> memory read (even when in cache) was a problem. There is just so little in the
> inner loop that every instruction matters. (At least on Ampere Altra and Apple M2).
>
> Of course if you're willing to pay that 4-5% for order-0 then the benefit to
> order-9 is around 10% in my measurements. Personally though, I'd prefer to play
> safe and ensure the common order-0 case doesn't regress, as you previously
> suggested.
>

I just hacked something up, on top of my beloved rmap cleanup/batching
series. I implemented very generic and simple batching for large folios
(all PTE bits except the PFN have to match).

Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R)
Silver 4210R CPU.

order-0: 0.014210 -> 0.013969

-> Around 1.7 % faster

order-9: 0.014373 -> 0.009149

-> Around 36.3 % faster


But it's likely buggy, so don't trust the numbers just yet. If they
actually hold up, we should probably do something like that ahead of
time, before all the arm-specific cont-pte work.

I suspect you can easily extend that by arch hooks where reasonable.

The (3) patches on top of the rmap cleanups can be found at:

https://github.com/davidhildenbrand/linux/tree/fork-batching

--
Cheers,

David / dhildenb


2023-12-19 17:43:16

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 19/12/2023 17:22, David Hildenbrand wrote:
> On 19.12.23 09:30, Ryan Roberts wrote:
>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>> batch is determined by the architecture with the new helper,
>>>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>>>> all belonging to the same folio. A pte batch is then write-protected in
>>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>
>>>> The primary motivation for this change is to reduce the number of tlb
>>>> maintenance operations that the arm64 backend has to perform during
>>>> fork, as it is about to add transparent support for the "contiguous bit"
>>>> in its ptes. By write-protecting the parent using the new
>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>> when all ptes in the range are being write-protected. Similarly, by
>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>> child, the backend does not need to fold a contiguous range once they
>>>> are all populated - they can be initially populated as a contiguous
>>>> range in the first place.
>>>>
>>>> This code is very performance sensitive, and a significant amount of
>>>> effort has been put into not regressing performance for the order-0
>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>> which enables the compiler to simplify the extra loops that are added
>>>> for batching and produce code that is equivalent (and equally
>>>> performant) as the previous implementation.
>>>>
>>>> This change addresses the core-mm refactoring only and a separate change
>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>> improvement as part of the work to enable contpte mappings.
>>>>
>>>> To ensure the arm64 is performant once implemented, this change is very
>>>> careful to only call ptep_get() once per pte batch.
>>>>
>>>> The following microbenchmark results demonstate that there is no
>>>> significant performance change after this patch. Fork is called in a
>>>> tight loop in a process with 1G of populated memory and the time for the
>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>>> positive is slower, compared to baseline upon which the series is based:
>>>>
>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>> | fork          |-------------------|-------------------|
>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>> |---------------|---------|---------|---------|---------|
>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>
>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>> | fork          |-------------------|-------------------|
>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>> |---------------|---------|---------|---------|---------|
>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>
>>>> Tested-by: John Hubbard <[email protected]>
>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> ---
>>>>    include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>>    mm/memory.c             | 92 ++++++++++++++++++++++++++---------------
>>>>    2 files changed, 139 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index af7639c3b0a3..db93fb81465a 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>    #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>    #endif
>>>>    +#ifndef pte_batch_remaining
>>>> +/**
>>>> + * pte_batch_remaining - Number of pages from addr to next batch boundary.
>>>> + * @pte: Page table entry for the first page.
>>>> + * @addr: Address of the first page.
>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>> + *
>>>> + * Some architectures (arm64) can efficiently modify a contiguous batch of
>>>> ptes.
>>>> + * In such cases, this function returns the remaining number of pages to
>>>> the end
>>>> + * of the current batch, as defined by addr. This can be useful when iterating
>>>> + * over ptes.
>>>> + *
>>>> + * May be overridden by the architecture, else batch size is always 1.
>>>> + */
>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long addr,
>>>> +                        unsigned long end)
>>>> +{
>>>> +    return 1;
>>>> +}
>>>> +#endif
>>>
>>> It's a shame we now lose the optimization for all other archtiectures.
>>>
>>> Was there no way to have some basic batching mechanism that doesn't require arch
>>> specifics?
>>
>> I tried a bunch of things but ultimately the way I've done it was the only way
>> to reduce the order-0 fork regression to 0.
>>
>> My original v3 posting was costing 5% extra and even my first attempt at an
>> arch-specific version that didn't resolve to a compile-time constant 1 still
>> cost an extra 3%.
>>
>>
>>>
>>> I'd have thought that something very basic would have worked like:
>>>
>>> * Check if PTE is the same when setting the PFN to 0.
>>> * Check that PFN is consecutive
>>> * Check that all PFNs belong to the same folio
>>
>> I haven't tried this exact approach, but I'd be surprised if I can get the
>> regression under 4% with this. Further along the series I spent a lot of time
>> having to fiddle with the arm64 implementation; every conditional and every
>> memory read (even when in cache) was a problem. There is just so little in the
>> inner loop that every instruction matters. (At least on Ampere Altra and Apple
>> M2).
>>
>> Of course if you're willing to pay that 4-5% for order-0 then the benefit to
>> order-9 is around 10% in my measurements. Personally though, I'd prefer to play
>> safe and ensure the common order-0 case doesn't regress, as you previously
>> suggested.
>>
>
> I just hacked something up, on top of my beloved rmap cleanup/batching series. I
> implemented very generic and simple batching for large folios (all PTE bits
> except the PFN have to match).
>
> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R) Silver
> 4210R CPU.
>
> order-0: 0.014210 -> 0.013969
>
> -> Around 1.7 % faster
>
> order-9: 0.014373 -> 0.009149
>
> -> Around 36.3 % faster

Well I guess that shows me :)

I'll do a review and run the tests on my HW to see if it concurs.

>
>
> But it's likely buggy, so don't trust the numbers just yet. If they actually
> hold up, we should probably do something like that ahead of time, before all the
> arm-specific cont-pte work.
>
> I suspect you can easily extend that by arch hooks where reasonable.
>
> The (3) patches on top of the rmap cleanups can be found at:
>
>     https://github.com/davidhildenbrand/linux/tree/fork-batching
>


2023-12-20 05:28:22

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v4 03/16] mm: Batch-clear PTE ranges during zap_pte_range()


Ryan Roberts <[email protected]> writes:

[...]

> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 4f559f4ddd21..39725756e6bf 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -47,6 +47,21 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
> return true;
> }
>
> +unsigned int tlb_reserve_space(struct mmu_gather *tlb, unsigned int nr)
> +{
> + struct mmu_gather_batch *batch = tlb->active;
> + unsigned int nr_alloc = batch->max - batch->nr;
> +
> + while (nr_alloc < nr) {
> + if (!tlb_next_batch(tlb))
> + break;
> + nr_alloc += tlb->active->max;
> + }
> +
> + tlb->active = batch;
> + return nr_alloc;
> +}

Agree this addresses my previous comment nicely, so you can add:

Reviewed-by: Alistair Popple <[email protected]>

> +
> #ifdef CONFIG_SMP
> static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_struct *vma)
> {


2023-12-20 05:32:55

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v4 16/16] arm64/mm: Implement clear_ptes() to optimize exit, munmap, dontneed


Ryan Roberts <[email protected]> writes:

> With the core-mm changes in place to batch-clear ptes during
> zap_pte_range(), we can take advantage of this in arm64 to greatly
> reduce the number of tlbis we have to issue, and recover the lost
> performance in exit, munmap and madvise(DONTNEED) incured when adding
> support for transparent contiguous ptes.
>
> If we are clearing a whole contpte range, we can elide first unfolding
> that range and save the tlbis. We just clear the whole range.
>
> The following microbenchmark results demonstate the effect of this
> change on madvise(DONTNEED) performance for large pte-mapped folios.
> madvise(dontneed) is called for each page of a 1G populated mapping and
> the total time is measured. 100 iterations per run, 8 runs performed on
> both Apple M2 (VM) and Ampere Altra (bare metal). Tests performed for
> case where 1G memory is comprised of pte-mapped order-9 folios. Negative
> is faster, positive is slower, compared to baseline upon which the
> series is based:
>
> | dontneed | Apple M2 VM | Ampere Altra |
> | order-9 |-------------------|-------------------|
> | (pte-map) | mean | stdev | mean | stdev |
> |---------------|---------|---------|---------|---------|
> | baseline | 0.0% | 7.9% | 0.0% | 0.0% |
> | before-change | -1.3% | 7.0% | 13.0% | 0.0% |
> | after-change | -9.9% | 0.9% | 14.1% | 0.0% |
>
> The memory is initially all contpte-mapped and has to be unfolded (which
> requires tlbi for the whole block) when the first page is touched (since
> the test is madvise-ing 1 page at a time). Ampere Altra has high cost
> for tlbi; this is why cost increases there.
>
> The following microbenchmark results demonstate the recovery (and
> overall improvement) of munmap performance for large pte-mapped folios.
> munmap is called for a 1G populated mapping and the function runtime is
> measured. 100 iterations per run, 8 runs performed on both Apple M2 (VM)
> and Ampere Altra (bare metal). Tests performed for case where 1G memory
> is comprised of pte-mapped order-9 folios. Negative is faster, positive
> is slower, compared to baseline upon which the series is based:
>
> | munmap | Apple M2 VM | Ampere Altra |
> | order-9 |-------------------|-------------------|
> | (pte-map) | mean | stdev | mean | stdev |
> |---------------|---------|---------|---------|---------|
> | baseline | 0.0% | 6.4% | 0.0% | 0.1% |
> | before-change | 43.3% | 1.9% | 375.2% | 0.0% |
> | after-change | -6.0% | 1.4% | -0.6% | 0.2% |
>
> Tested-by: John Hubbard <[email protected]>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> arch/arm64/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++
> arch/arm64/mm/contpte.c | 45 ++++++++++++++++++++++++++++++++
> 2 files changed, 87 insertions(+)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index d4805f73b9db..f5bf059291c3 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -953,6 +953,29 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
> return pte;
> }
>
> +static inline pte_t __clear_ptes(struct mm_struct *mm,
> + unsigned long address, pte_t *ptep,
> + unsigned int nr, int full)

Ping on my previous comment - why not just use the generic version
defined in patch 3 which is basically identical to this?

> +{
> + pte_t orig_pte = __ptep_get_and_clear(mm, address, ptep);
> + unsigned int i;
> + pte_t pte;
> +
> + for (i = 1; i < nr; i++) {
> + address += PAGE_SIZE;
> + ptep++;
> + pte = __ptep_get_and_clear(mm, address, ptep);
> +
> + if (pte_dirty(pte))
> + orig_pte = pte_mkdirty(orig_pte);
> +
> + if (pte_young(pte))
> + orig_pte = pte_mkyoung(orig_pte);
> + }
> +
> + return orig_pte;
> +}
> +
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
> static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
> @@ -1151,6 +1174,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep, pte_t pte, unsigned int nr);
> +extern pte_t contpte_clear_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr, int full);
> extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep);
> extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> @@ -1279,6 +1304,22 @@ static inline void pte_clear(struct mm_struct *mm,
> __pte_clear(mm, addr, ptep);
> }
>
> +#define clear_ptes clear_ptes
> +static inline pte_t clear_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep,
> + unsigned int nr, int full)
> +{
> + pte_t pte;
> +
> + if (nr == 1) {
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> + pte = __ptep_get_and_clear(mm, addr, ptep);
> + } else
> + pte = contpte_clear_ptes(mm, addr, ptep, nr, full);
> +
> + return pte;
> +}
> +
> #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> unsigned long addr, pte_t *ptep)
> @@ -1366,6 +1407,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
> #define set_pte __set_pte
> #define set_ptes __set_ptes
> #define pte_clear __pte_clear
> +#define clear_ptes __clear_ptes
> #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
> #define ptep_get_and_clear __ptep_get_and_clear
> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 72e672024785..6f2a15ac5163 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -293,6 +293,51 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> }
> EXPORT_SYMBOL(contpte_set_ptes);
>
> +pte_t contpte_clear_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
> + unsigned int nr, int full)
> +{
> + /*
> + * If we cover a partial contpte block at the beginning or end of the
> + * batch, unfold if currently folded. This makes it safe to clear some
> + * of the entries while keeping others. contpte blocks in the middle of
> + * the range, which are fully covered don't need to be unfolded because
> + * we will clear the full block.
> + */
> +
> + unsigned int i;
> + pte_t pte;
> + pte_t tail;
> +
> + if (!mm_is_user(mm))
> + return __clear_ptes(mm, addr, ptep, nr, full);
> +
> + if (ptep != contpte_align_down(ptep) || nr < CONT_PTES)
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +
> + if (ptep + nr != contpte_align_down(ptep + nr))
> + contpte_try_unfold(mm, addr + PAGE_SIZE * (nr - 1),
> + ptep + nr - 1,
> + __ptep_get(ptep + nr - 1));
> +
> + pte = __ptep_get_and_clear(mm, addr, ptep);
> +
> + for (i = 1; i < nr; i++) {
> + addr += PAGE_SIZE;
> + ptep++;
> +
> + tail = __ptep_get_and_clear(mm, addr, ptep);
> +
> + if (pte_dirty(tail))
> + pte = pte_mkdirty(pte);
> +
> + if (pte_young(tail))
> + pte = pte_mkyoung(pte);
> + }
> +
> + return pte;
> +}
> +EXPORT_SYMBOL(contpte_clear_ptes);
> +
> int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep)
> {


2023-12-20 08:45:14

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 16/16] arm64/mm: Implement clear_ptes() to optimize exit, munmap, dontneed

On 20/12/2023 05:28, Alistair Popple wrote:
>
> Ryan Roberts <[email protected]> writes:
>
>> With the core-mm changes in place to batch-clear ptes during
>> zap_pte_range(), we can take advantage of this in arm64 to greatly
>> reduce the number of tlbis we have to issue, and recover the lost
>> performance in exit, munmap and madvise(DONTNEED) incured when adding
>> support for transparent contiguous ptes.
>>
>> If we are clearing a whole contpte range, we can elide first unfolding
>> that range and save the tlbis. We just clear the whole range.
>>
>> The following microbenchmark results demonstate the effect of this
>> change on madvise(DONTNEED) performance for large pte-mapped folios.
>> madvise(dontneed) is called for each page of a 1G populated mapping and
>> the total time is measured. 100 iterations per run, 8 runs performed on
>> both Apple M2 (VM) and Ampere Altra (bare metal). Tests performed for
>> case where 1G memory is comprised of pte-mapped order-9 folios. Negative
>> is faster, positive is slower, compared to baseline upon which the
>> series is based:
>>
>> | dontneed | Apple M2 VM | Ampere Altra |
>> | order-9 |-------------------|-------------------|
>> | (pte-map) | mean | stdev | mean | stdev |
>> |---------------|---------|---------|---------|---------|
>> | baseline | 0.0% | 7.9% | 0.0% | 0.0% |
>> | before-change | -1.3% | 7.0% | 13.0% | 0.0% |
>> | after-change | -9.9% | 0.9% | 14.1% | 0.0% |
>>
>> The memory is initially all contpte-mapped and has to be unfolded (which
>> requires tlbi for the whole block) when the first page is touched (since
>> the test is madvise-ing 1 page at a time). Ampere Altra has high cost
>> for tlbi; this is why cost increases there.
>>
>> The following microbenchmark results demonstate the recovery (and
>> overall improvement) of munmap performance for large pte-mapped folios.
>> munmap is called for a 1G populated mapping and the function runtime is
>> measured. 100 iterations per run, 8 runs performed on both Apple M2 (VM)
>> and Ampere Altra (bare metal). Tests performed for case where 1G memory
>> is comprised of pte-mapped order-9 folios. Negative is faster, positive
>> is slower, compared to baseline upon which the series is based:
>>
>> | munmap | Apple M2 VM | Ampere Altra |
>> | order-9 |-------------------|-------------------|
>> | (pte-map) | mean | stdev | mean | stdev |
>> |---------------|---------|---------|---------|---------|
>> | baseline | 0.0% | 6.4% | 0.0% | 0.1% |
>> | before-change | 43.3% | 1.9% | 375.2% | 0.0% |
>> | after-change | -6.0% | 1.4% | -0.6% | 0.2% |
>>
>> Tested-by: John Hubbard <[email protected]>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> arch/arm64/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++
>> arch/arm64/mm/contpte.c | 45 ++++++++++++++++++++++++++++++++
>> 2 files changed, 87 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index d4805f73b9db..f5bf059291c3 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -953,6 +953,29 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
>> return pte;
>> }
>>
>> +static inline pte_t __clear_ptes(struct mm_struct *mm,
>> + unsigned long address, pte_t *ptep,
>> + unsigned int nr, int full)
>
> Ping on my previous comment - why not just use the generic version
> defined in patch 3 which is basically identical to this?

Perhaps I misunderstood your original comment - I thought this was what you were
suggesting - i.e. move this code out of the arm64 clear_ptes() impl into its own
__clear_ptes() helper, and always define an arm64 clear_ptes(), even when
ARM64_CONTPTE is not enabled.

I can use (and was in v3) the generic version when ARM64_CONTPTE is disabled.
But I can't use it when its enabled, because then arm64 needs its own
implementation to manage the contpte bit. And once it defines it's own version,
by defining the macro clear_ptes(), then the generic version is no longer
defined so I can't call it as part of this implementation. Even if I could, that
would be recursive.

Or perhaps I'm still not understanding your suggestion?

>
>> +{
>> + pte_t orig_pte = __ptep_get_and_clear(mm, address, ptep);
>> + unsigned int i;
>> + pte_t pte;
>> +
>> + for (i = 1; i < nr; i++) {
>> + address += PAGE_SIZE;
>> + ptep++;
>> + pte = __ptep_get_and_clear(mm, address, ptep);
>> +
>> + if (pte_dirty(pte))
>> + orig_pte = pte_mkdirty(orig_pte);
>> +
>> + if (pte_young(pte))
>> + orig_pte = pte_mkyoung(orig_pte);
>> + }
>> +
>> + return orig_pte;
>> +}
>> +
>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
>> static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>> @@ -1151,6 +1174,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>> extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern pte_t contpte_clear_ptes(struct mm_struct *mm, unsigned long addr,
>> + pte_t *ptep, unsigned int nr, int full);
>> extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> unsigned long addr, pte_t *ptep);
>> extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> @@ -1279,6 +1304,22 @@ static inline void pte_clear(struct mm_struct *mm,
>> __pte_clear(mm, addr, ptep);
>> }
>>
>> +#define clear_ptes clear_ptes
>> +static inline pte_t clear_ptes(struct mm_struct *mm,
>> + unsigned long addr, pte_t *ptep,
>> + unsigned int nr, int full)
>> +{
>> + pte_t pte;
>> +
>> + if (nr == 1) {
>> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> + pte = __ptep_get_and_clear(mm, addr, ptep);
>> + } else
>> + pte = contpte_clear_ptes(mm, addr, ptep, nr, full);
>> +
>> + return pte;
>> +}
>> +
>> #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>> unsigned long addr, pte_t *ptep)
>> @@ -1366,6 +1407,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>> #define set_pte __set_pte
>> #define set_ptes __set_ptes
>> #define pte_clear __pte_clear
>> +#define clear_ptes __clear_ptes
>> #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>> #define ptep_get_and_clear __ptep_get_and_clear
>> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index 72e672024785..6f2a15ac5163 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -293,6 +293,51 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> }
>> EXPORT_SYMBOL(contpte_set_ptes);
>>
>> +pte_t contpte_clear_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
>> + unsigned int nr, int full)
>> +{
>> + /*
>> + * If we cover a partial contpte block at the beginning or end of the
>> + * batch, unfold if currently folded. This makes it safe to clear some
>> + * of the entries while keeping others. contpte blocks in the middle of
>> + * the range, which are fully covered don't need to be unfolded because
>> + * we will clear the full block.
>> + */
>> +
>> + unsigned int i;
>> + pte_t pte;
>> + pte_t tail;
>> +
>> + if (!mm_is_user(mm))
>> + return __clear_ptes(mm, addr, ptep, nr, full);
>> +
>> + if (ptep != contpte_align_down(ptep) || nr < CONT_PTES)
>> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +
>> + if (ptep + nr != contpte_align_down(ptep + nr))
>> + contpte_try_unfold(mm, addr + PAGE_SIZE * (nr - 1),
>> + ptep + nr - 1,
>> + __ptep_get(ptep + nr - 1));
>> +
>> + pte = __ptep_get_and_clear(mm, addr, ptep);
>> +
>> + for (i = 1; i < nr; i++) {
>> + addr += PAGE_SIZE;
>> + ptep++;
>> +
>> + tail = __ptep_get_and_clear(mm, addr, ptep);
>> +
>> + if (pte_dirty(tail))
>> + pte = pte_mkdirty(pte);
>> +
>> + if (pte_young(tail))
>> + pte = pte_mkyoung(pte);
>> + }
>> +
>> + return pte;
>> +}
>> +EXPORT_SYMBOL(contpte_clear_ptes);
>> +
>> int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> unsigned long addr, pte_t *ptep)
>> {
>


2023-12-20 09:18:31

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 19.12.23 18:42, Ryan Roberts wrote:
> On 19/12/2023 17:22, David Hildenbrand wrote:
>> On 19.12.23 09:30, Ryan Roberts wrote:
>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>> batch is determined by the architecture with the new helper,
>>>>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>>>>> all belonging to the same folio. A pte batch is then write-protected in
>>>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>
>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>> maintenance operations that the arm64 backend has to perform during
>>>>> fork, as it is about to add transparent support for the "contiguous bit"
>>>>> in its ptes. By write-protecting the parent using the new
>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>> child, the backend does not need to fold a contiguous range once they
>>>>> are all populated - they can be initially populated as a contiguous
>>>>> range in the first place.
>>>>>
>>>>> This code is very performance sensitive, and a significant amount of
>>>>> effort has been put into not regressing performance for the order-0
>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>> which enables the compiler to simplify the extra loops that are added
>>>>> for batching and produce code that is equivalent (and equally
>>>>> performant) as the previous implementation.
>>>>>
>>>>> This change addresses the core-mm refactoring only and a separate change
>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>> improvement as part of the work to enable contpte mappings.
>>>>>
>>>>> To ensure the arm64 is performant once implemented, this change is very
>>>>> careful to only call ptep_get() once per pte batch.
>>>>>
>>>>> The following microbenchmark results demonstate that there is no
>>>>> significant performance change after this patch. Fork is called in a
>>>>> tight loop in a process with 1G of populated memory and the time for the
>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>>>> positive is slower, compared to baseline upon which the series is based:
>>>>>
>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>> | fork          |-------------------|-------------------|
>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>> |---------------|---------|---------|---------|---------|
>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>
>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>> | fork          |-------------------|-------------------|
>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>> |---------------|---------|---------|---------|---------|
>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>
>>>>> Tested-by: John Hubbard <[email protected]>
>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>> ---
>>>>>    include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>>>    mm/memory.c             | 92 ++++++++++++++++++++++++++---------------
>>>>>    2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>    #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>    #endif
>>>>>    +#ifndef pte_batch_remaining
>>>>> +/**
>>>>> + * pte_batch_remaining - Number of pages from addr to next batch boundary.
>>>>> + * @pte: Page table entry for the first page.
>>>>> + * @addr: Address of the first page.
>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>> + *
>>>>> + * Some architectures (arm64) can efficiently modify a contiguous batch of
>>>>> ptes.
>>>>> + * In such cases, this function returns the remaining number of pages to
>>>>> the end
>>>>> + * of the current batch, as defined by addr. This can be useful when iterating
>>>>> + * over ptes.
>>>>> + *
>>>>> + * May be overridden by the architecture, else batch size is always 1.
>>>>> + */
>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long addr,
>>>>> +                        unsigned long end)
>>>>> +{
>>>>> +    return 1;
>>>>> +}
>>>>> +#endif
>>>>
>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>
>>>> Was there no way to have some basic batching mechanism that doesn't require arch
>>>> specifics?
>>>
>>> I tried a bunch of things but ultimately the way I've done it was the only way
>>> to reduce the order-0 fork regression to 0.
>>>
>>> My original v3 posting was costing 5% extra and even my first attempt at an
>>> arch-specific version that didn't resolve to a compile-time constant 1 still
>>> cost an extra 3%.
>>>
>>>
>>>>
>>>> I'd have thought that something very basic would have worked like:
>>>>
>>>> * Check if PTE is the same when setting the PFN to 0.
>>>> * Check that PFN is consecutive
>>>> * Check that all PFNs belong to the same folio
>>>
>>> I haven't tried this exact approach, but I'd be surprised if I can get the
>>> regression under 4% with this. Further along the series I spent a lot of time
>>> having to fiddle with the arm64 implementation; every conditional and every
>>> memory read (even when in cache) was a problem. There is just so little in the
>>> inner loop that every instruction matters. (At least on Ampere Altra and Apple
>>> M2).
>>>
>>> Of course if you're willing to pay that 4-5% for order-0 then the benefit to
>>> order-9 is around 10% in my measurements. Personally though, I'd prefer to play
>>> safe and ensure the common order-0 case doesn't regress, as you previously
>>> suggested.
>>>
>>
>> I just hacked something up, on top of my beloved rmap cleanup/batching series. I
>> implemented very generic and simple batching for large folios (all PTE bits
>> except the PFN have to match).
>>
>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R) Silver
>> 4210R CPU.
>>
>> order-0: 0.014210 -> 0.013969
>>
>> -> Around 1.7 % faster
>>
>> order-9: 0.014373 -> 0.009149
>>
>> -> Around 36.3 % faster
>
> Well I guess that shows me :)
>
> I'll do a review and run the tests on my HW to see if it concurs.


I pushed a simple compile fixup (we need pte_next_pfn()).

Note that we should probably handle "ptep_set_wrprotects" rather like set_ptes:

#ifndef wrprotect_ptes
static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, unsigned int nr)
{
for (;;) {
ptep_set_wrprotect(mm, addr, ptep);
if (--nr == 0)
break;
ptep++;
addr += PAGE_SIZE;
}
}
#endif


--
Cheers,

David / dhildenb


2023-12-20 09:51:19

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20/12/2023 09:17, David Hildenbrand wrote:
> On 19.12.23 18:42, Ryan Roberts wrote:
>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>> batch is determined by the architecture with the new helper,
>>>>>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>>>>>> all belonging to the same folio. A pte batch is then write-protected in
>>>>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>
>>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>> fork, as it is about to add transparent support for the "contiguous bit"
>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>>> child, the backend does not need to fold a contiguous range once they
>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>> range in the first place.
>>>>>>
>>>>>> This code is very performance sensitive, and a significant amount of
>>>>>> effort has been put into not regressing performance for the order-0
>>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>>> which enables the compiler to simplify the extra loops that are added
>>>>>> for batching and produce code that is equivalent (and equally
>>>>>> performant) as the previous implementation.
>>>>>>
>>>>>> This change addresses the core-mm refactoring only and a separate change
>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>
>>>>>> To ensure the arm64 is performant once implemented, this change is very
>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>
>>>>>> The following microbenchmark results demonstate that there is no
>>>>>> significant performance change after this patch. Fork is called in a
>>>>>> tight loop in a process with 1G of populated memory and the time for the
>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>>>>> positive is slower, compared to baseline upon which the series is based:
>>>>>>
>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>> | fork          |-------------------|-------------------|
>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>> |---------------|---------|---------|---------|---------|
>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>
>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>> | fork          |-------------------|-------------------|
>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>> |---------------|---------|---------|---------|---------|
>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>
>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>> ---
>>>>>>     include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>>>>     mm/memory.c             | 92 ++++++++++++++++++++++++++---------------
>>>>>>     2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>> --- a/include/linux/pgtable.h
>>>>>> +++ b/include/linux/pgtable.h
>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>     #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>     #endif
>>>>>>     +#ifndef pte_batch_remaining
>>>>>> +/**
>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch boundary.
>>>>>> + * @pte: Page table entry for the first page.
>>>>>> + * @addr: Address of the first page.
>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>> + *
>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous batch of
>>>>>> ptes.
>>>>>> + * In such cases, this function returns the remaining number of pages to
>>>>>> the end
>>>>>> + * of the current batch, as defined by addr. This can be useful when
>>>>>> iterating
>>>>>> + * over ptes.
>>>>>> + *
>>>>>> + * May be overridden by the architecture, else batch size is always 1.
>>>>>> + */
>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long
>>>>>> addr,
>>>>>> +                        unsigned long end)
>>>>>> +{
>>>>>> +    return 1;
>>>>>> +}
>>>>>> +#endif
>>>>>
>>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>>
>>>>> Was there no way to have some basic batching mechanism that doesn't require
>>>>> arch
>>>>> specifics?
>>>>
>>>> I tried a bunch of things but ultimately the way I've done it was the only way
>>>> to reduce the order-0 fork regression to 0.
>>>>
>>>> My original v3 posting was costing 5% extra and even my first attempt at an
>>>> arch-specific version that didn't resolve to a compile-time constant 1 still
>>>> cost an extra 3%.
>>>>
>>>>
>>>>>
>>>>> I'd have thought that something very basic would have worked like:
>>>>>
>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>> * Check that PFN is consecutive
>>>>> * Check that all PFNs belong to the same folio
>>>>
>>>> I haven't tried this exact approach, but I'd be surprised if I can get the
>>>> regression under 4% with this. Further along the series I spent a lot of time
>>>> having to fiddle with the arm64 implementation; every conditional and every
>>>> memory read (even when in cache) was a problem. There is just so little in the
>>>> inner loop that every instruction matters. (At least on Ampere Altra and Apple
>>>> M2).
>>>>
>>>> Of course if you're willing to pay that 4-5% for order-0 then the benefit to
>>>> order-9 is around 10% in my measurements. Personally though, I'd prefer to play
>>>> safe and ensure the common order-0 case doesn't regress, as you previously
>>>> suggested.
>>>>
>>>
>>> I just hacked something up, on top of my beloved rmap cleanup/batching series. I
>>> implemented very generic and simple batching for large folios (all PTE bits
>>> except the PFN have to match).
>>>
>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R) Silver
>>> 4210R CPU.
>>>
>>> order-0: 0.014210 -> 0.013969
>>>
>>> -> Around 1.7 % faster
>>>
>>> order-9: 0.014373 -> 0.009149
>>>
>>> -> Around 36.3 % faster
>>
>> Well I guess that shows me :)
>>
>> I'll do a review and run the tests on my HW to see if it concurs.
>
>
> I pushed a simple compile fixup (we need pte_next_pfn()).

I've just been trying to compile and noticed this. Will take a look at your update.

But upon review, I've noticed the part that I think makes this difficult for
arm64 with the contpte optimization; You are calling ptep_get() for every pte in
the batch. While this is functionally correct, once arm64 has the contpte
changes, its ptep_get() has to read every pte in the contpte block in order to
gather the access and dirty bits. So if your batching function ends up wealking
a 16 entry contpte block, that will cause 16 x 16 reads, which kills
performance. That's why I added the arch-specific pte_batch_remaining()
function; this allows the core-mm to skip to the end of the contpte block and
avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s instead
of 256.

I considered making a ptep_get_noyoungdirty() variant, which would avoid the bit
gathering. But we have a similar problem in zap_pte_range() and that function
needs the dirty bit to update the folio. So it doesn't work there. (see patch 3
in my series).

I guess you are going to say that we should combine both approaches, so that
your batching loop can skip forward an arch-provided number of ptes? That would
certainly work, but feels like an orthogonal change to what I'm trying to
achieve :). Anyway, I'll spend some time playing with it today.


>
> Note that we should probably handle "ptep_set_wrprotects" rather like set_ptes:
>
> #ifndef wrprotect_ptes
> static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>                pte_t *ptep, unsigned int nr)
> {
>        for (;;) {
>                ptep_set_wrprotect(mm, addr, ptep);
>                if (--nr == 0)
>                        break;
>                ptep++;
>                addr += PAGE_SIZE;
>        }
> }
> #endif
>
>


2023-12-20 09:58:07

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20.12.23 10:51, Ryan Roberts wrote:
> On 20/12/2023 09:17, David Hildenbrand wrote:
>> On 19.12.23 18:42, Ryan Roberts wrote:
>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>>>>>>> all belonging to the same folio. A pte batch is then write-protected in
>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>
>>>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>> fork, as it is about to add transparent support for the "contiguous bit"
>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>>>> child, the backend does not need to fold a contiguous range once they
>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>> range in the first place.
>>>>>>>
>>>>>>> This code is very performance sensitive, and a significant amount of
>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>>>> which enables the compiler to simplify the extra loops that are added
>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>> performant) as the previous implementation.
>>>>>>>
>>>>>>> This change addresses the core-mm refactoring only and a separate change
>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>
>>>>>>> To ensure the arm64 is performant once implemented, this change is very
>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>
>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>> significant performance change after this patch. Fork is called in a
>>>>>>> tight loop in a process with 1G of populated memory and the time for the
>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>>>>>> positive is slower, compared to baseline upon which the series is based:
>>>>>>>
>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>> | fork          |-------------------|-------------------|
>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>
>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>> | fork          |-------------------|-------------------|
>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>
>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>> ---
>>>>>>>     include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>>>>>     mm/memory.c             | 92 ++++++++++++++++++++++++++---------------
>>>>>>>     2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>> --- a/include/linux/pgtable.h
>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>     #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>     #endif
>>>>>>>     +#ifndef pte_batch_remaining
>>>>>>> +/**
>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch boundary.
>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>> + * @addr: Address of the first page.
>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>> + *
>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous batch of
>>>>>>> ptes.
>>>>>>> + * In such cases, this function returns the remaining number of pages to
>>>>>>> the end
>>>>>>> + * of the current batch, as defined by addr. This can be useful when
>>>>>>> iterating
>>>>>>> + * over ptes.
>>>>>>> + *
>>>>>>> + * May be overridden by the architecture, else batch size is always 1.
>>>>>>> + */
>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long
>>>>>>> addr,
>>>>>>> +                        unsigned long end)
>>>>>>> +{
>>>>>>> +    return 1;
>>>>>>> +}
>>>>>>> +#endif
>>>>>>
>>>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>>>
>>>>>> Was there no way to have some basic batching mechanism that doesn't require
>>>>>> arch
>>>>>> specifics?
>>>>>
>>>>> I tried a bunch of things but ultimately the way I've done it was the only way
>>>>> to reduce the order-0 fork regression to 0.
>>>>>
>>>>> My original v3 posting was costing 5% extra and even my first attempt at an
>>>>> arch-specific version that didn't resolve to a compile-time constant 1 still
>>>>> cost an extra 3%.
>>>>>
>>>>>
>>>>>>
>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>
>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>> * Check that PFN is consecutive
>>>>>> * Check that all PFNs belong to the same folio
>>>>>
>>>>> I haven't tried this exact approach, but I'd be surprised if I can get the
>>>>> regression under 4% with this. Further along the series I spent a lot of time
>>>>> having to fiddle with the arm64 implementation; every conditional and every
>>>>> memory read (even when in cache) was a problem. There is just so little in the
>>>>> inner loop that every instruction matters. (At least on Ampere Altra and Apple
>>>>> M2).
>>>>>
>>>>> Of course if you're willing to pay that 4-5% for order-0 then the benefit to
>>>>> order-9 is around 10% in my measurements. Personally though, I'd prefer to play
>>>>> safe and ensure the common order-0 case doesn't regress, as you previously
>>>>> suggested.
>>>>>
>>>>
>>>> I just hacked something up, on top of my beloved rmap cleanup/batching series. I
>>>> implemented very generic and simple batching for large folios (all PTE bits
>>>> except the PFN have to match).
>>>>
>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R) Silver
>>>> 4210R CPU.
>>>>
>>>> order-0: 0.014210 -> 0.013969
>>>>
>>>> -> Around 1.7 % faster
>>>>
>>>> order-9: 0.014373 -> 0.009149
>>>>
>>>> -> Around 36.3 % faster
>>>
>>> Well I guess that shows me :)
>>>
>>> I'll do a review and run the tests on my HW to see if it concurs.
>>
>>
>> I pushed a simple compile fixup (we need pte_next_pfn()).
>
> I've just been trying to compile and noticed this. Will take a look at your update.
>
> But upon review, I've noticed the part that I think makes this difficult for
> arm64 with the contpte optimization; You are calling ptep_get() for every pte in
> the batch. While this is functionally correct, once arm64 has the contpte
> changes, its ptep_get() has to read every pte in the contpte block in order to
> gather the access and dirty bits. So if your batching function ends up wealking
> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
> performance. That's why I added the arch-specific pte_batch_remaining()
> function; this allows the core-mm to skip to the end of the contpte block and
> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s instead
> of 256.
>
> I considered making a ptep_get_noyoungdirty() variant, which would avoid the bit
> gathering. But we have a similar problem in zap_pte_range() and that function
> needs the dirty bit to update the folio. So it doesn't work there. (see patch 3
> in my series).
>
> I guess you are going to say that we should combine both approaches, so that
> your batching loop can skip forward an arch-provided number of ptes? That would
> certainly work, but feels like an orthogonal change to what I'm trying to
> achieve :). Anyway, I'll spend some time playing with it today.

You can overwrite the function or add special-casing internally, yes.

Right now, your patch is called "mm: Batch-copy PTE ranges during
fork()" and it doesn't do any of that besides preparing for some arm64 work.

--
Cheers,

David / dhildenb


2023-12-20 09:59:38

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20/12/2023 09:51, Ryan Roberts wrote:
> On 20/12/2023 09:17, David Hildenbrand wrote:
>> On 19.12.23 18:42, Ryan Roberts wrote:
>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>>>>>>> all belonging to the same folio. A pte batch is then write-protected in
>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>
>>>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>> fork, as it is about to add transparent support for the "contiguous bit"
>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>>>> child, the backend does not need to fold a contiguous range once they
>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>> range in the first place.
>>>>>>>
>>>>>>> This code is very performance sensitive, and a significant amount of
>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>>>> which enables the compiler to simplify the extra loops that are added
>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>> performant) as the previous implementation.
>>>>>>>
>>>>>>> This change addresses the core-mm refactoring only and a separate change
>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>
>>>>>>> To ensure the arm64 is performant once implemented, this change is very
>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>
>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>> significant performance change after this patch. Fork is called in a
>>>>>>> tight loop in a process with 1G of populated memory and the time for the
>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>>>>>> positive is slower, compared to baseline upon which the series is based:
>>>>>>>
>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>> | fork          |-------------------|-------------------|
>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>
>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>> | fork          |-------------------|-------------------|
>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>
>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>> ---
>>>>>>>     include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>>>>>     mm/memory.c             | 92 ++++++++++++++++++++++++++---------------
>>>>>>>     2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>> --- a/include/linux/pgtable.h
>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>     #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>     #endif
>>>>>>>     +#ifndef pte_batch_remaining
>>>>>>> +/**
>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch boundary.
>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>> + * @addr: Address of the first page.
>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>> + *
>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous batch of
>>>>>>> ptes.
>>>>>>> + * In such cases, this function returns the remaining number of pages to
>>>>>>> the end
>>>>>>> + * of the current batch, as defined by addr. This can be useful when
>>>>>>> iterating
>>>>>>> + * over ptes.
>>>>>>> + *
>>>>>>> + * May be overridden by the architecture, else batch size is always 1.
>>>>>>> + */
>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long
>>>>>>> addr,
>>>>>>> +                        unsigned long end)
>>>>>>> +{
>>>>>>> +    return 1;
>>>>>>> +}
>>>>>>> +#endif
>>>>>>
>>>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>>>
>>>>>> Was there no way to have some basic batching mechanism that doesn't require
>>>>>> arch
>>>>>> specifics?
>>>>>
>>>>> I tried a bunch of things but ultimately the way I've done it was the only way
>>>>> to reduce the order-0 fork regression to 0.
>>>>>
>>>>> My original v3 posting was costing 5% extra and even my first attempt at an
>>>>> arch-specific version that didn't resolve to a compile-time constant 1 still
>>>>> cost an extra 3%.
>>>>>
>>>>>
>>>>>>
>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>
>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>> * Check that PFN is consecutive
>>>>>> * Check that all PFNs belong to the same folio
>>>>>
>>>>> I haven't tried this exact approach, but I'd be surprised if I can get the
>>>>> regression under 4% with this. Further along the series I spent a lot of time
>>>>> having to fiddle with the arm64 implementation; every conditional and every
>>>>> memory read (even when in cache) was a problem. There is just so little in the
>>>>> inner loop that every instruction matters. (At least on Ampere Altra and Apple
>>>>> M2).
>>>>>
>>>>> Of course if you're willing to pay that 4-5% for order-0 then the benefit to
>>>>> order-9 is around 10% in my measurements. Personally though, I'd prefer to play
>>>>> safe and ensure the common order-0 case doesn't regress, as you previously
>>>>> suggested.
>>>>>
>>>>
>>>> I just hacked something up, on top of my beloved rmap cleanup/batching series. I
>>>> implemented very generic and simple batching for large folios (all PTE bits
>>>> except the PFN have to match).
>>>>
>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R) Silver
>>>> 4210R CPU.
>>>>
>>>> order-0: 0.014210 -> 0.013969
>>>>
>>>> -> Around 1.7 % faster
>>>>
>>>> order-9: 0.014373 -> 0.009149
>>>>
>>>> -> Around 36.3 % faster
>>>
>>> Well I guess that shows me :)
>>>
>>> I'll do a review and run the tests on my HW to see if it concurs.
>>
>>
>> I pushed a simple compile fixup (we need pte_next_pfn()).
>
> I've just been trying to compile and noticed this. Will take a look at your update.

Took a look; there will still be arch work needed; arm64 doesn't define
PFN_PTE_SHIFT because it defines set_ptes(). I'm not sure if there are other
arches that also don't define PFN_PTE_SHIFT (or pte_next_pfn()) if the math is
more complex) - it will need an audit.

>
> But upon review, I've noticed the part that I think makes this difficult for
> arm64 with the contpte optimization; You are calling ptep_get() for every pte in
> the batch. While this is functionally correct, once arm64 has the contpte
> changes, its ptep_get() has to read every pte in the contpte block in order to
> gather the access and dirty bits. So if your batching function ends up wealking
> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
> performance. That's why I added the arch-specific pte_batch_remaining()
> function; this allows the core-mm to skip to the end of the contpte block and
> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s instead
> of 256.
>
> I considered making a ptep_get_noyoungdirty() variant, which would avoid the bit
> gathering. But we have a similar problem in zap_pte_range() and that function
> needs the dirty bit to update the folio. So it doesn't work there. (see patch 3
> in my series).
>
> I guess you are going to say that we should combine both approaches, so that
> your batching loop can skip forward an arch-provided number of ptes? That would
> certainly work, but feels like an orthogonal change to what I'm trying to
> achieve :). Anyway, I'll spend some time playing with it today.
>
>
>>
>> Note that we should probably handle "ptep_set_wrprotects" rather like set_ptes:
>>
>> #ifndef wrprotect_ptes
>> static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>                pte_t *ptep, unsigned int nr)
>> {
>>        for (;;) {
>>                ptep_set_wrprotect(mm, addr, ptep);
>>                if (--nr == 0)
>>                        break;
>>                ptep++;
>>                addr += PAGE_SIZE;
>>        }
>> }
>> #endif

Yes that's a much better name; I've also introduced clear_ptes() (in patch 3)
and set_ptes_full(), which takes a flag that allows arm64 to avoid trying to
fold a contpte block; needed to avoid regressing fork once the contpte changes
are present.

>>
>>
>


2023-12-20 10:00:24

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20.12.23 10:57, Ryan Roberts wrote:
> On 20/12/2023 09:51, Ryan Roberts wrote:
>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>>>>>>>> all belonging to the same folio. A pte batch is then write-protected in
>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>
>>>>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>> fork, as it is about to add transparent support for the "contiguous bit"
>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>>>>> child, the backend does not need to fold a contiguous range once they
>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>> range in the first place.
>>>>>>>>
>>>>>>>> This code is very performance sensitive, and a significant amount of
>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>>>>> which enables the compiler to simplify the extra loops that are added
>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>> performant) as the previous implementation.
>>>>>>>>
>>>>>>>> This change addresses the core-mm refactoring only and a separate change
>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>
>>>>>>>> To ensure the arm64 is performant once implemented, this change is very
>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>
>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>> significant performance change after this patch. Fork is called in a
>>>>>>>> tight loop in a process with 1G of populated memory and the time for the
>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>>>>>>> positive is slower, compared to baseline upon which the series is based:
>>>>>>>>
>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>
>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>
>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>> ---
>>>>>>>>     include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>>>>>>     mm/memory.c             | 92 ++++++++++++++++++++++++++---------------
>>>>>>>>     2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>     #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>     #endif
>>>>>>>>     +#ifndef pte_batch_remaining
>>>>>>>> +/**
>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch boundary.
>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>> + * @addr: Address of the first page.
>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>> + *
>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous batch of
>>>>>>>> ptes.
>>>>>>>> + * In such cases, this function returns the remaining number of pages to
>>>>>>>> the end
>>>>>>>> + * of the current batch, as defined by addr. This can be useful when
>>>>>>>> iterating
>>>>>>>> + * over ptes.
>>>>>>>> + *
>>>>>>>> + * May be overridden by the architecture, else batch size is always 1.
>>>>>>>> + */
>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long
>>>>>>>> addr,
>>>>>>>> +                        unsigned long end)
>>>>>>>> +{
>>>>>>>> +    return 1;
>>>>>>>> +}
>>>>>>>> +#endif
>>>>>>>
>>>>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>>>>
>>>>>>> Was there no way to have some basic batching mechanism that doesn't require
>>>>>>> arch
>>>>>>> specifics?
>>>>>>
>>>>>> I tried a bunch of things but ultimately the way I've done it was the only way
>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>
>>>>>> My original v3 posting was costing 5% extra and even my first attempt at an
>>>>>> arch-specific version that didn't resolve to a compile-time constant 1 still
>>>>>> cost an extra 3%.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>
>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>> * Check that PFN is consecutive
>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>
>>>>>> I haven't tried this exact approach, but I'd be surprised if I can get the
>>>>>> regression under 4% with this. Further along the series I spent a lot of time
>>>>>> having to fiddle with the arm64 implementation; every conditional and every
>>>>>> memory read (even when in cache) was a problem. There is just so little in the
>>>>>> inner loop that every instruction matters. (At least on Ampere Altra and Apple
>>>>>> M2).
>>>>>>
>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the benefit to
>>>>>> order-9 is around 10% in my measurements. Personally though, I'd prefer to play
>>>>>> safe and ensure the common order-0 case doesn't regress, as you previously
>>>>>> suggested.
>>>>>>
>>>>>
>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching series. I
>>>>> implemented very generic and simple batching for large folios (all PTE bits
>>>>> except the PFN have to match).
>>>>>
>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R) Silver
>>>>> 4210R CPU.
>>>>>
>>>>> order-0: 0.014210 -> 0.013969
>>>>>
>>>>> -> Around 1.7 % faster
>>>>>
>>>>> order-9: 0.014373 -> 0.009149
>>>>>
>>>>> -> Around 36.3 % faster
>>>>
>>>> Well I guess that shows me :)
>>>>
>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>
>>>
>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>
>> I've just been trying to compile and noticed this. Will take a look at your update.
>
> Took a look; there will still be arch work needed; arm64 doesn't define
> PFN_PTE_SHIFT because it defines set_ptes(). I'm not sure if there are other
> arches that also don't define PFN_PTE_SHIFT (or pte_next_pfn()) if the math is
> more complex) - it will need an audit.
>

Right, likely many that have their own set_ptes() implementation right now.

--
Cheers,

David / dhildenb


2023-12-20 10:16:36

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20/12/2023 09:54, David Hildenbrand wrote:
> On 20.12.23 10:51, Ryan Roberts wrote:
>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>>>>>>>> all belonging to the same folio. A pte batch is then write-protected in
>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>
>>>>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>> fork, as it is about to add transparent support for the "contiguous bit"
>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>>>>> child, the backend does not need to fold a contiguous range once they
>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>> range in the first place.
>>>>>>>>
>>>>>>>> This code is very performance sensitive, and a significant amount of
>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>>>>> which enables the compiler to simplify the extra loops that are added
>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>> performant) as the previous implementation.
>>>>>>>>
>>>>>>>> This change addresses the core-mm refactoring only and a separate change
>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>
>>>>>>>> To ensure the arm64 is performant once implemented, this change is very
>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>
>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>> significant performance change after this patch. Fork is called in a
>>>>>>>> tight loop in a process with 1G of populated memory and the time for the
>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>>>>>>> positive is slower, compared to baseline upon which the series is based:
>>>>>>>>
>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>
>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>
>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>> ---
>>>>>>>>      include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>>>>>>      mm/memory.c             | 92 ++++++++++++++++++++++++++---------------
>>>>>>>>      2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>      #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>      #endif
>>>>>>>>      +#ifndef pte_batch_remaining
>>>>>>>> +/**
>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch boundary.
>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>> + * @addr: Address of the first page.
>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>> + *
>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous batch of
>>>>>>>> ptes.
>>>>>>>> + * In such cases, this function returns the remaining number of pages to
>>>>>>>> the end
>>>>>>>> + * of the current batch, as defined by addr. This can be useful when
>>>>>>>> iterating
>>>>>>>> + * over ptes.
>>>>>>>> + *
>>>>>>>> + * May be overridden by the architecture, else batch size is always 1.
>>>>>>>> + */
>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long
>>>>>>>> addr,
>>>>>>>> +                        unsigned long end)
>>>>>>>> +{
>>>>>>>> +    return 1;
>>>>>>>> +}
>>>>>>>> +#endif
>>>>>>>
>>>>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>>>>
>>>>>>> Was there no way to have some basic batching mechanism that doesn't require
>>>>>>> arch
>>>>>>> specifics?
>>>>>>
>>>>>> I tried a bunch of things but ultimately the way I've done it was the only
>>>>>> way
>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>
>>>>>> My original v3 posting was costing 5% extra and even my first attempt at an
>>>>>> arch-specific version that didn't resolve to a compile-time constant 1 still
>>>>>> cost an extra 3%.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>
>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>> * Check that PFN is consecutive
>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>
>>>>>> I haven't tried this exact approach, but I'd be surprised if I can get the
>>>>>> regression under 4% with this. Further along the series I spent a lot of time
>>>>>> having to fiddle with the arm64 implementation; every conditional and every
>>>>>> memory read (even when in cache) was a problem. There is just so little in
>>>>>> the
>>>>>> inner loop that every instruction matters. (At least on Ampere Altra and
>>>>>> Apple
>>>>>> M2).
>>>>>>
>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the benefit to
>>>>>> order-9 is around 10% in my measurements. Personally though, I'd prefer to
>>>>>> play
>>>>>> safe and ensure the common order-0 case doesn't regress, as you previously
>>>>>> suggested.
>>>>>>
>>>>>
>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching
>>>>> series. I
>>>>> implemented very generic and simple batching for large folios (all PTE bits
>>>>> except the PFN have to match).
>>>>>
>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R) Silver
>>>>> 4210R CPU.
>>>>>
>>>>> order-0: 0.014210 -> 0.013969
>>>>>
>>>>> -> Around 1.7 % faster
>>>>>
>>>>> order-9: 0.014373 -> 0.009149
>>>>>
>>>>> -> Around 36.3 % faster
>>>>
>>>> Well I guess that shows me :)
>>>>
>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>
>>>
>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>
>> I've just been trying to compile and noticed this. Will take a look at your
>> update.
>>
>> But upon review, I've noticed the part that I think makes this difficult for
>> arm64 with the contpte optimization; You are calling ptep_get() for every pte in
>> the batch. While this is functionally correct, once arm64 has the contpte
>> changes, its ptep_get() has to read every pte in the contpte block in order to
>> gather the access and dirty bits. So if your batching function ends up wealking
>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>> performance. That's why I added the arch-specific pte_batch_remaining()
>> function; this allows the core-mm to skip to the end of the contpte block and
>> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s instead
>> of 256.
>>
>> I considered making a ptep_get_noyoungdirty() variant, which would avoid the bit
>> gathering. But we have a similar problem in zap_pte_range() and that function
>> needs the dirty bit to update the folio. So it doesn't work there. (see patch 3
>> in my series).
>>
>> I guess you are going to say that we should combine both approaches, so that
>> your batching loop can skip forward an arch-provided number of ptes? That would
>> certainly work, but feels like an orthogonal change to what I'm trying to
>> achieve :). Anyway, I'll spend some time playing with it today.
>
> You can overwrite the function or add special-casing internally, yes.
>
> Right now, your patch is called "mm: Batch-copy PTE ranges during fork()" and it
> doesn't do any of that besides preparing for some arm64 work.
>

Well it allows an arch to opt-in to batching. But I see your point.

How do you want to handle your patches? Do you want to clean them up and I'll
base my stuff on top? Or do you want me to take them and sort it all out?

As I see it at the moment, I would keep your folio_pte_batch() always core, but
in subsequent patch, have it use pte_batch_remaining() (the arch function I have
in my series, which defaults to one). Then do a similar thing to what you have
done for fork in zap_pte_range() - also using folio_pte_batch(). Then lay my
series on top.


2023-12-20 10:26:16

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20.12.23 11:11, Ryan Roberts wrote:
> On 20/12/2023 09:54, David Hildenbrand wrote:
>> On 20.12.23 10:51, Ryan Roberts wrote:
>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>>>>>>>>> all belonging to the same folio. A pte batch is then write-protected in
>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>
>>>>>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>>> fork, as it is about to add transparent support for the "contiguous bit"
>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>>>>>> child, the backend does not need to fold a contiguous range once they
>>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>>> range in the first place.
>>>>>>>>>
>>>>>>>>> This code is very performance sensitive, and a significant amount of
>>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>>>>>> which enables the compiler to simplify the extra loops that are added
>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>
>>>>>>>>> This change addresses the core-mm refactoring only and a separate change
>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>
>>>>>>>>> To ensure the arm64 is performant once implemented, this change is very
>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>
>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>> significant performance change after this patch. Fork is called in a
>>>>>>>>> tight loop in a process with 1G of populated memory and the time for the
>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>>>>>>>> positive is slower, compared to baseline upon which the series is based:
>>>>>>>>>
>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>
>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>
>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>> ---
>>>>>>>>>      include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>>>>>>>      mm/memory.c             | 92 ++++++++++++++++++++++++++---------------
>>>>>>>>>      2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>      #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>      #endif
>>>>>>>>>      +#ifndef pte_batch_remaining
>>>>>>>>> +/**
>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch boundary.
>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>> + *
>>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous batch of
>>>>>>>>> ptes.
>>>>>>>>> + * In such cases, this function returns the remaining number of pages to
>>>>>>>>> the end
>>>>>>>>> + * of the current batch, as defined by addr. This can be useful when
>>>>>>>>> iterating
>>>>>>>>> + * over ptes.
>>>>>>>>> + *
>>>>>>>>> + * May be overridden by the architecture, else batch size is always 1.
>>>>>>>>> + */
>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long
>>>>>>>>> addr,
>>>>>>>>> +                        unsigned long end)
>>>>>>>>> +{
>>>>>>>>> +    return 1;
>>>>>>>>> +}
>>>>>>>>> +#endif
>>>>>>>>
>>>>>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>>>>>
>>>>>>>> Was there no way to have some basic batching mechanism that doesn't require
>>>>>>>> arch
>>>>>>>> specifics?
>>>>>>>
>>>>>>> I tried a bunch of things but ultimately the way I've done it was the only
>>>>>>> way
>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>
>>>>>>> My original v3 posting was costing 5% extra and even my first attempt at an
>>>>>>> arch-specific version that didn't resolve to a compile-time constant 1 still
>>>>>>> cost an extra 3%.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>
>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>> * Check that PFN is consecutive
>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>
>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can get the
>>>>>>> regression under 4% with this. Further along the series I spent a lot of time
>>>>>>> having to fiddle with the arm64 implementation; every conditional and every
>>>>>>> memory read (even when in cache) was a problem. There is just so little in
>>>>>>> the
>>>>>>> inner loop that every instruction matters. (At least on Ampere Altra and
>>>>>>> Apple
>>>>>>> M2).
>>>>>>>
>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the benefit to
>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd prefer to
>>>>>>> play
>>>>>>> safe and ensure the common order-0 case doesn't regress, as you previously
>>>>>>> suggested.
>>>>>>>
>>>>>>
>>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching
>>>>>> series. I
>>>>>> implemented very generic and simple batching for large folios (all PTE bits
>>>>>> except the PFN have to match).
>>>>>>
>>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R) Silver
>>>>>> 4210R CPU.
>>>>>>
>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>
>>>>>> -> Around 1.7 % faster
>>>>>>
>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>
>>>>>> -> Around 36.3 % faster
>>>>>
>>>>> Well I guess that shows me :)
>>>>>
>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>
>>>>
>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>
>>> I've just been trying to compile and noticed this. Will take a look at your
>>> update.
>>>
>>> But upon review, I've noticed the part that I think makes this difficult for
>>> arm64 with the contpte optimization; You are calling ptep_get() for every pte in
>>> the batch. While this is functionally correct, once arm64 has the contpte
>>> changes, its ptep_get() has to read every pte in the contpte block in order to
>>> gather the access and dirty bits. So if your batching function ends up wealking
>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>> function; this allows the core-mm to skip to the end of the contpte block and
>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s instead
>>> of 256.
>>>
>>> I considered making a ptep_get_noyoungdirty() variant, which would avoid the bit
>>> gathering. But we have a similar problem in zap_pte_range() and that function
>>> needs the dirty bit to update the folio. So it doesn't work there. (see patch 3
>>> in my series).
>>>
>>> I guess you are going to say that we should combine both approaches, so that
>>> your batching loop can skip forward an arch-provided number of ptes? That would
>>> certainly work, but feels like an orthogonal change to what I'm trying to
>>> achieve :). Anyway, I'll spend some time playing with it today.
>>
>> You can overwrite the function or add special-casing internally, yes.
>>
>> Right now, your patch is called "mm: Batch-copy PTE ranges during fork()" and it
>> doesn't do any of that besides preparing for some arm64 work.
>>
>
> Well it allows an arch to opt-in to batching. But I see your point.
>
> How do you want to handle your patches? Do you want to clean them up and I'll
> base my stuff on top? Or do you want me to take them and sort it all out?

Whatever you prefer, it was mostly a quick prototype to see if we can
achieve decent performance.

I can fixup the arch thingies (most should be easy, some might require a
custom pte_next_pfn()) and you can focus on getting cont-pte sorted out
on top [I assume that's what you want to work on :) ].

>
> As I see it at the moment, I would keep your folio_pte_batch() always core, but
> in subsequent patch, have it use pte_batch_remaining() (the arch function I have
> in my series, which defaults to one).

Just double-checking, how would it use pte_batch_remaining() ?

> Then do a similar thing to what you have
> done for fork in zap_pte_range() - also using folio_pte_batch(). Then lay my
> series on top.

Yes, we should probably try to handle the zapping part similarly: make
it benefit all archs first, then special-case on cont-pte. I can help
there as well.

--
Cheers,

David / dhildenb


2023-12-20 10:50:51

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20/12/2023 10:16, David Hildenbrand wrote:
> On 20.12.23 11:11, Ryan Roberts wrote:
>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>>>>>>>>>> all belonging to the same folio. A pte batch is then write-protected in
>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>
>>>>>>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>>>> fork, as it is about to add transparent support for the "contiguous bit"
>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>>>>>>> child, the backend does not need to fold a contiguous range once they
>>>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>>>> range in the first place.
>>>>>>>>>>
>>>>>>>>>> This code is very performance sensitive, and a significant amount of
>>>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>>>>>>> which enables the compiler to simplify the extra loops that are added
>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>
>>>>>>>>>> This change addresses the core-mm refactoring only and a separate change
>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>
>>>>>>>>>> To ensure the arm64 is performant once implemented, this change is very
>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>
>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>> significant performance change after this patch. Fork is called in a
>>>>>>>>>> tight loop in a process with 1G of populated memory and the time for the
>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>>>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>>>>>>>>> positive is slower, compared to baseline upon which the series is based:
>>>>>>>>>>
>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>
>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>
>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>>       include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>>>>>>>>       mm/memory.c             | 92
>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>       2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>       #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>       #endif
>>>>>>>>>>       +#ifndef pte_batch_remaining
>>>>>>>>>> +/**
>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>> boundary.
>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>> + *
>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous
>>>>>>>>>> batch of
>>>>>>>>>> ptes.
>>>>>>>>>> + * In such cases, this function returns the remaining number of pages to
>>>>>>>>>> the end
>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful when
>>>>>>>>>> iterating
>>>>>>>>>> + * over ptes.
>>>>>>>>>> + *
>>>>>>>>>> + * May be overridden by the architecture, else batch size is always 1.
>>>>>>>>>> + */
>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long
>>>>>>>>>> addr,
>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>> +{
>>>>>>>>>> +    return 1;
>>>>>>>>>> +}
>>>>>>>>>> +#endif
>>>>>>>>>
>>>>>>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>>>>>>
>>>>>>>>> Was there no way to have some basic batching mechanism that doesn't
>>>>>>>>> require
>>>>>>>>> arch
>>>>>>>>> specifics?
>>>>>>>>
>>>>>>>> I tried a bunch of things but ultimately the way I've done it was the only
>>>>>>>> way
>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>
>>>>>>>> My original v3 posting was costing 5% extra and even my first attempt at an
>>>>>>>> arch-specific version that didn't resolve to a compile-time constant 1
>>>>>>>> still
>>>>>>>> cost an extra 3%.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>
>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>
>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can get the
>>>>>>>> regression under 4% with this. Further along the series I spent a lot of
>>>>>>>> time
>>>>>>>> having to fiddle with the arm64 implementation; every conditional and every
>>>>>>>> memory read (even when in cache) was a problem. There is just so little in
>>>>>>>> the
>>>>>>>> inner loop that every instruction matters. (At least on Ampere Altra and
>>>>>>>> Apple
>>>>>>>> M2).
>>>>>>>>
>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>> benefit to
>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd prefer to
>>>>>>>> play
>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you previously
>>>>>>>> suggested.
>>>>>>>>
>>>>>>>
>>>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching
>>>>>>> series. I
>>>>>>> implemented very generic and simple batching for large folios (all PTE bits
>>>>>>> except the PFN have to match).
>>>>>>>
>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R)
>>>>>>> Silver
>>>>>>> 4210R CPU.
>>>>>>>
>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>
>>>>>>> -> Around 1.7 % faster
>>>>>>>
>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>
>>>>>>> -> Around 36.3 % faster
>>>>>>
>>>>>> Well I guess that shows me :)
>>>>>>
>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>
>>>>>
>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>
>>>> I've just been trying to compile and noticed this. Will take a look at your
>>>> update.
>>>>
>>>> But upon review, I've noticed the part that I think makes this difficult for
>>>> arm64 with the contpte optimization; You are calling ptep_get() for every
>>>> pte in
>>>> the batch. While this is functionally correct, once arm64 has the contpte
>>>> changes, its ptep_get() has to read every pte in the contpte block in order to
>>>> gather the access and dirty bits. So if your batching function ends up wealking
>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>> function; this allows the core-mm to skip to the end of the contpte block and
>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s
>>>> instead
>>>> of 256.
>>>>
>>>> I considered making a ptep_get_noyoungdirty() variant, which would avoid the
>>>> bit
>>>> gathering. But we have a similar problem in zap_pte_range() and that function
>>>> needs the dirty bit to update the folio. So it doesn't work there. (see patch 3
>>>> in my series).
>>>>
>>>> I guess you are going to say that we should combine both approaches, so that
>>>> your batching loop can skip forward an arch-provided number of ptes? That would
>>>> certainly work, but feels like an orthogonal change to what I'm trying to
>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>
>>> You can overwrite the function or add special-casing internally, yes.
>>>
>>> Right now, your patch is called "mm: Batch-copy PTE ranges during fork()" and it
>>> doesn't do any of that besides preparing for some arm64 work.
>>>
>>
>> Well it allows an arch to opt-in to batching. But I see your point.
>>
>> How do you want to handle your patches? Do you want to clean them up and I'll
>> base my stuff on top? Or do you want me to take them and sort it all out?
>
> Whatever you prefer, it was mostly a quick prototype to see if we can achieve
> decent performance.

I'm about to run it on Altra and M2. But I assume it will show similar results.

>
> I can fixup the arch thingies (most should be easy, some might require a custom
> pte_next_pfn())

Well if you're happy to do that, great! I'm keen to get the contpte stuff into
v6.9 if at all possible, and I'm concious that I'm introducing more dependencies
on you. And its about to be holiday season...

> and you can focus on getting cont-pte sorted out on top [I
> assume that's what you want to work on :) ].

That's certainly what I'm focussed on. But I'm happy to do whatever is required
to get it over the line. I guess I'll start by finishing my review of your v1
rmap stuff.

>
>>
>> As I see it at the moment, I would keep your folio_pte_batch() always core, but
>> in subsequent patch, have it use pte_batch_remaining() (the arch function I have
>> in my series, which defaults to one).
>
> Just double-checking, how would it use pte_batch_remaining() ?

I think something like this would do it (untested):

static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
pte_t *start_ptep, pte_t pte, int max_nr)
{
unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
pte_t expected_pte = pte_next_pfn(pte);
pte_t *ptep = start_ptep;
int nr;

for (;;) {
nr = min(max_nr, pte_batch_remaining());
ptep += nr;
max_nr -= nr;

if (max_nr == 0)
break;

pte = ptep_get(ptep);

/* Do all PTE bits match, and the PFN is consecutive? */
if (!pte_same(pte, expected_pte))
break;

/*
* Stop immediately once we reached the end of the folio. In
* corner cases the next PFN might fall into a different
* folio.
*/
if (pte_pfn(pte) == folio_end_pfn - 1)
break;

expected_pte = pte_next_pfn(expected_pte);
}

return ptep - start_ptep;
}

Of course, if we have the concept of a "pte batch" in the core-mm, then we might
want to call the arch's thing something different; pte span? pte cont? pte cont
batch? ...


>
>> Then do a similar thing to what you have
>> done for fork in zap_pte_range() - also using folio_pte_batch(). Then lay my
>> series on top.
>
> Yes, we should probably try to handle the zapping part similarly: make it
> benefit all archs first, then special-case on cont-pte. I can help there as well.

OK great.


2023-12-20 11:01:27

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20.12.23 11:41, Ryan Roberts wrote:
> On 20/12/2023 10:16, David Hildenbrand wrote:
>> On 20.12.23 11:11, Ryan Roberts wrote:
>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of memory,
>>>>>>>>>>> all belonging to the same folio. A pte batch is then write-protected in
>>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>>
>>>>>>>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>>>>> fork, as it is about to add transparent support for the "contiguous bit"
>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>>>>>>>> child, the backend does not need to fold a contiguous range once they
>>>>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>>>>> range in the first place.
>>>>>>>>>>>
>>>>>>>>>>> This code is very performance sensitive, and a significant amount of
>>>>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>>>>>>>> which enables the compiler to simplify the extra loops that are added
>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>
>>>>>>>>>>> This change addresses the core-mm refactoring only and a separate change
>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>
>>>>>>>>>>> To ensure the arm64 is performant once implemented, this change is very
>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>
>>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>>> significant performance change after this patch. Fork is called in a
>>>>>>>>>>> tight loop in a process with 1G of populated memory and the time for the
>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>>>>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>>>>>>>>>> positive is slower, compared to baseline upon which the series is based:
>>>>>>>>>>>
>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>
>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>
>>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>>> ---
>>>>>>>>>>>       include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>       mm/memory.c             | 92
>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>       2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>       #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>>       #endif
>>>>>>>>>>>       +#ifndef pte_batch_remaining
>>>>>>>>>>> +/**
>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>>> boundary.
>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>> + *
>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous
>>>>>>>>>>> batch of
>>>>>>>>>>> ptes.
>>>>>>>>>>> + * In such cases, this function returns the remaining number of pages to
>>>>>>>>>>> the end
>>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful when
>>>>>>>>>>> iterating
>>>>>>>>>>> + * over ptes.
>>>>>>>>>>> + *
>>>>>>>>>>> + * May be overridden by the architecture, else batch size is always 1.
>>>>>>>>>>> + */
>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned long
>>>>>>>>>>> addr,
>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>> +{
>>>>>>>>>>> +    return 1;
>>>>>>>>>>> +}
>>>>>>>>>>> +#endif
>>>>>>>>>>
>>>>>>>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>>>>>>>
>>>>>>>>>> Was there no way to have some basic batching mechanism that doesn't
>>>>>>>>>> require
>>>>>>>>>> arch
>>>>>>>>>> specifics?
>>>>>>>>>
>>>>>>>>> I tried a bunch of things but ultimately the way I've done it was the only
>>>>>>>>> way
>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>
>>>>>>>>> My original v3 posting was costing 5% extra and even my first attempt at an
>>>>>>>>> arch-specific version that didn't resolve to a compile-time constant 1
>>>>>>>>> still
>>>>>>>>> cost an extra 3%.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>>
>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>
>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can get the
>>>>>>>>> regression under 4% with this. Further along the series I spent a lot of
>>>>>>>>> time
>>>>>>>>> having to fiddle with the arm64 implementation; every conditional and every
>>>>>>>>> memory read (even when in cache) was a problem. There is just so little in
>>>>>>>>> the
>>>>>>>>> inner loop that every instruction matters. (At least on Ampere Altra and
>>>>>>>>> Apple
>>>>>>>>> M2).
>>>>>>>>>
>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>> benefit to
>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd prefer to
>>>>>>>>> play
>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you previously
>>>>>>>>> suggested.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching
>>>>>>>> series. I
>>>>>>>> implemented very generic and simple batching for large folios (all PTE bits
>>>>>>>> except the PFN have to match).
>>>>>>>>
>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R)
>>>>>>>> Silver
>>>>>>>> 4210R CPU.
>>>>>>>>
>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>
>>>>>>>> -> Around 1.7 % faster
>>>>>>>>
>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>
>>>>>>>> -> Around 36.3 % faster
>>>>>>>
>>>>>>> Well I guess that shows me :)
>>>>>>>
>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>
>>>>>>
>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>
>>>>> I've just been trying to compile and noticed this. Will take a look at your
>>>>> update.
>>>>>
>>>>> But upon review, I've noticed the part that I think makes this difficult for
>>>>> arm64 with the contpte optimization; You are calling ptep_get() for every
>>>>> pte in
>>>>> the batch. While this is functionally correct, once arm64 has the contpte
>>>>> changes, its ptep_get() has to read every pte in the contpte block in order to
>>>>> gather the access and dirty bits. So if your batching function ends up wealking
>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>>> function; this allows the core-mm to skip to the end of the contpte block and
>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s
>>>>> instead
>>>>> of 256.
>>>>>
>>>>> I considered making a ptep_get_noyoungdirty() variant, which would avoid the
>>>>> bit
>>>>> gathering. But we have a similar problem in zap_pte_range() and that function
>>>>> needs the dirty bit to update the folio. So it doesn't work there. (see patch 3
>>>>> in my series).
>>>>>
>>>>> I guess you are going to say that we should combine both approaches, so that
>>>>> your batching loop can skip forward an arch-provided number of ptes? That would
>>>>> certainly work, but feels like an orthogonal change to what I'm trying to
>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>
>>>> You can overwrite the function or add special-casing internally, yes.
>>>>
>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during fork()" and it
>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>
>>>
>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>
>>> How do you want to handle your patches? Do you want to clean them up and I'll
>>> base my stuff on top? Or do you want me to take them and sort it all out?
>>
>> Whatever you prefer, it was mostly a quick prototype to see if we can achieve
>> decent performance.
>
> I'm about to run it on Altra and M2. But I assume it will show similar results.
>
>>
>> I can fixup the arch thingies (most should be easy, some might require a custom
>> pte_next_pfn())
>
> Well if you're happy to do that, great! I'm keen to get the contpte stuff into
> v6.9 if at all possible, and I'm concious that I'm introducing more dependencies
> on you. And its about to be holiday season...

There is still plenty of time for 6.9. I'll try to get the rmap cleanup
finished asap.

>
>> and you can focus on getting cont-pte sorted out on top [I
>> assume that's what you want to work on :) ].
>
> That's certainly what I'm focussed on. But I'm happy to do whatever is required
> to get it over the line. I guess I'll start by finishing my review of your v1
> rmap stuff.

I'm planning on sending out a new version today.

>
>>
>>>
>>> As I see it at the moment, I would keep your folio_pte_batch() always core, but
>>> in subsequent patch, have it use pte_batch_remaining() (the arch function I have
>>> in my series, which defaults to one).
>>
>> Just double-checking, how would it use pte_batch_remaining() ?
>
> I think something like this would do it (untested):
>
> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> pte_t *start_ptep, pte_t pte, int max_nr)
> {
> unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
> pte_t expected_pte = pte_next_pfn(pte);
> pte_t *ptep = start_ptep;
> int nr;
>
> for (;;) {
> nr = min(max_nr, pte_batch_remaining());
> ptep += nr;
> max_nr -= nr;
>
> if (max_nr == 0)
> break;
>

expected_pte would be messed up. We'd have to increment it a couple of
times to make it match the nr of pages we're skipping.

> pte = ptep_get(ptep);
>
> /* Do all PTE bits match, and the PFN is consecutive? */
> if (!pte_same(pte, expected_pte))
> break;
>
> /*
> * Stop immediately once we reached the end of the folio. In
> * corner cases the next PFN might fall into a different
> * folio.
> */
> if (pte_pfn(pte) == folio_end_pfn - 1)
> break;
>
> expected_pte = pte_next_pfn(expected_pte);
> }
>
> return ptep - start_ptep;
> }
>
> Of course, if we have the concept of a "pte batch" in the core-mm, then we might
> want to call the arch's thing something different; pte span? pte cont? pte cont
> batch? ...

So, you mean something like

/*
* The architecture might be able to tell us efficiently using cont-pte
* bits how many next PTEs are certainly compatible. So in that case,
* simply skip forward.
*/
nr = min(max_nr, nr_cont_ptes(ptep));
...

I wonder if something simple at the start of the function might be good
enough for arm with cont-pte as a first step:

nr = nr_cont_ptes(start_ptep)
if (nr != 1) {
return min(max_nr, nr);
}

Which would get optimized out on other architectures.


--
Cheers,

David / dhildenb


2023-12-20 11:28:58

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20/12/2023 10:56, David Hildenbrand wrote:
> On 20.12.23 11:41, Ryan Roberts wrote:
>> On 20/12/2023 10:16, David Hildenbrand wrote:
>>> On 20.12.23 11:11, Ryan Roberts wrote:
>>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of
>>>>>>>>>>>> memory,
>>>>>>>>>>>> all belonging to the same folio. A pte batch is then write-protected in
>>>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>>>
>>>>>>>>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>>>>>> fork, as it is about to add transparent support for the "contiguous
>>>>>>>>>>>> bit"
>>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>>>>>>>>> child, the backend does not need to fold a contiguous range once they
>>>>>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>>>>>> range in the first place.
>>>>>>>>>>>>
>>>>>>>>>>>> This code is very performance sensitive, and a significant amount of
>>>>>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>>>>>>>>> which enables the compiler to simplify the extra loops that are added
>>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>>
>>>>>>>>>>>> This change addresses the core-mm refactoring only and a separate
>>>>>>>>>>>> change
>>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>>
>>>>>>>>>>>> To ensure the arm64 is performant once implemented, this change is very
>>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>>
>>>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>>>> significant performance change after this patch. Fork is called in a
>>>>>>>>>>>> tight loop in a process with 1G of populated memory and the time for
>>>>>>>>>>>> the
>>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>>>>>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>>>>>>>>>>> positive is slower, compared to baseline upon which the series is
>>>>>>>>>>>> based:
>>>>>>>>>>>>
>>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>>
>>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>>
>>>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>>>> ---
>>>>>>>>>>>>        include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>        mm/memory.c             | 92
>>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>>        2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>>        #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>>>        #endif
>>>>>>>>>>>>        +#ifndef pte_batch_remaining
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>>>> boundary.
>>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous
>>>>>>>>>>>> batch of
>>>>>>>>>>>> ptes.
>>>>>>>>>>>> + * In such cases, this function returns the remaining number of
>>>>>>>>>>>> pages to
>>>>>>>>>>>> the end
>>>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful when
>>>>>>>>>>>> iterating
>>>>>>>>>>>> + * over ptes.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * May be overridden by the architecture, else batch size is always 1.
>>>>>>>>>>>> + */
>>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned
>>>>>>>>>>>> long
>>>>>>>>>>>> addr,
>>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    return 1;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +#endif
>>>>>>>>>>>
>>>>>>>>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>>>>>>>>
>>>>>>>>>>> Was there no way to have some basic batching mechanism that doesn't
>>>>>>>>>>> require
>>>>>>>>>>> arch
>>>>>>>>>>> specifics?
>>>>>>>>>>
>>>>>>>>>> I tried a bunch of things but ultimately the way I've done it was the
>>>>>>>>>> only
>>>>>>>>>> way
>>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>>
>>>>>>>>>> My original v3 posting was costing 5% extra and even my first attempt
>>>>>>>>>> at an
>>>>>>>>>> arch-specific version that didn't resolve to a compile-time constant 1
>>>>>>>>>> still
>>>>>>>>>> cost an extra 3%.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>>>
>>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>>
>>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can get
>>>>>>>>>> the
>>>>>>>>>> regression under 4% with this. Further along the series I spent a lot of
>>>>>>>>>> time
>>>>>>>>>> having to fiddle with the arm64 implementation; every conditional and
>>>>>>>>>> every
>>>>>>>>>> memory read (even when in cache) was a problem. There is just so
>>>>>>>>>> little in
>>>>>>>>>> the
>>>>>>>>>> inner loop that every instruction matters. (At least on Ampere Altra and
>>>>>>>>>> Apple
>>>>>>>>>> M2).
>>>>>>>>>>
>>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>>> benefit to
>>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd
>>>>>>>>>> prefer to
>>>>>>>>>> play
>>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you
>>>>>>>>>> previously
>>>>>>>>>> suggested.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching
>>>>>>>>> series. I
>>>>>>>>> implemented very generic and simple batching for large folios (all PTE
>>>>>>>>> bits
>>>>>>>>> except the PFN have to match).
>>>>>>>>>
>>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R)
>>>>>>>>> Silver
>>>>>>>>> 4210R CPU.
>>>>>>>>>
>>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>>
>>>>>>>>> -> Around 1.7 % faster
>>>>>>>>>
>>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>>
>>>>>>>>> -> Around 36.3 % faster
>>>>>>>>
>>>>>>>> Well I guess that shows me :)
>>>>>>>>
>>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>>
>>>>>>>
>>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>>
>>>>>> I've just been trying to compile and noticed this. Will take a look at your
>>>>>> update.
>>>>>>
>>>>>> But upon review, I've noticed the part that I think makes this difficult for
>>>>>> arm64 with the contpte optimization; You are calling ptep_get() for every
>>>>>> pte in
>>>>>> the batch. While this is functionally correct, once arm64 has the contpte
>>>>>> changes, its ptep_get() has to read every pte in the contpte block in
>>>>>> order to
>>>>>> gather the access and dirty bits. So if your batching function ends up
>>>>>> wealking
>>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>>>> function; this allows the core-mm to skip to the end of the contpte block and
>>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s
>>>>>> instead
>>>>>> of 256.
>>>>>>
>>>>>> I considered making a ptep_get_noyoungdirty() variant, which would avoid the
>>>>>> bit
>>>>>> gathering. But we have a similar problem in zap_pte_range() and that function
>>>>>> needs the dirty bit to update the folio. So it doesn't work there. (see
>>>>>> patch 3
>>>>>> in my series).
>>>>>>
>>>>>> I guess you are going to say that we should combine both approaches, so that
>>>>>> your batching loop can skip forward an arch-provided number of ptes? That
>>>>>> would
>>>>>> certainly work, but feels like an orthogonal change to what I'm trying to
>>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>>
>>>>> You can overwrite the function or add special-casing internally, yes.
>>>>>
>>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during fork()"
>>>>> and it
>>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>>
>>>>
>>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>>
>>>> How do you want to handle your patches? Do you want to clean them up and I'll
>>>> base my stuff on top? Or do you want me to take them and sort it all out?
>>>
>>> Whatever you prefer, it was mostly a quick prototype to see if we can achieve
>>> decent performance.
>>
>> I'm about to run it on Altra and M2. But I assume it will show similar results.

OK results in, not looking great, which aligns with my previous experience. That
said, I'm seeing some "BUG: Bad page state in process gmain pfn:12a094" so
perhaps these results are not valid...

100 iterations per run, 8 runs over 2 reboots. Positive is slower than baseline,
negative is faster:

Fork, order-0, Apple M2 VM:
| kernel | mean_rel | std_rel |
|:----------------------|-----------:|----------:|
| mm-unstable | 0.0% | 0.8% |
| hugetlb-rmap-cleanups | 1.3% | 2.0% |
| fork-batching | 3.5% | 1.2% |

Fork, order-9, Apple M2 VM:
| kernel | mean_rel | std_rel |
|:----------------------|-----------:|----------:|
| mm-unstable | 0.0% | 0.8% |
| hugetlb-rmap-cleanups | 0.9% | 0.9% |
| fork-batching | -35.6% | 2.0% |

Fork, order-0, Ampere Altra:
| kernel | mean_rel | std_rel |
|:----------------------|-----------:|----------:|
| mm-unstable | 0.0% | 0.7% |
| hugetlb-rmap-cleanups | 3.2% | 0.7% |
| fork-batching | 5.5% | 1.1% |

Fork, order-9, Ampere Altra:
| kernel | mean_rel | std_rel |
|:----------------------|-----------:|----------:|
| mm-unstable | 0.0% | 0.1% |
| hugetlb-rmap-cleanups | 0.5% | 0.1% |
| fork-batching | -10.3% | 0.1% |


>>
>>>
>>> I can fixup the arch thingies (most should be easy, some might require a custom
>>> pte_next_pfn())
>>
>> Well if you're happy to do that, great! I'm keen to get the contpte stuff into
>> v6.9 if at all possible, and I'm concious that I'm introducing more dependencies
>> on you. And its about to be holiday season...
>
> There is still plenty of time for 6.9. I'll try to get the rmap cleanup finished
> asap.
>
>>
>>> and you can focus on getting cont-pte sorted out on top [I
>>> assume that's what you want to work on :) ].
>>
>> That's certainly what I'm focussed on. But I'm happy to do whatever is required
>> to get it over the line. I guess I'll start by finishing my review of your v1
>> rmap stuff.
>
> I'm planning on sending out a new version today.
>
>>
>>>
>>>>
>>>> As I see it at the moment, I would keep your folio_pte_batch() always core, but
>>>> in subsequent patch, have it use pte_batch_remaining() (the arch function I
>>>> have
>>>> in my series, which defaults to one).
>>>
>>> Just double-checking, how would it use pte_batch_remaining() ?
>>
>> I think something like this would do it (untested):
>>
>> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>         pte_t *start_ptep, pte_t pte, int max_nr)
>> {
>>     unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>     pte_t expected_pte = pte_next_pfn(pte);
>>     pte_t *ptep = start_ptep;
>>     int nr;
>>
>>     for (;;) {
>>         nr = min(max_nr, pte_batch_remaining());
>>         ptep += nr;
>>         max_nr -= nr;
>>
>>         if (max_nr == 0)
>>             break;
>>
>
> expected_pte would be messed up. We'd have to increment it a couple of times to
> make it match the nr of pages we're skipping.

Ahh, good point.

>
>>         pte = ptep_get(ptep);
>>
>>         /* Do all PTE bits match, and the PFN is consecutive? */
>>         if (!pte_same(pte, expected_pte))
>>             break;
>>
>>         /*
>>          * Stop immediately once we reached the end of the folio. In
>>          * corner cases the next PFN might fall into a different
>>          * folio.
>>          */
>>         if (pte_pfn(pte) == folio_end_pfn - 1)
>>             break;
>>
>>         expected_pte = pte_next_pfn(expected_pte);
>>     }
>>
>>     return ptep - start_ptep;
>> }
>>
>> Of course, if we have the concept of a "pte batch" in the core-mm, then we might
>> want to call the arch's thing something different; pte span? pte cont? pte cont
>> batch? ...
>
> So, you mean something like
>
> /*
>  * The architecture might be able to tell us efficiently using cont-pte
>  * bits how many next PTEs are certainly compatible. So in that case,
>  * simply skip forward.
>  */
> nr = min(max_nr, nr_cont_ptes(ptep));
> ...
>
> I wonder if something simple at the start of the function might be good enough
> for arm with cont-pte as a first step:
>
> nr = nr_cont_ptes(start_ptep)
> if (nr != 1) {
>     return min(max_nr, nr);
> }

Yeah that would probably work. But we need to be careful for the case where
start_ptep is in the middle of a contpte block (which can happen - due to some
vma splitting operations, we can have a contpte block that spans 2 vmas). So
nr_cont_ptes() needs to either be spec'ed to only return the contpte size if
start_ptep is pointing to the front of the block, and all other times, return 1,
or it needs to return the number of ptes remaining to the end of the block (as
it does in my v4).

But I guess we need to get to the bottom of my arm64 perf numbers first... I'll
debug those bugs and rerun.

>
> Which would get optimized out on other architectures.
>
>


2023-12-20 11:37:30

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20.12.23 12:28, Ryan Roberts wrote:
> On 20/12/2023 10:56, David Hildenbrand wrote:
>> On 20.12.23 11:41, Ryan Roberts wrote:
>>> On 20/12/2023 10:16, David Hildenbrand wrote:
>>>> On 20.12.23 11:11, Ryan Roberts wrote:
>>>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of
>>>>>>>>>>>>> memory,
>>>>>>>>>>>>> all belonging to the same folio. A pte batch is then write-protected in
>>>>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects() and is
>>>>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>>>>
>>>>>>>>>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>>>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>>>>>>> fork, as it is about to add transparent support for the "contiguous
>>>>>>>>>>>>> bit"
>>>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>>>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>>>>>>>>>> child, the backend does not need to fold a contiguous range once they
>>>>>>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>>>>>>> range in the first place.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This code is very performance sensitive, and a significant amount of
>>>>>>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>>>>>>>>>> which enables the compiler to simplify the extra loops that are added
>>>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This change addresses the core-mm refactoring only and a separate
>>>>>>>>>>>>> change
>>>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To ensure the arm64 is performant once implemented, this change is very
>>>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>>>>> significant performance change after this patch. Fork is called in a
>>>>>>>>>>>>> tight loop in a process with 1G of populated memory and the time for
>>>>>>>>>>>>> the
>>>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>>>>>>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is faster,
>>>>>>>>>>>>> positive is slower, compared to baseline upon which the series is
>>>>>>>>>>>>> based:
>>>>>>>>>>>>>
>>>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>>>
>>>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>        include/linux/pgtable.h | 80 +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>        mm/memory.c             | 92
>>>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>        2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>>>        #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>>>>        #endif
>>>>>>>>>>>>>        +#ifndef pte_batch_remaining
>>>>>>>>>>>>> +/**
>>>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous
>>>>>>>>>>>>> batch of
>>>>>>>>>>>>> ptes.
>>>>>>>>>>>>> + * In such cases, this function returns the remaining number of
>>>>>>>>>>>>> pages to
>>>>>>>>>>>>> the end
>>>>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful when
>>>>>>>>>>>>> iterating
>>>>>>>>>>>>> + * over ptes.
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * May be overridden by the architecture, else batch size is always 1.
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned
>>>>>>>>>>>>> long
>>>>>>>>>>>>> addr,
>>>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    return 1;
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>
>>>>>>>>>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>>>>>>>>>
>>>>>>>>>>>> Was there no way to have some basic batching mechanism that doesn't
>>>>>>>>>>>> require
>>>>>>>>>>>> arch
>>>>>>>>>>>> specifics?
>>>>>>>>>>>
>>>>>>>>>>> I tried a bunch of things but ultimately the way I've done it was the
>>>>>>>>>>> only
>>>>>>>>>>> way
>>>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>>>
>>>>>>>>>>> My original v3 posting was costing 5% extra and even my first attempt
>>>>>>>>>>> at an
>>>>>>>>>>> arch-specific version that didn't resolve to a compile-time constant 1
>>>>>>>>>>> still
>>>>>>>>>>> cost an extra 3%.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>>>>
>>>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>>>
>>>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can get
>>>>>>>>>>> the
>>>>>>>>>>> regression under 4% with this. Further along the series I spent a lot of
>>>>>>>>>>> time
>>>>>>>>>>> having to fiddle with the arm64 implementation; every conditional and
>>>>>>>>>>> every
>>>>>>>>>>> memory read (even when in cache) was a problem. There is just so
>>>>>>>>>>> little in
>>>>>>>>>>> the
>>>>>>>>>>> inner loop that every instruction matters. (At least on Ampere Altra and
>>>>>>>>>>> Apple
>>>>>>>>>>> M2).
>>>>>>>>>>>
>>>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>>>> benefit to
>>>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd
>>>>>>>>>>> prefer to
>>>>>>>>>>> play
>>>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you
>>>>>>>>>>> previously
>>>>>>>>>>> suggested.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching
>>>>>>>>>> series. I
>>>>>>>>>> implemented very generic and simple batching for large folios (all PTE
>>>>>>>>>> bits
>>>>>>>>>> except the PFN have to match).
>>>>>>>>>>
>>>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R)
>>>>>>>>>> Silver
>>>>>>>>>> 4210R CPU.
>>>>>>>>>>
>>>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>>>
>>>>>>>>>> -> Around 1.7 % faster
>>>>>>>>>>
>>>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>>>
>>>>>>>>>> -> Around 36.3 % faster
>>>>>>>>>
>>>>>>>>> Well I guess that shows me :)
>>>>>>>>>
>>>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>>>
>>>>>>>>
>>>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>>>
>>>>>>> I've just been trying to compile and noticed this. Will take a look at your
>>>>>>> update.
>>>>>>>
>>>>>>> But upon review, I've noticed the part that I think makes this difficult for
>>>>>>> arm64 with the contpte optimization; You are calling ptep_get() for every
>>>>>>> pte in
>>>>>>> the batch. While this is functionally correct, once arm64 has the contpte
>>>>>>> changes, its ptep_get() has to read every pte in the contpte block in
>>>>>>> order to
>>>>>>> gather the access and dirty bits. So if your batching function ends up
>>>>>>> wealking
>>>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>>>>> function; this allows the core-mm to skip to the end of the contpte block and
>>>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s
>>>>>>> instead
>>>>>>> of 256.
>>>>>>>
>>>>>>> I considered making a ptep_get_noyoungdirty() variant, which would avoid the
>>>>>>> bit
>>>>>>> gathering. But we have a similar problem in zap_pte_range() and that function
>>>>>>> needs the dirty bit to update the folio. So it doesn't work there. (see
>>>>>>> patch 3
>>>>>>> in my series).
>>>>>>>
>>>>>>> I guess you are going to say that we should combine both approaches, so that
>>>>>>> your batching loop can skip forward an arch-provided number of ptes? That
>>>>>>> would
>>>>>>> certainly work, but feels like an orthogonal change to what I'm trying to
>>>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>>>
>>>>>> You can overwrite the function or add special-casing internally, yes.
>>>>>>
>>>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during fork()"
>>>>>> and it
>>>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>>>
>>>>>
>>>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>>>
>>>>> How do you want to handle your patches? Do you want to clean them up and I'll
>>>>> base my stuff on top? Or do you want me to take them and sort it all out?
>>>>
>>>> Whatever you prefer, it was mostly a quick prototype to see if we can achieve
>>>> decent performance.
>>>
>>> I'm about to run it on Altra and M2. But I assume it will show similar results.
>
> OK results in, not looking great, which aligns with my previous experience. That
> said, I'm seeing some "BUG: Bad page state in process gmain pfn:12a094" so
> perhaps these results are not valid...

I didn't see that so far on x86, maybe related to the PFN fixup?

>
> 100 iterations per run, 8 runs over 2 reboots. Positive is slower than baseline,
> negative is faster:
>
> Fork, order-0, Apple M2 VM:
> | kernel | mean_rel | std_rel |
> |:----------------------|-----------:|----------:|
> | mm-unstable | 0.0% | 0.8% |
> | hugetlb-rmap-cleanups | 1.3% | 2.0% |
> | fork-batching | 3.5% | 1.2% |
>
> Fork, order-9, Apple M2 VM:
> | kernel | mean_rel | std_rel |
> |:----------------------|-----------:|----------:|
> | mm-unstable | 0.0% | 0.8% |
> | hugetlb-rmap-cleanups | 0.9% | 0.9% |
> | fork-batching | -35.6% | 2.0% |
>
> Fork, order-0, Ampere Altra:
> | kernel | mean_rel | std_rel |
> |:----------------------|-----------:|----------:|
> | mm-unstable | 0.0% | 0.7% |
> | hugetlb-rmap-cleanups | 3.2% | 0.7% |
> | fork-batching | 5.5% | 1.1% |
>
> Fork, order-9, Ampere Altra:
> | kernel | mean_rel | std_rel |
> |:----------------------|-----------:|----------:|
> | mm-unstable | 0.0% | 0.1% |
> | hugetlb-rmap-cleanups | 0.5% | 0.1% |
> | fork-batching | -10.3% | 0.1% |

It's weird that an effective folio_test_large() should affect
performance that much. So far I haven't seen that behavior on x86, I
wodner why arm64 should behave here differently (also for the rmap
cleanups). Code layout/size?

I'll dig it up again and test on x86 once more.

[...]

>
> Yeah that would probably work. But we need to be careful for the case where
> start_ptep is in the middle of a contpte block (which can happen - due to some
> vma splitting operations, we can have a contpte block that spans 2 vmas). So
> nr_cont_ptes() needs to either be spec'ed to only return the contpte size if
> start_ptep is pointing to the front of the block, and all other times, return 1,
> or it needs to return the number of ptes remaining to the end of the block (as
> it does in my v4).
>
> But I guess we need to get to the bottom of my arm64 perf numbers first... I'll
> debug those bugs and rerun.

Yes, I'll dig into it on x86 once more.

--
Cheers,

David / dhildenb


2023-12-20 11:52:10

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20/12/2023 11:36, David Hildenbrand wrote:
> On 20.12.23 12:28, Ryan Roberts wrote:
>> On 20/12/2023 10:56, David Hildenbrand wrote:
>>> On 20.12.23 11:41, Ryan Roberts wrote:
>>>> On 20/12/2023 10:16, David Hildenbrand wrote:
>>>>> On 20.12.23 11:11, Ryan Roberts wrote:
>>>>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of
>>>>>>>>>>>>>> memory,
>>>>>>>>>>>>>> all belonging to the same folio. A pte batch is then
>>>>>>>>>>>>>> write-protected in
>>>>>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects()
>>>>>>>>>>>>>> and is
>>>>>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>>>>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>>>>>>>> fork, as it is about to add transparent support for the "contiguous
>>>>>>>>>>>>>> bit"
>>>>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>>>>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>>>>>>>>>>> child, the backend does not need to fold a contiguous range once they
>>>>>>>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>>>>>>>> range in the first place.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This code is very performance sensitive, and a significant amount of
>>>>>>>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>>>>>>>>>>> which enables the compiler to simplify the extra loops that are added
>>>>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This change addresses the core-mm refactoring only and a separate
>>>>>>>>>>>>>> change
>>>>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To ensure the arm64 is performant once implemented, this change is
>>>>>>>>>>>>>> very
>>>>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>>>>>> significant performance change after this patch. Fork is called in a
>>>>>>>>>>>>>> tight loop in a process with 1G of populated memory and the time for
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>>>>>>>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is
>>>>>>>>>>>>>> faster,
>>>>>>>>>>>>>> positive is slower, compared to baseline upon which the series is
>>>>>>>>>>>>>> based:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>         include/linux/pgtable.h | 80
>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>         mm/memory.c             | 92
>>>>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>         2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>>>>         #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>>>>>         #endif
>>>>>>>>>>>>>>         +#ifndef pte_batch_remaining
>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous
>>>>>>>>>>>>>> batch of
>>>>>>>>>>>>>> ptes.
>>>>>>>>>>>>>> + * In such cases, this function returns the remaining number of
>>>>>>>>>>>>>> pages to
>>>>>>>>>>>>>> the end
>>>>>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful when
>>>>>>>>>>>>>> iterating
>>>>>>>>>>>>>> + * over ptes.
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * May be overridden by the architecture, else batch size is
>>>>>>>>>>>>>> always 1.
>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned
>>>>>>>>>>>>>> long
>>>>>>>>>>>>>> addr,
>>>>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +    return 1;
>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Was there no way to have some basic batching mechanism that doesn't
>>>>>>>>>>>>> require
>>>>>>>>>>>>> arch
>>>>>>>>>>>>> specifics?
>>>>>>>>>>>>
>>>>>>>>>>>> I tried a bunch of things but ultimately the way I've done it was the
>>>>>>>>>>>> only
>>>>>>>>>>>> way
>>>>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>>>>
>>>>>>>>>>>> My original v3 posting was costing 5% extra and even my first attempt
>>>>>>>>>>>> at an
>>>>>>>>>>>> arch-specific version that didn't resolve to a compile-time constant 1
>>>>>>>>>>>> still
>>>>>>>>>>>> cost an extra 3%.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>>>>
>>>>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can get
>>>>>>>>>>>> the
>>>>>>>>>>>> regression under 4% with this. Further along the series I spent a
>>>>>>>>>>>> lot of
>>>>>>>>>>>> time
>>>>>>>>>>>> having to fiddle with the arm64 implementation; every conditional and
>>>>>>>>>>>> every
>>>>>>>>>>>> memory read (even when in cache) was a problem. There is just so
>>>>>>>>>>>> little in
>>>>>>>>>>>> the
>>>>>>>>>>>> inner loop that every instruction matters. (At least on Ampere Altra
>>>>>>>>>>>> and
>>>>>>>>>>>> Apple
>>>>>>>>>>>> M2).
>>>>>>>>>>>>
>>>>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>>>>> benefit to
>>>>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd
>>>>>>>>>>>> prefer to
>>>>>>>>>>>> play
>>>>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you
>>>>>>>>>>>> previously
>>>>>>>>>>>> suggested.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching
>>>>>>>>>>> series. I
>>>>>>>>>>> implemented very generic and simple batching for large folios (all PTE
>>>>>>>>>>> bits
>>>>>>>>>>> except the PFN have to match).
>>>>>>>>>>>
>>>>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R)
>>>>>>>>>>> Silver
>>>>>>>>>>> 4210R CPU.
>>>>>>>>>>>
>>>>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>>>>
>>>>>>>>>>> -> Around 1.7 % faster
>>>>>>>>>>>
>>>>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>>>>
>>>>>>>>>>> -> Around 36.3 % faster
>>>>>>>>>>
>>>>>>>>>> Well I guess that shows me :)
>>>>>>>>>>
>>>>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>>>>
>>>>>>>> I've just been trying to compile and noticed this. Will take a look at your
>>>>>>>> update.
>>>>>>>>
>>>>>>>> But upon review, I've noticed the part that I think makes this difficult
>>>>>>>> for
>>>>>>>> arm64 with the contpte optimization; You are calling ptep_get() for every
>>>>>>>> pte in
>>>>>>>> the batch. While this is functionally correct, once arm64 has the contpte
>>>>>>>> changes, its ptep_get() has to read every pte in the contpte block in
>>>>>>>> order to
>>>>>>>> gather the access and dirty bits. So if your batching function ends up
>>>>>>>> wealking
>>>>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>>>>>> function; this allows the core-mm to skip to the end of the contpte
>>>>>>>> block and
>>>>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s
>>>>>>>> instead
>>>>>>>> of 256.
>>>>>>>>
>>>>>>>> I considered making a ptep_get_noyoungdirty() variant, which would avoid
>>>>>>>> the
>>>>>>>> bit
>>>>>>>> gathering. But we have a similar problem in zap_pte_range() and that
>>>>>>>> function
>>>>>>>> needs the dirty bit to update the folio. So it doesn't work there. (see
>>>>>>>> patch 3
>>>>>>>> in my series).
>>>>>>>>
>>>>>>>> I guess you are going to say that we should combine both approaches, so
>>>>>>>> that
>>>>>>>> your batching loop can skip forward an arch-provided number of ptes? That
>>>>>>>> would
>>>>>>>> certainly work, but feels like an orthogonal change to what I'm trying to
>>>>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>>>>
>>>>>>> You can overwrite the function or add special-casing internally, yes.
>>>>>>>
>>>>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during fork()"
>>>>>>> and it
>>>>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>>>>
>>>>>>
>>>>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>>>>
>>>>>> How do you want to handle your patches? Do you want to clean them up and I'll
>>>>>> base my stuff on top? Or do you want me to take them and sort it all out?
>>>>>
>>>>> Whatever you prefer, it was mostly a quick prototype to see if we can achieve
>>>>> decent performance.
>>>>
>>>> I'm about to run it on Altra and M2. But I assume it will show similar results.
>>
>> OK results in, not looking great, which aligns with my previous experience. That
>> said, I'm seeing some "BUG: Bad page state in process gmain  pfn:12a094" so
>> perhaps these results are not valid...
>
> I didn't see that so far on x86, maybe related to the PFN fixup?

All I've done is define PFN_PTE_SHIFT for arm64 on top of your latest patch:

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b19a8aee684c..9eb0fd693df9 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -359,6 +359,8 @@ static inline void set_ptes(struct mm_struct *mm,
}
#define set_ptes set_ptes

+#define PFN_PTE_SHIFT PAGE_SHIFT
+
/*
* Huge pte definitions.
*/


As an aside, I think there is a bug in arm64's set_ptes() for PA > 48-bit case. But that won't affect this.


With VM_DEBUG on, this is the first warning I see during boot:


[ 0.278110] page:00000000c7ced4e8 refcount:12 mapcount:0 mapping:00000000b2f9739b index:0x1a8 pfn:0x1bff30
[ 0.278742] head:00000000c7ced4e8 order:2 entire_mapcount:0 nr_pages_mapped:2 pincount:0
[ 0.279247] memcg:ffff1a678008a000
[ 0.279518] aops:xfs_address_space_operations ino:b0f70c dentry name:"systemd"
[ 0.279746] flags: 0xbfffc0000008068(uptodate|lru|private|head|node=0|zone=2|lastcpupid=0xffff)
[ 0.280003] page_type: 0xffffffff()
[ 0.280110] raw: 0bfffc0000008068 fffffc699effcb08 fffffc699effcd08 ffff1a678980a6b0
[ 0.280338] raw: 00000000000001a8 ffff1a678a0f0200 0000000cffffffff ffff1a678008a000
[ 0.280564] page dumped because: VM_WARN_ON_FOLIO((_Generic((page + nr_pages - 1), const struct page *: (const struct folio *)_compound_head(page + nr_pages - 1), struct page *: (struct folio *)_compound_head(page + nr_pages - 1))) != folio)
[ 0.281196] ------------[ cut here ]------------
[ 0.281349] WARNING: CPU: 2 PID: 1 at include/linux/rmap.h:208 __folio_rmap_sanity_checks.constprop.0+0x168/0x188
[ 0.281650] Modules linked in:
[ 0.281752] CPU: 2 PID: 1 Comm: systemd Not tainted 6.7.0-rc4-00345-gdb45492bba9d #7
[ 0.281959] Hardware name: linux,dummy-virt (DT)
[ 0.282079] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 0.282260] pc : __folio_rmap_sanity_checks.constprop.0+0x168/0x188
[ 0.282421] lr : __folio_rmap_sanity_checks.constprop.0+0x168/0x188
[ 0.282583] sp : ffff80008007b9e0
[ 0.282670] x29: ffff80008007b9e0 x28: 0000aaaacbecb000 x27: fffffc699effccc0
[ 0.282872] x26: 00600001bff33fc3 x25: 0000000000000001 x24: ffff1a678a302228
[ 0.283062] x23: ffff1a678a326658 x22: 0000000000000000 x21: 0000000000000004
[ 0.283246] x20: fffffc699effccc0 x19: fffffc699effcc00 x18: 0000000000000000
[ 0.283435] x17: 3736613166666666 x16: 2066666666666666 x15: 0720072007200720
[ 0.283679] x14: 0720072007200720 x13: 0720072007200720 x12: 0720072007200720
[ 0.283933] x11: 0720072007200720 x10: ffffa89ecd79ba50 x9 : ffffa89ecab23054
[ 0.284214] x8 : ffffa89ecd743a50 x7 : ffffa89ecd79ba50 x6 : 0000000000000000
[ 0.284545] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000
[ 0.284875] x2 : 0000000000000000 x1 : ffff1a6781420000 x0 : 00000000000000e5
[ 0.285205] Call trace:
[ 0.285320] __folio_rmap_sanity_checks.constprop.0+0x168/0x188
[ 0.285594] copy_page_range+0x1180/0x1328
[ 0.285788] copy_process+0x1b04/0x1db8
[ 0.285933] kernel_clone+0x94/0x3f8
[ 0.286078] __do_sys_clone+0x58/0x88
[ 0.286247] __arm64_sys_clone+0x28/0x40
[ 0.286430] invoke_syscall+0x50/0x128
[ 0.286607] el0_svc_common.constprop.0+0x48/0xf0
[ 0.286826] do_el0_svc+0x24/0x38
[ 0.286983] el0_svc+0x34/0xb8
[ 0.287142] el0t_64_sync_handler+0xc0/0xc8
[ 0.287339] el0t_64_sync+0x190/0x198
[ 0.287514] ---[ end trace 0000000000000000 ]---


>
>>
>> 100 iterations per run, 8 runs over 2 reboots. Positive is slower than baseline,
>> negative is faster:
>>
>> Fork, order-0, Apple M2 VM:
>> | kernel                |   mean_rel |   std_rel |
>> |:----------------------|-----------:|----------:|
>> | mm-unstable           |       0.0% |      0.8% |
>> | hugetlb-rmap-cleanups |       1.3% |      2.0% |
>> | fork-batching         |       3.5% |      1.2% |
>>
>> Fork, order-9, Apple M2 VM:
>> | kernel                |   mean_rel |   std_rel |
>> |:----------------------|-----------:|----------:|
>> | mm-unstable           |       0.0% |      0.8% |
>> | hugetlb-rmap-cleanups |       0.9% |      0.9% |
>> | fork-batching         |     -35.6% |      2.0% |
>>
>> Fork, order-0, Ampere Altra:
>> | kernel                |   mean_rel |   std_rel |
>> |:----------------------|-----------:|----------:|
>> | mm-unstable           |       0.0% |      0.7% |
>> | hugetlb-rmap-cleanups |       3.2% |      0.7% |
>> | fork-batching         |       5.5% |      1.1% |
>>
>> Fork, order-9, Ampere Altra:
>> | kernel                |   mean_rel |   std_rel |
>> |:----------------------|-----------:|----------:|
>> | mm-unstable           |       0.0% |      0.1% |
>> | hugetlb-rmap-cleanups |       0.5% |      0.1% |
>> | fork-batching         |     -10.3% |      0.1% |
>
> It's weird that an effective folio_test_large() should affect performance that
> much. So far I haven't seen that behavior on x86, I wodner why arm64 should
> behave here differently (also for the rmap cleanups). Code layout/size?
>
> I'll dig it up again and test on x86 once more.
>
> [...]
>
>>
>> Yeah that would probably work. But we need to be careful for the case where
>> start_ptep is in the middle of a contpte block (which can happen - due to some
>> vma splitting operations, we can have a contpte block that spans 2 vmas). So
>> nr_cont_ptes() needs to either be spec'ed to only return the contpte size if
>> start_ptep is pointing to the front of the block, and all other times, return 1,
>> or it needs to return the number of ptes remaining to the end of the block (as
>> it does in my v4).
>>
>> But I guess we need to get to the bottom of my arm64 perf numbers first... I'll
>> debug those bugs and rerun.
>
> Yes, I'll dig into it on x86 once more.
>


2023-12-20 12:04:45

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20/12/2023 11:58, David Hildenbrand wrote:
> On 20.12.23 12:51, Ryan Roberts wrote:
>> On 20/12/2023 11:36, David Hildenbrand wrote:
>>> On 20.12.23 12:28, Ryan Roberts wrote:
>>>> On 20/12/2023 10:56, David Hildenbrand wrote:
>>>>> On 20.12.23 11:41, Ryan Roberts wrote:
>>>>>> On 20/12/2023 10:16, David Hildenbrand wrote:
>>>>>>> On 20.12.23 11:11, Ryan Roberts wrote:
>>>>>>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>>>>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>>>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of
>>>>>>>>>>>>>>>> memory,
>>>>>>>>>>>>>>>> all belonging to the same folio. A pte batch is then
>>>>>>>>>>>>>>>> write-protected in
>>>>>>>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects()
>>>>>>>>>>>>>>>> and is
>>>>>>>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The primary motivation for this change is to reduce the number
>>>>>>>>>>>>>>>> of tlb
>>>>>>>>>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>>>>>>>>>> fork, as it is about to add transparent support for the "contiguous
>>>>>>>>>>>>>>>> bit"
>>>>>>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>>>>>>>>>>> backend
>>>>>>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is
>>>>>>>>>>>>>>>> expensive,
>>>>>>>>>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> child, the backend does not need to fold a contiguous range once
>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>>>>>>>>>> range in the first place.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This code is very performance sensitive, and a significant
>>>>>>>>>>>>>>>> amount of
>>>>>>>>>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile
>>>>>>>>>>>>>>>> constant 1,
>>>>>>>>>>>>>>>> which enables the compiler to simplify the extra loops that are
>>>>>>>>>>>>>>>> added
>>>>>>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This change addresses the core-mm refactoring only and a separate
>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To ensure the arm64 is performant once implemented, this change is
>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>>>>>>>> significant performance change after this patch. Fork is called
>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>> tight loop in a process with 1G of populated memory and the time
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal).
>>>>>>>>>>>>>>>> Tests
>>>>>>>>>>>>>>>> performed for case where 1G memory is comprised of order-0
>>>>>>>>>>>>>>>> folios and
>>>>>>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is
>>>>>>>>>>>>>>>> faster,
>>>>>>>>>>>>>>>> positive is slower, compared to baseline upon which the series is
>>>>>>>>>>>>>>>> based:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>          include/linux/pgtable.h | 80
>>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>          mm/memory.c             | 92
>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>          2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>>>>>>          #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>>>>>>>          #endif
>>>>>>>>>>>>>>>>          +#ifndef pte_batch_remaining
>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous
>>>>>>>>>>>>>>>> batch of
>>>>>>>>>>>>>>>> ptes.
>>>>>>>>>>>>>>>> + * In such cases, this function returns the remaining number of
>>>>>>>>>>>>>>>> pages to
>>>>>>>>>>>>>>>> the end
>>>>>>>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful
>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> iterating
>>>>>>>>>>>>>>>> + * over ptes.
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * May be overridden by the architecture, else batch size is
>>>>>>>>>>>>>>>> always 1.
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned
>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>> addr,
>>>>>>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> +    return 1;
>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's a shame we now lose the optimization for all other
>>>>>>>>>>>>>>> archtiectures.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Was there no way to have some basic batching mechanism that doesn't
>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>> arch
>>>>>>>>>>>>>>> specifics?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tried a bunch of things but ultimately the way I've done it was the
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>> way
>>>>>>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My original v3 posting was costing 5% extra and even my first attempt
>>>>>>>>>>>>>> at an
>>>>>>>>>>>>>> arch-specific version that didn't resolve to a compile-time
>>>>>>>>>>>>>> constant 1
>>>>>>>>>>>>>> still
>>>>>>>>>>>>>> cost an extra 3%.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can
>>>>>>>>>>>>>> get
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> regression under 4% with this. Further along the series I spent a
>>>>>>>>>>>>>> lot of
>>>>>>>>>>>>>> time
>>>>>>>>>>>>>> having to fiddle with the arm64 implementation; every conditional and
>>>>>>>>>>>>>> every
>>>>>>>>>>>>>> memory read (even when in cache) was a problem. There is just so
>>>>>>>>>>>>>> little in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> inner loop that every instruction matters. (At least on Ampere Altra
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> Apple
>>>>>>>>>>>>>> M2).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>>>>>>> benefit to
>>>>>>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd
>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>> play
>>>>>>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you
>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>> suggested.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching
>>>>>>>>>>>>> series. I
>>>>>>>>>>>>> implemented very generic and simple batching for large folios (all PTE
>>>>>>>>>>>>> bits
>>>>>>>>>>>>> except the PFN have to match).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R)
>>>>>>>>>>>>> Silver
>>>>>>>>>>>>> 4210R CPU.
>>>>>>>>>>>>>
>>>>>>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>>>>>>
>>>>>>>>>>>>> -> Around 1.7 % faster
>>>>>>>>>>>>>
>>>>>>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>>>>>>
>>>>>>>>>>>>> -> Around 36.3 % faster
>>>>>>>>>>>>
>>>>>>>>>>>> Well I guess that shows me :)
>>>>>>>>>>>>
>>>>>>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>>>>>>
>>>>>>>>>> I've just been trying to compile and noticed this. Will take a look at
>>>>>>>>>> your
>>>>>>>>>> update.
>>>>>>>>>>
>>>>>>>>>> But upon review, I've noticed the part that I think makes this difficult
>>>>>>>>>> for
>>>>>>>>>> arm64 with the contpte optimization; You are calling ptep_get() for every
>>>>>>>>>> pte in
>>>>>>>>>> the batch. While this is functionally correct, once arm64 has the contpte
>>>>>>>>>> changes, its ptep_get() has to read every pte in the contpte block in
>>>>>>>>>> order to
>>>>>>>>>> gather the access and dirty bits. So if your batching function ends up
>>>>>>>>>> wealking
>>>>>>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>>>>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>>>>>>>> function; this allows the core-mm to skip to the end of the contpte
>>>>>>>>>> block and
>>>>>>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s
>>>>>>>>>> instead
>>>>>>>>>> of 256.
>>>>>>>>>>
>>>>>>>>>> I considered making a ptep_get_noyoungdirty() variant, which would avoid
>>>>>>>>>> the
>>>>>>>>>> bit
>>>>>>>>>> gathering. But we have a similar problem in zap_pte_range() and that
>>>>>>>>>> function
>>>>>>>>>> needs the dirty bit to update the folio. So it doesn't work there. (see
>>>>>>>>>> patch 3
>>>>>>>>>> in my series).
>>>>>>>>>>
>>>>>>>>>> I guess you are going to say that we should combine both approaches, so
>>>>>>>>>> that
>>>>>>>>>> your batching loop can skip forward an arch-provided number of ptes? That
>>>>>>>>>> would
>>>>>>>>>> certainly work, but feels like an orthogonal change to what I'm trying to
>>>>>>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>>>>>>
>>>>>>>>> You can overwrite the function or add special-casing internally, yes.
>>>>>>>>>
>>>>>>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during fork()"
>>>>>>>>> and it
>>>>>>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>>>>>>
>>>>>>>> How do you want to handle your patches? Do you want to clean them up and
>>>>>>>> I'll
>>>>>>>> base my stuff on top? Or do you want me to take them and sort it all out?
>>>>>>>
>>>>>>> Whatever you prefer, it was mostly a quick prototype to see if we can
>>>>>>> achieve
>>>>>>> decent performance.
>>>>>>
>>>>>> I'm about to run it on Altra and M2. But I assume it will show similar
>>>>>> results.
>>>>
>>>> OK results in, not looking great, which aligns with my previous experience.
>>>> That
>>>> said, I'm seeing some "BUG: Bad page state in process gmain  pfn:12a094" so
>>>> perhaps these results are not valid...
>>>
>>> I didn't see that so far on x86, maybe related to the PFN fixup?
>>
>> All I've done is define PFN_PTE_SHIFT for arm64 on top of your latest patch:
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index b19a8aee684c..9eb0fd693df9 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -359,6 +359,8 @@ static inline void set_ptes(struct mm_struct *mm,
>>   }
>>   #define set_ptes set_ptes
>>   +#define PFN_PTE_SHIFT          PAGE_SHIFT
>> +
>>   /*
>>    * Huge pte definitions.
>>    */
>>
>>
>> As an aside, I think there is a bug in arm64's set_ptes() for PA > 48-bit
>> case. But that won't affect this.
>>
>>
>> With VM_DEBUG on, this is the first warning I see during boot:
>>
>>
>> [    0.278110] page:00000000c7ced4e8 refcount:12 mapcount:0
>> mapping:00000000b2f9739b index:0x1a8 pfn:0x1bff30
>> [    0.278742] head:00000000c7ced4e8 order:2 entire_mapcount:0
>> nr_pages_mapped:2 pincount:0
>
> ^ Ah, you are running with mTHP. Let me play with that.

Err... Its in mm-unstable, but I'm not enabling any sizes. It should only be set
up for PMD-sized THP.

I am using XFS though, so I imagine its a file folio.

I've rebased your rmap cleanup and fork batching to the version of mm-unstable
that I was doing all my other testing with so I could compare numbers. But its
not very old (perhaps a week). All the patches applied without any conflict.

>
> The warning would indicate that nr is too large (or something else is messed up).
>


2023-12-20 12:06:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20.12.23 12:51, Ryan Roberts wrote:
> On 20/12/2023 11:36, David Hildenbrand wrote:
>> On 20.12.23 12:28, Ryan Roberts wrote:
>>> On 20/12/2023 10:56, David Hildenbrand wrote:
>>>> On 20.12.23 11:41, Ryan Roberts wrote:
>>>>> On 20/12/2023 10:16, David Hildenbrand wrote:
>>>>>> On 20.12.23 11:11, Ryan Roberts wrote:
>>>>>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>>>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of
>>>>>>>>>>>>>>> memory,
>>>>>>>>>>>>>>> all belonging to the same folio. A pte batch is then
>>>>>>>>>>>>>>> write-protected in
>>>>>>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects()
>>>>>>>>>>>>>>> and is
>>>>>>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The primary motivation for this change is to reduce the number of tlb
>>>>>>>>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>>>>>>>>> fork, as it is about to add transparent support for the "contiguous
>>>>>>>>>>>>>>> bit"
>>>>>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the backend
>>>>>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is expensive,
>>>>>>>>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in the
>>>>>>>>>>>>>>> child, the backend does not need to fold a contiguous range once they
>>>>>>>>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>>>>>>>>> range in the first place.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This code is very performance sensitive, and a significant amount of
>>>>>>>>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile constant 1,
>>>>>>>>>>>>>>> which enables the compiler to simplify the extra loops that are added
>>>>>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This change addresses the core-mm refactoring only and a separate
>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To ensure the arm64 is performant once implemented, this change is
>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>>>>>>> significant performance change after this patch. Fork is called in a
>>>>>>>>>>>>>>> tight loop in a process with 1G of populated memory and the time for
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal). Tests
>>>>>>>>>>>>>>> performed for case where 1G memory is comprised of order-0 folios and
>>>>>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is
>>>>>>>>>>>>>>> faster,
>>>>>>>>>>>>>>> positive is slower, compared to baseline upon which the series is
>>>>>>>>>>>>>>> based:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>         include/linux/pgtable.h | 80
>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>         mm/memory.c             | 92
>>>>>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>         2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>>>>>         #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>>>>>>         #endif
>>>>>>>>>>>>>>>         +#ifndef pte_batch_remaining
>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous
>>>>>>>>>>>>>>> batch of
>>>>>>>>>>>>>>> ptes.
>>>>>>>>>>>>>>> + * In such cases, this function returns the remaining number of
>>>>>>>>>>>>>>> pages to
>>>>>>>>>>>>>>> the end
>>>>>>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful when
>>>>>>>>>>>>>>> iterating
>>>>>>>>>>>>>>> + * over ptes.
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * May be overridden by the architecture, else batch size is
>>>>>>>>>>>>>>> always 1.
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned
>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>> addr,
>>>>>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> +    return 1;
>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's a shame we now lose the optimization for all other archtiectures.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Was there no way to have some basic batching mechanism that doesn't
>>>>>>>>>>>>>> require
>>>>>>>>>>>>>> arch
>>>>>>>>>>>>>> specifics?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I tried a bunch of things but ultimately the way I've done it was the
>>>>>>>>>>>>> only
>>>>>>>>>>>>> way
>>>>>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My original v3 posting was costing 5% extra and even my first attempt
>>>>>>>>>>>>> at an
>>>>>>>>>>>>> arch-specific version that didn't resolve to a compile-time constant 1
>>>>>>>>>>>>> still
>>>>>>>>>>>>> cost an extra 3%.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>>>>>
>>>>>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can get
>>>>>>>>>>>>> the
>>>>>>>>>>>>> regression under 4% with this. Further along the series I spent a
>>>>>>>>>>>>> lot of
>>>>>>>>>>>>> time
>>>>>>>>>>>>> having to fiddle with the arm64 implementation; every conditional and
>>>>>>>>>>>>> every
>>>>>>>>>>>>> memory read (even when in cache) was a problem. There is just so
>>>>>>>>>>>>> little in
>>>>>>>>>>>>> the
>>>>>>>>>>>>> inner loop that every instruction matters. (At least on Ampere Altra
>>>>>>>>>>>>> and
>>>>>>>>>>>>> Apple
>>>>>>>>>>>>> M2).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>>>>>> benefit to
>>>>>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd
>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>> play
>>>>>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you
>>>>>>>>>>>>> previously
>>>>>>>>>>>>> suggested.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching
>>>>>>>>>>>> series. I
>>>>>>>>>>>> implemented very generic and simple batching for large folios (all PTE
>>>>>>>>>>>> bits
>>>>>>>>>>>> except the PFN have to match).
>>>>>>>>>>>>
>>>>>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R)
>>>>>>>>>>>> Silver
>>>>>>>>>>>> 4210R CPU.
>>>>>>>>>>>>
>>>>>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>>>>>
>>>>>>>>>>>> -> Around 1.7 % faster
>>>>>>>>>>>>
>>>>>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>>>>>
>>>>>>>>>>>> -> Around 36.3 % faster
>>>>>>>>>>>
>>>>>>>>>>> Well I guess that shows me :)
>>>>>>>>>>>
>>>>>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>>>>>
>>>>>>>>> I've just been trying to compile and noticed this. Will take a look at your
>>>>>>>>> update.
>>>>>>>>>
>>>>>>>>> But upon review, I've noticed the part that I think makes this difficult
>>>>>>>>> for
>>>>>>>>> arm64 with the contpte optimization; You are calling ptep_get() for every
>>>>>>>>> pte in
>>>>>>>>> the batch. While this is functionally correct, once arm64 has the contpte
>>>>>>>>> changes, its ptep_get() has to read every pte in the contpte block in
>>>>>>>>> order to
>>>>>>>>> gather the access and dirty bits. So if your batching function ends up
>>>>>>>>> wealking
>>>>>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>>>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>>>>>>> function; this allows the core-mm to skip to the end of the contpte
>>>>>>>>> block and
>>>>>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s
>>>>>>>>> instead
>>>>>>>>> of 256.
>>>>>>>>>
>>>>>>>>> I considered making a ptep_get_noyoungdirty() variant, which would avoid
>>>>>>>>> the
>>>>>>>>> bit
>>>>>>>>> gathering. But we have a similar problem in zap_pte_range() and that
>>>>>>>>> function
>>>>>>>>> needs the dirty bit to update the folio. So it doesn't work there. (see
>>>>>>>>> patch 3
>>>>>>>>> in my series).
>>>>>>>>>
>>>>>>>>> I guess you are going to say that we should combine both approaches, so
>>>>>>>>> that
>>>>>>>>> your batching loop can skip forward an arch-provided number of ptes? That
>>>>>>>>> would
>>>>>>>>> certainly work, but feels like an orthogonal change to what I'm trying to
>>>>>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>>>>>
>>>>>>>> You can overwrite the function or add special-casing internally, yes.
>>>>>>>>
>>>>>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during fork()"
>>>>>>>> and it
>>>>>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>>>>>
>>>>>>>
>>>>>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>>>>>
>>>>>>> How do you want to handle your patches? Do you want to clean them up and I'll
>>>>>>> base my stuff on top? Or do you want me to take them and sort it all out?
>>>>>>
>>>>>> Whatever you prefer, it was mostly a quick prototype to see if we can achieve
>>>>>> decent performance.
>>>>>
>>>>> I'm about to run it on Altra and M2. But I assume it will show similar results.
>>>
>>> OK results in, not looking great, which aligns with my previous experience. That
>>> said, I'm seeing some "BUG: Bad page state in process gmain  pfn:12a094" so
>>> perhaps these results are not valid...
>>
>> I didn't see that so far on x86, maybe related to the PFN fixup?
>
> All I've done is define PFN_PTE_SHIFT for arm64 on top of your latest patch:
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index b19a8aee684c..9eb0fd693df9 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -359,6 +359,8 @@ static inline void set_ptes(struct mm_struct *mm,
> }
> #define set_ptes set_ptes
>
> +#define PFN_PTE_SHIFT PAGE_SHIFT
> +
> /*
> * Huge pte definitions.
> */
>
>
> As an aside, I think there is a bug in arm64's set_ptes() for PA > 48-bit case. But that won't affect this.
>
>
> With VM_DEBUG on, this is the first warning I see during boot:
>
>
> [ 0.278110] page:00000000c7ced4e8 refcount:12 mapcount:0 mapping:00000000b2f9739b index:0x1a8 pfn:0x1bff30
> [ 0.278742] head:00000000c7ced4e8 order:2 entire_mapcount:0 nr_pages_mapped:2 pincount:0

^ Ah, you are running with mTHP. Let me play with that.

The warning would indicate that nr is too large (or something else is
messed up).

--
Cheers,

David / dhildenb


2023-12-20 12:08:48

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20.12.23 13:04, Ryan Roberts wrote:
> On 20/12/2023 11:58, David Hildenbrand wrote:
>> On 20.12.23 12:51, Ryan Roberts wrote:
>>> On 20/12/2023 11:36, David Hildenbrand wrote:
>>>> On 20.12.23 12:28, Ryan Roberts wrote:
>>>>> On 20/12/2023 10:56, David Hildenbrand wrote:
>>>>>> On 20.12.23 11:41, Ryan Roberts wrote:
>>>>>>> On 20/12/2023 10:16, David Hildenbrand wrote:
>>>>>>>> On 20.12.23 11:11, Ryan Roberts wrote:
>>>>>>>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>>>>>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>>>>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>>>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of
>>>>>>>>>>>>>>>>> memory,
>>>>>>>>>>>>>>>>> all belonging to the same folio. A pte batch is then
>>>>>>>>>>>>>>>>> write-protected in
>>>>>>>>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects()
>>>>>>>>>>>>>>>>> and is
>>>>>>>>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The primary motivation for this change is to reduce the number
>>>>>>>>>>>>>>>>> of tlb
>>>>>>>>>>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>>>>>>>>>>> fork, as it is about to add transparent support for the "contiguous
>>>>>>>>>>>>>>>>> bit"
>>>>>>>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>>>>>>>>>>>> backend
>>>>>>>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is
>>>>>>>>>>>>>>>>> expensive,
>>>>>>>>>>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> child, the backend does not need to fold a contiguous range once
>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>>>>>>>>>>> range in the first place.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This code is very performance sensitive, and a significant
>>>>>>>>>>>>>>>>> amount of
>>>>>>>>>>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile
>>>>>>>>>>>>>>>>> constant 1,
>>>>>>>>>>>>>>>>> which enables the compiler to simplify the extra loops that are
>>>>>>>>>>>>>>>>> added
>>>>>>>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This change addresses the core-mm refactoring only and a separate
>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To ensure the arm64 is performant once implemented, this change is
>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>>>>>>>>> significant performance change after this patch. Fork is called
>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>> tight loop in a process with 1G of populated memory and the time
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal).
>>>>>>>>>>>>>>>>> Tests
>>>>>>>>>>>>>>>>> performed for case where 1G memory is comprised of order-0
>>>>>>>>>>>>>>>>> folios and
>>>>>>>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is
>>>>>>>>>>>>>>>>> faster,
>>>>>>>>>>>>>>>>> positive is slower, compared to baseline upon which the series is
>>>>>>>>>>>>>>>>> based:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>          include/linux/pgtable.h | 80
>>>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>>          mm/memory.c             | 92
>>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>>          2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>>>>>>>          #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>>>>>>>>          #endif
>>>>>>>>>>>>>>>>>          +#ifndef pte_batch_remaining
>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous
>>>>>>>>>>>>>>>>> batch of
>>>>>>>>>>>>>>>>> ptes.
>>>>>>>>>>>>>>>>> + * In such cases, this function returns the remaining number of
>>>>>>>>>>>>>>>>> pages to
>>>>>>>>>>>>>>>>> the end
>>>>>>>>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful
>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>> iterating
>>>>>>>>>>>>>>>>> + * over ptes.
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * May be overridden by the architecture, else batch size is
>>>>>>>>>>>>>>>>> always 1.
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned
>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>> addr,
>>>>>>>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> +    return 1;
>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It's a shame we now lose the optimization for all other
>>>>>>>>>>>>>>>> archtiectures.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Was there no way to have some basic batching mechanism that doesn't
>>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>>> arch
>>>>>>>>>>>>>>>> specifics?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried a bunch of things but ultimately the way I've done it was the
>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My original v3 posting was costing 5% extra and even my first attempt
>>>>>>>>>>>>>>> at an
>>>>>>>>>>>>>>> arch-specific version that didn't resolve to a compile-time
>>>>>>>>>>>>>>> constant 1
>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>> cost an extra 3%.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can
>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> regression under 4% with this. Further along the series I spent a
>>>>>>>>>>>>>>> lot of
>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>> having to fiddle with the arm64 implementation; every conditional and
>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>> memory read (even when in cache) was a problem. There is just so
>>>>>>>>>>>>>>> little in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> inner loop that every instruction matters. (At least on Ampere Altra
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> Apple
>>>>>>>>>>>>>>> M2).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>>>>>>>> benefit to
>>>>>>>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd
>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>> play
>>>>>>>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you
>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>> suggested.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching
>>>>>>>>>>>>>> series. I
>>>>>>>>>>>>>> implemented very generic and simple batching for large folios (all PTE
>>>>>>>>>>>>>> bits
>>>>>>>>>>>>>> except the PFN have to match).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R)
>>>>>>>>>>>>>> Silver
>>>>>>>>>>>>>> 4210R CPU.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -> Around 1.7 % faster
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -> Around 36.3 % faster
>>>>>>>>>>>>>
>>>>>>>>>>>>> Well I guess that shows me :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>>>>>>>
>>>>>>>>>>> I've just been trying to compile and noticed this. Will take a look at
>>>>>>>>>>> your
>>>>>>>>>>> update.
>>>>>>>>>>>
>>>>>>>>>>> But upon review, I've noticed the part that I think makes this difficult
>>>>>>>>>>> for
>>>>>>>>>>> arm64 with the contpte optimization; You are calling ptep_get() for every
>>>>>>>>>>> pte in
>>>>>>>>>>> the batch. While this is functionally correct, once arm64 has the contpte
>>>>>>>>>>> changes, its ptep_get() has to read every pte in the contpte block in
>>>>>>>>>>> order to
>>>>>>>>>>> gather the access and dirty bits. So if your batching function ends up
>>>>>>>>>>> wealking
>>>>>>>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>>>>>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>>>>>>>>> function; this allows the core-mm to skip to the end of the contpte
>>>>>>>>>>> block and
>>>>>>>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s
>>>>>>>>>>> instead
>>>>>>>>>>> of 256.
>>>>>>>>>>>
>>>>>>>>>>> I considered making a ptep_get_noyoungdirty() variant, which would avoid
>>>>>>>>>>> the
>>>>>>>>>>> bit
>>>>>>>>>>> gathering. But we have a similar problem in zap_pte_range() and that
>>>>>>>>>>> function
>>>>>>>>>>> needs the dirty bit to update the folio. So it doesn't work there. (see
>>>>>>>>>>> patch 3
>>>>>>>>>>> in my series).
>>>>>>>>>>>
>>>>>>>>>>> I guess you are going to say that we should combine both approaches, so
>>>>>>>>>>> that
>>>>>>>>>>> your batching loop can skip forward an arch-provided number of ptes? That
>>>>>>>>>>> would
>>>>>>>>>>> certainly work, but feels like an orthogonal change to what I'm trying to
>>>>>>>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>>>>>>>
>>>>>>>>>> You can overwrite the function or add special-casing internally, yes.
>>>>>>>>>>
>>>>>>>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during fork()"
>>>>>>>>>> and it
>>>>>>>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>>>>>>>
>>>>>>>>> How do you want to handle your patches? Do you want to clean them up and
>>>>>>>>> I'll
>>>>>>>>> base my stuff on top? Or do you want me to take them and sort it all out?
>>>>>>>>
>>>>>>>> Whatever you prefer, it was mostly a quick prototype to see if we can
>>>>>>>> achieve
>>>>>>>> decent performance.
>>>>>>>
>>>>>>> I'm about to run it on Altra and M2. But I assume it will show similar
>>>>>>> results.
>>>>>
>>>>> OK results in, not looking great, which aligns with my previous experience.
>>>>> That
>>>>> said, I'm seeing some "BUG: Bad page state in process gmain  pfn:12a094" so
>>>>> perhaps these results are not valid...
>>>>
>>>> I didn't see that so far on x86, maybe related to the PFN fixup?
>>>
>>> All I've done is define PFN_PTE_SHIFT for arm64 on top of your latest patch:
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index b19a8aee684c..9eb0fd693df9 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -359,6 +359,8 @@ static inline void set_ptes(struct mm_struct *mm,
>>>   }
>>>   #define set_ptes set_ptes
>>>   +#define PFN_PTE_SHIFT          PAGE_SHIFT
>>> +
>>>   /*
>>>    * Huge pte definitions.
>>>    */
>>>
>>>
>>> As an aside, I think there is a bug in arm64's set_ptes() for PA > 48-bit
>>> case. But that won't affect this.
>>>
>>>
>>> With VM_DEBUG on, this is the first warning I see during boot:
>>>
>>>
>>> [    0.278110] page:00000000c7ced4e8 refcount:12 mapcount:0
>>> mapping:00000000b2f9739b index:0x1a8 pfn:0x1bff30
>>> [    0.278742] head:00000000c7ced4e8 order:2 entire_mapcount:0
>>> nr_pages_mapped:2 pincount:0
>>
>> ^ Ah, you are running with mTHP. Let me play with that.
>
> Err... Its in mm-unstable, but I'm not enabling any sizes. It should only be set
> up for PMD-sized THP.
>
> I am using XFS though, so I imagine its a file folio.
>

Right, that's even weirder :)

I should have that in my environment as well. Let me dig.

--
Cheers,

David / dhildenb


2023-12-20 12:55:16

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20.12.23 13:04, Ryan Roberts wrote:
> On 20/12/2023 11:58, David Hildenbrand wrote:
>> On 20.12.23 12:51, Ryan Roberts wrote:
>>> On 20/12/2023 11:36, David Hildenbrand wrote:
>>>> On 20.12.23 12:28, Ryan Roberts wrote:
>>>>> On 20/12/2023 10:56, David Hildenbrand wrote:
>>>>>> On 20.12.23 11:41, Ryan Roberts wrote:
>>>>>>> On 20/12/2023 10:16, David Hildenbrand wrote:
>>>>>>>> On 20.12.23 11:11, Ryan Roberts wrote:
>>>>>>>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>>>>>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>>>>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>>>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of
>>>>>>>>>>>>>>>>> memory,
>>>>>>>>>>>>>>>>> all belonging to the same folio. A pte batch is then
>>>>>>>>>>>>>>>>> write-protected in
>>>>>>>>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects()
>>>>>>>>>>>>>>>>> and is
>>>>>>>>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The primary motivation for this change is to reduce the number
>>>>>>>>>>>>>>>>> of tlb
>>>>>>>>>>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>>>>>>>>>>> fork, as it is about to add transparent support for the "contiguous
>>>>>>>>>>>>>>>>> bit"
>>>>>>>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>>>>>>>>>>>> backend
>>>>>>>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is
>>>>>>>>>>>>>>>>> expensive,
>>>>>>>>>>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> child, the backend does not need to fold a contiguous range once
>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>>>>>>>>>>> range in the first place.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This code is very performance sensitive, and a significant
>>>>>>>>>>>>>>>>> amount of
>>>>>>>>>>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile
>>>>>>>>>>>>>>>>> constant 1,
>>>>>>>>>>>>>>>>> which enables the compiler to simplify the extra loops that are
>>>>>>>>>>>>>>>>> added
>>>>>>>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This change addresses the core-mm refactoring only and a separate
>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To ensure the arm64 is performant once implemented, this change is
>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>>>>>>>>> significant performance change after this patch. Fork is called
>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>> tight loop in a process with 1G of populated memory and the time
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal).
>>>>>>>>>>>>>>>>> Tests
>>>>>>>>>>>>>>>>> performed for case where 1G memory is comprised of order-0
>>>>>>>>>>>>>>>>> folios and
>>>>>>>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is
>>>>>>>>>>>>>>>>> faster,
>>>>>>>>>>>>>>>>> positive is slower, compared to baseline upon which the series is
>>>>>>>>>>>>>>>>> based:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>          include/linux/pgtable.h | 80
>>>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>>          mm/memory.c             | 92
>>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>>          2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>>>>>>>          #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>>>>>>>>          #endif
>>>>>>>>>>>>>>>>>          +#ifndef pte_batch_remaining
>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous
>>>>>>>>>>>>>>>>> batch of
>>>>>>>>>>>>>>>>> ptes.
>>>>>>>>>>>>>>>>> + * In such cases, this function returns the remaining number of
>>>>>>>>>>>>>>>>> pages to
>>>>>>>>>>>>>>>>> the end
>>>>>>>>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful
>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>> iterating
>>>>>>>>>>>>>>>>> + * over ptes.
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * May be overridden by the architecture, else batch size is
>>>>>>>>>>>>>>>>> always 1.
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned
>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>> addr,
>>>>>>>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> +    return 1;
>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It's a shame we now lose the optimization for all other
>>>>>>>>>>>>>>>> archtiectures.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Was there no way to have some basic batching mechanism that doesn't
>>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>>> arch
>>>>>>>>>>>>>>>> specifics?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried a bunch of things but ultimately the way I've done it was the
>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My original v3 posting was costing 5% extra and even my first attempt
>>>>>>>>>>>>>>> at an
>>>>>>>>>>>>>>> arch-specific version that didn't resolve to a compile-time
>>>>>>>>>>>>>>> constant 1
>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>> cost an extra 3%.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can
>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> regression under 4% with this. Further along the series I spent a
>>>>>>>>>>>>>>> lot of
>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>> having to fiddle with the arm64 implementation; every conditional and
>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>> memory read (even when in cache) was a problem. There is just so
>>>>>>>>>>>>>>> little in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> inner loop that every instruction matters. (At least on Ampere Altra
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> Apple
>>>>>>>>>>>>>>> M2).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>>>>>>>> benefit to
>>>>>>>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd
>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>> play
>>>>>>>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you
>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>> suggested.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching
>>>>>>>>>>>>>> series. I
>>>>>>>>>>>>>> implemented very generic and simple batching for large folios (all PTE
>>>>>>>>>>>>>> bits
>>>>>>>>>>>>>> except the PFN have to match).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R)
>>>>>>>>>>>>>> Silver
>>>>>>>>>>>>>> 4210R CPU.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -> Around 1.7 % faster
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -> Around 36.3 % faster
>>>>>>>>>>>>>
>>>>>>>>>>>>> Well I guess that shows me :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>>>>>>>
>>>>>>>>>>> I've just been trying to compile and noticed this. Will take a look at
>>>>>>>>>>> your
>>>>>>>>>>> update.
>>>>>>>>>>>
>>>>>>>>>>> But upon review, I've noticed the part that I think makes this difficult
>>>>>>>>>>> for
>>>>>>>>>>> arm64 with the contpte optimization; You are calling ptep_get() for every
>>>>>>>>>>> pte in
>>>>>>>>>>> the batch. While this is functionally correct, once arm64 has the contpte
>>>>>>>>>>> changes, its ptep_get() has to read every pte in the contpte block in
>>>>>>>>>>> order to
>>>>>>>>>>> gather the access and dirty bits. So if your batching function ends up
>>>>>>>>>>> wealking
>>>>>>>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>>>>>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>>>>>>>>> function; this allows the core-mm to skip to the end of the contpte
>>>>>>>>>>> block and
>>>>>>>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s
>>>>>>>>>>> instead
>>>>>>>>>>> of 256.
>>>>>>>>>>>
>>>>>>>>>>> I considered making a ptep_get_noyoungdirty() variant, which would avoid
>>>>>>>>>>> the
>>>>>>>>>>> bit
>>>>>>>>>>> gathering. But we have a similar problem in zap_pte_range() and that
>>>>>>>>>>> function
>>>>>>>>>>> needs the dirty bit to update the folio. So it doesn't work there. (see
>>>>>>>>>>> patch 3
>>>>>>>>>>> in my series).
>>>>>>>>>>>
>>>>>>>>>>> I guess you are going to say that we should combine both approaches, so
>>>>>>>>>>> that
>>>>>>>>>>> your batching loop can skip forward an arch-provided number of ptes? That
>>>>>>>>>>> would
>>>>>>>>>>> certainly work, but feels like an orthogonal change to what I'm trying to
>>>>>>>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>>>>>>>
>>>>>>>>>> You can overwrite the function or add special-casing internally, yes.
>>>>>>>>>>
>>>>>>>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during fork()"
>>>>>>>>>> and it
>>>>>>>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>>>>>>>
>>>>>>>>> How do you want to handle your patches? Do you want to clean them up and
>>>>>>>>> I'll
>>>>>>>>> base my stuff on top? Or do you want me to take them and sort it all out?
>>>>>>>>
>>>>>>>> Whatever you prefer, it was mostly a quick prototype to see if we can
>>>>>>>> achieve
>>>>>>>> decent performance.
>>>>>>>
>>>>>>> I'm about to run it on Altra and M2. But I assume it will show similar
>>>>>>> results.
>>>>>
>>>>> OK results in, not looking great, which aligns with my previous experience.
>>>>> That
>>>>> said, I'm seeing some "BUG: Bad page state in process gmain  pfn:12a094" so
>>>>> perhaps these results are not valid...
>>>>
>>>> I didn't see that so far on x86, maybe related to the PFN fixup?
>>>
>>> All I've done is define PFN_PTE_SHIFT for arm64 on top of your latest patch:
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index b19a8aee684c..9eb0fd693df9 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -359,6 +359,8 @@ static inline void set_ptes(struct mm_struct *mm,
>>>   }
>>>   #define set_ptes set_ptes
>>>   +#define PFN_PTE_SHIFT          PAGE_SHIFT
>>> +
>>>   /*
>>>    * Huge pte definitions.
>>>    */
>>>
>>>
>>> As an aside, I think there is a bug in arm64's set_ptes() for PA > 48-bit
>>> case. But that won't affect this.
>>>
>>>
>>> With VM_DEBUG on, this is the first warning I see during boot:
>>>
>>>
>>> [    0.278110] page:00000000c7ced4e8 refcount:12 mapcount:0
>>> mapping:00000000b2f9739b index:0x1a8 pfn:0x1bff30
>>> [    0.278742] head:00000000c7ced4e8 order:2 entire_mapcount:0
>>> nr_pages_mapped:2 pincount:0
>>
>> ^ Ah, you are running with mTHP. Let me play with that.
>
> Err... Its in mm-unstable, but I'm not enabling any sizes. It should only be set
> up for PMD-sized THP.
>
> I am using XFS though, so I imagine its a file folio.
>
> I've rebased your rmap cleanup and fork batching to the version of mm-unstable
> that I was doing all my other testing with so I could compare numbers. But its
> not very old (perhaps a week). All the patches applied without any conflict.

I think it was something stupid: I would get "17" from folio_pte_batch()
for an order-4 folio, but only sometimes. The rmap sanity checks were
definitely worth it :)

I guess we hit the case "next mapped folio is actually the next physical
folio" and the detection for that was off by one.

diff --git a/mm/memory.c b/mm/memory.c
index 187d1b9b70e2..2af34add7ed7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -975,7 +975,7 @@ static inline int folio_pte_batch(struct folio
*folio, unsigned long addr,
* corner cases the next PFN might fall into a different
* folio.
*/
- if (pte_pfn(pte) == folio_end_pfn - 1)
+ if (pte_pfn(pte) == folio_end_pfn)
break;

Briefly tested, have to do more testing.

I only tested with order-9, which means max_nr would cap at 512.
Shouldn't affect the performance measurements, will redo them.

--
Cheers,

David / dhildenb


2023-12-20 13:02:59

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20/12/2023 12:54, David Hildenbrand wrote:
> On 20.12.23 13:04, Ryan Roberts wrote:
>> On 20/12/2023 11:58, David Hildenbrand wrote:
>>> On 20.12.23 12:51, Ryan Roberts wrote:
>>>> On 20/12/2023 11:36, David Hildenbrand wrote:
>>>>> On 20.12.23 12:28, Ryan Roberts wrote:
>>>>>> On 20/12/2023 10:56, David Hildenbrand wrote:
>>>>>>> On 20.12.23 11:41, Ryan Roberts wrote:
>>>>>>>> On 20/12/2023 10:16, David Hildenbrand wrote:
>>>>>>>>> On 20.12.23 11:11, Ryan Roberts wrote:
>>>>>>>>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>>>>>>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>>>>>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>>>>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>>>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A
>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of
>>>>>>>>>>>>>>>>>> memory,
>>>>>>>>>>>>>>>>>> all belonging to the same folio. A pte batch is then
>>>>>>>>>>>>>>>>>> write-protected in
>>>>>>>>>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects()
>>>>>>>>>>>>>>>>>> and is
>>>>>>>>>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The primary motivation for this change is to reduce the number
>>>>>>>>>>>>>>>>>> of tlb
>>>>>>>>>>>>>>>>>> maintenance operations that the arm64 backend has to perform
>>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>>> fork, as it is about to add transparent support for the
>>>>>>>>>>>>>>>>>> "contiguous
>>>>>>>>>>>>>>>>>> bit"
>>>>>>>>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>>>>>>>>>>>>> backend
>>>>>>>>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is
>>>>>>>>>>>>>>>>>> expensive,
>>>>>>>>>>>>>>>>>> when all ptes in the range are being write-protected.
>>>>>>>>>>>>>>>>>> Similarly, by
>>>>>>>>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> child, the backend does not need to fold a contiguous range once
>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>> are all populated - they can be initially populated as a
>>>>>>>>>>>>>>>>>> contiguous
>>>>>>>>>>>>>>>>>> range in the first place.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This code is very performance sensitive, and a significant
>>>>>>>>>>>>>>>>>> amount of
>>>>>>>>>>>>>>>>>> effort has been put into not regressing performance for the
>>>>>>>>>>>>>>>>>> order-0
>>>>>>>>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile
>>>>>>>>>>>>>>>>>> constant 1,
>>>>>>>>>>>>>>>>>> which enables the compiler to simplify the extra loops that are
>>>>>>>>>>>>>>>>>> added
>>>>>>>>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This change addresses the core-mm refactoring only and a separate
>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> To ensure the arm64 is performant once implemented, this
>>>>>>>>>>>>>>>>>> change is
>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>>>>>>>>>> significant performance change after this patch. Fork is called
>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>> tight loop in a process with 1G of populated memory and the time
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal).
>>>>>>>>>>>>>>>>>> Tests
>>>>>>>>>>>>>>>>>> performed for case where 1G memory is comprised of order-0
>>>>>>>>>>>>>>>>>> folios and
>>>>>>>>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is
>>>>>>>>>>>>>>>>>> faster,
>>>>>>>>>>>>>>>>>> positive is slower, compared to baseline upon which the series is
>>>>>>>>>>>>>>>>>> based:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>           include/linux/pgtable.h | 80
>>>>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>>>           mm/memory.c             | 92
>>>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>>>           2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>>>>>>>>           #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>>>>>>>>>           #endif
>>>>>>>>>>>>>>>>>>           +#ifndef pte_batch_remaining
>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a
>>>>>>>>>>>>>>>>>> contiguous
>>>>>>>>>>>>>>>>>> batch of
>>>>>>>>>>>>>>>>>> ptes.
>>>>>>>>>>>>>>>>>> + * In such cases, this function returns the remaining number of
>>>>>>>>>>>>>>>>>> pages to
>>>>>>>>>>>>>>>>>> the end
>>>>>>>>>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful
>>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>> iterating
>>>>>>>>>>>>>>>>>> + * over ptes.
>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>> + * May be overridden by the architecture, else batch size is
>>>>>>>>>>>>>>>>>> always 1.
>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte,
>>>>>>>>>>>>>>>>>> unsigned
>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>> addr,
>>>>>>>>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>> +    return 1;
>>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It's a shame we now lose the optimization for all other
>>>>>>>>>>>>>>>>> archtiectures.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Was there no way to have some basic batching mechanism that
>>>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>>>> arch
>>>>>>>>>>>>>>>>> specifics?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I tried a bunch of things but ultimately the way I've done it
>>>>>>>>>>>>>>>> was the
>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My original v3 posting was costing 5% extra and even my first
>>>>>>>>>>>>>>>> attempt
>>>>>>>>>>>>>>>> at an
>>>>>>>>>>>>>>>> arch-specific version that didn't resolve to a compile-time
>>>>>>>>>>>>>>>> constant 1
>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>> cost an extra 3%.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can
>>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> regression under 4% with this. Further along the series I spent a
>>>>>>>>>>>>>>>> lot of
>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>> having to fiddle with the arm64 implementation; every
>>>>>>>>>>>>>>>> conditional and
>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>> memory read (even when in cache) was a problem. There is just so
>>>>>>>>>>>>>>>> little in
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> inner loop that every instruction matters. (At least on Ampere
>>>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> Apple
>>>>>>>>>>>>>>>> M2).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>>>>>>>>> benefit to
>>>>>>>>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd
>>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>>> play
>>>>>>>>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you
>>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>>> suggested.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I just hacked something up, on top of my beloved rmap
>>>>>>>>>>>>>>> cleanup/batching
>>>>>>>>>>>>>>> series. I
>>>>>>>>>>>>>>> implemented very generic and simple batching for large folios
>>>>>>>>>>>>>>> (all PTE
>>>>>>>>>>>>>>> bits
>>>>>>>>>>>>>>> except the PFN have to match).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R)
>>>>>>>>>>>>>>> Xeon(R)
>>>>>>>>>>>>>>> Silver
>>>>>>>>>>>>>>> 4210R CPU.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -> Around 1.7 % faster
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -> Around 36.3 % faster
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Well I guess that shows me :)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>>>>>>>>
>>>>>>>>>>>> I've just been trying to compile and noticed this. Will take a look at
>>>>>>>>>>>> your
>>>>>>>>>>>> update.
>>>>>>>>>>>>
>>>>>>>>>>>> But upon review, I've noticed the part that I think makes this
>>>>>>>>>>>> difficult
>>>>>>>>>>>> for
>>>>>>>>>>>> arm64 with the contpte optimization; You are calling ptep_get() for
>>>>>>>>>>>> every
>>>>>>>>>>>> pte in
>>>>>>>>>>>> the batch. While this is functionally correct, once arm64 has the
>>>>>>>>>>>> contpte
>>>>>>>>>>>> changes, its ptep_get() has to read every pte in the contpte block in
>>>>>>>>>>>> order to
>>>>>>>>>>>> gather the access and dirty bits. So if your batching function ends up
>>>>>>>>>>>> wealking
>>>>>>>>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>>>>>>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>>>>>>>>>> function; this allows the core-mm to skip to the end of the contpte
>>>>>>>>>>>> block and
>>>>>>>>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16
>>>>>>>>>>>> READ_ONCE()s
>>>>>>>>>>>> instead
>>>>>>>>>>>> of 256.
>>>>>>>>>>>>
>>>>>>>>>>>> I considered making a ptep_get_noyoungdirty() variant, which would
>>>>>>>>>>>> avoid
>>>>>>>>>>>> the
>>>>>>>>>>>> bit
>>>>>>>>>>>> gathering. But we have a similar problem in zap_pte_range() and that
>>>>>>>>>>>> function
>>>>>>>>>>>> needs the dirty bit to update the folio. So it doesn't work there. (see
>>>>>>>>>>>> patch 3
>>>>>>>>>>>> in my series).
>>>>>>>>>>>>
>>>>>>>>>>>> I guess you are going to say that we should combine both approaches, so
>>>>>>>>>>>> that
>>>>>>>>>>>> your batching loop can skip forward an arch-provided number of ptes?
>>>>>>>>>>>> That
>>>>>>>>>>>> would
>>>>>>>>>>>> certainly work, but feels like an orthogonal change to what I'm
>>>>>>>>>>>> trying to
>>>>>>>>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>>>>>>>>
>>>>>>>>>>> You can overwrite the function or add special-casing internally, yes.
>>>>>>>>>>>
>>>>>>>>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during
>>>>>>>>>>> fork()"
>>>>>>>>>>> and it
>>>>>>>>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>>>>>>>>
>>>>>>>>>> How do you want to handle your patches? Do you want to clean them up and
>>>>>>>>>> I'll
>>>>>>>>>> base my stuff on top? Or do you want me to take them and sort it all out?
>>>>>>>>>
>>>>>>>>> Whatever you prefer, it was mostly a quick prototype to see if we can
>>>>>>>>> achieve
>>>>>>>>> decent performance.
>>>>>>>>
>>>>>>>> I'm about to run it on Altra and M2. But I assume it will show similar
>>>>>>>> results.
>>>>>>
>>>>>> OK results in, not looking great, which aligns with my previous experience.
>>>>>> That
>>>>>> said, I'm seeing some "BUG: Bad page state in process gmain  pfn:12a094" so
>>>>>> perhaps these results are not valid...
>>>>>
>>>>> I didn't see that so far on x86, maybe related to the PFN fixup?
>>>>
>>>> All I've done is define PFN_PTE_SHIFT for arm64 on top of your latest patch:
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h
>>>> b/arch/arm64/include/asm/pgtable.h
>>>> index b19a8aee684c..9eb0fd693df9 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -359,6 +359,8 @@ static inline void set_ptes(struct mm_struct *mm,
>>>>    }
>>>>    #define set_ptes set_ptes
>>>>    +#define PFN_PTE_SHIFT          PAGE_SHIFT
>>>> +
>>>>    /*
>>>>     * Huge pte definitions.
>>>>     */
>>>>
>>>>
>>>> As an aside, I think there is a bug in arm64's set_ptes() for PA > 48-bit
>>>> case. But that won't affect this.
>>>>
>>>>
>>>> With VM_DEBUG on, this is the first warning I see during boot:
>>>>
>>>>
>>>> [    0.278110] page:00000000c7ced4e8 refcount:12 mapcount:0
>>>> mapping:00000000b2f9739b index:0x1a8 pfn:0x1bff30
>>>> [    0.278742] head:00000000c7ced4e8 order:2 entire_mapcount:0
>>>> nr_pages_mapped:2 pincount:0
>>>
>>> ^ Ah, you are running with mTHP. Let me play with that.
>>
>> Err... Its in mm-unstable, but I'm not enabling any sizes. It should only be set
>> up for PMD-sized THP.
>>
>> I am using XFS though, so I imagine its a file folio.
>>
>> I've rebased your rmap cleanup and fork batching to the version of mm-unstable
>> that I was doing all my other testing with so I could compare numbers. But its
>> not very old (perhaps a week). All the patches applied without any conflict.
>
> I think it was something stupid: I would get "17" from folio_pte_batch() for an
> order-4 folio, but only sometimes. The rmap sanity checks were definitely worth
> it :)
>
> I guess we hit the case "next mapped folio is actually the next physical folio"
> and the detection for that was off by one.
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 187d1b9b70e2..2af34add7ed7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -975,7 +975,7 @@ static inline int folio_pte_batch(struct folio *folio,
> unsigned long addr,
>                  * corner cases the next PFN might fall into a different
>                  * folio.
>                  */
> -               if (pte_pfn(pte) == folio_end_pfn - 1)
> +               if (pte_pfn(pte) == folio_end_pfn)
>                         break;
>

haha, of course! I've been staring at this for an hour and didn't notice.

I no longer see any warnings during boot with debug enabled. Will rerun perf
measurements.


> Briefly tested, have to do more testing.
>
> I only tested with order-9, which means max_nr would cap at 512. Shouldn't
> affect the performance measurements, will redo them.
>


2023-12-20 13:06:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20.12.23 13:04, Ryan Roberts wrote:
> On 20/12/2023 11:58, David Hildenbrand wrote:
>> On 20.12.23 12:51, Ryan Roberts wrote:
>>> On 20/12/2023 11:36, David Hildenbrand wrote:
>>>> On 20.12.23 12:28, Ryan Roberts wrote:
>>>>> On 20/12/2023 10:56, David Hildenbrand wrote:
>>>>>> On 20.12.23 11:41, Ryan Roberts wrote:
>>>>>>> On 20/12/2023 10:16, David Hildenbrand wrote:
>>>>>>>> On 20.12.23 11:11, Ryan Roberts wrote:
>>>>>>>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>>>>>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>>>>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>>>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A given
>>>>>>>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of
>>>>>>>>>>>>>>>>> memory,
>>>>>>>>>>>>>>>>> all belonging to the same folio. A pte batch is then
>>>>>>>>>>>>>>>>> write-protected in
>>>>>>>>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects()
>>>>>>>>>>>>>>>>> and is
>>>>>>>>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The primary motivation for this change is to reduce the number
>>>>>>>>>>>>>>>>> of tlb
>>>>>>>>>>>>>>>>> maintenance operations that the arm64 backend has to perform during
>>>>>>>>>>>>>>>>> fork, as it is about to add transparent support for the "contiguous
>>>>>>>>>>>>>>>>> bit"
>>>>>>>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>>>>>>>>>>>> backend
>>>>>>>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is
>>>>>>>>>>>>>>>>> expensive,
>>>>>>>>>>>>>>>>> when all ptes in the range are being write-protected. Similarly, by
>>>>>>>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> child, the backend does not need to fold a contiguous range once
>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>> are all populated - they can be initially populated as a contiguous
>>>>>>>>>>>>>>>>> range in the first place.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This code is very performance sensitive, and a significant
>>>>>>>>>>>>>>>>> amount of
>>>>>>>>>>>>>>>>> effort has been put into not regressing performance for the order-0
>>>>>>>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile
>>>>>>>>>>>>>>>>> constant 1,
>>>>>>>>>>>>>>>>> which enables the compiler to simplify the extra loops that are
>>>>>>>>>>>>>>>>> added
>>>>>>>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This change addresses the core-mm refactoring only and a separate
>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To ensure the arm64 is performant once implemented, this change is
>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>>>>>>>>> significant performance change after this patch. Fork is called
>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>> tight loop in a process with 1G of populated memory and the time
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal).
>>>>>>>>>>>>>>>>> Tests
>>>>>>>>>>>>>>>>> performed for case where 1G memory is comprised of order-0
>>>>>>>>>>>>>>>>> folios and
>>>>>>>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is
>>>>>>>>>>>>>>>>> faster,
>>>>>>>>>>>>>>>>> positive is slower, compared to baseline upon which the series is
>>>>>>>>>>>>>>>>> based:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>          include/linux/pgtable.h | 80
>>>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>>          mm/memory.c             | 92
>>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>>          2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>>>>>>>          #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>>>>>>>>          #endif
>>>>>>>>>>>>>>>>>          +#ifndef pte_batch_remaining
>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a contiguous
>>>>>>>>>>>>>>>>> batch of
>>>>>>>>>>>>>>>>> ptes.
>>>>>>>>>>>>>>>>> + * In such cases, this function returns the remaining number of
>>>>>>>>>>>>>>>>> pages to
>>>>>>>>>>>>>>>>> the end
>>>>>>>>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful
>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>> iterating
>>>>>>>>>>>>>>>>> + * over ptes.
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * May be overridden by the architecture, else batch size is
>>>>>>>>>>>>>>>>> always 1.
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte, unsigned
>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>> addr,
>>>>>>>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> +    return 1;
>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It's a shame we now lose the optimization for all other
>>>>>>>>>>>>>>>> archtiectures.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Was there no way to have some basic batching mechanism that doesn't
>>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>>> arch
>>>>>>>>>>>>>>>> specifics?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried a bunch of things but ultimately the way I've done it was the
>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My original v3 posting was costing 5% extra and even my first attempt
>>>>>>>>>>>>>>> at an
>>>>>>>>>>>>>>> arch-specific version that didn't resolve to a compile-time
>>>>>>>>>>>>>>> constant 1
>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>> cost an extra 3%.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can
>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> regression under 4% with this. Further along the series I spent a
>>>>>>>>>>>>>>> lot of
>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>> having to fiddle with the arm64 implementation; every conditional and
>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>> memory read (even when in cache) was a problem. There is just so
>>>>>>>>>>>>>>> little in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> inner loop that every instruction matters. (At least on Ampere Altra
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> Apple
>>>>>>>>>>>>>>> M2).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>>>>>>>> benefit to
>>>>>>>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd
>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>> play
>>>>>>>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you
>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>> suggested.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just hacked something up, on top of my beloved rmap cleanup/batching
>>>>>>>>>>>>>> series. I
>>>>>>>>>>>>>> implemented very generic and simple batching for large folios (all PTE
>>>>>>>>>>>>>> bits
>>>>>>>>>>>>>> except the PFN have to match).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R) Xeon(R)
>>>>>>>>>>>>>> Silver
>>>>>>>>>>>>>> 4210R CPU.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -> Around 1.7 % faster
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -> Around 36.3 % faster
>>>>>>>>>>>>>
>>>>>>>>>>>>> Well I guess that shows me :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>>>>>>>
>>>>>>>>>>> I've just been trying to compile and noticed this. Will take a look at
>>>>>>>>>>> your
>>>>>>>>>>> update.
>>>>>>>>>>>
>>>>>>>>>>> But upon review, I've noticed the part that I think makes this difficult
>>>>>>>>>>> for
>>>>>>>>>>> arm64 with the contpte optimization; You are calling ptep_get() for every
>>>>>>>>>>> pte in
>>>>>>>>>>> the batch. While this is functionally correct, once arm64 has the contpte
>>>>>>>>>>> changes, its ptep_get() has to read every pte in the contpte block in
>>>>>>>>>>> order to
>>>>>>>>>>> gather the access and dirty bits. So if your batching function ends up
>>>>>>>>>>> wealking
>>>>>>>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>>>>>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>>>>>>>>> function; this allows the core-mm to skip to the end of the contpte
>>>>>>>>>>> block and
>>>>>>>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16 READ_ONCE()s
>>>>>>>>>>> instead
>>>>>>>>>>> of 256.
>>>>>>>>>>>
>>>>>>>>>>> I considered making a ptep_get_noyoungdirty() variant, which would avoid
>>>>>>>>>>> the
>>>>>>>>>>> bit
>>>>>>>>>>> gathering. But we have a similar problem in zap_pte_range() and that
>>>>>>>>>>> function
>>>>>>>>>>> needs the dirty bit to update the folio. So it doesn't work there. (see
>>>>>>>>>>> patch 3
>>>>>>>>>>> in my series).
>>>>>>>>>>>
>>>>>>>>>>> I guess you are going to say that we should combine both approaches, so
>>>>>>>>>>> that
>>>>>>>>>>> your batching loop can skip forward an arch-provided number of ptes? That
>>>>>>>>>>> would
>>>>>>>>>>> certainly work, but feels like an orthogonal change to what I'm trying to
>>>>>>>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>>>>>>>
>>>>>>>>>> You can overwrite the function or add special-casing internally, yes.
>>>>>>>>>>
>>>>>>>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during fork()"
>>>>>>>>>> and it
>>>>>>>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>>>>>>>
>>>>>>>>> How do you want to handle your patches? Do you want to clean them up and
>>>>>>>>> I'll
>>>>>>>>> base my stuff on top? Or do you want me to take them and sort it all out?
>>>>>>>>
>>>>>>>> Whatever you prefer, it was mostly a quick prototype to see if we can
>>>>>>>> achieve
>>>>>>>> decent performance.
>>>>>>>
>>>>>>> I'm about to run it on Altra and M2. But I assume it will show similar
>>>>>>> results.
>>>>>
>>>>> OK results in, not looking great, which aligns with my previous experience.
>>>>> That
>>>>> said, I'm seeing some "BUG: Bad page state in process gmain  pfn:12a094" so
>>>>> perhaps these results are not valid...
>>>>
>>>> I didn't see that so far on x86, maybe related to the PFN fixup?
>>>
>>> All I've done is define PFN_PTE_SHIFT for arm64 on top of your latest patch:
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index b19a8aee684c..9eb0fd693df9 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -359,6 +359,8 @@ static inline void set_ptes(struct mm_struct *mm,
>>>   }
>>>   #define set_ptes set_ptes
>>>   +#define PFN_PTE_SHIFT          PAGE_SHIFT
>>> +
>>>   /*
>>>    * Huge pte definitions.
>>>    */
>>>
>>>
>>> As an aside, I think there is a bug in arm64's set_ptes() for PA > 48-bit
>>> case. But that won't affect this.
>>>
>>>
>>> With VM_DEBUG on, this is the first warning I see during boot:
>>>
>>>
>>> [    0.278110] page:00000000c7ced4e8 refcount:12 mapcount:0
>>> mapping:00000000b2f9739b index:0x1a8 pfn:0x1bff30
>>> [    0.278742] head:00000000c7ced4e8 order:2 entire_mapcount:0
>>> nr_pages_mapped:2 pincount:0
>>
>> ^ Ah, you are running with mTHP. Let me play with that.
>
> Err... Its in mm-unstable, but I'm not enabling any sizes. It should only be set
> up for PMD-sized THP.
>
> I am using XFS though, so I imagine its a file folio.
>
> I've rebased your rmap cleanup and fork batching to the version of mm-unstable
> that I was doing all my other testing with so I could compare numbers. But its
> not very old (perhaps a week). All the patches applied without any conflict.


It would also be interesting to know if the compiler on arm64 decides to
do something stupid: like not inline wrprotect_ptes().

Because with an effective unlikely(folio_test_large(folio)) we shouldn't
see that much overhead.

--
Cheers,

David / dhildenb


2023-12-20 13:10:53

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20/12/2023 13:06, David Hildenbrand wrote:
> On 20.12.23 13:04, Ryan Roberts wrote:
>> On 20/12/2023 11:58, David Hildenbrand wrote:
>>> On 20.12.23 12:51, Ryan Roberts wrote:
>>>> On 20/12/2023 11:36, David Hildenbrand wrote:
>>>>> On 20.12.23 12:28, Ryan Roberts wrote:
>>>>>> On 20/12/2023 10:56, David Hildenbrand wrote:
>>>>>>> On 20.12.23 11:41, Ryan Roberts wrote:
>>>>>>>> On 20/12/2023 10:16, David Hildenbrand wrote:
>>>>>>>>> On 20.12.23 11:11, Ryan Roberts wrote:
>>>>>>>>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>>>>>>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>>>>>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>>>>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>>>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A
>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of
>>>>>>>>>>>>>>>>>> memory,
>>>>>>>>>>>>>>>>>> all belonging to the same folio. A pte batch is then
>>>>>>>>>>>>>>>>>> write-protected in
>>>>>>>>>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects()
>>>>>>>>>>>>>>>>>> and is
>>>>>>>>>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The primary motivation for this change is to reduce the number
>>>>>>>>>>>>>>>>>> of tlb
>>>>>>>>>>>>>>>>>> maintenance operations that the arm64 backend has to perform
>>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>>> fork, as it is about to add transparent support for the
>>>>>>>>>>>>>>>>>> "contiguous
>>>>>>>>>>>>>>>>>> bit"
>>>>>>>>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>>>>>>>>>>>>> backend
>>>>>>>>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is
>>>>>>>>>>>>>>>>>> expensive,
>>>>>>>>>>>>>>>>>> when all ptes in the range are being write-protected.
>>>>>>>>>>>>>>>>>> Similarly, by
>>>>>>>>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> child, the backend does not need to fold a contiguous range once
>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>> are all populated - they can be initially populated as a
>>>>>>>>>>>>>>>>>> contiguous
>>>>>>>>>>>>>>>>>> range in the first place.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This code is very performance sensitive, and a significant
>>>>>>>>>>>>>>>>>> amount of
>>>>>>>>>>>>>>>>>> effort has been put into not regressing performance for the
>>>>>>>>>>>>>>>>>> order-0
>>>>>>>>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile
>>>>>>>>>>>>>>>>>> constant 1,
>>>>>>>>>>>>>>>>>> which enables the compiler to simplify the extra loops that are
>>>>>>>>>>>>>>>>>> added
>>>>>>>>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This change addresses the core-mm refactoring only and a separate
>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> To ensure the arm64 is performant once implemented, this
>>>>>>>>>>>>>>>>>> change is
>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>>>>>>>>>> significant performance change after this patch. Fork is called
>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>> tight loop in a process with 1G of populated memory and the time
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal).
>>>>>>>>>>>>>>>>>> Tests
>>>>>>>>>>>>>>>>>> performed for case where 1G memory is comprised of order-0
>>>>>>>>>>>>>>>>>> folios and
>>>>>>>>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is
>>>>>>>>>>>>>>>>>> faster,
>>>>>>>>>>>>>>>>>> positive is slower, compared to baseline upon which the series is
>>>>>>>>>>>>>>>>>> based:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>           include/linux/pgtable.h | 80
>>>>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>>>           mm/memory.c             | 92
>>>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>>>           2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>>>>>>>>           #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>>>>>>>>>           #endif
>>>>>>>>>>>>>>>>>>           +#ifndef pte_batch_remaining
>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a
>>>>>>>>>>>>>>>>>> contiguous
>>>>>>>>>>>>>>>>>> batch of
>>>>>>>>>>>>>>>>>> ptes.
>>>>>>>>>>>>>>>>>> + * In such cases, this function returns the remaining number of
>>>>>>>>>>>>>>>>>> pages to
>>>>>>>>>>>>>>>>>> the end
>>>>>>>>>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful
>>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>> iterating
>>>>>>>>>>>>>>>>>> + * over ptes.
>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>> + * May be overridden by the architecture, else batch size is
>>>>>>>>>>>>>>>>>> always 1.
>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte,
>>>>>>>>>>>>>>>>>> unsigned
>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>> addr,
>>>>>>>>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>> +    return 1;
>>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It's a shame we now lose the optimization for all other
>>>>>>>>>>>>>>>>> archtiectures.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Was there no way to have some basic batching mechanism that
>>>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>>>> arch
>>>>>>>>>>>>>>>>> specifics?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I tried a bunch of things but ultimately the way I've done it
>>>>>>>>>>>>>>>> was the
>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My original v3 posting was costing 5% extra and even my first
>>>>>>>>>>>>>>>> attempt
>>>>>>>>>>>>>>>> at an
>>>>>>>>>>>>>>>> arch-specific version that didn't resolve to a compile-time
>>>>>>>>>>>>>>>> constant 1
>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>> cost an extra 3%.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can
>>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> regression under 4% with this. Further along the series I spent a
>>>>>>>>>>>>>>>> lot of
>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>> having to fiddle with the arm64 implementation; every
>>>>>>>>>>>>>>>> conditional and
>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>> memory read (even when in cache) was a problem. There is just so
>>>>>>>>>>>>>>>> little in
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> inner loop that every instruction matters. (At least on Ampere
>>>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> Apple
>>>>>>>>>>>>>>>> M2).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>>>>>>>>> benefit to
>>>>>>>>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd
>>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>>> play
>>>>>>>>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you
>>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>>> suggested.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I just hacked something up, on top of my beloved rmap
>>>>>>>>>>>>>>> cleanup/batching
>>>>>>>>>>>>>>> series. I
>>>>>>>>>>>>>>> implemented very generic and simple batching for large folios
>>>>>>>>>>>>>>> (all PTE
>>>>>>>>>>>>>>> bits
>>>>>>>>>>>>>>> except the PFN have to match).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R)
>>>>>>>>>>>>>>> Xeon(R)
>>>>>>>>>>>>>>> Silver
>>>>>>>>>>>>>>> 4210R CPU.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -> Around 1.7 % faster
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -> Around 36.3 % faster
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Well I guess that shows me :)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>>>>>>>>
>>>>>>>>>>>> I've just been trying to compile and noticed this. Will take a look at
>>>>>>>>>>>> your
>>>>>>>>>>>> update.
>>>>>>>>>>>>
>>>>>>>>>>>> But upon review, I've noticed the part that I think makes this
>>>>>>>>>>>> difficult
>>>>>>>>>>>> for
>>>>>>>>>>>> arm64 with the contpte optimization; You are calling ptep_get() for
>>>>>>>>>>>> every
>>>>>>>>>>>> pte in
>>>>>>>>>>>> the batch. While this is functionally correct, once arm64 has the
>>>>>>>>>>>> contpte
>>>>>>>>>>>> changes, its ptep_get() has to read every pte in the contpte block in
>>>>>>>>>>>> order to
>>>>>>>>>>>> gather the access and dirty bits. So if your batching function ends up
>>>>>>>>>>>> wealking
>>>>>>>>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>>>>>>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>>>>>>>>>> function; this allows the core-mm to skip to the end of the contpte
>>>>>>>>>>>> block and
>>>>>>>>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16
>>>>>>>>>>>> READ_ONCE()s
>>>>>>>>>>>> instead
>>>>>>>>>>>> of 256.
>>>>>>>>>>>>
>>>>>>>>>>>> I considered making a ptep_get_noyoungdirty() variant, which would
>>>>>>>>>>>> avoid
>>>>>>>>>>>> the
>>>>>>>>>>>> bit
>>>>>>>>>>>> gathering. But we have a similar problem in zap_pte_range() and that
>>>>>>>>>>>> function
>>>>>>>>>>>> needs the dirty bit to update the folio. So it doesn't work there. (see
>>>>>>>>>>>> patch 3
>>>>>>>>>>>> in my series).
>>>>>>>>>>>>
>>>>>>>>>>>> I guess you are going to say that we should combine both approaches, so
>>>>>>>>>>>> that
>>>>>>>>>>>> your batching loop can skip forward an arch-provided number of ptes?
>>>>>>>>>>>> That
>>>>>>>>>>>> would
>>>>>>>>>>>> certainly work, but feels like an orthogonal change to what I'm
>>>>>>>>>>>> trying to
>>>>>>>>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>>>>>>>>
>>>>>>>>>>> You can overwrite the function or add special-casing internally, yes.
>>>>>>>>>>>
>>>>>>>>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during
>>>>>>>>>>> fork()"
>>>>>>>>>>> and it
>>>>>>>>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>>>>>>>>
>>>>>>>>>> How do you want to handle your patches? Do you want to clean them up and
>>>>>>>>>> I'll
>>>>>>>>>> base my stuff on top? Or do you want me to take them and sort it all out?
>>>>>>>>>
>>>>>>>>> Whatever you prefer, it was mostly a quick prototype to see if we can
>>>>>>>>> achieve
>>>>>>>>> decent performance.
>>>>>>>>
>>>>>>>> I'm about to run it on Altra and M2. But I assume it will show similar
>>>>>>>> results.
>>>>>>
>>>>>> OK results in, not looking great, which aligns with my previous experience.
>>>>>> That
>>>>>> said, I'm seeing some "BUG: Bad page state in process gmain  pfn:12a094" so
>>>>>> perhaps these results are not valid...
>>>>>
>>>>> I didn't see that so far on x86, maybe related to the PFN fixup?
>>>>
>>>> All I've done is define PFN_PTE_SHIFT for arm64 on top of your latest patch:
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h
>>>> b/arch/arm64/include/asm/pgtable.h
>>>> index b19a8aee684c..9eb0fd693df9 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -359,6 +359,8 @@ static inline void set_ptes(struct mm_struct *mm,
>>>>    }
>>>>    #define set_ptes set_ptes
>>>>    +#define PFN_PTE_SHIFT          PAGE_SHIFT
>>>> +
>>>>    /*
>>>>     * Huge pte definitions.
>>>>     */
>>>>
>>>>
>>>> As an aside, I think there is a bug in arm64's set_ptes() for PA > 48-bit
>>>> case. But that won't affect this.
>>>>
>>>>
>>>> With VM_DEBUG on, this is the first warning I see during boot:
>>>>
>>>>
>>>> [    0.278110] page:00000000c7ced4e8 refcount:12 mapcount:0
>>>> mapping:00000000b2f9739b index:0x1a8 pfn:0x1bff30
>>>> [    0.278742] head:00000000c7ced4e8 order:2 entire_mapcount:0
>>>> nr_pages_mapped:2 pincount:0
>>>
>>> ^ Ah, you are running with mTHP. Let me play with that.
>>
>> Err... Its in mm-unstable, but I'm not enabling any sizes. It should only be set
>> up for PMD-sized THP.
>>
>> I am using XFS though, so I imagine its a file folio.
>>
>> I've rebased your rmap cleanup and fork batching to the version of mm-unstable
>> that I was doing all my other testing with so I could compare numbers. But its
>> not very old (perhaps a week). All the patches applied without any conflict.
>
>
> It would also be interesting to know if the compiler on arm64 decides to do
> something stupid: like not inline wrprotect_ptes().
>
> Because with an effective unlikely(folio_test_large(folio)) we shouldn't see
> that much overhead.
>

What version of gcc are you using? I must confess I'm using the Ubuntu 20.04
default version:

aarch64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

Perhaps I should grab something a bit newer?


2023-12-20 13:13:44

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20.12.23 14:10, Ryan Roberts wrote:
> On 20/12/2023 13:06, David Hildenbrand wrote:
>> On 20.12.23 13:04, Ryan Roberts wrote:
>>> On 20/12/2023 11:58, David Hildenbrand wrote:
>>>> On 20.12.23 12:51, Ryan Roberts wrote:
>>>>> On 20/12/2023 11:36, David Hildenbrand wrote:
>>>>>> On 20.12.23 12:28, Ryan Roberts wrote:
>>>>>>> On 20/12/2023 10:56, David Hildenbrand wrote:
>>>>>>>> On 20.12.23 11:41, Ryan Roberts wrote:
>>>>>>>>> On 20/12/2023 10:16, David Hildenbrand wrote:
>>>>>>>>>> On 20.12.23 11:11, Ryan Roberts wrote:
>>>>>>>>>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>>>>>>>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>>>>>>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>>>>>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>>>>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>>>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A
>>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous block of
>>>>>>>>>>>>>>>>>>> memory,
>>>>>>>>>>>>>>>>>>> all belonging to the same folio. A pte batch is then
>>>>>>>>>>>>>>>>>>> write-protected in
>>>>>>>>>>>>>>>>>>> one go in the parent using the new helper, ptep_set_wrprotects()
>>>>>>>>>>>>>>>>>>> and is
>>>>>>>>>>>>>>>>>>> set in one go in the child using the new helper, set_ptes_full().
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The primary motivation for this change is to reduce the number
>>>>>>>>>>>>>>>>>>> of tlb
>>>>>>>>>>>>>>>>>>> maintenance operations that the arm64 backend has to perform
>>>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>>>> fork, as it is about to add transparent support for the
>>>>>>>>>>>>>>>>>>> "contiguous
>>>>>>>>>>>>>>>>>>> bit"
>>>>>>>>>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>>>>>>>>>>>>>> backend
>>>>>>>>>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is
>>>>>>>>>>>>>>>>>>> expensive,
>>>>>>>>>>>>>>>>>>> when all ptes in the range are being write-protected.
>>>>>>>>>>>>>>>>>>> Similarly, by
>>>>>>>>>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up ptes in
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> child, the backend does not need to fold a contiguous range once
>>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>> are all populated - they can be initially populated as a
>>>>>>>>>>>>>>>>>>> contiguous
>>>>>>>>>>>>>>>>>>> range in the first place.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This code is very performance sensitive, and a significant
>>>>>>>>>>>>>>>>>>> amount of
>>>>>>>>>>>>>>>>>>> effort has been put into not regressing performance for the
>>>>>>>>>>>>>>>>>>> order-0
>>>>>>>>>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile
>>>>>>>>>>>>>>>>>>> constant 1,
>>>>>>>>>>>>>>>>>>> which enables the compiler to simplify the extra loops that are
>>>>>>>>>>>>>>>>>>> added
>>>>>>>>>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This change addresses the core-mm refactoring only and a separate
>>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> To ensure the arm64 is performant once implemented, this
>>>>>>>>>>>>>>>>>>> change is
>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The following microbenchmark results demonstate that there is no
>>>>>>>>>>>>>>>>>>> significant performance change after this patch. Fork is called
>>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>> tight loop in a process with 1G of populated memory and the time
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal).
>>>>>>>>>>>>>>>>>>> Tests
>>>>>>>>>>>>>>>>>>> performed for case where 1G memory is comprised of order-0
>>>>>>>>>>>>>>>>>>> folios and
>>>>>>>>>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is
>>>>>>>>>>>>>>>>>>> faster,
>>>>>>>>>>>>>>>>>>> positive is slower, compared to baseline upon which the series is
>>>>>>>>>>>>>>>>>>> based:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>           include/linux/pgtable.h | 80
>>>>>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>>>>           mm/memory.c             | 92
>>>>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>>>>           2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>>>>>>>>>           #define arch_flush_lazy_mmu_mode()    do {} while (0)
>>>>>>>>>>>>>>>>>>>           #endif
>>>>>>>>>>>>>>>>>>>           +#ifndef pte_batch_remaining
>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next batch
>>>>>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a
>>>>>>>>>>>>>>>>>>> contiguous
>>>>>>>>>>>>>>>>>>> batch of
>>>>>>>>>>>>>>>>>>> ptes.
>>>>>>>>>>>>>>>>>>> + * In such cases, this function returns the remaining number of
>>>>>>>>>>>>>>>>>>> pages to
>>>>>>>>>>>>>>>>>>> the end
>>>>>>>>>>>>>>>>>>> + * of the current batch, as defined by addr. This can be useful
>>>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>>> iterating
>>>>>>>>>>>>>>>>>>> + * over ptes.
>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>> + * May be overridden by the architecture, else batch size is
>>>>>>>>>>>>>>>>>>> always 1.
>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte,
>>>>>>>>>>>>>>>>>>> unsigned
>>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>>> addr,
>>>>>>>>>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>> +    return 1;
>>>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> It's a shame we now lose the optimization for all other
>>>>>>>>>>>>>>>>>> archtiectures.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Was there no way to have some basic batching mechanism that
>>>>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>>>>> arch
>>>>>>>>>>>>>>>>>> specifics?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I tried a bunch of things but ultimately the way I've done it
>>>>>>>>>>>>>>>>> was the
>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> My original v3 posting was costing 5% extra and even my first
>>>>>>>>>>>>>>>>> attempt
>>>>>>>>>>>>>>>>> at an
>>>>>>>>>>>>>>>>> arch-specific version that didn't resolve to a compile-time
>>>>>>>>>>>>>>>>> constant 1
>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>> cost an extra 3%.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'd have thought that something very basic would have worked like:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I can
>>>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> regression under 4% with this. Further along the series I spent a
>>>>>>>>>>>>>>>>> lot of
>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>> having to fiddle with the arm64 implementation; every
>>>>>>>>>>>>>>>>> conditional and
>>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>>> memory read (even when in cache) was a problem. There is just so
>>>>>>>>>>>>>>>>> little in
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> inner loop that every instruction matters. (At least on Ampere
>>>>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> Apple
>>>>>>>>>>>>>>>>> M2).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>>>>>>>>>> benefit to
>>>>>>>>>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd
>>>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>>>> play
>>>>>>>>>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you
>>>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>>>> suggested.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I just hacked something up, on top of my beloved rmap
>>>>>>>>>>>>>>>> cleanup/batching
>>>>>>>>>>>>>>>> series. I
>>>>>>>>>>>>>>>> implemented very generic and simple batching for large folios
>>>>>>>>>>>>>>>> (all PTE
>>>>>>>>>>>>>>>> bits
>>>>>>>>>>>>>>>> except the PFN have to match).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R)
>>>>>>>>>>>>>>>> Xeon(R)
>>>>>>>>>>>>>>>> Silver
>>>>>>>>>>>>>>>> 4210R CPU.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -> Around 1.7 % faster
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -> Around 36.3 % faster
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Well I guess that shows me :)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've just been trying to compile and noticed this. Will take a look at
>>>>>>>>>>>>> your
>>>>>>>>>>>>> update.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But upon review, I've noticed the part that I think makes this
>>>>>>>>>>>>> difficult
>>>>>>>>>>>>> for
>>>>>>>>>>>>> arm64 with the contpte optimization; You are calling ptep_get() for
>>>>>>>>>>>>> every
>>>>>>>>>>>>> pte in
>>>>>>>>>>>>> the batch. While this is functionally correct, once arm64 has the
>>>>>>>>>>>>> contpte
>>>>>>>>>>>>> changes, its ptep_get() has to read every pte in the contpte block in
>>>>>>>>>>>>> order to
>>>>>>>>>>>>> gather the access and dirty bits. So if your batching function ends up
>>>>>>>>>>>>> wealking
>>>>>>>>>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>>>>>>>>>> performance. That's why I added the arch-specific pte_batch_remaining()
>>>>>>>>>>>>> function; this allows the core-mm to skip to the end of the contpte
>>>>>>>>>>>>> block and
>>>>>>>>>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16
>>>>>>>>>>>>> READ_ONCE()s
>>>>>>>>>>>>> instead
>>>>>>>>>>>>> of 256.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I considered making a ptep_get_noyoungdirty() variant, which would
>>>>>>>>>>>>> avoid
>>>>>>>>>>>>> the
>>>>>>>>>>>>> bit
>>>>>>>>>>>>> gathering. But we have a similar problem in zap_pte_range() and that
>>>>>>>>>>>>> function
>>>>>>>>>>>>> needs the dirty bit to update the folio. So it doesn't work there. (see
>>>>>>>>>>>>> patch 3
>>>>>>>>>>>>> in my series).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I guess you are going to say that we should combine both approaches, so
>>>>>>>>>>>>> that
>>>>>>>>>>>>> your batching loop can skip forward an arch-provided number of ptes?
>>>>>>>>>>>>> That
>>>>>>>>>>>>> would
>>>>>>>>>>>>> certainly work, but feels like an orthogonal change to what I'm
>>>>>>>>>>>>> trying to
>>>>>>>>>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>>>>>>>>>
>>>>>>>>>>>> You can overwrite the function or add special-casing internally, yes.
>>>>>>>>>>>>
>>>>>>>>>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during
>>>>>>>>>>>> fork()"
>>>>>>>>>>>> and it
>>>>>>>>>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>>>>>>>>>
>>>>>>>>>>> How do you want to handle your patches? Do you want to clean them up and
>>>>>>>>>>> I'll
>>>>>>>>>>> base my stuff on top? Or do you want me to take them and sort it all out?
>>>>>>>>>>
>>>>>>>>>> Whatever you prefer, it was mostly a quick prototype to see if we can
>>>>>>>>>> achieve
>>>>>>>>>> decent performance.
>>>>>>>>>
>>>>>>>>> I'm about to run it on Altra and M2. But I assume it will show similar
>>>>>>>>> results.
>>>>>>>
>>>>>>> OK results in, not looking great, which aligns with my previous experience.
>>>>>>> That
>>>>>>> said, I'm seeing some "BUG: Bad page state in process gmain  pfn:12a094" so
>>>>>>> perhaps these results are not valid...
>>>>>>
>>>>>> I didn't see that so far on x86, maybe related to the PFN fixup?
>>>>>
>>>>> All I've done is define PFN_PTE_SHIFT for arm64 on top of your latest patch:
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/pgtable.h
>>>>> b/arch/arm64/include/asm/pgtable.h
>>>>> index b19a8aee684c..9eb0fd693df9 100644
>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>> @@ -359,6 +359,8 @@ static inline void set_ptes(struct mm_struct *mm,
>>>>>    }
>>>>>    #define set_ptes set_ptes
>>>>>    +#define PFN_PTE_SHIFT          PAGE_SHIFT
>>>>> +
>>>>>    /*
>>>>>     * Huge pte definitions.
>>>>>     */
>>>>>
>>>>>
>>>>> As an aside, I think there is a bug in arm64's set_ptes() for PA > 48-bit
>>>>> case. But that won't affect this.
>>>>>
>>>>>
>>>>> With VM_DEBUG on, this is the first warning I see during boot:
>>>>>
>>>>>
>>>>> [    0.278110] page:00000000c7ced4e8 refcount:12 mapcount:0
>>>>> mapping:00000000b2f9739b index:0x1a8 pfn:0x1bff30
>>>>> [    0.278742] head:00000000c7ced4e8 order:2 entire_mapcount:0
>>>>> nr_pages_mapped:2 pincount:0
>>>>
>>>> ^ Ah, you are running with mTHP. Let me play with that.
>>>
>>> Err... Its in mm-unstable, but I'm not enabling any sizes. It should only be set
>>> up for PMD-sized THP.
>>>
>>> I am using XFS though, so I imagine its a file folio.
>>>
>>> I've rebased your rmap cleanup and fork batching to the version of mm-unstable
>>> that I was doing all my other testing with so I could compare numbers. But its
>>> not very old (perhaps a week). All the patches applied without any conflict.
>>
>>
>> It would also be interesting to know if the compiler on arm64 decides to do
>> something stupid: like not inline wrprotect_ptes().
>>
>> Because with an effective unlikely(folio_test_large(folio)) we shouldn't see
>> that much overhead.
>>
>
> What version of gcc are you using? I must confess I'm using the Ubuntu 20.04
> default version:
>
> aarch64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
>
> Perhaps I should grab something a bit newer?
>

gcc version 13.2.1 20231011 (Red Hat 13.2.1-4) (GCC)

From Fedora 38. So "a bit" newer :P

--
Cheers,

David / dhildenb


2023-12-20 13:33:25

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20/12/2023 13:13, David Hildenbrand wrote:
> On 20.12.23 14:10, Ryan Roberts wrote:
>> On 20/12/2023 13:06, David Hildenbrand wrote:
>>> On 20.12.23 13:04, Ryan Roberts wrote:
>>>> On 20/12/2023 11:58, David Hildenbrand wrote:
>>>>> On 20.12.23 12:51, Ryan Roberts wrote:
>>>>>> On 20/12/2023 11:36, David Hildenbrand wrote:
>>>>>>> On 20.12.23 12:28, Ryan Roberts wrote:
>>>>>>>> On 20/12/2023 10:56, David Hildenbrand wrote:
>>>>>>>>> On 20.12.23 11:41, Ryan Roberts wrote:
>>>>>>>>>> On 20/12/2023 10:16, David Hildenbrand wrote:
>>>>>>>>>>> On 20.12.23 11:11, Ryan Roberts wrote:
>>>>>>>>>>>> On 20/12/2023 09:54, David Hildenbrand wrote:
>>>>>>>>>>>>> On 20.12.23 10:51, Ryan Roberts wrote:
>>>>>>>>>>>>>> On 20/12/2023 09:17, David Hildenbrand wrote:
>>>>>>>>>>>>>>> On 19.12.23 18:42, Ryan Roberts wrote:
>>>>>>>>>>>>>>>> On 19/12/2023 17:22, David Hildenbrand wrote:
>>>>>>>>>>>>>>>>> On 19.12.23 09:30, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>>> On 18/12/2023 17:47, David Hildenbrand wrote:
>>>>>>>>>>>>>>>>>>> On 18.12.23 11:50, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>>>>> Convert copy_pte_range() to copy a batch of ptes in one go. A
>>>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>>> batch is determined by the architecture with the new helper,
>>>>>>>>>>>>>>>>>>>> pte_batch_remaining(), and maps a physically contiguous
>>>>>>>>>>>>>>>>>>>> block of
>>>>>>>>>>>>>>>>>>>> memory,
>>>>>>>>>>>>>>>>>>>> all belonging to the same folio. A pte batch is then
>>>>>>>>>>>>>>>>>>>> write-protected in
>>>>>>>>>>>>>>>>>>>> one go in the parent using the new helper,
>>>>>>>>>>>>>>>>>>>> ptep_set_wrprotects()
>>>>>>>>>>>>>>>>>>>> and is
>>>>>>>>>>>>>>>>>>>> set in one go in the child using the new helper,
>>>>>>>>>>>>>>>>>>>> set_ptes_full().
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The primary motivation for this change is to reduce the number
>>>>>>>>>>>>>>>>>>>> of tlb
>>>>>>>>>>>>>>>>>>>> maintenance operations that the arm64 backend has to perform
>>>>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>>>>> fork, as it is about to add transparent support for the
>>>>>>>>>>>>>>>>>>>> "contiguous
>>>>>>>>>>>>>>>>>>>> bit"
>>>>>>>>>>>>>>>>>>>> in its ptes. By write-protecting the parent using the new
>>>>>>>>>>>>>>>>>>>> ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>>>>>>>>>>>>>>> backend
>>>>>>>>>>>>>>>>>>>> can avoid having to unfold contig ranges of PTEs, which is
>>>>>>>>>>>>>>>>>>>> expensive,
>>>>>>>>>>>>>>>>>>>> when all ptes in the range are being write-protected.
>>>>>>>>>>>>>>>>>>>> Similarly, by
>>>>>>>>>>>>>>>>>>>> using set_ptes_full() rather than set_pte_at() to set up
>>>>>>>>>>>>>>>>>>>> ptes in
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> child, the backend does not need to fold a contiguous range
>>>>>>>>>>>>>>>>>>>> once
>>>>>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>>>>>> are all populated - they can be initially populated as a
>>>>>>>>>>>>>>>>>>>> contiguous
>>>>>>>>>>>>>>>>>>>> range in the first place.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This code is very performance sensitive, and a significant
>>>>>>>>>>>>>>>>>>>> amount of
>>>>>>>>>>>>>>>>>>>> effort has been put into not regressing performance for the
>>>>>>>>>>>>>>>>>>>> order-0
>>>>>>>>>>>>>>>>>>>> folio case. By default, pte_batch_remaining() is compile
>>>>>>>>>>>>>>>>>>>> constant 1,
>>>>>>>>>>>>>>>>>>>> which enables the compiler to simplify the extra loops that are
>>>>>>>>>>>>>>>>>>>> added
>>>>>>>>>>>>>>>>>>>> for batching and produce code that is equivalent (and equally
>>>>>>>>>>>>>>>>>>>> performant) as the previous implementation.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This change addresses the core-mm refactoring only and a
>>>>>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>>>> will implement pte_batch_remaining(), ptep_set_wrprotects() and
>>>>>>>>>>>>>>>>>>>> set_ptes_full() in the arm64 backend to realize the performance
>>>>>>>>>>>>>>>>>>>> improvement as part of the work to enable contpte mappings.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> To ensure the arm64 is performant once implemented, this
>>>>>>>>>>>>>>>>>>>> change is
>>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>> careful to only call ptep_get() once per pte batch.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The following microbenchmark results demonstate that there
>>>>>>>>>>>>>>>>>>>> is no
>>>>>>>>>>>>>>>>>>>> significant performance change after this patch. Fork is called
>>>>>>>>>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>>>>>> tight loop in a process with 1G of populated memory and the
>>>>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> function to execute is measured. 100 iterations per run, 8 runs
>>>>>>>>>>>>>>>>>>>> performed on both Apple M2 (VM) and Ampere Altra (bare metal).
>>>>>>>>>>>>>>>>>>>> Tests
>>>>>>>>>>>>>>>>>>>> performed for case where 1G memory is comprised of order-0
>>>>>>>>>>>>>>>>>>>> folios and
>>>>>>>>>>>>>>>>>>>> case where comprised of pte-mapped order-9 folios. Negative is
>>>>>>>>>>>>>>>>>>>> faster,
>>>>>>>>>>>>>>>>>>>> positive is slower, compared to baseline upon which the
>>>>>>>>>>>>>>>>>>>> series is
>>>>>>>>>>>>>>>>>>>> based:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> | Apple M2 VM   | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.1% |    0.0% |    1.2% |
>>>>>>>>>>>>>>>>>>>> | after-change  |   -1.0% |    2.0% |   -0.1% |    1.1% |
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> | Ampere Altra  | order-0 (pte-map) | order-9 (pte-map) |
>>>>>>>>>>>>>>>>>>>> | fork          |-------------------|-------------------|
>>>>>>>>>>>>>>>>>>>> | microbench    |    mean |   stdev |    mean |   stdev |
>>>>>>>>>>>>>>>>>>>> |---------------|---------|---------|---------|---------|
>>>>>>>>>>>>>>>>>>>> | baseline      |    0.0% |    1.0% |    0.0% |    0.1% |
>>>>>>>>>>>>>>>>>>>> | after-change  |   -0.1% |    1.2% |   -0.1% |    0.1% |
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Tested-by: John Hubbard <[email protected]>
>>>>>>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>            include/linux/pgtable.h | 80
>>>>>>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>>>>>            mm/memory.c             | 92
>>>>>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>>>>>            2 files changed, 139 insertions(+), 33 deletions(-)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>>>>> index af7639c3b0a3..db93fb81465a 100644
>>>>>>>>>>>>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>>>>>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>>>>>>>>>>>>> @@ -205,6 +205,27 @@ static inline int pmd_young(pmd_t pmd)
>>>>>>>>>>>>>>>>>>>>            #define arch_flush_lazy_mmu_mode()    do {} while
>>>>>>>>>>>>>>>>>>>> (0)
>>>>>>>>>>>>>>>>>>>>            #endif
>>>>>>>>>>>>>>>>>>>>            +#ifndef pte_batch_remaining
>>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>>> + * pte_batch_remaining - Number of pages from addr to next
>>>>>>>>>>>>>>>>>>>> batch
>>>>>>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>>>>>> + * @pte: Page table entry for the first page.
>>>>>>>>>>>>>>>>>>>> + * @addr: Address of the first page.
>>>>>>>>>>>>>>>>>>>> + * @end: Batch ceiling (e.g. end of vma).
>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>> + * Some architectures (arm64) can efficiently modify a
>>>>>>>>>>>>>>>>>>>> contiguous
>>>>>>>>>>>>>>>>>>>> batch of
>>>>>>>>>>>>>>>>>>>> ptes.
>>>>>>>>>>>>>>>>>>>> + * In such cases, this function returns the remaining
>>>>>>>>>>>>>>>>>>>> number of
>>>>>>>>>>>>>>>>>>>> pages to
>>>>>>>>>>>>>>>>>>>> the end
>>>>>>>>>>>>>>>>>>>> + * of the current batch, as defined by addr. This can be
>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>>>> iterating
>>>>>>>>>>>>>>>>>>>> + * over ptes.
>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>> + * May be overridden by the architecture, else batch size is
>>>>>>>>>>>>>>>>>>>> always 1.
>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>> +static inline unsigned int pte_batch_remaining(pte_t pte,
>>>>>>>>>>>>>>>>>>>> unsigned
>>>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>>>> addr,
>>>>>>>>>>>>>>>>>>>> +                        unsigned long end)
>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>> +    return 1;
>>>>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It's a shame we now lose the optimization for all other
>>>>>>>>>>>>>>>>>>> archtiectures.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Was there no way to have some basic batching mechanism that
>>>>>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>>>>>> arch
>>>>>>>>>>>>>>>>>>> specifics?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I tried a bunch of things but ultimately the way I've done it
>>>>>>>>>>>>>>>>>> was the
>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>>>>> to reduce the order-0 fork regression to 0.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> My original v3 posting was costing 5% extra and even my first
>>>>>>>>>>>>>>>>>> attempt
>>>>>>>>>>>>>>>>>> at an
>>>>>>>>>>>>>>>>>> arch-specific version that didn't resolve to a compile-time
>>>>>>>>>>>>>>>>>> constant 1
>>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>> cost an extra 3%.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'd have thought that something very basic would have worked
>>>>>>>>>>>>>>>>>>> like:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> * Check if PTE is the same when setting the PFN to 0.
>>>>>>>>>>>>>>>>>>> * Check that PFN is consecutive
>>>>>>>>>>>>>>>>>>> * Check that all PFNs belong to the same folio
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I haven't tried this exact approach, but I'd be surprised if I
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> regression under 4% with this. Further along the series I spent a
>>>>>>>>>>>>>>>>>> lot of
>>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>> having to fiddle with the arm64 implementation; every
>>>>>>>>>>>>>>>>>> conditional and
>>>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>>>> memory read (even when in cache) was a problem. There is just so
>>>>>>>>>>>>>>>>>> little in
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> inner loop that every instruction matters. (At least on Ampere
>>>>>>>>>>>>>>>>>> Altra
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> Apple
>>>>>>>>>>>>>>>>>> M2).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Of course if you're willing to pay that 4-5% for order-0 then the
>>>>>>>>>>>>>>>>>> benefit to
>>>>>>>>>>>>>>>>>> order-9 is around 10% in my measurements. Personally though, I'd
>>>>>>>>>>>>>>>>>> prefer to
>>>>>>>>>>>>>>>>>> play
>>>>>>>>>>>>>>>>>> safe and ensure the common order-0 case doesn't regress, as you
>>>>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>>>>> suggested.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I just hacked something up, on top of my beloved rmap
>>>>>>>>>>>>>>>>> cleanup/batching
>>>>>>>>>>>>>>>>> series. I
>>>>>>>>>>>>>>>>> implemented very generic and simple batching for large folios
>>>>>>>>>>>>>>>>> (all PTE
>>>>>>>>>>>>>>>>> bits
>>>>>>>>>>>>>>>>> except the PFN have to match).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Some very quick testing (don't trust each last % ) on Intel(R)
>>>>>>>>>>>>>>>>> Xeon(R)
>>>>>>>>>>>>>>>>> Silver
>>>>>>>>>>>>>>>>> 4210R CPU.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> order-0: 0.014210 -> 0.013969
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -> Around 1.7 % faster
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> order-9: 0.014373 -> 0.009149
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -> Around 36.3 % faster
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Well I guess that shows me :)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'll do a review and run the tests on my HW to see if it concurs.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I pushed a simple compile fixup (we need pte_next_pfn()).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've just been trying to compile and noticed this. Will take a
>>>>>>>>>>>>>> look at
>>>>>>>>>>>>>> your
>>>>>>>>>>>>>> update.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But upon review, I've noticed the part that I think makes this
>>>>>>>>>>>>>> difficult
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> arm64 with the contpte optimization; You are calling ptep_get() for
>>>>>>>>>>>>>> every
>>>>>>>>>>>>>> pte in
>>>>>>>>>>>>>> the batch. While this is functionally correct, once arm64 has the
>>>>>>>>>>>>>> contpte
>>>>>>>>>>>>>> changes, its ptep_get() has to read every pte in the contpte block in
>>>>>>>>>>>>>> order to
>>>>>>>>>>>>>> gather the access and dirty bits. So if your batching function
>>>>>>>>>>>>>> ends up
>>>>>>>>>>>>>> wealking
>>>>>>>>>>>>>> a 16 entry contpte block, that will cause 16 x 16 reads, which kills
>>>>>>>>>>>>>> performance. That's why I added the arch-specific
>>>>>>>>>>>>>> pte_batch_remaining()
>>>>>>>>>>>>>> function; this allows the core-mm to skip to the end of the contpte
>>>>>>>>>>>>>> block and
>>>>>>>>>>>>>> avoid ptep_get() for the 15 tail ptes. So we end up with 16
>>>>>>>>>>>>>> READ_ONCE()s
>>>>>>>>>>>>>> instead
>>>>>>>>>>>>>> of 256.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I considered making a ptep_get_noyoungdirty() variant, which would
>>>>>>>>>>>>>> avoid
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> bit
>>>>>>>>>>>>>> gathering. But we have a similar problem in zap_pte_range() and that
>>>>>>>>>>>>>> function
>>>>>>>>>>>>>> needs the dirty bit to update the folio. So it doesn't work there.
>>>>>>>>>>>>>> (see
>>>>>>>>>>>>>> patch 3
>>>>>>>>>>>>>> in my series).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I guess you are going to say that we should combine both
>>>>>>>>>>>>>> approaches, so
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> your batching loop can skip forward an arch-provided number of ptes?
>>>>>>>>>>>>>> That
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>> certainly work, but feels like an orthogonal change to what I'm
>>>>>>>>>>>>>> trying to
>>>>>>>>>>>>>> achieve :). Anyway, I'll spend some time playing with it today.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You can overwrite the function or add special-casing internally, yes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Right now, your patch is called "mm: Batch-copy PTE ranges during
>>>>>>>>>>>>> fork()"
>>>>>>>>>>>>> and it
>>>>>>>>>>>>> doesn't do any of that besides preparing for some arm64 work.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Well it allows an arch to opt-in to batching. But I see your point.
>>>>>>>>>>>>
>>>>>>>>>>>> How do you want to handle your patches? Do you want to clean them up
>>>>>>>>>>>> and
>>>>>>>>>>>> I'll
>>>>>>>>>>>> base my stuff on top? Or do you want me to take them and sort it all
>>>>>>>>>>>> out?
>>>>>>>>>>>
>>>>>>>>>>> Whatever you prefer, it was mostly a quick prototype to see if we can
>>>>>>>>>>> achieve
>>>>>>>>>>> decent performance.
>>>>>>>>>>
>>>>>>>>>> I'm about to run it on Altra and M2. But I assume it will show similar
>>>>>>>>>> results.
>>>>>>>>
>>>>>>>> OK results in, not looking great, which aligns with my previous experience.
>>>>>>>> That
>>>>>>>> said, I'm seeing some "BUG: Bad page state in process gmain  pfn:12a094" so
>>>>>>>> perhaps these results are not valid...
>>>>>>>
>>>>>>> I didn't see that so far on x86, maybe related to the PFN fixup?
>>>>>>
>>>>>> All I've done is define PFN_PTE_SHIFT for arm64 on top of your latest patch:
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h
>>>>>> b/arch/arm64/include/asm/pgtable.h
>>>>>> index b19a8aee684c..9eb0fd693df9 100644
>>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>>> @@ -359,6 +359,8 @@ static inline void set_ptes(struct mm_struct *mm,
>>>>>>     }
>>>>>>     #define set_ptes set_ptes
>>>>>>     +#define PFN_PTE_SHIFT          PAGE_SHIFT
>>>>>> +
>>>>>>     /*
>>>>>>      * Huge pte definitions.
>>>>>>      */
>>>>>>
>>>>>>
>>>>>> As an aside, I think there is a bug in arm64's set_ptes() for PA > 48-bit
>>>>>> case. But that won't affect this.
>>>>>>
>>>>>>
>>>>>> With VM_DEBUG on, this is the first warning I see during boot:
>>>>>>
>>>>>>
>>>>>> [    0.278110] page:00000000c7ced4e8 refcount:12 mapcount:0
>>>>>> mapping:00000000b2f9739b index:0x1a8 pfn:0x1bff30
>>>>>> [    0.278742] head:00000000c7ced4e8 order:2 entire_mapcount:0
>>>>>> nr_pages_mapped:2 pincount:0
>>>>>
>>>>> ^ Ah, you are running with mTHP. Let me play with that.
>>>>
>>>> Err... Its in mm-unstable, but I'm not enabling any sizes. It should only be
>>>> set
>>>> up for PMD-sized THP.
>>>>
>>>> I am using XFS though, so I imagine its a file folio.
>>>>
>>>> I've rebased your rmap cleanup and fork batching to the version of mm-unstable
>>>> that I was doing all my other testing with so I could compare numbers. But its
>>>> not very old (perhaps a week). All the patches applied without any conflict.
>>>
>>>
>>> It would also be interesting to know if the compiler on arm64 decides to do
>>> something stupid: like not inline wrprotect_ptes().
>>>
>>> Because with an effective unlikely(folio_test_large(folio)) we shouldn't see
>>> that much overhead.
>>>
>>
>> What version of gcc are you using? I must confess I'm using the Ubuntu 20.04
>> default version:
>>
>> aarch64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
>>
>> Perhaps I should grab something a bit newer?
>>
>
> gcc version 13.2.1 20231011 (Red Hat 13.2.1-4) (GCC)
>
> From Fedora 38. So "a bit" newer :P
>

I'll retry with newer toolchain.

FWIW, with the code fix and the original compiler:

Fork, order-0, Apple M2:
| kernel | mean_rel | std_rel |
|:----------------------|-----------:|----------:|
| mm-unstable | 0.0% | 0.8% |
| hugetlb-rmap-cleanups | 1.3% | 2.0% |
| fork-batching | 4.3% | 1.0% |

Fork, order-9, Apple M2:
| kernel | mean_rel | std_rel |
|:----------------------|-----------:|----------:|
| mm-unstable | 0.0% | 0.8% |
| hugetlb-rmap-cleanups | 0.9% | 0.9% |
| fork-batching | -37.3% | 1.0% |

Fork, order-0, Ampere Altra:
| kernel | mean_rel | std_rel |
|:----------------------|-----------:|----------:|
| mm-unstable | 0.0% | 0.7% |
| hugetlb-rmap-cleanups | 3.2% | 0.7% |
| fork-batching | 5.5% | 1.1% |

Fork, order-9, Ampere Altra:
| kernel | mean_rel | std_rel |
|:----------------------|-----------:|----------:|
| mm-unstable | 0.0% | 0.1% |
| hugetlb-rmap-cleanups | 0.5% | 0.1% |
| fork-batching | -10.4% | 0.1% |


2023-12-20 14:02:52

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

[...]

>>>
>>
>> gcc version 13.2.1 20231011 (Red Hat 13.2.1-4) (GCC)
>>
>> From Fedora 38. So "a bit" newer :P
>>
>
> I'll retry with newer toolchain.
>
> FWIW, with the code fix and the original compiler:
>
> Fork, order-0, Apple M2:
> | kernel | mean_rel | std_rel |
> |:----------------------|-----------:|----------:|
> | mm-unstable | 0.0% | 0.8% |
> | hugetlb-rmap-cleanups | 1.3% | 2.0% |
> | fork-batching | 4.3% | 1.0% |
>
> Fork, order-9, Apple M2:
> | kernel | mean_rel | std_rel |
> |:----------------------|-----------:|----------:|
> | mm-unstable | 0.0% | 0.8% |
> | hugetlb-rmap-cleanups | 0.9% | 0.9% |
> | fork-batching | -37.3% | 1.0% |
>
> Fork, order-0, Ampere Altra:
> | kernel | mean_rel | std_rel |
> |:----------------------|-----------:|----------:|
> | mm-unstable | 0.0% | 0.7% |
> | hugetlb-rmap-cleanups | 3.2% | 0.7% |
> | fork-batching | 5.5% | 1.1% |
>
> Fork, order-9, Ampere Altra:
> | kernel | mean_rel | std_rel |
> |:----------------------|-----------:|----------:|
> | mm-unstable | 0.0% | 0.1% |
> | hugetlb-rmap-cleanups | 0.5% | 0.1% |
> | fork-batching | -10.4% | 0.1% |
>

I just gave it another quick benchmark run on that Intel system.

hugetlb-rmap-cleanups -> fork-batching

order-0: 0.014114 -> 0.013848

-1.9%

order-9: 0.014262 -> 0.009410

-34%

Note that I disable SMT and turbo, and pin the test to one CPU, to make
the results as stable as possible. My kernel config has anything related
to debugging disabled.

--
Cheers,

David / dhildenb


2023-12-20 15:05:18

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20/12/2023 14:00, David Hildenbrand wrote:
> [...]
>
>>>>
>>>
>>> gcc version 13.2.1 20231011 (Red Hat 13.2.1-4) (GCC)
>>>
>>>  From Fedora 38. So "a bit" newer :P
>>>
>>
>> I'll retry with newer toolchain.
>>
>> FWIW, with the code fix and the original compiler:
>>
>> Fork, order-0, Apple M2:
>> | kernel                |   mean_rel |   std_rel |
>> |:----------------------|-----------:|----------:|
>> | mm-unstable           |       0.0% |      0.8% |
>> | hugetlb-rmap-cleanups |       1.3% |      2.0% |
>> | fork-batching         |       4.3% |      1.0% |
>>
>> Fork, order-9, Apple M2:
>> | kernel                |   mean_rel |   std_rel |
>> |:----------------------|-----------:|----------:|
>> | mm-unstable           |       0.0% |      0.8% |
>> | hugetlb-rmap-cleanups |       0.9% |      0.9% |
>> | fork-batching         |     -37.3% |      1.0% |
>>
>> Fork, order-0, Ampere Altra:
>> | kernel                |   mean_rel |   std_rel |
>> |:----------------------|-----------:|----------:|
>> | mm-unstable           |       0.0% |      0.7% |
>> | hugetlb-rmap-cleanups |       3.2% |      0.7% |
>> | fork-batching         |       5.5% |      1.1% |
>>
>> Fork, order-9, Ampere Altra:
>> | kernel                |   mean_rel |   std_rel |
>> |:----------------------|-----------:|----------:|
>> | mm-unstable           |       0.0% |      0.1% |
>> | hugetlb-rmap-cleanups |       0.5% |      0.1% |
>> | fork-batching         |     -10.4% |      0.1% |
>>
>
> I just gave it another quick benchmark run on that Intel system.
>
> hugetlb-rmap-cleanups -> fork-batching
>
> order-0: 0.014114 -> 0.013848
>
> -1.9%
>
> order-9: 0.014262 -> 0.009410
>
> -34%
>
> Note that I disable SMT and turbo, and pin the test to one CPU, to make the
> results as stable as possible. My kernel config has anything related to
> debugging disabled.
>

And with gcc 13.2 on arm64:

Fork, order-0, Apple M2 VM:
| kernel | mean_rel | std_rel |
|:----------------------|-----------:|----------:|
| mm-unstable | 0.0% | 1.5% |
| hugetlb-rmap-cleanups | -3.3% | 1.1% |
| fork-batching | -3.6% | 1.4% |

Fork, order-9, Apple M2 VM:
| kernel | mean_rel | std_rel |
|:----------------------|-----------:|----------:|
| mm-unstable | 0.0% | 1.8% |
| hugetlb-rmap-cleanups | -5.8% | 1.3% |
| fork-batching | -38.1% | 2.3% |

Fork, order-0, Ampere Altra:
| kernel | mean_rel | std_rel |
|:----------------------|-----------:|----------:|
| mm-unstable | 0.0% | 1.3% |
| hugetlb-rmap-cleanups | -0.1% | 0.4% |
| fork-batching | -0.4% | 0.5% |

Fork, order-9, Ampere Altra:
| kernel | mean_rel | std_rel |
|:----------------------|-----------:|----------:|
| mm-unstable | 0.0% | 0.1% |
| hugetlb-rmap-cleanups | -0.1% | 0.1% |
| fork-batching | -13.9% | 0.1% |


So all looking good. Compiler was the issue. Sorry for the noise.

So please go ahead with you rmap v2 stuff, and I'll wait for you to post the
fork and zap batching patches properly, then rebase my arm64 contpte stuff on
top and remeasure everything.

Thanks,
Ryan


2023-12-20 15:44:23

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20.12.23 16:05, Ryan Roberts wrote:
> On 20/12/2023 14:00, David Hildenbrand wrote:
>> [...]
>>
>>>>>
>>>>
>>>> gcc version 13.2.1 20231011 (Red Hat 13.2.1-4) (GCC)
>>>>
>>>>  From Fedora 38. So "a bit" newer :P
>>>>
>>>
>>> I'll retry with newer toolchain.
>>>
>>> FWIW, with the code fix and the original compiler:
>>>
>>> Fork, order-0, Apple M2:
>>> | kernel                |   mean_rel |   std_rel |
>>> |:----------------------|-----------:|----------:|
>>> | mm-unstable           |       0.0% |      0.8% |
>>> | hugetlb-rmap-cleanups |       1.3% |      2.0% |
>>> | fork-batching         |       4.3% |      1.0% |
>>>
>>> Fork, order-9, Apple M2:
>>> | kernel                |   mean_rel |   std_rel |
>>> |:----------------------|-----------:|----------:|
>>> | mm-unstable           |       0.0% |      0.8% |
>>> | hugetlb-rmap-cleanups |       0.9% |      0.9% |
>>> | fork-batching         |     -37.3% |      1.0% |
>>>
>>> Fork, order-0, Ampere Altra:
>>> | kernel                |   mean_rel |   std_rel |
>>> |:----------------------|-----------:|----------:|
>>> | mm-unstable           |       0.0% |      0.7% |
>>> | hugetlb-rmap-cleanups |       3.2% |      0.7% |
>>> | fork-batching         |       5.5% |      1.1% |
>>>
>>> Fork, order-9, Ampere Altra:
>>> | kernel                |   mean_rel |   std_rel |
>>> |:----------------------|-----------:|----------:|
>>> | mm-unstable           |       0.0% |      0.1% |
>>> | hugetlb-rmap-cleanups |       0.5% |      0.1% |
>>> | fork-batching         |     -10.4% |      0.1% |
>>>
>>
>> I just gave it another quick benchmark run on that Intel system.
>>
>> hugetlb-rmap-cleanups -> fork-batching
>>
>> order-0: 0.014114 -> 0.013848
>>
>> -1.9%
>>
>> order-9: 0.014262 -> 0.009410
>>
>> -34%
>>
>> Note that I disable SMT and turbo, and pin the test to one CPU, to make the
>> results as stable as possible. My kernel config has anything related to
>> debugging disabled.
>>
>
> And with gcc 13.2 on arm64:
>
> Fork, order-0, Apple M2 VM:
> | kernel | mean_rel | std_rel |
> |:----------------------|-----------:|----------:|
> | mm-unstable | 0.0% | 1.5% |
> | hugetlb-rmap-cleanups | -3.3% | 1.1% |
> | fork-batching | -3.6% | 1.4% |
>
> Fork, order-9, Apple M2 VM:
> | kernel | mean_rel | std_rel |
> |:----------------------|-----------:|----------:|
> | mm-unstable | 0.0% | 1.8% |
> | hugetlb-rmap-cleanups | -5.8% | 1.3% |
> | fork-batching | -38.1% | 2.3% |
>
> Fork, order-0, Ampere Altra:
> | kernel | mean_rel | std_rel |
> |:----------------------|-----------:|----------:|
> | mm-unstable | 0.0% | 1.3% |
> | hugetlb-rmap-cleanups | -0.1% | 0.4% |
> | fork-batching | -0.4% | 0.5% |
>
> Fork, order-9, Ampere Altra:
> | kernel | mean_rel | std_rel |
> |:----------------------|-----------:|----------:|
> | mm-unstable | 0.0% | 0.1% |
> | hugetlb-rmap-cleanups | -0.1% | 0.1% |
> | fork-batching | -13.9% | 0.1% |
>
>
> So all looking good. Compiler was the issue. Sorry for the noise.

No need to be sorry, good that we figured out what's going wrong here.

Weird that the compiler makes such a difference here.

>
> So please go ahead with you rmap v2 stuff, and I'll wait for you to post the
> fork and zap batching patches properly, then rebase my arm64 contpte stuff on
> top and remeasure everything.

Yes, will get rmap v2 out soon, then start working on fork, and then try
tackling zap. I have some holiday coming up, so it might take some time
-- but there is plenty of time left.

--
Cheers,

David / dhildenb


2023-12-20 16:03:55

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] mm: Batch-copy PTE ranges during fork()

On 20/12/2023 15:35, David Hildenbrand wrote:
> On 20.12.23 16:05, Ryan Roberts wrote:
>> On 20/12/2023 14:00, David Hildenbrand wrote:
>>> [...]
>>>
>>>>>>
>>>>>
>>>>> gcc version 13.2.1 20231011 (Red Hat 13.2.1-4) (GCC)
>>>>>
>>>>>   From Fedora 38. So "a bit" newer :P
>>>>>
>>>>
>>>> I'll retry with newer toolchain.
>>>>
>>>> FWIW, with the code fix and the original compiler:
>>>>
>>>> Fork, order-0, Apple M2:
>>>> | kernel                |   mean_rel |   std_rel |
>>>> |:----------------------|-----------:|----------:|
>>>> | mm-unstable           |       0.0% |      0.8% |
>>>> | hugetlb-rmap-cleanups |       1.3% |      2.0% |
>>>> | fork-batching         |       4.3% |      1.0% |
>>>>
>>>> Fork, order-9, Apple M2:
>>>> | kernel                |   mean_rel |   std_rel |
>>>> |:----------------------|-----------:|----------:|
>>>> | mm-unstable           |       0.0% |      0.8% |
>>>> | hugetlb-rmap-cleanups |       0.9% |      0.9% |
>>>> | fork-batching         |     -37.3% |      1.0% |
>>>>
>>>> Fork, order-0, Ampere Altra:
>>>> | kernel                |   mean_rel |   std_rel |
>>>> |:----------------------|-----------:|----------:|
>>>> | mm-unstable           |       0.0% |      0.7% |
>>>> | hugetlb-rmap-cleanups |       3.2% |      0.7% |
>>>> | fork-batching         |       5.5% |      1.1% |
>>>>
>>>> Fork, order-9, Ampere Altra:
>>>> | kernel                |   mean_rel |   std_rel |
>>>> |:----------------------|-----------:|----------:|
>>>> | mm-unstable           |       0.0% |      0.1% |
>>>> | hugetlb-rmap-cleanups |       0.5% |      0.1% |
>>>> | fork-batching         |     -10.4% |      0.1% |
>>>>
>>>
>>> I just gave it another quick benchmark run on that Intel system.
>>>
>>> hugetlb-rmap-cleanups -> fork-batching
>>>
>>> order-0: 0.014114 -> 0.013848
>>>
>>> -1.9%
>>>
>>> order-9: 0.014262 -> 0.009410
>>>
>>> -34%
>>>
>>> Note that I disable SMT and turbo, and pin the test to one CPU, to make the
>>> results as stable as possible. My kernel config has anything related to
>>> debugging disabled.
>>>
>>
>> And with gcc 13.2 on arm64:
>>
>> Fork, order-0, Apple M2 VM:
>> | kernel                |   mean_rel |   std_rel |
>> |:----------------------|-----------:|----------:|
>> | mm-unstable           |       0.0% |      1.5% |
>> | hugetlb-rmap-cleanups |      -3.3% |      1.1% |
>> | fork-batching         |      -3.6% |      1.4% |
>>
>> Fork, order-9, Apple M2 VM:
>> | kernel                |   mean_rel |   std_rel |
>> |:----------------------|-----------:|----------:|
>> | mm-unstable           |       0.0% |      1.8% |
>> | hugetlb-rmap-cleanups |      -5.8% |      1.3% |
>> | fork-batching         |     -38.1% |      2.3% |
>>
>> Fork, order-0, Ampere Altra:
>> | kernel                |   mean_rel |   std_rel |
>> |:----------------------|-----------:|----------:|
>> | mm-unstable           |       0.0% |      1.3% |
>> | hugetlb-rmap-cleanups |      -0.1% |      0.4% |
>> | fork-batching         |      -0.4% |      0.5% |
>>
>> Fork, order-9, Ampere Altra:
>> | kernel                |   mean_rel |   std_rel |
>> |:----------------------|-----------:|----------:|
>> | mm-unstable           |       0.0% |      0.1% |
>> | hugetlb-rmap-cleanups |      -0.1% |      0.1% |
>> | fork-batching         |     -13.9% |      0.1% |
>>
>>
>> So all looking good. Compiler was the issue. Sorry for the noise.
>
> No need to be sorry, good that we figured out what's going wrong here.
>
> Weird that the compiler makes such a difference here.
>
>>
>> So please go ahead with you rmap v2 stuff, and I'll wait for you to post the
>> fork and zap batching patches properly, then rebase my arm64 contpte stuff on
>> top and remeasure everything.
>
> Yes, will get rmap v2 out soon, then start working on fork, and then try
> tackling zap. I have some holiday coming up, so it might take some time -- but
> there is plenty of time left.

Me too, I'll be out from end of Friday, returning on 2nd Jan.

Happy Christmas!

>


2024-01-15 15:14:35

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] arm64/mm: Wire up PTE_CONT for user mappings

Hi Ryan,

On 18/12/2023 11:50, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings. Whenever it detects a set of PTEs that meet the
> requirements for a contiguous range, the PTEs are re-painted with the
> PTE_CONT bit. Use of contpte mappings is intended to be transparent to
> the core-mm, which continues to interact with individual ptes.
>
> Since a contpte block only has a single access and dirty bit, the
> semantic here changes slightly; when getting a pte (e.g. ptep_get())
> that is part of a contpte mapping, the access and dirty information are
> pulled from the block (so all ptes in the block return the same
> access/dirty info). When changing the access/dirty info on a pte (e.g.
> ptep_set_access_flags()) that is part of a contpte mapping, this change
> will affect the whole contpte block. This is works fine in practice
> since we guarantee that only a single folio is mapped by a contpte
> block, and the core-mm tracks access/dirty information per folio.
>
> This initial change provides a baseline that can be optimized in future
> commits. That said, fold/unfold operations (which imply tlb
> invalidation) are avoided where possible with a few tricks for
> access/dirty bit management. Write-protect modifications for contpte
> mappings are currently non-optimal, and incure a regression in fork()
> performance. This will be addressed in follow-up changes.
>
> In order for the public functions, which used to be pure inline, to
> continue to be callable by modules, export all the contpte_* symbols
> that are now called by those public inline functions.
>
> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
> at build time. It defaults to enabled as long as its dependency,
> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
> enabled, then there is no chance of meeting the physical contiguity
> requirement for contpte mappings.
>
> Tested-by: John Hubbard <[email protected]>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> arch/arm64/Kconfig | 10 +-
> arch/arm64/include/asm/pgtable.h | 184 +++++++++++++++
> arch/arm64/mm/Makefile | 1 +
> arch/arm64/mm/contpte.c | 388 +++++++++++++++++++++++++++++++
> 4 files changed, 582 insertions(+), 1 deletion(-)
> create mode 100644 arch/arm64/mm/contpte.c
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7b071a00425d..de76e484ff3a 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2209,6 +2209,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
> select UNWIND_TABLES
> select DYNAMIC_SCS
>
> +config ARM64_CONTPTE
> + bool "Contiguous PTE mappings for user memory" if EXPERT
> + depends on TRANSPARENT_HUGEPAGE
> + default y
> + help
> + When enabled, user mappings are configured using the PTE contiguous
> + bit, for any mappings that meet the size and alignment requirements.
> + This reduces TLB pressure and improves performance.
> +
> endmenu # "Kernel Features"
>
> menu "Boot options"
> @@ -2318,4 +2327,3 @@ endmenu # "CPU Power Management"
> source "drivers/acpi/Kconfig"
>
> source "arch/arm64/kvm/Kconfig"
> -
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 6930c14f062f..e64120452301 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
> */
> #define pte_valid_not_user(pte) \
> ((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
> +/*
> + * Returns true if the pte is valid and has the contiguous bit set.
> + */
> +#define pte_valid_cont(pte) (pte_valid(pte) && pte_cont(pte))
> /*
> * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
> * so that we don't erroneously return false for pages that have been
> @@ -1116,6 +1120,184 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep,
> pte_t old_pte, pte_t new_pte);
>
> +#ifdef CONFIG_ARM64_CONTPTE
> +
> +/*
> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
> + * a private implementation detail of the public ptep API (see below).
> + */
> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte);
> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte);
> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte, unsigned int nr);
> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + pte_t entry, int dirty);
> +
> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + /*
> + * Only bother trying if both the virtual and physical addresses are
> + * aligned and correspond to the last entry in a contig range. The core
> + * code mostly modifies ranges from low to high, so this is the likely
> + * the last modification in the contig range, so a good time to fold.
> + * We can't fold special mappings, because there is no associated folio.
> + */
> +
> + const unsigned long contmask = CONT_PTES - 1;
> + bool valign = (((unsigned long)ptep >> 3) & contmask) == contmask;
> + bool palign = (pte_pfn(pte) & contmask) == contmask;
> +
> + if (valign && palign &&
> + pte_valid(pte) && !pte_cont(pte) && !pte_special(pte))
> + __contpte_try_fold(mm, addr, ptep, pte);
> +}
> +
> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + if (pte_valid_cont(pte))
> + __contpte_try_unfold(mm, addr, ptep, pte);
> +}
> +
> +/*
> + * The below functions constitute the public API that arm64 presents to the
> + * core-mm to manipulate PTE entries within their page tables (or at least this
> + * is the subset of the API that arm64 needs to implement). These public
> + * versions will automatically and transparently apply the contiguous bit where
> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
> + * private versions, which are prefixed with double underscore. All of these
> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
> + * held.
> + */
> +
> +#define ptep_get ptep_get
> +static inline pte_t ptep_get(pte_t *ptep)
> +{
> + pte_t pte = __ptep_get(ptep);
> +
> + if (!pte_valid_cont(pte))
> + return pte;
> +
> + return contpte_ptep_get(ptep, pte);
> +}
> +
> +#define ptep_get_lockless ptep_get_lockless
> +static inline pte_t ptep_get_lockless(pte_t *ptep)
> +{
> + pte_t pte = __ptep_get(ptep);
> +
> + if (!pte_valid_cont(pte))
> + return pte;
> +
> + return contpte_ptep_get_lockless(ptep);
> +}
> +
> +static inline void set_pte(pte_t *ptep, pte_t pte)
> +{
> + /*
> + * We don't have the mm or vaddr so cannot unfold or fold contig entries
> + * (since it requires tlb maintenance). set_pte() is not used in core
> + * code, so this should never even be called. Regardless do our best to
> + * service any call and emit a warning if there is any attempt to set a
> + * pte on top of an existing contig range.
> + */
> + pte_t orig_pte = __ptep_get(ptep);
> +
> + WARN_ON_ONCE(pte_valid_cont(orig_pte));
> + __set_pte(ptep, pte_mknoncont(pte));
> +}
> +
> +#define set_ptes set_ptes
> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> + pte = pte_mknoncont(pte);
> +
> + if (nr == 1) {
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> + __set_ptes(mm, addr, ptep, pte, 1);
> + contpte_try_fold(mm, addr, ptep, pte);
> + } else
> + contpte_set_ptes(mm, addr, ptep, pte, nr);
> +}
> +
> +static inline void pte_clear(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep)
> +{
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> + __pte_clear(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep)
> +{
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> + return __ptep_get_and_clear(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep)
> +{
> + pte_t orig_pte = __ptep_get(ptep);
> +
> + if (!pte_valid_cont(orig_pte))
> + return __ptep_test_and_clear_young(vma, addr, ptep);
> +
> + return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep)
> +{
> + pte_t orig_pte = __ptep_get(ptep);
> +
> + if (!pte_valid_cont(orig_pte))
> + return __ptep_clear_flush_young(vma, addr, ptep);
> +
> + return contpte_ptep_clear_flush_young(vma, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep)
> +{
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> + __ptep_set_wrprotect(mm, addr, ptep);
> + contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
> +}
> +
> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + pte_t entry, int dirty)
> +{
> + pte_t orig_pte = __ptep_get(ptep);
> +
> + entry = pte_mknoncont(entry);
> +
> + if (!pte_valid_cont(orig_pte))
> + return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +
> + return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +}
> +
> +#else /* CONFIG_ARM64_CONTPTE */
> +
> #define ptep_get __ptep_get
> #define set_pte __set_pte
> #define set_ptes __set_ptes
> @@ -1131,6 +1313,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
> #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
> #define ptep_set_access_flags __ptep_set_access_flags
>
> +#endif /* CONFIG_ARM64_CONTPTE */
> +
> #endif /* !__ASSEMBLY__ */
>
> #endif /* __ASM_PGTABLE_H */
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index dbd1bc95967d..60454256945b 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -3,6 +3,7 @@ obj-y := dma-mapping.o extable.o fault.o init.o \
> cache.o copypage.o flush.o \
> ioremap.o mmap.o pgd.o mmu.o \
> context.o proc.o pageattr.o fixmap.o
> +obj-$(CONFIG_ARM64_CONTPTE) += contpte.o
> obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
> obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
> obj-$(CONFIG_PTDUMP_DEBUGFS) += ptdump_debugfs.o
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> new file mode 100644
> index 000000000000..69c36749dd98
> --- /dev/null
> +++ b/arch/arm64/mm/contpte.c
> @@ -0,0 +1,388 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/export.h>
> +#include <asm/tlbflush.h>
> +
> +static inline bool mm_is_user(struct mm_struct *mm)
> +{
> + /*
> + * Don't attempt to apply the contig bit to kernel mappings, because
> + * dynamically adding/removing the contig bit can cause page faults.
> + * These racing faults are ok for user space, since they get serialized
> + * on the PTL. But kernel mappings can't tolerate faults.
> + */
> + return mm != &init_mm;
> +}
> +
> +static inline pte_t *contpte_align_down(pte_t *ptep)
> +{
> + return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> +}
> +
> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, int nr)
> +{
> + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> + unsigned long start_addr = addr;
> + int i;
> +
> + for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
> + __pte_clear(mm, addr, ptep);
> +
> + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +}
> +
> +static bool ptep_any_valid(pte_t *ptep, int nr)
> +{
> + int i;
> +
> + for (i = 0; i < nr; i++, ptep++) {
> + if (pte_valid(__ptep_get(ptep)))
> + return true;
> + }
> +
> + return false;
> +}
> +
> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> + unsigned long start_addr;
> + pte_t *start_ptep;
> + int i;
> +
> + start_ptep = ptep = contpte_align_down(ptep);
> + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> + pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> + pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> +
> + if (pte_dirty(ptent))
> + pte = pte_mkdirty(pte);
> +
> + if (pte_young(ptent))
> + pte = pte_mkyoung(pte);
> + }
> +
> + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +
> + __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> +}
> +
> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + /*
> + * We have already checked that the virtual and pysical addresses are
> + * correctly aligned for a contpte mapping in contpte_try_fold() so the
> + * remaining checks are to ensure that the contpte range is fully
> + * covered by a single folio, and ensure that all the ptes are valid
> + * with contiguous PFNs and matching prots. We ignore the state of the
> + * access and dirty bits for the purpose of deciding if its a contiguous
> + * range; the folding process will generate a single contpte entry which
> + * has a single access and dirty bit. Those 2 bits are the logical OR of
> + * their respective bits in the constituent pte entries. In order to
> + * ensure the contpte range is covered by a single folio, we must
> + * recover the folio from the pfn, but special mappings don't have a
> + * folio backing them. Fortunately contpte_try_fold() already checked
> + * that the pte is not special - we never try to fold special mappings.
> + * Note we can't use vm_normal_page() for this since we don't have the
> + * vma.
> + */
> +
> + unsigned long folio_saddr;
> + unsigned long folio_eaddr;
> + unsigned long cont_saddr;
> + unsigned long cont_eaddr;
> + struct folio *folio;
> + struct page *page;
> + unsigned long pfn;
> + pte_t *orig_ptep;
> + pgprot_t prot;
> + pte_t subpte;
> + int i;
> +
> + if (!mm_is_user(mm))
> + return;
> +
> + page = pte_page(pte);
> + folio = page_folio(page);
> + folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
> + folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
> + cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> + cont_eaddr = cont_saddr + CONT_PTE_SIZE;
> +
> + if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
> + return;
> +
> + pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> + orig_ptep = ptep;
> + ptep = contpte_align_down(ptep);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> + subpte = __ptep_get(ptep);
> + subpte = pte_mkold(pte_mkclean(subpte));
> +
> + if (!pte_valid(subpte) ||
> + pte_pfn(subpte) != pfn ||
> + pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
> + return;
> + }
> +
> + pte = pte_mkcont(pte);
> + contpte_convert(mm, addr, orig_ptep, pte);
> +}
> +EXPORT_SYMBOL(__contpte_try_fold);
> +
> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + /*
> + * We have already checked that the ptes are contiguous in
> + * contpte_try_unfold(), so just check that the mm is user space.
> + */
> +
> + if (!mm_is_user(mm))
> + return;
> +
> + pte = pte_mknoncont(pte);
> + contpte_convert(mm, addr, ptep, pte);
> +}
> +EXPORT_SYMBOL(__contpte_try_unfold);
> +
> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> +{
> + /*
> + * Gather access/dirty bits, which may be populated in any of the ptes
> + * of the contig range. We are guarranteed to be holding the PTL, so any
> + * contiguous range cannot be unfolded or otherwise modified under our
> + * feet.
> + */
> +
> + pte_t pte;
> + int i;
> +
> + ptep = contpte_align_down(ptep);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++) {
> + pte = __ptep_get(ptep);
> +
> + if (pte_dirty(pte))
> + orig_pte = pte_mkdirty(orig_pte);
> +
> + if (pte_young(pte))
> + orig_pte = pte_mkyoung(orig_pte);
> + }
> +
> + return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get);
> +
> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> +{
> + /*
> + * Gather access/dirty bits, which may be populated in any of the ptes
> + * of the contig range. We may not be holding the PTL, so any contiguous
> + * range may be unfolded/modified/refolded under our feet. Therefore we
> + * ensure we read a _consistent_ contpte range by checking that all ptes
> + * in the range are valid and have CONT_PTE set, that all pfns are
> + * contiguous and that all pgprots are the same (ignoring access/dirty).
> + * If we find a pte that is not consistent, then we must be racing with
> + * an update so start again. If the target pte does not have CONT_PTE
> + * set then that is considered consistent on its own because it is not
> + * part of a contpte range.
> + */
> +
> + pgprot_t orig_prot;
> + unsigned long pfn;
> + pte_t orig_pte;
> + pgprot_t prot;
> + pte_t *ptep;
> + pte_t pte;
> + int i;
> +
> +retry:
> + orig_pte = __ptep_get(orig_ptep);
> +
> + if (!pte_valid_cont(orig_pte))
> + return orig_pte;
> +
> + orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> + ptep = contpte_align_down(orig_ptep);
> + pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> + pte = __ptep_get(ptep);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> + if (!pte_valid_cont(pte) ||
> + pte_pfn(pte) != pfn ||
> + pgprot_val(prot) != pgprot_val(orig_prot))
> + goto retry;
> +
> + if (pte_dirty(pte))
> + orig_pte = pte_mkdirty(orig_pte);
> +
> + if (pte_young(pte))
> + orig_pte = pte_mkyoung(orig_pte);
> + }
> +
> + return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
> +
> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> + unsigned long next;
> + unsigned long end;
> + unsigned long pfn;
> + pgprot_t prot;
> + pte_t orig_pte;
> +
> + if (!mm_is_user(mm))
> + return __set_ptes(mm, addr, ptep, pte, nr);
> +
> + end = addr + (nr << PAGE_SHIFT);
> + pfn = pte_pfn(pte);
> + prot = pte_pgprot(pte);
> +
> + do {
> + next = pte_cont_addr_end(addr, end);
> + nr = (next - addr) >> PAGE_SHIFT;
> + pte = pfn_pte(pfn, prot);
> +
> + if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
> + pte = pte_mkcont(pte);
> + else
> + pte = pte_mknoncont(pte);
> +
> + /*
> + * If operating on a partial contiguous range then we must first
> + * unfold the contiguous range if it was previously folded.
> + * Otherwise we could end up with overlapping tlb entries.
> + */
> + if (nr != CONT_PTES)
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +
> + /*
> + * If we are replacing ptes that were contiguous or if the new
> + * ptes are contiguous and any of the ptes being replaced are
> + * valid, we need to clear and flush the range to prevent
> + * overlapping tlb entries.
> + */
> + orig_pte = __ptep_get(ptep);
> + if (pte_valid_cont(orig_pte) ||
> + (pte_cont(pte) && ptep_any_valid(ptep, nr)))
> + ptep_clear_flush_range(mm, addr, ptep, nr);
> +
> + __set_ptes(mm, addr, ptep, pte, nr);
> +
> + addr = next;
> + ptep += nr;
> + pfn += nr;
> +
> + } while (addr != end);
> +}
> +EXPORT_SYMBOL(contpte_set_ptes);
> +
> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep)
> +{
> + /*
> + * ptep_clear_flush_young() technically requires us to clear the access
> + * flag for a _single_ pte. However, the core-mm code actually tracks
> + * access/dirty per folio, not per page. And since we only create a
> + * contig range when the range is covered by a single folio, we can get
> + * away with clearing young for the whole contig range here, so we avoid
> + * having to unfold.
> + */
> +
> + int young = 0;
> + int i;
> +
> + ptep = contpte_align_down(ptep);
> + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> + young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +
> + return young;
> +}
> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
> +
> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep)
> +{
> + int young;
> +
> + young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +
> + if (young) {
> + /*
> + * See comment in __ptep_clear_flush_young(); same rationale for
> + * eliding the trailing DSB applies here.
> + */
> + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> + __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
> + PAGE_SIZE, true, 3);
> + }
> +
> + return young;
> +}
> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
> +
> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + pte_t entry, int dirty)
> +{
> + unsigned long start_addr;
> + pte_t orig_pte;
> + int i;
> +
> + /*
> + * Gather the access/dirty bits for the contiguous range. If nothing has
> + * changed, its a noop.
> + */
> + orig_pte = pte_mknoncont(ptep_get(ptep));
> + if (pte_val(orig_pte) == pte_val(entry))
> + return 0;
> +
> + /*
> + * We can fix up access/dirty bits without having to unfold/fold the
> + * contig range. But if the write bit is changing, we need to go through
> + * the full unfold/fold cycle.
> + */
> + if (pte_write(orig_pte) == pte_write(entry)) {
> + /*
> + * For HW access management, we technically only need to update
> + * the flag on a single pte in the range. But for SW access
> + * management, we need to update all the ptes to prevent extra
> + * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
> + * and instead flush the whole range at the end.
> + */
> + ptep = contpte_align_down(ptep);
> + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> + __ptep_set_access_flags(vma, addr, ptep, entry, 0);


entry was pte_mknoncont() in ptep_set_access_flags() so here you lose
the contpte range, is that intentional? Or am I mistaken?


> +
> + if (dirty)
> + __flush_tlb_range(vma, start_addr, addr,
> + PAGE_SIZE, true, 3);
> + } else {
> + __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
> + __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> + contpte_try_fold(vma->vm_mm, addr, ptep, entry);
> + }
> +
> + return 1;
> +}
> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);

2024-01-15 16:27:31

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] arm64/mm: Wire up PTE_CONT for user mappings

On 15/01/2024 15:14, Alexandre Ghiti wrote:
> Hi Ryan,
>
> On 18/12/2023 11:50, Ryan Roberts wrote:
>> With the ptep API sufficiently refactored, we can now introduce a new
>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>> user mappings. Whenever it detects a set of PTEs that meet the
>> requirements for a contiguous range, the PTEs are re-painted with the
>> PTE_CONT bit. Use of contpte mappings is intended to be transparent to
>> the core-mm, which continues to interact with individual ptes.
>>
>> Since a contpte block only has a single access and dirty bit, the
>> semantic here changes slightly; when getting a pte (e.g. ptep_get())
>> that is part of a contpte mapping, the access and dirty information are
>> pulled from the block (so all ptes in the block return the same
>> access/dirty info). When changing the access/dirty info on a pte (e.g.
>> ptep_set_access_flags()) that is part of a contpte mapping, this change
>> will affect the whole contpte block. This is works fine in practice
>> since we guarantee that only a single folio is mapped by a contpte
>> block, and the core-mm tracks access/dirty information per folio.
>>
>> This initial change provides a baseline that can be optimized in future
>> commits. That said, fold/unfold operations (which imply tlb
>> invalidation) are avoided where possible with a few tricks for
>> access/dirty bit management. Write-protect modifications for contpte
>> mappings are currently non-optimal, and incure a regression in fork()
>> performance. This will be addressed in follow-up changes.
>>
>> In order for the public functions, which used to be pure inline, to
>> continue to be callable by modules, export all the contpte_* symbols
>> that are now called by those public inline functions.
>>
>> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
>> at build time. It defaults to enabled as long as its dependency,
>> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
>> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
>> enabled, then there is no chance of meeting the physical contiguity
>> requirement for contpte mappings.
>>
>> Tested-by: John Hubbard <[email protected]>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>>   arch/arm64/Kconfig               |  10 +-
>>   arch/arm64/include/asm/pgtable.h | 184 +++++++++++++++
>>   arch/arm64/mm/Makefile           |   1 +
>>   arch/arm64/mm/contpte.c          | 388 +++++++++++++++++++++++++++++++
>>   4 files changed, 582 insertions(+), 1 deletion(-)
>>   create mode 100644 arch/arm64/mm/contpte.c
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 7b071a00425d..de76e484ff3a 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -2209,6 +2209,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>>       select UNWIND_TABLES
>>       select DYNAMIC_SCS
>>   +config ARM64_CONTPTE
>> +    bool "Contiguous PTE mappings for user memory" if EXPERT
>> +    depends on TRANSPARENT_HUGEPAGE
>> +    default y
>> +    help
>> +      When enabled, user mappings are configured using the PTE contiguous
>> +      bit, for any mappings that meet the size and alignment requirements.
>> +      This reduces TLB pressure and improves performance.
>> +
>>   endmenu # "Kernel Features"
>>     menu "Boot options"
>> @@ -2318,4 +2327,3 @@ endmenu # "CPU Power Management"
>>   source "drivers/acpi/Kconfig"
>>     source "arch/arm64/kvm/Kconfig"
>> -
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 6930c14f062f..e64120452301 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>>    */
>>   #define pte_valid_not_user(pte) \
>>       ((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID |
>> PTE_UXN))
>> +/*
>> + * Returns true if the pte is valid and has the contiguous bit set.
>> + */
>> +#define pte_valid_cont(pte)    (pte_valid(pte) && pte_cont(pte))
>>   /*
>>    * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>>    * so that we don't erroneously return false for pages that have been
>> @@ -1116,6 +1120,184 @@ extern void ptep_modify_prot_commit(struct
>> vm_area_struct *vma,
>>                       unsigned long addr, pte_t *ptep,
>>                       pte_t old_pte, pte_t new_pte);
>>   +#ifdef CONFIG_ARM64_CONTPTE
>> +
>> +/*
>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>> + * a private implementation detail of the public ptep API (see below).
>> + */
>> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +                pte_t *ptep, pte_t pte);
>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +                pte_t *ptep, pte_t pte);
>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +                pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +                unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +                unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +                unsigned long addr, pte_t *ptep,
>> +                pte_t entry, int dirty);
>> +
>> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +                    pte_t *ptep, pte_t pte)
>> +{
>> +    /*
>> +     * Only bother trying if both the virtual and physical addresses are
>> +     * aligned and correspond to the last entry in a contig range. The core
>> +     * code mostly modifies ranges from low to high, so this is the likely
>> +     * the last modification in the contig range, so a good time to fold.
>> +     * We can't fold special mappings, because there is no associated folio.
>> +     */
>> +
>> +    const unsigned long contmask = CONT_PTES - 1;
>> +    bool valign = (((unsigned long)ptep >> 3) & contmask) == contmask;
>> +    bool palign = (pte_pfn(pte) & contmask) == contmask;
>> +
>> +    if (valign && palign &&
>> +        pte_valid(pte) && !pte_cont(pte) && !pte_special(pte))
>> +        __contpte_try_fold(mm, addr, ptep, pte);
>> +}
>> +
>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +                    pte_t *ptep, pte_t pte)
>> +{
>> +    if (pte_valid_cont(pte))
>> +        __contpte_try_unfold(mm, addr, ptep, pte);
>> +}
>> +
>> +/*
>> + * The below functions constitute the public API that arm64 presents to the
>> + * core-mm to manipulate PTE entries within their page tables (or at least this
>> + * is the subset of the API that arm64 needs to implement). These public
>> + * versions will automatically and transparently apply the contiguous bit where
>> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>> + * private versions, which are prefixed with double underscore. All of these
>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>> + * held.
>> + */
>> +
>> +#define ptep_get ptep_get
>> +static inline pte_t ptep_get(pte_t *ptep)
>> +{
>> +    pte_t pte = __ptep_get(ptep);
>> +
>> +    if (!pte_valid_cont(pte))
>> +        return pte;
>> +
>> +    return contpte_ptep_get(ptep, pte);
>> +}
>> +
>> +#define ptep_get_lockless ptep_get_lockless
>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>> +{
>> +    pte_t pte = __ptep_get(ptep);
>> +
>> +    if (!pte_valid_cont(pte))
>> +        return pte;
>> +
>> +    return contpte_ptep_get_lockless(ptep);
>> +}
>> +
>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>> +{
>> +    /*
>> +     * We don't have the mm or vaddr so cannot unfold or fold contig entries
>> +     * (since it requires tlb maintenance). set_pte() is not used in core
>> +     * code, so this should never even be called. Regardless do our best to
>> +     * service any call and emit a warning if there is any attempt to set a
>> +     * pte on top of an existing contig range.
>> +     */
>> +    pte_t orig_pte = __ptep_get(ptep);
>> +
>> +    WARN_ON_ONCE(pte_valid_cont(orig_pte));
>> +    __set_pte(ptep, pte_mknoncont(pte));
>> +}
>> +
>> +#define set_ptes set_ptes
>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>> +                pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +    pte = pte_mknoncont(pte);
>> +
>> +    if (nr == 1) {
>> +        contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +        __set_ptes(mm, addr, ptep, pte, 1);
>> +        contpte_try_fold(mm, addr, ptep, pte);
>> +    } else
>> +        contpte_set_ptes(mm, addr, ptep, pte, nr);
>> +}
>> +
>> +static inline void pte_clear(struct mm_struct *mm,
>> +                unsigned long addr, pte_t *ptep)
>> +{
>> +    contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +    __pte_clear(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>> +                unsigned long addr, pte_t *ptep)
>> +{
>> +    contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +    return __ptep_get_and_clear(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +                unsigned long addr, pte_t *ptep)
>> +{
>> +    pte_t orig_pte = __ptep_get(ptep);
>> +
>> +    if (!pte_valid_cont(orig_pte))
>> +        return __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +    return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>> +                unsigned long addr, pte_t *ptep)
>> +{
>> +    pte_t orig_pte = __ptep_get(ptep);
>> +
>> +    if (!pte_valid_cont(orig_pte))
>> +        return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +    return contpte_ptep_clear_flush_young(vma, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
>> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
>> +                unsigned long addr, pte_t *ptep)
>> +{
>> +    contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +    __ptep_set_wrprotect(mm, addr, ptep);
>> +    contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>> +                unsigned long addr, pte_t *ptep,
>> +                pte_t entry, int dirty)
>> +{
>> +    pte_t orig_pte = __ptep_get(ptep);
>> +
>> +    entry = pte_mknoncont(entry);
>> +
>> +    if (!pte_valid_cont(orig_pte))
>> +        return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +
>> +    return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +}
>> +
>> +#else /* CONFIG_ARM64_CONTPTE */
>> +
>>   #define ptep_get                __ptep_get
>>   #define set_pte                    __set_pte
>>   #define set_ptes                __set_ptes
>> @@ -1131,6 +1313,8 @@ extern void ptep_modify_prot_commit(struct
>> vm_area_struct *vma,
>>   #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>   #define ptep_set_access_flags            __ptep_set_access_flags
>>   +#endif /* CONFIG_ARM64_CONTPTE */
>> +
>>   #endif /* !__ASSEMBLY__ */
>>     #endif /* __ASM_PGTABLE_H */
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index dbd1bc95967d..60454256945b 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -3,6 +3,7 @@ obj-y                := dma-mapping.o extable.o fault.o init.o \
>>                      cache.o copypage.o flush.o \
>>                      ioremap.o mmap.o pgd.o mmu.o \
>>                      context.o proc.o pageattr.o fixmap.o
>> +obj-$(CONFIG_ARM64_CONTPTE)    += contpte.o
>>   obj-$(CONFIG_HUGETLB_PAGE)    += hugetlbpage.o
>>   obj-$(CONFIG_PTDUMP_CORE)    += ptdump.o
>>   obj-$(CONFIG_PTDUMP_DEBUGFS)    += ptdump_debugfs.o
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> new file mode 100644
>> index 000000000000..69c36749dd98
>> --- /dev/null
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -0,0 +1,388 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/export.h>
>> +#include <asm/tlbflush.h>
>> +
>> +static inline bool mm_is_user(struct mm_struct *mm)
>> +{
>> +    /*
>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>> +     * dynamically adding/removing the contig bit can cause page faults.
>> +     * These racing faults are ok for user space, since they get serialized
>> +     * on the PTL. But kernel mappings can't tolerate faults.
>> +     */
>> +    return mm != &init_mm;
>> +}
>> +
>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>> +{
>> +    return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>> +}
>> +
>> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
>> +                pte_t *ptep, int nr)
>> +{
>> +    struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +    unsigned long start_addr = addr;
>> +    int i;
>> +
>> +    for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
>> +        __pte_clear(mm, addr, ptep);
>> +
>> +    __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +}
>> +
>> +static bool ptep_any_valid(pte_t *ptep, int nr)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < nr; i++, ptep++) {
>> +        if (pte_valid(__ptep_get(ptep)))
>> +            return true;
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>> +                pte_t *ptep, pte_t pte)
>> +{
>> +    struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +    unsigned long start_addr;
>> +    pte_t *start_ptep;
>> +    int i;
>> +
>> +    start_ptep = ptep = contpte_align_down(ptep);
>> +    start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +    pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>> +
>> +    for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>> +        pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> +
>> +        if (pte_dirty(ptent))
>> +            pte = pte_mkdirty(pte);
>> +
>> +        if (pte_young(ptent))
>> +            pte = pte_mkyoung(pte);
>> +    }
>> +
>> +    __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +
>> +    __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>> +}
>> +
>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>> +            pte_t *ptep, pte_t pte)
>> +{
>> +    /*
>> +     * We have already checked that the virtual and pysical addresses are
>> +     * correctly aligned for a contpte mapping in contpte_try_fold() so the
>> +     * remaining checks are to ensure that the contpte range is fully
>> +     * covered by a single folio, and ensure that all the ptes are valid
>> +     * with contiguous PFNs and matching prots. We ignore the state of the
>> +     * access and dirty bits for the purpose of deciding if its a contiguous
>> +     * range; the folding process will generate a single contpte entry which
>> +     * has a single access and dirty bit. Those 2 bits are the logical OR of
>> +     * their respective bits in the constituent pte entries. In order to
>> +     * ensure the contpte range is covered by a single folio, we must
>> +     * recover the folio from the pfn, but special mappings don't have a
>> +     * folio backing them. Fortunately contpte_try_fold() already checked
>> +     * that the pte is not special - we never try to fold special mappings.
>> +     * Note we can't use vm_normal_page() for this since we don't have the
>> +     * vma.
>> +     */
>> +
>> +    unsigned long folio_saddr;
>> +    unsigned long folio_eaddr;
>> +    unsigned long cont_saddr;
>> +    unsigned long cont_eaddr;
>> +    struct folio *folio;
>> +    struct page *page;
>> +    unsigned long pfn;
>> +    pte_t *orig_ptep;
>> +    pgprot_t prot;
>> +    pte_t subpte;
>> +    int i;
>> +
>> +    if (!mm_is_user(mm))
>> +        return;
>> +
>> +    page = pte_page(pte);
>> +    folio = page_folio(page);
>> +    folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>> +    folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>> +    cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +    cont_eaddr = cont_saddr + CONT_PTE_SIZE;
>> +
>> +    if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>> +        return;
>> +
>> +    pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +    orig_ptep = ptep;
>> +    ptep = contpte_align_down(ptep);
>> +
>> +    for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +        subpte = __ptep_get(ptep);
>> +        subpte = pte_mkold(pte_mkclean(subpte));
>> +
>> +        if (!pte_valid(subpte) ||
>> +            pte_pfn(subpte) != pfn ||
>> +            pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
>> +            return;
>> +    }
>> +
>> +    pte = pte_mkcont(pte);
>> +    contpte_convert(mm, addr, orig_ptep, pte);
>> +}
>> +EXPORT_SYMBOL(__contpte_try_fold);
>> +
>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +            pte_t *ptep, pte_t pte)
>> +{
>> +    /*
>> +     * We have already checked that the ptes are contiguous in
>> +     * contpte_try_unfold(), so just check that the mm is user space.
>> +     */
>> +
>> +    if (!mm_is_user(mm))
>> +        return;
>> +
>> +    pte = pte_mknoncont(pte);
>> +    contpte_convert(mm, addr, ptep, pte);
>> +}
>> +EXPORT_SYMBOL(__contpte_try_unfold);
>> +
>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> +{
>> +    /*
>> +     * Gather access/dirty bits, which may be populated in any of the ptes
>> +     * of the contig range. We are guarranteed to be holding the PTL, so any
>> +     * contiguous range cannot be unfolded or otherwise modified under our
>> +     * feet.
>> +     */
>> +
>> +    pte_t pte;
>> +    int i;
>> +
>> +    ptep = contpte_align_down(ptep);
>> +
>> +    for (i = 0; i < CONT_PTES; i++, ptep++) {
>> +        pte = __ptep_get(ptep);
>> +
>> +        if (pte_dirty(pte))
>> +            orig_pte = pte_mkdirty(orig_pte);
>> +
>> +        if (pte_young(pte))
>> +            orig_pte = pte_mkyoung(orig_pte);
>> +    }
>> +
>> +    return orig_pte;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get);
>> +
>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>> +{
>> +    /*
>> +     * Gather access/dirty bits, which may be populated in any of the ptes
>> +     * of the contig range. We may not be holding the PTL, so any contiguous
>> +     * range may be unfolded/modified/refolded under our feet. Therefore we
>> +     * ensure we read a _consistent_ contpte range by checking that all ptes
>> +     * in the range are valid and have CONT_PTE set, that all pfns are
>> +     * contiguous and that all pgprots are the same (ignoring access/dirty).
>> +     * If we find a pte that is not consistent, then we must be racing with
>> +     * an update so start again. If the target pte does not have CONT_PTE
>> +     * set then that is considered consistent on its own because it is not
>> +     * part of a contpte range.
>> +     */
>> +
>> +    pgprot_t orig_prot;
>> +    unsigned long pfn;
>> +    pte_t orig_pte;
>> +    pgprot_t prot;
>> +    pte_t *ptep;
>> +    pte_t pte;
>> +    int i;
>> +
>> +retry:
>> +    orig_pte = __ptep_get(orig_ptep);
>> +
>> +    if (!pte_valid_cont(orig_pte))
>> +        return orig_pte;
>> +
>> +    orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>> +    ptep = contpte_align_down(orig_ptep);
>> +    pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>> +
>> +    for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +        pte = __ptep_get(ptep);
>> +        prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +
>> +        if (!pte_valid_cont(pte) ||
>> +           pte_pfn(pte) != pfn ||
>> +           pgprot_val(prot) != pgprot_val(orig_prot))
>> +            goto retry;
>> +
>> +        if (pte_dirty(pte))
>> +            orig_pte = pte_mkdirty(orig_pte);
>> +
>> +        if (pte_young(pte))
>> +            orig_pte = pte_mkyoung(orig_pte);
>> +    }
>> +
>> +    return orig_pte;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
>> +
>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +                    pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +    unsigned long next;
>> +    unsigned long end;
>> +    unsigned long pfn;
>> +    pgprot_t prot;
>> +    pte_t orig_pte;
>> +
>> +    if (!mm_is_user(mm))
>> +        return __set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +    end = addr + (nr << PAGE_SHIFT);
>> +    pfn = pte_pfn(pte);
>> +    prot = pte_pgprot(pte);
>> +
>> +    do {
>> +        next = pte_cont_addr_end(addr, end);
>> +        nr = (next - addr) >> PAGE_SHIFT;
>> +        pte = pfn_pte(pfn, prot);
>> +
>> +        if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>> +            pte = pte_mkcont(pte);
>> +        else
>> +            pte = pte_mknoncont(pte);
>> +
>> +        /*
>> +         * If operating on a partial contiguous range then we must first
>> +         * unfold the contiguous range if it was previously folded.
>> +         * Otherwise we could end up with overlapping tlb entries.
>> +         */
>> +        if (nr != CONT_PTES)
>> +            contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +
>> +        /*
>> +         * If we are replacing ptes that were contiguous or if the new
>> +         * ptes are contiguous and any of the ptes being replaced are
>> +         * valid, we need to clear and flush the range to prevent
>> +         * overlapping tlb entries.
>> +         */
>> +        orig_pte = __ptep_get(ptep);
>> +        if (pte_valid_cont(orig_pte) ||
>> +            (pte_cont(pte) && ptep_any_valid(ptep, nr)))
>> +            ptep_clear_flush_range(mm, addr, ptep, nr);
>> +
>> +        __set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +        addr = next;
>> +        ptep += nr;
>> +        pfn += nr;
>> +
>> +    } while (addr != end);
>> +}
>> +EXPORT_SYMBOL(contpte_set_ptes);
>> +
>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +                    unsigned long addr, pte_t *ptep)
>> +{
>> +    /*
>> +     * ptep_clear_flush_young() technically requires us to clear the access
>> +     * flag for a _single_ pte. However, the core-mm code actually tracks
>> +     * access/dirty per folio, not per page. And since we only create a
>> +     * contig range when the range is covered by a single folio, we can get
>> +     * away with clearing young for the whole contig range here, so we avoid
>> +     * having to unfold.
>> +     */
>> +
>> +    int young = 0;
>> +    int i;
>> +
>> +    ptep = contpte_align_down(ptep);
>> +    addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +    for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +        young |= __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +    return young;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
>> +
>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +                    unsigned long addr, pte_t *ptep)
>> +{
>> +    int young;
>> +
>> +    young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +    if (young) {
>> +        /*
>> +         * See comment in __ptep_clear_flush_young(); same rationale for
>> +         * eliding the trailing DSB applies here.
>> +         */
>> +        addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +        __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>> +                     PAGE_SIZE, true, 3);
>> +    }
>> +
>> +    return young;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>> +
>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +                    unsigned long addr, pte_t *ptep,
>> +                    pte_t entry, int dirty)
>> +{
>> +    unsigned long start_addr;
>> +    pte_t orig_pte;
>> +    int i;
>> +
>> +    /*
>> +     * Gather the access/dirty bits for the contiguous range. If nothing has
>> +     * changed, its a noop.
>> +     */
>> +    orig_pte = pte_mknoncont(ptep_get(ptep));
>> +    if (pte_val(orig_pte) == pte_val(entry))
>> +        return 0;
>> +
>> +    /*
>> +     * We can fix up access/dirty bits without having to unfold/fold the
>> +     * contig range. But if the write bit is changing, we need to go through
>> +     * the full unfold/fold cycle.
>> +     */
>> +    if (pte_write(orig_pte) == pte_write(entry)) {
>> +        /*
>> +         * For HW access management, we technically only need to update
>> +         * the flag on a single pte in the range. But for SW access
>> +         * management, we need to update all the ptes to prevent extra
>> +         * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
>> +         * and instead flush the whole range at the end.
>> +         */
>> +        ptep = contpte_align_down(ptep);
>> +        start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +        for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +            __ptep_set_access_flags(vma, addr, ptep, entry, 0);
>
>
> entry was pte_mknoncont() in ptep_set_access_flags() so here you lose the
> contpte range, is that intentional? Or am I mistaken?

entry doesn't have PTE_CONT bit set, that's correct. I intentionally strip that
bit at the interface boundary, because it is the implementation's job to decide
whether its a contpte block, not the caller's. But there are situations where
the caller can end up with a pte that has PTE_CONT set (by having done a
previous ptep_get() for example) and then it forwards the pte to a setter. So
stripping it is required; It would probably be cleaner to strip it before
returning it from ptep_get(), but that would be problematic for pte_leaf_size()
which is called from perf_get_pgtable_size().

In this particular case, __ptep_set_access_flags() only modifies the PTE's
access flags, so CONT_PTE will remain as it is in the page table. The fact that
entry has it cleared is not a problem.

Thanks,
Ryan


>
>
>> +
>> +        if (dirty)
>> +            __flush_tlb_range(vma, start_addr, addr,
>> +                            PAGE_SIZE, true, 3);
>> +    } else {
>> +        __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>> +        __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +        contpte_try_fold(vma->vm_mm, addr, ptep, entry);
>> +    }
>> +
>> +    return 1;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);


2024-01-15 21:23:47

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] arm64/mm: Wire up PTE_CONT for user mappings

On 15/01/2024 17:27, Ryan Roberts wrote:
> On 15/01/2024 15:14, Alexandre Ghiti wrote:
>> Hi Ryan,
>>
>> On 18/12/2023 11:50, Ryan Roberts wrote:
>>> With the ptep API sufficiently refactored, we can now introduce a new
>>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>>> user mappings. Whenever it detects a set of PTEs that meet the
>>> requirements for a contiguous range, the PTEs are re-painted with the
>>> PTE_CONT bit. Use of contpte mappings is intended to be transparent to
>>> the core-mm, which continues to interact with individual ptes.
>>>
>>> Since a contpte block only has a single access and dirty bit, the
>>> semantic here changes slightly; when getting a pte (e.g. ptep_get())
>>> that is part of a contpte mapping, the access and dirty information are
>>> pulled from the block (so all ptes in the block return the same
>>> access/dirty info). When changing the access/dirty info on a pte (e.g.
>>> ptep_set_access_flags()) that is part of a contpte mapping, this change
>>> will affect the whole contpte block. This is works fine in practice
>>> since we guarantee that only a single folio is mapped by a contpte
>>> block, and the core-mm tracks access/dirty information per folio.
>>>
>>> This initial change provides a baseline that can be optimized in future
>>> commits. That said, fold/unfold operations (which imply tlb
>>> invalidation) are avoided where possible with a few tricks for
>>> access/dirty bit management. Write-protect modifications for contpte
>>> mappings are currently non-optimal, and incure a regression in fork()
>>> performance. This will be addressed in follow-up changes.
>>>
>>> In order for the public functions, which used to be pure inline, to
>>> continue to be callable by modules, export all the contpte_* symbols
>>> that are now called by those public inline functions.
>>>
>>> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
>>> at build time. It defaults to enabled as long as its dependency,
>>> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
>>> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
>>> enabled, then there is no chance of meeting the physical contiguity
>>> requirement for contpte mappings.
>>>
>>> Tested-by: John Hubbard <[email protected]>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>>   arch/arm64/Kconfig               |  10 +-
>>>   arch/arm64/include/asm/pgtable.h | 184 +++++++++++++++
>>>   arch/arm64/mm/Makefile           |   1 +
>>>   arch/arm64/mm/contpte.c          | 388 +++++++++++++++++++++++++++++++
>>>   4 files changed, 582 insertions(+), 1 deletion(-)
>>>   create mode 100644 arch/arm64/mm/contpte.c
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 7b071a00425d..de76e484ff3a 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -2209,6 +2209,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>>>       select UNWIND_TABLES
>>>       select DYNAMIC_SCS
>>>   +config ARM64_CONTPTE
>>> +    bool "Contiguous PTE mappings for user memory" if EXPERT
>>> +    depends on TRANSPARENT_HUGEPAGE
>>> +    default y
>>> +    help
>>> +      When enabled, user mappings are configured using the PTE contiguous
>>> +      bit, for any mappings that meet the size and alignment requirements.
>>> +      This reduces TLB pressure and improves performance.
>>> +
>>>   endmenu # "Kernel Features"
>>>     menu "Boot options"
>>> @@ -2318,4 +2327,3 @@ endmenu # "CPU Power Management"
>>>   source "drivers/acpi/Kconfig"
>>>     source "arch/arm64/kvm/Kconfig"
>>> -
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 6930c14f062f..e64120452301 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>>>    */
>>>   #define pte_valid_not_user(pte) \
>>>       ((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID |
>>> PTE_UXN))
>>> +/*
>>> + * Returns true if the pte is valid and has the contiguous bit set.
>>> + */
>>> +#define pte_valid_cont(pte)    (pte_valid(pte) && pte_cont(pte))
>>>   /*
>>>    * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>>>    * so that we don't erroneously return false for pages that have been
>>> @@ -1116,6 +1120,184 @@ extern void ptep_modify_prot_commit(struct
>>> vm_area_struct *vma,
>>>                       unsigned long addr, pte_t *ptep,
>>>                       pte_t old_pte, pte_t new_pte);
>>>   +#ifdef CONFIG_ARM64_CONTPTE
>>> +
>>> +/*
>>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>>> + * a private implementation detail of the public ptep API (see below).
>>> + */
>>> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>>> +                pte_t *ptep, pte_t pte);
>>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +                pte_t *ptep, pte_t pte);
>>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +                pte_t *ptep, pte_t pte, unsigned int nr);
>>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +                unsigned long addr, pte_t *ptep);
>>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +                unsigned long addr, pte_t *ptep);
>>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> +                unsigned long addr, pte_t *ptep,
>>> +                pte_t entry, int dirty);
>>> +
>>> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>>> +                    pte_t *ptep, pte_t pte)
>>> +{
>>> +    /*
>>> +     * Only bother trying if both the virtual and physical addresses are
>>> +     * aligned and correspond to the last entry in a contig range. The core
>>> +     * code mostly modifies ranges from low to high, so this is the likely
>>> +     * the last modification in the contig range, so a good time to fold.
>>> +     * We can't fold special mappings, because there is no associated folio.
>>> +     */
>>> +
>>> +    const unsigned long contmask = CONT_PTES - 1;
>>> +    bool valign = (((unsigned long)ptep >> 3) & contmask) == contmask;
>>> +    bool palign = (pte_pfn(pte) & contmask) == contmask;
>>> +
>>> +    if (valign && palign &&
>>> +        pte_valid(pte) && !pte_cont(pte) && !pte_special(pte))
>>> +        __contpte_try_fold(mm, addr, ptep, pte);
>>> +}
>>> +
>>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +                    pte_t *ptep, pte_t pte)
>>> +{
>>> +    if (pte_valid_cont(pte))
>>> +        __contpte_try_unfold(mm, addr, ptep, pte);
>>> +}
>>> +
>>> +/*
>>> + * The below functions constitute the public API that arm64 presents to the
>>> + * core-mm to manipulate PTE entries within their page tables (or at least this
>>> + * is the subset of the API that arm64 needs to implement). These public
>>> + * versions will automatically and transparently apply the contiguous bit where
>>> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>>> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>>> + * private versions, which are prefixed with double underscore. All of these
>>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>>> + * held.
>>> + */
>>> +
>>> +#define ptep_get ptep_get
>>> +static inline pte_t ptep_get(pte_t *ptep)
>>> +{
>>> +    pte_t pte = __ptep_get(ptep);
>>> +
>>> +    if (!pte_valid_cont(pte))
>>> +        return pte;
>>> +
>>> +    return contpte_ptep_get(ptep, pte);
>>> +}
>>> +
>>> +#define ptep_get_lockless ptep_get_lockless
>>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>>> +{
>>> +    pte_t pte = __ptep_get(ptep);
>>> +
>>> +    if (!pte_valid_cont(pte))
>>> +        return pte;
>>> +
>>> +    return contpte_ptep_get_lockless(ptep);
>>> +}
>>> +
>>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>>> +{
>>> +    /*
>>> +     * We don't have the mm or vaddr so cannot unfold or fold contig entries
>>> +     * (since it requires tlb maintenance). set_pte() is not used in core
>>> +     * code, so this should never even be called. Regardless do our best to
>>> +     * service any call and emit a warning if there is any attempt to set a
>>> +     * pte on top of an existing contig range.
>>> +     */
>>> +    pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +    WARN_ON_ONCE(pte_valid_cont(orig_pte));
>>> +    __set_pte(ptep, pte_mknoncont(pte));
>>> +}
>>> +
>>> +#define set_ptes set_ptes
>>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +                pte_t *ptep, pte_t pte, unsigned int nr)
>>> +{
>>> +    pte = pte_mknoncont(pte);
>>> +
>>> +    if (nr == 1) {
>>> +        contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +        __set_ptes(mm, addr, ptep, pte, 1);
>>> +        contpte_try_fold(mm, addr, ptep, pte);
>>> +    } else
>>> +        contpte_set_ptes(mm, addr, ptep, pte, nr);
>>> +}
>>> +
>>> +static inline void pte_clear(struct mm_struct *mm,
>>> +                unsigned long addr, pte_t *ptep)
>>> +{
>>> +    contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +    __pte_clear(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>> +                unsigned long addr, pte_t *ptep)
>>> +{
>>> +    contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +    return __ptep_get_and_clear(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +                unsigned long addr, pte_t *ptep)
>>> +{
>>> +    pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +    if (!pte_valid_cont(orig_pte))
>>> +        return __ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +    return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +                unsigned long addr, pte_t *ptep)
>>> +{
>>> +    pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +    if (!pte_valid_cont(orig_pte))
>>> +        return __ptep_clear_flush_young(vma, addr, ptep);
>>> +
>>> +    return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>> +                unsigned long addr, pte_t *ptep)
>>> +{
>>> +    contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +    __ptep_set_wrprotect(mm, addr, ptep);
>>> +    contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>> +                unsigned long addr, pte_t *ptep,
>>> +                pte_t entry, int dirty)
>>> +{
>>> +    pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +    entry = pte_mknoncont(entry);
>>> +
>>> +    if (!pte_valid_cont(orig_pte))
>>> +        return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +
>>> +    return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +}
>>> +
>>> +#else /* CONFIG_ARM64_CONTPTE */
>>> +
>>>   #define ptep_get                __ptep_get
>>>   #define set_pte                    __set_pte
>>>   #define set_ptes                __set_ptes
>>> @@ -1131,6 +1313,8 @@ extern void ptep_modify_prot_commit(struct
>>> vm_area_struct *vma,
>>>   #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>>   #define ptep_set_access_flags            __ptep_set_access_flags
>>>   +#endif /* CONFIG_ARM64_CONTPTE */
>>> +
>>>   #endif /* !__ASSEMBLY__ */
>>>     #endif /* __ASM_PGTABLE_H */
>>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>>> index dbd1bc95967d..60454256945b 100644
>>> --- a/arch/arm64/mm/Makefile
>>> +++ b/arch/arm64/mm/Makefile
>>> @@ -3,6 +3,7 @@ obj-y                := dma-mapping.o extable.o fault.o init.o \
>>>                      cache.o copypage.o flush.o \
>>>                      ioremap.o mmap.o pgd.o mmu.o \
>>>                      context.o proc.o pageattr.o fixmap.o
>>> +obj-$(CONFIG_ARM64_CONTPTE)    += contpte.o
>>>   obj-$(CONFIG_HUGETLB_PAGE)    += hugetlbpage.o
>>>   obj-$(CONFIG_PTDUMP_CORE)    += ptdump.o
>>>   obj-$(CONFIG_PTDUMP_DEBUGFS)    += ptdump_debugfs.o
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> new file mode 100644
>>> index 000000000000..69c36749dd98
>>> --- /dev/null
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -0,0 +1,388 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Copyright (C) 2023 ARM Ltd.
>>> + */
>>> +
>>> +#include <linux/mm.h>
>>> +#include <linux/export.h>
>>> +#include <asm/tlbflush.h>
>>> +
>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>> +{
>>> +    /*
>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>> +     * These racing faults are ok for user space, since they get serialized
>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>> +     */
>>> +    return mm != &init_mm;
>>> +}
>>> +
>>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>>> +{
>>> +    return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>> +}
>>> +
>>> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
>>> +                pte_t *ptep, int nr)
>>> +{
>>> +    struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>> +    unsigned long start_addr = addr;
>>> +    int i;
>>> +
>>> +    for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
>>> +        __pte_clear(mm, addr, ptep);
>>> +
>>> +    __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>> +}
>>> +
>>> +static bool ptep_any_valid(pte_t *ptep, int nr)
>>> +{
>>> +    int i;
>>> +
>>> +    for (i = 0; i < nr; i++, ptep++) {
>>> +        if (pte_valid(__ptep_get(ptep)))
>>> +            return true;
>>> +    }
>>> +
>>> +    return false;
>>> +}
>>> +
>>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>> +                pte_t *ptep, pte_t pte)
>>> +{
>>> +    struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>> +    unsigned long start_addr;
>>> +    pte_t *start_ptep;
>>> +    int i;
>>> +
>>> +    start_ptep = ptep = contpte_align_down(ptep);
>>> +    start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +    pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>> +
>>> +    for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>> +        pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>> +
>>> +        if (pte_dirty(ptent))
>>> +            pte = pte_mkdirty(pte);
>>> +
>>> +        if (pte_young(ptent))
>>> +            pte = pte_mkyoung(pte);
>>> +    }
>>> +
>>> +    __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>> +
>>> +    __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>> +}
>>> +
>>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>>> +            pte_t *ptep, pte_t pte)
>>> +{
>>> +    /*
>>> +     * We have already checked that the virtual and pysical addresses are
>>> +     * correctly aligned for a contpte mapping in contpte_try_fold() so the
>>> +     * remaining checks are to ensure that the contpte range is fully
>>> +     * covered by a single folio, and ensure that all the ptes are valid
>>> +     * with contiguous PFNs and matching prots. We ignore the state of the
>>> +     * access and dirty bits for the purpose of deciding if its a contiguous
>>> +     * range; the folding process will generate a single contpte entry which
>>> +     * has a single access and dirty bit. Those 2 bits are the logical OR of
>>> +     * their respective bits in the constituent pte entries. In order to
>>> +     * ensure the contpte range is covered by a single folio, we must
>>> +     * recover the folio from the pfn, but special mappings don't have a
>>> +     * folio backing them. Fortunately contpte_try_fold() already checked
>>> +     * that the pte is not special - we never try to fold special mappings.
>>> +     * Note we can't use vm_normal_page() for this since we don't have the
>>> +     * vma.
>>> +     */
>>> +
>>> +    unsigned long folio_saddr;
>>> +    unsigned long folio_eaddr;
>>> +    unsigned long cont_saddr;
>>> +    unsigned long cont_eaddr;
>>> +    struct folio *folio;
>>> +    struct page *page;
>>> +    unsigned long pfn;
>>> +    pte_t *orig_ptep;
>>> +    pgprot_t prot;
>>> +    pte_t subpte;
>>> +    int i;
>>> +
>>> +    if (!mm_is_user(mm))
>>> +        return;
>>> +
>>> +    page = pte_page(pte);
>>> +    folio = page_folio(page);
>>> +    folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>>> +    folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>>> +    cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +    cont_eaddr = cont_saddr + CONT_PTE_SIZE;
>>> +
>>> +    if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>>> +        return;
>>> +
>>> +    pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>> +    orig_ptep = ptep;
>>> +    ptep = contpte_align_down(ptep);
>>> +
>>> +    for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>> +        subpte = __ptep_get(ptep);
>>> +        subpte = pte_mkold(pte_mkclean(subpte));
>>> +
>>> +        if (!pte_valid(subpte) ||
>>> +            pte_pfn(subpte) != pfn ||
>>> +            pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
>>> +            return;
>>> +    }
>>> +
>>> +    pte = pte_mkcont(pte);
>>> +    contpte_convert(mm, addr, orig_ptep, pte);
>>> +}
>>> +EXPORT_SYMBOL(__contpte_try_fold);
>>> +
>>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +            pte_t *ptep, pte_t pte)
>>> +{
>>> +    /*
>>> +     * We have already checked that the ptes are contiguous in
>>> +     * contpte_try_unfold(), so just check that the mm is user space.
>>> +     */
>>> +
>>> +    if (!mm_is_user(mm))
>>> +        return;
>>> +
>>> +    pte = pte_mknoncont(pte);
>>> +    contpte_convert(mm, addr, ptep, pte);
>>> +}
>>> +EXPORT_SYMBOL(__contpte_try_unfold);
>>> +
>>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>> +{
>>> +    /*
>>> +     * Gather access/dirty bits, which may be populated in any of the ptes
>>> +     * of the contig range. We are guarranteed to be holding the PTL, so any
>>> +     * contiguous range cannot be unfolded or otherwise modified under our
>>> +     * feet.
>>> +     */
>>> +
>>> +    pte_t pte;
>>> +    int i;
>>> +
>>> +    ptep = contpte_align_down(ptep);
>>> +
>>> +    for (i = 0; i < CONT_PTES; i++, ptep++) {
>>> +        pte = __ptep_get(ptep);
>>> +
>>> +        if (pte_dirty(pte))
>>> +            orig_pte = pte_mkdirty(orig_pte);
>>> +
>>> +        if (pte_young(pte))
>>> +            orig_pte = pte_mkyoung(orig_pte);
>>> +    }
>>> +
>>> +    return orig_pte;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_get);
>>> +
>>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>> +{
>>> +    /*
>>> +     * Gather access/dirty bits, which may be populated in any of the ptes
>>> +     * of the contig range. We may not be holding the PTL, so any contiguous
>>> +     * range may be unfolded/modified/refolded under our feet. Therefore we
>>> +     * ensure we read a _consistent_ contpte range by checking that all ptes
>>> +     * in the range are valid and have CONT_PTE set, that all pfns are
>>> +     * contiguous and that all pgprots are the same (ignoring access/dirty).
>>> +     * If we find a pte that is not consistent, then we must be racing with
>>> +     * an update so start again. If the target pte does not have CONT_PTE
>>> +     * set then that is considered consistent on its own because it is not
>>> +     * part of a contpte range.
>>> +     */
>>> +
>>> +    pgprot_t orig_prot;
>>> +    unsigned long pfn;
>>> +    pte_t orig_pte;
>>> +    pgprot_t prot;
>>> +    pte_t *ptep;
>>> +    pte_t pte;
>>> +    int i;
>>> +
>>> +retry:
>>> +    orig_pte = __ptep_get(orig_ptep);
>>> +
>>> +    if (!pte_valid_cont(orig_pte))
>>> +        return orig_pte;
>>> +
>>> +    orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>>> +    ptep = contpte_align_down(orig_ptep);
>>> +    pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>>> +
>>> +    for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>> +        pte = __ptep_get(ptep);
>>> +        prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>> +
>>> +        if (!pte_valid_cont(pte) ||
>>> +           pte_pfn(pte) != pfn ||
>>> +           pgprot_val(prot) != pgprot_val(orig_prot))
>>> +            goto retry;
>>> +
>>> +        if (pte_dirty(pte))
>>> +            orig_pte = pte_mkdirty(orig_pte);
>>> +
>>> +        if (pte_young(pte))
>>> +            orig_pte = pte_mkyoung(orig_pte);
>>> +    }
>>> +
>>> +    return orig_pte;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
>>> +
>>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +                    pte_t *ptep, pte_t pte, unsigned int nr)
>>> +{
>>> +    unsigned long next;
>>> +    unsigned long end;
>>> +    unsigned long pfn;
>>> +    pgprot_t prot;
>>> +    pte_t orig_pte;
>>> +
>>> +    if (!mm_is_user(mm))
>>> +        return __set_ptes(mm, addr, ptep, pte, nr);
>>> +
>>> +    end = addr + (nr << PAGE_SHIFT);
>>> +    pfn = pte_pfn(pte);
>>> +    prot = pte_pgprot(pte);
>>> +
>>> +    do {
>>> +        next = pte_cont_addr_end(addr, end);
>>> +        nr = (next - addr) >> PAGE_SHIFT;
>>> +        pte = pfn_pte(pfn, prot);
>>> +
>>> +        if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>>> +            pte = pte_mkcont(pte);
>>> +        else
>>> +            pte = pte_mknoncont(pte);
>>> +
>>> +        /*
>>> +         * If operating on a partial contiguous range then we must first
>>> +         * unfold the contiguous range if it was previously folded.
>>> +         * Otherwise we could end up with overlapping tlb entries.
>>> +         */
>>> +        if (nr != CONT_PTES)
>>> +            contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +
>>> +        /*
>>> +         * If we are replacing ptes that were contiguous or if the new
>>> +         * ptes are contiguous and any of the ptes being replaced are
>>> +         * valid, we need to clear and flush the range to prevent
>>> +         * overlapping tlb entries.
>>> +         */
>>> +        orig_pte = __ptep_get(ptep);
>>> +        if (pte_valid_cont(orig_pte) ||
>>> +            (pte_cont(pte) && ptep_any_valid(ptep, nr)))
>>> +            ptep_clear_flush_range(mm, addr, ptep, nr);
>>> +
>>> +        __set_ptes(mm, addr, ptep, pte, nr);
>>> +
>>> +        addr = next;
>>> +        ptep += nr;
>>> +        pfn += nr;
>>> +
>>> +    } while (addr != end);
>>> +}
>>> +EXPORT_SYMBOL(contpte_set_ptes);
>>> +
>>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +                    unsigned long addr, pte_t *ptep)
>>> +{
>>> +    /*
>>> +     * ptep_clear_flush_young() technically requires us to clear the access
>>> +     * flag for a _single_ pte. However, the core-mm code actually tracks
>>> +     * access/dirty per folio, not per page. And since we only create a
>>> +     * contig range when the range is covered by a single folio, we can get
>>> +     * away with clearing young for the whole contig range here, so we avoid
>>> +     * having to unfold.
>>> +     */
>>> +
>>> +    int young = 0;
>>> +    int i;
>>> +
>>> +    ptep = contpte_align_down(ptep);
>>> +    addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +
>>> +    for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> +        young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +    return young;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
>>> +
>>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +                    unsigned long addr, pte_t *ptep)
>>> +{
>>> +    int young;
>>> +
>>> +    young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +    if (young) {
>>> +        /*
>>> +         * See comment in __ptep_clear_flush_young(); same rationale for
>>> +         * eliding the trailing DSB applies here.
>>> +         */
>>> +        addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +        __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>>> +                     PAGE_SIZE, true, 3);
>>> +    }
>>> +
>>> +    return young;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>>> +
>>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> +                    unsigned long addr, pte_t *ptep,
>>> +                    pte_t entry, int dirty)
>>> +{
>>> +    unsigned long start_addr;
>>> +    pte_t orig_pte;
>>> +    int i;
>>> +
>>> +    /*
>>> +     * Gather the access/dirty bits for the contiguous range. If nothing has
>>> +     * changed, its a noop.
>>> +     */
>>> +    orig_pte = pte_mknoncont(ptep_get(ptep));
>>> +    if (pte_val(orig_pte) == pte_val(entry))
>>> +        return 0;
>>> +
>>> +    /*
>>> +     * We can fix up access/dirty bits without having to unfold/fold the
>>> +     * contig range. But if the write bit is changing, we need to go through
>>> +     * the full unfold/fold cycle.
>>> +     */
>>> +    if (pte_write(orig_pte) == pte_write(entry)) {
>>> +        /*
>>> +         * For HW access management, we technically only need to update
>>> +         * the flag on a single pte in the range. But for SW access
>>> +         * management, we need to update all the ptes to prevent extra
>>> +         * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
>>> +         * and instead flush the whole range at the end.
>>> +         */
>>> +        ptep = contpte_align_down(ptep);
>>> +        start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +
>>> +        for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> +            __ptep_set_access_flags(vma, addr, ptep, entry, 0);
>>
>> entry was pte_mknoncont() in ptep_set_access_flags() so here you lose the
>> contpte range, is that intentional? Or am I mistaken?
> entry doesn't have PTE_CONT bit set, that's correct. I intentionally strip that
> bit at the interface boundary, because it is the implementation's job to decide
> whether its a contpte block, not the caller's. But there are situations where
> the caller can end up with a pte that has PTE_CONT set (by having done a
> previous ptep_get() for example) and then it forwards the pte to a setter. So
> stripping it is required; It would probably be cleaner to strip it before
> returning it from ptep_get(), but that would be problematic for pte_leaf_size()
> which is called from perf_get_pgtable_size().
>
> In this particular case, __ptep_set_access_flags() only modifies the PTE's
> access flags, so CONT_PTE will remain as it is in the page table. The fact that
> entry has it cleared is not a problem.


I see, I had not checked the arm64 implementation of
ptep_set_access_flags(). For context, I'm merging the arm64 contpte
support with the riscv napot support, the implementation being quite
similar (although riscv is a bit different as it uses bits from the pfn
to advertise the number of contiguous ptes).

Anyway,  our implementation of ptep_set_access_flags() actually sets the
ptep with entry, so we would actually lose the cont bit. I would simply
do the following (I will in my patchset, no need for you to worry about
this):

__ptep_set_access_flags(vma, addr, ptep, pte_mkcont(entry), 0);

Let me know if you think this is not right,

Thanks,

Alex


>
> Thanks,
> Ryan
>
>
>>
>>> +
>>> +        if (dirty)
>>> +            __flush_tlb_range(vma, start_addr, addr,
>>> +                            PAGE_SIZE, true, 3);
>>> +    } else {
>>> +        __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>>> +        __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +        contpte_try_fold(vma->vm_mm, addr, ptep, entry);
>>> +    }
>>> +
>>> +    return 1;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);

2024-01-16 14:45:22

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] arm64/mm: Wire up PTE_CONT for user mappings

On 15/01/2024 21:23, Alexandre Ghiti wrote:
> On 15/01/2024 17:27, Ryan Roberts wrote:
>> On 15/01/2024 15:14, Alexandre Ghiti wrote:
>>> Hi Ryan,
>>>
>>> On 18/12/2023 11:50, Ryan Roberts wrote:
>>>> With the ptep API sufficiently refactored, we can now introduce a new
>>>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>>>> user mappings. Whenever it detects a set of PTEs that meet the
>>>> requirements for a contiguous range, the PTEs are re-painted with the
>>>> PTE_CONT bit. Use of contpte mappings is intended to be transparent to
>>>> the core-mm, which continues to interact with individual ptes.
>>>>
>>>> Since a contpte block only has a single access and dirty bit, the
>>>> semantic here changes slightly; when getting a pte (e.g. ptep_get())
>>>> that is part of a contpte mapping, the access and dirty information are
>>>> pulled from the block (so all ptes in the block return the same
>>>> access/dirty info). When changing the access/dirty info on a pte (e.g.
>>>> ptep_set_access_flags()) that is part of a contpte mapping, this change
>>>> will affect the whole contpte block. This is works fine in practice
>>>> since we guarantee that only a single folio is mapped by a contpte
>>>> block, and the core-mm tracks access/dirty information per folio.
>>>>
>>>> This initial change provides a baseline that can be optimized in future
>>>> commits. That said, fold/unfold operations (which imply tlb
>>>> invalidation) are avoided where possible with a few tricks for
>>>> access/dirty bit management. Write-protect modifications for contpte
>>>> mappings are currently non-optimal, and incure a regression in fork()
>>>> performance. This will be addressed in follow-up changes.
>>>>
>>>> In order for the public functions, which used to be pure inline, to
>>>> continue to be callable by modules, export all the contpte_* symbols
>>>> that are now called by those public inline functions.
>>>>
>>>> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
>>>> at build time. It defaults to enabled as long as its dependency,
>>>> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
>>>> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
>>>> enabled, then there is no chance of meeting the physical contiguity
>>>> requirement for contpte mappings.
>>>>
>>>> Tested-by: John Hubbard <[email protected]>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> ---
>>>>    arch/arm64/Kconfig               |  10 +-
>>>>    arch/arm64/include/asm/pgtable.h | 184 +++++++++++++++
>>>>    arch/arm64/mm/Makefile           |   1 +
>>>>    arch/arm64/mm/contpte.c          | 388 +++++++++++++++++++++++++++++++
>>>>    4 files changed, 582 insertions(+), 1 deletion(-)
>>>>    create mode 100644 arch/arm64/mm/contpte.c
>>>>
>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>> index 7b071a00425d..de76e484ff3a 100644
>>>> --- a/arch/arm64/Kconfig
>>>> +++ b/arch/arm64/Kconfig
>>>> @@ -2209,6 +2209,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>>>>        select UNWIND_TABLES
>>>>        select DYNAMIC_SCS
>>>>    +config ARM64_CONTPTE
>>>> +    bool "Contiguous PTE mappings for user memory" if EXPERT
>>>> +    depends on TRANSPARENT_HUGEPAGE
>>>> +    default y
>>>> +    help
>>>> +      When enabled, user mappings are configured using the PTE contiguous
>>>> +      bit, for any mappings that meet the size and alignment requirements.
>>>> +      This reduces TLB pressure and improves performance.
>>>> +
>>>>    endmenu # "Kernel Features"
>>>>      menu "Boot options"
>>>> @@ -2318,4 +2327,3 @@ endmenu # "CPU Power Management"
>>>>    source "drivers/acpi/Kconfig"
>>>>      source "arch/arm64/kvm/Kconfig"
>>>> -
>>>> diff --git a/arch/arm64/include/asm/pgtable.h
>>>> b/arch/arm64/include/asm/pgtable.h
>>>> index 6930c14f062f..e64120452301 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>>>>     */
>>>>    #define pte_valid_not_user(pte) \
>>>>        ((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID |
>>>> PTE_UXN))
>>>> +/*
>>>> + * Returns true if the pte is valid and has the contiguous bit set.
>>>> + */
>>>> +#define pte_valid_cont(pte)    (pte_valid(pte) && pte_cont(pte))
>>>>    /*
>>>>     * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>>>>     * so that we don't erroneously return false for pages that have been
>>>> @@ -1116,6 +1120,184 @@ extern void ptep_modify_prot_commit(struct
>>>> vm_area_struct *vma,
>>>>                        unsigned long addr, pte_t *ptep,
>>>>                        pte_t old_pte, pte_t new_pte);
>>>>    +#ifdef CONFIG_ARM64_CONTPTE
>>>> +
>>>> +/*
>>>> + * The contpte APIs are used to transparently manage the contiguous bit in
>>>> ptes
>>>> + * where it is possible and makes sense to do so. The PTE_CONT bit is
>>>> considered
>>>> + * a private implementation detail of the public ptep API (see below).
>>>> + */
>>>> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>>>> +                pte_t *ptep, pte_t pte);
>>>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>>> +                pte_t *ptep, pte_t pte);
>>>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>>> +                pte_t *ptep, pte_t pte, unsigned int nr);
>>>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> +                unsigned long addr, pte_t *ptep);
>>>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>> +                unsigned long addr, pte_t *ptep);
>>>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>>> +                unsigned long addr, pte_t *ptep,
>>>> +                pte_t entry, int dirty);
>>>> +
>>>> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>>>> +                    pte_t *ptep, pte_t pte)
>>>> +{
>>>> +    /*
>>>> +     * Only bother trying if both the virtual and physical addresses are
>>>> +     * aligned and correspond to the last entry in a contig range. The core
>>>> +     * code mostly modifies ranges from low to high, so this is the likely
>>>> +     * the last modification in the contig range, so a good time to fold.
>>>> +     * We can't fold special mappings, because there is no associated folio.
>>>> +     */
>>>> +
>>>> +    const unsigned long contmask = CONT_PTES - 1;
>>>> +    bool valign = (((unsigned long)ptep >> 3) & contmask) == contmask;
>>>> +    bool palign = (pte_pfn(pte) & contmask) == contmask;
>>>> +
>>>> +    if (valign && palign &&
>>>> +        pte_valid(pte) && !pte_cont(pte) && !pte_special(pte))
>>>> +        __contpte_try_fold(mm, addr, ptep, pte);
>>>> +}
>>>> +
>>>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long
>>>> addr,
>>>> +                    pte_t *ptep, pte_t pte)
>>>> +{
>>>> +    if (pte_valid_cont(pte))
>>>> +        __contpte_try_unfold(mm, addr, ptep, pte);
>>>> +}
>>>> +
>>>> +/*
>>>> + * The below functions constitute the public API that arm64 presents to the
>>>> + * core-mm to manipulate PTE entries within their page tables (or at least
>>>> this
>>>> + * is the subset of the API that arm64 needs to implement). These public
>>>> + * versions will automatically and transparently apply the contiguous bit
>>>> where
>>>> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>>>> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>>>> + * private versions, which are prefixed with double underscore. All of these
>>>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>>>> + * held.
>>>> + */
>>>> +
>>>> +#define ptep_get ptep_get
>>>> +static inline pte_t ptep_get(pte_t *ptep)
>>>> +{
>>>> +    pte_t pte = __ptep_get(ptep);
>>>> +
>>>> +    if (!pte_valid_cont(pte))
>>>> +        return pte;
>>>> +
>>>> +    return contpte_ptep_get(ptep, pte);
>>>> +}
>>>> +
>>>> +#define ptep_get_lockless ptep_get_lockless
>>>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>>>> +{
>>>> +    pte_t pte = __ptep_get(ptep);
>>>> +
>>>> +    if (!pte_valid_cont(pte))
>>>> +        return pte;
>>>> +
>>>> +    return contpte_ptep_get_lockless(ptep);
>>>> +}
>>>> +
>>>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>>>> +{
>>>> +    /*
>>>> +     * We don't have the mm or vaddr so cannot unfold or fold contig entries
>>>> +     * (since it requires tlb maintenance). set_pte() is not used in core
>>>> +     * code, so this should never even be called. Regardless do our best to
>>>> +     * service any call and emit a warning if there is any attempt to set a
>>>> +     * pte on top of an existing contig range.
>>>> +     */
>>>> +    pte_t orig_pte = __ptep_get(ptep);
>>>> +
>>>> +    WARN_ON_ONCE(pte_valid_cont(orig_pte));
>>>> +    __set_pte(ptep, pte_mknoncont(pte));
>>>> +}
>>>> +
>>>> +#define set_ptes set_ptes
>>>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>> +                pte_t *ptep, pte_t pte, unsigned int nr)
>>>> +{
>>>> +    pte = pte_mknoncont(pte);
>>>> +
>>>> +    if (nr == 1) {
>>>> +        contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>>> +        __set_ptes(mm, addr, ptep, pte, 1);
>>>> +        contpte_try_fold(mm, addr, ptep, pte);
>>>> +    } else
>>>> +        contpte_set_ptes(mm, addr, ptep, pte, nr);
>>>> +}
>>>> +
>>>> +static inline void pte_clear(struct mm_struct *mm,
>>>> +                unsigned long addr, pte_t *ptep)
>>>> +{
>>>> +    contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>>> +    __pte_clear(mm, addr, ptep);
>>>> +}
>>>> +
>>>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>> +                unsigned long addr, pte_t *ptep)
>>>> +{
>>>> +    contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>>> +    return __ptep_get_and_clear(mm, addr, ptep);
>>>> +}
>>>> +
>>>> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>>> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> +                unsigned long addr, pte_t *ptep)
>>>> +{
>>>> +    pte_t orig_pte = __ptep_get(ptep);
>>>> +
>>>> +    if (!pte_valid_cont(orig_pte))
>>>> +        return __ptep_test_and_clear_young(vma, addr, ptep);
>>>> +
>>>> +    return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>> +}
>>>> +
>>>> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>>> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>> +                unsigned long addr, pte_t *ptep)
>>>> +{
>>>> +    pte_t orig_pte = __ptep_get(ptep);
>>>> +
>>>> +    if (!pte_valid_cont(orig_pte))
>>>> +        return __ptep_clear_flush_young(vma, addr, ptep);
>>>> +
>>>> +    return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>>> +}
>>>> +
>>>> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>>> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>>> +                unsigned long addr, pte_t *ptep)
>>>> +{
>>>> +    contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>>> +    __ptep_set_wrprotect(mm, addr, ptep);
>>>> +    contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
>>>> +}
>>>> +
>>>> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>>> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>>> +                unsigned long addr, pte_t *ptep,
>>>> +                pte_t entry, int dirty)
>>>> +{
>>>> +    pte_t orig_pte = __ptep_get(ptep);
>>>> +
>>>> +    entry = pte_mknoncont(entry);
>>>> +
>>>> +    if (!pte_valid_cont(orig_pte))
>>>> +        return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>>> +
>>>> +    return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>>> +}
>>>> +
>>>> +#else /* CONFIG_ARM64_CONTPTE */
>>>> +
>>>>    #define ptep_get                __ptep_get
>>>>    #define set_pte                    __set_pte
>>>>    #define set_ptes                __set_ptes
>>>> @@ -1131,6 +1313,8 @@ extern void ptep_modify_prot_commit(struct
>>>> vm_area_struct *vma,
>>>>    #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>>>    #define ptep_set_access_flags            __ptep_set_access_flags
>>>>    +#endif /* CONFIG_ARM64_CONTPTE */
>>>> +
>>>>    #endif /* !__ASSEMBLY__ */
>>>>      #endif /* __ASM_PGTABLE_H */
>>>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>>>> index dbd1bc95967d..60454256945b 100644
>>>> --- a/arch/arm64/mm/Makefile
>>>> +++ b/arch/arm64/mm/Makefile
>>>> @@ -3,6 +3,7 @@ obj-y                := dma-mapping.o extable.o fault.o
>>>> init.o \
>>>>                       cache.o copypage.o flush.o \
>>>>                       ioremap.o mmap.o pgd.o mmu.o \
>>>>                       context.o proc.o pageattr.o fixmap.o
>>>> +obj-$(CONFIG_ARM64_CONTPTE)    += contpte.o
>>>>    obj-$(CONFIG_HUGETLB_PAGE)    += hugetlbpage.o
>>>>    obj-$(CONFIG_PTDUMP_CORE)    += ptdump.o
>>>>    obj-$(CONFIG_PTDUMP_DEBUGFS)    += ptdump_debugfs.o
>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>> new file mode 100644
>>>> index 000000000000..69c36749dd98
>>>> --- /dev/null
>>>> +++ b/arch/arm64/mm/contpte.c
>>>> @@ -0,0 +1,388 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>> +/*
>>>> + * Copyright (C) 2023 ARM Ltd.
>>>> + */
>>>> +
>>>> +#include <linux/mm.h>
>>>> +#include <linux/export.h>
>>>> +#include <asm/tlbflush.h>
>>>> +
>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>> +{
>>>> +    /*
>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>> +     * These racing faults are ok for user space, since they get serialized
>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>> +     */
>>>> +    return mm != &init_mm;
>>>> +}
>>>> +
>>>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>>>> +{
>>>> +    return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>>> +}
>>>> +
>>>> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
>>>> +                pte_t *ptep, int nr)
>>>> +{
>>>> +    struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>>> +    unsigned long start_addr = addr;
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
>>>> +        __pte_clear(mm, addr, ptep);
>>>> +
>>>> +    __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>>> +}
>>>> +
>>>> +static bool ptep_any_valid(pte_t *ptep, int nr)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < nr; i++, ptep++) {
>>>> +        if (pte_valid(__ptep_get(ptep)))
>>>> +            return true;
>>>> +    }
>>>> +
>>>> +    return false;
>>>> +}
>>>> +
>>>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>>> +                pte_t *ptep, pte_t pte)
>>>> +{
>>>> +    struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>>> +    unsigned long start_addr;
>>>> +    pte_t *start_ptep;
>>>> +    int i;
>>>> +
>>>> +    start_ptep = ptep = contpte_align_down(ptep);
>>>> +    start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +    pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>>> +
>>>> +    for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>>> +        pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>>> +
>>>> +        if (pte_dirty(ptent))
>>>> +            pte = pte_mkdirty(pte);
>>>> +
>>>> +        if (pte_young(ptent))
>>>> +            pte = pte_mkyoung(pte);
>>>> +    }
>>>> +
>>>> +    __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>>> +
>>>> +    __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>>> +}
>>>> +
>>>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>>>> +            pte_t *ptep, pte_t pte)
>>>> +{
>>>> +    /*
>>>> +     * We have already checked that the virtual and pysical addresses are
>>>> +     * correctly aligned for a contpte mapping in contpte_try_fold() so the
>>>> +     * remaining checks are to ensure that the contpte range is fully
>>>> +     * covered by a single folio, and ensure that all the ptes are valid
>>>> +     * with contiguous PFNs and matching prots. We ignore the state of the
>>>> +     * access and dirty bits for the purpose of deciding if its a contiguous
>>>> +     * range; the folding process will generate a single contpte entry which
>>>> +     * has a single access and dirty bit. Those 2 bits are the logical OR of
>>>> +     * their respective bits in the constituent pte entries. In order to
>>>> +     * ensure the contpte range is covered by a single folio, we must
>>>> +     * recover the folio from the pfn, but special mappings don't have a
>>>> +     * folio backing them. Fortunately contpte_try_fold() already checked
>>>> +     * that the pte is not special - we never try to fold special mappings.
>>>> +     * Note we can't use vm_normal_page() for this since we don't have the
>>>> +     * vma.
>>>> +     */
>>>> +
>>>> +    unsigned long folio_saddr;
>>>> +    unsigned long folio_eaddr;
>>>> +    unsigned long cont_saddr;
>>>> +    unsigned long cont_eaddr;
>>>> +    struct folio *folio;
>>>> +    struct page *page;
>>>> +    unsigned long pfn;
>>>> +    pte_t *orig_ptep;
>>>> +    pgprot_t prot;
>>>> +    pte_t subpte;
>>>> +    int i;
>>>> +
>>>> +    if (!mm_is_user(mm))
>>>> +        return;
>>>> +
>>>> +    page = pte_page(pte);
>>>> +    folio = page_folio(page);
>>>> +    folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>>>> +    folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>>>> +    cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +    cont_eaddr = cont_saddr + CONT_PTE_SIZE;
>>>> +
>>>> +    if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>>>> +        return;
>>>> +
>>>> +    pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>>> +    orig_ptep = ptep;
>>>> +    ptep = contpte_align_down(ptep);
>>>> +
>>>> +    for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>>> +        subpte = __ptep_get(ptep);
>>>> +        subpte = pte_mkold(pte_mkclean(subpte));
>>>> +
>>>> +        if (!pte_valid(subpte) ||
>>>> +            pte_pfn(subpte) != pfn ||
>>>> +            pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
>>>> +            return;
>>>> +    }
>>>> +
>>>> +    pte = pte_mkcont(pte);
>>>> +    contpte_convert(mm, addr, orig_ptep, pte);
>>>> +}
>>>> +EXPORT_SYMBOL(__contpte_try_fold);
>>>> +
>>>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>>> +            pte_t *ptep, pte_t pte)
>>>> +{
>>>> +    /*
>>>> +     * We have already checked that the ptes are contiguous in
>>>> +     * contpte_try_unfold(), so just check that the mm is user space.
>>>> +     */
>>>> +
>>>> +    if (!mm_is_user(mm))
>>>> +        return;
>>>> +
>>>> +    pte = pte_mknoncont(pte);
>>>> +    contpte_convert(mm, addr, ptep, pte);
>>>> +}
>>>> +EXPORT_SYMBOL(__contpte_try_unfold);
>>>> +
>>>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>> +{
>>>> +    /*
>>>> +     * Gather access/dirty bits, which may be populated in any of the ptes
>>>> +     * of the contig range. We are guarranteed to be holding the PTL, so any
>>>> +     * contiguous range cannot be unfolded or otherwise modified under our
>>>> +     * feet.
>>>> +     */
>>>> +
>>>> +    pte_t pte;
>>>> +    int i;
>>>> +
>>>> +    ptep = contpte_align_down(ptep);
>>>> +
>>>> +    for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>> +        pte = __ptep_get(ptep);
>>>> +
>>>> +        if (pte_dirty(pte))
>>>> +            orig_pte = pte_mkdirty(orig_pte);
>>>> +
>>>> +        if (pte_young(pte))
>>>> +            orig_pte = pte_mkyoung(orig_pte);
>>>> +    }
>>>> +
>>>> +    return orig_pte;
>>>> +}
>>>> +EXPORT_SYMBOL(contpte_ptep_get);
>>>> +
>>>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>>> +{
>>>> +    /*
>>>> +     * Gather access/dirty bits, which may be populated in any of the ptes
>>>> +     * of the contig range. We may not be holding the PTL, so any contiguous
>>>> +     * range may be unfolded/modified/refolded under our feet. Therefore we
>>>> +     * ensure we read a _consistent_ contpte range by checking that all ptes
>>>> +     * in the range are valid and have CONT_PTE set, that all pfns are
>>>> +     * contiguous and that all pgprots are the same (ignoring access/dirty).
>>>> +     * If we find a pte that is not consistent, then we must be racing with
>>>> +     * an update so start again. If the target pte does not have CONT_PTE
>>>> +     * set then that is considered consistent on its own because it is not
>>>> +     * part of a contpte range.
>>>> +     */
>>>> +
>>>> +    pgprot_t orig_prot;
>>>> +    unsigned long pfn;
>>>> +    pte_t orig_pte;
>>>> +    pgprot_t prot;
>>>> +    pte_t *ptep;
>>>> +    pte_t pte;
>>>> +    int i;
>>>> +
>>>> +retry:
>>>> +    orig_pte = __ptep_get(orig_ptep);
>>>> +
>>>> +    if (!pte_valid_cont(orig_pte))
>>>> +        return orig_pte;
>>>> +
>>>> +    orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>>>> +    ptep = contpte_align_down(orig_ptep);
>>>> +    pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>>>> +
>>>> +    for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>>> +        pte = __ptep_get(ptep);
>>>> +        prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>>> +
>>>> +        if (!pte_valid_cont(pte) ||
>>>> +           pte_pfn(pte) != pfn ||
>>>> +           pgprot_val(prot) != pgprot_val(orig_prot))
>>>> +            goto retry;
>>>> +
>>>> +        if (pte_dirty(pte))
>>>> +            orig_pte = pte_mkdirty(orig_pte);
>>>> +
>>>> +        if (pte_young(pte))
>>>> +            orig_pte = pte_mkyoung(orig_pte);
>>>> +    }
>>>> +
>>>> +    return orig_pte;
>>>> +}
>>>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
>>>> +
>>>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>>> +                    pte_t *ptep, pte_t pte, unsigned int nr)
>>>> +{
>>>> +    unsigned long next;
>>>> +    unsigned long end;
>>>> +    unsigned long pfn;
>>>> +    pgprot_t prot;
>>>> +    pte_t orig_pte;
>>>> +
>>>> +    if (!mm_is_user(mm))
>>>> +        return __set_ptes(mm, addr, ptep, pte, nr);
>>>> +
>>>> +    end = addr + (nr << PAGE_SHIFT);
>>>> +    pfn = pte_pfn(pte);
>>>> +    prot = pte_pgprot(pte);
>>>> +
>>>> +    do {
>>>> +        next = pte_cont_addr_end(addr, end);
>>>> +        nr = (next - addr) >> PAGE_SHIFT;
>>>> +        pte = pfn_pte(pfn, prot);
>>>> +
>>>> +        if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>>>> +            pte = pte_mkcont(pte);
>>>> +        else
>>>> +            pte = pte_mknoncont(pte);
>>>> +
>>>> +        /*
>>>> +         * If operating on a partial contiguous range then we must first
>>>> +         * unfold the contiguous range if it was previously folded.
>>>> +         * Otherwise we could end up with overlapping tlb entries.
>>>> +         */
>>>> +        if (nr != CONT_PTES)
>>>> +            contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>>> +
>>>> +        /*
>>>> +         * If we are replacing ptes that were contiguous or if the new
>>>> +         * ptes are contiguous and any of the ptes being replaced are
>>>> +         * valid, we need to clear and flush the range to prevent
>>>> +         * overlapping tlb entries.
>>>> +         */
>>>> +        orig_pte = __ptep_get(ptep);
>>>> +        if (pte_valid_cont(orig_pte) ||
>>>> +            (pte_cont(pte) && ptep_any_valid(ptep, nr)))
>>>> +            ptep_clear_flush_range(mm, addr, ptep, nr);
>>>> +
>>>> +        __set_ptes(mm, addr, ptep, pte, nr);
>>>> +
>>>> +        addr = next;
>>>> +        ptep += nr;
>>>> +        pfn += nr;
>>>> +
>>>> +    } while (addr != end);
>>>> +}
>>>> +EXPORT_SYMBOL(contpte_set_ptes);
>>>> +
>>>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> +                    unsigned long addr, pte_t *ptep)
>>>> +{
>>>> +    /*
>>>> +     * ptep_clear_flush_young() technically requires us to clear the access
>>>> +     * flag for a _single_ pte. However, the core-mm code actually tracks
>>>> +     * access/dirty per folio, not per page. And since we only create a
>>>> +     * contig range when the range is covered by a single folio, we can get
>>>> +     * away with clearing young for the whole contig range here, so we avoid
>>>> +     * having to unfold.
>>>> +     */
>>>> +
>>>> +    int young = 0;
>>>> +    int i;
>>>> +
>>>> +    ptep = contpte_align_down(ptep);
>>>> +    addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +
>>>> +    for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>>> +        young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>>> +
>>>> +    return young;
>>>> +}
>>>> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
>>>> +
>>>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>> +                    unsigned long addr, pte_t *ptep)
>>>> +{
>>>> +    int young;
>>>> +
>>>> +    young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>> +
>>>> +    if (young) {
>>>> +        /*
>>>> +         * See comment in __ptep_clear_flush_young(); same rationale for
>>>> +         * eliding the trailing DSB applies here.
>>>> +         */
>>>> +        addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +        __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>>>> +                     PAGE_SIZE, true, 3);
>>>> +    }
>>>> +
>>>> +    return young;
>>>> +}
>>>> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>>>> +
>>>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>>> +                    unsigned long addr, pte_t *ptep,
>>>> +                    pte_t entry, int dirty)
>>>> +{
>>>> +    unsigned long start_addr;
>>>> +    pte_t orig_pte;
>>>> +    int i;
>>>> +
>>>> +    /*
>>>> +     * Gather the access/dirty bits for the contiguous range. If nothing has
>>>> +     * changed, its a noop.
>>>> +     */
>>>> +    orig_pte = pte_mknoncont(ptep_get(ptep));
>>>> +    if (pte_val(orig_pte) == pte_val(entry))
>>>> +        return 0;
>>>> +
>>>> +    /*
>>>> +     * We can fix up access/dirty bits without having to unfold/fold the
>>>> +     * contig range. But if the write bit is changing, we need to go through
>>>> +     * the full unfold/fold cycle.
>>>> +     */
>>>> +    if (pte_write(orig_pte) == pte_write(entry)) {
>>>> +        /*
>>>> +         * For HW access management, we technically only need to update
>>>> +         * the flag on a single pte in the range. But for SW access
>>>> +         * management, we need to update all the ptes to prevent extra
>>>> +         * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
>>>> +         * and instead flush the whole range at the end.
>>>> +         */
>>>> +        ptep = contpte_align_down(ptep);
>>>> +        start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +
>>>> +        for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>>> +            __ptep_set_access_flags(vma, addr, ptep, entry, 0);
>>>
>>> entry was pte_mknoncont() in ptep_set_access_flags() so here you lose the
>>> contpte range, is that intentional? Or am I mistaken?
>> entry doesn't have PTE_CONT bit set, that's correct. I intentionally strip that
>> bit at the interface boundary, because it is the implementation's job to decide
>> whether its a contpte block, not the caller's. But there are situations where
>> the caller can end up with a pte that has PTE_CONT set (by having done a
>> previous ptep_get() for example) and then it forwards the pte to a setter. So
>> stripping it is required; It would probably be cleaner to strip it before
>> returning it from ptep_get(), but that would be problematic for pte_leaf_size()
>> which is called from perf_get_pgtable_size().
>>
>> In this particular case, __ptep_set_access_flags() only modifies the PTE's
>> access flags, so CONT_PTE will remain as it is in the page table. The fact that
>> entry has it cleared is not a problem.
>
>
> I see, I had not checked the arm64 implementation of ptep_set_access_flags().
> For context, I'm merging the arm64 contpte support with the riscv napot support,
> the implementation being quite similar (although riscv is a bit different as it
> uses bits from the pfn to advertise the number of contiguous ptes).
>
> Anyway,  our implementation of ptep_set_access_flags() actually sets the ptep
> with entry, so we would actually lose the cont bit. I would simply do the
> following (I will in my patchset, no need for you to worry about this):
>
> __ptep_set_access_flags(vma, addr, ptep, pte_mkcont(entry), 0);
>
> Let me know if you think this is not right,

I'm not familiar with riscv HW specs or the Linux implementation, so its hard
for me to say for sure. If your __ptep_set_access_flags() implementation is
writing all of entry to the PTE, then it looks like its probably wrong to me -
you will be writing the same PFN to every PTE in the contpte block. Certainly on
Arm that would be wrong.

On Arm, __ptep_set_access_flags() is just folding the access flags in entry into
the PTE without changing anything else. That's needed in order to deal with
racing HW updates of the access/dirty bits.

Thanks,
Ryan


>
> Thanks,
>
> Alex
>
>
>>
>> Thanks,
>> Ryan
>>
>>
>>>
>>>> +
>>>> +        if (dirty)
>>>> +            __flush_tlb_range(vma, start_addr, addr,
>>>> +                            PAGE_SIZE, true, 3);
>>>> +    } else {
>>>> +        __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>>>> +        __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>>> +        contpte_try_fold(vma->vm_mm, addr, ptep, entry);
>>>> +    }
>>>> +
>>>> +    return 1;
>>>> +}
>>>> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);


2024-01-16 22:17:25

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] arm64/mm: Wire up PTE_CONT for user mappings

On 16/01/2024 15:44, Ryan Roberts wrote:
> On 15/01/2024 21:23, Alexandre Ghiti wrote:
>> On 15/01/2024 17:27, Ryan Roberts wrote:
>>> On 15/01/2024 15:14, Alexandre Ghiti wrote:
>>>> Hi Ryan,
>>>>
>>>> On 18/12/2023 11:50, Ryan Roberts wrote:
>>>>> With the ptep API sufficiently refactored, we can now introduce a new
>>>>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>>>>> user mappings. Whenever it detects a set of PTEs that meet the
>>>>> requirements for a contiguous range, the PTEs are re-painted with the
>>>>> PTE_CONT bit. Use of contpte mappings is intended to be transparent to
>>>>> the core-mm, which continues to interact with individual ptes.
>>>>>
>>>>> Since a contpte block only has a single access and dirty bit, the
>>>>> semantic here changes slightly; when getting a pte (e.g. ptep_get())
>>>>> that is part of a contpte mapping, the access and dirty information are
>>>>> pulled from the block (so all ptes in the block return the same
>>>>> access/dirty info). When changing the access/dirty info on a pte (e.g.
>>>>> ptep_set_access_flags()) that is part of a contpte mapping, this change
>>>>> will affect the whole contpte block. This is works fine in practice
>>>>> since we guarantee that only a single folio is mapped by a contpte
>>>>> block, and the core-mm tracks access/dirty information per folio.
>>>>>
>>>>> This initial change provides a baseline that can be optimized in future
>>>>> commits. That said, fold/unfold operations (which imply tlb
>>>>> invalidation) are avoided where possible with a few tricks for
>>>>> access/dirty bit management. Write-protect modifications for contpte
>>>>> mappings are currently non-optimal, and incure a regression in fork()
>>>>> performance. This will be addressed in follow-up changes.
>>>>>
>>>>> In order for the public functions, which used to be pure inline, to
>>>>> continue to be callable by modules, export all the contpte_* symbols
>>>>> that are now called by those public inline functions.
>>>>>
>>>>> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
>>>>> at build time. It defaults to enabled as long as its dependency,
>>>>> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
>>>>> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
>>>>> enabled, then there is no chance of meeting the physical contiguity
>>>>> requirement for contpte mappings.
>>>>>
>>>>> Tested-by: John Hubbard <[email protected]>
>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>> ---
>>>>>    arch/arm64/Kconfig               |  10 +-
>>>>>    arch/arm64/include/asm/pgtable.h | 184 +++++++++++++++
>>>>>    arch/arm64/mm/Makefile           |   1 +
>>>>>    arch/arm64/mm/contpte.c          | 388 +++++++++++++++++++++++++++++++
>>>>>    4 files changed, 582 insertions(+), 1 deletion(-)
>>>>>    create mode 100644 arch/arm64/mm/contpte.c
>>>>>
>>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>>> index 7b071a00425d..de76e484ff3a 100644
>>>>> --- a/arch/arm64/Kconfig
>>>>> +++ b/arch/arm64/Kconfig
>>>>> @@ -2209,6 +2209,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>>>>>        select UNWIND_TABLES
>>>>>        select DYNAMIC_SCS
>>>>>    +config ARM64_CONTPTE
>>>>> +    bool "Contiguous PTE mappings for user memory" if EXPERT
>>>>> +    depends on TRANSPARENT_HUGEPAGE
>>>>> +    default y
>>>>> +    help
>>>>> +      When enabled, user mappings are configured using the PTE contiguous
>>>>> +      bit, for any mappings that meet the size and alignment requirements.
>>>>> +      This reduces TLB pressure and improves performance.
>>>>> +
>>>>>    endmenu # "Kernel Features"
>>>>>      menu "Boot options"
>>>>> @@ -2318,4 +2327,3 @@ endmenu # "CPU Power Management"
>>>>>    source "drivers/acpi/Kconfig"
>>>>>      source "arch/arm64/kvm/Kconfig"
>>>>> -
>>>>> diff --git a/arch/arm64/include/asm/pgtable.h
>>>>> b/arch/arm64/include/asm/pgtable.h
>>>>> index 6930c14f062f..e64120452301 100644
>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>>>>>     */
>>>>>    #define pte_valid_not_user(pte) \
>>>>>        ((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID |
>>>>> PTE_UXN))
>>>>> +/*
>>>>> + * Returns true if the pte is valid and has the contiguous bit set.
>>>>> + */
>>>>> +#define pte_valid_cont(pte)    (pte_valid(pte) && pte_cont(pte))
>>>>>    /*
>>>>>     * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>>>>>     * so that we don't erroneously return false for pages that have been
>>>>> @@ -1116,6 +1120,184 @@ extern void ptep_modify_prot_commit(struct
>>>>> vm_area_struct *vma,
>>>>>                        unsigned long addr, pte_t *ptep,
>>>>>                        pte_t old_pte, pte_t new_pte);
>>>>>    +#ifdef CONFIG_ARM64_CONTPTE
>>>>> +
>>>>> +/*
>>>>> + * The contpte APIs are used to transparently manage the contiguous bit in
>>>>> ptes
>>>>> + * where it is possible and makes sense to do so. The PTE_CONT bit is
>>>>> considered
>>>>> + * a private implementation detail of the public ptep API (see below).
>>>>> + */
>>>>> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>>>>> +                pte_t *ptep, pte_t pte);
>>>>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>>>> +                pte_t *ptep, pte_t pte);
>>>>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>>>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>>>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>>>> +                pte_t *ptep, pte_t pte, unsigned int nr);
>>>>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>> +                unsigned long addr, pte_t *ptep);
>>>>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>> +                unsigned long addr, pte_t *ptep);
>>>>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>>>> +                unsigned long addr, pte_t *ptep,
>>>>> +                pte_t entry, int dirty);
>>>>> +
>>>>> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>>>>> +                    pte_t *ptep, pte_t pte)
>>>>> +{
>>>>> +    /*
>>>>> +     * Only bother trying if both the virtual and physical addresses are
>>>>> +     * aligned and correspond to the last entry in a contig range. The core
>>>>> +     * code mostly modifies ranges from low to high, so this is the likely
>>>>> +     * the last modification in the contig range, so a good time to fold.
>>>>> +     * We can't fold special mappings, because there is no associated folio.
>>>>> +     */
>>>>> +
>>>>> +    const unsigned long contmask = CONT_PTES - 1;
>>>>> +    bool valign = (((unsigned long)ptep >> 3) & contmask) == contmask;
>>>>> +    bool palign = (pte_pfn(pte) & contmask) == contmask;
>>>>> +
>>>>> +    if (valign && palign &&
>>>>> +        pte_valid(pte) && !pte_cont(pte) && !pte_special(pte))
>>>>> +        __contpte_try_fold(mm, addr, ptep, pte);
>>>>> +}
>>>>> +
>>>>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long
>>>>> addr,
>>>>> +                    pte_t *ptep, pte_t pte)
>>>>> +{
>>>>> +    if (pte_valid_cont(pte))
>>>>> +        __contpte_try_unfold(mm, addr, ptep, pte);
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * The below functions constitute the public API that arm64 presents to the
>>>>> + * core-mm to manipulate PTE entries within their page tables (or at least
>>>>> this
>>>>> + * is the subset of the API that arm64 needs to implement). These public
>>>>> + * versions will automatically and transparently apply the contiguous bit
>>>>> where
>>>>> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>>>>> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>>>>> + * private versions, which are prefixed with double underscore. All of these
>>>>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>>>>> + * held.
>>>>> + */
>>>>> +
>>>>> +#define ptep_get ptep_get
>>>>> +static inline pte_t ptep_get(pte_t *ptep)
>>>>> +{
>>>>> +    pte_t pte = __ptep_get(ptep);
>>>>> +
>>>>> +    if (!pte_valid_cont(pte))
>>>>> +        return pte;
>>>>> +
>>>>> +    return contpte_ptep_get(ptep, pte);
>>>>> +}
>>>>> +
>>>>> +#define ptep_get_lockless ptep_get_lockless
>>>>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>>>>> +{
>>>>> +    pte_t pte = __ptep_get(ptep);
>>>>> +
>>>>> +    if (!pte_valid_cont(pte))
>>>>> +        return pte;
>>>>> +
>>>>> +    return contpte_ptep_get_lockless(ptep);
>>>>> +}
>>>>> +
>>>>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>>>>> +{
>>>>> +    /*
>>>>> +     * We don't have the mm or vaddr so cannot unfold or fold contig entries
>>>>> +     * (since it requires tlb maintenance). set_pte() is not used in core
>>>>> +     * code, so this should never even be called. Regardless do our best to
>>>>> +     * service any call and emit a warning if there is any attempt to set a
>>>>> +     * pte on top of an existing contig range.
>>>>> +     */
>>>>> +    pte_t orig_pte = __ptep_get(ptep);
>>>>> +
>>>>> +    WARN_ON_ONCE(pte_valid_cont(orig_pte));
>>>>> +    __set_pte(ptep, pte_mknoncont(pte));
>>>>> +}
>>>>> +
>>>>> +#define set_ptes set_ptes
>>>>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>>> +                pte_t *ptep, pte_t pte, unsigned int nr)
>>>>> +{
>>>>> +    pte = pte_mknoncont(pte);
>>>>> +
>>>>> +    if (nr == 1) {
>>>>> +        contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>>>> +        __set_ptes(mm, addr, ptep, pte, 1);
>>>>> +        contpte_try_fold(mm, addr, ptep, pte);
>>>>> +    } else
>>>>> +        contpte_set_ptes(mm, addr, ptep, pte, nr);
>>>>> +}
>>>>> +
>>>>> +static inline void pte_clear(struct mm_struct *mm,
>>>>> +                unsigned long addr, pte_t *ptep)
>>>>> +{
>>>>> +    contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>>>> +    __pte_clear(mm, addr, ptep);
>>>>> +}
>>>>> +
>>>>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>> +                unsigned long addr, pte_t *ptep)
>>>>> +{
>>>>> +    contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>>>> +    return __ptep_get_and_clear(mm, addr, ptep);
>>>>> +}
>>>>> +
>>>>> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>>>> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>> +                unsigned long addr, pte_t *ptep)
>>>>> +{
>>>>> +    pte_t orig_pte = __ptep_get(ptep);
>>>>> +
>>>>> +    if (!pte_valid_cont(orig_pte))
>>>>> +        return __ptep_test_and_clear_young(vma, addr, ptep);
>>>>> +
>>>>> +    return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>>> +}
>>>>> +
>>>>> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>>>> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>> +                unsigned long addr, pte_t *ptep)
>>>>> +{
>>>>> +    pte_t orig_pte = __ptep_get(ptep);
>>>>> +
>>>>> +    if (!pte_valid_cont(orig_pte))
>>>>> +        return __ptep_clear_flush_young(vma, addr, ptep);
>>>>> +
>>>>> +    return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>>>> +}
>>>>> +
>>>>> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>>>> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>>>> +                unsigned long addr, pte_t *ptep)
>>>>> +{
>>>>> +    contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>>>> +    __ptep_set_wrprotect(mm, addr, ptep);
>>>>> +    contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
>>>>> +}
>>>>> +
>>>>> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>>>> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>>>> +                unsigned long addr, pte_t *ptep,
>>>>> +                pte_t entry, int dirty)
>>>>> +{
>>>>> +    pte_t orig_pte = __ptep_get(ptep);
>>>>> +
>>>>> +    entry = pte_mknoncont(entry);
>>>>> +
>>>>> +    if (!pte_valid_cont(orig_pte))
>>>>> +        return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>>>> +
>>>>> +    return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>>>> +}
>>>>> +
>>>>> +#else /* CONFIG_ARM64_CONTPTE */
>>>>> +
>>>>>    #define ptep_get                __ptep_get
>>>>>    #define set_pte                    __set_pte
>>>>>    #define set_ptes                __set_ptes
>>>>> @@ -1131,6 +1313,8 @@ extern void ptep_modify_prot_commit(struct
>>>>> vm_area_struct *vma,
>>>>>    #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>>>>    #define ptep_set_access_flags            __ptep_set_access_flags
>>>>>    +#endif /* CONFIG_ARM64_CONTPTE */
>>>>> +
>>>>>    #endif /* !__ASSEMBLY__ */
>>>>>      #endif /* __ASM_PGTABLE_H */
>>>>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>>>>> index dbd1bc95967d..60454256945b 100644
>>>>> --- a/arch/arm64/mm/Makefile
>>>>> +++ b/arch/arm64/mm/Makefile
>>>>> @@ -3,6 +3,7 @@ obj-y                := dma-mapping.o extable.o fault.o
>>>>> init.o \
>>>>>                       cache.o copypage.o flush.o \
>>>>>                       ioremap.o mmap.o pgd.o mmu.o \
>>>>>                       context.o proc.o pageattr.o fixmap.o
>>>>> +obj-$(CONFIG_ARM64_CONTPTE)    += contpte.o
>>>>>    obj-$(CONFIG_HUGETLB_PAGE)    += hugetlbpage.o
>>>>>    obj-$(CONFIG_PTDUMP_CORE)    += ptdump.o
>>>>>    obj-$(CONFIG_PTDUMP_DEBUGFS)    += ptdump_debugfs.o
>>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>>> new file mode 100644
>>>>> index 000000000000..69c36749dd98
>>>>> --- /dev/null
>>>>> +++ b/arch/arm64/mm/contpte.c
>>>>> @@ -0,0 +1,388 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>>> +/*
>>>>> + * Copyright (C) 2023 ARM Ltd.
>>>>> + */
>>>>> +
>>>>> +#include <linux/mm.h>
>>>>> +#include <linux/export.h>
>>>>> +#include <asm/tlbflush.h>
>>>>> +
>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>> +{
>>>>> +    /*
>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>> +     * These racing faults are ok for user space, since they get serialized
>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>> +     */
>>>>> +    return mm != &init_mm;
>>>>> +}
>>>>> +
>>>>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>>>>> +{
>>>>> +    return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>>>> +}
>>>>> +
>>>>> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
>>>>> +                pte_t *ptep, int nr)
>>>>> +{
>>>>> +    struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>>>> +    unsigned long start_addr = addr;
>>>>> +    int i;
>>>>> +
>>>>> +    for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
>>>>> +        __pte_clear(mm, addr, ptep);
>>>>> +
>>>>> +    __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>>>> +}
>>>>> +
>>>>> +static bool ptep_any_valid(pte_t *ptep, int nr)
>>>>> +{
>>>>> +    int i;
>>>>> +
>>>>> +    for (i = 0; i < nr; i++, ptep++) {
>>>>> +        if (pte_valid(__ptep_get(ptep)))
>>>>> +            return true;
>>>>> +    }
>>>>> +
>>>>> +    return false;
>>>>> +}
>>>>> +
>>>>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>>>> +                pte_t *ptep, pte_t pte)
>>>>> +{
>>>>> +    struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>>>> +    unsigned long start_addr;
>>>>> +    pte_t *start_ptep;
>>>>> +    int i;
>>>>> +
>>>>> +    start_ptep = ptep = contpte_align_down(ptep);
>>>>> +    start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>>> +    pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>>>> +
>>>>> +    for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>>>> +        pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>>>> +
>>>>> +        if (pte_dirty(ptent))
>>>>> +            pte = pte_mkdirty(pte);
>>>>> +
>>>>> +        if (pte_young(ptent))
>>>>> +            pte = pte_mkyoung(pte);
>>>>> +    }
>>>>> +
>>>>> +    __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>>>> +
>>>>> +    __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>>>> +}
>>>>> +
>>>>> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>>>>> +            pte_t *ptep, pte_t pte)
>>>>> +{
>>>>> +    /*
>>>>> +     * We have already checked that the virtual and pysical addresses are
>>>>> +     * correctly aligned for a contpte mapping in contpte_try_fold() so the
>>>>> +     * remaining checks are to ensure that the contpte range is fully
>>>>> +     * covered by a single folio, and ensure that all the ptes are valid
>>>>> +     * with contiguous PFNs and matching prots. We ignore the state of the
>>>>> +     * access and dirty bits for the purpose of deciding if its a contiguous
>>>>> +     * range; the folding process will generate a single contpte entry which
>>>>> +     * has a single access and dirty bit. Those 2 bits are the logical OR of
>>>>> +     * their respective bits in the constituent pte entries. In order to
>>>>> +     * ensure the contpte range is covered by a single folio, we must
>>>>> +     * recover the folio from the pfn, but special mappings don't have a
>>>>> +     * folio backing them. Fortunately contpte_try_fold() already checked
>>>>> +     * that the pte is not special - we never try to fold special mappings.
>>>>> +     * Note we can't use vm_normal_page() for this since we don't have the
>>>>> +     * vma.
>>>>> +     */
>>>>> +
>>>>> +    unsigned long folio_saddr;
>>>>> +    unsigned long folio_eaddr;
>>>>> +    unsigned long cont_saddr;
>>>>> +    unsigned long cont_eaddr;
>>>>> +    struct folio *folio;
>>>>> +    struct page *page;
>>>>> +    unsigned long pfn;
>>>>> +    pte_t *orig_ptep;
>>>>> +    pgprot_t prot;
>>>>> +    pte_t subpte;
>>>>> +    int i;
>>>>> +
>>>>> +    if (!mm_is_user(mm))
>>>>> +        return;
>>>>> +
>>>>> +    page = pte_page(pte);
>>>>> +    folio = page_folio(page);
>>>>> +    folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
>>>>> +    folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
>>>>> +    cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>>> +    cont_eaddr = cont_saddr + CONT_PTE_SIZE;
>>>>> +
>>>>> +    if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
>>>>> +        return;
>>>>> +
>>>>> +    pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
>>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>>>> +    orig_ptep = ptep;
>>>>> +    ptep = contpte_align_down(ptep);
>>>>> +
>>>>> +    for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>>>> +        subpte = __ptep_get(ptep);
>>>>> +        subpte = pte_mkold(pte_mkclean(subpte));
>>>>> +
>>>>> +        if (!pte_valid(subpte) ||
>>>>> +            pte_pfn(subpte) != pfn ||
>>>>> +            pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
>>>>> +            return;
>>>>> +    }
>>>>> +
>>>>> +    pte = pte_mkcont(pte);
>>>>> +    contpte_convert(mm, addr, orig_ptep, pte);
>>>>> +}
>>>>> +EXPORT_SYMBOL(__contpte_try_fold);
>>>>> +
>>>>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>>>> +            pte_t *ptep, pte_t pte)
>>>>> +{
>>>>> +    /*
>>>>> +     * We have already checked that the ptes are contiguous in
>>>>> +     * contpte_try_unfold(), so just check that the mm is user space.
>>>>> +     */
>>>>> +
>>>>> +    if (!mm_is_user(mm))
>>>>> +        return;
>>>>> +
>>>>> +    pte = pte_mknoncont(pte);
>>>>> +    contpte_convert(mm, addr, ptep, pte);
>>>>> +}
>>>>> +EXPORT_SYMBOL(__contpte_try_unfold);
>>>>> +
>>>>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>>> +{
>>>>> +    /*
>>>>> +     * Gather access/dirty bits, which may be populated in any of the ptes
>>>>> +     * of the contig range. We are guarranteed to be holding the PTL, so any
>>>>> +     * contiguous range cannot be unfolded or otherwise modified under our
>>>>> +     * feet.
>>>>> +     */
>>>>> +
>>>>> +    pte_t pte;
>>>>> +    int i;
>>>>> +
>>>>> +    ptep = contpte_align_down(ptep);
>>>>> +
>>>>> +    for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>>> +        pte = __ptep_get(ptep);
>>>>> +
>>>>> +        if (pte_dirty(pte))
>>>>> +            orig_pte = pte_mkdirty(orig_pte);
>>>>> +
>>>>> +        if (pte_young(pte))
>>>>> +            orig_pte = pte_mkyoung(orig_pte);
>>>>> +    }
>>>>> +
>>>>> +    return orig_pte;
>>>>> +}
>>>>> +EXPORT_SYMBOL(contpte_ptep_get);
>>>>> +
>>>>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>>>> +{
>>>>> +    /*
>>>>> +     * Gather access/dirty bits, which may be populated in any of the ptes
>>>>> +     * of the contig range. We may not be holding the PTL, so any contiguous
>>>>> +     * range may be unfolded/modified/refolded under our feet. Therefore we
>>>>> +     * ensure we read a _consistent_ contpte range by checking that all ptes
>>>>> +     * in the range are valid and have CONT_PTE set, that all pfns are
>>>>> +     * contiguous and that all pgprots are the same (ignoring access/dirty).
>>>>> +     * If we find a pte that is not consistent, then we must be racing with
>>>>> +     * an update so start again. If the target pte does not have CONT_PTE
>>>>> +     * set then that is considered consistent on its own because it is not
>>>>> +     * part of a contpte range.
>>>>> +     */
>>>>> +
>>>>> +    pgprot_t orig_prot;
>>>>> +    unsigned long pfn;
>>>>> +    pte_t orig_pte;
>>>>> +    pgprot_t prot;
>>>>> +    pte_t *ptep;
>>>>> +    pte_t pte;
>>>>> +    int i;
>>>>> +
>>>>> +retry:
>>>>> +    orig_pte = __ptep_get(orig_ptep);
>>>>> +
>>>>> +    if (!pte_valid_cont(orig_pte))
>>>>> +        return orig_pte;
>>>>> +
>>>>> +    orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>>>>> +    ptep = contpte_align_down(orig_ptep);
>>>>> +    pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>>>>> +
>>>>> +    for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>>>> +        pte = __ptep_get(ptep);
>>>>> +        prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>>>> +
>>>>> +        if (!pte_valid_cont(pte) ||
>>>>> +           pte_pfn(pte) != pfn ||
>>>>> +           pgprot_val(prot) != pgprot_val(orig_prot))
>>>>> +            goto retry;
>>>>> +
>>>>> +        if (pte_dirty(pte))
>>>>> +            orig_pte = pte_mkdirty(orig_pte);
>>>>> +
>>>>> +        if (pte_young(pte))
>>>>> +            orig_pte = pte_mkyoung(orig_pte);
>>>>> +    }
>>>>> +
>>>>> +    return orig_pte;
>>>>> +}
>>>>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
>>>>> +
>>>>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>>>> +                    pte_t *ptep, pte_t pte, unsigned int nr)
>>>>> +{
>>>>> +    unsigned long next;
>>>>> +    unsigned long end;
>>>>> +    unsigned long pfn;
>>>>> +    pgprot_t prot;
>>>>> +    pte_t orig_pte;
>>>>> +
>>>>> +    if (!mm_is_user(mm))
>>>>> +        return __set_ptes(mm, addr, ptep, pte, nr);
>>>>> +
>>>>> +    end = addr + (nr << PAGE_SHIFT);
>>>>> +    pfn = pte_pfn(pte);
>>>>> +    prot = pte_pgprot(pte);
>>>>> +
>>>>> +    do {
>>>>> +        next = pte_cont_addr_end(addr, end);
>>>>> +        nr = (next - addr) >> PAGE_SHIFT;
>>>>> +        pte = pfn_pte(pfn, prot);
>>>>> +
>>>>> +        if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>>>>> +            pte = pte_mkcont(pte);
>>>>> +        else
>>>>> +            pte = pte_mknoncont(pte);
>>>>> +
>>>>> +        /*
>>>>> +         * If operating on a partial contiguous range then we must first
>>>>> +         * unfold the contiguous range if it was previously folded.
>>>>> +         * Otherwise we could end up with overlapping tlb entries.
>>>>> +         */
>>>>> +        if (nr != CONT_PTES)
>>>>> +            contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>>>> +
>>>>> +        /*
>>>>> +         * If we are replacing ptes that were contiguous or if the new
>>>>> +         * ptes are contiguous and any of the ptes being replaced are
>>>>> +         * valid, we need to clear and flush the range to prevent
>>>>> +         * overlapping tlb entries.
>>>>> +         */
>>>>> +        orig_pte = __ptep_get(ptep);
>>>>> +        if (pte_valid_cont(orig_pte) ||
>>>>> +            (pte_cont(pte) && ptep_any_valid(ptep, nr)))
>>>>> +            ptep_clear_flush_range(mm, addr, ptep, nr);
>>>>> +
>>>>> +        __set_ptes(mm, addr, ptep, pte, nr);
>>>>> +
>>>>> +        addr = next;
>>>>> +        ptep += nr;
>>>>> +        pfn += nr;
>>>>> +
>>>>> +    } while (addr != end);
>>>>> +}
>>>>> +EXPORT_SYMBOL(contpte_set_ptes);
>>>>> +
>>>>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>> +                    unsigned long addr, pte_t *ptep)
>>>>> +{
>>>>> +    /*
>>>>> +     * ptep_clear_flush_young() technically requires us to clear the access
>>>>> +     * flag for a _single_ pte. However, the core-mm code actually tracks
>>>>> +     * access/dirty per folio, not per page. And since we only create a
>>>>> +     * contig range when the range is covered by a single folio, we can get
>>>>> +     * away with clearing young for the whole contig range here, so we avoid
>>>>> +     * having to unfold.
>>>>> +     */
>>>>> +
>>>>> +    int young = 0;
>>>>> +    int i;
>>>>> +
>>>>> +    ptep = contpte_align_down(ptep);
>>>>> +    addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>>> +
>>>>> +    for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>>>> +        young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>>>> +
>>>>> +    return young;
>>>>> +}
>>>>> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
>>>>> +
>>>>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>> +                    unsigned long addr, pte_t *ptep)
>>>>> +{
>>>>> +    int young;
>>>>> +
>>>>> +    young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>>> +
>>>>> +    if (young) {
>>>>> +        /*
>>>>> +         * See comment in __ptep_clear_flush_young(); same rationale for
>>>>> +         * eliding the trailing DSB applies here.
>>>>> +         */
>>>>> +        addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>>> +        __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>>>>> +                     PAGE_SIZE, true, 3);
>>>>> +    }
>>>>> +
>>>>> +    return young;
>>>>> +}
>>>>> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>>>>> +
>>>>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>>>> +                    unsigned long addr, pte_t *ptep,
>>>>> +                    pte_t entry, int dirty)
>>>>> +{
>>>>> +    unsigned long start_addr;
>>>>> +    pte_t orig_pte;
>>>>> +    int i;
>>>>> +
>>>>> +    /*
>>>>> +     * Gather the access/dirty bits for the contiguous range. If nothing has
>>>>> +     * changed, its a noop.
>>>>> +     */
>>>>> +    orig_pte = pte_mknoncont(ptep_get(ptep));
>>>>> +    if (pte_val(orig_pte) == pte_val(entry))
>>>>> +        return 0;
>>>>> +
>>>>> +    /*
>>>>> +     * We can fix up access/dirty bits without having to unfold/fold the
>>>>> +     * contig range. But if the write bit is changing, we need to go through
>>>>> +     * the full unfold/fold cycle.
>>>>> +     */
>>>>> +    if (pte_write(orig_pte) == pte_write(entry)) {
>>>>> +        /*
>>>>> +         * For HW access management, we technically only need to update
>>>>> +         * the flag on a single pte in the range. But for SW access
>>>>> +         * management, we need to update all the ptes to prevent extra
>>>>> +         * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
>>>>> +         * and instead flush the whole range at the end.
>>>>> +         */
>>>>> +        ptep = contpte_align_down(ptep);
>>>>> +        start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>>> +
>>>>> +        for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>>>> +            __ptep_set_access_flags(vma, addr, ptep, entry, 0);
>>>> entry was pte_mknoncont() in ptep_set_access_flags() so here you lose the
>>>> contpte range, is that intentional? Or am I mistaken?
>>> entry doesn't have PTE_CONT bit set, that's correct. I intentionally strip that
>>> bit at the interface boundary, because it is the implementation's job to decide
>>> whether its a contpte block, not the caller's. But there are situations where
>>> the caller can end up with a pte that has PTE_CONT set (by having done a
>>> previous ptep_get() for example) and then it forwards the pte to a setter. So
>>> stripping it is required; It would probably be cleaner to strip it before
>>> returning it from ptep_get(), but that would be problematic for pte_leaf_size()
>>> which is called from perf_get_pgtable_size().
>>>
>>> In this particular case, __ptep_set_access_flags() only modifies the PTE's
>>> access flags, so CONT_PTE will remain as it is in the page table. The fact that
>>> entry has it cleared is not a problem.
>>
>> I see, I had not checked the arm64 implementation of ptep_set_access_flags().
>> For context, I'm merging the arm64 contpte support with the riscv napot support,
>> the implementation being quite similar (although riscv is a bit different as it
>> uses bits from the pfn to advertise the number of contiguous ptes).
>>
>> Anyway,  our implementation of ptep_set_access_flags() actually sets the ptep
>> with entry, so we would actually lose the cont bit. I would simply do the
>> following (I will in my patchset, no need for you to worry about this):
>>
>> __ptep_set_access_flags(vma, addr, ptep, pte_mkcont(entry), 0);
>>
>> Let me know if you think this is not right,
> I'm not familiar with riscv HW specs or the Linux implementation, so its hard
> for me to say for sure. If your __ptep_set_access_flags() implementation is
> writing all of entry to the PTE, then it looks like its probably wrong to me -
> you will be writing the same PFN to every PTE in the contpte block. Certainly on
> Arm that would be wrong.


On riscv, ptes in the same contiguous range are expected to have the
same pfn because the lsbs of the pfn are used to advertise the number of
ptes in the mapping. The HW will use the virtual bits to address the
correct page (see page 89 of the riscv privileged specification
https://drive.google.com/file/d/1EMip5dZlnypTk7pt4WWUKmtjUKTOkBqh/view).


>
> On Arm, __ptep_set_access_flags() is just folding the access flags in entry into
> the PTE without changing anything else. That's needed in order to deal with
> racing HW updates of the access/dirty bits.


Ok I have to think more about that, we don't do that (neither does x86),
so maybe I'm missing something.


Thanks,

Alex


>
> Thanks,
> Ryan
>
>
>> Thanks,
>>
>> Alex
>>
>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>>> +
>>>>> +        if (dirty)
>>>>> +            __flush_tlb_range(vma, start_addr, addr,
>>>>> +                            PAGE_SIZE, true, 3);
>>>>> +    } else {
>>>>> +        __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>>>>> +        __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>>>> +        contpte_try_fold(vma->vm_mm, addr, ptep, entry);
>>>>> +    }
>>>>> +
>>>>> +    return 1;
>>>>> +}
>>>>> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);