LinuxLists.cc - [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-29 12:50:17

Subject: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when processing
PTE-mapped THPs.

This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to
use the new rmap batching functions that simplify the code and prepare
for further rmap accounting changes.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.

While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.

Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes
for fork() (shorter is better):

Folio Size | v6.8-rc1 | New | Change
------------------------------------------
4KiB | 0.014328 | 0.014035 | - 2%
16KiB | 0.014263 | 0.01196 | -16%
32KiB | 0.014334 | 0.01094 | -24%
64KiB | 0.014046 | 0.010444 | -26%
128KiB | 0.014011 | 0.010063 | -28%
256KiB | 0.013993 | 0.009938 | -29%
512KiB | 0.013983 | 0.00985 | -30%
1024KiB | 0.013986 | 0.00982 | -30%
2048KiB | 0.014305 | 0.010076 | -30%

Note that these numbers are even better than the ones from v1 (verified
over multiple reboots), even though there were only minimal code changes.
Well, I removed a pte_mkclean() call for anon folios, maybe that also
plays a role.

But my experience is that fork() is extremely sensitive to code size,
inlining, ... so I suspect we'll see on other architectures rather a change
of -20% instead of -30%, and it will be easy to "lose" some of that speedup
in the future by subtle code changes.

Next up is PTE batching when unmapping. Only tested on x86-64.
Compile-tested on most other architectures.

v2 -> v3:
* Rebased on mm-unstable
* Picked up RB's
* Updated documentation of wrprotect_ptes().

v1 -> v2:
* "arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary"
-> Added patch from Ryan
* "arm/pgtable: define PFN_PTE_SHIFT"
-> Removed the arm64 bits
* "mm/pgtable: make pte_next_pfn() independent of set_ptes()"
* "arm/mm: use pte_next_pfn() in set_ptes()"
* "powerpc/mm: use pte_next_pfn() in set_ptes()"
-> Added to use pte_next_pfn() in some arch set_ptes() implementations
I tried to make use of pte_next_pfn() also in the others, but it's
not trivial because the other archs implement set_ptes() in their
asm/pgtable.h. Future work.
* "mm/memory: factor out copying the actual PTE in copy_present_pte()"
-> Move common folio_get() out of if/else
* "mm/memory: optimize fork() with PTE-mapped THP"
-> Add doc for wrprotect_ptes
-> Extend description to mention handling of pinned folios
-> Move common folio_ref_add() out of if/else
* "mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()"
-> Be more conservative with dirt/soft-dirty, let the caller specify
using flags

[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
[3] https://lkml.kernel.org/r/[email protected]
[4] https://lkml.kernel.org/r/[email protected]
[5] https://lkml.kernel.org/r/[email protected]

Cc: Andrew Morton <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

---

Andrew asked for a resend based on latest mm-unstable. I am sending this
out earlier than I would usually have sent out the next version, so we can
pull it into mm-unstable again now that v1 was dropped.

David Hildenbrand (14):
arm/pgtable: define PFN_PTE_SHIFT
nios2/pgtable: define PFN_PTE_SHIFT
powerpc/pgtable: define PFN_PTE_SHIFT
riscv/pgtable: define PFN_PTE_SHIFT
s390/pgtable: define PFN_PTE_SHIFT
sparc/pgtable: define PFN_PTE_SHIFT
mm/pgtable: make pte_next_pfn() independent of set_ptes()
arm/mm: use pte_next_pfn() in set_ptes()
powerpc/mm: use pte_next_pfn() in set_ptes()
mm/memory: factor out copying the actual PTE in copy_present_pte()
mm/memory: pass PTE to copy_present_pte()
mm/memory: optimize fork() with PTE-mapped THP
mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
mm/memory: ignore writable bit in folio_pte_batch()

Ryan Roberts (1):
arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary

arch/arm/include/asm/pgtable.h | 2 +
arch/arm/mm/mmu.c | 2 +-
arch/arm64/include/asm/pgtable.h | 28 ++--
arch/nios2/include/asm/pgtable.h | 2 +
arch/powerpc/include/asm/pgtable.h | 2 +
arch/powerpc/mm/pgtable.c | 5 +-
arch/riscv/include/asm/pgtable.h | 2 +
arch/s390/include/asm/pgtable.h | 2 +
arch/sparc/include/asm/pgtable_64.h | 2 +
include/linux/pgtable.h | 33 ++++-
mm/memory.c | 212 ++++++++++++++++++++++------
11 files changed, 229 insertions(+), 63 deletions(-)

base-commit: d162e170f1181b4305494843e1976584ddf2b72e
--
2.43.0

2024-01-29 12:50:30

Subject: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: [PATCH v3 01/15] arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary

Subject: [PATCH v3 02/15] arm/pgtable: define PFN_PTE_SHIFT

Subject: [PATCH v3 04/15] powerpc/pgtable: define PFN_PTE_SHIFT

Subject: [PATCH v3 05/15] riscv/pgtable: define PFN_PTE_SHIFT

Subject: [PATCH v3 03/15] nios2/pgtable: define PFN_PTE_SHIFT

Subject: [PATCH v3 06/15] s390/pgtable: define PFN_PTE_SHIFT

Subject: [PATCH v3 07/15] sparc/pgtable: define PFN_PTE_SHIFT

Subject: [PATCH v3 08/15] mm/pgtable: make pte_next_pfn() independent of set_ptes()

Subject: [PATCH v3 09/15] arm/mm: use pte_next_pfn() in set_ptes()

Subject: [PATCH v3 10/15] powerpc/mm: use pte_next_pfn() in set_ptes()

Subject: [PATCH v3 11/15] mm/memory: factor out copying the actual PTE in copy_present_pte()

Subject: [PATCH v3 12/15] mm/memory: pass PTE to copy_present_pte()

Subject: [PATCH v3 13/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: [PATCH v3 14/15] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()

Subject: [PATCH v3 15/15] mm/memory: ignore writable bit in folio_pte_batch()

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 01/15] arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary

Subject: Re: [PATCH v3 02/15] arm/pgtable: define PFN_PTE_SHIFT

Subject: Re: [PATCH v3 03/15] nios2/pgtable: define PFN_PTE_SHIFT

Subject: Re: [PATCH v3 04/15] powerpc/pgtable: define PFN_PTE_SHIFT

Subject: Re: [PATCH v3 05/15] riscv/pgtable: define PFN_PTE_SHIFT

Subject: Re: [PATCH v3 06/15] s390/pgtable: define PFN_PTE_SHIFT

Subject: Re: [PATCH v3 07/15] sparc/pgtable: define PFN_PTE_SHIFT

Subject: Re: [PATCH v3 08/15] mm/pgtable: make pte_next_pfn() independent of set_ptes()

Subject: Re: [PATCH v3 10/15] powerpc/mm: use pte_next_pfn() in set_ptes()

Subject: Re: [PATCH v3 09/15] arm/mm: use pte_next_pfn() in set_ptes()

Subject: Re: [PATCH v3 11/15] mm/memory: factor out copying the actual PTE in copy_present_pte()

Subject: Re: [PATCH v3 12/15] mm/memory: pass PTE to copy_present_pte()

Subject: Re: [PATCH v3 13/15] mm/memory: optimize fork() with PTE-mapped THP

Subject: Re: [PATCH v3 01/15] arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary

Subject: Re: [PATCH v3 12/15] mm/memory: pass PTE to copy_present_pte()

Subject: Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP