LinuxLists.cc - [PATCH v1 00/18] mm: mapcount for large folios + page

2024-04-09 19:25:09

Subject: [PATCH v1 00/18] mm: mapcount for large folios + page_mapcount() cleanups

This series tracks the mapcount of large folios in a single value, so
it can be read efficiently and atomically, just like the mapcount of
small folios.

folio_mapcount() is then used in a couple more places, most notably to
reduce false negatives in folio_likely_mapped_shared(), and many users of
page_mapcount() are cleaned up (that's maybe why you got CCed on the
full series, sorry sh+xtensa folks! :) ).

The remaining s390x user and one KSM user of page_mapcount() are getting
removed separately on the list right now. I have patches to handle the
other KSM one, the khugepaged one and the kpagecount one; as they are not
as "obvious", I will send them out separately in the future. Once that is
all in place, I'm planning on moving page_mapcount() into
fs/proc/task_mmu.c, the remaining user for the time being (and we can
discuss at LSF/MM details on that :) ).

I proposed the mapcount for large folios (previously called total
mapcount) originally in part of [1] and I later included it in [2] where
it is a requirement. In the meantime, I changed the patch a bit so I
dropped all RB's. During the discussion of [1], Peter Xu correctly raised
that this additional tracking might affect the performance when
PMD->PTE remapping THPs. In the meantime. I addressed that by batching RMAP
operations during fork(), unmap/zap and when PMD->PTE remapping THPs.

Running some of my micro-benchmarks [3] (fork,munmap,cow-byte,remap) on 1
GiB of memory backed by folios with the same order, I observe the following
on an Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz tuned for reproducible
results as much as possible:

Standard deviation is mostly < 1%, except for order-9, where it's < 2% for
fork() and munmap().

(1) Small folios are not affected (< 1%) in all 4 microbenchmarks.
(2) Order-4 folios are not affected (< 1%) in all 4 microbenchmarks. A bit
weird comapred to the other orders ...
(3) PMD->PTE remapping of order-9 THPs is not affected (< 1%)
(4) COW-byte (COWing a single page by writing a single byte) is not
affected for any order (< 1 %). The page copy_fault overhead dominates
everything.
(5) fork() is mostly not affected (< 1%), except order-2, where we have
a slowdown of ~4%. Already for order-3 folios, we're down to a slowdown
of < 1%.
(6) munmap() sees a slowdown by < 3% for some orders (order-5,
order-6, order-9), but less for others (< 1% for order-4 and order-8,
< 2% for order-2, order-3, order-7).

Especially the fork() and munmap() benchmark are sensitive to each added
instruction and other system noise, so I suspect some of the change and
observed weirdness (order-4) is due to code layout changes and other
factors, but not really due to the added atomics.

So in the common case where we can batch, the added atomics don't really
make a big difference, especially in light of the recent improvements for
large folios that we recently gained due to batching. Surprisingly, for
some cases where we cannot batch (e.g., COW), the added atomics don't seem
to matter, because other overhead dominates.

My fork and munmap micro-benchmarks don't cover cases where we cannot
batch-process bigger parts of large folios. As this is not the common case,
I'm not worrying about that right now.

Future work is batching RMAP operations during swapout and folio
migration.

Not CCing everybody (e.g., cgroups folks just because of the doc
updated) recommended by get_maintainers, to reduce noise. Tested on
x86-64, compile-tested on a bunch of other archs. Will do more testing
in the upcoming days.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?ref_type=heads

Cc: Andrew Morton <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Richard Chang <[email protected]>

David Hildenbrand (18):
mm: allow for detecting underflows with page_mapcount() again
mm/rmap: always inline anon/file rmap duplication of a single PTE
mm/rmap: add fast-path for small folios when
adding/removing/duplicating
mm: track mapcount of large folios in single value
mm: improve folio_likely_mapped_shared() using the mapcount of large
folios
mm: make folio_mapcount() return 0 for small typed folios
mm/memory: use folio_mapcount() in zap_present_folio_ptes()
mm/huge_memory: use folio_mapcount() in zap_huge_pmd() sanity check
mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings()
mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()
mm/migrate: use folio_likely_mapped_shared() in
add_page_for_migration()
sh/mm/cache: use folio_mapped() in copy_from_user_page()
mm/filemap: use folio_mapcount() in filemap_unaccount_folio()
mm/migrate_device: use folio_mapcount() in migrate_vma_check_page()
trace/events/page_ref: trace the raw page mapcount value
xtensa/mm: convert check_tlb_entry() to sanity check folios
mm/debug: print only page mapcount (excluding folio entire mapcount)
in __dump_folio()
Documentation/admin-guide/cgroup-v1/memory.rst: don't reference
page_mapcount()

.../admin-guide/cgroup-v1/memory.rst | 4 +-
Documentation/mm/transhuge.rst | 12 +--
arch/sh/mm/cache.c | 2 +-
arch/xtensa/mm/tlb.c | 11 +--
include/linux/mm.h | 77 +++++++++++--------
include/linux/mm_types.h | 5 +-
include/linux/rmap.h | 40 +++++++++-
include/trace/events/page_ref.h | 4 +-
mm/debug.c | 12 +--
mm/filemap.c | 2 +-
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 4 +-
mm/internal.h | 3 +
mm/khugepaged.c | 2 +-
mm/memory-failure.c | 4 +-
mm/memory.c | 3 +-
mm/migrate.c | 2 +-
mm/migrate_device.c | 12 +--
mm/page_alloc.c | 12 ++-
mm/rmap.c | 60 +++++++--------
20 files changed, 163 insertions(+), 110 deletions(-)

--
2.44.0

2024-04-09 19:25:12

Subject: [PATCH v1 00/18] mm: mapcount for large folios + page_mapcount() cleanups

Subject: [PATCH v1 04/18] mm: track mapcount of large folios in single value

Subject: [PATCH v1 03/18] mm/rmap: add fast-path for small folios when adding/removing/duplicating

Subject: [PATCH v1 02/18] mm/rmap: always inline anon/file rmap duplication of a single PTE

Subject: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

Subject: [PATCH v1 09/18] mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings()

Subject: [PATCH v1 06/18] mm: make folio_mapcount() return 0 for small typed folios

Subject: [PATCH v1 10/18] mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()

Subject: [PATCH v1 07/18] mm/memory: use folio_mapcount() in zap_present_folio_ptes()

Subject: [PATCH v1 11/18] mm/migrate: use folio_likely_mapped_shared() in add_page_for_migration()

Subject: [PATCH v1 13/18] mm/filemap: use folio_mapcount() in filemap_unaccount_folio()

Subject: [PATCH v1 14/18] mm/migrate_device: use folio_mapcount() in migrate_vma_check_page()

Subject: [PATCH v1 15/18] trace/events/page_ref: trace the raw page mapcount value

Subject: [PATCH v1 16/18] xtensa/mm: convert check_tlb_entry() to sanity check folios

Subject: [PATCH v1 17/18] mm/debug: print only page mapcount (excluding folio entire mapcount) in __dump_folio()

Subject: [PATCH v1 18/18] Documentation/admin-guide/cgroup-v1/memory.rst: don't reference page_mapcount()

Subject: [PATCH v1 08/18] mm/huge_memory: use folio_mapcount() in zap_huge_pmd() sanity check

Subject: [PATCH v1 12/18] sh/mm/cache: use folio_mapped() in copy_from_user_page()

Subject: Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value

Attachments:

Subject: Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value

Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

Subject: Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value

Subject: Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value

Subject: Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value

Subject: Re: [PATCH v1 02/18] mm/rmap: always inline anon/file rmap duplication of a single PTE

Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

Subject: Re: [PATCH v1 02/18] mm/rmap: always inline anon/file rmap duplication of a single PTE

Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

Subject: Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value

Subject: Re: [PATCH v1 03/18] mm/rmap: add fast-path for small folios when adding/removing/duplicating

Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

Subject: Re: [PATCH v1 02/18] mm/rmap: always inline anon/file rmap duplication of a single PTE

Subject: Re: [PATCH v1 06/18] mm: make folio_mapcount() return 0 for small typed folios