2024-04-09 19:25:09

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 00/18] mm: mapcount for large folios + page_mapcount() cleanups

This series tracks the mapcount of large folios in a single value, so
it can be read efficiently and atomically, just like the mapcount of
small folios.

folio_mapcount() is then used in a couple more places, most notably to
reduce false negatives in folio_likely_mapped_shared(), and many users of
page_mapcount() are cleaned up (that's maybe why you got CCed on the
full series, sorry sh+xtensa folks! :) ).

The remaining s390x user and one KSM user of page_mapcount() are getting
removed separately on the list right now. I have patches to handle the
other KSM one, the khugepaged one and the kpagecount one; as they are not
as "obvious", I will send them out separately in the future. Once that is
all in place, I'm planning on moving page_mapcount() into
fs/proc/task_mmu.c, the remaining user for the time being (and we can
discuss at LSF/MM details on that :) ).

I proposed the mapcount for large folios (previously called total
mapcount) originally in part of [1] and I later included it in [2] where
it is a requirement. In the meantime, I changed the patch a bit so I
dropped all RB's. During the discussion of [1], Peter Xu correctly raised
that this additional tracking might affect the performance when
PMD->PTE remapping THPs. In the meantime. I addressed that by batching RMAP
operations during fork(), unmap/zap and when PMD->PTE remapping THPs.

Running some of my micro-benchmarks [3] (fork,munmap,cow-byte,remap) on 1
GiB of memory backed by folios with the same order, I observe the following
on an Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz tuned for reproducible
results as much as possible:

Standard deviation is mostly < 1%, except for order-9, where it's < 2% for
fork() and munmap().

(1) Small folios are not affected (< 1%) in all 4 microbenchmarks.
(2) Order-4 folios are not affected (< 1%) in all 4 microbenchmarks. A bit
weird comapred to the other orders ...
(3) PMD->PTE remapping of order-9 THPs is not affected (< 1%)
(4) COW-byte (COWing a single page by writing a single byte) is not
affected for any order (< 1 %). The page copy_fault overhead dominates
everything.
(5) fork() is mostly not affected (< 1%), except order-2, where we have
a slowdown of ~4%. Already for order-3 folios, we're down to a slowdown
of < 1%.
(6) munmap() sees a slowdown by < 3% for some orders (order-5,
order-6, order-9), but less for others (< 1% for order-4 and order-8,
< 2% for order-2, order-3, order-7).

Especially the fork() and munmap() benchmark are sensitive to each added
instruction and other system noise, so I suspect some of the change and
observed weirdness (order-4) is due to code layout changes and other
factors, but not really due to the added atomics.

So in the common case where we can batch, the added atomics don't really
make a big difference, especially in light of the recent improvements for
large folios that we recently gained due to batching. Surprisingly, for
some cases where we cannot batch (e.g., COW), the added atomics don't seem
to matter, because other overhead dominates.

My fork and munmap micro-benchmarks don't cover cases where we cannot
batch-process bigger parts of large folios. As this is not the common case,
I'm not worrying about that right now.

Future work is batching RMAP operations during swapout and folio
migration.

Not CCing everybody (e.g., cgroups folks just because of the doc
updated) recommended by get_maintainers, to reduce noise. Tested on
x86-64, compile-tested on a bunch of other archs. Will do more testing
in the upcoming days.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?ref_type=heads

Cc: Andrew Morton <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Richard Chang <[email protected]>

David Hildenbrand (18):
mm: allow for detecting underflows with page_mapcount() again
mm/rmap: always inline anon/file rmap duplication of a single PTE
mm/rmap: add fast-path for small folios when
adding/removing/duplicating
mm: track mapcount of large folios in single value
mm: improve folio_likely_mapped_shared() using the mapcount of large
folios
mm: make folio_mapcount() return 0 for small typed folios
mm/memory: use folio_mapcount() in zap_present_folio_ptes()
mm/huge_memory: use folio_mapcount() in zap_huge_pmd() sanity check
mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings()
mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()
mm/migrate: use folio_likely_mapped_shared() in
add_page_for_migration()
sh/mm/cache: use folio_mapped() in copy_from_user_page()
mm/filemap: use folio_mapcount() in filemap_unaccount_folio()
mm/migrate_device: use folio_mapcount() in migrate_vma_check_page()
trace/events/page_ref: trace the raw page mapcount value
xtensa/mm: convert check_tlb_entry() to sanity check folios
mm/debug: print only page mapcount (excluding folio entire mapcount)
in __dump_folio()
Documentation/admin-guide/cgroup-v1/memory.rst: don't reference
page_mapcount()

.../admin-guide/cgroup-v1/memory.rst | 4 +-
Documentation/mm/transhuge.rst | 12 +--
arch/sh/mm/cache.c | 2 +-
arch/xtensa/mm/tlb.c | 11 +--
include/linux/mm.h | 77 +++++++++++--------
include/linux/mm_types.h | 5 +-
include/linux/rmap.h | 40 +++++++++-
include/trace/events/page_ref.h | 4 +-
mm/debug.c | 12 +--
mm/filemap.c | 2 +-
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 4 +-
mm/internal.h | 3 +
mm/khugepaged.c | 2 +-
mm/memory-failure.c | 4 +-
mm/memory.c | 3 +-
mm/migrate.c | 2 +-
mm/migrate_device.c | 12 +--
mm/page_alloc.c | 12 ++-
mm/rmap.c | 60 +++++++--------
20 files changed, 163 insertions(+), 110 deletions(-)

--
2.44.0



2024-04-09 19:25:12

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 04/18] mm: track mapcount of large folios in single value

Let's track the mapcount of large folios in a single value. The mapcount of
a large folio currently corresponds to the sum of the entire mapcount and
all page mapcounts.

This sum is what we actually want to know in folio_mapcount() and it is
also sufficient for implementing folio_mapped().

With PTE-mapped THP becoming more important and more widely used, we want
to avoid looping over all pages of a folio just to obtain the mapcount
of large folios. The comment "In the common case, avoid the loop when no
pages mapped by PTE" in folio_total_mapcount() does no longer hold for
mTHP that are always mapped by PTE.

Further, we are planning on using folio_mapcount() more
frequently, and might even want to remove page mapcounts for large
folios in some kernel configs. Therefore, allow for reading the mapcount of
large folios efficiently and atomically without looping over any pages.

Maintain the mapcount also for hugetlb pages for simplicity. Use the new
mapcount to implement folio_mapcount() and folio_mapped(). Make
page_mapped() simply call folio_mapped(). We can now get rid of
folio_large_is_mapped().

_nr_pages_mapped is now only used in rmap code and for debugging
purposes. Keep folio_nr_pages_mapped() around, but document that its use
should be limited to rmap internals and debugging purposes.

This change implies one additional atomic add/sub whenever
mapping/unmapping (parts of) a large folio.

As we now batch RMAP operations for PTE-mapped THP during fork(),
during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust
the large mapcount for a PTE batch only once, the added overhead in the
common case is small. Only when unmapping individual pages of a large folio
(e.g., during COW), the overhead might be bigger in comparison, but it's
essentially one additional atomic operation.

Note that before the new mapcount would overflow, already our refcount
would overflow: each mapping requires a folio reference. Extend the
focumentation of folio_mapcount().

Signed-off-by: David Hildenbrand <[email protected]>
---
Documentation/mm/transhuge.rst | 12 +++++-----
include/linux/mm.h | 44 ++++++++++++++++------------------
include/linux/mm_types.h | 5 ++--
include/linux/rmap.h | 10 ++++++++
mm/debug.c | 3 ++-
mm/hugetlb.c | 4 ++--
mm/internal.h | 3 +++
mm/khugepaged.c | 2 +-
mm/page_alloc.c | 4 ++++
mm/rmap.c | 34 +++++++++-----------------
10 files changed, 62 insertions(+), 59 deletions(-)

diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
index 93c9239b9ebe..1ba0ad63246c 100644
--- a/Documentation/mm/transhuge.rst
+++ b/Documentation/mm/transhuge.rst
@@ -116,14 +116,14 @@ pages:
succeeds on tail pages.

- map/unmap of a PMD entry for the whole THP increment/decrement
- folio->_entire_mapcount and also increment/decrement
- folio->_nr_pages_mapped by ENTIRELY_MAPPED when _entire_mapcount
- goes from -1 to 0 or 0 to -1.
+ folio->_entire_mapcount, increment/decrement folio->_large_mapcount
+ and also increment/decrement folio->_nr_pages_mapped by ENTIRELY_MAPPED
+ when _entire_mapcount goes from -1 to 0 or 0 to -1.

- map/unmap of individual pages with PTE entry increment/decrement
- page->_mapcount and also increment/decrement folio->_nr_pages_mapped
- when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts
- the number of pages mapped by PTE.
+ page->_mapcount, increment/decrement folio->_large_mapcount and also
+ increment/decrement folio->_nr_pages_mapped when page->_mapcount goes
+ from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE.

split_huge_page internally has to distribute the refcounts in the head
page to the tail pages before clearing all PG_head/tail bits from the page
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0fb8a40f82dd..1862a216af15 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1239,16 +1239,26 @@ static inline int page_mapcount(struct page *page)
return mapcount;
}

-int folio_total_mapcount(const struct folio *folio);
+static inline int folio_large_mapcount(const struct folio *folio)
+{
+ VM_WARN_ON_FOLIO(!folio_test_large(folio), folio);
+ return atomic_read(&folio->_large_mapcount) + 1;
+}

/**
- * folio_mapcount() - Calculate the number of mappings of this folio.
+ * folio_mapcount() - Number of mappings of this folio.
* @folio: The folio.
*
- * A large folio tracks both how many times the entire folio is mapped,
- * and how many times each individual page in the folio is mapped.
- * This function calculates the total number of times the folio is
- * mapped.
+ * The folio mapcount corresponds to the number of present user page table
+ * entries that reference any part of a folio. Each such present user page
+ * table entry must be paired with exactly on folio reference.
+ *
+ * For ordindary folios, each user page table entry (PTE/PMD/PUD/...) counts
+ * exactly once.
+ *
+ * For hugetlb folios, each abstracted "hugetlb" user page table entry that
+ * references the entire folio counts exactly once, even when such special
+ * page table entries are comprised of multiple ordinary page table entries.
*
* Return: The number of times this folio is mapped.
*/
@@ -1256,17 +1266,7 @@ static inline int folio_mapcount(const struct folio *folio)
{
if (likely(!folio_test_large(folio)))
return atomic_read(&folio->_mapcount) + 1;
- return folio_total_mapcount(folio);
-}
-
-static inline bool folio_large_is_mapped(const struct folio *folio)
-{
- /*
- * Reading _entire_mapcount below could be omitted if hugetlb
- * participated in incrementing nr_pages_mapped when compound mapped.
- */
- return atomic_read(&folio->_nr_pages_mapped) > 0 ||
- atomic_read(&folio->_entire_mapcount) >= 0;
+ return folio_large_mapcount(folio);
}

/**
@@ -1275,11 +1275,9 @@ static inline bool folio_large_is_mapped(const struct folio *folio)
*
* Return: True if any page in this folio is referenced by user page tables.
*/
-static inline bool folio_mapped(struct folio *folio)
+static inline bool folio_mapped(const struct folio *folio)
{
- if (likely(!folio_test_large(folio)))
- return atomic_read(&folio->_mapcount) >= 0;
- return folio_large_is_mapped(folio);
+ return folio_mapcount(folio) >= 1;
}

/*
@@ -1289,9 +1287,7 @@ static inline bool folio_mapped(struct folio *folio)
*/
static inline bool page_mapped(const struct page *page)
{
- if (likely(!PageCompound(page)))
- return atomic_read(&page->_mapcount) >= 0;
- return folio_large_is_mapped(page_folio(page));
+ return folio_mapped(page_folio(page));
}

static inline struct page *virt_to_head_page(const void *x)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4260c595a79d..c432add95913 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -289,7 +289,8 @@ typedef struct {
* @virtual: Virtual address in the kernel direct map.
* @_last_cpupid: IDs of last CPU and last process that accessed the folio.
* @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
- * @_nr_pages_mapped: Do not use directly, call folio_mapcount().
+ * @_large_mapcount: Do not use directly, call folio_mapcount().
+ * @_nr_pages_mapped: Do not use outside of rmap and debug code.
* @_pincount: Do not use directly, call folio_maybe_dma_pinned().
* @_folio_nr_pages: Do not use directly, call folio_nr_pages().
* @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
@@ -348,8 +349,8 @@ struct folio {
struct {
unsigned long _flags_1;
unsigned long _head_1;
- unsigned long _folio_avail;
/* public: */
+ atomic_t _large_mapcount;
atomic_t _entire_mapcount;
atomic_t _nr_pages_mapped;
atomic_t _pincount;
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 327f1ca5a487..0f906dc6d280 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -273,6 +273,7 @@ static inline int hugetlb_try_dup_anon_rmap(struct folio *folio,
ClearPageAnonExclusive(&folio->page);
}
atomic_inc(&folio->_entire_mapcount);
+ atomic_inc(&folio->_large_mapcount);
return 0;
}

@@ -306,6 +307,7 @@ static inline void hugetlb_add_file_rmap(struct folio *folio)
VM_WARN_ON_FOLIO(folio_test_anon(folio), folio);

atomic_inc(&folio->_entire_mapcount);
+ atomic_inc(&folio->_large_mapcount);
}

static inline void hugetlb_remove_rmap(struct folio *folio)
@@ -313,11 +315,14 @@ static inline void hugetlb_remove_rmap(struct folio *folio)
VM_WARN_ON_FOLIO(!folio_test_hugetlb(folio), folio);

atomic_dec(&folio->_entire_mapcount);
+ atomic_dec(&folio->_large_mapcount);
}

static __always_inline void __folio_dup_file_rmap(struct folio *folio,
struct page *page, int nr_pages, enum rmap_level level)
{
+ const int orig_nr_pages = nr_pages;
+
__folio_rmap_sanity_checks(folio, page, nr_pages, level);

switch (level) {
@@ -330,9 +335,11 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
do {
atomic_inc(&page->_mapcount);
} while (page++, --nr_pages > 0);
+ atomic_add(orig_nr_pages, &folio->_large_mapcount);
break;
case RMAP_LEVEL_PMD:
atomic_inc(&folio->_entire_mapcount);
+ atomic_inc(&folio->_large_mapcount);
break;
}
}
@@ -382,6 +389,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
struct page *page, int nr_pages, struct vm_area_struct *src_vma,
enum rmap_level level)
{
+ const int orig_nr_pages = nr_pages;
bool maybe_pinned;
int i;

@@ -423,6 +431,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
ClearPageAnonExclusive(page);
atomic_inc(&page->_mapcount);
} while (page++, --nr_pages > 0);
+ atomic_add(orig_nr_pages, &folio->_large_mapcount);
break;
case RMAP_LEVEL_PMD:
if (PageAnonExclusive(page)) {
@@ -431,6 +440,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
ClearPageAnonExclusive(page);
}
atomic_inc(&folio->_entire_mapcount);
+ atomic_inc(&folio->_large_mapcount);
break;
}
return 0;
diff --git a/mm/debug.c b/mm/debug.c
index b71186f1fb0b..d064db42af54 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -68,8 +68,9 @@ static void __dump_folio(struct folio *folio, struct page *page,
folio_ref_count(folio), mapcount, mapping,
folio->index + idx, pfn);
if (folio_test_large(folio)) {
- pr_warn("head: order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n",
+ pr_warn("head: order:%u mapcount:%d entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n",
folio_order(folio),
+ folio_mapcount(folio),
folio_entire_mapcount(folio),
folio_nr_pages_mapped(folio),
atomic_read(&folio->_pincount));
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 454900c84b30..a8536349de13 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1517,7 +1517,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio,
struct page *p;

atomic_set(&folio->_entire_mapcount, 0);
- atomic_set(&folio->_nr_pages_mapped, 0);
+ atomic_set(&folio->_large_mapcount, 0);
atomic_set(&folio->_pincount, 0);

for (i = 1; i < nr_pages; i++) {
@@ -2120,7 +2120,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio,
/* we rely on prep_new_hugetlb_folio to set the hugetlb flag */
folio_set_order(folio, order);
atomic_set(&folio->_entire_mapcount, -1);
- atomic_set(&folio->_nr_pages_mapped, 0);
+ atomic_set(&folio->_large_mapcount, -1);
atomic_set(&folio->_pincount, 0);
return true;

diff --git a/mm/internal.h b/mm/internal.h
index 9d3250b4a08a..51fa6246769c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -72,6 +72,8 @@ void page_writeback_init(void);
/*
* How many individual pages have an elevated _mapcount. Excludes
* the folio's entire_mapcount.
+ *
+ * Don't use this function outside of debugging code.
*/
static inline int folio_nr_pages_mapped(const struct folio *folio)
{
@@ -610,6 +612,7 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
struct folio *folio = (struct folio *)page;

folio_set_order(folio, order);
+ atomic_set(&folio->_large_mapcount, -1);
atomic_set(&folio->_entire_mapcount, -1);
atomic_set(&folio->_nr_pages_mapped, 0);
atomic_set(&folio->_pincount, 0);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 89e2624fb3ff..2f73d2aa9ae8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1358,7 +1358,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
* Check if the page has any GUP (or other external) pins.
*
* Here the check may be racy:
- * it may see total_mapcount > refcount in some cases?
+ * it may see folio_mapcount() > folio_ref_count().
* But such case is ephemeral we could always retry collapse
* later. However it may report false positive if the page
* has excessive GUP pins (i.e. 512). Anyway the same check
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index adbb7e6e0c72..393366d4a704 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -941,6 +941,10 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
bad_page(page, "nonzero entire_mapcount");
goto out;
}
+ if (unlikely(folio_large_mapcount(folio))) {
+ bad_page(page, "nonzero large_mapcount");
+ goto out;
+ }
if (unlikely(atomic_read(&folio->_nr_pages_mapped))) {
bad_page(page, "nonzero nr_pages_mapped");
goto out;
diff --git a/mm/rmap.c b/mm/rmap.c
index 4bde6d60db6c..2608c40dffad 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1138,34 +1138,12 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
return page_vma_mkclean_one(&pvmw);
}

-int folio_total_mapcount(const struct folio *folio)
-{
- int mapcount = folio_entire_mapcount(folio);
- int nr_pages;
- int i;
-
- /* In the common case, avoid the loop when no pages mapped by PTE */
- if (folio_nr_pages_mapped(folio) == 0)
- return mapcount;
- /*
- * Add all the PTE mappings of those pages mapped by PTE.
- * Limit the loop to folio_nr_pages_mapped()?
- * Perhaps: given all the raciness, that may be a good or a bad idea.
- */
- nr_pages = folio_nr_pages(folio);
- for (i = 0; i < nr_pages; i++)
- mapcount += atomic_read(&folio_page(folio, i)->_mapcount);
-
- /* But each of those _mapcounts was based on -1 */
- mapcount += nr_pages;
- return mapcount;
-}
-
static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
struct page *page, int nr_pages, enum rmap_level level,
int *nr_pmdmapped)
{
atomic_t *mapped = &folio->_nr_pages_mapped;
+ const int orig_nr_pages = nr_pages;
int first, nr = 0;

__folio_rmap_sanity_checks(folio, page, nr_pages, level);
@@ -1185,6 +1163,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
nr++;
}
} while (page++, --nr_pages > 0);
+ atomic_add(orig_nr_pages, &folio->_large_mapcount);
break;
case RMAP_LEVEL_PMD:
first = atomic_inc_and_test(&folio->_entire_mapcount);
@@ -1201,6 +1180,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
nr = 0;
}
}
+ atomic_inc(&folio->_large_mapcount);
break;
}
return nr;
@@ -1436,10 +1416,14 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
SetPageAnonExclusive(page);
}

+ /* increment count (starts at -1) */
+ atomic_set(&folio->_large_mapcount, nr - 1);
atomic_set(&folio->_nr_pages_mapped, nr);
} else {
/* increment count (starts at -1) */
atomic_set(&folio->_entire_mapcount, 0);
+ /* increment count (starts at -1) */
+ atomic_set(&folio->_large_mapcount, 0);
atomic_set(&folio->_nr_pages_mapped, ENTIRELY_MAPPED);
SetPageAnonExclusive(&folio->page);
__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
@@ -1522,6 +1506,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
break;
}

+ atomic_sub(nr_pages, &folio->_large_mapcount);
do {
last = atomic_add_negative(-1, &page->_mapcount);
if (last) {
@@ -1532,6 +1517,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
} while (page++, --nr_pages > 0);
break;
case RMAP_LEVEL_PMD:
+ atomic_dec(&folio->_large_mapcount);
last = atomic_add_negative(-1, &folio->_entire_mapcount);
if (last) {
nr = atomic_sub_return_relaxed(ENTIRELY_MAPPED, mapped);
@@ -2714,6 +2700,7 @@ void hugetlb_add_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);

atomic_inc(&folio->_entire_mapcount);
+ atomic_inc(&folio->_large_mapcount);
if (flags & RMAP_EXCLUSIVE)
SetPageAnonExclusive(&folio->page);
VM_WARN_ON_FOLIO(folio_entire_mapcount(folio) > 1 &&
@@ -2728,6 +2715,7 @@ void hugetlb_add_new_anon_rmap(struct folio *folio,
BUG_ON(address < vma->vm_start || address >= vma->vm_end);
/* increment count (starts at -1) */
atomic_set(&folio->_entire_mapcount, 0);
+ atomic_set(&folio->_large_mapcount, 0);
folio_clear_hugetlb_restore_reserve(folio);
__folio_set_anon(folio, vma, address, true);
SetPageAnonExclusive(&folio->page);
--
2.44.0


2024-04-09 19:25:20

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 03/18] mm/rmap: add fast-path for small folios when adding/removing/duplicating

Let's add a fast-path for small folios to all relevant rmap functions.
Note that only RMAP_LEVEL_PTE applies.

This is a preparation for tracking the mapcount of large folios in a
single value.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/rmap.h | 13 +++++++++++++
mm/rmap.c | 26 ++++++++++++++++----------
2 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9549d78928bb..327f1ca5a487 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -322,6 +322,11 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,

switch (level) {
case RMAP_LEVEL_PTE:
+ if (!folio_test_large(folio)) {
+ atomic_inc(&page->_mapcount);
+ break;
+ }
+
do {
atomic_inc(&page->_mapcount);
} while (page++, --nr_pages > 0);
@@ -405,6 +410,14 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
if (PageAnonExclusive(page + i))
return -EBUSY;
}
+
+ if (!folio_test_large(folio)) {
+ if (PageAnonExclusive(page))
+ ClearPageAnonExclusive(page);
+ atomic_inc(&page->_mapcount);
+ break;
+ }
+
do {
if (PageAnonExclusive(page))
ClearPageAnonExclusive(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 56b313aa2ebf..4bde6d60db6c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1172,15 +1172,18 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,

switch (level) {
case RMAP_LEVEL_PTE:
+ if (!folio_test_large(folio)) {
+ nr = atomic_inc_and_test(&page->_mapcount);
+ break;
+ }
+
do {
first = atomic_inc_and_test(&page->_mapcount);
- if (first && folio_test_large(folio)) {
+ if (first) {
first = atomic_inc_return_relaxed(mapped);
- first = (first < ENTIRELY_MAPPED);
+ if (first < ENTIRELY_MAPPED)
+ nr++;
}
-
- if (first)
- nr++;
} while (page++, --nr_pages > 0);
break;
case RMAP_LEVEL_PMD:
@@ -1514,15 +1517,18 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,

switch (level) {
case RMAP_LEVEL_PTE:
+ if (!folio_test_large(folio)) {
+ nr = atomic_add_negative(-1, &page->_mapcount);
+ break;
+ }
+
do {
last = atomic_add_negative(-1, &page->_mapcount);
- if (last && folio_test_large(folio)) {
+ if (last) {
last = atomic_dec_return_relaxed(mapped);
- last = (last < ENTIRELY_MAPPED);
+ if (last < ENTIRELY_MAPPED)
+ nr++;
}
-
- if (last)
- nr++;
} while (page++, --nr_pages > 0);
break;
case RMAP_LEVEL_PMD:
--
2.44.0


2024-04-09 19:25:38

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 02/18] mm/rmap: always inline anon/file rmap duplication of a single PTE

As we grow the code, the compiler might make stupid decisions and
unnecessarily degrade fork() performance. Let's make sure to always inline
functions that operate on a single PTE so the compiler will always
optimize out the loop and avoid a function call.

This is a preparation for maintining a total mapcount for large folios.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/rmap.h | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9bf9324214fc..9549d78928bb 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -347,8 +347,12 @@ static inline void folio_dup_file_rmap_ptes(struct folio *folio,
{
__folio_dup_file_rmap(folio, page, nr_pages, RMAP_LEVEL_PTE);
}
-#define folio_dup_file_rmap_pte(folio, page) \
- folio_dup_file_rmap_ptes(folio, page, 1)
+
+static __always_inline void folio_dup_file_rmap_pte(struct folio *folio,
+ struct page *page)
+{
+ __folio_dup_file_rmap(folio, page, 1, RMAP_LEVEL_PTE);
+}

/**
* folio_dup_file_rmap_pmd - duplicate a PMD mapping of a page range of a folio
@@ -448,8 +452,13 @@ static inline int folio_try_dup_anon_rmap_ptes(struct folio *folio,
return __folio_try_dup_anon_rmap(folio, page, nr_pages, src_vma,
RMAP_LEVEL_PTE);
}
-#define folio_try_dup_anon_rmap_pte(folio, page, vma) \
- folio_try_dup_anon_rmap_ptes(folio, page, 1, vma)
+
+static __always_inline int folio_try_dup_anon_rmap_pte(struct folio *folio,
+ struct page *page, struct vm_area_struct *src_vma)
+{
+ return __folio_try_dup_anon_rmap(folio, page, 1, src_vma,
+ RMAP_LEVEL_PTE);
+}

/**
* folio_try_dup_anon_rmap_pmd - try duplicating a PMD mapping of a page range
--
2.44.0


2024-04-09 19:25:39

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

We can now read the mapcount of large folios very efficiently. Use it to
improve our handling of partially-mappable folios, falling back
to making a guess only in case the folio is not "obviously mapped shared".

We can now better detect partially-mappable folios where the first page is
not mapped as "mapped shared", reducing "false negatives"; but false
negatives are still possible.

While at it, fixup a wrong comment (false positive vs. false negative)
for KSM folios.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm.h | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1862a216af15..daf687f0e8e5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2183,7 +2183,7 @@ static inline size_t folio_size(struct folio *folio)
* indicate "mapped shared" (false positive) when two VMAs in the same MM
* cover the same file range.
* #. For (small) KSM folios, the return value can wrongly indicate "mapped
- * shared" (false negative), when the folio is mapped multiple times into
+ * shared" (false positive), when the folio is mapped multiple times into
* the same MM.
*
* Further, this function only considers current page table mappings that
@@ -2200,7 +2200,22 @@ static inline size_t folio_size(struct folio *folio)
*/
static inline bool folio_likely_mapped_shared(struct folio *folio)
{
- return page_mapcount(folio_page(folio, 0)) > 1;
+ int mapcount = folio_mapcount(folio);
+
+ /* Only partially-mappable folios require more care. */
+ if (!folio_test_large(folio) || unlikely(folio_test_hugetlb(folio)))
+ return mapcount > 1;
+
+ /* A single mapping implies "mapped exclusively". */
+ if (mapcount <= 1)
+ return false;
+
+ /* If any page is mapped more than once we treat it "mapped shared". */
+ if (folio_entire_mapcount(folio) || mapcount > folio_nr_pages(folio))
+ return true;
+
+ /* Let's guess based on the first subpage. */
+ return atomic_read(&folio->_mapcount) > 0;
}

#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
--
2.44.0


2024-04-09 19:27:05

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 09/18] mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. We can only unmap full folios; page_mapped(),
which we check here, is translated to folio_mapped() -- based on
folio_mapcount(). So let's print the folio mapcount instead.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory-failure.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 88359a185c5f..ee2f4b8905ef 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1628,8 +1628,8 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,

unmap_success = !page_mapped(p);
if (!unmap_success)
- pr_err("%#lx: failed to unmap page (mapcount=%d)\n",
- pfn, page_mapcount(p));
+ pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n",
+ pfn, folio_mapcount(page_folio(p)));

/*
* try_to_unmap() might put mlocked page in lru cache, so call
--
2.44.0


2024-04-09 19:27:29

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 06/18] mm: make folio_mapcount() return 0 for small typed folios

We already handle it properly for large folios. Let's also return "0"
for small typed folios, like page_mapcount() currently would.

Consequently, folio_mapcount() will never return negative values for
typed folios, but may return negative values for underflows.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm.h | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index daf687f0e8e5..d453232bba62 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1260,12 +1260,19 @@ static inline int folio_large_mapcount(const struct folio *folio)
* references the entire folio counts exactly once, even when such special
* page table entries are comprised of multiple ordinary page table entries.
*
+ * Will report 0 for pages which cannot be mapped into userspace, such as
+ * slab, page tables and similar.
+ *
* Return: The number of times this folio is mapped.
*/
static inline int folio_mapcount(const struct folio *folio)
{
- if (likely(!folio_test_large(folio)))
- return atomic_read(&folio->_mapcount) + 1;
+ int mapcount;
+
+ if (likely(!folio_test_large(folio))) {
+ mapcount = atomic_read(&folio->_mapcount);
+ return page_type_has_type(mapcount) ? 0 : mapcount + 1;
+ }
return folio_large_mapcount(folio);
}

--
2.44.0


2024-04-09 19:27:43

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 10/18] mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary.

For tracing purposes, we use page_mapcount() in
__alloc_contig_migrate_range(). Adding that mapcount to total_mapped sounds
strange: total_migrated and total_reclaimed would count each page only
once, not multiple times.

But then, isolate_migratepages_range() adds each folio only once to the
list. So for large folios, we would query the mapcount of the
first page of the folio, which doesn't make too much sense for large
folios.

Let's simply use folio_mapped() * folio_nr_pages(), which makes more
sense as nr_migratepages is also incremented by the number of pages in
the folio in case of successful migration.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/page_alloc.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 393366d4a704..40fc0f60e021 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6389,8 +6389,12 @@ int __alloc_contig_migrate_range(struct compact_control *cc,

if (trace_mm_alloc_contig_migrate_range_info_enabled()) {
total_reclaimed += nr_reclaimed;
- list_for_each_entry(page, &cc->migratepages, lru)
- total_mapped += page_mapcount(page);
+ list_for_each_entry(page, &cc->migratepages, lru) {
+ struct folio *folio = page_folio(page);
+
+ total_mapped += folio_mapped(folio) *
+ folio_nr_pages(folio);
+ }
}

ret = migrate_pages(&cc->migratepages, alloc_migration_target,
--
2.44.0


2024-04-09 19:27:58

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 07/18] mm/memory: use folio_mapcount() in zap_present_folio_ptes()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. In zap_present_folio_ptes(), let's simply check
the folio mapcount(). If there is some issue, it will underflow at some
point either way when unmapping.

As indicated already in commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap
with PTE-mapped THP"), we already documented "If we ever have a cheap
folio_mapcount(), we might just want to check for underflows there.".

There is no change for small folios. For large folios, we'll now catch
more underflows when batch-unmapping, because instead of only testing
the mapcount of the first subpage, we'll test if the folio mapcount
underflows.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 78422d1c7381..178492efb4af 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1502,8 +1502,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
if (!delay_rmap) {
folio_remove_rmap_ptes(folio, page, nr, vma);

- /* Only sanity-check the first page in a batch. */
- if (unlikely(page_mapcount(page) < 0))
+ if (unlikely(folio_mapcount(folio) < 0))
print_bad_pte(vma, addr, ptent, page);
}

--
2.44.0


2024-04-09 19:28:16

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 11/18] mm/migrate: use folio_likely_mapped_shared() in add_page_for_migration()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. In add_page_for_migration(), we actually want to
check if the folio is mapped shared, to reject such folios. So let's
use folio_likely_mapped_shared() instead.

For small folios, fully mapped THP, and hugetlb folios, there is no change.
For partially mapped, shared THP, we should now do a better job at
rejecting such folios.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/migrate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 285072bca29c..d87ce32645d4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2140,7 +2140,7 @@ static int add_page_for_migration(struct mm_struct *mm, const void __user *p,
goto out_putfolio;

err = -EACCES;
- if (page_mapcount(page) > 1 && !migrate_all)
+ if (folio_likely_mapped_shared(folio) && !migrate_all)
goto out_putfolio;

err = -EBUSY;
--
2.44.0


2024-04-09 19:28:53

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 13/18] mm/filemap: use folio_mapcount() in filemap_unaccount_folio()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary.

Let's use folio_mapcount() instead of filemap_unaccount_folio().

No functional change intended, because we're only dealing with small
folios.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/filemap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c668e11cd6ef..d4aa82ad5b59 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -168,7 +168,7 @@ static void filemap_unaccount_folio(struct address_space *mapping,
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);

if (mapping_exiting(mapping) && !folio_test_large(folio)) {
- int mapcount = page_mapcount(&folio->page);
+ int mapcount = folio_mapcount(folio);

if (folio_ref_count(folio) >= mapcount + 2) {
/*
--
2.44.0


2024-04-09 19:29:03

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 14/18] mm/migrate_device: use folio_mapcount() in migrate_vma_check_page()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. Let's convert migrate_vma_check_page() to work on
a folio internally so we can remove the page_mapcount() usage.

Note that we reject any large folios.

There is a lot more folio conversion to be had, but that has to wait for
another day. No functional change intended.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/migrate_device.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index d40b46ae9d65..b929b450b77c 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -324,6 +324,8 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
*/
static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
{
+ struct folio *folio = page_folio(page);
+
/*
* One extra ref because caller holds an extra reference, either from
* isolate_lru_page() for a regular page, or migrate_vma_collect() for
@@ -336,18 +338,18 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
* check them than regular pages, because they can be mapped with a pmd
* or with a pte (split pte mapping).
*/
- if (PageCompound(page))
+ if (folio_test_large(folio))
return false;

/* Page from ZONE_DEVICE have one extra reference */
- if (is_zone_device_page(page))
+ if (folio_is_zone_device(folio))
extra++;

/* For file back page */
- if (page_mapping(page))
- extra += 1 + page_has_private(page);
+ if (folio_mapping(folio))
+ extra += 1 + folio_has_private(folio);

- if ((page_count(page) - extra) > page_mapcount(page))
+ if ((folio_ref_count(folio) - extra) > folio_mapcount(folio))
return false;

return true;
--
2.44.0


2024-04-09 19:29:20

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 15/18] trace/events/page_ref: trace the raw page mapcount value

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. We already trace raw page->refcount, raw page->flags
and raw page->mapping, and don't involve any folios. Let's also trace the
raw mapcount value that does not consider the entire mapcount of large
folios, and we don't add "1" to it.

When dealing with typed folios, this makes a lot more sense. ... and
it's for debugging purposes only either way.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/trace/events/page_ref.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/page_ref.h b/include/trace/events/page_ref.h
index 8a99c1cd417b..fe33a255b7d0 100644
--- a/include/trace/events/page_ref.h
+++ b/include/trace/events/page_ref.h
@@ -30,7 +30,7 @@ DECLARE_EVENT_CLASS(page_ref_mod_template,
__entry->pfn = page_to_pfn(page);
__entry->flags = page->flags;
__entry->count = page_ref_count(page);
- __entry->mapcount = page_mapcount(page);
+ __entry->mapcount = atomic_read(&page->_mapcount);
__entry->mapping = page->mapping;
__entry->mt = get_pageblock_migratetype(page);
__entry->val = v;
@@ -79,7 +79,7 @@ DECLARE_EVENT_CLASS(page_ref_mod_and_test_template,
__entry->pfn = page_to_pfn(page);
__entry->flags = page->flags;
__entry->count = page_ref_count(page);
- __entry->mapcount = page_mapcount(page);
+ __entry->mapcount = atomic_read(&page->_mapcount);
__entry->mapping = page->mapping;
__entry->mt = get_pageblock_migratetype(page);
__entry->val = v;
--
2.44.0


2024-04-09 19:29:54

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 16/18] xtensa/mm: convert check_tlb_entry() to sanity check folios

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. So let's convert check_tlb_entry() to perform
sanity checks on folios instead of pages.

This essentially already happened: page_count() is mapped to
folio_ref_count(), and page_mapped() to folio_mapped() internally.
However, we would have printed the page_mapount(), which
does not really match what page_mapped() would have checked.

Let's simply print the folio mapcount to avoid using page_mapcount(). For
small folios there is no change.

Signed-off-by: David Hildenbrand <[email protected]>
---
arch/xtensa/mm/tlb.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
index 4f974b74883c..d8b60d6e50a8 100644
--- a/arch/xtensa/mm/tlb.c
+++ b/arch/xtensa/mm/tlb.c
@@ -256,12 +256,13 @@ static int check_tlb_entry(unsigned w, unsigned e, bool dtlb)
dtlb ? 'D' : 'I', w, e, r0, r1, pte);
if (pte == 0 || !pte_present(__pte(pte))) {
struct page *p = pfn_to_page(r1 >> PAGE_SHIFT);
- pr_err("page refcount: %d, mapcount: %d\n",
- page_count(p),
- page_mapcount(p));
- if (!page_count(p))
+ struct folio *f = page_folio(p);
+
+ pr_err("folio refcount: %d, mapcount: %d\n",
+ folio_ref_count(f), folio_mapcount(f));
+ if (!folio_ref_count(f))
rc |= TLB_INSANE;
- else if (page_mapcount(p))
+ else if (folio_mapped(f))
rc |= TLB_SUSPICIOUS;
} else {
rc |= TLB_INSANE;
--
2.44.0


2024-04-09 19:30:18

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 17/18] mm/debug: print only page mapcount (excluding folio entire mapcount) in __dump_folio()

Let's simplify and only print the page mapcount: we already print the
large folio mapcount and the entire folio mapcount for large folios
separately; that should be sufficient to figure out what's happening.

While at it, print the page mapcount also if it had an underflow,
filtering out only typed pages.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/debug.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/debug.c b/mm/debug.c
index d064db42af54..69e524c3e601 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -55,15 +55,10 @@ static void __dump_folio(struct folio *folio, struct page *page,
unsigned long pfn, unsigned long idx)
{
struct address_space *mapping = folio_mapping(folio);
- int mapcount = atomic_read(&page->_mapcount) + 1;
+ int mapcount = atomic_read(&page->_mapcount);
char *type = "";

- /* Open-code page_mapcount() to avoid looking up a stale folio */
- if (mapcount < 0)
- mapcount = 0;
- if (folio_test_large(folio))
- mapcount += folio_entire_mapcount(folio);
-
+ mapcount = page_type_has_type(mapcount) ? 0 : mapcount + 1;
pr_warn("page: refcount:%d mapcount:%d mapping:%p index:%#lx pfn:%#lx\n",
folio_ref_count(folio), mapcount, mapping,
folio->index + idx, pfn);
--
2.44.0


2024-04-09 19:30:48

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 18/18] Documentation/admin-guide/cgroup-v1/memory.rst: don't reference page_mapcount()

Let's stop talking about page_mapcount().

Signed-off-by: David Hildenbrand <[email protected]>
---
Documentation/admin-guide/cgroup-v1/memory.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 46110e6a31bb..9cde26d33843 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -802,8 +802,8 @@ a page or a swap can be moved only when it is charged to the task's current
| | anonymous pages, file pages (and swaps) in the range mmapped by the task |
| | will be moved even if the task hasn't done page fault, i.e. they might |
| | not be the task's "RSS", but other task's "RSS" that maps the same file. |
-| | And mapcount of the page is ignored (the page can be moved even if |
-| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to |
+| | The mapcount of the page is ignored (the page can be moved independent |
+| | of the mapcount). You must enable Swap Extension (see 2.4) to |
| | enable move of swap charges. |
+---+--------------------------------------------------------------------------+

--
2.44.0


2024-04-09 19:31:04

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 08/18] mm/huge_memory: use folio_mapcount() in zap_huge_pmd() sanity check

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. Let's similarly check for folio_mapcount() underflows
instead of page_mapcount() underflows like we do in
zap_present_folio_ptes() now.

Instead of the VM_BUG_ON(), we should actually be doing something like
print_bad_pte(). For now, let's keep it simple and use WARN_ON_ONCE(),
performing that check independently of DEBUG_VM.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/huge_memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d8d2ed80b0bf..68ac27d229ef 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1851,7 +1851,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,

folio = page_folio(page);
folio_remove_rmap_pmd(folio, page, vma);
- VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+ WARN_ON_ONCE(folio_mapcount(folio) < 0);
VM_BUG_ON_PAGE(!PageHead(page), page);
} else if (thp_migration_supported()) {
swp_entry_t entry;
--
2.44.0


2024-04-09 19:32:54

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 12/18] sh/mm/cache: use folio_mapped() in copy_from_user_page()

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary.

We're already using folio_mapped in copy_user_highpage() and
copy_to_user_page() for a similar purpose so ... let's also simply use
it for copy_from_user_page().

There is no change for small folios. Likely we won't stumble over many
large folios on sh in that code either way.

Signed-off-by: David Hildenbrand <[email protected]>
---
arch/sh/mm/cache.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c
index 9bcaa5619eab..d8be352e14d2 100644
--- a/arch/sh/mm/cache.c
+++ b/arch/sh/mm/cache.c
@@ -84,7 +84,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
{
struct folio *folio = page_folio(page);

- if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
+ if (boot_cpu_data.dcache.n_aliases && folio_mapped(folio) &&
test_bit(PG_dcache_clean, &folio->flags)) {
void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
memcpy(dst, vfrom, len);
--
2.44.0


2024-04-09 20:14:00

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value

On 9 Apr 2024, at 15:22, David Hildenbrand wrote:

> Let's track the mapcount of large folios in a single value. The mapcount of
> a large folio currently corresponds to the sum of the entire mapcount and
> all page mapcounts.
>
> This sum is what we actually want to know in folio_mapcount() and it is
> also sufficient for implementing folio_mapped().
>
> With PTE-mapped THP becoming more important and more widely used, we want
> to avoid looping over all pages of a folio just to obtain the mapcount
> of large folios. The comment "In the common case, avoid the loop when no
> pages mapped by PTE" in folio_total_mapcount() does no longer hold for
> mTHP that are always mapped by PTE.
>
> Further, we are planning on using folio_mapcount() more
> frequently, and might even want to remove page mapcounts for large
> folios in some kernel configs. Therefore, allow for reading the mapcount of
> large folios efficiently and atomically without looping over any pages.
>
> Maintain the mapcount also for hugetlb pages for simplicity. Use the new
> mapcount to implement folio_mapcount() and folio_mapped(). Make
> page_mapped() simply call folio_mapped(). We can now get rid of
> folio_large_is_mapped().
>
> _nr_pages_mapped is now only used in rmap code and for debugging
> purposes. Keep folio_nr_pages_mapped() around, but document that its use
> should be limited to rmap internals and debugging purposes.
>
> This change implies one additional atomic add/sub whenever
> mapping/unmapping (parts of) a large folio.
>
> As we now batch RMAP operations for PTE-mapped THP during fork(),
> during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust
> the large mapcount for a PTE batch only once, the added overhead in the
> common case is small. Only when unmapping individual pages of a large folio
> (e.g., during COW), the overhead might be bigger in comparison, but it's
> essentially one additional atomic operation.
>
> Note that before the new mapcount would overflow, already our refcount
> would overflow: each mapping requires a folio reference. Extend the
> focumentation of folio_mapcount().

s/focumentation/documentation/ ;)

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2024-04-10 08:21:09

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value

On 09.04.24 22:13, Zi Yan wrote:
> On 9 Apr 2024, at 15:22, David Hildenbrand wrote:
>
>> Let's track the mapcount of large folios in a single value. The mapcount of
>> a large folio currently corresponds to the sum of the entire mapcount and
>> all page mapcounts.
>>
>> This sum is what we actually want to know in folio_mapcount() and it is
>> also sufficient for implementing folio_mapped().
>>
>> With PTE-mapped THP becoming more important and more widely used, we want
>> to avoid looping over all pages of a folio just to obtain the mapcount
>> of large folios. The comment "In the common case, avoid the loop when no
>> pages mapped by PTE" in folio_total_mapcount() does no longer hold for
>> mTHP that are always mapped by PTE.
>>
>> Further, we are planning on using folio_mapcount() more
>> frequently, and might even want to remove page mapcounts for large
>> folios in some kernel configs. Therefore, allow for reading the mapcount of
>> large folios efficiently and atomically without looping over any pages.
>>
>> Maintain the mapcount also for hugetlb pages for simplicity. Use the new
>> mapcount to implement folio_mapcount() and folio_mapped(). Make
>> page_mapped() simply call folio_mapped(). We can now get rid of
>> folio_large_is_mapped().
>>
>> _nr_pages_mapped is now only used in rmap code and for debugging
>> purposes. Keep folio_nr_pages_mapped() around, but document that its use
>> should be limited to rmap internals and debugging purposes.
>>
>> This change implies one additional atomic add/sub whenever
>> mapping/unmapping (parts of) a large folio.
>>
>> As we now batch RMAP operations for PTE-mapped THP during fork(),
>> during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust
>> the large mapcount for a PTE batch only once, the added overhead in the
>> common case is small. Only when unmapping individual pages of a large folio
>> (e.g., during COW), the overhead might be bigger in comparison, but it's
>> essentially one additional atomic operation.
>>
>> Note that before the new mapcount would overflow, already our refcount
>> would overflow: each mapping requires a folio reference. Extend the
>> focumentation of folio_mapcount().
>
> s/focumentation/documentation/ ;)

Thanks! :)

--
Cheers,

David / dhildenb


2024-04-16 10:47:28

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

On 16.04.24 12:40, Lance Yang wrote:
> Hey David,
>
> Maybe I spotted a bug below.

Thanks for the review!

>
> [...]
> static inline bool folio_likely_mapped_shared(struct folio *folio)
> {
> - return page_mapcount(folio_page(folio, 0)) > 1;
> + int mapcount = folio_mapcount(folio);
> +
> + /* Only partially-mappable folios require more care. */
> + if (!folio_test_large(folio) || unlikely(folio_test_hugetlb(folio)))
> + return mapcount > 1;
> +
> + /* A single mapping implies "mapped exclusively". */
> + if (mapcount <= 1)
> + return false;
> +
> + /* If any page is mapped more than once we treat it "mapped shared". */
> + if (folio_entire_mapcount(folio) || mapcount > folio_nr_pages(folio))
> + return true;
>
> bug: if a PMD-mapped THP is exclusively mapped, the folio_entire_mapcount()
> function will return 1 (atomic_read(&folio->_entire_mapcount) + 1).

If it's exclusively mapped, then folio_mapcount(folio)==1. In which case
the previous statement:

if (mapcount <= 1)
return false;

Catches it.

IOW, once we reach this point we now that folio_mapcount(folio) > 1, and
there must be something else besides the entire mapping ("more than once").


Or did I not address your concern?

--
Cheers,

David / dhildenb


2024-04-16 10:48:22

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

Hey David,

Maybe I spotted a bug below.

[...]
static inline bool folio_likely_mapped_shared(struct folio *folio)
{
- return page_mapcount(folio_page(folio, 0)) > 1;
+ int mapcount = folio_mapcount(folio);
+
+ /* Only partially-mappable folios require more care. */
+ if (!folio_test_large(folio) || unlikely(folio_test_hugetlb(folio)))
+ return mapcount > 1;
+
+ /* A single mapping implies "mapped exclusively". */
+ if (mapcount <= 1)
+ return false;
+
+ /* If any page is mapped more than once we treat it "mapped shared". */
+ if (folio_entire_mapcount(folio) || mapcount > folio_nr_pages(folio))
+ return true;

bug: if a PMD-mapped THP is exclusively mapped, the folio_entire_mapcount()
function will return 1 (atomic_read(&folio->_entire_mapcount) + 1).

IIUC, when mapping a PMD entry for the entire THP, folio->_entire_mapcount
increments from -1 to 0.

Thanks,
Lance

+
+ /* Let's guess based on the first subpage. */
+ return atomic_read(&folio->_mapcount) > 0;
}
[...]

2024-04-16 10:53:56

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

On Tue, Apr 16, 2024 at 6:47 PM David Hildenbrand <[email protected]> wrote:
>
> On 16.04.24 12:40, Lance Yang wrote:
> > Hey David,
> >
> > Maybe I spotted a bug below.
>
> Thanks for the review!
>
> >
> > [...]
> > static inline bool folio_likely_mapped_shared(struct folio *folio)
> > {
> > - return page_mapcount(folio_page(folio, 0)) > 1;
> > + int mapcount = folio_mapcount(folio);
> > +
> > + /* Only partially-mappable folios require more care. */
> > + if (!folio_test_large(folio) || unlikely(folio_test_hugetlb(folio)))
> > + return mapcount > 1;
> > +
> > + /* A single mapping implies "mapped exclusively". */
> > + if (mapcount <= 1)
> > + return false;
> > +
> > + /* If any page is mapped more than once we treat it "mapped shared". */
> > + if (folio_entire_mapcount(folio) || mapcount > folio_nr_pages(folio))
> > + return true;
> >
> > bug: if a PMD-mapped THP is exclusively mapped, the folio_entire_mapcount()
> > function will return 1 (atomic_read(&folio->_entire_mapcount) + 1).
>
> If it's exclusively mapped, then folio_mapcount(folio)==1. In which case
> the previous statement:
>
> if (mapcount <= 1)
> return false;
>
> Catches it.

You're right!

>
> IOW, once we reach this point we now that folio_mapcount(folio) > 1, and
> there must be something else besides the entire mapping ("more than once").
>
>
> Or did I not address your concern?

Sorry, my mistake :(

Thanks,
Lance

>
> --
> Cheers,
>
> David / dhildenb
>

2024-04-16 10:54:37

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

On 16.04.24 12:52, Lance Yang wrote:
> On Tue, Apr 16, 2024 at 6:47 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 16.04.24 12:40, Lance Yang wrote:
>>> Hey David,
>>>
>>> Maybe I spotted a bug below.
>>
>> Thanks for the review!
>>
>>>
>>> [...]
>>> static inline bool folio_likely_mapped_shared(struct folio *folio)
>>> {
>>> - return page_mapcount(folio_page(folio, 0)) > 1;
>>> + int mapcount = folio_mapcount(folio);
>>> +
>>> + /* Only partially-mappable folios require more care. */
>>> + if (!folio_test_large(folio) || unlikely(folio_test_hugetlb(folio)))
>>> + return mapcount > 1;
>>> +
>>> + /* A single mapping implies "mapped exclusively". */
>>> + if (mapcount <= 1)
>>> + return false;
>>> +
>>> + /* If any page is mapped more than once we treat it "mapped shared". */
>>> + if (folio_entire_mapcount(folio) || mapcount > folio_nr_pages(folio))
>>> + return true;
>>>
>>> bug: if a PMD-mapped THP is exclusively mapped, the folio_entire_mapcount()
>>> function will return 1 (atomic_read(&folio->_entire_mapcount) + 1).
>>
>> If it's exclusively mapped, then folio_mapcount(folio)==1. In which case
>> the previous statement:
>>
>> if (mapcount <= 1)
>> return false;
>>
>> Catches it.
>
> You're right!
>
>>
>> IOW, once we reach this point we now that folio_mapcount(folio) > 1, and
>> there must be something else besides the entire mapping ("more than once").
>>
>>
>> Or did I not address your concern?
>
> Sorry, my mistake :(

No worries, thanks for the review and thinking this through!

--
Cheers,

David / dhildenb


2024-04-18 14:50:46

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value

Hey David,

FWIW, just a nit below.

diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffad..08bb6834cf72 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1143,7 +1143,6 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
int *nr_pmdmapped)
{
atomic_t *mapped = &folio->_nr_pages_mapped;
- const int orig_nr_pages = nr_pages;
int first, nr = 0;

__folio_rmap_sanity_checks(folio, page, nr_pages, level);
@@ -1155,6 +1154,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
break;
}

+ atomic_add(nr_pages, &folio->_large_mapcount);
do {
first = atomic_inc_and_test(&page->_mapcount);
if (first) {
@@ -1163,7 +1163,6 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
nr++;
}
} while (page++, --nr_pages > 0);
- atomic_add(orig_nr_pages, &folio->_large_mapcount);
break;
case RMAP_LEVEL_PMD:
first = atomic_inc_and_test(&folio->_entire_mapcount);

Thanks,
Lance

2024-04-18 15:10:07

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value

On 18.04.24 16:50, Lance Yang wrote:
> Hey David,
>
> FWIW, just a nit below.

Hi!

Thanks, but that was done on purpose.

This way, we'll have a memory barrier (due to at least one
atomic_inc_and_test()) between incrementing the folio refcount
(happening before the rmap change) and incrementing the mapcount.

Is it required? Not 100% sure, refcount vs. mapcount checks are always a
bit racy. But doing it this way let me sleep better at night ;)

[with no subpage mapcounts, we'd do the atomic_inc_and_test on the large
mapcount and have the memory barrier there again; but that's stuff for
the future]

Thanks!

>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffad..08bb6834cf72 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1143,7 +1143,6 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
> int *nr_pmdmapped)
> {
> atomic_t *mapped = &folio->_nr_pages_mapped;
> - const int orig_nr_pages = nr_pages;
> int first, nr = 0;
>
> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> @@ -1155,6 +1154,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
> break;
> }
>
> + atomic_add(nr_pages, &folio->_large_mapcount);
> do {
> first = atomic_inc_and_test(&page->_mapcount);
> if (first) {
> @@ -1163,7 +1163,6 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
> nr++;
> }
> } while (page++, --nr_pages > 0);
> - atomic_add(orig_nr_pages, &folio->_large_mapcount);
> break;
> case RMAP_LEVEL_PMD:
> first = atomic_inc_and_test(&folio->_entire_mapcount);
>
> Thanks,
> Lance
>

--
Cheers,

David / dhildenb


2024-04-19 00:32:27

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value

On Thu, Apr 18, 2024 at 11:09 PM David Hildenbrand <[email protected]> wrote:
>
> On 18.04.24 16:50, Lance Yang wrote:
> > Hey David,
> >
> > FWIW, just a nit below.
>
> Hi!
>

Thanks for clarifying!

> Thanks, but that was done on purpose.
>
> This way, we'll have a memory barrier (due to at least one
> atomic_inc_and_test()) between incrementing the folio refcount
> (happening before the rmap change) and incrementing the mapcount.
>
> Is it required? Not 100% sure, refcount vs. mapcount checks are always a
> bit racy. But doing it this way let me sleep better at night ;)

Yep, I understood :)

Thanks,
Lance

>
> [with no subpage mapcounts, we'd do the atomic_inc_and_test on the large
> mapcount and have the memory barrier there again; but that's stuff for
> the future]
>
> Thanks!



>
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 2608c40dffad..08bb6834cf72 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1143,7 +1143,6 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
> > int *nr_pmdmapped)
> > {
> > atomic_t *mapped = &folio->_nr_pages_mapped;
> > - const int orig_nr_pages = nr_pages;
> > int first, nr = 0;
> >
> > __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> > @@ -1155,6 +1154,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
> > break;
> > }
> >
> > + atomic_add(nr_pages, &folio->_large_mapcount);
> > do {
> > first = atomic_inc_and_test(&page->_mapcount);
> > if (first) {
> > @@ -1163,7 +1163,6 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
> > nr++;
> > }
> > } while (page++, --nr_pages > 0);
> > - atomic_add(orig_nr_pages, &folio->_large_mapcount);
> > break;
> > case RMAP_LEVEL_PMD:
> > first = atomic_inc_and_test(&folio->_entire_mapcount);
> >
> > Thanks,
> > Lance
> >
>
> --
> Cheers,
>
> David / dhildenb
>

2024-04-19 02:26:29

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v1 02/18] mm/rmap: always inline anon/file rmap duplication of a single PTE



On 4/10/2024 3:22 AM, David Hildenbrand wrote:
> As we grow the code, the compiler might make stupid decisions and
> unnecessarily degrade fork() performance. Let's make sure to always inline
> functions that operate on a single PTE so the compiler will always
> optimize out the loop and avoid a function call.
>
> This is a preparation for maintining a total mapcount for large folios.
>
> Signed-off-by: David Hildenbrand<[email protected]>
The patch looks good to me. Just curious: Is this change driven by code
reviewing or performance data profiling? Thanks.


Regards
Yin, Fengwei

2024-04-19 02:30:18

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios



On 4/10/2024 3:22 AM, David Hildenbrand wrote:
> @@ -2200,7 +2200,22 @@ static inline size_t folio_size(struct folio *folio)
> */
> static inline bool folio_likely_mapped_shared(struct folio *folio)
> {
> - return page_mapcount(folio_page(folio, 0)) > 1;
> + int mapcount = folio_mapcount(folio);
> +
> + /* Only partially-mappable folios require more care. */
> + if (!folio_test_large(folio) || unlikely(folio_test_hugetlb(folio)))
> + return mapcount > 1;
My understanding is that mapcount > folio_nr_pages(folio) can cover
order 0 folio. And also folio_entire_mapcount() can cover hugetlb (I am
not 100% sure for this one). I am wondering whether we can drop above
two lines? Thanks.


Regards
Yin, Fengwei

> +
> + /* A single mapping implies "mapped exclusively". */
> + if (mapcount <= 1)
> + return false;
> +
> + /* If any page is mapped more than once we treat it "mapped shared". */
> + if (folio_entire_mapcount(folio) || mapcount > folio_nr_pages(folio))
> + return true;
> +
> + /* Let's guess based on the first subpage. */
> + return atomic_read(&folio->_mapcount) > 0;
> }


2024-04-19 09:14:49

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 02/18] mm/rmap: always inline anon/file rmap duplication of a single PTE

On 19.04.24 04:25, Yin, Fengwei wrote:
>
>
> On 4/10/2024 3:22 AM, David Hildenbrand wrote:
>> As we grow the code, the compiler might make stupid decisions and
>> unnecessarily degrade fork() performance. Let's make sure to always inline
>> functions that operate on a single PTE so the compiler will always
>> optimize out the loop and avoid a function call.
>>
>> This is a preparation for maintining a total mapcount for large folios.
>>
>> Signed-off-by: David Hildenbrand<[email protected]>
> The patch looks good to me. Just curious: Is this change driven by code
> reviewing or performance data profiling? Thanks.

It was identified while observing an performance degradation with small
folios in the fork() microbenchmark discussed in the cover letter
(mentioned here as "unnecessarily degrade fork() performance").

The added atomic_add() was sufficient for the compiler not inline and
optimize-out nr_pages, inserting a function call to a function where
nr_pages is not optimized out.

--
Cheers,

David / dhildenb


2024-04-19 09:28:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

On 19.04.24 04:29, Yin, Fengwei wrote:
>
>
> On 4/10/2024 3:22 AM, David Hildenbrand wrote:
>> @@ -2200,7 +2200,22 @@ static inline size_t folio_size(struct folio *folio)
>> */
>> static inline bool folio_likely_mapped_shared(struct folio *folio)
>> {
>> - return page_mapcount(folio_page(folio, 0)) > 1;
>> + int mapcount = folio_mapcount(folio);
>> +
>> + /* Only partially-mappable folios require more care. */
>> + if (!folio_test_large(folio) || unlikely(folio_test_hugetlb(folio)))
>> + return mapcount > 1;
> My understanding is that mapcount > folio_nr_pages(folio) can cover
> order 0 folio. And also folio_entire_mapcount() can cover hugetlb (I am
> not 100% sure for this one). I am wondering whether we can drop above
> two lines? Thanks.

folio_entire_mapcount() does not apply to small folios, so we must not
call that for small folios.

Regarding hugetlb, subpage mapcounts are completely unused, except
subpage 0 mapcount, which is now *always* negative (storing a page type)
-- so there is no trusting on that value at all.

So in the end, it all looked cleanest when only special-casing on
partially-mappable folios where we know the entire mapcount exists and
we know that subapge mapcount 0 actually stores something reasonable
(not a type).

Thanks!

--
Cheers,

David / dhildenb


2024-04-19 13:49:13

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios



On 4/19/2024 5:19 PM, David Hildenbrand wrote:
> On 19.04.24 04:29, Yin, Fengwei wrote:
>>
>>
>> On 4/10/2024 3:22 AM, David Hildenbrand wrote:
>>> @@ -2200,7 +2200,22 @@ static inline size_t folio_size(struct folio
>>> *folio)
>>>     */
>>>    static inline bool folio_likely_mapped_shared(struct folio *folio)
>>>    {
>>> -    return page_mapcount(folio_page(folio, 0)) > 1;
>>> +    int mapcount = folio_mapcount(folio);
>>> +
>>> +    /* Only partially-mappable folios require more care. */
>>> +    if (!folio_test_large(folio) ||
>>> unlikely(folio_test_hugetlb(folio)))
>>> +        return mapcount > 1;
>> My understanding is that mapcount > folio_nr_pages(folio) can cover
>> order 0 folio. And also folio_entire_mapcount() can cover hugetlb (I am
>> not 100% sure for this one).  I am wondering whether we can drop above
>> two lines? Thanks.
>
> folio_entire_mapcount() does not apply to small folios, so we must not
> call that for small folios.
Right. I missed this part. Thanks for clarification.


Regards
Yin, Fengwei

>
> Regarding hugetlb, subpage mapcounts are completely unused, except
> subpage 0 mapcount, which is now *always* negative (storing a page type)
> -- so there is no trusting on that value at all.
>
> So in the end, it all looked cleanest when only special-casing on
> partially-mappable folios where we know the entire mapcount exists and
> we know that subapge mapcount 0 actually stores something reasonable
> (not a type).
>
> Thanks!
>

2024-04-19 13:49:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

On 19.04.24 15:47, Yin, Fengwei wrote:
>
>
> On 4/19/2024 5:19 PM, David Hildenbrand wrote:
>> On 19.04.24 04:29, Yin, Fengwei wrote:
>>>
>>>
>>> On 4/10/2024 3:22 AM, David Hildenbrand wrote:
>>>> @@ -2200,7 +2200,22 @@ static inline size_t folio_size(struct folio
>>>> *folio)
>>>>     */
>>>>    static inline bool folio_likely_mapped_shared(struct folio *folio)
>>>>    {
>>>> -    return page_mapcount(folio_page(folio, 0)) > 1;
>>>> +    int mapcount = folio_mapcount(folio);
>>>> +
>>>> +    /* Only partially-mappable folios require more care. */
>>>> +    if (!folio_test_large(folio) ||
>>>> unlikely(folio_test_hugetlb(folio)))
>>>> +        return mapcount > 1;
>>> My understanding is that mapcount > folio_nr_pages(folio) can cover
>>> order 0 folio. And also folio_entire_mapcount() can cover hugetlb (I am
>>> not 100% sure for this one).  I am wondering whether we can drop above
>>> two lines? Thanks.
>>
>> folio_entire_mapcount() does not apply to small folios, so we must not
>> call that for small folios.
> Right. I missed this part. Thanks for clarification.

Thanks for the review!

--
Cheers,

David / dhildenb


2024-04-19 14:07:49

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value



On 4/10/2024 3:22 AM, David Hildenbrand wrote:
> Let's track the mapcount of large folios in a single value. The mapcount of
> a large folio currently corresponds to the sum of the entire mapcount and
> all page mapcounts.
>
> This sum is what we actually want to know in folio_mapcount() and it is
> also sufficient for implementing folio_mapped().
>
> With PTE-mapped THP becoming more important and more widely used, we want
> to avoid looping over all pages of a folio just to obtain the mapcount
> of large folios. The comment "In the common case, avoid the loop when no
> pages mapped by PTE" in folio_total_mapcount() does no longer hold for
> mTHP that are always mapped by PTE.
>
> Further, we are planning on using folio_mapcount() more
> frequently, and might even want to remove page mapcounts for large
> folios in some kernel configs. Therefore, allow for reading the mapcount of
> large folios efficiently and atomically without looping over any pages.
>
> Maintain the mapcount also for hugetlb pages for simplicity. Use the new
> mapcount to implement folio_mapcount() and folio_mapped(). Make
> page_mapped() simply call folio_mapped(). We can now get rid of
> folio_large_is_mapped().
>
> _nr_pages_mapped is now only used in rmap code and for debugging
> purposes. Keep folio_nr_pages_mapped() around, but document that its use
> should be limited to rmap internals and debugging purposes.
>
> This change implies one additional atomic add/sub whenever
> mapping/unmapping (parts of) a large folio.
>
> As we now batch RMAP operations for PTE-mapped THP during fork(),
> during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust
> the large mapcount for a PTE batch only once, the added overhead in the
> common case is small. Only when unmapping individual pages of a large folio
> (e.g., during COW), the overhead might be bigger in comparison, but it's
> essentially one additional atomic operation.
>
> Note that before the new mapcount would overflow, already our refcount
> would overflow: each mapping requires a folio reference. Extend the
> focumentation of folio_mapcount().
>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Yin Fengwei <[email protected]>

2024-04-19 14:08:09

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v1 03/18] mm/rmap: add fast-path for small folios when adding/removing/duplicating



On 4/10/2024 3:22 AM, David Hildenbrand wrote:
> Let's add a fast-path for small folios to all relevant rmap functions.
> Note that only RMAP_LEVEL_PTE applies.
>
> This is a preparation for tracking the mapcount of large folios in a
> single value.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Yin Fengwei <[email protected]>

2024-04-19 14:09:04

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios



On 4/10/2024 3:22 AM, David Hildenbrand wrote:
> We can now read the mapcount of large folios very efficiently. Use it to
> improve our handling of partially-mappable folios, falling back
> to making a guess only in case the folio is not "obviously mapped shared".
>
> We can now better detect partially-mappable folios where the first page is
> not mapped as "mapped shared", reducing "false negatives"; but false
> negatives are still possible.
>
> While at it, fixup a wrong comment (false positive vs. false negative)
> for KSM folios.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Yin Fengwei <[email protected]>

2024-04-19 14:15:35

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v1 02/18] mm/rmap: always inline anon/file rmap duplication of a single PTE



On 4/10/2024 3:22 AM, David Hildenbrand wrote:
> As we grow the code, the compiler might make stupid decisions and
> unnecessarily degrade fork() performance. Let's make sure to always inline
> functions that operate on a single PTE so the compiler will always
> optimize out the loop and avoid a function call.
>
> This is a preparation for maintining a total mapcount for large folios.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Yin Fengwei <[email protected]>

2024-04-24 09:40:29

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/18] mm: make folio_mapcount() return 0 for small typed folios

On 09.04.24 21:22, David Hildenbrand wrote:
> We already handle it properly for large folios. Let's also return "0"
> for small typed folios, like page_mapcount() currently would.
>
> Consequently, folio_mapcount() will never return negative values for
> typed folios, but may return negative values for underflows.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> include/linux/mm.h | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index daf687f0e8e5..d453232bba62 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1260,12 +1260,19 @@ static inline int folio_large_mapcount(const struct folio *folio)
> * references the entire folio counts exactly once, even when such special
> * page table entries are comprised of multiple ordinary page table entries.
> *
> + * Will report 0 for pages which cannot be mapped into userspace, such as
> + * slab, page tables and similar.
> + *
> * Return: The number of times this folio is mapped.
> */
> static inline int folio_mapcount(const struct folio *folio)
> {
> - if (likely(!folio_test_large(folio)))
> - return atomic_read(&folio->_mapcount) + 1;
> + int mapcount;
> +
> + if (likely(!folio_test_large(folio))) {
> + mapcount = atomic_read(&folio->_mapcount);
> + return page_type_has_type(mapcount) ? 0 : mapcount + 1;
> + }
> return folio_large_mapcount(folio);
> }
>

From 98acfb7ff35cb65fcfca5e799bf58f8afe84a645 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <[email protected]>
Date: Wed, 24 Apr 2024 10:56:17 +0200
Subject: [PATCH] !fixup: mm: make folio_mapcount() return 0 for small typed
folios

Just like page_mapcount(), let's make folio_mapcount() slightly more
efficient.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cf700c5cdd58b..78e583b50e421 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1271,8 +1271,11 @@ static inline int folio_mapcount(const struct folio *folio)
int mapcount;

if (likely(!folio_test_large(folio))) {
- mapcount = atomic_read(&folio->_mapcount);
- return page_type_has_type(mapcount) ? 0 : mapcount + 1;
+ mapcount = atomic_read(&folio->_mapcount) + 1;
+ /* Handle page_has_type() pages */
+ if (mapcount < PAGE_MAPCOUNT_RESERVE + 1)
+ mapcount = 0;
+ return mapcount;
}
return folio_large_mapcount(folio);
}
--
2.44.0


--
Cheers,

David / dhildenb