2024-06-07 12:24:31

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 0/6] fs/proc: move page_mapcount() to fs/proc/internal.h

With all other page_mapcount() users in the tree gone, move
page_mapcount() to fs/proc/internal.h, rename it and extend the
documentation to prevent future (ab)use.

... of course, I find some issues while working on that code that I sort
first ;)

We'll now only end up calling page_mapcount()
[now folio_precise_page_mapcount()] on pages mapped via present page table
entries. Except for /proc/kpagecount, that still does questionable things,
but we'll leave that legacy interface as is for now.

Did a quick sanity check. Likely we would want some better selfestest
for /proc/$/pagemap + smaps. I'll see if I can find some time to write
some more.

Cc: Andrew Morton <[email protected]>
Cc: Jonathan Corbet <[email protected]>

David Hildenbrand (6):
fs/proc/task_mmu: indicate PM_FILE for PMD-mapped file THP
fs/proc/task_mmu: don't indicate PM_MMAP_EXCLUSIVE without PM_PRESENT
fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of
PMD-mapped THPs
fs/proc/task_mmu: account non-present entries as "maybe shared, but no
idea how often"
fs/proc: move page_mapcount() to fs/proc/internal.h
Documentation/admin-guide/mm/pagemap.rst: drop "Using pagemap to do
something useful"

Documentation/admin-guide/mm/pagemap.rst | 21 -----
fs/proc/internal.h | 33 ++++++++
fs/proc/page.c | 21 +++--
fs/proc/task_mmu.c | 102 +++++++++++++----------
include/linux/mm.h | 27 +-----
5 files changed, 104 insertions(+), 100 deletions(-)


base-commit: 19b8422c5bd56fb5e7085995801c6543a98bda1f
--
2.45.2



2024-06-07 12:24:42

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 1/6] fs/proc/task_mmu: indicate PM_FILE for PMD-mapped file THP

Looks like we never taught pagemap_pmd_range() about the existence of
PMD-mapped file THPs. Seems to date back to the times when we first added
support for non-anon THPs in the form of shmem THP.

Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Cc: Kirill A. Shutemov <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/task_mmu.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5aceb3db7565e..08465b904ced5 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1522,6 +1522,8 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
}
#endif

+ if (page && !PageAnon(page))
+ flags |= PM_FILE;
if (page && !migration && page_mapcount(page) == 1)
flags |= PM_MMAP_EXCLUSIVE;

--
2.45.2


2024-06-07 12:25:13

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 2/6] fs/proc/task_mmu: don't indicate PM_MMAP_EXCLUSIVE without PM_PRESENT

Relying on the mapcount for non-present PTEs that reference pages
doesn't make any sense: they are not accounted in the mapcount, so
page_mapcount() == 1 won't return the result we actually want to know.

While we don't check the mapcount for migration entries already, we
could end up checking it for swap, hwpoison, device exclusive, ...
entries, which we really shouldn't.

There is one exception: device private entries, which we consider
fake-present (e.g., incremented the mapcount). But we won't care about
that for now for PM_MMAP_EXCLUSIVE, because indicating PM_SWAP for them
although they are fake-present already sounds suspiciously wrong.

Let's never indicate PM_MMAP_EXCLUSIVE without PM_PRESENT.

Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/task_mmu.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 08465b904ced5..40e38bc33a9d2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1415,7 +1415,6 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
{
u64 frame = 0, flags = 0;
struct page *page = NULL;
- bool migration = false;

if (pte_present(pte)) {
if (pm->show_pfn)
@@ -1447,7 +1446,6 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
(offset << MAX_SWAPFILES_SHIFT);
}
flags |= PM_SWAP;
- migration = is_migration_entry(entry);
if (is_pfn_swap_entry(entry))
page = pfn_swap_entry_to_page(entry);
if (pte_marker_entry_uffd_wp(entry))
@@ -1456,7 +1454,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,

if (page && !PageAnon(page))
flags |= PM_FILE;
- if (page && !migration && page_mapcount(page) == 1)
+ if (page && (flags & PM_PRESENT) && page_mapcount(page) == 1)
flags |= PM_MMAP_EXCLUSIVE;
if (vma->vm_flags & VM_SOFTDIRTY)
flags |= PM_SOFT_DIRTY;
@@ -1473,7 +1471,6 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
pte_t *pte, *orig_pte;
int err = 0;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- bool migration = false;

ptl = pmd_trans_huge_lock(pmdp, vma);
if (ptl) {
@@ -1517,14 +1514,13 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
if (pmd_swp_uffd_wp(pmd))
flags |= PM_UFFD_WP;
VM_BUG_ON(!is_pmd_migration_entry(pmd));
- migration = is_migration_entry(entry);
page = pfn_swap_entry_to_page(entry);
}
#endif

if (page && !PageAnon(page))
flags |= PM_FILE;
- if (page && !migration && page_mapcount(page) == 1)
+ if (page && (flags & PM_PRESENT) && page_mapcount(page) == 1)
flags |= PM_MMAP_EXCLUSIVE;

for (; addr != end; addr += PAGE_SIZE) {
--
2.45.2


2024-06-07 12:25:27

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 3/6] fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of PMD-mapped THPs

We added PM_MMAP_EXCLUSIVE in 2015 via commit 77bb499bb60f ("pagemap: add
mmap-exclusive bit for marking pages mapped only here"), when THPs could
not be partially mapped and page_mapcount() returned something
that was true for all pages of the THP.

In 2016, we added support for partially mapping THPs via
commit 53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping
of THPs") but missed to determine PM_MMAP_EXCLUSIVE as well per page.

Checking page_mapcount() on the head page does not tell the whole story.

We should check each individual page. In a future without per-page
mapcounts it will be different, but we'll change that to be consistent
with PTE-mapped THPs once we deal with that.

Fixes: 53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping of THPs")
Cc: Kirill A. Shutemov <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/task_mmu.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 40e38bc33a9d2..f427176ce2c34 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1474,6 +1474,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,

ptl = pmd_trans_huge_lock(pmdp, vma);
if (ptl) {
+ unsigned int idx = (addr & ~PMD_MASK) >> PAGE_SHIFT;
u64 flags = 0, frame = 0;
pmd_t pmd = *pmdp;
struct page *page = NULL;
@@ -1490,8 +1491,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
if (pmd_uffd_wp(pmd))
flags |= PM_UFFD_WP;
if (pm->show_pfn)
- frame = pmd_pfn(pmd) +
- ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+ frame = pmd_pfn(pmd) + idx;
}
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
else if (is_swap_pmd(pmd)) {
@@ -1500,11 +1500,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,

if (pm->show_pfn) {
if (is_pfn_swap_entry(entry))
- offset = swp_offset_pfn(entry);
+ offset = swp_offset_pfn(entry) + idx;
else
- offset = swp_offset(entry);
- offset = offset +
- ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+ offset = swp_offset(entry) + idx;
frame = swp_type(entry) |
(offset << MAX_SWAPFILES_SHIFT);
}
@@ -1520,12 +1518,16 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,

if (page && !PageAnon(page))
flags |= PM_FILE;
- if (page && (flags & PM_PRESENT) && page_mapcount(page) == 1)
- flags |= PM_MMAP_EXCLUSIVE;

- for (; addr != end; addr += PAGE_SIZE) {
- pagemap_entry_t pme = make_pme(frame, flags);
+ for (; addr != end; addr += PAGE_SIZE, idx++) {
+ unsigned long cur_flags = flags;
+ pagemap_entry_t pme;
+
+ if (page && (flags & PM_PRESENT) &&
+ page_mapcount(page + idx) == 1)
+ cur_flags |= PM_MMAP_EXCLUSIVE;

+ pme = make_pme(frame, cur_flags);
err = add_to_pagemap(&pme, pm);
if (err)
break;
--
2.45.2


2024-06-07 12:25:34

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 4/6] fs/proc/task_mmu: account non-present entries as "maybe shared, but no idea how often"

We currently rely on mapcount information for pages referenced by
non-present entries to calculate the USS (shared vs. private) and the
PSS.

However, relying on mapcounts for non-present entries doesn't make any
sense. We have to treat such entries as "maybe shared, but no idea how
often", implying that they will *not* get accounted towards the USS, and
will get fully accounted to the PSS (no idea how often shared).

There is one exception: device exclusive entries essentially behave like
present entries (e.g., mapcount incremented).

In smaps_pmd_entry(), use is_pfn_swap_entry() instead of
is_migration_entry(), which should not make a real difference but makes
the code look more similar to the PTE variant.

While at it, adjust the comments in smaps_account().

Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/task_mmu.c | 53 +++++++++++++++++++++++++++-------------------
1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f427176ce2c34..67d9b406c7586 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -442,7 +442,7 @@ static void smaps_page_accumulate(struct mem_size_stats *mss,

static void smaps_account(struct mem_size_stats *mss, struct page *page,
bool compound, bool young, bool dirty, bool locked,
- bool migration)
+ bool present)
{
struct folio *folio = page_folio(page);
int i, nr = compound ? compound_nr(page) : 1;
@@ -471,22 +471,27 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
* Then accumulate quantities that may depend on sharing, or that may
* differ page-by-page.
*
- * refcount == 1 guarantees the page is mapped exactly once.
- * If any subpage of the compound page mapped with PTE it would elevate
- * the refcount.
+ * refcount == 1 for present entries guarantees that the folio is mapped
+ * exactly once. For large folios this implies that exactly one
+ * PTE/PMD/... maps (a part of) this folio.
*
- * The page_mapcount() is called to get a snapshot of the mapcount.
- * Without holding the page lock this snapshot can be slightly wrong as
- * we cannot always read the mapcount atomically. It is not safe to
- * call page_mapcount() even with PTL held if the page is not mapped,
- * especially for migration entries. Treat regular migration entries
- * as mapcount == 1.
+ * Treat all non-present entries (where relying on the mapcount and
+ * refcount doesn't make sense) as "maybe shared, but not sure how
+ * often". We treat device private entries as being fake-present.
+ *
+ * Note that it would not be safe to read the mapcount especially for
+ * pages referenced by migration entries, even with the PTL held.
*/
- if ((folio_ref_count(folio) == 1) || migration) {
+ if (folio_ref_count(folio) == 1 || !present) {
smaps_page_accumulate(mss, folio, size, size << PSS_SHIFT,
- dirty, locked, true);
+ dirty, locked, present);
return;
}
+ /*
+ * The page_mapcount() is called to get a snapshot of the mapcount.
+ * Without holding the folio lock this snapshot can be slightly wrong as
+ * we cannot always read the mapcount atomically.
+ */
for (i = 0; i < nr; i++, page++) {
int mapcount = page_mapcount(page);
unsigned long pss = PAGE_SIZE << PSS_SHIFT;
@@ -531,13 +536,14 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
struct vm_area_struct *vma = walk->vma;
bool locked = !!(vma->vm_flags & VM_LOCKED);
struct page *page = NULL;
- bool migration = false, young = false, dirty = false;
+ bool present = false, young = false, dirty = false;
pte_t ptent = ptep_get(pte);

if (pte_present(ptent)) {
page = vm_normal_page(vma, addr, ptent);
young = pte_young(ptent);
dirty = pte_dirty(ptent);
+ present = true;
} else if (is_swap_pte(ptent)) {
swp_entry_t swpent = pte_to_swp_entry(ptent);

@@ -555,8 +561,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
}
} else if (is_pfn_swap_entry(swpent)) {
- if (is_migration_entry(swpent))
- migration = true;
+ if (is_device_private_entry(swpent))
+ present = true;
page = pfn_swap_entry_to_page(swpent);
}
} else {
@@ -567,7 +573,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
if (!page)
return;

- smaps_account(mss, page, false, young, dirty, locked, migration);
+ smaps_account(mss, page, false, young, dirty, locked, present);
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -578,18 +584,17 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
struct vm_area_struct *vma = walk->vma;
bool locked = !!(vma->vm_flags & VM_LOCKED);
struct page *page = NULL;
+ bool present = false;
struct folio *folio;
- bool migration = false;

if (pmd_present(*pmd)) {
page = vm_normal_page_pmd(vma, addr, *pmd);
+ present = true;
} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
swp_entry_t entry = pmd_to_swp_entry(*pmd);

- if (is_migration_entry(entry)) {
- migration = true;
+ if (is_pfn_swap_entry(entry))
page = pfn_swap_entry_to_page(entry);
- }
}
if (IS_ERR_OR_NULL(page))
return;
@@ -604,7 +609,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
mss->file_thp += HPAGE_PMD_SIZE;

smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd),
- locked, migration);
+ locked, present);
}
#else
static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
@@ -732,17 +737,21 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
struct vm_area_struct *vma = walk->vma;
pte_t ptent = huge_ptep_get(pte);
struct folio *folio = NULL;
+ bool present = false;

if (pte_present(ptent)) {
folio = page_folio(pte_page(ptent));
+ present = true;
} else if (is_swap_pte(ptent)) {
swp_entry_t swpent = pte_to_swp_entry(ptent);

if (is_pfn_swap_entry(swpent))
folio = pfn_swap_entry_folio(swpent);
}
+
if (folio) {
- if (folio_likely_mapped_shared(folio) ||
+ /* We treat non-present entries as "maybe shared". */
+ if (!present || folio_likely_mapped_shared(folio) ||
hugetlb_pmd_shared(pte))
mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
else
--
2.45.2


2024-06-07 12:26:32

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 5/6] fs/proc: move page_mapcount() to fs/proc/internal.h

... and rename it to folio_precise_page_mapcount(). fs/proc is the last
remaining user, and that should stay that way.

While at it, cleanup kpagecount_read() a bit: there are still some legacy
leftovers -- when the interface was introduced it returned the page
refcount, but was changed briefly afterwards to return the page
mapcount. Further, some simple folio conversion.

Once we stop using the per-page mapcounts of large folios, all
folio_precise_page_mapcount() users will have to implement an
alternative way to achieve what they are trying to achieve, possibly in
a less precise way.

Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/internal.h | 33 +++++++++++++++++++++++++++++++++
fs/proc/page.c | 21 ++++++++++-----------
fs/proc/task_mmu.c | 35 ++++++++++++++++++++++-------------
include/linux/mm.h | 27 +--------------------------
4 files changed, 66 insertions(+), 50 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index a71ac5379584a..a8a8576d8592e 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -13,6 +13,7 @@
#include <linux/binfmts.h>
#include <linux/sched/coredump.h>
#include <linux/sched/task.h>
+#include <linux/mm.h>

struct ctl_table_header;
struct mempolicy;
@@ -142,6 +143,38 @@ unsigned name_to_int(const struct qstr *qstr);
/* Worst case buffer size needed for holding an integer. */
#define PROC_NUMBUF 13

+/**
+ * folio_precise_page_mapcount() - Number of mappings of this folio page.
+ * @folio: The folio.
+ * @page: The page.
+ *
+ * The number of present user page table entries that reference this page
+ * as tracked via the RMAP: either referenced directly (PTE) or as part of
+ * a larger area that covers this page (e.g., PMD).
+ *
+ * Use this function only for the calculation of existing statistics
+ * (USS, PSS, mapcount_max) and for debugging purposes (/proc/kpagecount).
+ *
+ * Do not add new users.
+ *
+ * Returns: The number of mappings of this folio page. 0 for
+ * folios that are not mapped to user space or are not tracked via the RMAP
+ * (e.g., shared zeropage).
+ */
+static inline int folio_precise_page_mapcount(struct folio *folio,
+ struct page *page)
+{
+ int mapcount = atomic_read(&page->_mapcount) + 1;
+
+ /* Handle page_has_type() pages */
+ if (mapcount < PAGE_MAPCOUNT_RESERVE + 1)
+ mapcount = 0;
+ if (folio_test_large(folio))
+ mapcount += folio_entire_mapcount(folio);
+
+ return mapcount;
+}
+
/*
* array.c
*/
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 2fb64bdb64eb1..e8440db8cfbf9 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -37,21 +37,19 @@ static inline unsigned long get_max_dump_pfn(void)
#endif
}

-/* /proc/kpagecount - an array exposing page counts
+/* /proc/kpagecount - an array exposing page mapcounts
*
* Each entry is a u64 representing the corresponding
- * physical page count.
+ * physical page mapcount.
*/
static ssize_t kpagecount_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
const unsigned long max_dump_pfn = get_max_dump_pfn();
u64 __user *out = (u64 __user *)buf;
- struct page *ppage;
unsigned long src = *ppos;
unsigned long pfn;
ssize_t ret = 0;
- u64 pcount;

pfn = src / KPMSIZE;
if (src & KPMMASK || count & KPMMASK)
@@ -61,18 +59,19 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
count = min_t(unsigned long, count, (max_dump_pfn * KPMSIZE) - src);

while (count > 0) {
+ struct page *page;
+ u64 mapcount = 0;
+
/*
* TODO: ZONE_DEVICE support requires to identify
* memmaps that were actually initialized.
*/
- ppage = pfn_to_online_page(pfn);
-
- if (!ppage)
- pcount = 0;
- else
- pcount = page_mapcount(ppage);
+ page = pfn_to_online_page(pfn);
+ if (page)
+ mapcount = folio_precise_page_mapcount(page_folio(page),
+ page);

- if (put_user(pcount, out)) {
+ if (put_user(mapcount, out)) {
ret = -EFAULT;
break;
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 67d9b406c7586..631371cb80a05 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -488,12 +488,12 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
return;
}
/*
- * The page_mapcount() is called to get a snapshot of the mapcount.
- * Without holding the folio lock this snapshot can be slightly wrong as
- * we cannot always read the mapcount atomically.
+ * We obtain a snapshot of the mapcount. Without holding the folio lock
+ * this snapshot can be slightly wrong as we cannot always read the
+ * mapcount atomically.
*/
for (i = 0; i < nr; i++, page++) {
- int mapcount = page_mapcount(page);
+ int mapcount = folio_precise_page_mapcount(folio, page);
unsigned long pss = PAGE_SIZE << PSS_SHIFT;
if (mapcount >= 2)
pss /= mapcount;
@@ -1424,6 +1424,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
{
u64 frame = 0, flags = 0;
struct page *page = NULL;
+ struct folio *folio;

if (pte_present(pte)) {
if (pm->show_pfn)
@@ -1461,10 +1462,14 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
flags |= PM_UFFD_WP;
}

- if (page && !PageAnon(page))
- flags |= PM_FILE;
- if (page && (flags & PM_PRESENT) && page_mapcount(page) == 1)
- flags |= PM_MMAP_EXCLUSIVE;
+ if (page) {
+ folio = page_folio(page);
+ if (!folio_test_anon(folio))
+ flags |= PM_FILE;
+ if ((flags & PM_PRESENT) &&
+ folio_precise_page_mapcount(folio, page) == 1)
+ flags |= PM_MMAP_EXCLUSIVE;
+ }
if (vma->vm_flags & VM_SOFTDIRTY)
flags |= PM_SOFT_DIRTY;

@@ -1487,6 +1492,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
u64 flags = 0, frame = 0;
pmd_t pmd = *pmdp;
struct page *page = NULL;
+ struct folio *folio;

if (vma->vm_flags & VM_SOFTDIRTY)
flags |= PM_SOFT_DIRTY;
@@ -1525,15 +1531,18 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
}
#endif

- if (page && !PageAnon(page))
- flags |= PM_FILE;
+ if (page) {
+ folio = page_folio(page);
+ if (!folio_test_anon(folio))
+ flags |= PM_FILE;
+ }

for (; addr != end; addr += PAGE_SIZE, idx++) {
unsigned long cur_flags = flags;
pagemap_entry_t pme;

- if (page && (flags & PM_PRESENT) &&
- page_mapcount(page + idx) == 1)
+ if (folio && (flags & PM_PRESENT) &&
+ folio_precise_page_mapcount(folio, page + idx) == 1)
cur_flags |= PM_MMAP_EXCLUSIVE;

pme = make_pme(frame, cur_flags);
@@ -2572,7 +2581,7 @@ static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty,
unsigned long nr_pages)
{
struct folio *folio = page_folio(page);
- int count = page_mapcount(page);
+ int count = folio_precise_page_mapcount(folio, page);

md->pages += nr_pages;
if (pte_dirty || folio_test_dirty(folio))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 04795a5090267..42e3752b5eed5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1197,8 +1197,7 @@ static inline int is_vmalloc_or_module_addr(const void *x)
/*
* How many times the entire folio is mapped as a single unit (eg by a
* PMD or PUD entry). This is probably not what you want, except for
- * debugging purposes - it does not include PTE-mapped sub-pages; look
- * at folio_mapcount() or page_mapcount() instead.
+ * debugging purposes or implementation of other core folio_*() primitives.
*/
static inline int folio_entire_mapcount(const struct folio *folio)
{
@@ -1206,30 +1205,6 @@ static inline int folio_entire_mapcount(const struct folio *folio)
return atomic_read(&folio->_entire_mapcount) + 1;
}

-/**
- * page_mapcount() - Number of times this precise page is mapped.
- * @page: The page.
- *
- * The number of times this page is mapped. If this page is part of
- * a large folio, it includes the number of times this page is mapped
- * as part of that folio.
- *
- * Will report 0 for pages which cannot be mapped into userspace, eg
- * slab, page tables and similar.
- */
-static inline int page_mapcount(struct page *page)
-{
- int mapcount = atomic_read(&page->_mapcount) + 1;
-
- /* Handle page_has_type() pages */
- if (mapcount < PAGE_MAPCOUNT_RESERVE + 1)
- mapcount = 0;
- if (unlikely(PageCompound(page)))
- mapcount += folio_entire_mapcount(page_folio(page));
-
- return mapcount;
-}
-
static inline int folio_large_mapcount(const struct folio *folio)
{
VM_WARN_ON_FOLIO(!folio_test_large(folio), folio);
--
2.45.2


2024-06-07 12:27:01

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 6/6] Documentation/admin-guide/mm/pagemap.rst: drop "Using pagemap to do something useful"

That example was added in 2008. In 2015, we restricted access to the
PFNs in the pagemap to CAP_SYS_ADMIN, making that approach quite less
usable.

It's 2024 now, and using that racy and low-lewel mechanism to calculate the
USS should not be considered a good example anymore. /proc/$pid/smaps
and /proc/$pid/smaps_rollup can do a much better job without any of
that low-level handling.

Let's just drop that example.

Signed-off-by: David Hildenbrand <[email protected]>
---
Documentation/admin-guide/mm/pagemap.rst | 21 ---------------------
1 file changed, 21 deletions(-)

diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index f5f065c67615d..f2817a8015962 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -173,27 +173,6 @@ LRU related page flags
The page-types tool in the tools/mm directory can be used to query the
above flags.

-Using pagemap to do something useful
-====================================
-
-The general procedure for using pagemap to find out about a process' memory
-usage goes like this:
-
- 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
- mapped to what.
- 2. Select the maps you are interested in -- all of them, or a particular
- library, or the stack or the heap, etc.
- 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
- 4. Read a u64 for each page from pagemap.
- 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
- just read, seek to that entry in the file, and read the data you want.
-
-For example, to find the "unique set size" (USS), which is the amount of
-memory that a process is using that is not shared with any other process,
-you can go through every map in the process, find the PFNs, look those up
-in kpagecount, and tally up the number of pages that are only referenced
-once.
-
Exceptions for Shared Memory
============================

--
2.45.2


2024-06-07 13:18:56

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v1 0/6] fs/proc: move page_mapcount() to fs/proc/internal.h

On Fri, Jun 07, 2024 at 02:23:51PM +0200, David Hildenbrand wrote:
> With all other page_mapcount() users in the tree gone, move
> page_mapcount() to fs/proc/internal.h, rename it and extend the
> documentation to prevent future (ab)use.
>
> ... of course, I find some issues while working on that code that I sort
> first ;)
>
> We'll now only end up calling page_mapcount()
> [now folio_precise_page_mapcount()] on pages mapped via present page table
> entries. Except for /proc/kpagecount, that still does questionable things,
> but we'll leave that legacy interface as is for now.
>
> Did a quick sanity check. Likely we would want some better selfestest
> for /proc/$/pagemap + smaps. I'll see if I can find some time to write
> some more.

I stumbled upon some of these issues while unifying .{pud/pmd}_entry and
.hugetlb_entry.
I am not sure what is the current state of pagemap/smaps selftest, but
since I am going to need them anyway to keep me in check and making sure
I do not break anything hugetlb-related, I might as well write some of
them.


--
Oscar Salvador
SUSE Labs

2024-06-07 13:38:49

by Lance Yang

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] fs/proc/task_mmu: indicate PM_FILE for PMD-mapped file THP

On Fri, Jun 7, 2024 at 8:24 PM David Hildenbrand <[email protected]> wrote:
>
> Looks like we never taught pagemap_pmd_range() about the existence of
> PMD-mapped file THPs. Seems to date back to the times when we first added
> support for non-anon THPs in the form of shmem THP.
>
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Cc: Kirill A. Shutemov <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

LGTM. Feel free to add:
Reviewed-by: Lance Yang <[email protected]>

Thanks,
Lance

> ---
> fs/proc/task_mmu.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 5aceb3db7565e..08465b904ced5 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1522,6 +1522,8 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> }
> #endif
>
> + if (page && !PageAnon(page))
> + flags |= PM_FILE;
> if (page && !migration && page_mapcount(page) == 1)
> flags |= PM_MMAP_EXCLUSIVE;
>
> --
> 2.45.2
>
>

2024-06-07 15:42:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] fs/proc/task_mmu: indicate PM_FILE for PMD-mapped file THP

On Fri, Jun 07, 2024 at 02:23:52PM +0200, David Hildenbrand wrote:
> Looks like we never taught pagemap_pmd_range() about the existence of
> PMD-mapped file THPs. Seems to date back to the times when we first added
> support for non-anon THPs in the form of shmem THP.
>
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Cc: Kirill A. Shutemov <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-06-10 04:32:33

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] fs/proc/task_mmu: indicate PM_FILE for PMD-mapped file THP

On Fri, Jun 07, 2024 at 02:23:52PM +0200, David Hildenbrand wrote:
> Looks like we never taught pagemap_pmd_range() about the existence of
> PMD-mapped file THPs. Seems to date back to the times when we first added
> support for non-anon THPs in the form of shmem THP.
>
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Cc: Kirill A. Shutemov <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Oscar Salvador <[email protected]>

> ---
> fs/proc/task_mmu.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 5aceb3db7565e..08465b904ced5 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1522,6 +1522,8 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> }
> #endif
>
> + if (page && !PageAnon(page))
> + flags |= PM_FILE;
> if (page && !migration && page_mapcount(page) == 1)
> flags |= PM_MMAP_EXCLUSIVE;
>
> --
> 2.45.2
>
>

--
Oscar Salvador
SUSE Labs

2024-06-10 04:39:00

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v1 2/6] fs/proc/task_mmu: don't indicate PM_MMAP_EXCLUSIVE without PM_PRESENT

On Fri, Jun 07, 2024 at 02:23:53PM +0200, David Hildenbrand wrote:
> Relying on the mapcount for non-present PTEs that reference pages
> doesn't make any sense: they are not accounted in the mapcount, so
> page_mapcount() == 1 won't return the result we actually want to know.
>
> While we don't check the mapcount for migration entries already, we
> could end up checking it for swap, hwpoison, device exclusive, ...
> entries, which we really shouldn't.
>
> There is one exception: device private entries, which we consider
> fake-present (e.g., incremented the mapcount). But we won't care about
> that for now for PM_MMAP_EXCLUSIVE, because indicating PM_SWAP for them
> although they are fake-present already sounds suspiciously wrong.
>
> Let's never indicate PM_MMAP_EXCLUSIVE without PM_PRESENT.

Alternatively we could use is_pfn_swap_entry?
But the PM_PRESENT approach seems more correct.

> Signed-off-by: David Hildenbrand <[email protected]>

Signed-off-by: Oscar Salvador <[email protected]>

--
Oscar Salvador
SUSE Labs

2024-06-10 04:49:25

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v1 2/6] fs/proc/task_mmu: don't indicate PM_MMAP_EXCLUSIVE without PM_PRESENT

On Fri, Jun 07, 2024 at 02:23:53PM +0200, David Hildenbrand wrote:
> Relying on the mapcount for non-present PTEs that reference pages
> doesn't make any sense: they are not accounted in the mapcount, so
> page_mapcount() == 1 won't return the result we actually want to know.
>
> While we don't check the mapcount for migration entries already, we
> could end up checking it for swap, hwpoison, device exclusive, ...
> entries, which we really shouldn't.
>
> There is one exception: device private entries, which we consider
> fake-present (e.g., incremented the mapcount). But we won't care about
> that for now for PM_MMAP_EXCLUSIVE, because indicating PM_SWAP for them
> although they are fake-present already sounds suspiciously wrong.
>
> Let's never indicate PM_MMAP_EXCLUSIVE without PM_PRESENT.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Forgot to comment on something:

> @@ -1517,14 +1514,13 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> if (pmd_swp_uffd_wp(pmd))
> flags |= PM_UFFD_WP;
> VM_BUG_ON(!is_pmd_migration_entry(pmd));
> - migration = is_migration_entry(entry);
> page = pfn_swap_entry_to_page(entry);

We do not really need to get the page anymore here as that is the non-present
part.

Then we could get away without checking the flags as only page != NULL
would mean a present pmd.

Not that we gain much as this is far from being a hot-path, but just
saying..

--
Oscar Salvador
SUSE Labs

2024-06-10 04:51:31

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v1 3/6] fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of PMD-mapped THPs

On Fri, Jun 07, 2024 at 02:23:54PM +0200, David Hildenbrand wrote:
> We added PM_MMAP_EXCLUSIVE in 2015 via commit 77bb499bb60f ("pagemap: add
> mmap-exclusive bit for marking pages mapped only here"), when THPs could
> not be partially mapped and page_mapcount() returned something
> that was true for all pages of the THP.
>
> In 2016, we added support for partially mapping THPs via
> commit 53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping
> of THPs") but missed to determine PM_MMAP_EXCLUSIVE as well per page.
>
> Checking page_mapcount() on the head page does not tell the whole story.
>
> We should check each individual page. In a future without per-page
> mapcounts it will be different, but we'll change that to be consistent
> with PTE-mapped THPs once we deal with that.
>
> Fixes: 53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping of THPs")
> Cc: Kirill A. Shutemov <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Oscar Salvador <[email protected]>

--
Oscar Salvador
SUSE Labs

2024-06-11 07:14:15

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v1 2/6] fs/proc/task_mmu: don't indicate PM_MMAP_EXCLUSIVE without PM_PRESENT

On Mon, Jun 10, 2024 at 06:38:33AM +0200, Oscar Salvador wrote:
> Signed-off-by: Oscar Salvador <[email protected]>

Uh, I spaced out here, sorry.

Reviewed-by: Oscar Salvador <[email protected]>


--
Oscar Salvador
SUSE Labs

2024-06-11 10:46:13

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 2/6] fs/proc/task_mmu: don't indicate PM_MMAP_EXCLUSIVE without PM_PRESENT

On 10.06.24 06:38, Oscar Salvador wrote:
> On Fri, Jun 07, 2024 at 02:23:53PM +0200, David Hildenbrand wrote:
>> Relying on the mapcount for non-present PTEs that reference pages
>> doesn't make any sense: they are not accounted in the mapcount, so
>> page_mapcount() == 1 won't return the result we actually want to know.
>>
>> While we don't check the mapcount for migration entries already, we
>> could end up checking it for swap, hwpoison, device exclusive, ...
>> entries, which we really shouldn't.
>>
>> There is one exception: device private entries, which we consider
>> fake-present (e.g., incremented the mapcount). But we won't care about
>> that for now for PM_MMAP_EXCLUSIVE, because indicating PM_SWAP for them
>> although they are fake-present already sounds suspiciously wrong.
>>
>> Let's never indicate PM_MMAP_EXCLUSIVE without PM_PRESENT.
>
> Alternatively we could use is_pfn_swap_entry?

It's all weird, because only device private fake swp entries are
fake-present. For these, we might want to use PM_PRESENT, but I don't
care enough about device private entries to handle that here in a better
way :)

Indicating PM_SWAP for something that is not swap (migration/poison/...)
is also a bit weird. But likely nobody cared about that for now: it's
either present (PM_PRESENT), something else (PM_SWAP), or nothing is
there (no bit set).

Thanks!

--
Cheers,

David / dhildenb


2024-06-11 10:51:44

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 2/6] fs/proc/task_mmu: don't indicate PM_MMAP_EXCLUSIVE without PM_PRESENT

On 11.06.24 09:13, Oscar Salvador wrote:
> On Mon, Jun 10, 2024 at 06:38:33AM +0200, Oscar Salvador wrote:
>> Signed-off-by: Oscar Salvador <[email protected]>
>
> Uh, I spaced out here, sorry.
>
> Reviewed-by: Oscar Salvador <[email protected]>

:)

Thanks!

--
Cheers,

David / dhildenb


2024-06-11 10:56:56

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 2/6] fs/proc/task_mmu: don't indicate PM_MMAP_EXCLUSIVE without PM_PRESENT

On 10.06.24 06:49, Oscar Salvador wrote:
> On Fri, Jun 07, 2024 at 02:23:53PM +0200, David Hildenbrand wrote:
>> Relying on the mapcount for non-present PTEs that reference pages
>> doesn't make any sense: they are not accounted in the mapcount, so
>> page_mapcount() == 1 won't return the result we actually want to know.
>>
>> While we don't check the mapcount for migration entries already, we
>> could end up checking it for swap, hwpoison, device exclusive, ...
>> entries, which we really shouldn't.
>>
>> There is one exception: device private entries, which we consider
>> fake-present (e.g., incremented the mapcount). But we won't care about
>> that for now for PM_MMAP_EXCLUSIVE, because indicating PM_SWAP for them
>> although they are fake-present already sounds suspiciously wrong.
>>
>> Let's never indicate PM_MMAP_EXCLUSIVE without PM_PRESENT.
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>
> Forgot to comment on something:
>
>> @@ -1517,14 +1514,13 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
>> if (pmd_swp_uffd_wp(pmd))
>> flags |= PM_UFFD_WP;
>> VM_BUG_ON(!is_pmd_migration_entry(pmd));
>> - migration = is_migration_entry(entry);
>> page = pfn_swap_entry_to_page(entry);
>
> We do not really need to get the page anymore here as that is the non-present
> part.
>
> Then we could get away without checking the flags as only page != NULL
> would mean a present pmd.
>
> Not that we gain much as this is far from being a hot-path, but just
> saying..

I *think* we still want that for indicating PM_FILE after patch #1.

--
Cheers,

David / dhildenb


2024-06-11 11:16:05

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v1 2/6] fs/proc/task_mmu: don't indicate PM_MMAP_EXCLUSIVE without PM_PRESENT

On Tue, Jun 11, 2024 at 12:50:46PM +0200, David Hildenbrand wrote:
> I *think* we still want that for indicating PM_FILE after patch #1.

Yes, we do, disregard that comment please.


--
Oscar Salvador
SUSE Labs