2022-03-08 19:21:35

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages

This series is the result of the discussion on the previous approach [2].
More information on the general COW issues can be found there. It is based
on v5.17-rc7 and [1], which resides in -mm and -next:
[PATCH v3 0/9] mm: COW fixes part 1: fix the COW security issue for
THP and swap

v1 is located at:
https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2_v1

This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
on an anonymous page and COW logic fails to detect exclusivity of the page
to then replacing the anonymous page by a copy in the page table: The
GUP pin lost synchronicity with the pages mapped into the page tables.

This issue, including other related COW issues, has been summarized in [3]
under 3):
"
3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)

page_maybe_dma_pinned() is used to check if a page may be pinned for
DMA (using FOLL_PIN instead of FOLL_GET). While false positives are
tolerable, false negatives are problematic: pages that are pinned for
DMA must not be added to the swapcache. If it happens, the (now pinned)
page could be faulted back from the swapcache into page tables
read-only. Future write-access would detect the pinning and COW the
page, losing synchronicity. For the interested reader, this is nicely
documented in feb889fb40fa ("mm: don't put pinned pages into the swap
cache").

Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
cases and can result in a violation of the documented semantics:
giving false negatives because of the race.

There are cases where we call it without properly taking a per-process
sequence lock, turning the usage of page_maybe_dma_pinned() racy. While
one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
handle, there is especially one rmap case (shrink_page_list) that's hard
to fix: in the rmap world, we're not limited to a single process.

The shrink_page_list() issue is really subtle. If we race with
someone pinning a page, we can trigger the same issue as in the FOLL_GET
case. See the detail section at the end of this mail on a discussion how
bad this can bite us with VFIO or other FOLL_PIN user.

It's harder to reproduce, but I managed to modify the O_DIRECT
reproducer to use io_uring fixed buffers [15] instead, which ends up
using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
similarly trigger a loss of synchronicity and consequently a memory
corruption.

Again, the root issue is that a write-fault on a page that has
additional references results in a COW and thereby a loss of
synchronicity and consequently a memory corruption if two parties
believe they are referencing the same page.
"

This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable,
especially also taking care of concurrent pinning via GUP-fast,
for example, also fully fixing an issue reported regarding NUMA
balancing [4] recently. While doing that, it further reduces "unnecessary
COWs", especially when we don't fork()/KSM and don't swapout, and fixes the
COW security for hugetlb for FOLL_PIN.

In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
anonymous page is exclusive. Exclusive anonymous pages that are mapped
R/O can directly be mapped R/W by the COW logic in the write fault handler.
Exclusive anonymous pages that want to be shared (fork(), KSM) first have
to mark a mapped anonymous page shared -- which will fail if there are
GUP pins on the page. GUP is only allowed to take a pin on anonymous pages
that is exclusive. The PT lock is the primary mechanism to synchronize
modifications of PG_anon_exclusive. GUP-fast is synchronized either via the
src_mm->write_protect_seq or via clear/invalidate+flush of the relevant
page table entry.

Special care has to be taken about swap, migration, and THPs (whereby a
PMD-mapping can be converted to a PTE mapping and we have to track
information for subpages). Besides these, we let the rmap code handle most
magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE
logic as part of our previous approach [2], however, it's now 100% mapcount
free and I further simplified it a bit.

#1 is a fix
#3-#9 are mostly rmap preparations for PG_anon_exclusive handling
#10 introduces PG_anon_exclusive
#11 uses PG_anon_exclusive and make R/W pins of anonymous pages
reliable
#12 is a preparation for reliable R/O pins
#13 and #14 is reused/modified GUP-triggered unsharing for R/O GUP pins
make R/O pins of anonymous pages reliable
#15 adds sanity check when (un)pinning anonymous pages

I'm not proud about patch #10, suggestions welcome. Patch #11 contains
excessive explanations and the main logic for R/W pins. #12 and #13
resemble what we proposed in the previous approach [2]. I consider the
general approach of #15 very nice and helpful, and I remember Linus even
envisioning something like that for finding BUGs, although we might want to
implement the sanity checks eventually differently

It passes my growing set of tests for "wrong COW" and "missed COW",
including the ones in [3] -- I'd really appreciate some experienced eyes
to take a close look at corner cases.

I'm planning on sending a part 3 that will remember PG_anon_exclusive for
ordinary swap entries: this will make FOLL_GET | FOLL_WRITE references
reliable and fix the memory corruptions for O_DIRECT -- as described in
[3] under 2) -- as well, as long as there is no fork().

The long term goal should be to convert relevant users of FOLL_GET to
FOLL_PIN, however, with part3 it wouldn't be required to fix the obvious
memory corruptions we are aware of. Once that's in place we can streamline
our COW logic for hugetlb to rely on page_count() as well and fix any
possible COW security issues.

[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
[3] https://lore.kernel.org/r/[email protected]
[4] https://bugzilla.kernel.org/show_bug.cgi?id=215616


RFC -> v1:
* Rephrased/extended some patch descriptions+comments
* Tested on aarch64, ppc64 and x86_64
* "mm/rmap: convert RMAP flags to a proper distinct rmap_t type"
-> Added
* "mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()"
-> Added
* "mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"
-> Fixed __do_huge_pmd_anonymous_page() to recheck after temporarily
dropping the PT lock.
-> Use "reuse" label in __do_huge_pmd_anonymous_page()
-> Slightly simplify logic in hugetlb_cow()
-> In remove_migration_pte(), remove unrelated changes around
page_remove_rmap()
* "mm: support GUP-triggered unsharing of anonymous pages"
-> In handle_pte_fault(), trigger pte_mkdirty() only with
FAULT_FLAG_WRITE
-> In __handle_mm_fault(), extend comment regarding anonymous PUDs
* "mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
anonymous page"
-> Added unsharing logic to gup_hugepte() and gup_huge_pud()
-> Changed return logic in __follow_hugetlb_must_fault(), making sure
that "unshare" is always set
* "mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
exclusive when (un)pinning"
-> Slightly simplified sanity_check_pinned_pages()

David Hildenbrand (15):
mm/rmap: fix missing swap_free() in try_to_unmap() after
arch_unmap_one() failed
mm/hugetlb: take src_mm->write_protect_seq in
copy_hugetlb_page_range()
mm/memory: slightly simplify copy_present_pte()
mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and
page_try_dup_anon_rmap()
mm/rmap: convert RMAP flags to a proper distinct rmap_t type
mm/rmap: remove do_page_add_anon_rmap()
mm/rmap: pass rmap flags to hugepage_add_anon_rmap()
mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon()
page exclusively
mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages
mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
mm/gup: disallow follow_page(FOLL_PIN)
mm: support GUP-triggered unsharing of anonymous pages
mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
anonymous page
mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
exclusive when (un)pinning

fs/proc/page.c | 3 +-
include/linux/mm.h | 46 ++++++-
include/linux/mm_types.h | 8 ++
include/linux/page-flags.h | 124 +++++++++++++++++-
include/linux/rmap.h | 109 ++++++++++++++--
include/linux/swap.h | 15 ++-
include/linux/swapops.h | 25 ++++
include/trace/events/mmflags.h | 2 +-
kernel/events/uprobes.c | 2 +-
mm/gup.c | 103 ++++++++++++++-
mm/huge_memory.c | 122 +++++++++++++-----
mm/hugetlb.c | 137 ++++++++++++++------
mm/khugepaged.c | 2 +-
mm/ksm.c | 15 ++-
mm/memory-failure.c | 24 +++-
mm/memory.c | 221 ++++++++++++++++++++-------------
mm/memremap.c | 11 ++
mm/migrate.c | 40 +++++-
mm/mprotect.c | 8 +-
mm/page_alloc.c | 13 ++
mm/rmap.c | 95 ++++++++++----
mm/swapfile.c | 4 +-
mm/userfaultfd.c | 2 +-
23 files changed, 904 insertions(+), 227 deletions(-)

--
2.35.1


2022-03-08 19:36:20

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

The basic question we would like to have a reliable and efficient answer
to is: is this anonymous page exclusive to a single process or might it
be shared?

In an ideal world, we'd have a spare pageflag. Unfortunately, pageflags
don't grow on trees, so we have to get a little creative for the time
being.

Introduce a way to mark an anonymous page as exclusive, with the
ultimate goal of teaching our COW logic to not do "wrong COWs", whereby
GUP pins lose consistency with the pages mapped into the page table,
resulting in reported memory corruptions.

Most pageflags already have semantics for anonymous pages, so we're left
with reusing PG_slab for our purpose: for PageAnon() pages PG_slab now
translates to PG_anon_exclusive, teach some in-kernel code that manually
handles PG_slab about that.

Add a spoiler on the semantics of PG_anon_exclusive as documentation. More
documentation will be contained in the code that actually makes use of
PG_anon_exclusive.

We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
zapping) of page table entries, page freeing code will handle that when
also invalidate page->mapping to not indicate PageAnon() anymore.
Letting information about exclusivity stick around will be an important
property when adding sanity checks to unpinning code.

RFC notes: in-tree tools/cgroup/memcg_slabinfo.py looks like it might need
some care. We'd have to lookup the head page and check if
PageAnon() is set. Similarly, tools living outside the kernel
repository like crash and makedumpfile might need adaptions.

Cc: Roman Gushchin <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/page.c | 3 +-
include/linux/page-flags.h | 124 ++++++++++++++++++++++++++++++++-
include/trace/events/mmflags.h | 2 +-
mm/hugetlb.c | 4 ++
mm/memory-failure.c | 24 +++++--
mm/memremap.c | 11 +++
mm/page_alloc.c | 13 ++++
7 files changed, 170 insertions(+), 11 deletions(-)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index 9f1077d94cde..61187568e5f7 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -182,8 +182,7 @@ u64 stable_page_flags(struct page *page)

u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked);

- u |= kpf_copy_bit(k, KPF_SLAB, PG_slab);
- if (PageTail(page) && PageSlab(compound_head(page)))
+ if (PageSlab(page) || PageSlab(compound_head(page)))
u |= 1 << KPF_SLAB;

u |= kpf_copy_bit(k, KPF_ERROR, PG_error);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1c3b6e5c8bfd..b3c2cf71467e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -107,7 +107,7 @@ enum pageflags {
PG_workingset,
PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
PG_error,
- PG_slab,
+ PG_slab, /* Slab page if !PageAnon() */
PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
PG_arch_1,
PG_reserved,
@@ -142,6 +142,63 @@ enum pageflags {

PG_readahead = PG_reclaim,

+ /*
+ * Depending on the way an anonymous folio can be mapped into a page
+ * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
+ * THP), PG_anon_exclusive may be set only for the head page or for
+ * subpages of an anonymous folio.
+ *
+ * PG_anon_exclusive is *usually* only expressive in combination with a
+ * page table entry. Depending on the page table entry type it might
+ * store the following information:
+ *
+ * Is what's mapped via this page table entry exclusive to the
+ * single process and can be mapped writable without further
+ * checks? If not, it might be shared and we might have to COW.
+ *
+ * For now, we only expect PTE-mapped THPs to make use of
+ * PG_anon_exclusive in subpages. For other anonymous compound
+ * folios (i.e., hugetlb), only the head page is logically mapped and
+ * holds this information.
+ *
+ * For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
+ * set on the head page. When replacing the PMD by a page table full
+ * of PTEs, PG_anon_exclusive, if set on the head page, will be set on
+ * all tail pages accordingly. Note that converting from a PTE-mapping
+ * to a PMD mapping using the same compound page is currently not
+ * possible and consequently doesn't require care.
+ *
+ * If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
+ * it should only pin if the relevant PG_anon_bit is set. In that case,
+ * the pin will be fully reliable and stay consistent with the pages
+ * mapped into the page table, as the bit cannot get cleared (e.g., by
+ * fork(), KSM) while the page is pinned. For anonymous pages that
+ * are mapped R/W, PG_anon_exclusive can be assumed to always be set
+ * because such pages cannot possibly be shared.
+ *
+ * The page table lock protecting the page table entry is the primary
+ * synchronization mechanism for PG_anon_exclusive; GUP-fast that does
+ * not take the PT lock needs special care when trying to clear the
+ * flag.
+ *
+ * Page table entry types and PG_anon_exclusive:
+ * * Present: PG_anon_exclusive applies.
+ * * Swap: the information is lost. PG_anon_exclusive was cleared.
+ * * Migration: the entry holds this information instead.
+ * PG_anon_exclusive was cleared.
+ * * Device private: PG_anon_exclusive applies.
+ * * Device exclusive: PG_anon_exclusive applies.
+ * * HW Poison: PG_anon_exclusive is stale and not changed.
+ *
+ * If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
+ * not allowed and the flag will stick around until the page is freed
+ * and folio->mapping is cleared.
+ *
+ * Before clearing folio->mapping, PG_anon_exclusive has to be cleared
+ * to not result in PageSlab() false positives.
+ */
+ PG_anon_exclusive = PG_slab,
+
/* Filesystems */
PG_checked = PG_owner_priv_1,

@@ -425,7 +482,6 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
PAGEFLAG(Workingset, workingset, PF_HEAD)
TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
-__PAGEFLAG(Slab, slab, PF_NO_TAIL)
__PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */

@@ -920,6 +976,70 @@ extern bool is_free_buddy_page(struct page *page);

__PAGEFLAG(Isolated, isolated, PF_ANY);

+static __always_inline bool folio_test_slab(struct folio *folio)
+{
+ return !folio_test_anon(folio) &&
+ test_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL));
+}
+
+static __always_inline int PageSlab(struct page *page)
+{
+ return !PageAnon(page) &&
+ test_bit(PG_slab, &PF_NO_TAIL(page, 0)->flags);
+}
+
+static __always_inline void __folio_set_slab(struct folio *folio)
+{
+ VM_BUG_ON(folio_test_anon(folio));
+ __set_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL));
+}
+
+static __always_inline void __SetPageSlab(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(PageAnon(page), page);
+ __set_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags);
+}
+
+static __always_inline void __folio_clear_slab(struct folio *folio)
+{
+ VM_BUG_ON(folio_test_anon(folio));
+ __clear_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL));
+}
+
+static __always_inline void __ClearPageSlab(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(PageAnon(page), page);
+ __clear_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags);
+}
+
+static __always_inline int PageAnonExclusive(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
+ VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+ return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void SetPageAnonExclusive(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
+ VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+ set_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void ClearPageAnonExclusive(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
+ VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+ clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void __ClearPageAnonExclusive(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
+ VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+ __clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
#ifdef CONFIG_MMU
#define __PG_MLOCKED (1UL << PG_mlocked)
#else
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 116ed4d5d0f8..5662f74e027f 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -103,7 +103,7 @@
{1UL << PG_lru, "lru" }, \
{1UL << PG_active, "active" }, \
{1UL << PG_workingset, "workingset" }, \
- {1UL << PG_slab, "slab" }, \
+ {1UL << PG_slab, "slab/anon_exclusive" }, \
{1UL << PG_owner_priv_1, "owner_priv_1" }, \
{1UL << PG_arch_1, "arch_1" }, \
{1UL << PG_reserved, "reserved" }, \
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9fb990d95dab..bf5ce91c623c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1669,6 +1669,10 @@ void free_huge_page(struct page *page)
VM_BUG_ON_PAGE(page_mapcount(page), page);

hugetlb_set_page_subpool(page, NULL);
+ if (PageAnon(page)) {
+ __ClearPageAnonExclusive(page);
+ wmb(); /* avoid PageSlab() false positives */
+ }
page->mapping = NULL;
restore_reserve = HPageRestoreReserve(page);
ClearHPageRestoreReserve(page);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 97a9ed8f87a9..176fa2fbd699 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1470,17 +1470,29 @@ static int identify_page_state(unsigned long pfn, struct page *p,
* The first check uses the current page flags which may not have any
* relevant information. The second check with the saved page flags is
* carried out only if the first check can't determine the page status.
+ *
+ * Note that PG_slab is also used as PG_anon_exclusive for PageAnon()
+ * pages. Most of these pages should have been handled previously,
+ * however, let's play safe and verify via PageAnon().
*/
- for (ps = error_states;; ps++)
- if ((p->flags & ps->mask) == ps->res)
- break;
+ for (ps = error_states;; ps++) {
+ if ((p->flags & ps->mask) != ps->res)
+ continue;
+ if ((ps->type == MF_MSG_SLAB) && PageAnon(p))
+ continue;
+ break;
+ }

page_flags |= (p->flags & (1UL << PG_dirty));

if (!ps->mask)
- for (ps = error_states;; ps++)
- if ((page_flags & ps->mask) == ps->res)
- break;
+ for (ps = error_states;; ps++) {
+ if ((page_flags & ps->mask) != ps->res)
+ continue;
+ if ((ps->type == MF_MSG_SLAB) && PageAnon(p))
+ continue;
+ break;
+ }
return page_action(ps, p, pfn);
}

diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11f..a6b82ae8b3e6 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -478,6 +478,17 @@ void free_devmap_managed_page(struct page *page)

mem_cgroup_uncharge(page_folio(page));

+ /*
+ * Note: we don't expect anonymous compound pages yet. Once supported
+ * and we could PTE-map them similar to THP, we'd have to clear
+ * PG_anon_exclusive on all tail pages.
+ */
+ VM_BUG_ON_PAGE(PageAnon(page) && PageCompound(page), page);
+ if (PageAnon(page)) {
+ __ClearPageAnonExclusive(page);
+ wmb(); /* avoid PageSlab() false positives */
+ }
+
/*
* When a device_private page is freed, the page->mapping field
* may still contain a (stale) mapping value. For example, the
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..e59e68a61d1a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1297,6 +1297,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
{
int bad = 0;
bool skip_kasan_poison = should_skip_kasan_poison(page, fpi_flags);
+ const bool anon = PageAnon(page);

VM_BUG_ON_PAGE(PageTail(page), page);

@@ -1329,6 +1330,14 @@ static __always_inline bool free_pages_prepare(struct page *page,
ClearPageHasHWPoisoned(page);
}
for (i = 1; i < (1 << order); i++) {
+ /*
+ * Freeing a previously PTE-mapped THP can have
+ * PG_anon_exclusive set on tail pages. Clear it
+ * manually as it's overloaded with PG_slab that we
+ * want to catch in check_free_page().
+ */
+ if (anon)
+ __ClearPageAnonExclusive(page + i);
if (compound)
bad += free_tail_pages_check(page, page + i);
if (unlikely(check_free_page(page + i))) {
@@ -1338,6 +1347,10 @@ static __always_inline bool free_pages_prepare(struct page *page,
(page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
}
}
+ if (anon) {
+ __ClearPageAnonExclusive(page);
+ wmb(); /* avoid PageSlab() false positives */
+ }
if (PageMappingFlags(page))
page->mapping = NULL;
if (memcg_kmem_enabled() && PageMemcgKmem(page))
--
2.35.1

2022-03-08 19:37:56

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 11/15] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive

Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
exclusive, and use that information to make GUP pins reliable and stay
consistent with the page mapped into the page table even if the
page table entry gets write-protected.

With that information at hand, we can extend our COW logic to always
reuse anonymous pages that are exclusive. For anonymous pages that
might be shared, the existing logic applies.

As already documented, PG_anon_exclusive is usually only expressive in
combination with a page table entry. Especially PTE vs. PMD-mapped
anonymous pages require more thought, some examples: due to mremap() we
can easily have a single compound page PTE-mapped into multiple page tables
exclusively in a single process -- multiple page table locks apply.
Further, due to MADV_WIPEONFORK we might not necessarily write-protect
all PTEs, and only some subpages might be pinned. Long story short: once
PTE-mapped, we have to track information about exclusivity per sub-page,
but until then, we can just track it for the compound page in the head
page and not having to update a whole bunch of subpages all of the time
for a simple PMD mapping of a THP.

For simplicity, this commit mostly talks about "anonymous pages", while
it's for THP actually "the part of an anonymous folio referenced via
a page table entry".

To not spill PG_anon_exclusive code all over the mm code-base, we let
the anon rmap code to handle all PG_anon_exclusive logic it can easily
handle.

If a writable, present page table entry points at an anonymous (sub)page,
that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably
pin (FOLL_PIN) on an anonymous page references via a present
page table entry, it must only pin if PG_anon_exclusive is set for the
mapped (sub)page.

This commit doesn't adjust GUP, so this is only implicitly handled for
FOLL_WRITE, follow-up commits will teach GUP to also respect it for
FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages
fully reliable.

Whenever an anonymous page is to be shared (fork(), KSM), or when
temporarily unmapping an anonymous page (swap, migration), the relevant
PG_anon_exclusive bit has to be cleared to mark the anonymous page
possibly shared. Clearing will fail if there are GUP pins on the page:
* For fork(), this means having to copy the page and not being able to
share it. fork() protects against concurrent GUP using the PT lock and
the src_mm->write_protect_seq.
* For KSM, this means sharing will fail. For swap this means, unmapping
will fail, For migration this means, migration will fail early. All
three cases protect against concurrent GUP using the PT lock and a
proper clear/invalidate+flush of the relevant page table entry.

This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
pinned page gets mapped R/O and the successive write fault ends up
replacing the page instead of reusing it. It improves the situation for
O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN,
if fork() is *not* involved, however swapout and fork() are still
problematic. Properly using FOLL_PIN instead of FOLL_GET for these
GUP users will fix the issue for them.

I. Details about basic handling

I.1. Fresh anonymous pages

page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
given page exclusive via __page_set_anon_rmap(exclusive=1). As that is
the mechanism fresh anonymous pages come into life (besides migration
code where we copy the page->mapping), all fresh anonymous pages will
start out as exclusive.

I.2. COW reuse handling of anonymous pages

When a COW handler stumbles over a (sub)page that's marked exclusive, it
simply reuses it. Otherwise, the handler tries harder under page lock to
detect if the (sub)page is exclusive and can be reused. If exclusive,
page_move_anon_rmap() will mark the given (sub)page exclusive.

Note that hugetlb code does not yet check for PageAnonExclusive(), as it
still uses the old COW logic that is prone to the COW security issue
because hugetlb code cannot really tolerate unnecessary/wrong COW as
huge pages are a scarce resource.

I.3. Migration handling

try_to_migrate() has to try marking an exclusive anonymous page shared
via page_try_share_anon_rmap(). If it fails because there are GUP pins
on the page, unmap fails. migrate_vma_collect_pmd() and
__split_huge_pmd_locked() are handled similarly.

Writable migration entries implicitly point at shared anonymous pages.
For readable migration entries that information is stored via a new
"readable-exclusive" migration entry, specific to anonymous pages.

When restoring a migration entry in remove_migration_pte(), information
about exlusivity is detected via the migration entry type, and
RMAP_EXCLUSIVE is set accordingly for
page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that
information.

I.4. Swapout handling

try_to_unmap() has to try marking the mapped page possibly shared via
page_try_share_anon_rmap(). If it fails because there are GUP pins on the
page, unmap fails. For now, information about exclusivity is lost. In the
future, we might want to remember that information in the swap entry in
some cases, however, it requires more thought, care, and a way to store
that information in swap entries.

I.5. Swapin handling

do_swap_page() will never stumble over exclusive anonymous pages in the
swap cache, as try_to_migrate() prohibits that. do_swap_page() always has
to detect manually if an anonymous page is exclusive and has to set
RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.

I.6. THP handling

__split_huge_pmd_locked() has to move the information about exclusivity
from the PMD to the PTEs.

a) In case we have a readable-exclusive PMD migration entry, simply insert
readable-exclusive PTE migration entries.

b) In case we have a present PMD entry and we don't want to freeze
("convert to migration entries"), simply forward PG_anon_exclusive to
all sub-pages, no need to temporarily clear the bit.

c) In case we have a present PMD entry and want to freeze, handle it
similar to try_to_migrate(): try marking the page shared first. In case
we fail, we ignore the "freeze" instruction and simply split ordinarily.
try_to_migrate() will properly fail because the THP is still mapped via
PTEs.

When splitting a compound anonymous folio (THP), the information about
exclusivity is implicitly handled via the migration entries: no need to
replicate PG_anon_exclusive manually.

I.7. fork() handling

fork() handling is relatively easy, because PG_anon_exclusive is only
expressive for some page table entry types.

a) Present anonymous pages

page_try_dup_anon_rmap() will mark the given subpage shared -- which
will fail if the page is pinned. If it failed, we have to copy (or
PTE-map a PMD to handle it on the PTE level).

Note that device exclusive entries are just a pointer at a PageAnon()
page. fork() will first convert a device exclusive entry to a present
page table and handle it just like present anonymous pages.

b) Device private entry

Device private entries point at PageAnon() pages that cannot be mapped
directly and, therefore, cannot get pinned.

page_try_dup_anon_rmap() will mark the given subpage shared, which
cannot fail because they cannot get pinned.

c) HW poison entries

PG_anon_exclusive will remain untouched and is stale -- the page table
entry is just a placeholder after all.

d) Migration entries

Writable and readable-exclusive entries are converted to readable
entries: possibly shared.

I.8. mprotect() handling

mprotect() only has to properly handle the new readable-exclusive
migration entry:

When write-protecting a migration entry that points at an anonymous
page, remember the information about exclusivity via the
"readable-exclusive" migration entry type.

II. Migration and GUP-fast

Whenever replacing a present page table entry that maps an exclusive
anonymous page by a migration entry, we have to mark the page possibly
shared and synchronize against GUP-fast by a proper
clear/invalidate+flush to make the following scenario impossible:

1. try_to_migrate() places a migration entry after checking for GUP pins
and marks the page possibly shared.
2. GUP-fast pins the page due to lack of synchronization
3. fork() converts the "writable/readable-exclusive" migration entry into a
readable migration entry
4. Migration fails due to the GUP pin (failing to freeze the refcount)
5. Migration entries are restored. PG_anon_exclusive is lost

-> We have a pinned page that is not marked exclusive anymore.

Note that we move information about exclusivity from the page to the
migration entry as it otherwise highly overcomplicates fork() and
PTE-mapping a THP.

III. Swapout and GUP-fast

Whenever replacing a present page table entry that maps an exclusive
anonymous page by a swap entry, we have to mark the page possibly
shared and synchronize against GUP-fast by a proper
clear/invalidate+flush to make the following scenario impossible:

1. try_to_unmap() places a swap entry after checking for GUP pins and
clears exclusivity information on the page.
2. GUP-fast pins the page due to lack of synchronization.

-> We have a pinned page that is not marked exclusive anymore.

If we'd ever store information about exclusivity in the swap entry,
similar to migration handling, the same considerations as in II would
apply. This is future work.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/rmap.h | 33 ++++++++++++++++++
include/linux/swap.h | 15 ++++++---
include/linux/swapops.h | 25 ++++++++++++++
mm/huge_memory.c | 75 ++++++++++++++++++++++++++++++++++++-----
mm/hugetlb.c | 15 ++++++---
mm/ksm.c | 13 ++++++-
mm/memory.c | 33 +++++++++++++-----
mm/migrate.c | 34 +++++++++++++++++--
mm/mprotect.c | 8 +++--
mm/rmap.c | 59 +++++++++++++++++++++++++++++---
10 files changed, 275 insertions(+), 35 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 51953bace0a3..1bc522d28a78 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -235,6 +235,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
unlikely(page_needs_cow_for_dma(vma, page))))
return -EBUSY;

+ ClearPageAnonExclusive(page);
/*
* It's okay to share the anon page between both processes, mapping
* the page R/O into both processes.
@@ -243,6 +244,38 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
return 0;
}

+/**
+ * page_try_share_anon_rmap - try marking an exclusive anonymous page possibly
+ * shared to prepare for KSM or temporary unmapping
+ * @page: the exclusive anonymous page to try marking possibly shared
+ *
+ * The caller needs to hold the PT lock and has to have the page table entry
+ * cleared/invalidated+flushed, to properly sync against GUP-fast.
+ *
+ * This is similar to page_try_dup_anon_rmap(), however, not used during fork()
+ * to duplicate a mapping, but instead to prepare for KSM or temporarily
+ * unmapping a page (swap, migration) via page_remove_rmap().
+ *
+ * Marking the page shared can only fail if the page may be pinned; device
+ * private pages cannot get pinned and consequently this function cannot fail.
+ *
+ * Returns 0 if marking the page possibly shared succeeded. Returns -EBUSY
+ * otherwise.
+ */
+static inline int page_try_share_anon_rmap(struct page *page)
+{
+ VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
+
+ /* See page_try_dup_anon_rmap(). */
+ if (likely(!is_device_private_page(page) &&
+ unlikely(page_maybe_dma_pinned(page))))
+ return -EBUSY;
+
+ ClearPageAnonExclusive(page);
+ return 0;
+}
+
+
/*
* Called from mm/vmscan.c to handle paging out
*/
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b546e4bd5c5a..422765d1141c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -78,12 +78,19 @@ static inline int current_is_kswapd(void)
#endif

/*
- * NUMA node memory migration support
+ * Page migration support.
+ *
+ * SWP_MIGRATION_READ_EXCLUSIVE is only applicable to anonymous pages and
+ * indicates that the referenced (part of) an anonymous page is exclusive to
+ * a single process. For SWP_MIGRATION_WRITE, that information is implicit:
+ * (part of) an anonymous page that are mapped writable are exclusive to a
+ * single process.
*/
#ifdef CONFIG_MIGRATION
-#define SWP_MIGRATION_NUM 2
-#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM)
-#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
+#define SWP_MIGRATION_NUM 3
+#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM)
+#define SWP_MIGRATION_READ_EXCLUSIVE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
+#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 2)
#else
#define SWP_MIGRATION_NUM 0
#endif
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d356ab4047f7..06280fc1c99b 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -194,6 +194,7 @@ static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
static inline int is_migration_entry(swp_entry_t entry)
{
return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
+ swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE ||
swp_type(entry) == SWP_MIGRATION_WRITE);
}

@@ -202,11 +203,26 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
}

+static inline int is_readable_migration_entry(swp_entry_t entry)
+{
+ return unlikely(swp_type(entry) == SWP_MIGRATION_READ);
+}
+
+static inline int is_readable_exclusive_migration_entry(swp_entry_t entry)
+{
+ return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE);
+}
+
static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
{
return swp_entry(SWP_MIGRATION_READ, offset);
}

+static inline swp_entry_t make_readable_exclusive_migration_entry(pgoff_t offset)
+{
+ return swp_entry(SWP_MIGRATION_READ_EXCLUSIVE, offset);
+}
+
static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
{
return swp_entry(SWP_MIGRATION_WRITE, offset);
@@ -224,6 +240,11 @@ static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
return swp_entry(0, 0);
}

+static inline swp_entry_t make_readable_exclusive_migration_entry(pgoff_t offset)
+{
+ return swp_entry(0, 0);
+}
+
static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
{
return swp_entry(0, 0);
@@ -244,6 +265,10 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
{
return 0;
}
+static inline int is_readable_migration_entry(swp_entry_t entry)
+{
+ return 0;
+}

#endif

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0b6fb409b9e4..0e83c1551da3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1054,7 +1054,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
swp_entry_t entry = pmd_to_swp_entry(pmd);

VM_BUG_ON(!is_pmd_migration_entry(pmd));
- if (is_writable_migration_entry(entry)) {
+ if (!is_readable_migration_entry(entry)) {
entry = make_readable_migration_entry(
swp_offset(entry));
pmd = swp_entry_to_pmd(entry);
@@ -1292,6 +1292,10 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
page = pmd_page(orig_pmd);
VM_BUG_ON_PAGE(!PageHead(page), page);

+ /* Early check when only holding the PT lock. */
+ if (PageAnonExclusive(page))
+ goto reuse;
+
if (!trylock_page(page)) {
get_page(page);
spin_unlock(vmf->ptl);
@@ -1306,6 +1310,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
put_page(page);
}

+ /* Recheck after temporarily dropping the PT lock. */
+ if (PageAnonExclusive(page)) {
+ unlock_page(page);
+ goto reuse;
+ }
+
/*
* See do_wp_page(): we can only map the page writable if there are
* no additional references. Note that we always drain the LRU
@@ -1319,11 +1329,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
pmd_t entry;

page_move_anon_rmap(page, vma);
+ unlock_page(page);
+reuse:
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
- unlock_page(page);
spin_unlock(vmf->ptl);
return VM_FAULT_WRITE;
}
@@ -1741,6 +1752,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
if (is_swap_pmd(*pmd)) {
swp_entry_t entry = pmd_to_swp_entry(*pmd);
+ struct page *page = pfn_swap_entry_to_page(entry);

VM_BUG_ON(!is_pmd_migration_entry(*pmd));
if (is_writable_migration_entry(entry)) {
@@ -1749,8 +1761,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
* A protection check is difficult so
* just be safe and disable write
*/
- entry = make_readable_migration_entry(
- swp_offset(entry));
+ if (PageAnon(page))
+ entry = make_readable_exclusive_migration_entry(swp_offset(entry));
+ else
+ entry = make_readable_migration_entry(swp_offset(entry));
newpmd = swp_entry_to_pmd(entry);
if (pmd_swp_soft_dirty(*pmd))
newpmd = pmd_swp_mksoft_dirty(newpmd);
@@ -1959,6 +1973,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
pgtable_t pgtable;
pmd_t old_pmd, _pmd;
bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
+ bool anon_exclusive = false;
unsigned long addr;
int i;

@@ -2040,6 +2055,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
entry = pmd_to_swp_entry(old_pmd);
page = pfn_swap_entry_to_page(entry);
write = is_writable_migration_entry(entry);
+ if (PageAnon(page))
+ anon_exclusive = is_readable_exclusive_migration_entry(entry);
young = false;
soft_dirty = pmd_swp_soft_dirty(old_pmd);
uffd_wp = pmd_swp_uffd_wp(old_pmd);
@@ -2051,6 +2068,23 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
young = pmd_young(old_pmd);
soft_dirty = pmd_soft_dirty(old_pmd);
uffd_wp = pmd_uffd_wp(old_pmd);
+
+ /*
+ * Without "freeze", we'll simply split the PMD, propagating the
+ * PageAnonExclusive() flag for each PTE by setting it for
+ * each subpage -- no need to (temporarily) clear.
+ *
+ * With "freeze" we want to replace mapped pages by
+ * migration entries right away. This is only possible if we
+ * managed to clear PageAnonExclusive() -- see
+ * set_pmd_migration_entry().
+ *
+ * In case we cannot clear PageAnonExclusive(), split the PMD
+ * only and let try_to_migrate_one() fail later.
+ */
+ anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
+ if (freeze && page_try_share_anon_rmap(page))
+ freeze = false;
}
VM_BUG_ON_PAGE(!page_count(page), page);
page_ref_add(page, HPAGE_PMD_NR - 1);
@@ -2074,6 +2108,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (write)
swp_entry = make_writable_migration_entry(
page_to_pfn(page + i));
+ else if (anon_exclusive)
+ swp_entry = make_readable_exclusive_migration_entry(
+ page_to_pfn(page + i));
else
swp_entry = make_readable_migration_entry(
page_to_pfn(page + i));
@@ -2085,6 +2122,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
} else {
entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
entry = maybe_mkwrite(entry, vma);
+ if (anon_exclusive)
+ SetPageAnonExclusive(page + i);
if (!write)
entry = pte_wrprotect(entry);
if (!young)
@@ -2315,8 +2354,11 @@ static void __split_huge_page_tail(struct page *head, int tail,
*
* After successful get_page_unless_zero() might follow flags change,
* for example lock_page() which set PG_waiters.
+ *
+ * Keep PG_anon_exclusive information, already maintained for all
+ * subpages of a compound page, untouched.
*/
- page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+ page_tail->flags &= ~(PAGE_FLAGS_CHECK_AT_PREP & ~PG_anon_exclusive);
page_tail->flags |= (head->flags &
((1L << PG_referenced) |
(1L << PG_swapbacked) |
@@ -3070,6 +3112,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
struct vm_area_struct *vma = pvmw->vma;
struct mm_struct *mm = vma->vm_mm;
unsigned long address = pvmw->address;
+ bool anon_exclusive;
pmd_t pmdval;
swp_entry_t entry;
pmd_t pmdswp;
@@ -3079,10 +3122,19 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,

flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
+
+ anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
+ if (anon_exclusive && page_try_share_anon_rmap(page)) {
+ set_pmd_at(mm, address, pvmw->pmd, pmdval);
+ return;
+ }
+
if (pmd_dirty(pmdval))
set_page_dirty(page);
if (pmd_write(pmdval))
entry = make_writable_migration_entry(page_to_pfn(page));
+ else if (anon_exclusive)
+ entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
else
entry = make_readable_migration_entry(page_to_pfn(page));
pmdswp = swp_entry_to_pmd(entry);
@@ -3116,10 +3168,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));

flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
- if (PageAnon(new))
- page_add_anon_rmap(new, vma, mmun_start, RMAP_COMPOUND);
- else
+ if (PageAnon(new)) {
+ int rmap_flags = RMAP_COMPOUND;
+
+ if (!is_readable_migration_entry(entry))
+ rmap_flags |= RMAP_EXCLUSIVE;
+
+ page_add_anon_rmap(new, vma, mmun_start, rmap_flags);
+ } else {
page_add_file_rmap(new, true);
+ }
+ VM_BUG_ON(pmd_write(pmde) && PageAnon(new) && !PageAnonExclusive(new));
set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
mlock_vma_page(new);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bf5ce91c623c..195a1a5d8b20 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4767,7 +4767,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
is_hugetlb_entry_hwpoisoned(entry))) {
swp_entry_t swp_entry = pte_to_swp_entry(entry);

- if (is_writable_migration_entry(swp_entry) && cow) {
+ if (!is_readable_migration_entry(swp_entry) && cow) {
/*
* COW mappings require pages in both
* parent and child to be set to read.
@@ -5167,6 +5167,8 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
set_huge_ptep_writable(vma, haddr, ptep);
return 0;
}
+ VM_BUG_ON_PAGE(PageAnon(old_page) && PageAnonExclusive(old_page),
+ old_page);

/*
* If the process that created a MAP_PRIVATE mapping is about to
@@ -6162,12 +6164,17 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
}
if (unlikely(is_hugetlb_entry_migration(pte))) {
swp_entry_t entry = pte_to_swp_entry(pte);
+ struct page *page = pfn_swap_entry_to_page(entry);

- if (is_writable_migration_entry(entry)) {
+ if (!is_readable_migration_entry(entry)) {
pte_t newpte;

- entry = make_readable_migration_entry(
- swp_offset(entry));
+ if (PageAnon(page))
+ entry = make_readable_exclusive_migration_entry(
+ swp_offset(entry));
+ else
+ entry = make_readable_migration_entry(
+ swp_offset(entry));
newpte = swp_entry_to_pte(entry);
set_huge_swap_pte_at(mm, address, ptep,
newpte, huge_page_size(h));
diff --git a/mm/ksm.c b/mm/ksm.c
index 9ff28097bc0a..350b46432d65 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -866,6 +866,7 @@ static inline struct stable_node *page_stable_node(struct page *page)
static inline void set_page_stable_node(struct page *page,
struct stable_node *stable_node)
{
+ VM_BUG_ON_PAGE(PageAnon(page) && PageAnonExclusive(page), page);
page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM);
}

@@ -1041,6 +1042,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
int swapped;
int err = -EFAULT;
struct mmu_notifier_range range;
+ bool anon_exclusive;

pvmw.address = page_address_in_vma(page, vma);
if (pvmw.address == -EFAULT)
@@ -1058,9 +1060,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?"))
goto out_unlock;

+ anon_exclusive = PageAnonExclusive(page);
if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
(pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) ||
- mm_tlb_flush_pending(mm)) {
+ anon_exclusive || mm_tlb_flush_pending(mm)) {
pte_t entry;

swapped = PageSwapCache(page);
@@ -1088,6 +1091,12 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
set_pte_at(mm, pvmw.address, pvmw.pte, entry);
goto out_unlock;
}
+
+ if (anon_exclusive && page_try_share_anon_rmap(page)) {
+ set_pte_at(mm, pvmw.address, pvmw.pte, entry);
+ goto out_unlock;
+ }
+
if (pte_dirty(entry))
set_page_dirty(page);

@@ -1146,6 +1155,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
pte_unmap_unlock(ptep, ptl);
goto out_mn;
}
+ VM_BUG_ON_PAGE(PageAnonExclusive(page), page);
+ VM_BUG_ON_PAGE(PageAnon(kpage) && PageAnonExclusive(kpage), kpage);

/*
* No need to check ksm_use_zero_pages here: we can only have a
diff --git a/mm/memory.c b/mm/memory.c
index 7b32f422798d..a804f12d878c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -720,6 +720,8 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
else if (is_writable_device_exclusive_entry(entry))
pte = maybe_mkwrite(pte_mkdirty(pte), vma);

+ VM_BUG_ON(pte_write(pte) && !(PageAnon(page) && PageAnonExclusive(page)));
+
/*
* No need to take a page reference as one was already
* created when the swap entry was made.
@@ -799,11 +801,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,

rss[mm_counter(page)]++;

- if (is_writable_migration_entry(entry) &&
+ if (!is_readable_migration_entry(entry) &&
is_cow_mapping(vm_flags)) {
/*
- * COW mappings require pages in both
- * parent and child to be set to read.
+ * COW mappings require pages in both parent and child
+ * to be set to read. A previously exclusive entry is
+ * now shared.
*/
entry = make_readable_migration_entry(
swp_offset(entry));
@@ -954,6 +957,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}
+ VM_BUG_ON(page && PageAnon(page) && PageAnonExclusive(page));

/*
* If it's a shared mapping, mark it clean in
@@ -2943,6 +2947,9 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct page *page = vmf->page;
pte_t entry;
+
+ VM_BUG_ON(PageAnon(page) && !PageAnonExclusive(page));
+
/*
* Clear the pages cpupid information as the existing
* information potentially belongs to a now completely
@@ -3277,6 +3284,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
if (PageAnon(vmf->page)) {
struct page *page = vmf->page;

+ /*
+ * If the page is exclusive to this process we must reuse the
+ * page without further checks.
+ */
+ if (PageAnonExclusive(page))
+ goto reuse;
+
/*
* We have to verify under page lock: these early checks are
* just an optimization to avoid locking the page and freeing
@@ -3309,6 +3323,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
*/
page_move_anon_rmap(page, vma);
unlock_page(page);
+reuse:
wp_page_reuse(vmf);
return VM_FAULT_WRITE;
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
@@ -3689,11 +3704,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* that are certainly not shared because we just allocated them without
* exposing them to the swapcache.
*/
- if ((vmf->flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
- (page != swapcache || page_count(page) == 1)) {
- pte = maybe_mkwrite(pte_mkdirty(pte), vma);
- vmf->flags &= ~FAULT_FLAG_WRITE;
- ret |= VM_FAULT_WRITE;
+ if (!PageKsm(page) && (page != swapcache || page_count(page) == 1)) {
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+ vmf->flags &= ~FAULT_FLAG_WRITE;
+ ret |= VM_FAULT_WRITE;
+ }
rmap_flags |= RMAP_EXCLUSIVE;
}
flush_icache_page(vma, page);
@@ -3713,6 +3729,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
}

+ VM_BUG_ON(!PageAnon(page) || (pte_write(pte) && !PageAnonExclusive(page)));
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);

diff --git a/mm/migrate.c b/mm/migrate.c
index fd9eba33b34a..7f440d2103ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -188,6 +188,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,

VM_BUG_ON_PAGE(PageTail(page), page);
while (page_vma_mapped_walk(&pvmw)) {
+ rmap_t rmap_flags = RMAP_NONE;
+
if (PageKsm(page))
new = page;
else
@@ -217,6 +219,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
else if (pte_swp_uffd_wp(*pvmw.pte))
pte = pte_mkuffd_wp(pte);

+ if (PageAnon(new) && !is_readable_migration_entry(entry))
+ rmap_flags |= RMAP_EXCLUSIVE;
+
if (unlikely(is_device_private_page(new))) {
if (pte_write(pte))
entry = make_writable_device_private_entry(
@@ -239,7 +244,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
if (PageAnon(new))
hugepage_add_anon_rmap(new, vma, pvmw.address,
- RMAP_NONE);
+ rmap_flags);
else
page_dup_file_rmap(new, true);
set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
@@ -248,7 +253,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
{
if (PageAnon(new))
page_add_anon_rmap(new, vma, pvmw.address,
- RMAP_NONE);
+ rmap_flags);
else
page_add_file_rmap(new, false);
set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
@@ -379,6 +384,10 @@ int folio_migrate_mapping(struct address_space *mapping,
/* No turning back from here */
newfolio->index = folio->index;
newfolio->mapping = folio->mapping;
+ /*
+ * Note: PG_anon_exclusive is always migrated via migration
+ * entries.
+ */
if (folio_test_swapbacked(folio))
__folio_set_swapbacked(newfolio);

@@ -2302,15 +2311,34 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
* set up a special migration page table entry now.
*/
if (trylock_page(page)) {
+ bool anon_exclusive;
pte_t swp_pte;

+ anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
+ if (anon_exclusive) {
+ flush_cache_page(vma, addr, pte_pfn(*ptep));
+ ptep_clear_flush(vma, addr, ptep);
+
+ if (page_try_share_anon_rmap(page)) {
+ set_pte_at(mm, addr, ptep, pte);
+ unlock_page(page);
+ put_page(page);
+ mpfn = 0;
+ goto next;
+ }
+ } else {
+ ptep_get_and_clear(mm, addr, ptep);
+ }
+
migrate->cpages++;
- ptep_get_and_clear(mm, addr, ptep);

/* Setup special migration page table entry */
if (mpfn & MIGRATE_PFN_WRITE)
entry = make_writable_migration_entry(
page_to_pfn(page));
+ else if (anon_exclusive)
+ entry = make_readable_exclusive_migration_entry(
+ page_to_pfn(page));
else
entry = make_readable_migration_entry(
page_to_pfn(page));
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2887644fd150..9441675e993c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -141,6 +141,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
+ struct page *page = pfn_swap_entry_to_page(entry);
pte_t newpte;

if (is_writable_migration_entry(entry)) {
@@ -148,8 +149,11 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
* A protection check is difficult so
* just be safe and disable write
*/
- entry = make_readable_migration_entry(
- swp_offset(entry));
+ if (PageAnon(page))
+ entry = make_readable_exclusive_migration_entry(
+ swp_offset(entry));
+ else
+ entry = make_readable_migration_entry(swp_offset(entry));
newpte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(oldpte))
newpte = pte_swp_mksoft_dirty(newpte);
diff --git a/mm/rmap.c b/mm/rmap.c
index ebe7140c4493..9d2a7e11e8cc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1048,6 +1048,7 @@ EXPORT_SYMBOL_GPL(folio_mkclean);
void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
{
struct anon_vma *anon_vma = vma->anon_vma;
+ struct page *subpage = page;

page = compound_head(page);

@@ -1061,6 +1062,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
* PageAnon()) will not see one without the other.
*/
WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
+ SetPageAnonExclusive(subpage);
}

/**
@@ -1078,7 +1080,7 @@ static void __page_set_anon_rmap(struct page *page,
BUG_ON(!anon_vma);

if (PageAnon(page))
- return;
+ goto out;

/*
* If the page isn't exclusively mapped into this vma,
@@ -1097,6 +1099,9 @@ static void __page_set_anon_rmap(struct page *page,
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
page->index = linear_page_index(vma, address);
+out:
+ if (exclusive)
+ SetPageAnonExclusive(page);
}

/**
@@ -1156,6 +1161,8 @@ void page_add_anon_rmap(struct page *page,
} else {
first = atomic_inc_and_test(&page->_mapcount);
}
+ VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
+ VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);

if (first) {
int nr = compound ? thp_nr_pages(page) : 1;
@@ -1422,7 +1429,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
};
pte_t pteval;
struct page *subpage;
- bool ret = true;
+ bool anon_exclusive, ret = true;
struct mmu_notifier_range range;
enum ttu_flags flags = (enum ttu_flags)(long)arg;

@@ -1485,6 +1492,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,

subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
address = pvmw.address;
+ anon_exclusive = PageAnon(page) && PageAnonExclusive(subpage);

if (PageHuge(page) && !PageAnon(page)) {
/*
@@ -1520,9 +1528,12 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
}
}

- /* Nuke the page table entry. */
+ /*
+ * Nuke the page table entry. When having to clear
+ * PageAnonExclusive(), we always have to flush.
+ */
flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
- if (should_defer_flush(mm, flags)) {
+ if (should_defer_flush(mm, flags) && !anon_exclusive) {
/*
* We clear the PTE but do not flush so potentially
* a remote CPU could still be writing to the page.
@@ -1623,6 +1634,24 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
page_vma_mapped_walk_done(&pvmw);
break;
}
+ if (anon_exclusive &&
+ page_try_share_anon_rmap(subpage)) {
+ swap_free(entry);
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ ret = false;
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+ /*
+ * Note: We *don't* remember yet if the page was mapped
+ * exclusively in the swap entry, so swapin code has
+ * to re-determine that manually and might detect the
+ * page as possibly shared, for example, if there are
+ * other references on the page or if the page is under
+ * writeback. We made sure that there are no GUP pins
+ * on the page that would rely on it, so for GUP pins
+ * this is fine.
+ */
if (list_empty(&mm->mmlist)) {
spin_lock(&mmlist_lock);
if (list_empty(&mm->mmlist))
@@ -1723,7 +1752,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
};
pte_t pteval;
struct page *subpage;
- bool ret = true;
+ bool anon_exclusive, ret = true;
struct mmu_notifier_range range;
enum ttu_flags flags = (enum ttu_flags)(long)arg;

@@ -1782,6 +1811,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,

subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
address = pvmw.address;
+ anon_exclusive = PageAnon(page) && PageAnonExclusive(subpage);

if (PageHuge(page) && !PageAnon(page)) {
/*
@@ -1833,6 +1863,9 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
swp_entry_t entry;
pte_t swp_pte;

+ if (anon_exclusive)
+ BUG_ON(page_try_share_anon_rmap(subpage));
+
/*
* Store the pfn of the page in a special migration
* pte. do_swap_page() will wait until the migration
@@ -1841,6 +1874,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
entry = pte_to_swp_entry(pteval);
if (is_writable_device_private_entry(entry))
entry = make_writable_migration_entry(pfn);
+ else if (anon_exclusive)
+ entry = make_readable_exclusive_migration_entry(pfn);
else
entry = make_readable_migration_entry(pfn);
swp_pte = swp_entry_to_pte(entry);
@@ -1903,6 +1938,15 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
page_vma_mapped_walk_done(&pvmw);
break;
}
+ VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) &&
+ !anon_exclusive, page);
+ if (anon_exclusive &&
+ page_try_share_anon_rmap(subpage)) {
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ ret = false;
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }

/*
* Store the pfn of the page in a special migration
@@ -1912,6 +1956,9 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
if (pte_write(pteval))
entry = make_writable_migration_entry(
page_to_pfn(subpage));
+ else if (anon_exclusive)
+ entry = make_readable_exclusive_migration_entry(
+ page_to_pfn(subpage));
else
entry = make_readable_migration_entry(
page_to_pfn(subpage));
@@ -2425,6 +2472,8 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
BUG_ON(!anon_vma);
/* address might be in next vma when migration races vma_adjust */
first = atomic_inc_and_test(compound_mapcount_ptr(page));
+ VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
+ VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
if (first)
__page_set_anon_rmap(page, vma, address,
flags & RMAP_EXCLUSIVE);
--
2.35.1

2022-03-08 20:39:05

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 14/15] mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page

Whenever GUP currently ends up taking a R/O pin on an anonymous page that
might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
on the page table entry will end up replacing the mapped anonymous page
due to COW, resulting in the GUP pin no longer being consistent with the
page actually mapped into the page table.

The possible ways to deal with this situation are:
(1) Ignore and pin -- what we do right now.
(2) Fail to pin -- which would be rather surprising to callers and
could break user space.
(3) Trigger unsharing and pin the now exclusive page -- reliable R/O
pins.

Let's implement 3) because it provides the clearest semantics and
allows for checking in unpin_user_pages() and friends for possible BUGs:
when trying to unpin a page that's no longer exclusive, clearly
something went very wrong and might result in memory corruptions that
might be hard to debug. So we better have a nice way to spot such
issues.

This change implies that whenever user space *wrote* to a private
mapping (IOW, we have an anonymous page mapped), that GUP pins will
always remain consistent: reliable R/O GUP pins of anonymous pages.

As a side note, this commit fixes the COW security issue for hugetlb with
FOLL_PIN as documented in:
https://lore.kernel.org/r/[email protected]
The vmsplice reproducer still applies, because vmsplice uses FOLL_GET
instead of FOLL_PIN.

Note that follow_huge_pmd() doesn't apply because we cannot end up in
there with FOLL_PIN.

This commit is heavily based on prototype patches by Andrea.

Co-developed-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm.h | 39 +++++++++++++++++++++++++++++++++++++++
mm/gup.c | 42 +++++++++++++++++++++++++++++++++++++++---
mm/huge_memory.c | 3 +++
mm/hugetlb.c | 27 ++++++++++++++++++++++++---
4 files changed, 105 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 63ee06001189..45fe97213a27 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3003,6 +3003,45 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
return 0;
}

+/*
+ * Indicates for which pages that are write-protected in the page table,
+ * whether GUP has to trigger unsharing via FAULT_FLAG_UNSHARE such that the
+ * GUP pin will remain consistent with the pages mapped into the page tables
+ * of the MM.
+ *
+ * Temporary unmapping of PageAnonExclusive() pages or clearing of
+ * PageAnonExclusive() has to protect against concurrent GUP:
+ * * Ordinary GUP: Using the PT lock
+ * * GUP-fast and fork(): mm->write_protect_seq
+ * * GUP-fast and KSM or temporary unmapping (swap, migration):
+ * clear/invalidate+flush of the page table entry
+ *
+ * Must be called with the (sub)page that's actually referenced via the
+ * page table entry, which might not necessarily be the head page for a
+ * PTE-mapped THP.
+ */
+static inline bool gup_must_unshare(unsigned int flags, struct page *page)
+{
+ /*
+ * FOLL_WRITE is implicitly handled correctly as the page table entry
+ * has to be writable -- and if it references (part of) an anonymous
+ * folio, that part is required to be marked exclusive.
+ */
+ if ((flags & (FOLL_WRITE | FOLL_PIN)) != FOLL_PIN)
+ return false;
+ /*
+ * Note: PageAnon(page) is stable until the page is actually getting
+ * freed.
+ */
+ if (!PageAnon(page))
+ return false;
+ /*
+ * Note that PageKsm() pages cannot be exclusive, and consequently,
+ * cannot get pinned.
+ */
+ return !PageAnonExclusive(page);
+}
+
typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
unsigned long size, pte_fn_t fn, void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 7127e789d210..9e4864130202 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -568,6 +568,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
}
}

+ if (!pte_write(pte) && gup_must_unshare(flags, page)) {
+ page = ERR_PTR(-EMLINK);
+ goto out;
+ }
/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
if (unlikely(!try_grab_page(page, flags))) {
page = ERR_PTR(-ENOMEM);
@@ -820,6 +824,11 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
* When getting pages from ZONE_DEVICE memory, the @ctx->pgmap caches
* the device's dev_pagemap metadata to avoid repeating expensive lookups.
*
+ * When getting an anonymous page and the caller has to trigger unsharing
+ * of a shared anonymous page first, -EMLINK is returned. The caller should
+ * trigger a fault with FAULT_FLAG_UNSHARE set. Note that unsharing is only
+ * relevant with FOLL_PIN and !FOLL_WRITE.
+ *
* On output, the @ctx->page_mask is set according to the size of the page.
*
* Return: the mapped (struct page *), %NULL if no mapping exists, or
@@ -943,7 +952,8 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
* is, *@locked will be set to 0 and -EBUSY returned.
*/
static int faultin_page(struct vm_area_struct *vma,
- unsigned long address, unsigned int *flags, int *locked)
+ unsigned long address, unsigned int *flags, bool unshare,
+ int *locked)
{
unsigned int fault_flags = 0;
vm_fault_t ret;
@@ -968,6 +978,11 @@ static int faultin_page(struct vm_area_struct *vma,
*/
fault_flags |= FAULT_FLAG_TRIED;
}
+ if (unshare) {
+ fault_flags |= FAULT_FLAG_UNSHARE;
+ /* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */
+ VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE);
+ }

ret = handle_mm_fault(vma, address, fault_flags, NULL);
if (ret & VM_FAULT_ERROR) {
@@ -1189,8 +1204,9 @@ static long __get_user_pages(struct mm_struct *mm,
cond_resched();

page = follow_page_mask(vma, start, foll_flags, &ctx);
- if (!page) {
- ret = faultin_page(vma, start, &foll_flags, locked);
+ if (!page || PTR_ERR(page) == -EMLINK) {
+ ret = faultin_page(vma, start, &foll_flags,
+ PTR_ERR(page) == -EMLINK, locked);
switch (ret) {
case 0:
goto retry;
@@ -2346,6 +2362,11 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
goto pte_unmap;
}

+ if (!pte_write(pte) && gup_must_unshare(flags, page)) {
+ put_compound_head(head, 1, flags);
+ goto pte_unmap;
+ }
+
VM_BUG_ON_PAGE(compound_head(page) != head, page);

/*
@@ -2529,6 +2550,11 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
return 0;
}

+ if (!pte_write(pte) && gup_must_unshare(flags, head)) {
+ put_compound_head(head, refs, flags);
+ return 0;
+ }
+
*nr += refs;
SetPageReferenced(head);
return 1;
@@ -2589,6 +2615,11 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
return 0;
}

+ if (!pmd_write(orig) && gup_must_unshare(flags, head)) {
+ put_compound_head(head, refs, flags);
+ return 0;
+ }
+
*nr += refs;
SetPageReferenced(head);
return 1;
@@ -2623,6 +2654,11 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
return 0;
}

+ if (!pud_write(orig) && gup_must_unshare(flags, head)) {
+ put_compound_head(head, refs, flags);
+ return 0;
+ }
+
*nr += refs;
SetPageReferenced(head);
return 1;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 617ed753996e..0f02c121c884 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1389,6 +1389,9 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);

+ if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
+ return ERR_PTR(-EMLINK);
+
if (!try_grab_page(page, flags))
return ERR_PTR(-ENOMEM);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7161f33aff9a..c5d63baa2957 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5956,6 +5956,25 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
}
}

+static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte,
+ bool *unshare)
+{
+ pte_t pteval = huge_ptep_get(pte);
+
+ *unshare = false;
+ if (is_swap_pte(pteval))
+ return true;
+ if (huge_pte_write(pteval))
+ return false;
+ if (flags & FOLL_WRITE)
+ return true;
+ if (gup_must_unshare(flags, pte_page(pteval))) {
+ *unshare = true;
+ return true;
+ }
+ return false;
+}
+
long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
unsigned long *position, unsigned long *nr_pages,
@@ -5970,6 +5989,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
while (vaddr < vma->vm_end && remainder) {
pte_t *pte;
spinlock_t *ptl = NULL;
+ bool unshare = false;
int absent;
struct page *page;

@@ -6020,9 +6040,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
* both cases, and because we can't follow correct pages
* directly from any kind of swap entries.
*/
- if (absent || is_swap_pte(huge_ptep_get(pte)) ||
- ((flags & FOLL_WRITE) &&
- !huge_pte_write(huge_ptep_get(pte)))) {
+ if (absent ||
+ __follow_hugetlb_must_fault(flags, pte, &unshare)) {
vm_fault_t ret;
unsigned int fault_flags = 0;

@@ -6030,6 +6049,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(ptl);
if (flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
+ else if (unshare)
+ fault_flags |= FAULT_FLAG_UNSHARE;
if (locked)
fault_flags |= FAULT_FLAG_ALLOW_RETRY |
FAULT_FLAG_KILLABLE;
--
2.35.1

2022-03-08 21:51:36

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 12/15] mm/gup: disallow follow_page(FOLL_PIN)

We want to change the way we handle R/O pins on anonymous pages that
might be shared: if we detect a possibly shared anonymous page --
mapped R/O and not !PageAnonExclusive() -- we want to trigger unsharing
via a page fault, resulting in an exclusive anonymous page that can be
pinned reliably without getting replaced via COW on the next write
fault.

However, the required page fault will be problematic for follow_page():
in contrast to ordinary GUP, follow_page() doesn't trigger faults
internally. So we would have to end up failing a R/O pin via
follow_page(), although there is something mapped R/O into the page
table, which might be rather surprising.

We don't seem to have follow_page(FOLL_PIN) users, and it's a purely
internal MM function. Let's just make our life easier and the semantics of
follow_page() clearer by just disallowing FOLL_PIN for follow_page()
completely.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/gup.c | 3 +++
mm/hugetlb.c | 8 +++++---
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a9d4d724aef7..7127e789d210 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -875,6 +875,9 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
if (vma_is_secretmem(vma))
return NULL;

+ if (foll_flags & FOLL_PIN)
+ return NULL;
+
page = follow_page_mask(vma, address, foll_flags, &ctx);
if (ctx.pgmap)
put_dev_pagemap(ctx.pgmap);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 195a1a5d8b20..e40d6b68e25a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6694,9 +6694,11 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
spinlock_t *ptl;
pte_t pte;

- /* FOLL_GET and FOLL_PIN are mutually exclusive. */
- if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
- (FOLL_PIN | FOLL_GET)))
+ /*
+ * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
+ * follow_hugetlb_page().
+ */
+ if (WARN_ON_ONCE(flags & FOLL_PIN))
return NULL;

retry:
--
2.35.1

2022-03-08 22:14:16

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 05/15] mm/rmap: convert RMAP flags to a proper distinct rmap_t type

We want to pass the flags to more than one anon rmap function, getting
rid of special "do_page_add_anon_rmap()". So let's pass around a distinct
__bitwise type and refine documentation.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/rmap.h | 22 ++++++++++++++++++----
mm/memory.c | 6 +++---
mm/rmap.c | 7 ++++---
3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 92c3585b8c6a..49f6b208938c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -158,9 +158,23 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,

struct anon_vma *page_get_anon_vma(struct page *page);

-/* bitflags for do_page_add_anon_rmap() */
-#define RMAP_EXCLUSIVE 0x01
-#define RMAP_COMPOUND 0x02
+/* RMAP flags, currently only relevant for some anon rmap operations. */
+typedef int __bitwise rmap_t;
+
+/*
+ * No special request: if the page is a subpage of a compound page, it is
+ * mapped via a PTE. The mapped (sub)page is possibly shared between processes.
+ */
+#define RMAP_NONE ((__force rmap_t)0)
+
+/* The (sub)page is exclusive to a single process. */
+#define RMAP_EXCLUSIVE ((__force rmap_t)BIT(0))
+
+/*
+ * The compound page is not mapped via PTEs, but instead via a single PMD and
+ * should be accounted accordingly.
+ */
+#define RMAP_COMPOUND ((__force rmap_t)BIT(1))

/*
* rmap interfaces called when adding or removing pte of page
@@ -169,7 +183,7 @@ void page_move_anon_rmap(struct page *, struct vm_area_struct *);
void page_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, bool);
void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long, int);
+ unsigned long, rmap_t);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, bool);
void page_add_file_rmap(struct page *, bool);
diff --git a/mm/memory.c b/mm/memory.c
index b9602d41d907..bbce3ca72974 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3515,10 +3515,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL, *swapcache;
struct swap_info_struct *si = NULL;
+ rmap_t rmap_flags = RMAP_NONE;
swp_entry_t entry;
pte_t pte;
int locked;
- int exclusive = 0;
vm_fault_t ret = 0;
void *shadow = NULL;

@@ -3693,7 +3693,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
- exclusive = RMAP_EXCLUSIVE;
+ rmap_flags |= RMAP_EXCLUSIVE;
}
flush_icache_page(vma, page);
if (pte_swp_soft_dirty(vmf->orig_pte))
@@ -3709,7 +3709,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
page_add_new_anon_rmap(page, vma, vmf->address, false);
lru_cache_add_inactive_or_unevictable(page, vma);
} else {
- do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
+ do_page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
}

set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index f825aeef61ca..3d7028d100ea 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1139,7 +1139,8 @@ static void __page_check_anon_rmap(struct page *page,
void page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address, bool compound)
{
- do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
+ do_page_add_anon_rmap(page, vma, address,
+ compound ? RMAP_COMPOUND : RMAP_NONE);
}

/*
@@ -1148,7 +1149,7 @@ void page_add_anon_rmap(struct page *page,
* Everybody else should continue to use page_add_anon_rmap above.
*/
void do_page_add_anon_rmap(struct page *page,
- struct vm_area_struct *vma, unsigned long address, int flags)
+ struct vm_area_struct *vma, unsigned long address, rmap_t flags)
{
bool compound = flags & RMAP_COMPOUND;
bool first;
@@ -1189,7 +1190,7 @@ void do_page_add_anon_rmap(struct page *page,
/* address might be in next vma when migration races vma_adjust */
if (first)
__page_set_anon_rmap(page, vma, address,
- flags & RMAP_EXCLUSIVE);
+ !!(flags & RMAP_EXCLUSIVE));
else
__page_check_anon_rmap(page, vma, address);
}
--
2.35.1

2022-03-09 00:37:34

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v1 05/15] mm/rmap: convert RMAP flags to a proper distinct rmap_t type



> On Mar 8, 2022, at 6:14 AM, David Hildenbrand <[email protected]> wrote:
>
> We want to pass the flags to more than one anon rmap function, getting
> rid of special "do_page_add_anon_rmap()". So let's pass around a distinct
> __bitwise type and refine documentation.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> include/linux/rmap.h | 22 ++++++++++++++++++----
> mm/memory.c | 6 +++---
> mm/rmap.c | 7 ++++---
> 3 files changed, 25 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 92c3585b8c6a..49f6b208938c 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -158,9 +158,23 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
>
> struct anon_vma *page_get_anon_vma(struct page *page);
>
> -/* bitflags for do_page_add_anon_rmap() */
> -#define RMAP_EXCLUSIVE 0x01
> -#define RMAP_COMPOUND 0x02
> +/* RMAP flags, currently only relevant for some anon rmap operations. */
> +typedef int __bitwise rmap_t;
> +
> +/*
> + * No special request: if the page is a subpage of a compound page, it is
> + * mapped via a PTE. The mapped (sub)page is possibly shared between processes.
> + */
> +#define RMAP_NONE ((__force rmap_t)0)
> +
> +/* The (sub)page is exclusive to a single process. */
> +#define RMAP_EXCLUSIVE ((__force rmap_t)BIT(0))
> +
> +/*
> + * The compound page is not mapped via PTEs, but instead via a single PMD and
> + * should be accounted accordingly.
> + */
> +#define RMAP_COMPOUND ((__force rmap_t)BIT(1))

I was once shouted at for a similar suggestion, but I am going to try
once more… If you already define a new type, why not to use bitfields?

It would be much easier to read. The last time I made such a suggestion,
Ingo said "I personally like bitfields in theory … [but] older versions of
GCC did a really poor job of optimizing them.” At the time (2014), I looked
at GCC-4.4 and GCC-4.8 and there were some differences in the quality of
the generated code. Is it still the case?

2022-03-09 00:38:25

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 01/15] mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed

In case arch_unmap_one() fails, we already did a swap_duplicate(). let's
undo that properly via swap_free().

Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap")
Reviewed-by: Khalid Aziz <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/rmap.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..f825aeef61ca 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1625,6 +1625,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
break;
}
if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+ swap_free(entry);
set_pte_at(mm, address, pvmw.pte, pteval);
ret = false;
page_vma_mapped_walk_done(&pvmw);
--
2.35.1

2022-03-09 00:40:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages

On Tue, Mar 8, 2022 at 6:14 AM David Hildenbrand <[email protected]> wrote:
>
> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
> on an anonymous page and COW logic fails to detect exclusivity of the page
> to then replacing the anonymous page by a copy in the page table [...]

From a cursory scan of the patches, this looks sane.

I'm not sure what the next step should be, but I really would like the
people who do a lot of pinning stuff to give it a good shake-down.
Including both looking at the patches, but very much actually running
it on whatever test-cases etc you people have.

Please?

Linus

2022-03-09 01:08:45

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 15/15] mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning

Let's verify when (un)pinning anonymous pages that we always deal with
exclusive anonymous pages, which guarantees that we'll have a reliable
PIN, meaning that we cannot end up with the GUP pin being inconsistent
with he pages mapped into the page tables due to a COW triggered
by a write fault.

When pinning pages, after conditionally triggering GUP unsharing of
possibly shared anonymous pages, we should always only see exclusive
anonymous pages. Note that anonymous pages that are mapped writable
must be marked exclusive, otherwise we'd have a BUG.

When pinning during ordinary GUP, simply add a check after our
conditional GUP-triggered unsharing checks. As we know exactly how the
page is mapped, we know exactly in which page we have to check for
PageAnonExclusive().

When pinning via GUP-fast we have to be careful, because we can race with
fork(): verify only after we made sure via the seqcount that we didn't
race with concurrent fork() that we didn't end up pinning a possibly
shared anonymous page.

Similarly, when unpinning, verify that the pages are still marked as
exclusive: otherwise something turned the pages possibly shared, which
can result in random memory corruptions, which we really want to catch.

With only the pinned pages at hand and not the actual page table entries
we have to be a bit careful: hugetlb pages are always mapped via a
single logical page table entry referencing the head page and
PG_anon_exclusive of the head page applies. Anon THP are a bit more
complicated, because we might have obtained the page reference either via
a PMD or a PTE -- depending on the mapping type we either have to check
PageAnonExclusive of the head page (PMD-mapped THP) or the tail page
(PTE-mapped THP) applies: as we don't know and to make our life easier,
check that either is set.

Take care to not verify in case we're unpinning during GUP-fast because
we detected concurrent fork(): we might stumble over an anonymous page
that is now shared.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/gup.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++-
mm/huge_memory.c | 3 +++
mm/hugetlb.c | 3 +++
3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 9e4864130202..b36f02f2b720 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -45,6 +45,38 @@ static void hpage_pincount_sub(struct page *page, int refs)
atomic_sub(refs, compound_pincount_ptr(page));
}

+static inline void sanity_check_pinned_pages(struct page **pages,
+ unsigned long npages)
+{
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * We only pin anonymous pages if they are exclusive. Once pinned, we
+ * can no longer turn them possibly shared and PageAnonExclusive() will
+ * stick around until the page is freed.
+ *
+ * We'd like to verify that our pinned anonymous pages are still mapped
+ * exclusively. The issue with anon THP is that we don't know how
+ * they are/were mapped when pinning them. However, for anon
+ * THP we can assume that either the given page (PTE-mapped THP) or
+ * the head page (PMD-mapped THP) should be PageAnonExclusive(). If
+ * neither is the case, there is certainly something wrong.
+ */
+ for (; npages; npages--, pages++) {
+ struct page *page = *pages;
+ struct page *head = compound_head(page);
+
+ if (!PageAnon(head))
+ continue;
+ if (!PageCompound(head) || PageHuge(head))
+ VM_BUG_ON_PAGE(!PageAnonExclusive(head), page);
+ else
+ /* Either a PTE-mapped or a PMD-mapped THP. */
+ VM_BUG_ON_PAGE(!PageAnonExclusive(head) &&
+ !PageAnonExclusive(page), page);
+ }
+#endif /* CONFIG_DEBUG_VM */
+}
+
/* Equivalent to calling put_page() @refs times. */
static void put_page_refs(struct page *page, int refs)
{
@@ -250,6 +282,7 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
*/
void unpin_user_page(struct page *page)
{
+ sanity_check_pinned_pages(&page, 1);
put_compound_head(compound_head(page), 1, FOLL_PIN);
}
EXPORT_SYMBOL(unpin_user_page);
@@ -340,6 +373,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
return;
}

+ sanity_check_pinned_pages(pages, npages);
for_each_compound_head(index, pages, npages, head, ntails) {
/*
* Checking PageDirty at this point may race with
@@ -404,6 +438,21 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
}
EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);

+static void unpin_user_pages_lockless(struct page **pages, unsigned long npages)
+{
+ unsigned long index;
+ struct page *head;
+ unsigned int ntails;
+
+ /*
+ * Don't perform any sanity checks because we might have raced with
+ * fork() and some anonymous pages might now actually be shared --
+ * which is why we're unpinning after all.
+ */
+ for_each_compound_head(index, pages, npages, head, ntails)
+ put_compound_head(head, ntails, FOLL_PIN);
+}
+
/**
* unpin_user_pages() - release an array of gup-pinned pages.
* @pages: array of pages to be marked dirty and released.
@@ -426,6 +475,7 @@ void unpin_user_pages(struct page **pages, unsigned long npages)
*/
if (WARN_ON(IS_ERR_VALUE(npages)))
return;
+ sanity_check_pinned_pages(pages, npages);

for_each_compound_head(index, pages, npages, head, ntails)
put_compound_head(head, ntails, FOLL_PIN);
@@ -572,6 +622,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
page = ERR_PTR(-EMLINK);
goto out;
}
+
+ VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
+ !PageAnonExclusive(page));
+
/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
if (unlikely(!try_grab_page(page, flags))) {
page = ERR_PTR(-ENOMEM);
@@ -2895,8 +2949,10 @@ static unsigned long lockless_pages_from_mm(unsigned long start,
*/
if (gup_flags & FOLL_PIN) {
if (read_seqcount_retry(&current->mm->write_protect_seq, seq)) {
- unpin_user_pages(pages, nr_pinned);
+ unpin_user_pages_lockless(pages, nr_pinned);
return 0;
+ } else {
+ sanity_check_pinned_pages(pages, nr_pinned);
}
}
return nr_pinned;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0f02c121c884..d9559341a5f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1392,6 +1392,9 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
return ERR_PTR(-EMLINK);

+ VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
+ !PageAnonExclusive(page));
+
if (!try_grab_page(page, flags))
return ERR_PTR(-ENOMEM);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c5d63baa2957..0d150d100111 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6092,6 +6092,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
page = pte_page(huge_ptep_get(pte));

+ VM_BUG_ON((flags & FOLL_PIN) && PageAnon(page) &&
+ !PageAnonExclusive(page));
+
/*
* If subpage information not requested, update counters
* and skip the same_page loop below.
--
2.35.1

2022-03-09 01:31:25

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 08/15] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()

New anonymous pages are always mapped natively: only THP/khugepagd code
maps a new compound anonymous page and passes "true". Otherwise, we're
just dealing with simple, non-compound pages.

Let's give the interface clearer semantics and document these.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/rmap.h | 2 +-
kernel/events/uprobes.c | 2 +-
mm/huge_memory.c | 2 +-
mm/khugepaged.c | 2 +-
mm/memory.c | 10 +++++-----
mm/migrate.c | 2 +-
mm/rmap.c | 9 ++++++---
mm/swapfile.c | 2 +-
mm/userfaultfd.c | 2 +-
9 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 94ee38829c63..51953bace0a3 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -183,7 +183,7 @@ void page_move_anon_rmap(struct page *, struct vm_area_struct *);
void page_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, rmap_t);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long, bool);
+ unsigned long);
void page_add_file_rmap(struct page *, bool);
void page_remove_rmap(struct page *, bool);

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..b6fdb23fb3ea 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -184,7 +184,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,

if (new_page) {
get_page(new_page);
- page_add_new_anon_rmap(new_page, vma, addr, false);
+ page_add_new_anon_rmap(new_page, vma, addr);
lru_cache_add_inactive_or_unevictable(new_page, vma);
} else
/* no new page, just dec_mm_counter for old_page */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2ca137e01e84..c1f7eaba23ff 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -647,7 +647,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,

entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- page_add_new_anon_rmap(page, vma, haddr, true);
+ page_add_new_anon_rmap(page, vma, haddr);
lru_cache_add_inactive_or_unevictable(page, vma);
pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a325a646be33..96cc903c4788 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1183,7 +1183,7 @@ static void collapse_huge_page(struct mm_struct *mm,

spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
- page_add_new_anon_rmap(new_page, vma, address, true);
+ page_add_new_anon_rmap(new_page, vma, address);
lru_cache_add_inactive_or_unevictable(new_page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, address, pmd, _pmd);
diff --git a/mm/memory.c b/mm/memory.c
index e2d8e55c55c0..00c45b3a9576 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -896,7 +896,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
*prealloc = NULL;
copy_user_highpage(new_page, page, addr, src_vma);
__SetPageUptodate(new_page);
- page_add_new_anon_rmap(new_page, dst_vma, addr, false);
+ page_add_new_anon_rmap(new_page, dst_vma, addr);
lru_cache_add_inactive_or_unevictable(new_page, dst_vma);
rss[mm_counter(new_page)]++;

@@ -3052,7 +3052,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* some TLBs while the old PTE remains in others.
*/
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
- page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+ page_add_new_anon_rmap(new_page, vma, vmf->address);
lru_cache_add_inactive_or_unevictable(new_page, vma);
/*
* We call the notify macro here because, when using secondary
@@ -3706,7 +3706,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)

/* ksm created a completely new copy */
if (unlikely(page != swapcache && swapcache)) {
- page_add_new_anon_rmap(page, vma, vmf->address, false);
+ page_add_new_anon_rmap(page, vma, vmf->address);
lru_cache_add_inactive_or_unevictable(page, vma);
} else {
page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
@@ -3856,7 +3856,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
}

inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, vmf->address, false);
+ page_add_new_anon_rmap(page, vma, vmf->address);
lru_cache_add_inactive_or_unevictable(page, vma);
setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
@@ -4033,7 +4033,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
/* copy-on-write page */
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, addr, false);
+ page_add_new_anon_rmap(page, vma, addr);
lru_cache_add_inactive_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
diff --git a/mm/migrate.c b/mm/migrate.c
index e6b3cb3d148b..fd9eba33b34a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2725,7 +2725,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
goto unlock_abort;

inc_mm_counter(mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, addr, false);
+ page_add_new_anon_rmap(page, vma, addr);
if (!is_zone_device_page(page))
lru_cache_add_inactive_or_unevictable(page, vma);
get_page(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 7162689203fc..ebe7140c4493 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1184,19 +1184,22 @@ void page_add_anon_rmap(struct page *page,
}

/**
- * page_add_new_anon_rmap - add pte mapping to a new anonymous page
+ * page_add_new_anon_rmap - add mapping to a new anonymous page
* @page: the page to add the mapping to
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
- * @compound: charge the page as compound or small page
+ *
+ * If it's a compound page, it is accounted as a compound page. As the page
+ * is new, it's assume to get mapped exclusively by a single process.
*
* Same as page_add_anon_rmap but must only be called on *new* pages.
* This means the inc-and-test can be bypassed.
* Page does not have to be locked.
*/
void page_add_new_anon_rmap(struct page *page,
- struct vm_area_struct *vma, unsigned long address, bool compound)
+ struct vm_area_struct *vma, unsigned long address)
{
+ const bool compound = PageCompound(page);
int nr = compound ? thp_nr_pages(page) : 1;

VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 41ba8238d16b..7edc8e099b22 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1802,7 +1802,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
if (page == swapcache) {
page_add_anon_rmap(page, vma, addr, RMAP_NONE);
} else { /* ksm created a completely new copy */
- page_add_new_anon_rmap(page, vma, addr, false);
+ page_add_new_anon_rmap(page, vma, addr);
lru_cache_add_inactive_or_unevictable(page, vma);
}
set_pte_at(vma->vm_mm, addr, pte,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..4ca854ce14f0 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -98,7 +98,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
if (page_in_cache)
page_add_file_rmap(page, false);
else
- page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+ page_add_new_anon_rmap(page, dst_vma, dst_addr);

/*
* Must happen after rmap, as mm_counter() checks mapping (via
--
2.35.1

2022-03-09 08:04:14

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages

On 08.03.22 22:22, Linus Torvalds wrote:
> On Tue, Mar 8, 2022 at 6:14 AM David Hildenbrand <[email protected]> wrote:
>>
>> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
>> on an anonymous page and COW logic fails to detect exclusivity of the page
>> to then replacing the anonymous page by a copy in the page table [...]
>
> From a cursory scan of the patches, this looks sane.

Thanks for skimming over the patches that quickly!

>
> I'm not sure what the next step should be, but I really would like the
> people who do a lot of pinning stuff to give it a good shake-down.
> Including both looking at the patches, but very much actually running
> it on whatever test-cases etc you people have.
>
> Please?

My proposal would be to pull it into -next early after we have
v5.18-rc1. I expect some minor clashes with folio changes that should go
in in the next merge window, so I'll have to rebase+resend either way,
and I'm planning on thoroughly testing at least on s390x as well.

We'd then have plenty of time to further review+test while in -next
until the v5.19 merge window opens up.

By that time I should also have my selftests cleaned up and ready, and
part 3 ready to improve the situation for FOLL_GET|FOLL_WRITE until we
have the full FOLL_GET->FOLL_PIN conversion from John (I'll most
probably sent out an early RFC of part 3 soonish). So we *might* be able
to have everything fixed in v5.19.

Last but not least, tools/cgroup/memcg_slabinfo.py as mentioned in patch
#10 still needs care due to the PG_slab reuse, but I consider that a
secondary concern (yet, it should be fixed and help from the Authors
would be appreciated ;) ).

--
Thanks,

David / dhildenb

2022-03-09 16:22:38

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

On Tue, Mar 08, 2022 at 03:14:32PM +0100, David Hildenbrand wrote:
> The basic question we would like to have a reliable and efficient answer
> to is: is this anonymous page exclusive to a single process or might it
> be shared?

Is this supposed to be for PAGE_SIZE pages as well, or is it only used
on pages > PAGE_SIZE?

> In an ideal world, we'd have a spare pageflag. Unfortunately, pageflags
> don't grow on trees, so we have to get a little creative for the time
> being.

This feels a little _too_ creative to me. There's now an implicit
requirement that SL[AOU]B doesn't use the bottom two bits of
->slab_cache, which is probably OK but would need to be documented.

I have plans to get rid of PageError and PagePrivate, but those are going
to be too late for you. I don't think mappedtodisk has meaning for anon
pages, even if they're in the swapcache. It would need PG_has_hwpoisoned
to shift to another bit ... but almost any bit will do for has_hwpoisoned.
Or have I overlooked something?

> @@ -920,6 +976,70 @@ extern bool is_free_buddy_page(struct page *page);
>
> __PAGEFLAG(Isolated, isolated, PF_ANY);
>
> +static __always_inline bool folio_test_slab(struct folio *folio)
> +{
> + return !folio_test_anon(folio) &&
> + test_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL));
> +}
> +
> +static __always_inline int PageSlab(struct page *page)
> +{
> + return !PageAnon(page) &&
> + test_bit(PG_slab, &PF_NO_TAIL(page, 0)->flags);
> +}

In case we do end up using this, this would be better implemented as

static __always_inline int PageSlab(struct page *page)
{
return folio_test_slab(page_folio(page));
}

since PageAnon already has a page_folio() call embedded in it.

> +static __always_inline void __SetPageSlab(struct page *page)
> +{
> + VM_BUG_ON_PGFLAGS(PageAnon(page), page);
> + __set_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags);
> +}

There's only one caller of __SetPageSlab() left, in kfence. And that
code looks ... weird.

for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
if (!i || (i % 2))
continue;

/* Verify we do not have a compound head page. */
if (WARN_ON(compound_head(&pages[i]) != &pages[i]))
goto err;

__SetPageSlab(&pages[i]);

I think the author probably intended WARN_ON(PageCompound(page)) because
they're actually verifying that it's not a tail page, rather than head
page.

> +static __always_inline void __ClearPageSlab(struct page *page)
> +{
> + VM_BUG_ON_PGFLAGS(PageAnon(page), page);
> + __clear_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags);
> +}

There are no remaining callers of __ClearPageSlab(). yay.

2022-03-09 19:16:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

Hi,

On 09.03.22 16:47, Matthew Wilcox wrote:
> On Tue, Mar 08, 2022 at 03:14:32PM +0100, David Hildenbrand wrote:
>> The basic question we would like to have a reliable and efficient answer
>> to is: is this anonymous page exclusive to a single process or might it
>> be shared?
>
> Is this supposed to be for PAGE_SIZE pages as well, or is it only used
> on pages > PAGE_SIZE?

As documented, simple/ordinary PAGE_SIZE pages as well (unfortunately ,
otherwise we'd have more space :) ).

>
>> In an ideal world, we'd have a spare pageflag. Unfortunately, pageflags
>> don't grow on trees, so we have to get a little creative for the time
>> being.
>
> This feels a little _too_ creative to me. There's now an implicit

It's making the semantics of PG_slab depend on another bit in the head
page. I agree, it's not perfect, but it's not *too* crazy. As raised in
the cover letter, not proud of this, but I didn't really find an
alternative for the time being.

> requirement that SL[AOU]B doesn't use the bottom two bits of

I think only the last bit (0x1)

> ->slab_cache, which is probably OK but would need to be documented.

We'd already have false positive PageAnon() if that wasn't the case. At
least in stable_page_flags() would already indicate something wrong I
think (KPF_ANON). We'd need !PageSlab() checks at a couple of places
already if I'm not wrong.

I had a comment in there, but after the PageSlab cleanups came in, I
dropped it because my assumption was that actually nobody is really
allowed to use the lowest mapping bit for something else. We can
document that, of course.

So at least in that regard, I think this is fine.

>
> I have plans to get rid of PageError and PagePrivate, but those are going
> to be too late for you. I don't think mappedtodisk has meaning for anon
> pages, even if they're in the swapcache. It would need PG_has_hwpoisoned

Are you sure it's not used if the page is in the swapcache? I have no
detailed knowledge how file-back swap targets are handled in that
regard. So fs experience would be highly appreciated :)

> to shift to another bit ... but almost any bit will do for has_hwpoisoned.
> Or have I overlooked something?

Good question, I assume we could use another bit that's not already
defined/check on subpages of a compound page.


Overloading PG_reserved would be an alternative, however, that flag can
also indicate that the remainder of the memmap might be mostly garbage,
so it's not that good of a fit IMHO.

>
>> @@ -920,6 +976,70 @@ extern bool is_free_buddy_page(struct page *page);
>>
>> __PAGEFLAG(Isolated, isolated, PF_ANY);
>>
>> +static __always_inline bool folio_test_slab(struct folio *folio)
>> +{
>> + return !folio_test_anon(folio) &&
>> + test_bit(PG_slab, folio_flags(folio, FOLIO_PF_NO_TAIL));
>> +}
>> +
>> +static __always_inline int PageSlab(struct page *page)
>> +{
>> + return !PageAnon(page) &&
>> + test_bit(PG_slab, &PF_NO_TAIL(page, 0)->flags);
>> +}
>
> In case we do end up using this, this would be better implemented as
>
> static __always_inline int PageSlab(struct page *page)
> {
> return folio_test_slab(page_folio(page));
> }
>
> since PageAnon already has a page_folio() call embedded in it.

Agreed, I mainly copied the stubs and extended them.

>
>> +static __always_inline void __SetPageSlab(struct page *page)
>> +{
>> + VM_BUG_ON_PGFLAGS(PageAnon(page), page);
>> + __set_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags);
>> +}
>
> There's only one caller of __SetPageSlab() left, in kfence. And that
> code looks ... weird.
>
> for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
> if (!i || (i % 2))
> continue;
>
> /* Verify we do not have a compound head page. */
> if (WARN_ON(compound_head(&pages[i]) != &pages[i]))
> goto err;
>
> __SetPageSlab(&pages[i]);
>
> I think the author probably intended WARN_ON(PageCompound(page)) because
> they're actually verifying that it's not a tail page, rather than head
> page.

It's certainly a head-scratcher.

>
>> +static __always_inline void __ClearPageSlab(struct page *page)
>> +{
>> + VM_BUG_ON_PGFLAGS(PageAnon(page), page);
>> + __clear_bit(PG_slab, &PF_NO_TAIL(page, 1)->flags);
>> +}
>
> There are no remaining callers of __ClearPageSlab(). yay.
>

Indeed, nice.

Thanks!

--
Thanks,

David / dhildenb

2022-03-09 20:05:51

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

>> It's making the semantics of PG_slab depend on another bit in the head
>> page. I agree, it's not perfect, but it's not *too* crazy. As raised in
>> the cover letter, not proud of this, but I didn't really find an
>> alternative for the time being.
>>
>>> requirement that SL[AOU]B doesn't use the bottom two bits of
>>
>> I think only the last bit (0x1)
>
> Yeah, OK, they can use three of the four possible combinations of the
> bottom two bits ;-) Still, it's yet more constraints on use of struct
> page, which aren't obvious (and are even upside down for the casual
> observer).

I don't disagree that such constraints are nasty.

Having that said, I'd really like to avoid overloading PG_slab
(especially, such that I don't have to mess with
scripts/crash/makedumpfile). So if we can reuse MappedToDisk, that would
be very nice.

>
>>> I have plans to get rid of PageError and PagePrivate, but those are going
>>> to be too late for you. I don't think mappedtodisk has meaning for anon
>>> pages, even if they're in the swapcache. It would need PG_has_hwpoisoned
>>
>> Are you sure it's not used if the page is in the swapcache? I have no
>> detailed knowledge how file-back swap targets are handled in that
>> regard. So fs experience would be highly appreciated :)
>
> I have two arguments here. One is based on grep:
>
> $ git grep mappedtodisk
> fs/proc/page.c: u |= kpf_copy_bit(k, KPF_MAPPEDTODISK, PG_mappedtodisk);
> include/linux/page-flags.h: PG_mappedtodisk, /* Has blocks allocated on-disk */
> include/linux/page-flags.h: PG_has_hwpoisoned = PG_mappedtodisk,
> include/linux/page-flags.h:PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
> include/trace/events/mmflags.h: {1UL << PG_mappedtodisk, "mappedtodisk" }, \
> include/trace/events/pagemap.h: (folio_test_mappedtodisk(folio) ? PAGEMAP_MAPPEDDISK : 0) | \
> mm/migrate.c: if (folio_test_mappedtodisk(folio))
> mm/migrate.c: folio_set_mappedtodisk(newfolio);
> mm/truncate.c: folio_clear_mappedtodisk(folio);
> tools/vm/page-types.c: [KPF_MAPPEDTODISK] = "d:mappedtodisk",
>
> $ git grep MappedToDisk
> fs/buffer.c: SetPageMappedToDisk(page);
> fs/buffer.c: if (PageMappedToDisk(page))
> fs/buffer.c: SetPageMappedToDisk(page);
> fs/ext4/readpage.c: SetPageMappedToDisk(page);
> fs/f2fs/data.c: SetPageMappedToDisk(page);
> fs/f2fs/file.c: if (PageMappedToDisk(page))
> fs/fuse/dev.c: ClearPageMappedToDisk(newpage);
> fs/mpage.c: SetPageMappedToDisk(page);
> fs/nilfs2/file.c: if (PageMappedToDisk(page))
> fs/nilfs2/file.c: SetPageMappedToDisk(page);
> fs/nilfs2/page.c: ClearPageMappedToDisk(page);
> fs/nilfs2/page.c: SetPageMappedToDisk(dpage);
> fs/nilfs2/page.c: ClearPageMappedToDisk(dpage);
> fs/nilfs2/page.c: if (PageMappedToDisk(src) && !PageMappedToDisk(dst))
> fs/nilfs2/page.c: SetPageMappedToDisk(dst);
> fs/nilfs2/page.c: else if (!PageMappedToDisk(src) && PageMappedToDisk(dst))
> fs/nilfs2/page.c: ClearPageMappedToDisk(dst);
> fs/nilfs2/page.c: ClearPageMappedToDisk(page);
> include/linux/page-flags.h:PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
>
> so you can see it's _rarely_ used, and only by specific filesystems.

Right, but I spot ext4 and fs/buffer.c core functionality. That
naturally makes me nervous :)

>
> The swap code actually bypasses the filesystem (except for network
> filesystems) and submits BIOs directly (see __swap_writepage() and
> swap_readpage()). There's no check for MappedToDisk, and nowhere to
> set it/clear it.
>
> The second argument is that MappedToDisk is used for delayed allocation
> on writes for filesystems. But swap is required to be allocated at
> swapfile setup (so that the swap code can bypass the filesystem ...)
> and so this flag is useless.

I have some faint memory that there are corner cases, but maybe
(hopefully) my memory is wrong.

>
> Of course, I may have missed something. This is complex.
>

Yeah, that's why I was very careful. If any FS would end up setting that
flag we'd be in trouble and would have to blacklist that fs for swapping
or rework it (well, if we're even able to identify such a file system).

I think one sanity check I could add is making sure that
PageAnonExclusive() is never set when getting a page from the swapcache
in do_swap_page(). I think that would take care of the obvious bugs.

--
Thanks,

David / dhildenb

2022-03-09 20:15:50

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

On Wed, Mar 09, 2022 at 05:57:51PM +0100, David Hildenbrand wrote:
> Hi,
>
> On 09.03.22 16:47, Matthew Wilcox wrote:
> > On Tue, Mar 08, 2022 at 03:14:32PM +0100, David Hildenbrand wrote:
> >> The basic question we would like to have a reliable and efficient answer
> >> to is: is this anonymous page exclusive to a single process or might it
> >> be shared?
> >
> > Is this supposed to be for PAGE_SIZE pages as well, or is it only used
> > on pages > PAGE_SIZE?
>
> As documented, simple/ordinary PAGE_SIZE pages as well (unfortunately ,
> otherwise we'd have more space :) ).

OK. Documentation doesn't always capture exactly what the author was
thinking, so I appreciate the clarification ;-)

> > This feels a little _too_ creative to me. There's now an implicit
>
> It's making the semantics of PG_slab depend on another bit in the head
> page. I agree, it's not perfect, but it's not *too* crazy. As raised in
> the cover letter, not proud of this, but I didn't really find an
> alternative for the time being.
>
> > requirement that SL[AOU]B doesn't use the bottom two bits of
>
> I think only the last bit (0x1)

Yeah, OK, they can use three of the four possible combinations of the
bottom two bits ;-) Still, it's yet more constraints on use of struct
page, which aren't obvious (and are even upside down for the casual
observer).

> > I have plans to get rid of PageError and PagePrivate, but those are going
> > to be too late for you. I don't think mappedtodisk has meaning for anon
> > pages, even if they're in the swapcache. It would need PG_has_hwpoisoned
>
> Are you sure it's not used if the page is in the swapcache? I have no
> detailed knowledge how file-back swap targets are handled in that
> regard. So fs experience would be highly appreciated :)

I have two arguments here. One is based on grep:

$ git grep mappedtodisk
fs/proc/page.c: u |= kpf_copy_bit(k, KPF_MAPPEDTODISK, PG_mappedtodisk);
include/linux/page-flags.h: PG_mappedtodisk, /* Has blocks allocated on-disk */
include/linux/page-flags.h: PG_has_hwpoisoned = PG_mappedtodisk,
include/linux/page-flags.h:PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
include/trace/events/mmflags.h: {1UL << PG_mappedtodisk, "mappedtodisk" }, \
include/trace/events/pagemap.h: (folio_test_mappedtodisk(folio) ? PAGEMAP_MAPPEDDISK : 0) | \
mm/migrate.c: if (folio_test_mappedtodisk(folio))
mm/migrate.c: folio_set_mappedtodisk(newfolio);
mm/truncate.c: folio_clear_mappedtodisk(folio);
tools/vm/page-types.c: [KPF_MAPPEDTODISK] = "d:mappedtodisk",

$ git grep MappedToDisk
fs/buffer.c: SetPageMappedToDisk(page);
fs/buffer.c: if (PageMappedToDisk(page))
fs/buffer.c: SetPageMappedToDisk(page);
fs/ext4/readpage.c: SetPageMappedToDisk(page);
fs/f2fs/data.c: SetPageMappedToDisk(page);
fs/f2fs/file.c: if (PageMappedToDisk(page))
fs/fuse/dev.c: ClearPageMappedToDisk(newpage);
fs/mpage.c: SetPageMappedToDisk(page);
fs/nilfs2/file.c: if (PageMappedToDisk(page))
fs/nilfs2/file.c: SetPageMappedToDisk(page);
fs/nilfs2/page.c: ClearPageMappedToDisk(page);
fs/nilfs2/page.c: SetPageMappedToDisk(dpage);
fs/nilfs2/page.c: ClearPageMappedToDisk(dpage);
fs/nilfs2/page.c: if (PageMappedToDisk(src) && !PageMappedToDisk(dst))
fs/nilfs2/page.c: SetPageMappedToDisk(dst);
fs/nilfs2/page.c: else if (!PageMappedToDisk(src) && PageMappedToDisk(dst))
fs/nilfs2/page.c: ClearPageMappedToDisk(dst);
fs/nilfs2/page.c: ClearPageMappedToDisk(page);
include/linux/page-flags.h:PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)

so you can see it's _rarely_ used, and only by specific filesystems.

The swap code actually bypasses the filesystem (except for network
filesystems) and submits BIOs directly (see __swap_writepage() and
swap_readpage()). There's no check for MappedToDisk, and nowhere to
set it/clear it.

The second argument is that MappedToDisk is used for delayed allocation
on writes for filesystems. But swap is required to be allocated at
swapfile setup (so that the swap code can bypass the filesystem ...)
and so this flag is useless.

Of course, I may have missed something. This is complex.

> > to shift to another bit ... but almost any bit will do for has_hwpoisoned.
> > Or have I overlooked something?
>
> Good question, I assume we could use another bit that's not already
> defined/check on subpages of a compound page.
>
>
> Overloading PG_reserved would be an alternative, however, that flag can
> also indicate that the remainder of the memmap might be mostly garbage,
> so it's not that good of a fit IMHO.

I wouldn't reuse PG_reserved for anything. Although I do have a modest
proposal that I should finish writing up ... let me start on that.

2022-03-10 14:16:01

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages

On 10.03.22 12:13, Oded Gabbay wrote:
> On Wed, Mar 9, 2022 at 10:00 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 08.03.22 22:22, Linus Torvalds wrote:
>>> On Tue, Mar 8, 2022 at 6:14 AM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
>>>> on an anonymous page and COW logic fails to detect exclusivity of the page
>>>> to then replacing the anonymous page by a copy in the page table [...]
>>>
>>> From a cursory scan of the patches, this looks sane.
>>
>> Thanks for skimming over the patches that quickly!
>>
>>>
>>> I'm not sure what the next step should be, but I really would like the
>>> people who do a lot of pinning stuff to give it a good shake-down.
>>> Including both looking at the patches, but very much actually running
>>> it on whatever test-cases etc you people have.
>>>
>>> Please?
>
> I can take this patch-set and test it in our data-center with all the
> DL workloads we are running
> on Gaudi.

Perfect!

>
> David,
> Any chance you can prepare me a branch with your patch-set based on 5.17-rc7 ?
> I prefer to take a stable kernel and not 5.18-rc1 as this is going to
> run on hundreds of machines.

https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2_v1

contains the state of this series, based on v5.17-rc7 and part 1.

I'll be leaving that branch unmodified when working on newer versions of
the patch set.

--
Thanks,

David / dhildenb

2022-03-10 18:35:11

by Oded Gabbay

[permalink] [raw]
Subject: Re: [PATCH v1 00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages

On Wed, Mar 9, 2022 at 10:00 AM David Hildenbrand <[email protected]> wrote:
>
> On 08.03.22 22:22, Linus Torvalds wrote:
> > On Tue, Mar 8, 2022 at 6:14 AM David Hildenbrand <[email protected]> wrote:
> >>
> >> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
> >> on an anonymous page and COW logic fails to detect exclusivity of the page
> >> to then replacing the anonymous page by a copy in the page table [...]
> >
> > From a cursory scan of the patches, this looks sane.
>
> Thanks for skimming over the patches that quickly!
>
> >
> > I'm not sure what the next step should be, but I really would like the
> > people who do a lot of pinning stuff to give it a good shake-down.
> > Including both looking at the patches, but very much actually running
> > it on whatever test-cases etc you people have.
> >
> > Please?

I can take this patch-set and test it in our data-center with all the
DL workloads we are running
on Gaudi.

David,
Any chance you can prepare me a branch with your patch-set based on 5.17-rc7 ?
I prefer to take a stable kernel and not 5.18-rc1 as this is going to
run on hundreds of machines.

Thanks,
Oded

>
> My proposal would be to pull it into -next early after we have
> v5.18-rc1. I expect some minor clashes with folio changes that should go
> in in the next merge window, so I'll have to rebase+resend either way,
> and I'm planning on thoroughly testing at least on s390x as well.
>
> We'd then have plenty of time to further review+test while in -next
> until the v5.19 merge window opens up.
>
> By that time I should also have my selftests cleaned up and ready, and
> part 3 ready to improve the situation for FOLL_GET|FOLL_WRITE until we
> have the full FOLL_GET->FOLL_PIN conversion from John (I'll most
> probably sent out an early RFC of part 3 soonish). So we *might* be able
> to have everything fixed in v5.19.
>
> Last but not least, tools/cgroup/memcg_slabinfo.py as mentioned in patch
> #10 still needs care due to the PG_slab reuse, but I consider that a
> secondary concern (yet, it should be fixed and help from the Authors
> would be appreciated ;) ).
>
> --
> Thanks,
>
> David / dhildenb
>

2022-03-11 21:44:58

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

On 08.03.22 15:14, David Hildenbrand wrote:
> The basic question we would like to have a reliable and efficient answer
> to is: is this anonymous page exclusive to a single process or might it
> be shared?
>
> In an ideal world, we'd have a spare pageflag. Unfortunately, pageflags
> don't grow on trees, so we have to get a little creative for the time
> being.
>
> Introduce a way to mark an anonymous page as exclusive, with the
> ultimate goal of teaching our COW logic to not do "wrong COWs", whereby
> GUP pins lose consistency with the pages mapped into the page table,
> resulting in reported memory corruptions.
>
> Most pageflags already have semantics for anonymous pages, so we're left
> with reusing PG_slab for our purpose: for PageAnon() pages PG_slab now
> translates to PG_anon_exclusive, teach some in-kernel code that manually
> handles PG_slab about that.
>
> Add a spoiler on the semantics of PG_anon_exclusive as documentation. More
> documentation will be contained in the code that actually makes use of
> PG_anon_exclusive.
>
> We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
> zapping) of page table entries, page freeing code will handle that when
> also invalidate page->mapping to not indicate PageAnon() anymore.
> Letting information about exclusivity stick around will be an important
> property when adding sanity checks to unpinning code.
>
> RFC notes: in-tree tools/cgroup/memcg_slabinfo.py looks like it might need
> some care. We'd have to lookup the head page and check if
> PageAnon() is set. Similarly, tools living outside the kernel
> repository like crash and makedumpfile might need adaptions.
>
> Cc: Roman Gushchin <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---

I'm currently testing with the following. My tests so far with a swapfile on
all different kinds of weird filesystems (excluding networking fs, though)
revealed no surprises so far:



From 13aeb929722e38d162d03baa75eadb8c2c84359d Mon Sep 17 00:00:00 2001
From: David Hildenbrand <[email protected]>
Date: Mon, 20 Dec 2021 20:23:51 +0100
Subject: [PATCH] mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for
PageAnon() pages

The basic question we would like to have a reliable and efficient answer
to is: is this anonymous page exclusive to a single process or might it
be shared? We need that information for ordinary/single pages, hugetlb
pages, and possibly each subpage of a THP.

Introduce a way to mark an anonymous page as exclusive, with the
ultimate goal of teaching our COW logic to not do "wrong COWs", whereby
GUP pins lose consistency with the pages mapped into the page table,
resulting in reported memory corruptions.

Most pageflags already have semantics for anonymous pages, however,
PG_mappedtodisk should never apply to pages in the swapcache, so let's
reuse that flag.

As PG_has_hwpoisoned also uses that flag on the second tail page of a
compound page, convert it to PG_waiters instead, which is marked as
PF_ONLY_HEAD and does definetly not apply to any tail pages.
__split_huge_page() properly does a ClearPageHasHWPoisoned() before
continuing with the THP split.

Use custom page flag modification functions such that we can do
additional sanity checks. Add a spoiler on the semantics of
PG_anon_exclusive as documentation. More documentation will be contained
in the code that actually makes use of PG_anon_exclusive.

We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
zapping) of page table entries, page freeing code will handle that when
also invalidate page->mapping to not indicate PageAnon() anymore.
Letting information about exclusivity stick around will be an important
property when adding sanity checks to unpinning code.

Note that we properly clear the flag in free_pages_prepare() via
PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
so there is no need to manually clear the flag.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/page-flags.h | 84 +++++++++++++++++++++++++++++++++++++-
mm/hugetlb.c | 2 +
mm/memory.c | 11 +++++
mm/memremap.c | 9 ++++
mm/swapfile.c | 4 ++
tools/vm/page-types.c | 8 +++-
6 files changed, 116 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1c3b6e5c8bfd..ac68d28b5e1f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -142,6 +142,60 @@ enum pageflags {

PG_readahead = PG_reclaim,

+ /*
+ * Depending on the way an anonymous folio can be mapped into a page
+ * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
+ * THP), PG_anon_exclusive may be set only for the head page or for
+ * subpages of an anonymous folio.
+ *
+ * PG_anon_exclusive is *usually* only expressive in combination with a
+ * page table entry. Depending on the page table entry type it might
+ * store the following information:
+ *
+ * Is what's mapped via this page table entry exclusive to the
+ * single process and can be mapped writable without further
+ * checks? If not, it might be shared and we might have to COW.
+ *
+ * For now, we only expect PTE-mapped THPs to make use of
+ * PG_anon_exclusive in subpages. For other anonymous compound
+ * folios (i.e., hugetlb), only the head page is logically mapped and
+ * holds this information.
+ *
+ * For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
+ * set on the head page. When replacing the PMD by a page table full
+ * of PTEs, PG_anon_exclusive, if set on the head page, will be set on
+ * all tail pages accordingly. Note that converting from a PTE-mapping
+ * to a PMD mapping using the same compound page is currently not
+ * possible and consequently doesn't require care.
+ *
+ * If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
+ * it should only pin if the relevant PG_anon_bit is set. In that case,
+ * the pin will be fully reliable and stay consistent with the pages
+ * mapped into the page table, as the bit cannot get cleared (e.g., by
+ * fork(), KSM) while the page is pinned. For anonymous pages that
+ * are mapped R/W, PG_anon_exclusive can be assumed to always be set
+ * because such pages cannot possibly be shared.
+ *
+ * The page table lock protecting the page table entry is the primary
+ * synchronization mechanism for PG_anon_exclusive; GUP-fast that does
+ * not take the PT lock needs special care when trying to clear the
+ * flag.
+ *
+ * Page table entry types and PG_anon_exclusive:
+ * * Present: PG_anon_exclusive applies.
+ * * Swap: the information is lost. PG_anon_exclusive was cleared.
+ * * Migration: the entry holds this information instead.
+ * PG_anon_exclusive was cleared.
+ * * Device private: PG_anon_exclusive applies.
+ * * Device exclusive: PG_anon_exclusive applies.
+ * * HW Poison: PG_anon_exclusive is stale and not changed.
+ *
+ * If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
+ * not allowed and the flag will stick around until the page is freed
+ * and folio->mapping is cleared.
+ */
+ PG_anon_exclusive = PG_mappedtodisk,
+
/* Filesystems */
PG_checked = PG_owner_priv_1,

@@ -176,7 +230,7 @@ enum pageflags {
* Indicates that at least one subpage is hwpoisoned in the
* THP.
*/
- PG_has_hwpoisoned = PG_mappedtodisk,
+ PG_has_hwpoisoned = PG_waiters,
#endif

/* non-lru isolated movable page */
@@ -920,6 +974,34 @@ extern bool is_free_buddy_page(struct page *page);

__PAGEFLAG(Isolated, isolated, PF_ANY);

+static __always_inline int PageAnonExclusive(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
+ VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+ return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void SetPageAnonExclusive(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
+ VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+ set_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void ClearPageAnonExclusive(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
+ VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+ clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
+static __always_inline void __ClearPageAnonExclusive(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
+ VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+ __clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
+}
+
#ifdef CONFIG_MMU
#define __PG_MLOCKED (1UL << PG_mlocked)
#else
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9fb990d95dab..1ff0b9e1e28e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1669,6 +1669,8 @@ void free_huge_page(struct page *page)
VM_BUG_ON_PAGE(page_mapcount(page), page);

hugetlb_set_page_subpool(page, NULL);
+ if (PageAnon(page))
+ __ClearPageAnonExclusive(page);
page->mapping = NULL;
restore_reserve = HPageRestoreReserve(page);
ClearHPageRestoreReserve(page);
diff --git a/mm/memory.c b/mm/memory.c
index 7b32f422798d..d01fab481134 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3671,6 +3671,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_nomap;
}

+ /*
+ * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
+ * must never point at an anonymous page in the swapcache that is
+ * PG_anon_exclusive. Sanity check that this holds and especially, that
+ * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
+ * check after taking the PT lock and making sure that nobody
+ * concurrently faulted in this page and set PG_anon_exclusive.
+ */
+ BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
+ BUG_ON(PageAnon(page) && PageAnonExclusive(page));
+
/*
* Remove the swap entry and conditionally try to free up the swapcache.
* We're already holding a reference on the page but haven't mapped it
diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11f..160ea92e4e17 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -478,6 +478,15 @@ void free_devmap_managed_page(struct page *page)

mem_cgroup_uncharge(page_folio(page));

+ /*
+ * Note: we don't expect anonymous compound pages yet. Once supported
+ * and we could PTE-map them similar to THP, we'd have to clear
+ * PG_anon_exclusive on all tail pages.
+ */
+ VM_BUG_ON_PAGE(PageAnon(page) && PageCompound(page), page);
+ if (PageAnon(page))
+ __ClearPageAnonExclusive(page);
+
/*
* When a device_private page is freed, the page->mapping field
* may still contain a (stale) mapping value. For example, the
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7edc8e099b22..493acb967b7a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1796,6 +1796,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
goto out;
}

+ /* See do_swap_page() */
+ BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
+ BUG_ON(PageAnon(page) && PageAnonExclusive(page));
+
dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
get_page(page);
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index b1ed76d9a979..381dcc00cb62 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -80,9 +80,10 @@
#define KPF_SOFTDIRTY 40
#define KPF_ARCH_2 41

-/* [48-] take some arbitrary free slots for expanding overloaded flags
+/* [47-] take some arbitrary free slots for expanding overloaded flags
* not part of kernel API
*/
+#define KPF_ANON_EXCLUSIVE 47
#define KPF_READAHEAD 48
#define KPF_SLOB_FREE 49
#define KPF_SLUB_FROZEN 50
@@ -138,6 +139,7 @@ static const char * const page_flag_names[] = {
[KPF_SOFTDIRTY] = "f:softdirty",
[KPF_ARCH_2] = "H:arch_2",

+ [KPF_ANON_EXCLUSIVE] = "d:anon_exclusive",
[KPF_READAHEAD] = "I:readahead",
[KPF_SLOB_FREE] = "P:slob_free",
[KPF_SLUB_FROZEN] = "A:slub_frozen",
@@ -472,6 +474,10 @@ static int bit_mask_ok(uint64_t flags)

static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme)
{
+ /* Anonymous pages overload PG_mappedtodisk */
+ if ((flags & BIT(ANON)) && (flags & BIT(MAPPEDTODISK)))
+ flags ^= BIT(MAPPEDTODISK) | BIT(ANON_EXCLUSIVE);
+
/* SLOB/SLUB overload several page flags */
if (flags & BIT(SLAB)) {
if (flags & BIT(PRIVATE))
--
2.35.1


--
Thanks,

David / dhildenb

2022-03-11 22:11:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

On Fri, Mar 11, 2022 at 10:46 AM David Hildenbrand <[email protected]> wrote:
>
> - PG_has_hwpoisoned = PG_mappedtodisk,
> + PG_has_hwpoisoned = PG_waiters,

That makes me too nervous for words. PG_waiters is very subtle.

Not only is it magical in bit location ways (and the special
clear_bit_unlock_is_negative_byte() macro that *literally* exists only
for unlocking a page), it just ends up having fairly subtle semantics
with intentionally racy clearing etc.

Mixing that up with any hwpoison bits - that aren't used by any normal
mortals and thus get very little coverage - just sounds horribly
horribly wrong.

Linus

2022-03-11 22:30:57

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 11/15] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive

On 08.03.22 15:14, David Hildenbrand wrote:
> Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
> exclusive, and use that information to make GUP pins reliable and stay
> consistent with the page mapped into the page table even if the
> page table entry gets write-protected.
>
> With that information at hand, we can extend our COW logic to always
> reuse anonymous pages that are exclusive. For anonymous pages that
> might be shared, the existing logic applies.
>
> As already documented, PG_anon_exclusive is usually only expressive in
> combination with a page table entry. Especially PTE vs. PMD-mapped
> anonymous pages require more thought, some examples: due to mremap() we
> can easily have a single compound page PTE-mapped into multiple page tables
> exclusively in a single process -- multiple page table locks apply.
> Further, due to MADV_WIPEONFORK we might not necessarily write-protect
> all PTEs, and only some subpages might be pinned. Long story short: once
> PTE-mapped, we have to track information about exclusivity per sub-page,
> but until then, we can just track it for the compound page in the head
> page and not having to update a whole bunch of subpages all of the time
> for a simple PMD mapping of a THP.
>
> For simplicity, this commit mostly talks about "anonymous pages", while
> it's for THP actually "the part of an anonymous folio referenced via
> a page table entry".
>
> To not spill PG_anon_exclusive code all over the mm code-base, we let
> the anon rmap code to handle all PG_anon_exclusive logic it can easily
> handle.
>
> If a writable, present page table entry points at an anonymous (sub)page,
> that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably
> pin (FOLL_PIN) on an anonymous page references via a present
> page table entry, it must only pin if PG_anon_exclusive is set for the
> mapped (sub)page.
>
> This commit doesn't adjust GUP, so this is only implicitly handled for
> FOLL_WRITE, follow-up commits will teach GUP to also respect it for
> FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages
> fully reliable.
>
> Whenever an anonymous page is to be shared (fork(), KSM), or when
> temporarily unmapping an anonymous page (swap, migration), the relevant
> PG_anon_exclusive bit has to be cleared to mark the anonymous page
> possibly shared. Clearing will fail if there are GUP pins on the page:
> * For fork(), this means having to copy the page and not being able to
> share it. fork() protects against concurrent GUP using the PT lock and
> the src_mm->write_protect_seq.
> * For KSM, this means sharing will fail. For swap this means, unmapping
> will fail, For migration this means, migration will fail early. All
> three cases protect against concurrent GUP using the PT lock and a
> proper clear/invalidate+flush of the relevant page table entry.
>
> This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
> pinned page gets mapped R/O and the successive write fault ends up
> replacing the page instead of reusing it. It improves the situation for
> O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN,
> if fork() is *not* involved, however swapout and fork() are still
> problematic. Properly using FOLL_PIN instead of FOLL_GET for these
> GUP users will fix the issue for them.
>
> I. Details about basic handling
>
> I.1. Fresh anonymous pages
>
> page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
> given page exclusive via __page_set_anon_rmap(exclusive=1). As that is
> the mechanism fresh anonymous pages come into life (besides migration
> code where we copy the page->mapping), all fresh anonymous pages will
> start out as exclusive.
>
> I.2. COW reuse handling of anonymous pages
>
> When a COW handler stumbles over a (sub)page that's marked exclusive, it
> simply reuses it. Otherwise, the handler tries harder under page lock to
> detect if the (sub)page is exclusive and can be reused. If exclusive,
> page_move_anon_rmap() will mark the given (sub)page exclusive.
>
> Note that hugetlb code does not yet check for PageAnonExclusive(), as it
> still uses the old COW logic that is prone to the COW security issue
> because hugetlb code cannot really tolerate unnecessary/wrong COW as
> huge pages are a scarce resource.
>
> I.3. Migration handling
>
> try_to_migrate() has to try marking an exclusive anonymous page shared
> via page_try_share_anon_rmap(). If it fails because there are GUP pins
> on the page, unmap fails. migrate_vma_collect_pmd() and
> __split_huge_pmd_locked() are handled similarly.
>
> Writable migration entries implicitly point at shared anonymous pages.
> For readable migration entries that information is stored via a new
> "readable-exclusive" migration entry, specific to anonymous pages.
>
> When restoring a migration entry in remove_migration_pte(), information
> about exlusivity is detected via the migration entry type, and
> RMAP_EXCLUSIVE is set accordingly for
> page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that
> information.
>
> I.4. Swapout handling
>
> try_to_unmap() has to try marking the mapped page possibly shared via
> page_try_share_anon_rmap(). If it fails because there are GUP pins on the
> page, unmap fails. For now, information about exclusivity is lost. In the
> future, we might want to remember that information in the swap entry in
> some cases, however, it requires more thought, care, and a way to store
> that information in swap entries.
>
> I.5. Swapin handling
>
> do_swap_page() will never stumble over exclusive anonymous pages in the
> swap cache, as try_to_migrate() prohibits that. do_swap_page() always has
> to detect manually if an anonymous page is exclusive and has to set
> RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.
>
> I.6. THP handling
>
> __split_huge_pmd_locked() has to move the information about exclusivity
> from the PMD to the PTEs.
>
> a) In case we have a readable-exclusive PMD migration entry, simply insert
> readable-exclusive PTE migration entries.
>
> b) In case we have a present PMD entry and we don't want to freeze
> ("convert to migration entries"), simply forward PG_anon_exclusive to
> all sub-pages, no need to temporarily clear the bit.
>
> c) In case we have a present PMD entry and want to freeze, handle it
> similar to try_to_migrate(): try marking the page shared first. In case
> we fail, we ignore the "freeze" instruction and simply split ordinarily.
> try_to_migrate() will properly fail because the THP is still mapped via
> PTEs.
>
> When splitting a compound anonymous folio (THP), the information about
> exclusivity is implicitly handled via the migration entries: no need to
> replicate PG_anon_exclusive manually.
>
> I.7. fork() handling
>
> fork() handling is relatively easy, because PG_anon_exclusive is only
> expressive for some page table entry types.
>
> a) Present anonymous pages
>
> page_try_dup_anon_rmap() will mark the given subpage shared -- which
> will fail if the page is pinned. If it failed, we have to copy (or
> PTE-map a PMD to handle it on the PTE level).
>
> Note that device exclusive entries are just a pointer at a PageAnon()
> page. fork() will first convert a device exclusive entry to a present
> page table and handle it just like present anonymous pages.
>
> b) Device private entry
>
> Device private entries point at PageAnon() pages that cannot be mapped
> directly and, therefore, cannot get pinned.
>
> page_try_dup_anon_rmap() will mark the given subpage shared, which
> cannot fail because they cannot get pinned.
>
> c) HW poison entries
>
> PG_anon_exclusive will remain untouched and is stale -- the page table
> entry is just a placeholder after all.
>
> d) Migration entries
>
> Writable and readable-exclusive entries are converted to readable
> entries: possibly shared.
>
> I.8. mprotect() handling
>
> mprotect() only has to properly handle the new readable-exclusive
> migration entry:
>
> When write-protecting a migration entry that points at an anonymous
> page, remember the information about exclusivity via the
> "readable-exclusive" migration entry type.
>
> II. Migration and GUP-fast
>
> Whenever replacing a present page table entry that maps an exclusive
> anonymous page by a migration entry, we have to mark the page possibly
> shared and synchronize against GUP-fast by a proper
> clear/invalidate+flush to make the following scenario impossible:
>
> 1. try_to_migrate() places a migration entry after checking for GUP pins
> and marks the page possibly shared.
> 2. GUP-fast pins the page due to lack of synchronization
> 3. fork() converts the "writable/readable-exclusive" migration entry into a
> readable migration entry
> 4. Migration fails due to the GUP pin (failing to freeze the refcount)
> 5. Migration entries are restored. PG_anon_exclusive is lost
>
> -> We have a pinned page that is not marked exclusive anymore.
>
> Note that we move information about exclusivity from the page to the
> migration entry as it otherwise highly overcomplicates fork() and
> PTE-mapping a THP.
>
> III. Swapout and GUP-fast
>
> Whenever replacing a present page table entry that maps an exclusive
> anonymous page by a swap entry, we have to mark the page possibly
> shared and synchronize against GUP-fast by a proper
> clear/invalidate+flush to make the following scenario impossible:
>
> 1. try_to_unmap() places a swap entry after checking for GUP pins and
> clears exclusivity information on the page.
> 2. GUP-fast pins the page due to lack of synchronization.
>
> -> We have a pinned page that is not marked exclusive anymore.
>
> If we'd ever store information about exclusivity in the swap entry,
> similar to migration handling, the same considerations as in II would
> apply. This is future work.
>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> include/linux/rmap.h | 33 ++++++++++++++++++
> include/linux/swap.h | 15 ++++++---
> include/linux/swapops.h | 25 ++++++++++++++
> mm/huge_memory.c | 75 ++++++++++++++++++++++++++++++++++++-----
> mm/hugetlb.c | 15 ++++++---
> mm/ksm.c | 13 ++++++-
> mm/memory.c | 33 +++++++++++++-----
> mm/migrate.c | 34 +++++++++++++++++--
> mm/mprotect.c | 8 +++--
> mm/rmap.c | 59 +++++++++++++++++++++++++++++---
> 10 files changed, 275 insertions(+), 35 deletions(-)
>

I'll be including the following two changes in the next version:



diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 1bc522d28a78..1aef834e1d60 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -224,6 +224,13 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
{
VM_BUG_ON_PAGE(!PageAnon(page), page);

+ /*
+ * No need to check+clear for already shared pages, including KSM
+ * pages.
+ */
+ if (!PageAnonExclusive(page))
+ goto dup;
+
/*
* If this page may have been pinned by the parent process,
* don't allow to duplicate the mapping but instead require to e.g.,
@@ -240,6 +247,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
* It's okay to share the anon page between both processes, mapping
* the page R/O into both processes.
*/
+dup:
__page_dup_rmap(page, compound);
return 0;
}
@@ -275,7 +283,6 @@ static inline int page_try_share_anon_rmap(struct page *page)
return 0;
}

-
/*
* Called from mm/vmscan.c to handle paging out
*/
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0e83c1551da3..f94c66959531 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2083,7 +2083,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
* only and let try_to_migrate_one() fail later.
*/
anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
- if (freeze && page_try_share_anon_rmap(page))
+ if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
freeze = false;
}
VM_BUG_ON_PAGE(!page_count(page), page);
@@ -2355,10 +2355,14 @@ static void __split_huge_page_tail(struct page *head, int tail,
* After successful get_page_unless_zero() might follow flags change,
* for example lock_page() which set PG_waiters.
*
- * Keep PG_anon_exclusive information, already maintained for all
- * subpages of a compound page, untouched.
+ * Note that for anonymous pages, PG_anon_exclusive has been cleared
+ * in unmap_page() and is stored in the migration entry instead. It will
+ * be restored via remap_page(). We should never see PG_anon_exclusive
+ * at this point.
*/
- page_tail->flags &= ~(PAGE_FLAGS_CHECK_AT_PREP & ~PG_anon_exclusive);
+ VM_BUG_ON_PAGE(PageAnon(head) && PageAnonExclusive(page_tail),
+ page_tail);
+ page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
page_tail->flags |= (head->flags &
((1L << PG_referenced) |
(1L << PG_swapbacked) |



--
Thanks,

David / dhildenb

2022-03-11 22:45:45

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

On 11.03.22 20:22, Linus Torvalds wrote:
> On Fri, Mar 11, 2022 at 10:46 AM David Hildenbrand <[email protected]> wrote:
>>
>> - PG_has_hwpoisoned = PG_mappedtodisk,
>> + PG_has_hwpoisoned = PG_waiters,
>
> That makes me too nervous for words. PG_waiters is very subtle.

Yes, but PG_has_hwpoisoned is located on the second subpage of a
compound page, not on the head page.

>
> Not only is it magical in bit location ways (and the special
> clear_bit_unlock_is_negative_byte() macro that *literally* exists only
> for unlocking a page), it just ends up having fairly subtle semantics
> with intentionally racy clearing etc.
>
> Mixing that up with any hwpoison bits - that aren't used by any normal
> mortals and thus get very little coverage - just sounds horribly
> horribly wrong.

I used PG_error before, but felt like using a bit that is never ever
valid to be set/cleared/checked on a subpage would be even a better fit:

Note the:

PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, PF_ONLY_HEAD)

whereby PF_ONLY_HEAD translates to:

"for compound page, callers only ever operate on the head page."


I can just switch to PG_error, but for the second subpage, PG_waiters
should be just fine (unless I am missing something important).

--
Thanks,

David / dhildenb

2022-03-11 23:15:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

On Fri, Mar 11, 2022 at 11:36 AM David Hildenbrand <[email protected]> wrote:
>
> Yes, but PG_has_hwpoisoned is located on the second subpage of a
> compound page, not on the head page.

Ahh. Ok, I missed that when just looking at the patch.

I wish we had a different namespace entirely for those separate bits
or something.

Linus

2022-03-11 23:39:30

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

On Fri, Mar 11, 2022 at 07:46:39PM +0100, David Hildenbrand wrote:
> I'm currently testing with the following. My tests so far with a swapfile on
> all different kinds of weird filesystems (excluding networking fs, though)
> revealed no surprises so far:

I like this a lot better than reusing PG_swap. Thanks!

I'm somewhat reluctant to introduce a new flag that can be set on tail
pages. Do we lose much if it's always set only on the head page?

> +++ b/include/linux/page-flags.h
> @@ -142,6 +142,60 @@ enum pageflags {
>
> PG_readahead = PG_reclaim,
>
> + /*
> + * Depending on the way an anonymous folio can be mapped into a page
> + * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
> + * THP), PG_anon_exclusive may be set only for the head page or for
> + * subpages of an anonymous folio.
> + *
> + * PG_anon_exclusive is *usually* only expressive in combination with a
> + * page table entry. Depending on the page table entry type it might
> + * store the following information:
> + *
> + * Is what's mapped via this page table entry exclusive to the
> + * single process and can be mapped writable without further
> + * checks? If not, it might be shared and we might have to COW.
> + *
> + * For now, we only expect PTE-mapped THPs to make use of
> + * PG_anon_exclusive in subpages. For other anonymous compound
> + * folios (i.e., hugetlb), only the head page is logically mapped and
> + * holds this information.
> + *
> + * For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
> + * set on the head page. When replacing the PMD by a page table full
> + * of PTEs, PG_anon_exclusive, if set on the head page, will be set on
> + * all tail pages accordingly. Note that converting from a PTE-mapping
> + * to a PMD mapping using the same compound page is currently not
> + * possible and consequently doesn't require care.
> + *
> + * If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
> + * it should only pin if the relevant PG_anon_bit is set. In that case,
> + * the pin will be fully reliable and stay consistent with the pages
> + * mapped into the page table, as the bit cannot get cleared (e.g., by
> + * fork(), KSM) while the page is pinned. For anonymous pages that
> + * are mapped R/W, PG_anon_exclusive can be assumed to always be set
> + * because such pages cannot possibly be shared.
> + *
> + * The page table lock protecting the page table entry is the primary
> + * synchronization mechanism for PG_anon_exclusive; GUP-fast that does
> + * not take the PT lock needs special care when trying to clear the
> + * flag.
> + *
> + * Page table entry types and PG_anon_exclusive:
> + * * Present: PG_anon_exclusive applies.
> + * * Swap: the information is lost. PG_anon_exclusive was cleared.
> + * * Migration: the entry holds this information instead.
> + * PG_anon_exclusive was cleared.
> + * * Device private: PG_anon_exclusive applies.
> + * * Device exclusive: PG_anon_exclusive applies.
> + * * HW Poison: PG_anon_exclusive is stale and not changed.
> + *
> + * If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
> + * not allowed and the flag will stick around until the page is freed
> + * and folio->mapping is cleared.
> + */

... I also don't think this is the right place for this comment. Not
sure where it should go.

> +static __always_inline void SetPageAnonExclusive(struct page *page)
> +{
> + VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);

hm. seems to me like we should have a PageAnonNotKsm which just
does
return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
PAGE_MAPPING_ANON;
because that's "a bit" inefficient. OK, that's just a VM_BUG_ON,
but we have other users in real code:

mm/migrate.c: if (PageAnon(page) && !PageKsm(page))
mm/page_idle.c: need_lock = !PageAnon(page) || PageKsm(page);
mm/rmap.c: if (!is_locked && (!PageAnon(page) || PageKsm(page))) {

2022-03-11 23:39:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

On Fri, Mar 11, 2022 at 08:36:37PM +0100, David Hildenbrand wrote:
> I used PG_error before, but felt like using a bit that is never ever
> valid to be set/cleared/checked on a subpage would be even a better fit:
>
> Note the:
>
> PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, PF_ONLY_HEAD)
>
> whereby PF_ONLY_HEAD translates to:
>
> "for compound page, callers only ever operate on the head page."
>
>
> I can just switch to PG_error, but for the second subpage, PG_waiters
> should be just fine (unless I am missing something important).

I think you're missing something important that almost everybody misses
when looking at this code (including me).

PF_ANY flags can be set on individual pages.
PF_HEAD means "we automatically redirect all operations to the head page".
PF_ONLY_HEAD means "If you try to call this on a tail page, we BUG".
PF_NO_TAIL means "If you try to read this flag on a tail page, we'll
look at the head page instead, but if you try to set/clear this flag
on a tail page, we BUG"
PF_NO_COMPOUND means "We BUG() if you call this on a compound page"

So really, you can reuse any flag as PF_SECOND that isn't PF_ANY.

No, that's not what the documentation currently says. It should be.
I had a patch to reword it at some point, but I guess it got lost.
The current documentation reads like "We replicate the flag currently
set on the head page to all tail pages", but that just isn't what
the code does.

2022-03-12 09:22:49

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

On 11.03.22 22:02, Matthew Wilcox wrote:
> On Fri, Mar 11, 2022 at 07:46:39PM +0100, David Hildenbrand wrote:
>> I'm currently testing with the following. My tests so far with a swapfile on
>> all different kinds of weird filesystems (excluding networking fs, though)
>> revealed no surprises so far:
>
> I like this a lot better than reusing PG_swap. Thanks!
>
> I'm somewhat reluctant to introduce a new flag that can be set on tail
> pages. Do we lose much if it's always set only on the head page?
After spending one month on getting THP to work without PF_ANY, I can
say with confidence that the whole thing won't fly with THP when not
tracking it on the minimum-mapping granularity. For a PTE-mapped THP,
that's on the subpage level.

The next patch in the series documents some details. Intuitively, if we
could replace the pageflag by a PTE/PMD bit, we'd get roughly the same
result.

>
>> +++ b/include/linux/page-flags.h
>> @@ -142,6 +142,60 @@ enum pageflags {
>>
>> PG_readahead = PG_reclaim,
>>
>> + /*
>> + * Depending on the way an anonymous folio can be mapped into a page
>> + * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
>> + * THP), PG_anon_exclusive may be set only for the head page or for
>> + * subpages of an anonymous folio.
>> + *
>> + * PG_anon_exclusive is *usually* only expressive in combination with a
>> + * page table entry. Depending on the page table entry type it might
>> + * store the following information:
>> + *
>> + * Is what's mapped via this page table entry exclusive to the
>> + * single process and can be mapped writable without further
>> + * checks? If not, it might be shared and we might have to COW.
>> + *
>> + * For now, we only expect PTE-mapped THPs to make use of
>> + * PG_anon_exclusive in subpages. For other anonymous compound
>> + * folios (i.e., hugetlb), only the head page is logically mapped and
>> + * holds this information.
>> + *
>> + * For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
>> + * set on the head page. When replacing the PMD by a page table full
>> + * of PTEs, PG_anon_exclusive, if set on the head page, will be set on
>> + * all tail pages accordingly. Note that converting from a PTE-mapping
>> + * to a PMD mapping using the same compound page is currently not
>> + * possible and consequently doesn't require care.
>> + *
>> + * If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
>> + * it should only pin if the relevant PG_anon_bit is set. In that case,
>> + * the pin will be fully reliable and stay consistent with the pages
>> + * mapped into the page table, as the bit cannot get cleared (e.g., by
>> + * fork(), KSM) while the page is pinned. For anonymous pages that
>> + * are mapped R/W, PG_anon_exclusive can be assumed to always be set
>> + * because such pages cannot possibly be shared.
>> + *
>> + * The page table lock protecting the page table entry is the primary
>> + * synchronization mechanism for PG_anon_exclusive; GUP-fast that does
>> + * not take the PT lock needs special care when trying to clear the
>> + * flag.
>> + *
>> + * Page table entry types and PG_anon_exclusive:
>> + * * Present: PG_anon_exclusive applies.
>> + * * Swap: the information is lost. PG_anon_exclusive was cleared.
>> + * * Migration: the entry holds this information instead.
>> + * PG_anon_exclusive was cleared.
>> + * * Device private: PG_anon_exclusive applies.
>> + * * Device exclusive: PG_anon_exclusive applies.
>> + * * HW Poison: PG_anon_exclusive is stale and not changed.
>> + *
>> + * If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
>> + * not allowed and the flag will stick around until the page is freed
>> + * and folio->mapping is cleared.
>> + */
>
> ... I also don't think this is the right place for this comment. Not
> sure where it should go.

I went for "rather have some documentation at a sub-optimal place then
no documentation at all". I'm thinking about writing a proper
documentation once everything is in place, and moving some details from
there into that document then.

>
>> +static __always_inline void SetPageAnonExclusive(struct page *page)
>> +{
>> + VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
>
> hm. seems to me like we should have a PageAnonNotKsm which just
> does
> return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
> PAGE_MAPPING_ANON;
> because that's "a bit" inefficient. OK, that's just a VM_BUG_ON,
> but we have other users in real code:
>
> mm/migrate.c: if (PageAnon(page) && !PageKsm(page))
> mm/page_idle.c: need_lock = !PageAnon(page) || PageKsm(page);
> mm/rmap.c: if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
>

I'm wondering if the compiler won't be able to optimize that. Having
that said, I can look into adding that outside of this series.

Thanks!

--
Thanks,

David / dhildenb

2022-03-12 14:06:10

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 10/15] mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages

On 11.03.22 22:11, Matthew Wilcox wrote:
> On Fri, Mar 11, 2022 at 08:36:37PM +0100, David Hildenbrand wrote:
>> I used PG_error before, but felt like using a bit that is never ever
>> valid to be set/cleared/checked on a subpage would be even a better fit:
>>
>> Note the:
>>
>> PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, PF_ONLY_HEAD)
>>
>> whereby PF_ONLY_HEAD translates to:
>>
>> "for compound page, callers only ever operate on the head page."
>>
>>
>> I can just switch to PG_error, but for the second subpage, PG_waiters
>> should be just fine (unless I am missing something important).
>
> I think you're missing something important that almost everybody misses
> when looking at this code (including me).
>
> PF_ANY flags can be set on individual pages.
> PF_HEAD means "we automatically redirect all operations to the head page".
> PF_ONLY_HEAD means "If you try to call this on a tail page, we BUG".
> PF_NO_TAIL means "If you try to read this flag on a tail page, we'll
> look at the head page instead, but if you try to set/clear this flag
> on a tail page, we BUG"

^ that's actually the confusing part for me

"modifications of the page flag must be done on small or head pages,
checks can be done on tail pages too."

Just from reading that I thought checks on the tail page would *not* get
redirected to the head.

I'd have written that as and extended version of PF_HEAD:

"for compound page all operations related to the page flag are applied
to head page. Checks done on a tail page will be redirected to the head
page."

[...]

>
> So really, you can reuse any flag as PF_SECOND that isn't PF_ANY.
>

I'll go back to PG_error, thanks!

--
Thanks,

David / dhildenb