2021-12-17 11:32:12

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 00/11] mm: COW fixes part 1: fix the COW security issue for THP and hugetlb

Hi everybody,

as discussed in the linux-mm alignment session on Wednesday, this is part 1
of the COW fixes: fix the COW security issue using GUP-triggered
unsharing of shared anonymous pages (ordinary, THP, hugetlb). In the
meeting slides, this approach was referred to as "Copy On Read". If anybody
wants to have access to the slides, please feel free to reach out.


The patches are based on v5.16-rc5 and available at:
https://github.com/davidhildenbrand/linux/pull/new/unshare_v1


It is currently again possible for a child process to observe modifications
of anonymous pages performed by the parent process after fork() in some
cases, which is not only a violation of the POSIX semantics of MAP_PRIVATE,
but more importantly a real security issue.

This issue, including other related COW issues, has been summarized at [1]:
"
1. Observing Memory Modifications of Private Pages From A Child Process

Long story short: process-private memory might not be as private as you
think once you fork(): successive modifications of private memory
regions in the parent process can still be observed by the child
process, for example, by smart use of vmsplice()+munmap().

The core problem is that pinning pages readable in a child process, such
as done via the vmsplice system call, can result in a child process
observing memory modifications done in the parent process the child is
not supposed to observe. [1] contains an excellent summary and [2]
contains further details. This issue was assigned CVE-2020-29374 [9].

For this to trigger, it's required to use a fork() without subsequent
exec(), for example, as used under Android zygote. Without further
details about an application that forks less-privileged child processes,
one cannot really say what's actually affected and what's not -- see the
details section the end of this mail for a short sshd/openssh analysis.

While commit 17839856fd58 ("gup: document and work around "COW can break
either way" issue") fixed this issue and resulted in other problems
(e.g., ptrace on pmem), commit 09854ba94c6a ("mm: do_wp_page()
simplification") re-introduced part of the problem unfortunately.

The original reproducer can be modified quite easily to use THP [3] and
make the issue appear again on upstream kernels. I modified it to use
hugetlb [4] and it triggers as well. The problem is certainly less
severe with hugetlb than with THP; it merely highlights that we still
have plenty of open holes we should be closing/fixing.

Regarding vmsplice(), the only known workaround is to disallow the
vmsplice() system call ... or disable THP and hugetlb. But who knows
what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in
the end, it's a more generic issue.
"

This security issue was first reported by Jann Horn on 27 May 2020 and it
currently affects anonymous THP and hugetlb again. The "security issue"
part for hugetlb might be less important than for THP. However, with this
approach it's just easy to get the MAP_PRIVATE semantics of any anonymous
pages in that regard and avoid any such information leaks without much
added complexity.

Ordinary anonymous pages are currently not affected, because the COW logic
was changed in commit 09854ba94c6a ("mm: do_wp_page() simplification")
for them to COW on "page_count() != 1" instead of "mapcount > 1", which
unfortunately results in other COW issues, some of them documented in [1]
as well.

To fix this COW issue once and for all, introduce GUP-triggered unsharing
that can be conditionally triggered via FAULT_FLAG_UNSHARE. In contrast to
traditional COW, unsharing will leave the copied page mapped
write-protected in the page table, not having the semantics of a write
fault.

Logically, unsharing is triggered "early", as soon as GUP performs the
action that could result in a COW getting missed later and the security
issue triggering: however, unsharing is not triggered as before via a
write fault with undesired side effects.

Long story short, GUP triggers unsharing if all of the following conditions
are met:
* The page is mapped R/O
* We have an anonymous page, excluding KSM
* We want to read (!FOLL_WRITE)
* Unsharing is not disabled (!FOLL_NOUNSHARE)
* We want to take a reference (FOLL_GET or FOLL_PIN)
* The page is a shared anonymous page: mapcount > 1

To reliably detect shared anonymous THP without heavy locking, introduce
a mapcount_seqcount seqlock that protects the mapcount of a THP and can
be used to read an atomic mapcount value. The mapcount_seqlock is stored
inside the memmap of the compound page -- to keep it simple, factor out
a raw_seqlock_t from the seqlock_t.

As this patch series introduces the same unsharing logic for any
anonymous pages, it also paves the way to fix other COW issues, e.g.,
documented in [1], without reintroducing the security issue or
reintroducing other issues we observed in the past (e.g., broken ptrace on
pmem).

All reproducers for this COW issue have been consolidated in the selftest
included in this series. Hopefully we'll get this fixed for good.

Future work:

* get_user_pages_fast_only() can currently spin on the mapcount_seqcount
when reading the mapcount, which might be a rare event. While this is
fine even when done from get_user_pages_fast_only() in IRQ context, we
might want to just fail fast in get_user_pages_fast_only(). We already
have patches prepared that add page_anon_maybe_shared() and
page_trans_huge_anon_maybe_shared() that will return "true" in case
spinning would be required and make get_user_pages_fast_only() fail fast.
I'm excluding them for simplicity.

... even better would be finding a way to just not need the
mapcount_seqcount, but THP splitting and PageDoubleMap() gives us a
hard time -- but maybe we'll eventually find a way someday :)

* Part 2 will tackle the other user-space visible breakages / COW issues
raised in [1]. This series is the basis for adjusting the COW logic once
again without re-introducing the COW issue fixed in this series and
without reintroducing the issues we saw with the original CVE fix
(e.g., breaking ptrace on pmem). There might be further parts to improve
the GUP long-term <-> MM synchronicity and to optimize some things
around that.


The idea is by Andrea and some patches are rewritten versions of prototype
patches by Andrea. I cross-compiled and tested as good as possible.

I'll CC locking+selftest folks only on the relevant patch and the cover
letter to minimze the noise. I'll put everyone on CC who was either
involved with the COW issues in the past or attended the linux-mm alignment
session on Wednesday. Appologies if I forget anyone :)

[1] https://lore.kernel.org/r/[email protected]


David Hildenbrand (11):
seqlock: provide lockdep-free raw_seqcount_t variant
mm: thp: consolidate mapcount logic on THP split
mm: simplify hugetlb and file-THP handling in __page_mapcount()
mm: thp: simlify total_mapcount()
mm: thp: allow for reading the THP mapcount atomically via a
raw_seqlock_t
mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)
mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required
(!hugetlb)
mm: hugetlb: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE
mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required
(hugetlb)
mm: thp: introduce and use page_trans_huge_anon_shared()
selftests/vm: add tests for the known COW security issues

Documentation/locking/seqlock.rst | 50 ++++
include/linux/huge_mm.h | 72 +++++
include/linux/mm.h | 14 +
include/linux/mm_types.h | 9 +
include/linux/seqlock.h | 145 +++++++---
mm/gup.c | 89 +++++-
mm/huge_memory.c | 120 +++++++--
mm/hugetlb.c | 129 +++++++--
mm/memory.c | 136 ++++++++--
mm/rmap.c | 40 +--
mm/swapfile.c | 35 ++-
mm/util.c | 24 +-
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/gup_cow.c | 312 ++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests.sh | 16 ++
15 files changed, 1044 insertions(+), 148 deletions(-)
create mode 100644 tools/testing/selftests/vm/gup_cow.c

--
2.31.1



2021-12-17 11:32:58

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 01/11] seqlock: provide lockdep-free raw_seqcount_t variant

Sometimes it is required to have a seqcount implementation that uses
a structure with a fixed and minimal size -- just a bare unsigned int --
independent of the kernel configuration. This is especially valuable, when
the raw_ variants of the seqlock function will be used and the additional
lockdep part of the seqcount_t structure remains essentially unused.

Let's provide a lockdep-free raw_seqcount_t variant that can be used via
the raw functions to have a basic seqlock.

The target use case is embedding a raw_seqcount_t in the "struct page",
where we really want a minimal size and cannot tolerate a sudden grow of
the seqcount_t structure resulting in a significant "struct page"
increase or even a layout change.

Provide raw_read_seqcount_retry(), to make it easy to match to
raw_read_seqcount_begin() in the code.

Let's add a short documentation as well.

Note: There might be other possible users for raw_seqcount_t where the
lockdep part might be completely unused and just wastes memory --
essentially any users that only use the raw_ function variants.

Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Acked-by: Peter Xu <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
Documentation/locking/seqlock.rst | 50 +++++++++++
include/linux/seqlock.h | 145 +++++++++++++++++++++++-------
2 files changed, 162 insertions(+), 33 deletions(-)

diff --git a/Documentation/locking/seqlock.rst b/Documentation/locking/seqlock.rst
index 64405e5da63e..6f66ae29cc07 100644
--- a/Documentation/locking/seqlock.rst
+++ b/Documentation/locking/seqlock.rst
@@ -87,6 +87,56 @@ Read path::
} while (read_seqcount_retry(&foo_seqcount, seq));


+Raw sequence counters (``raw_seqcount_t``)
+==========================================
+
+This is the raw counting mechanism, which does not protect against multiple
+writers and does not perform any lockdep tracking. Write side critical sections
+must thus be serialized by an external lock.
+
+It is primary useful when a fixed, minimal sequence counter size is
+required and the lockdep overhead cannot be tolerated or is unused.
+Prefer using a :ref:`seqcount_t`, a :ref:`seqlock_t` or a
+:ref:`seqcount_locktype_t` if possible.
+
+The raw sequence counter is very similar to the :ref:`seqcount_t`, however,
+it can only be used with functions that don't perform any implicit lockdep
+tracking: primarily the *raw* function variants.
+
+Initialization::
+
+ /* dynamic */
+ raw_seqcount_t foo_seqcount;
+ raw_seqcount_init(&foo_seqcount);
+
+ /* static */
+ static raw_seqcount_t foo_seqcount = RAW_SEQCNT_ZERO(foo_seqcount);
+
+ /* C99 struct init */
+ struct {
+ .seq = RAW_SEQCNT_ZERO(foo.seq),
+ } foo;
+
+Write path::
+
+ /* Serialized context with disabled preemption */
+
+ raw_write_seqcount_begin(&foo_seqcount);
+
+ /* ... [[write-side critical section]] ... */
+
+ raw_write_seqcount_end(&foo_seqcount);
+
+Read path::
+
+ do {
+ seq = raw_read_seqcount_begin(&foo_seqcount);
+
+ /* ... [[read-side critical section]] ... */
+
+ } while (raw_read_seqcount_retry(&foo_seqcount, seq));
+
+
.. _seqcount_locktype_t:

Sequence counters with associated locks (``seqcount_LOCKNAME_t``)
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 37ded6b8fee6..c61fba1f9893 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -60,15 +60,27 @@
* serialization and non-preemptibility requirements, use a sequential
* lock (seqlock_t) instead.
*
+ * If it's undesired to have lockdep, especially when a fixed, minimal,
+ * structure size is required, use raw_seqcount_t along with the raw
+ * function variants.
+ *
* See Documentation/locking/seqlock.rst
*/
+
+typedef unsigned int raw_seqcount_t;
+
typedef struct seqcount {
- unsigned sequence;
+ raw_seqcount_t sequence;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
} seqcount_t;

+static inline void __raw_seqcount_init(raw_seqcount_t *s)
+{
+ *s = 0;
+}
+
static inline void __seqcount_init(seqcount_t *s, const char *name,
struct lock_class_key *key)
{
@@ -76,9 +88,15 @@ static inline void __seqcount_init(seqcount_t *s, const char *name,
* Make sure we are not reinitializing a held lock:
*/
lockdep_init_map(&s->dep_map, name, key, 0);
- s->sequence = 0;
+ __raw_seqcount_init(&s->sequence);
}

+/**
+ * raw_seqcount_init() - runtime initializer for raw_seqcount_t
+ * @s: Pointer to the raw_seqcount_t instance
+ */
+# define raw_seqcount_init(s) __raw_seqcount_init(s)
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC

# define SEQCOUNT_DEP_MAP_INIT(lockname) \
@@ -111,11 +129,16 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s)
# define seqcount_lockdep_reader_access(x)
#endif

+/**
+ * RAW_SEQCNT_ZERO() - static initializer for raw_seqcount_t
+ */
+#define RAW_SEQCNT_ZERO() 0
+
/**
* SEQCNT_ZERO() - static initializer for seqcount_t
* @name: Name of the seqcount_t instance
*/
-#define SEQCNT_ZERO(name) { .sequence = 0, SEQCOUNT_DEP_MAP_INIT(name) }
+#define SEQCNT_ZERO(name) { .sequence = RAW_SEQCNT_ZERO(), SEQCOUNT_DEP_MAP_INIT(name) }

/*
* Sequence counters with associated locks (seqcount_LOCKNAME_t)
@@ -203,6 +226,12 @@ typedef struct seqcount_##lockname { \
__SEQ_LOCK(locktype *lock); \
} seqcount_##lockname##_t; \
\
+static __always_inline raw_seqcount_t * \
+__seqprop_##lockname##_raw_ptr(seqcount_##lockname##_t *s) \
+{ \
+ return &s->seqcount.sequence; \
+} \
+ \
static __always_inline seqcount_t * \
__seqprop_##lockname##_ptr(seqcount_##lockname##_t *s) \
{ \
@@ -247,10 +276,45 @@ __seqprop_##lockname##_assert(const seqcount_##lockname##_t *s) \
__SEQ_LOCK(lockdep_assert_held(lockmember)); \
}

+/*
+ * __raw_seqprop() for raw_seqcount_t
+ */
+
+static inline raw_seqcount_t *__raw_seqprop_raw_ptr(raw_seqcount_t *s)
+{
+ return s;
+}
+
+static inline seqcount_t *__raw_seqprop_ptr(raw_seqcount_t *s)
+{
+ BUILD_BUG();
+ return NULL;
+}
+
+static inline unsigned int __raw_seqprop_sequence(const raw_seqcount_t *s)
+{
+ return READ_ONCE(*s);
+}
+
+static inline bool __raw_seqprop_preemptible(const raw_seqcount_t *s)
+{
+ return false;
+}
+
+static inline void __raw_seqprop_assert(const raw_seqcount_t *s)
+{
+ lockdep_assert_preemption_disabled();
+}
+
/*
* __seqprop() for seqcount_t
*/

+static inline raw_seqcount_t *__seqprop_raw_ptr(seqcount_t *s)
+{
+ return &s->sequence;
+}
+
static inline seqcount_t *__seqprop_ptr(seqcount_t *s)
{
return s;
@@ -300,6 +364,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu
seqcount_##lockname##_t: __seqprop_##lockname##_##prop((void *)(s))

#define __seqprop(s, prop) _Generic(*(s), \
+ raw_seqcount_t: __raw_seqprop_##prop((void *)(s)), \
seqcount_t: __seqprop_##prop((void *)(s)), \
__seqprop_case((s), raw_spinlock, prop), \
__seqprop_case((s), spinlock, prop), \
@@ -307,6 +372,7 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu
__seqprop_case((s), mutex, prop), \
__seqprop_case((s), ww_mutex, prop))

+#define seqprop_raw_ptr(s) __seqprop(s, raw_ptr)
#define seqprop_ptr(s) __seqprop(s, ptr)
#define seqprop_sequence(s) __seqprop(s, sequence)
#define seqprop_preemptible(s) __seqprop(s, preemptible)
@@ -314,7 +380,8 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu

/**
* __read_seqcount_begin() - begin a seqcount_t read section w/o barrier
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ * variants
*
* __read_seqcount_begin is like read_seqcount_begin, but has no smp_rmb()
* barrier. Callers should ensure that smp_rmb() or equivalent ordering is
@@ -339,7 +406,8 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu

/**
* raw_read_seqcount_begin() - begin a seqcount_t read section w/o lockdep
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the
+ * seqcount_LOCKNAME_t variants
*
* Return: count to be passed to read_seqcount_retry()
*/
@@ -365,7 +433,8 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu

/**
* raw_read_seqcount() - read the raw seqcount_t counter value
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ * variants
*
* raw_read_seqcount opens a read critical section of the given
* seqcount_t, without any lockdep checking, and without checking or
@@ -386,7 +455,8 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu
/**
* raw_seqcount_begin() - begin a seqcount_t read critical section w/o
* lockdep and w/o counter stabilization
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t, or any of the seqcount_LOCKNAME_t
+ * variants
*
* raw_seqcount_begin opens a read critical section of the given
* seqcount_t. Unlike read_seqcount_begin(), this function will not wait
@@ -411,7 +481,8 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu

/**
* __read_seqcount_retry() - end a seqcount_t read section w/o barrier
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ * variants
* @start: count, from read_seqcount_begin()
*
* __read_seqcount_retry is like read_seqcount_retry, but has no smp_rmb()
@@ -425,17 +496,19 @@ SEQCOUNT_LOCKNAME(ww_mutex, struct ww_mutex, true, &s->lock->base, ww_mu
* Return: true if a read section retry is required, else false
*/
#define __read_seqcount_retry(s, start) \
- do___read_seqcount_retry(seqprop_ptr(s), start)
+ do___read_seqcount_retry(seqprop_raw_ptr(s), start)

-static inline int do___read_seqcount_retry(const seqcount_t *s, unsigned start)
+static inline int do___read_seqcount_retry(const raw_seqcount_t *s,
+ unsigned int start)
{
kcsan_atomic_next(0);
- return unlikely(READ_ONCE(s->sequence) != start);
+ return unlikely(READ_ONCE(*s) != start);
}

/**
* read_seqcount_retry() - end a seqcount_t read critical section
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ * variants
* @start: count, from read_seqcount_begin()
*
* read_seqcount_retry closes the read critical section of given
@@ -445,9 +518,11 @@ static inline int do___read_seqcount_retry(const seqcount_t *s, unsigned start)
* Return: true if a read section retry is required, else false
*/
#define read_seqcount_retry(s, start) \
- do_read_seqcount_retry(seqprop_ptr(s), start)
+ do_read_seqcount_retry(seqprop_raw_ptr(s), start)
+#define raw_read_seqcount_retry(s, start) read_seqcount_retry(s, start)

-static inline int do_read_seqcount_retry(const seqcount_t *s, unsigned start)
+static inline int do_read_seqcount_retry(const raw_seqcount_t *s,
+ unsigned int start)
{
smp_rmb();
return do___read_seqcount_retry(s, start);
@@ -455,7 +530,8 @@ static inline int do_read_seqcount_retry(const seqcount_t *s, unsigned start)

/**
* raw_write_seqcount_begin() - start a seqcount_t write section w/o lockdep
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ * variants
*
* Context: check write_seqcount_begin()
*/
@@ -464,34 +540,35 @@ do { \
if (seqprop_preemptible(s)) \
preempt_disable(); \
\
- do_raw_write_seqcount_begin(seqprop_ptr(s)); \
+ do_raw_write_seqcount_begin(seqprop_raw_ptr(s)); \
} while (0)

-static inline void do_raw_write_seqcount_begin(seqcount_t *s)
+static inline void do_raw_write_seqcount_begin(raw_seqcount_t *s)
{
kcsan_nestable_atomic_begin();
- s->sequence++;
+ (*s)++;
smp_wmb();
}

/**
* raw_write_seqcount_end() - end a seqcount_t write section w/o lockdep
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ * variants
*
* Context: check write_seqcount_end()
*/
#define raw_write_seqcount_end(s) \
do { \
- do_raw_write_seqcount_end(seqprop_ptr(s)); \
+ do_raw_write_seqcount_end(seqprop_raw_ptr(s)); \
\
if (seqprop_preemptible(s)) \
preempt_enable(); \
} while (0)

-static inline void do_raw_write_seqcount_end(seqcount_t *s)
+static inline void do_raw_write_seqcount_end(raw_seqcount_t *s)
{
smp_wmb();
- s->sequence++;
+ (*s)++;
kcsan_nestable_atomic_end();
}

@@ -516,7 +593,7 @@ do { \

static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass)
{
- do_raw_write_seqcount_begin(s);
+ do_raw_write_seqcount_begin(&s->sequence);
seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_);
}

@@ -563,12 +640,13 @@ do { \
static inline void do_write_seqcount_end(seqcount_t *s)
{
seqcount_release(&s->dep_map, _RET_IP_);
- do_raw_write_seqcount_end(s);
+ do_raw_write_seqcount_end(&s->sequence);
}

/**
* raw_write_seqcount_barrier() - do a seqcount_t write barrier
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ * variants
*
* This can be used to provide an ordering guarantee instead of the usual
* consistency guarantee. It is one wmb cheaper, because it can collapse
@@ -608,33 +686,34 @@ static inline void do_write_seqcount_end(seqcount_t *s)
* }
*/
#define raw_write_seqcount_barrier(s) \
- do_raw_write_seqcount_barrier(seqprop_ptr(s))
+ do_raw_write_seqcount_barrier(seqprop_raw_ptr(s))

-static inline void do_raw_write_seqcount_barrier(seqcount_t *s)
+static inline void do_raw_write_seqcount_barrier(raw_seqcount_t *s)
{
kcsan_nestable_atomic_begin();
- s->sequence++;
+ (*s)++;
smp_wmb();
- s->sequence++;
+ (*s)++;
kcsan_nestable_atomic_end();
}

/**
* write_seqcount_invalidate() - invalidate in-progress seqcount_t read
* side operations
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ * variants
*
* After write_seqcount_invalidate, no seqcount_t read side operations
* will complete successfully and see data older than this.
*/
#define write_seqcount_invalidate(s) \
- do_write_seqcount_invalidate(seqprop_ptr(s))
+ do_write_seqcount_invalidate(seqprop_raw_ptr(s))

-static inline void do_write_seqcount_invalidate(seqcount_t *s)
+static inline void do_write_seqcount_invalidate(raw_seqcount_t *s)
{
smp_wmb();
kcsan_nestable_atomic_begin();
- s->sequence+=2;
+ (*s) += 2;
kcsan_nestable_atomic_end();
}

--
2.31.1


2021-12-17 11:33:38

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 02/11] mm: thp: consolidate mapcount logic on THP split

Let's consolidate the mapcount logic to make it easier to understand and
to prepare for further changes.

Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/huge_memory.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e5483347291c..4751d03947da 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2101,21 +2101,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
pte = pte_offset_map(&_pmd, addr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, entry);
- if (!pmd_migration)
- atomic_inc(&page[i]._mapcount);
pte_unmap(pte);
}

if (!pmd_migration) {
+ /* Sub-page mapcount accounting for above small mappings. */
+ int val = 1;
+
/*
* Set PG_double_map before dropping compound_mapcount to avoid
* false-negative page_mapped().
+ *
+ * The first to set PageDoubleMap() has to increment all
+ * sub-page mapcounts by one.
*/
- if (compound_mapcount(page) > 1 &&
- !TestSetPageDoubleMap(page)) {
- for (i = 0; i < HPAGE_PMD_NR; i++)
- atomic_inc(&page[i]._mapcount);
- }
+ if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page))
+ val++;
+
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ atomic_add(val, &page[i]._mapcount);

lock_page_memcg(page);
if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
--
2.31.1


2021-12-17 11:33:46

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 03/11] mm: simplify hugetlb and file-THP handling in __page_mapcount()

Let's return early for hugetlb, which really only relies on the compound
mapcount so far and does not support PageDoubleMap() yet. Use the chance
to cleanup the file-THP case to make it easier to grasp. While at it, use
head_compound_mapcount().

This is a preparation for further changes.

Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/util.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 741ba32a43ac..3239e75c148d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -732,15 +732,18 @@ int __page_mapcount(struct page *page)
{
int ret;

- ret = atomic_read(&page->_mapcount) + 1;
+ if (PageHuge(page))
+ return compound_mapcount(page);
/*
* For file THP page->_mapcount contains total number of mapping
* of the page: no need to look into compound_mapcount.
*/
- if (!PageAnon(page) && !PageHuge(page))
- return ret;
+ if (!PageAnon(page))
+ return atomic_read(&page->_mapcount) + 1;
+
+ ret = atomic_read(&page->_mapcount) + 1;
page = compound_head(page);
- ret += atomic_read(compound_mapcount_ptr(page)) + 1;
+ ret += head_compound_mapcount(page);
if (PageDoubleMap(page))
ret--;
return ret;
--
2.31.1


2021-12-17 11:33:54

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 04/11] mm: thp: simlify total_mapcount()

Let's simplify a bit, returning for PageHuge() early and using
head_compound_page() as we are only getting called for HEAD pages.

Note the VM_BUG_ON_PAGE(PageTail(page), page) check at the beginning of
total_mapcount().

This is a preparation for further changes.

Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/huge_memory.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4751d03947da..826cabcad11a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2506,12 +2506,11 @@ int total_mapcount(struct page *page)

if (likely(!PageCompound(page)))
return atomic_read(&page->_mapcount) + 1;
+ if (PageHuge(page))
+ return head_compound_mapcount(page);

- compound = compound_mapcount(page);
nr = compound_nr(page);
- if (PageHuge(page))
- return compound;
- ret = compound;
+ ret = compound = head_compound_mapcount(page);
for (i = 0; i < nr; i++)
ret += atomic_read(&page[i]._mapcount) + 1;
/* File pages has compound_mapcount included in _mapcount */
--
2.31.1


2021-12-17 11:34:01

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 05/11] mm: thp: allow for reading the THP mapcount atomically via a raw_seqlock_t

Currently, we are not able to read the mapcount of a THP atomically
without expensive locking, for example, if the THP is getting split
concurrently.

Also, we don't want mapcount readers to observe jitter on concurrent GUP
and unmapping like:
2 -> 1 -> 2 -> 1
Instead, we want to avoid such jitter and want the mapcount of a THP to
move into one direction only instead.

The main challenge to avoid such jitter is PageDoubleMap. If the
compound_mapcount and the tail mapcounts move in the same direction,
there is no problem. However when the compound_mapcount is decreased and
reaches zero, the reader will see initially a decrease in the THP mapcount
that will then be followed by the PageDoubleMap being cleared and the
mapcount getting increased again. The act of clearing PageDoubleMap will
lead readers to overestimate the mapcount until all tail mapcounts (that
the PageDoubleMap flag kept artificially elevated) are finally released.

Introduce a raw_seqlock_t in the THP subpage at index 1 to allow reading
the THP mapcount atomically without grabbing the page lock, avoiding racing
with THP splitting or PageDoubleMap processing. For now, we only require
the seqlock for anonymous THP.

We use a PG_lock-based spinlock to synchronize the writer side. Note
that the PG_lock is located on the THP subpage at index 1, which is
unused so far.

To make especially page_mapcount() safe to be called from IRQ context, as
required by GUP via get_user_pages_fast_only() in the context of
GUP-triggered unsharing of shared anonymous pages soon, make sure the
reader side cannot deadlock if the writer side would be interrupted:
disable local interrupts on the writer side. Note that they are already
disabled during lock_page_memcg() in some configurations.

Fortunately, we do have as of now (mm/Kconfig)
config TRANSPARENT_HUGEPAGE
bool "Transparent Hugepage Support"
depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
so the disabling of interrupts in our case in particular has no effect
on PREEMPT_RT, which is good.

We don't need this type of locking on the THP freeing path: Once the
compound_mapcount of an anonymous THP drops to 0, it won't suddenly
increase again, so PageDoubleMap cannot be cleared concurrently and
consequently the seqlock only needs to be taken if the PageDoubleMap
flag is found set.

Note: In the future, we could avoid disabling local interrupts on the
writer side by providing alternative functions that can be called
from IRQ context without deadlocking: These functions must not spin
but instead have to signal that locking failed. OR maybe we'll find
a way to just simplify that whole mapcount handling logic for
anonymous THP, but for now none has been identified.
Let's keep it simple for now.

This commit is based on prototype patches by Andrea.

Reported-by: Sergey Senozhatsky <[email protected]>
Reported-by: Hugh Dickins <[email protected]>
Fixes: c444eb564fb1 ("mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()")
Co-developed-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/huge_mm.h | 65 ++++++++++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 9 ++++++
mm/huge_memory.c | 56 +++++++++++++++++++++++-----------
mm/rmap.c | 40 +++++++++++++++----------
mm/swapfile.c | 35 +++++++++++++---------
mm/util.c | 17 +++++++----
6 files changed, 170 insertions(+), 52 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f280f33ff223..44e02d47c65a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -318,6 +318,49 @@ static inline struct list_head *page_deferred_list(struct page *page)
return &page[2].deferred_list;
}

+static inline void thp_mapcount_seqcount_init(struct page *page)
+{
+ raw_seqcount_init(&page[1].mapcount_seqcount);
+}
+
+static inline unsigned int thp_mapcount_read_begin(struct page *page)
+{
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ return raw_read_seqcount_begin(&page[1].mapcount_seqcount);
+}
+
+static inline bool thp_mapcount_read_retry(struct page *page,
+ unsigned int seqcount)
+{
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ if (!raw_read_seqcount_retry(&page[1].mapcount_seqcount, seqcount))
+ return false;
+ cpu_relax();
+ return true;
+}
+
+static inline void thp_mapcount_lock(struct page *page,
+ unsigned long *irq_flags)
+{
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ /*
+ * Prevent deadlocks in thp_mapcount_read_begin() if it is called in IRQ
+ * context.
+ */
+ local_irq_save(*irq_flags);
+ bit_spin_lock(PG_locked, &page[1].flags);
+ raw_write_seqcount_begin(&page[1].mapcount_seqcount);
+}
+
+static inline void thp_mapcount_unlock(struct page *page,
+ unsigned long irq_flags)
+{
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ raw_write_seqcount_end(&page[1].mapcount_seqcount);
+ bit_spin_unlock(PG_locked, &page[1].flags);
+ local_irq_restore(irq_flags);
+}
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -467,6 +510,28 @@ static inline bool thp_migration_supported(void)
{
return false;
}
+
+static inline unsigned int thp_mapcount_read_begin(struct page *page)
+{
+ return 0;
+}
+
+static inline bool thp_mapcount_read_retry(struct page *page,
+ unsigned int seqcount)
+{
+ return false;
+}
+
+static inline void thp_mapcount_lock(struct page *page,
+ unsigned long *irq_flags)
+{
+}
+
+static inline void thp_mapcount_unlock(struct page *page,
+ unsigned long irq_flags)
+{
+}
+
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

/**
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c3a6e6209600..a85a2a75d4ff 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -151,6 +151,15 @@ struct page {
unsigned char compound_order;
atomic_t compound_mapcount;
unsigned int compound_nr; /* 1 << compound_order */
+ /*
+ * THP only: allow for atomic reading of the mapcount,
+ * for example when we might be racing with a concurrent
+ * THP split. Initialized for all THP but locking is
+ * so far only required for anon THP where such races
+ * apply. Write access is serialized via the
+ * PG_locked-based spinlock in the first tail page.
+ */
+ raw_seqcount_t mapcount_seqcount;
};
struct { /* Second tail page of compound page */
unsigned long _compound_pad_1; /* compound_head */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 826cabcad11a..1685821525e8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -527,6 +527,7 @@ void prep_transhuge_page(struct page *page)

INIT_LIST_HEAD(page_deferred_list(page));
set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
+ thp_mapcount_seqcount_init(page);
}

bool is_transparent_hugepage(struct page *page)
@@ -1959,11 +1960,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long haddr, bool freeze)
{
struct mm_struct *mm = vma->vm_mm;
+ unsigned long addr, irq_flags;
struct page *page;
pgtable_t pgtable;
pmd_t old_pmd, _pmd;
bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
- unsigned long addr;
int i;

VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
@@ -2108,6 +2109,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
/* Sub-page mapcount accounting for above small mappings. */
int val = 1;

+ /*
+ * lock_page_memcg() is taken before thp_mapcount_lock() in
+ * page_remove_anon_compound_rmap(), respect the same locking
+ * order.
+ */
+ lock_page_memcg(page);
+ thp_mapcount_lock(page, &irq_flags);
/*
* Set PG_double_map before dropping compound_mapcount to avoid
* false-negative page_mapped().
@@ -2121,7 +2129,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
for (i = 0; i < HPAGE_PMD_NR; i++)
atomic_add(val, &page[i]._mapcount);

- lock_page_memcg(page);
if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
/* Last compound_mapcount is gone. */
__mod_lruvec_page_state(page, NR_ANON_THPS,
@@ -2132,6 +2139,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
atomic_dec(&page[i]._mapcount);
}
}
+ thp_mapcount_unlock(page, irq_flags);
unlock_page_memcg(page);
}

@@ -2501,6 +2509,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
int total_mapcount(struct page *page)
{
int i, compound, nr, ret;
+ unsigned int seqcount;
+ bool double_map;

VM_BUG_ON_PAGE(PageTail(page), page);

@@ -2510,13 +2520,19 @@ int total_mapcount(struct page *page)
return head_compound_mapcount(page);

nr = compound_nr(page);
- ret = compound = head_compound_mapcount(page);
- for (i = 0; i < nr; i++)
- ret += atomic_read(&page[i]._mapcount) + 1;
+
+ do {
+ seqcount = thp_mapcount_read_begin(page);
+ ret = compound = head_compound_mapcount(page);
+ for (i = 0; i < nr; i++)
+ ret += atomic_read(&page[i]._mapcount) + 1;
+ double_map = PageDoubleMap(page);
+ } while (thp_mapcount_read_retry(page, seqcount));
+
/* File pages has compound_mapcount included in _mapcount */
if (!PageAnon(page))
return ret - compound * nr;
- if (PageDoubleMap(page))
+ if (double_map)
ret -= nr;
return ret;
}
@@ -2548,6 +2564,7 @@ int total_mapcount(struct page *page)
int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
{
int i, ret, _total_mapcount, mapcount;
+ unsigned int seqcount;

/* hugetlbfs shouldn't call it */
VM_BUG_ON_PAGE(PageHuge(page), page);
@@ -2561,17 +2578,22 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)

page = compound_head(page);

- _total_mapcount = ret = 0;
- for (i = 0; i < thp_nr_pages(page); i++) {
- mapcount = atomic_read(&page[i]._mapcount) + 1;
- ret = max(ret, mapcount);
- _total_mapcount += mapcount;
- }
- if (PageDoubleMap(page)) {
- ret -= 1;
- _total_mapcount -= thp_nr_pages(page);
- }
- mapcount = compound_mapcount(page);
+ do {
+ _total_mapcount = ret = 0;
+
+ seqcount = thp_mapcount_read_begin(page);
+ for (i = 0; i < thp_nr_pages(page); i++) {
+ mapcount = atomic_read(&page[i]._mapcount) + 1;
+ ret = max(ret, mapcount);
+ _total_mapcount += mapcount;
+ }
+ if (PageDoubleMap(page)) {
+ ret -= 1;
+ _total_mapcount -= thp_nr_pages(page);
+ }
+ mapcount = compound_mapcount(page);
+ } while (thp_mapcount_read_retry(page, seqcount));
+
ret += mapcount;
_total_mapcount += mapcount;
if (total_mapcount)
diff --git a/mm/rmap.c b/mm/rmap.c
index 163ac4e6bcee..0218052586e7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1294,6 +1294,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)

static void page_remove_anon_compound_rmap(struct page *page)
{
+ unsigned long irq_flags;
int i, nr;

if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
@@ -1308,23 +1309,30 @@ static void page_remove_anon_compound_rmap(struct page *page)

__mod_lruvec_page_state(page, NR_ANON_THPS, -thp_nr_pages(page));

- if (TestClearPageDoubleMap(page)) {
- /*
- * Subpages can be mapped with PTEs too. Check how many of
- * them are still mapped.
- */
- for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
- if (atomic_add_negative(-1, &page[i]._mapcount))
- nr++;
- }
+ if (PageDoubleMap(page)) {
+ thp_mapcount_lock(page, &irq_flags);
+ if (TestClearPageDoubleMap(page)) {
+ /*
+ * Subpages can be mapped with PTEs too. Check how many
+ * of them are still mapped.
+ */
+ for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
+ if (atomic_add_negative(-1, &page[i]._mapcount))
+ nr++;
+ }
+ thp_mapcount_unlock(page, irq_flags);

- /*
- * Queue the page for deferred split if at least one small
- * page of the compound page is unmapped, but at least one
- * small page is still mapped.
- */
- if (nr && nr < thp_nr_pages(page))
- deferred_split_huge_page(page);
+ /*
+ * Queue the page for deferred split if at least one
+ * small page of the compound page is unmapped, but at
+ * least one small page is still mapped.
+ */
+ if (nr && nr < thp_nr_pages(page))
+ deferred_split_huge_page(page);
+ } else {
+ thp_mapcount_unlock(page, irq_flags);
+ nr = thp_nr_pages(page);
+ }
} else {
nr = thp_nr_pages(page);
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e59e08ef46e1..82aeb927a7ba 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1610,6 +1610,7 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
struct swap_cluster_info *ci = NULL;
unsigned char *map = NULL;
int mapcount, swapcount = 0;
+ unsigned int seqcount;

/* hugetlbfs shouldn't call it */
VM_BUG_ON_PAGE(PageHuge(page), page);
@@ -1625,7 +1626,6 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,

page = compound_head(page);

- _total_mapcount = _total_swapcount = map_swapcount = 0;
if (PageSwapCache(page)) {
swp_entry_t entry;

@@ -1638,21 +1638,28 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
}
if (map)
ci = lock_cluster(si, offset);
- for (i = 0; i < HPAGE_PMD_NR; i++) {
- mapcount = atomic_read(&page[i]._mapcount) + 1;
- _total_mapcount += mapcount;
- if (map) {
- swapcount = swap_count(map[offset + i]);
- _total_swapcount += swapcount;
+
+ do {
+ _total_mapcount = _total_swapcount = map_swapcount = 0;
+
+ seqcount = thp_mapcount_read_begin(page);
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ mapcount = atomic_read(&page[i]._mapcount) + 1;
+ _total_mapcount += mapcount;
+ if (map) {
+ swapcount = swap_count(map[offset + i]);
+ _total_swapcount += swapcount;
+ }
+ map_swapcount = max(map_swapcount, mapcount + swapcount);
}
- map_swapcount = max(map_swapcount, mapcount + swapcount);
- }
+ if (PageDoubleMap(page)) {
+ map_swapcount -= 1;
+ _total_mapcount -= HPAGE_PMD_NR;
+ }
+ mapcount = compound_mapcount(page);
+ } while (thp_mapcount_read_retry(page, seqcount));
+
unlock_cluster(ci);
- if (PageDoubleMap(page)) {
- map_swapcount -= 1;
- _total_mapcount -= HPAGE_PMD_NR;
- }
- mapcount = compound_mapcount(page);
map_swapcount += mapcount;
_total_mapcount += mapcount;
if (total_mapcount)
diff --git a/mm/util.c b/mm/util.c
index 3239e75c148d..f4b81c794da1 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -730,6 +730,8 @@ EXPORT_SYMBOL(folio_mapping);
/* Slow path of page_mapcount() for compound pages */
int __page_mapcount(struct page *page)
{
+ struct page *head_page;
+ unsigned int seqcount;
int ret;

if (PageHuge(page))
@@ -741,11 +743,16 @@ int __page_mapcount(struct page *page)
if (!PageAnon(page))
return atomic_read(&page->_mapcount) + 1;

- ret = atomic_read(&page->_mapcount) + 1;
- page = compound_head(page);
- ret += head_compound_mapcount(page);
- if (PageDoubleMap(page))
- ret--;
+ /* The mapcount_seqlock is so far only required for anonymous THP. */
+ head_page = compound_head(page);
+ do {
+ seqcount = thp_mapcount_read_begin(head_page);
+ ret = atomic_read(&page->_mapcount) + 1;
+ ret += head_compound_mapcount(head_page);
+ if (PageDoubleMap(head_page))
+ ret--;
+ } while (thp_mapcount_read_retry(head_page, seqcount));
+
return ret;
}
EXPORT_SYMBOL_GPL(__page_mapcount);
--
2.31.1


2021-12-17 11:34:09

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

FAULT_FLAG_UNSHARE is a new type of page fault applicable to COW-able
anonymous memory (including hugetlb but excluding KSM) and its purpose is
to allow for unsharing of shared anonymous pages on selected GUP *read*
access, in comparison to the traditional COW on *write* access.

In contrast to a COW, GUP-triggered unsharing will still maintain the write
protection. It will be triggered by GUP to properly prevent a child process
from finding ways via GUP to observe memory modifications of anonymous
memory of the parent process after fork().

Rename the relevant functions to make it clear whether we're dealing
with unsharing, cow, or both.

The hugetlb part will be added separately.

This commit is based on prototype patches by Andrea.

Co-developed-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm.h | 4 ++
mm/memory.c | 136 ++++++++++++++++++++++++++++++++++++++-------
2 files changed, 119 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..37d1fb2f865e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -436,6 +436,9 @@ extern pgprot_t protection_map[16];
* @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
* @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
* @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare a
+ * shared anonymous page (-> mapped R/O). Does not apply
+ * to KSM.
*
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
* whether we would allow page faults to retry by specifying these two
@@ -467,6 +470,7 @@ enum fault_flag {
FAULT_FLAG_REMOTE = 1 << 7,
FAULT_FLAG_INSTRUCTION = 1 << 8,
FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
+ FAULT_FLAG_UNSHARE = 1 << 10,
};

/*
diff --git a/mm/memory.c b/mm/memory.c
index 8f1de811a1dc..7253a2ad4320 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2707,8 +2707,9 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
* read non-atomically. Before making any commitment, on those architectures
* or configurations (e.g. i386 with PAE) which might give a mix of unmatched
* parts, do_swap_page must check under lock before unmapping the pte and
- * proceeding (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page can safely check later on).
+ * proceeding (but do_wp_page_cow/do_wp_page_unshare is only called after
+ * already making such a check; and do_anonymous_page can safely check later
+ * on).
*/
static inline int pte_unmap_same(struct vm_fault *vmf)
{
@@ -2726,8 +2727,8 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
return same;
}

-static inline bool cow_user_page(struct page *dst, struct page *src,
- struct vm_fault *vmf)
+static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
+ struct vm_fault *vmf)
{
bool ret;
void *kaddr;
@@ -2952,7 +2953,8 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
}

/*
- * Handle the case of a page which we actually need to copy to a new page.
+ * Handle the case of a page which we actually need to copy to a new page,
+ * either due to COW or unsharing.
*
* Called with mmap_lock locked and the old page referenced, but
* without the ptl held.
@@ -2967,7 +2969,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
* held to the old page, as well as updating the rmap.
* - In any case, unlock the PTL and drop the reference we took to the old page.
*/
-static vm_fault_t wp_page_copy(struct vm_fault *vmf)
+static vm_fault_t wp_page_copy(struct vm_fault *vmf, bool unshare)
{
struct vm_area_struct *vma = vmf->vma;
struct mm_struct *mm = vma->vm_mm;
@@ -2991,7 +2993,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
if (!new_page)
goto oom;

- if (!cow_user_page(new_page, old_page, vmf)) {
+ if (!__wp_page_copy_user(new_page, old_page, vmf)) {
/*
* COW failed, if the fault was solved by other,
* it's fine. If not, userspace would re-fault on
@@ -3033,7 +3035,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = pte_sw_mkyoung(entry);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (unlikely(unshare)) {
+ if (pte_soft_dirty(vmf->orig_pte))
+ entry = pte_mksoft_dirty(entry);
+ if (pte_uffd_wp(vmf->orig_pte))
+ entry = pte_mkuffd_wp(entry);
+ } else {
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ }

/*
* Clear the pte entry and flush it first, before updating the
@@ -3050,6 +3059,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* mmu page tables (such as kvm shadow page tables), we want the
* new page to be mapped directly into the secondary page table.
*/
+ BUG_ON(unshare && pte_write(entry));
set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
update_mmu_cache(vma, vmf->address, vmf->pte);
if (old_page) {
@@ -3109,6 +3119,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
free_swap_cache(old_page);
put_page(old_page);
}
+ if (unlikely(unshare))
+ return 0;
return page_copied ? VM_FAULT_WRITE : 0;
oom_free_new:
put_page(new_page);
@@ -3118,6 +3130,70 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
return VM_FAULT_OOM;
}

+static __always_inline vm_fault_t wp_page_cow(struct vm_fault *vmf)
+{
+ return wp_page_copy(vmf, false);
+}
+
+static __always_inline vm_fault_t wp_page_unshare(struct vm_fault *vmf)
+{
+ return wp_page_copy(vmf, true);
+}
+
+/*
+ * This routine handles present pages, when GUP tries to take a read-only
+ * pin on a shared anonymous page. It's similar to do_wp_page_cow(), except that
+ * it keeps the pages mapped read-only and doesn't apply to KSM pages.
+ *
+ * If a parent process forks a child process, we share anonymous pages between
+ * both processes with COW semantics. Both processes will map these now shared
+ * anonymous pages read-only, and any write access triggers unsharing via COW.
+ *
+ * If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not
+ * set) and then unmaps the target page, we have:
+ *
+ * * page has mapcount == 1 and refcount > 1
+ * * page is mapped read-only into the parent
+ * * page is pinned by the child and can still be read
+ *
+ * For now, we rely on refcount > 1 to perform the COW and trigger unsharing.
+ * However, that leads to other hard-to fix issues.
+ *
+ * GUP-triggered unsharing provides a parallel approach to trigger unsharing
+ * early, still allowing for relying on mapcount > 1 in COW code instead of on
+ * imprecise refcount > 1. Note that when we don't actually take a reference
+ * on the target page but instead use memory notifiers to synchronize to changes
+ * in the process page tables, unsharing is not required.
+ *
+ * Note that in the above scenario, it's impossible to distinguish during the
+ * write fault between:
+ *
+ * a) The parent process performed the pin and the child no longer has access
+ * to the page.
+ *
+ * b) The child process performed the pin and the child still has access to the
+ * page.
+ *
+ * In case of a), if we're dealing with a long-term read-only pin, the COW
+ * in the parent will result the pinned page differing from the page actually
+ * mapped into the process page tables in the parent: loss of synchronicity.
+ * Therefore, we really want to perform the copy when the read-only pin happens.
+ */
+static vm_fault_t do_wp_page_unshare(struct vm_fault *vmf)
+ __releases(vmf->ptl)
+{
+ vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
+ if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
+ page_mapcount(vmf->page) > 1) {
+ get_page(vmf->page);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return wp_page_unshare(vmf);
+ }
+ vmf->page = NULL;
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+}
+
/**
* finish_mkwrite_fault - finish page fault for a shared mapping, making PTE
* writeable once the page is prepared
@@ -3226,7 +3302,7 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
* but allow concurrent faults), with pte both mapped and locked.
* We return with mmap_lock still held, but pte unmapped and unlocked.
*/
-static vm_fault_t do_wp_page(struct vm_fault *vmf)
+static vm_fault_t do_wp_page_cow(struct vm_fault *vmf)
__releases(vmf->ptl)
{
struct vm_area_struct *vma = vmf->vma;
@@ -3258,7 +3334,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
return wp_pfn_shared(vmf);

pte_unmap_unlock(vmf->pte, vmf->ptl);
- return wp_page_copy(vmf);
+ return wp_page_cow(vmf);
}

/*
@@ -3296,7 +3372,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
get_page(vmf->page);

pte_unmap_unlock(vmf->pte, vmf->ptl);
- return wp_page_copy(vmf);
+ return wp_page_cow(vmf);
}

static void unmap_mapping_range_vma(struct vm_area_struct *vma,
@@ -3670,7 +3746,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}

if (vmf->flags & FAULT_FLAG_WRITE) {
- ret |= do_wp_page(vmf);
+ ret |= do_wp_page_cow(vmf);
if (ret & VM_FAULT_ERROR)
ret &= VM_FAULT_ERROR;
goto out;
@@ -4428,6 +4504,16 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
/* `inline' is required to avoid gcc 4.1.2 build error */
static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
{
+ if (vmf->flags & FAULT_FLAG_UNSHARE) {
+ /*
+ * We'll simply split the THP and handle unsharing on the
+ * PTE level. Unsharing only applies to anon THPs and we
+ * shouldn't ever find them inside shared mappings.
+ */
+ if (WARN_ON_ONCE(vmf->vma->vm_flags & VM_SHARED))
+ return 0;
+ goto split_fallback;
+ }
if (vma_is_anonymous(vmf->vma)) {
if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
return handle_userfault(vmf, VM_UFFD_WP);
@@ -4440,7 +4526,8 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
return ret;
}

- /* COW or write-notify handled on pte level: split pmd. */
+split_fallback:
+ /* COW, unsharing or write-notify handled on pte level: split pmd. */
__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);

return VM_FAULT_FALLBACK;
@@ -4551,8 +4638,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
return do_fault(vmf);
}

- if (!pte_present(vmf->orig_pte))
- return do_swap_page(vmf);
+ if (!pte_present(vmf->orig_pte)) {
+ if (likely(!(vmf->flags & FAULT_FLAG_UNSHARE)))
+ return do_swap_page(vmf);
+ return 0;
+ }

if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
@@ -4564,9 +4654,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
goto unlock;
}
- if (vmf->flags & FAULT_FLAG_WRITE) {
- if (!pte_write(entry))
- return do_wp_page(vmf);
+ if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
+ if (!pte_write(entry)) {
+ if (vmf->flags & FAULT_FLAG_WRITE)
+ return do_wp_page_cow(vmf);
+ else
+ return do_wp_page_unshare(vmf);
+ }
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
@@ -4607,7 +4701,6 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
.pgoff = linear_page_index(vma, address),
.gfp_mask = __get_fault_gfp_mask(vma),
};
- unsigned int dirty = flags & FAULT_FLAG_WRITE;
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
p4d_t *p4d;
@@ -4634,7 +4727,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,

/* NUMA case for anonymous PUDs would go here */

- if (dirty && !pud_write(orig_pud)) {
+ if ((flags & FAULT_FLAG_WRITE) && !pud_write(orig_pud)) {
ret = wp_huge_pud(&vmf, orig_pud);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
@@ -4672,7 +4765,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf);

- if (dirty && !pmd_write(vmf.orig_pmd)) {
+ if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
+ !pmd_write(vmf.orig_pmd)) {
ret = wp_huge_pmd(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
--
2.31.1


2021-12-17 11:34:20

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 07/11] mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required (!hugetlb)

It is currently possible for a child process to observe modifications of
anonymous pages by the parent process after fork() in some cases, which is
not only a userspace visible violation of the POSIX semantics of
MAP_PRIVATE, but more importantly a real security issue.

This issue, including other related COW issues, has been summarized in [1]:
"
1. Observing Memory Modifications of Private Pages From A Child Process

Long story short: process-private memory might not be as private as you
think once you fork(): successive modifications of private memory
regions in the parent process can still be observed by the child
process, for example, by smart use of vmsplice()+munmap().

The core problem is that pinning pages readable in a child process, such
as done via the vmsplice system call, can result in a child process
observing memory modifications done in the parent process the child is
not supposed to observe. [1] contains an excellent summary and [2]
contains further details. This issue was assigned CVE-2020-29374 [9].

For this to trigger, it's required to use a fork() without subsequent
exec(), for example, as used under Android zygote. Without further
details about an application that forks less-privileged child processes,
one cannot really say what's actually affected and what's not -- see the
details section the end of this mail for a short sshd/openssh analysis.

While commit 17839856fd58 ("gup: document and work around "COW can break
either way" issue") fixed this issue and resulted in other problems
(e.g., ptrace on pmem), commit 09854ba94c6a ("mm: do_wp_page()
simplification") re-introduced part of the problem unfortunately.

The original reproducer can be modified quite easily to use THP [3] and
make the issue appear again on upstream kernels. I modified it to use
hugetlb [4] and it triggers as well. The problem is certainly less
severe with hugetlb than with THP; it merely highlights that we still
have plenty of open holes we should be closing/fixing.

Regarding vmsplice(), the only known workaround is to disallow the
vmsplice() system call ... or disable THP and hugetlb. But who knows
what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in
the end, it's a more generic issue.
"

This security issue / MAP_PRIVATE POSIX violation was first reported by
Jann Horn on 27 May 2020 and it currently affects anonymous THP and
hugetlb.

Ordinary anonymous pages are currently not affected, because the COW logic
was changed in commit 09854ba94c6a ("mm: do_wp_page() simplification")
for them to COW on "page_count() != 1" instead of "mapcount > 1", which
unfortunately results in other COW issues, some of them documented in [1]
as well.

To fix this COW issue once and for all, introduce GUP-triggered unsharing
that can be conditionally triggered via FAULT_FLAG_UNSHARE. In contrast to
traditional COW, unsharing will leave the copied page mapped
write-protected in the page table, not having the semantics of a write
fault.

Logically, unsharing is triggered "early", as soon as GUP performs the
action that could result in a COW getting missed later and the security
issue triggering: however, unsharing is not triggered as before via a
write fault with undesired side effects.

GUP triggers unsharing if all of the following conditions are met:
* The page is mapped R/O
* We have an anonymous page, excluding KSM
* We want to read (!FOLL_WRITE)
* Unsharing is not disabled (!FOLL_NOUNSHARE)
* We want to take a reference (FOLL_GET or FOLL_PIN)
* The page is a shared anonymous page: mapcount > 1

As this patch introduces the same unsharing logic also for ordinary
PTE-mapped anonymous pages, it also paves the way to fix the other known
COW related issues documented in [1] without reintroducing the security
issue or reintroducing other issues we observed in the past (e.g., broken
ptrace on pmem).

We better leave the follow_page() API alone: it's an internal API and
its users don't actually allow for user space to read page content and they
don't expect to get "NULL" for actually present pages -- because they
usually don't trigger faults. Introduce and use FOLL_NOUNSHARE for that
purpose. We could also think about using it for other corner cases, such
as get_dump_page().

Note: GUP users that use memory notifiers to synchronize with the MM
don't have to bother about unsharing: they don't actually take
a reference on the pages and are properly synchronized against MM
changes to never result in consistency issues.

Add a TODO item that the mechanism should be extended to improve GUP
long-term as a whole, avoiding the requirement for FOLL_WRITE|FOLL_FORCE.

hugetlb case will be handled separately.

This commit is based on prototype patches by Andrea.

[1] https://lore.kernel.org/r/[email protected]

Co-developed-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm.h | 10 ++++++
mm/gup.c | 90 ++++++++++++++++++++++++++++++++++++++++++++--
mm/huge_memory.c | 7 ++++
3 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 37d1fb2f865e..ebcdaed60701 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2975,6 +2975,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_NOUNSHARE 0x100000 /* don't trigger unsharing on shared anon pages */

/*
* FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
@@ -3029,6 +3030,12 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
* releasing pages: get_user_pages*() pages must be released via put_page(),
* while pin_user_pages*() pages must be released via unpin_user_page().
*
+ * FOLL_NOUNSHARE should be set when no unsharing should be triggered when
+ * eventually taking a read-only reference on a shared anonymous page, because
+ * we are sure that user space cannot use that reference for reading the page
+ * after eventually unmapping the page. FOLL_NOUNSHARE is implicitly set for the
+ * follow_page() API.
+ *
* Please see Documentation/core-api/pin_user_pages.rst for more information.
*/

@@ -3043,6 +3050,9 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
return 0;
}

+extern bool gup_must_unshare(unsigned int flags, struct page *page,
+ bool is_head);
+
typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
unsigned long size, pte_fn_t fn, void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 2c51e9748a6a..2a83388c3fb4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,53 @@ struct follow_page_context {
unsigned int page_mask;
};

+/*
+ * Indicates for which pages that are write-protected in the page table,
+ * whether GUP has to trigger unsharing via FAULT_FLAG_UNSHARE such that the
+ * GUP pin will remain consistent with the pages mapped into the page tables
+ * of the MM.
+ *
+ * This handling is required to guarantee that a child process that triggered
+ * a read-only GUP before unmapping the page of interest cannot observe
+ * modifications of shared anonymous pages with COW semantics in the parent
+ * after fork().
+ *
+ * TODO: although the security issue described does no longer apply in any case,
+ * the full consistency between the pinned pages and the pages mapped into the
+ * page tables of the MM only apply to short-term pinnings only. For
+ * FOLL_LONGTERM, FOLL_WRITE|FOLL_FORCE is required for now, which can be
+ * inefficient and still result in some consistency issues. Extend this
+ * mechanism to also provide full synchronicity to FOLL_LONGTERM, avoiding
+ * FOLL_WRITE|FOLL_FORCE.
+ *
+ * This function is safe to be called in IRQ context.
+ */
+bool gup_must_unshare(unsigned int flags, struct page *page, bool is_head)
+{
+ /* We only care about read faults where unsharing is desired. */
+ if (flags & (FOLL_WRITE | FOLL_NOUNSHARE))
+ return false;
+ /*
+ * We only care when the reference count of the page is to get
+ * increased. In particular, GUP users that rely on memory notifiers
+ * instead don't have to trigger unsharing.
+ */
+ if (!(flags & (FOLL_GET|FOLL_PIN)))
+ return false;
+ if (!PageAnon(page))
+ return false;
+ if (PageKsm(page))
+ return false;
+ if (PageHuge(page))
+ /* TODO: handle hugetlb as well. */
+ return false;
+ if (is_head) {
+ VM_BUG_ON(!PageTransHuge(page));
+ return page_trans_huge_mapcount(page, NULL) > 1;
+ }
+ return page_mapcount(page) > 1;
+}
+
static void hpage_pincount_add(struct page *page, int refs)
{
VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
@@ -543,6 +590,14 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
}
}

+ /*
+ * If unsharing is required, keep retrying to unshare until the
+ * page becomes exclusive.
+ */
+ if (!pte_write(pte) && gup_must_unshare(flags, page, false)) {
+ page = ERR_PTR(-EMLINK);
+ goto out;
+ }
/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
if (unlikely(!try_grab_page(page, flags))) {
page = ERR_PTR(-ENOMEM);
@@ -790,6 +845,11 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
* When getting pages from ZONE_DEVICE memory, the @ctx->pgmap caches
* the device's dev_pagemap metadata to avoid repeating expensive lookups.
*
+ * When getting an anonymous page and the caller has to trigger unsharing
+ * of a shared anonymous page first, -EMLINK is returned. The caller should
+ * trigger a fault with FAULT_FLAG_UNSHARE set. With FOLL_NOUNSHARE set, will
+ * never require unsharing and consequently not return -EMLINK.
+ *
* On output, the @ctx->page_mask is set according to the size of the page.
*
* Return: the mapped (struct page *), %NULL if no mapping exists, or
@@ -845,6 +905,12 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
if (vma_is_secretmem(vma))
return NULL;

+ /*
+ * Don't require unsharing in case we stumble over a read-only mapped,
+ * shared anonymous page: this is an internal API only and callers don't
+ * actually use it for exposing page content to user space.
+ */
+ foll_flags |= FOLL_NOUNSHARE;
page = follow_page_mask(vma, address, foll_flags, &ctx);
if (ctx.pgmap)
put_dev_pagemap(ctx.pgmap);
@@ -910,7 +976,8 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
* is, *@locked will be set to 0 and -EBUSY returned.
*/
static int faultin_page(struct vm_area_struct *vma,
- unsigned long address, unsigned int *flags, int *locked)
+ unsigned long address, unsigned int *flags, bool unshare,
+ int *locked)
{
unsigned int fault_flags = 0;
vm_fault_t ret;
@@ -935,6 +1002,12 @@ static int faultin_page(struct vm_area_struct *vma,
*/
fault_flags |= FAULT_FLAG_TRIED;
}
+ if (unshare) {
+ VM_BUG_ON(unshare && *flags & FOLL_NOUNSHARE);
+ fault_flags |= FAULT_FLAG_UNSHARE;
+ /* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */
+ VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE);
+ }

ret = handle_mm_fault(vma, address, fault_flags, NULL);
if (ret & VM_FAULT_ERROR) {
@@ -1156,8 +1229,9 @@ static long __get_user_pages(struct mm_struct *mm,
cond_resched();

page = follow_page_mask(vma, start, foll_flags, &ctx);
- if (!page) {
- ret = faultin_page(vma, start, &foll_flags, locked);
+ if (!page || PTR_ERR(page) == -EMLINK) {
+ ret = faultin_page(vma, start, &foll_flags,
+ PTR_ERR(page) == -EMLINK, locked);
switch (ret) {
case 0:
goto retry;
@@ -2311,6 +2385,11 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
goto pte_unmap;
}

+ if (!pte_write(pte) && gup_must_unshare(flags, page, false)) {
+ put_compound_head(head, 1, flags);
+ goto pte_unmap;
+ }
+
VM_BUG_ON_PAGE(compound_head(page) != head, page);

/*
@@ -2554,6 +2633,11 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
return 0;
}

+ if (!pmd_write(orig) && gup_must_unshare(flags, head, true)) {
+ put_compound_head(head, refs, flags);
+ return 0;
+ }
+
*nr += refs;
SetPageReferenced(head);
return 1;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1685821525e8..57842e8b13d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1375,6 +1375,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);

+ /*
+ * If unsharing is required, keep retrying to unshare until the
+ * page becomes exclusive.
+ */
+ if (!pmd_write(*pmd) && gup_must_unshare(flags, page, true))
+ return ERR_PTR(-EMLINK);
+
if (!try_grab_page(page, flags))
return ERR_PTR(-ENOMEM);

--
2.31.1


2021-12-17 11:34:22

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 08/11] mm: hugetlb: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE

Let's support FAULT_FLAG_UNSHARE to implement GUP-triggered unsharing,
preparing for its use in the GUP paths when there is need to unshare a
shared anonymous hugetlb page.

We'll make use of it next by setting FAULT_FLAG_UNSHARE in case we
detect that unsharing is necessary.

This commit is based on a prototype patch by Andrea.

Co-developed-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/hugetlb.c | 86 ++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 63 insertions(+), 23 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a1baa198519a..5f2863b046ef 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5130,14 +5130,15 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
}

/*
- * Hugetlb_cow() should be called with page lock of the original hugepage held.
+ * __wp_hugetlb() should be called with page lock of the original hugepage held.
* Called with hugetlb_fault_mutex_table held and pte_page locked so we
* cannot race with other handlers or page migration.
* Keep the pte_same checks anyway to make transition from the mutex easier.
*/
-static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep,
- struct page *pagecache_page, spinlock_t *ptl)
+static __always_inline vm_fault_t
+__wp_hugetlb(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep, struct page *pagecache_page,
+ spinlock_t *ptl, bool unshare)
{
pte_t pte;
struct hstate *h = hstate_vma(vma);
@@ -5151,11 +5152,21 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
old_page = pte_page(pte);

retry_avoidcopy:
- /* If no-one else is actually using this page, avoid the copy
- * and just make the page writable */
- if (page_mapcount(old_page) == 1 && PageAnon(old_page)) {
- page_move_anon_rmap(old_page, vma);
- set_huge_ptep_writable(vma, haddr, ptep);
+ if (!unshare) {
+ /*
+ * If no-one else is actually using this page, avoid the copy
+ * and just make the page writable.
+ */
+ if (page_mapcount(old_page) == 1 && PageAnon(old_page)) {
+ page_move_anon_rmap(old_page, vma);
+ set_huge_ptep_writable(vma, haddr, ptep);
+ return 0;
+ }
+ } else if (!PageAnon(old_page) || page_mapcount(old_page) == 1) {
+ /*
+ * GUP-triggered unsharing only applies to shared anonymous
+ * pages. If that does no longer apply, there is nothing to do.
+ */
return 0;
}

@@ -5256,11 +5267,11 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
ClearHPageRestoreReserve(new_page);

- /* Break COW */
+ /* Break COW or unshare */
huge_ptep_clear_flush(vma, haddr, ptep);
mmu_notifier_invalidate_range(mm, range.start, range.end);
set_huge_pte_at(mm, haddr, ptep,
- make_huge_pte(vma, new_page, 1));
+ make_huge_pte(vma, new_page, !unshare));
page_remove_rmap(old_page, true);
hugepage_add_new_anon_rmap(new_page, vma, haddr);
SetHPageMigratable(new_page);
@@ -5270,7 +5281,10 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(ptl);
mmu_notifier_invalidate_range_end(&range);
out_release_all:
- /* No restore in case of successful pagetable update (Break COW) */
+ /*
+ * No restore in case of successful pagetable update (Break COW or
+ * unshare)
+ */
if (new_page != old_page)
restore_reserve_on_error(h, vma, haddr, new_page);
put_page(new_page);
@@ -5281,6 +5295,23 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
return ret;
}

+static vm_fault_t
+wp_hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep, struct page *pagecache_page,
+ spinlock_t *ptl)
+{
+ return __wp_hugetlb(mm, vma, address, ptep, pagecache_page, ptl,
+ false);
+}
+
+static vm_fault_t
+wp_hugetlb_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep,
+ struct page *pagecache_page, spinlock_t *ptl)
+{
+ return __wp_hugetlb(mm, vma, address, ptep, pagecache_page, ptl, true);
+}
+
/* Return the pagecache page at a given address within a VMA */
static struct page *hugetlbfs_pagecache_page(struct hstate *h,
struct vm_area_struct *vma, unsigned long address)
@@ -5393,7 +5424,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
/*
* Currently, we are forced to kill the process in the event the
* original mapper has unmapped pages from the child due to a failed
- * COW. Warn that such a situation has occurred as it may not be obvious
+ * COW/unsharing. Warn that such a situation has occurred as it may not
+ * be obvious.
*/
if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) {
pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n",
@@ -5519,7 +5551,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
hugetlb_count_add(pages_per_huge_page(h), mm);
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
/* Optimization, do the COW without a second fault */
- ret = hugetlb_cow(mm, vma, address, ptep, page, ptl);
+ ret = wp_hugetlb_cow(mm, vma, address, ptep, page, ptl);
}

spin_unlock(ptl);
@@ -5649,14 +5681,15 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_mutex;

/*
- * If we are going to COW the mapping later, we examine the pending
- * reservations for this page now. This will ensure that any
+ * If we are going to COW/unshare the mapping later, we examine the
+ * pending reservations for this page now. This will ensure that any
* allocations necessary to record that reservation occur outside the
* spinlock. For private mappings, we also lookup the pagecache
* page now as it is used to determine if a reservation has been
* consumed.
*/
- if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
+ if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
+ !huge_pte_write(entry)) {
if (vma_needs_reservation(h, vma, haddr) < 0) {
ret = VM_FAULT_OOM;
goto out_mutex;
@@ -5671,14 +5704,17 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,

ptl = huge_pte_lock(h, mm, ptep);

- /* Check for a racing update before calling hugetlb_cow */
+ /*
+ * Check for a racing update before calling wp_hugetlb_cow /
+ * wp_hugetlb_unshare
+ */
if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
goto out_ptl;

/*
- * hugetlb_cow() requires page locks of pte_page(entry) and
- * pagecache_page, so here we need take the former one
- * when page != pagecache_page or !pagecache_page.
+ * wp_hugetlb_cow()/wp_hugetlb_unshare() requires page locks of
+ * pte_page(entry) and pagecache_page, so here we need take the former
+ * one when page != pagecache_page or !pagecache_page.
*/
page = pte_page(entry);
if (page != pagecache_page)
@@ -5691,11 +5727,15 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,

if (flags & FAULT_FLAG_WRITE) {
if (!huge_pte_write(entry)) {
- ret = hugetlb_cow(mm, vma, address, ptep,
- pagecache_page, ptl);
+ ret = wp_hugetlb_cow(mm, vma, address, ptep,
+ pagecache_page, ptl);
goto out_put_page;
}
entry = huge_pte_mkdirty(entry);
+ } else if (flags & FAULT_FLAG_UNSHARE && !huge_pte_write(entry)) {
+ ret = wp_hugetlb_unshare(mm, vma, address, ptep, pagecache_page,
+ ptl);
+ goto out_put_page;
}
entry = pte_mkyoung(entry);
if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
--
2.31.1


2021-12-17 11:34:28

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 09/11] mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required (hugetlb)

Similar to the !hugetlb variant, invoke unsharing for shared anonymous
pages when required during GUP by setting FOLL_FAULT_UNSHARE in hugetlb
code as well.

FAULT_FLAG_UNSHARE will trigger unsharing of shared anonymous pages during
GUP, resulting in a child process no longer being able to observe memory
modifications performed by the parent after fork() to anonymous shared
hugetlb pages.

This commit is based on prototype patches by Andrea.

Co-developed-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/gup.c | 3 +--
mm/hugetlb.c | 43 +++++++++++++++++++++++++++++++++++++++----
2 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 2a83388c3fb4..35d1b28e3829 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -67,8 +67,7 @@ bool gup_must_unshare(unsigned int flags, struct page *page, bool is_head)
if (PageKsm(page))
return false;
if (PageHuge(page))
- /* TODO: handle hugetlb as well. */
- return false;
+ return __page_mapcount(page) > 1;
if (is_head) {
VM_BUG_ON(!PageTransHuge(page));
return page_trans_huge_mapcount(page, NULL) > 1;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f2863b046ef..dc42018ee1a6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5971,6 +5971,25 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
}
}

+static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte,
+ bool *unshare)
+{
+ pte_t pteval = huge_ptep_get(pte);
+
+ *unshare = false;
+ if (is_swap_pte(pteval))
+ return true;
+ if (huge_pte_write(pteval))
+ return false;
+ if (flags & FOLL_WRITE)
+ return true;
+ if (gup_must_unshare(flags, pte_page(pteval), true)) {
+ *unshare = true;
+ return true;
+ }
+ return false;
+}
+
long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
unsigned long *position, unsigned long *nr_pages,
@@ -5985,6 +6004,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
while (vaddr < vma->vm_end && remainder) {
pte_t *pte;
spinlock_t *ptl = NULL;
+ bool unshare;
int absent;
struct page *page;

@@ -6035,9 +6055,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
* both cases, and because we can't follow correct pages
* directly from any kind of swap entries.
*/
- if (absent || is_swap_pte(huge_ptep_get(pte)) ||
- ((flags & FOLL_WRITE) &&
- !huge_pte_write(huge_ptep_get(pte)))) {
+ if (absent ||
+ __follow_hugetlb_must_fault(flags, pte, &unshare)) {
vm_fault_t ret;
unsigned int fault_flags = 0;

@@ -6045,6 +6064,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(ptl);
if (flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
+ else if (unshare)
+ fault_flags |= FAULT_FLAG_UNSHARE;
if (locked)
fault_flags |= FAULT_FLAG_ALLOW_RETRY |
FAULT_FLAG_KILLABLE;
@@ -6734,7 +6755,21 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
goto out;
pte = huge_ptep_get((pte_t *)pmd);
if (pte_present(pte)) {
- page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+ struct page *head_page = pmd_page(*pmd);
+
+ /*
+ * follow_huge_pmd() is only called when coming via
+ * follow_page(), where we set FOLL_NOUNSHARE. Ordinary GUP
+ * goes via follow_hugetlb_page(), where we can properly unshare
+ * if required.
+ */
+ if (WARN_ON_ONCE(!huge_pte_write(pte) &&
+ gup_must_unshare(flags, head_page, true))) {
+ page = NULL;
+ goto out;
+ }
+
+ page = head_page + ((address & ~PMD_MASK) >> PAGE_SHIFT);
/*
* try_grab_page() should always succeed here, because: a) we
* hold the pmd (ptl) lock, and b) we've just checked that the
--
2.31.1


2021-12-17 11:34:32

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 10/11] mm: thp: introduce and use page_trans_huge_anon_shared()

Let's add an optimized way to check "page_trans_huge_mapcount() > 1"
that is allowed to break the loop early.

This commit is based on a prototype patch by Andrea.

Co-developed-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/huge_mm.h | 7 +++++++
mm/gup.c | 2 +-
mm/huge_memory.c | 34 ++++++++++++++++++++++++++++++++++
3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 44e02d47c65a..3a9d8cf64219 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -361,6 +361,8 @@ static inline void thp_mapcount_unlock(struct page *page,
local_irq_restore(irq_flags);
}

+extern bool page_trans_huge_anon_shared(struct page *page);
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -532,6 +534,11 @@ static inline void thp_mapcount_unlock(struct page *page,
{
}

+static inline bool page_trans_huge_anon_shared(struct page *page)
+{
+ return false;
+}
+
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

/**
diff --git a/mm/gup.c b/mm/gup.c
index 35d1b28e3829..496575ff9ac8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -70,7 +70,7 @@ bool gup_must_unshare(unsigned int flags, struct page *page, bool is_head)
return __page_mapcount(page) > 1;
if (is_head) {
VM_BUG_ON(!PageTransHuge(page));
- return page_trans_huge_mapcount(page, NULL) > 1;
+ return page_trans_huge_anon_shared(page);
}
return page_mapcount(page) > 1;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57842e8b13d4..dced82274f1d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1281,6 +1281,40 @@ void huge_pmd_set_accessed(struct vm_fault *vmf)
spin_unlock(vmf->ptl);
}

+
+static bool __page_trans_huge_anon_shared(struct page *page)
+{
+ int i, mapcount;
+
+ mapcount = head_compound_mapcount(page);
+ if (mapcount > 1)
+ return true;
+ if (PageDoubleMap(page))
+ mapcount -= 1;
+ for (i = 0; i < thp_nr_pages(page); i++) {
+ if (atomic_read(&page[i]._mapcount) + mapcount + 1 > 1)
+ return true;
+ }
+ return false;
+}
+
+/* A lightweight check corresponding to "page_trans_huge_mapcount() > 1". */
+bool page_trans_huge_anon_shared(struct page *page)
+{
+ unsigned int seqcount;
+ bool shared;
+
+ VM_BUG_ON_PAGE(PageHuge(page) || PageTail(page), page);
+ VM_BUG_ON_PAGE(!PageAnon(page) || !PageTransHuge(page), page);
+
+ do {
+ seqcount = thp_mapcount_read_begin(page);
+ shared = __page_trans_huge_anon_shared(page);
+ } while (thp_mapcount_read_retry(page, seqcount));
+
+ return shared;
+}
+
vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
--
2.31.1


2021-12-17 11:34:44

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 11/11] selftests/vm: add tests for the known COW security issues

Let's make sure the security issue / MAP_PRIVATE violation of POSIX
semantics doesn't reappear again using variations of the original
vmsplice reproducer. Ideally, we'd also be test some more cases with
R/O long-term pinnings -- but the existing mechanisms like RDMA or VFIO
require rather complicated setups not suitable for simple selftests.

In the future we might be able to add some O_DIRECT test and maybe
extend the gup tests in the kernel accordingly.

Using barrier() is a little clunky, but "volatile" seems to be in
general frowned upon and makes checkpatch angry.

Cc: Shuah Khan <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/gup_cow.c | 312 ++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests.sh | 16 ++
3 files changed, 329 insertions(+)
create mode 100644 tools/testing/selftests/vm/gup_cow.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 1607322a112c..dad6037d735f 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -27,6 +27,7 @@ CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS)
LDLIBS = -lrt -lpthread
TEST_GEN_FILES = compaction_test
TEST_GEN_FILES += gup_test
+TEST_GEN_FILES += gup_cow
TEST_GEN_FILES += hmm-tests
TEST_GEN_FILES += hugepage-mmap
TEST_GEN_FILES += hugepage-mremap
diff --git a/tools/testing/selftests/vm/gup_cow.c b/tools/testing/selftests/vm/gup_cow.c
new file mode 100644
index 000000000000..9d44ed2ffdfc
--- /dev/null
+++ b/tools/testing/selftests/vm/gup_cow.c
@@ -0,0 +1,312 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * GUP (Get User Pages) interaction with COW (Copy On Write) tests.
+ *
+ * Copyright 2021, Red Hat, Inc.
+ *
+ * Author(s): David Hildenbrand <[email protected]>
+ */
+#define _GNU_SOURCE
+#include <stdlib.h>
+#include <string.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+
+#include "../kselftest.h"
+
+#define barrier() asm volatile("" ::: "memory")
+
+static size_t pagesize;
+static size_t thpsize;
+static size_t hugetlbsize;
+
+struct shared_mem {
+ bool parent_ready;
+ bool child_ready;
+};
+struct shared_mem *shared;
+
+static size_t detect_thpsize(void)
+{
+ int fd = open("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size",
+ O_RDONLY);
+ size_t size = 0;
+ char buf[15];
+ int ret;
+
+ if (fd < 0)
+ return 0;
+
+ ret = pread(fd, buf, sizeof(buf), 0);
+ if (ret < 0 || ret == sizeof(buf))
+ goto out;
+ buf[ret] = 0;
+
+ size = strtoul(buf, NULL, 10);
+out:
+ close(fd);
+ if (size < pagesize)
+ size = 0;
+ return size;
+}
+
+static uint64_t pagemap_get_entry(int fd, void *addr)
+{
+ const unsigned long pfn = (unsigned long)addr / pagesize;
+ uint64_t entry;
+ int ret;
+
+ ret = pread(fd, &entry, sizeof(entry), pfn * sizeof(entry));
+ if (ret != sizeof(entry))
+ ksft_exit_fail_msg("reading pagemap failed\n");
+ return entry;
+}
+
+static bool page_is_populated(void *addr)
+{
+ int fd = open("/proc/self/pagemap", O_RDONLY);
+ uint64_t entry;
+ bool ret;
+
+ if (fd < 0)
+ ksft_exit_fail_msg("opening pagemap failed\n");
+
+ /* Present or swapped. */
+ entry = pagemap_get_entry(fd, addr);
+ ret = !!(entry & 0xc000000000000000ull);
+ close(fd);
+ return ret;
+}
+
+static int child_vmsplice_fn(unsigned char *mem, size_t size)
+{
+ struct iovec iov = {
+ .iov_base = mem,
+ .iov_len = size,
+ };
+ size_t cur, total, transferred;
+ char *old, *new;
+ int fds[2];
+
+ old = malloc(size);
+ new = malloc(size);
+
+ /* Backup the original content. */
+ memcpy(old, mem, size);
+
+ if (pipe(fds) < 0)
+ return -errno;
+
+ /* Trigger a read-only pin. */
+ transferred = vmsplice(fds[1], &iov, 1, 0);
+ if (transferred < 0)
+ return -errno;
+ if (transferred == 0)
+ return -EINVAL;
+
+ /* Unmap it from our page tables. */
+ if (munmap(mem, size) < 0)
+ return -errno;
+
+ /* Wait until the parent modified it. */
+ barrier();
+ shared->child_ready = true;
+ barrier();
+ while (!shared->parent_ready)
+ barrier();
+ barrier();
+
+ /* See if we still read the old values. */
+ total = 0;
+ while (total < transferred) {
+ cur = read(fds[0], new + total, transferred - total);
+ if (cur < 0)
+ return -errno;
+ total += cur;
+ }
+
+ return memcmp(old, new, transferred);
+}
+
+static void test_child_ro_gup(unsigned char *mem, size_t size)
+{
+ int ret;
+
+ /* Populate the page. */
+ memset(mem, 0, size);
+
+ shared->parent_ready = false;
+ shared->child_ready = false;
+ barrier();
+
+ ret = fork();
+ if (ret < 0) {
+ ksft_exit_fail_msg("fork failed\n");
+ } else if (!ret) {
+ ret = child_vmsplice_fn(mem, size);
+ exit(ret);
+ }
+
+ barrier();
+ while (!shared->child_ready)
+ barrier();
+ /* Modify the page. */
+ barrier();
+ memset(mem, 0xff, size);
+ barrier();
+ shared->parent_ready = true;
+
+ wait(&ret);
+ if (WIFEXITED(ret))
+ ret = WEXITSTATUS(ret);
+ else
+ ret = -EINVAL;
+
+ ksft_test_result(!ret, "child has correct MAP_PRIVATE semantics\n");
+}
+
+static void test_anon_ro_gup_child(void)
+{
+ unsigned char *mem;
+ int ret;
+
+ ksft_print_msg("[RUN] %s\n", __func__);
+
+ mem = mmap(NULL, pagesize, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (mem == MAP_FAILED) {
+ ksft_test_result_fail("mmap failed\n");
+ return;
+ }
+
+ ret = madvise(mem, pagesize, MADV_NOHUGEPAGE);
+ /* Ignore if not around on a kernel. */
+ if (ret && ret != -EINVAL) {
+ ksft_test_result_fail("madvise failed\n");
+ goto out;
+ }
+
+ test_child_ro_gup(mem, pagesize);
+out:
+ munmap(mem, pagesize);
+}
+
+static void test_anon_thp_ro_gup_child(void)
+{
+ unsigned char *mem, *mmap_mem;
+ size_t mmap_size;
+ int ret;
+
+ ksft_print_msg("[RUN] %s\n", __func__);
+
+ if (!thpsize)
+ ksft_test_result_skip("THP size not detected\n");
+
+ mmap_size = 2 * thpsize;
+ mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (mmap_mem == MAP_FAILED) {
+ ksft_test_result_fail("mmap failed\n");
+ return;
+ }
+
+ mem = (unsigned char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
+
+ ret = madvise(mem, thpsize, MADV_HUGEPAGE);
+ if (ret) {
+ ksft_test_result_fail("madvise(MADV_HUGEPAGE) failed\n");
+ goto out;
+ }
+
+ /*
+ * Touch the first sub-page and test of we get another sub-page
+ * populated.
+ */
+ mem[0] = 0;
+ if (!page_is_populated(mem + pagesize)) {
+ ksft_test_result_skip("Did not get a THP populated\n");
+ goto out;
+ }
+
+ test_child_ro_gup(mem, thpsize);
+out:
+ munmap(mmap_mem, mmap_size);
+}
+
+static void test_anon_hugetlb_ro_gup_child(void)
+{
+ unsigned char *mem, *dummy;
+
+ ksft_print_msg("[RUN] %s\n", __func__);
+
+ if (!hugetlbsize)
+ ksft_test_result_skip("hugetlb size not detected\n");
+
+ ksft_print_msg("[INFO] Assuming hugetlb size of %zd bytes\n",
+ hugetlbsize);
+
+ mem = mmap(NULL, hugetlbsize, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
+ if (mem == MAP_FAILED) {
+ ksft_test_result_skip("need more free huge pages\n");
+ return;
+ }
+
+ /*
+ * We need a total of two hugetlb pages to handle COW/unsharing
+ * properly.
+ */
+ dummy = mmap(NULL, hugetlbsize, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
+ if (dummy == MAP_FAILED) {
+ ksft_test_result_skip("need more free huge pages\n");
+ goto out;
+ }
+ munmap(dummy, hugetlbsize);
+
+ test_child_ro_gup(mem, hugetlbsize);
+out:
+ munmap(mem, hugetlbsize);
+}
+
+int main(int argc, char **argv)
+{
+ int err;
+
+ pagesize = getpagesize();
+ thpsize = detect_thpsize();
+ /* For simplicity, we'll rely on the thp size. */
+ hugetlbsize = thpsize;
+
+ ksft_print_header();
+ ksft_set_plan(3);
+
+ /* We need an easy way to talk to our child. */
+ shared = mmap(NULL, pagesize, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+ if (shared == MAP_FAILED)
+ ksft_exit_fail_msg("mmap(MAP_SHARED)\n");
+
+ /*
+ * Tests for the security issue reported by Jann Horn that originally
+ * resulted in CVE-2020-29374. More generally, it's a violation of
+ * POSIX MAP_PRIVATE semantics, because some other process can modify
+ * pages that are supposed to be private to one process.
+ *
+ * So let's test that process-private pages stay private using the
+ * known vmsplice reproducer.
+ */
+ test_anon_ro_gup_child();
+ test_anon_thp_ro_gup_child();
+ test_anon_hugetlb_ro_gup_child();
+
+ err = ksft_get_fail_cnt();
+ if (err)
+ ksft_exit_fail_msg("%d out of %d tests failed\n",
+ err, ksft_test_num());
+ return ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
index a24d30af3094..80e441e0ae45 100755
--- a/tools/testing/selftests/vm/run_vmtests.sh
+++ b/tools/testing/selftests/vm/run_vmtests.sh
@@ -168,6 +168,22 @@ else
echo "[PASS]"
fi

+echo "--------------------------------------------------------"
+echo "running "GUP interaction with COW tests.
+echo "--------------------------------------------------------"
+./gup_cow
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+ echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+ echo "[SKIP]"
+ exitcode=$ksft_skip
+else
+ echo "[FAIL]"
+ exitcode=1
+fi
+
echo "-------------------"
echo "running userfaultfd"
echo "-------------------"
--
2.31.1


2021-12-17 17:02:32

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] seqlock: provide lockdep-free raw_seqcount_t variant



> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <[email protected]> wrote:
>
> Sometimes it is required to have a seqcount implementation that uses
> a structure with a fixed and minimal size -- just a bare unsigned int --
> independent of the kernel configuration. This is especially valuable, when
> the raw_ variants of the seqlock function will be used and the additional
> lockdep part of the seqcount_t structure remains essentially unused.
>
> Let's provide a lockdep-free raw_seqcount_t variant that can be used via
> the raw functions to have a basic seqlock.
>
> The target use case is embedding a raw_seqcount_t in the "struct page",
> where we really want a minimal size and cannot tolerate a sudden grow of
> the seqcount_t structure resulting in a significant "struct page"
> increase or even a layout change.
>
> Provide raw_read_seqcount_retry(), to make it easy to match to
> raw_read_seqcount_begin() in the code.
>
> Let's add a short documentation as well.
>
> Note: There might be other possible users for raw_seqcount_t where the
> lockdep part might be completely unused and just wastes memory --
> essentially any users that only use the raw_ function variants.
>

Is it possible to force some policy when raw_seqcount_t is used to
prevent its abuse? For instance not to allow to acquire other (certain?)
locks when it is held?

[ snip ]

> +/**
> + * raw_seqcount_init() - runtime initializer for raw_seqcount_t
> + * @s: Pointer to the raw_seqcount_t instance
> + */
> +# define raw_seqcount_init(s) __raw_seqcount_init(s)
> +
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>
> # define SEQCOUNT_DEP_MAP_INIT(lockname) \
> @@ -111,11 +129,16 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s)
> # define seqcount_lockdep_reader_access(x)
> #endif
>
> +/**
> + * RAW_SEQCNT_ZERO() - static initializer for raw_seqcount_t
> + */
> +#define RAW_SEQCNT_ZERO() 0

I am not sure why RAW_SWQCNT_ZERO() should be a function-like macro.

Moreover, the documentation showed:

+ /* static */
+ static raw_seqcount_t foo_seqcount = RAW_SEQCNT_ZERO(foo_seqcount);
+

But RAW_SEQCNT_ZERO does not have an argument?


2021-12-17 17:16:05

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v1 03/11] mm: simplify hugetlb and file-THP handling in __page_mapcount()


> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <[email protected]> wrote:
>
> Let's return early for hugetlb, which really only relies on the compound
> mapcount so far and does not support PageDoubleMap() yet. Use the chance
> to cleanup the file-THP case to make it easier to grasp. While at it, use
> head_compound_mapcount().
>
> This is a preparation for further changes.

It would be useful to add “no functional change intended” or something.

>
> Reviewed-by: Peter Xu <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/util.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/mm/util.c b/mm/util.c
> index 741ba32a43ac..3239e75c148d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page)
> {
> int ret;
>
> - ret = atomic_read(&page->_mapcount) + 1;
> + if (PageHuge(page))
> + return compound_mapcount(page);

Before you return, perhaps you can add an assertion like:

VM_BUG_ON(PageDoubleMap(page));

This would be make the code clearer and would ease debugging in the
future (if support for double-map is expanded).

2021-12-17 17:29:15

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] seqlock: provide lockdep-free raw_seqcount_t variant

On 17.12.21 18:02, Nadav Amit wrote:
>
>
>> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <[email protected]> wrote:
>>
>> Sometimes it is required to have a seqcount implementation that uses
>> a structure with a fixed and minimal size -- just a bare unsigned int --
>> independent of the kernel configuration. This is especially valuable, when
>> the raw_ variants of the seqlock function will be used and the additional
>> lockdep part of the seqcount_t structure remains essentially unused.
>>
>> Let's provide a lockdep-free raw_seqcount_t variant that can be used via
>> the raw functions to have a basic seqlock.
>>
>> The target use case is embedding a raw_seqcount_t in the "struct page",
>> where we really want a minimal size and cannot tolerate a sudden grow of
>> the seqcount_t structure resulting in a significant "struct page"
>> increase or even a layout change.
>>
>> Provide raw_read_seqcount_retry(), to make it easy to match to
>> raw_read_seqcount_begin() in the code.
>>
>> Let's add a short documentation as well.
>>
>> Note: There might be other possible users for raw_seqcount_t where the
>> lockdep part might be completely unused and just wastes memory --
>> essentially any users that only use the raw_ function variants.
>>
>
> Is it possible to force some policy when raw_seqcount_t is used to
> prevent its abuse? For instance not to allow to acquire other (certain?)
> locks when it is held?
>

Good question ... in this series we won't be taking additional locks on
the reader or the writer side. Something like lockdep_forbid() /
lockdep_allow() to disallow any kind of locking. I haven't heard of
anything like that, maybe someone reading along has a clue?

The writer side might be easy to handle, but some seqcount operations
that don't do the full read()->retry() cycle are problematic
(->raw_read_seqcount).

> [ snip ]
>
>> +/**
>> + * raw_seqcount_init() - runtime initializer for raw_seqcount_t
>> + * @s: Pointer to the raw_seqcount_t instance
>> + */
>> +# define raw_seqcount_init(s) __raw_seqcount_init(s)
>> +
>> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>
>> # define SEQCOUNT_DEP_MAP_INIT(lockname) \
>> @@ -111,11 +129,16 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s)
>> # define seqcount_lockdep_reader_access(x)
>> #endif
>>
>> +/**
>> + * RAW_SEQCNT_ZERO() - static initializer for raw_seqcount_t
>> + */
>> +#define RAW_SEQCNT_ZERO() 0
>
> I am not sure why RAW_SWQCNT_ZERO() should be a function-like macro.
>

I think I just went for consistency with SEQCNT_ZERO() -- but I agree,
that can just be simplified!

Thanks!


--
Thanks,

David / dhildenb


2021-12-17 17:30:36

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 03/11] mm: simplify hugetlb and file-THP handling in __page_mapcount()

On 17.12.21 18:16, Nadav Amit wrote:
>
>> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <[email protected]> wrote:
>>
>> Let's return early for hugetlb, which really only relies on the compound
>> mapcount so far and does not support PageDoubleMap() yet. Use the chance
>> to cleanup the file-THP case to make it easier to grasp. While at it, use
>> head_compound_mapcount().
>>
>> This is a preparation for further changes.
>
> It would be useful to add “no functional change intended” or something.

Absolutely, same applies to other "simplification" patches.

>
>>
>> Reviewed-by: Peter Xu <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> mm/util.c | 11 +++++++----
>> 1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/util.c b/mm/util.c
>> index 741ba32a43ac..3239e75c148d 100644
>> --- a/mm/util.c
>> +++ b/mm/util.c
>> @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page)
>> {
>> int ret;
>>
>> - ret = atomic_read(&page->_mapcount) + 1;
>> + if (PageHuge(page))
>> + return compound_mapcount(page);
>
> Before you return, perhaps you can add an assertion like:
>
> VM_BUG_ON(PageDoubleMap(page));
>
> This would be make the code clearer and would ease debugging in the
> future (if support for double-map is expanded).
>

I'd probably have to add this to a couple of places -- and I assume
anybody working on that has to grep the kernel for use of PageDoubleMap
already.

Thanks!

--
Thanks,

David / dhildenb


2021-12-17 17:50:00

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] seqlock: provide lockdep-free raw_seqcount_t variant

On 17.12.21 18:29, David Hildenbrand wrote:
> On 17.12.21 18:02, Nadav Amit wrote:
>>
>>
>>> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <[email protected]> wrote:
>>>
>>> Sometimes it is required to have a seqcount implementation that uses
>>> a structure with a fixed and minimal size -- just a bare unsigned int --
>>> independent of the kernel configuration. This is especially valuable, when
>>> the raw_ variants of the seqlock function will be used and the additional
>>> lockdep part of the seqcount_t structure remains essentially unused.
>>>
>>> Let's provide a lockdep-free raw_seqcount_t variant that can be used via
>>> the raw functions to have a basic seqlock.
>>>
>>> The target use case is embedding a raw_seqcount_t in the "struct page",
>>> where we really want a minimal size and cannot tolerate a sudden grow of
>>> the seqcount_t structure resulting in a significant "struct page"
>>> increase or even a layout change.
>>>
>>> Provide raw_read_seqcount_retry(), to make it easy to match to
>>> raw_read_seqcount_begin() in the code.
>>>
>>> Let's add a short documentation as well.
>>>
>>> Note: There might be other possible users for raw_seqcount_t where the
>>> lockdep part might be completely unused and just wastes memory --
>>> essentially any users that only use the raw_ function variants.
>>>
>>
>> Is it possible to force some policy when raw_seqcount_t is used to
>> prevent its abuse? For instance not to allow to acquire other (certain?)
>> locks when it is held?
>>
>
> Good question ... in this series we won't be taking additional locks on
> the reader or the writer side. Something like lockdep_forbid() /
> lockdep_allow() to disallow any kind of locking. I haven't heard of
> anything like that, maybe someone reading along has a clue?
>
> The writer side might be easy to handle, but some seqcount operations
> that don't do the full read()->retry() cycle are problematic
> (->raw_read_seqcount).

Sorry, I forgot to mention an important point: the raw_seqcount_t
doesn't give you any additional "power" to abuse.

You can just use the ordinary seqcount_t with the raw_ functions. One
example is mm->write_protect_seq . So whatever we would want to "invent"
should also apply to the raw_ functions in general -- which might be
undesired or impossible (IIRC IRQ context).

--
Thanks,

David / dhildenb


2021-12-17 18:01:23

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] seqlock: provide lockdep-free raw_seqcount_t variant



> On Dec 17, 2021, at 9:49 AM, David Hildenbrand <[email protected]> wrote:
>
> On 17.12.21 18:29, David Hildenbrand wrote:
>> On 17.12.21 18:02, Nadav Amit wrote:
>>>
>>>
>>>> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <[email protected]> wrote:
>>>>
>>>> Sometimes it is required to have a seqcount implementation that uses
>>>> a structure with a fixed and minimal size -- just a bare unsigned int --
>>>> independent of the kernel configuration. This is especially valuable, when
>>>> the raw_ variants of the seqlock function will be used and the additional
>>>> lockdep part of the seqcount_t structure remains essentially unused.
>>>>
>>>> Let's provide a lockdep-free raw_seqcount_t variant that can be used via
>>>> the raw functions to have a basic seqlock.
>>>>
>>>> The target use case is embedding a raw_seqcount_t in the "struct page",
>>>> where we really want a minimal size and cannot tolerate a sudden grow of
>>>> the seqcount_t structure resulting in a significant "struct page"
>>>> increase or even a layout change.
>>>>
>>>> Provide raw_read_seqcount_retry(), to make it easy to match to
>>>> raw_read_seqcount_begin() in the code.
>>>>
>>>> Let's add a short documentation as well.
>>>>
>>>> Note: There might be other possible users for raw_seqcount_t where the
>>>> lockdep part might be completely unused and just wastes memory --
>>>> essentially any users that only use the raw_ function variants.
>>>>
>>>
>>> Is it possible to force some policy when raw_seqcount_t is used to
>>> prevent its abuse? For instance not to allow to acquire other (certain?)
>>> locks when it is held?
>>>
>>
>> Good question ... in this series we won't be taking additional locks on
>> the reader or the writer side. Something like lockdep_forbid() /
>> lockdep_allow() to disallow any kind of locking. I haven't heard of
>> anything like that, maybe someone reading along has a clue?
>>
>> The writer side might be easy to handle, but some seqcount operations
>> that don't do the full read()->retry() cycle are problematic
>> (->raw_read_seqcount).
>
> Sorry, I forgot to mention an important point: the raw_seqcount_t
> doesn't give you any additional "power" to abuse.
>
> You can just use the ordinary seqcount_t with the raw_ functions. One
> example is mm->write_protect_seq . So whatever we would want to "invent"
> should also apply to the raw_ functions in general -- which might be
> undesired or impossible (IIRC IRQ context).
>

Thanks for the clarification. I was unfamiliar with
raw_read_seqcount_begin() (and friends). Indeed it is very very rarely
used.

2021-12-17 18:11:25

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 03/11] mm: simplify hugetlb and file-THP handling in __page_mapcount()

On 17.12.21 19:06, Mike Kravetz wrote:
> On 12/17/21 03:30, David Hildenbrand wrote:
>> Let's return early for hugetlb, which really only relies on the compound
>> mapcount so far and does not support PageDoubleMap() yet. Use the chance
>
> It is too early to say if hugetlb double mapping will use PageDoubleMap().
> I do not think (hope) it will be necessary. So, I think you can drop mention
> of it here.

Desires have most certainly been expressed from a couple of parties --
to PTE map huge pages :) Hopefully we'll find a way to avoid
PageDoubleMap, I agree.

Dropping the comment!

>
>> to cleanup the file-THP case to make it easier to grasp. While at it, use
>> head_compound_mapcount().
>>
>> This is a preparation for further changes.
>>
>> Reviewed-by: Peter Xu <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>
> Acked-by: Mike Kravetz <[email protected]>
>

Thanks!

--
Thanks,

David / dhildenb


2021-12-17 18:16:14

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v1 03/11] mm: simplify hugetlb and file-THP handling in __page_mapcount()

On 12/17/21 03:30, David Hildenbrand wrote:
> Let's return early for hugetlb, which really only relies on the compound
> mapcount so far and does not support PageDoubleMap() yet. Use the chance

It is too early to say if hugetlb double mapping will use PageDoubleMap().
I do not think (hope) it will be necessary. So, I think you can drop mention
of it here.

> to cleanup the file-THP case to make it easier to grasp. While at it, use
> head_compound_mapcount().
>
> This is a preparation for further changes.
>
> Reviewed-by: Peter Xu <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Acked-by: Mike Kravetz <[email protected]>
--
Mike Kravetz

> ---
> mm/util.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/mm/util.c b/mm/util.c
> index 741ba32a43ac..3239e75c148d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page)
> {
> int ret;
>
> - ret = atomic_read(&page->_mapcount) + 1;
> + if (PageHuge(page))
> + return compound_mapcount(page);
> /*
> * For file THP page->_mapcount contains total number of mapping
> * of the page: no need to look into compound_mapcount.
> */
> - if (!PageAnon(page) && !PageHuge(page))
> - return ret;
> + if (!PageAnon(page))
> + return atomic_read(&page->_mapcount) + 1;
> +
> + ret = atomic_read(&page->_mapcount) + 1;
> page = compound_head(page);
> - ret += atomic_read(compound_mapcount_ptr(page)) + 1;
> + ret += head_compound_mapcount(page);
> if (PageDoubleMap(page))
> ret--;
> return ret;
>



2021-12-17 19:05:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 3:34 AM David Hildenbrand <[email protected]> wrote:
>
> + * If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not
> + * set) and then unmaps the target page, we have:
> + *
> + * * page has mapcount == 1 and refcount > 1

All these games with mapcount makes me think this is still broken.

mapcount has been a horribly broken thing in the past, and I'm not
convinced it's not a broken thing now.

> + vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
> + if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
> + page_mapcount(vmf->page) > 1) {

What keeps the mapcount stable in here?

And I still believe that the whole notion that "COW should use
mapcount" is pure and utter garbage.

If we are doing a COW, we need an *exclusive* access to the page. That
is not mapcount, that is the page ref.

mapcount is insane, and I think this is making this worse again.

Linus

2021-12-17 19:06:47

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v1 02/11] mm: thp: consolidate mapcount logic on THP split

On Fri, Dec 17, 2021 at 3:33 AM David Hildenbrand <[email protected]> wrote:
>
> Let's consolidate the mapcount logic to make it easier to understand and
> to prepare for further changes.
>
> Reviewed-by: Peter Xu <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Yang Shi <[email protected]>

> ---
> mm/huge_memory.c | 18 +++++++++++-------
> 1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e5483347291c..4751d03947da 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2101,21 +2101,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> pte = pte_offset_map(&_pmd, addr);
> BUG_ON(!pte_none(*pte));
> set_pte_at(mm, addr, pte, entry);
> - if (!pmd_migration)
> - atomic_inc(&page[i]._mapcount);
> pte_unmap(pte);
> }
>
> if (!pmd_migration) {
> + /* Sub-page mapcount accounting for above small mappings. */
> + int val = 1;
> +
> /*
> * Set PG_double_map before dropping compound_mapcount to avoid
> * false-negative page_mapped().
> + *
> + * The first to set PageDoubleMap() has to increment all
> + * sub-page mapcounts by one.
> */
> - if (compound_mapcount(page) > 1 &&
> - !TestSetPageDoubleMap(page)) {
> - for (i = 0; i < HPAGE_PMD_NR; i++)
> - atomic_inc(&page[i]._mapcount);
> - }
> + if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page))
> + val++;
> +
> + for (i = 0; i < HPAGE_PMD_NR; i++)
> + atomic_add(val, &page[i]._mapcount);
>
> lock_page_memcg(page);
> if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
> --
> 2.31.1
>

2021-12-17 19:08:08

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v1 03/11] mm: simplify hugetlb and file-THP handling in __page_mapcount()

On Fri, Dec 17, 2021 at 3:33 AM David Hildenbrand <[email protected]> wrote:
>
> Let's return early for hugetlb, which really only relies on the compound
> mapcount so far and does not support PageDoubleMap() yet. Use the chance
> to cleanup the file-THP case to make it easier to grasp. While at it, use
> head_compound_mapcount().
>
> This is a preparation for further changes.
>
> Reviewed-by: Peter Xu <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Yang Shi <[email protected]>

> ---
> mm/util.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/mm/util.c b/mm/util.c
> index 741ba32a43ac..3239e75c148d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page)
> {
> int ret;
>
> - ret = atomic_read(&page->_mapcount) + 1;
> + if (PageHuge(page))
> + return compound_mapcount(page);
> /*
> * For file THP page->_mapcount contains total number of mapping
> * of the page: no need to look into compound_mapcount.
> */
> - if (!PageAnon(page) && !PageHuge(page))
> - return ret;
> + if (!PageAnon(page))
> + return atomic_read(&page->_mapcount) + 1;
> +
> + ret = atomic_read(&page->_mapcount) + 1;
> page = compound_head(page);
> - ret += atomic_read(compound_mapcount_ptr(page)) + 1;
> + ret += head_compound_mapcount(page);
> if (PageDoubleMap(page))
> ret--;
> return ret;
> --
> 2.31.1
>

2021-12-17 19:12:35

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH v1 04/11] mm: thp: simlify total_mapcount()

On Fri, Dec 17, 2021 at 3:33 AM David Hildenbrand <[email protected]> wrote:
>
> Let's simplify a bit, returning for PageHuge() early and using
> head_compound_page() as we are only getting called for HEAD pages.
>
> Note the VM_BUG_ON_PAGE(PageTail(page), page) check at the beginning of
> total_mapcount().
>
> This is a preparation for further changes.
>
> Reviewed-by: Peter Xu <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Yang Shi <[email protected]>

> ---
> mm/huge_memory.c | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4751d03947da..826cabcad11a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2506,12 +2506,11 @@ int total_mapcount(struct page *page)
>
> if (likely(!PageCompound(page)))
> return atomic_read(&page->_mapcount) + 1;
> + if (PageHuge(page))
> + return head_compound_mapcount(page);
>
> - compound = compound_mapcount(page);
> nr = compound_nr(page);
> - if (PageHuge(page))
> - return compound;
> - ret = compound;
> + ret = compound = head_compound_mapcount(page);
> for (i = 0; i < nr; i++)
> ret += atomic_read(&page[i]._mapcount) + 1;
> /* File pages has compound_mapcount included in _mapcount */
> --
> 2.31.1
>

2021-12-17 19:23:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds
<[email protected]> wrote:
>
> If we are doing a COW, we need an *exclusive* access to the page. That
> is not mapcount, that is the page ref.
>
> mapcount is insane, and I think this is making this worse again.

Maybe I'm misreading this, but afaik

- get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.

This just increments the page count, because mapcount == 1.

- fork()

- unmap in the original

- child now has "mapcount == 1" on a page again, but refcount is
elevated, and child HAS TO COW before writing.

Notice? "mapcount" is complete BS. The number of times a page is
mapped is irrelevant for COW. All that matters is that we get an
exclusive access to the page before we can write to it.

Anybody who takes mapcount into account at COW time is broken, and it
worries me how this is all mixing up with the COW logic.

Now, maybe this "unshare" case is sufficiently different from COW that
it's ok to look at mapcount for FAULT_FLAG_UNSHARE, as long as it
doesn't happen for a real COW.

But honestly, for "unshare", I still don't see that the mapcount
matters. What does "mapcount == 1" mean? Why is it meaningful?

Because if COW does things right, and always breaks a COW based on
refcount, then what's the problem with taking a read-only ref to the
page whether it is mapped multiple times or mapped just once? Anybody
who already had write access to the page can write to it regardless,
and any new writers go through COW and get a new page.

I must be missing something realyl fundamental here, but to me it
really reads like "mapcount can fundamentally never be relevant for
COW, and if it's not relevant for COW, how can it be relevant for a
read-only copy?"

Linus

2021-12-17 20:18:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 17.12.21 20:22, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds
> <[email protected]> wrote:
>>
>> If we are doing a COW, we need an *exclusive* access to the page. That
>> is not mapcount, that is the page ref.
>>
>> mapcount is insane, and I think this is making this worse again.
>
> Maybe I'm misreading this, but afaik
>
> - get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.
>
> This just increments the page count, because mapcount == 1.
>
> - fork()
>
> - unmap in the original
>
> - child now has "mapcount == 1" on a page again, but refcount is
> elevated, and child HAS TO COW before writing.

Hi Linus,

This is just GUP before fork(), which is in general
problematic/incompatible with sharing. What we're concerned about in the
context of this series (see the security issue) is GUP after fork(). And
we're not changing GUP before fork() or even the COW logic in the
context of this series.

I agree that GUP before fork() has to be handled differently: during
fork(): don't share something that cannot possibly be shared in a safe
way. Don't allow COW semantics for something that is just broken with COW.

>
> Notice? "mapcount" is complete BS. The number of times a page is
> mapped is irrelevant for COW. All that matters is that we get an
> exclusive access to the page before we can write to it.

We have to be very careful about the two sides of the story: GUP before
fork and GUP after fork.

>
> Anybody who takes mapcount into account at COW time is broken, and it
> worries me how this is all mixing up with the COW logic.
>
> Now, maybe this "unshare" case is sufficiently different from COW that
> it's ok to look at mapcount for FAULT_FLAG_UNSHARE, as long as it
> doesn't happen for a real COW.
>
> But honestly, for "unshare", I still don't see that the mapcount
> matters. What does "mapcount == 1" mean? Why is it meaningful?

I'll reply to your first mail in a sec.

GUP is the problem with COW, not ordinary processes mapping a page
(mapcount), where you will only get new sharers during fork() -- in a
very controlled way. So GUP has to take care to unshare *before* taking
a reference, such that we can never reach the point of missed COW. GUP
really is the problematic bit with it all.

Without GUP, we'd be living in a wonderful world in regards to COW.

>
> Because if COW does things right, and always breaks a COW based on
> refcount, then what's the problem with taking a read-only ref to the
> page whether it is mapped multiple times or mapped just once? Anybody
> who already had write access to the page can write to it regardless,
> and any new writers go through COW and get a new page.

Let's just take a look at what refcount does *wrong*. Let's use an
adjusted version of your example above, because it's a perfect fit:

1. mem = mmap(pagesize, MAP_PRIVATE)
-> refcount == 1

2. memset(mem, 0, pagesize); /* Page is mapped R/W */

3. fork() /* Page gets mapped R/O */
-> refcount > 1

4. child quits
-> refcount == 1

5. Take a R/O pin (RDMA, VFIO, ...)
-> refcount > 1

6. memset(mem, 0xff, pagesize);
-> Write fault -> COW

And GUP sees something different than our MM -- and this is perfectly
valid, the R/O pin is just reading page content we might be modifying
afterwrds. Take out 3. and 4. and it works as expected. This wouldn't
happen when relying on the mapcount.

And 3+4 can really be anything that results in a R/O mapping of an
anonymous page, even if it's just swapout followed by read fault that
maps the page R/O.

>
> I must be missing something realyl fundamental here, but to me it
> really reads like "mapcount can fundamentally never be relevant for
> COW, and if it's not relevant for COW, how can it be relevant for a
> read-only copy?"

It really is the right value to use. Only GUP is the problematic bit
that has to trigger unsharing to not mess up COW logic later. Take GUP
out of the equation and COW just works as expected with the mapcount --
as long as we can read an atomic value and synchronize against fork.
(again, still composing the other mail :) )

--
Thanks,

David / dhildenb


2021-12-17 20:37:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 12:18 PM David Hildenbrand <[email protected]> wrote:
>
> On 17.12.21 20:22, Linus Torvalds wrote:
> > On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds
> > <[email protected]> wrote:
> >>
> > - get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.
> >
> > This just increments the page count, because mapcount == 1.
> >
> > - fork()
> >
> > - unmap in the original
> >
> > - child now has "mapcount == 1" on a page again, but refcount is
> > elevated, and child HAS TO COW before writing.
>
> Hi Linus,
>
> This is just GUP before fork(), which is in general
> problematic/incompatible with sharing.

Note that my example was not meant to be an example of a problem per
se, but purely as an example of how meaningless 'mapcount' is, and how
'mapcount==1' isn't really a very meaningful test.

So it wasn't mean to show "look, GUP before fork is problematic". We
have that problem already solved at least for regular pages.

It was purely meant to show how "mapcount==1" isn't a meaningful thing
to test, and my worry about how you're adding that nonsensical test to
the new code.

> Let's just take a look at what refcount does *wrong*. Let's use an
> adjusted version of your example above, because it's a perfect fit:
>
> 1. mem = mmap(pagesize, MAP_PRIVATE)
> -> refcount == 1
>
> 2. memset(mem, 0, pagesize); /* Page is mapped R/W */
>
> 3. fork() /* Page gets mapped R/O */
> -> refcount > 1
>
> 4. child quits
> -> refcount == 1
>
> 5. Take a R/O pin (RDMA, VFIO, ...)
> -> refcount > 1
>
> 6. memset(mem, 0xff, pagesize);
> -> Write fault -> COW

I do not believe this is actually a bug.

You asked for a R/O pin, and you got one.

Then somebody else modified that page, and you got exactly what you
asked for - a COW event. The original R/O pin has the original page
that it asked for, and can read it just fine.

So what is your argument?

Linus

2021-12-17 20:40:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 12:36 PM Linus Torvalds
<[email protected]> wrote:
>
> >
> > 5. Take a R/O pin (RDMA, VFIO, ...)
> > -> refcount > 1
> >
> > 6. memset(mem, 0xff, pagesize);
> > -> Write fault -> COW
>
> I do not believe this is actually a bug.
>
> You asked for a R/O pin, and you got one.

If you want a shared pin that actually follows the changes of your
process around, then that is what you should have asked for.

At the time of such a shared pin, you can do what we already do:
re-use the page if it has a refcount of 1. Or do an early COW event
(feel free to avoid the "mark it writable and dirty").

But note: *refcount* of 1. Not "mapcount". Because mapcount would be
broken garbage.

Linus

2021-12-17 20:42:47

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 17.12.21 21:36, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 12:18 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 17.12.21 20:22, Linus Torvalds wrote:
>>> On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds
>>> <[email protected]> wrote:
>>>>
>>> - get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.
>>>
>>> This just increments the page count, because mapcount == 1.
>>>
>>> - fork()
>>>
>>> - unmap in the original
>>>
>>> - child now has "mapcount == 1" on a page again, but refcount is
>>> elevated, and child HAS TO COW before writing.
>>
>> Hi Linus,
>>
>> This is just GUP before fork(), which is in general
>> problematic/incompatible with sharing.
>
> Note that my example was not meant to be an example of a problem per
> se, but purely as an example of how meaningless 'mapcount' is, and how
> 'mapcount==1' isn't really a very meaningful test.
>
> So it wasn't mean to show "look, GUP before fork is problematic". We
> have that problem already solved at least for regular pages.
>
> It was purely meant to show how "mapcount==1" isn't a meaningful thing
> to test, and my worry about how you're adding that nonsensical test to
> the new code.
>
>> Let's just take a look at what refcount does *wrong*. Let's use an
>> adjusted version of your example above, because it's a perfect fit:
>>
>> 1. mem = mmap(pagesize, MAP_PRIVATE)
>> -> refcount == 1
>>
>> 2. memset(mem, 0, pagesize); /* Page is mapped R/W */
>>
>> 3. fork() /* Page gets mapped R/O */
>> -> refcount > 1
>>
>> 4. child quits
>> -> refcount == 1
>>
>> 5. Take a R/O pin (RDMA, VFIO, ...)
>> -> refcount > 1
>>
>> 6. memset(mem, 0xff, pagesize);
>> -> Write fault -> COW
>
> I do not believe this is actually a bug.

It's debatable if it's a BUG or not (I think it is one). It's for sure
inconsistent.

>
> You asked for a R/O pin, and you got one.
>
> Then somebody else modified that page, and you got exactly what you
> asked for - a COW event. The original R/O pin has the original page
> that it asked for, and can read it just fine.

Where in the code did I ask for a COW event? I asked for a R/O pin, not
any kind of memory protection.

--
Thanks,

David / dhildenb


2021-12-17 20:45:19

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 17.12.21 20:04, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 3:34 AM David Hildenbrand <[email protected]> wrote:
>>
>> + * If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not
>> + * set) and then unmaps the target page, we have:
>> + *
>> + * * page has mapcount == 1 and refcount > 1
>

Hi Linus,

> All these games with mapcount makes me think this is still broken.
>
> mapcount has been a horribly broken thing in the past, and I'm not
> convinced it's not a broken thing now.

It all started when Jann detected the security issue in GUP – and this
patch set is fixing the problem exactly there, in GUP itself. Are you
aware of COW issues regarding the mapcount if we would remove GUP from
the equation? My point is that COW without GUP works just perfectly
fine, but I’ll be happy to learn about other cases I was ignoring so far.

Unfortunately, page_count() is even more unreliable, and the issues
we're just detecting (see the link in the cover letter: memory
corruptions inside user space -- e.g., lost DMA writes) are even worse
than what we had before -- IMHO of course.

>
>> + vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
>> + if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
>> + page_mapcount(vmf->page) > 1) {
>
> What keeps the mapcount stable in here?

So, we're reading an atomic value here. It’s read via atomic_read for
regular pages, and the THP mapcount case has also been made atomic (as
lockless as page_count) in patch #5.

If a page is mapped exactly once, page_mapcount(page) == 1 and there is
nothing to do.

If the page is mapped more than once, page_mapcount(page) > 1 and we
would have to trigger unsharing. And it’s true that the value is
unstable in this case, but we really only care about page_mapcount(page)
> 1 vs. page_mapcount(page) == 1. In this respect, there is no
difference from the instability of the page_count and the mapcount – we
still only care if it’s >1 or == 1.

So the only case we could care about is concurrent additional mappings
that can increment the mapcount -- which can only happen due to
concurrent fork. So if we're reading page_mapcount(page) == 1 the only
way we can get page_mapcount(page) > 1 is due to fork(). But we're
holding the mmap_lock in read mode during faults and fork requires the
mmap_lock in write mode.


>
> And I still believe that the whole notion that "COW should use
> mapcount" is pure and utter garbage.
>
> If we are doing a COW, we need an *exclusive* access to the page. That
> is not mapcount, that is the page ref.

I thought about this a lot, because initially I had the same opinion.

But really, we don't care about any speculative references (pagecache,
migration, daemon, pagevec, ...) or any short-term "I just want to grab
this reference real quick such that the page can't get freed" references.

All we care about are GUP references, and we attack that problem at the
root by triggering unsharing exactly at the point where GUP comes into play.

So IMHO GUP is the problem and needs unsharing either:
* On write access to a shared anonymous page, which is just COW as we
know it.
* On read access to a shared anonymous page, which is what we’re
proposing in this patch set.

So as soon as GUP comes into play, even if only pinning R/O, we have to
trigger unsharing . Doing so enforces the invariant that it is
impossible to take a GUP pin on an anonymous page with a mapcount > 1.
In turn, the COW does not need to worry about the GUP after fork()
security issue anymore and it can focus on doing optimally the COW
faults as if GUP just wouldn’t exist.


--
Thanks,

David / dhildenb


2021-12-17 20:46:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 12:42 PM David Hildenbrand <[email protected]> wrote:
>
> > Then somebody else modified that page, and you got exactly what you
> > asked for - a COW event. The original R/O pin has the original page
> > that it asked for, and can read it just fine.
>
> Where in the code did I ask for a COW event? I asked for a R/O pin, not
> any kind of memory protection.

Why didn't you ask for a shared pin, if that is what you want?

We already support that.

If you don't like the read-only pins, don't use them. It's that simple.

Linus

2021-12-17 20:47:12

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote:

> > 5. Take a R/O pin (RDMA, VFIO, ...)
> > -> refcount > 1
> >
> > 6. memset(mem, 0xff, pagesize);
> > -> Write fault -> COW
>
> I do not believe this is actually a bug.
>
> You asked for a R/O pin, and you got one.
>
> Then somebody else modified that page, and you got exactly what you
> asked for - a COW event. The original R/O pin has the original page
> that it asked for, and can read it just fine.

To remind all, the GUP users, like RDMA, VFIO use
FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
COW breaking the coherence. In these case 'r/o pin' does not mean
"snapshot the data", but its only a promise not to write to the pages
and still desires coherence with the memory map.

Eg in RDMA we know of apps asking for a R/O pin of something in .bss
then filling that something with data finally doing the actual
DMA. Breaking COW after pin breaks those apps.

The above #5 can occur for O_DIRECT read and in that case the
'snapshot the data' is perfectly fine as racing the COW with the
O_DIRECT read just resolves the race toward the read() direction.

IIRC there is some other scenario that motivated this patch?

Jason

2021-12-17 20:51:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 12:45 PM David Hildenbrand <[email protected]> wrote:
>
> If a page is mapped exactly once, page_mapcount(page) == 1 and there is
> nothing to do.

Why?

You state that, but you stating that doesn't magically make it so.

What makes "mapcount==1" stable and special? Your "it's an
atomic_read()" argument is nonsense - it implies that the count can be
changing, but you will get _one_ answer.

What makes that one answer of a changing count special?

What if there are other references to that same page, gotten with
vmsplice(), and just about to be mapped into another address space?

This is the meat of my argument. You claim that "mapcount==1" is
special. I claim that you haven't explained why it would be. And I do
not believe it is.

Linus

2021-12-17 20:52:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 12:39 PM Linus Torvalds
<[email protected]> wrote:
>
> At the time of such a shared pin, you can do what we already do:
> re-use the page if it has a refcount of 1. Or do an early COW event
> (feel free to avoid the "mark it writable and dirty").

Note that this also depends on fork() doing the right thing, marking
things for "a fork() can not share this page any more".

Which it does for regular pages, and is exactly what that
page_needs_cow_for_dma() logic is all about (and the special
write_protect_seq around gup/fork).

I do believe that huge-pages don't do it right. But I think that as
you try to fix hugepages, you are now breaking the normal case.

If all your logic was only about hugepages, I wouldn't care so much.
But you are playing questionable games with code that I think is
correct.

Please explain why.

Linus

2021-12-17 20:55:42

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 17.12.21 21:51, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 12:45 PM David Hildenbrand <[email protected]> wrote:
>>
>> If a page is mapped exactly once, page_mapcount(page) == 1 and there is
>> nothing to do.
>
> Why?
>
> You state that, but you stating that doesn't magically make it so.
>
> What makes "mapcount==1" stable and special? Your "it's an
> atomic_read()" argument is nonsense - it implies that the count can be
> changing, but you will get _one_ answer.

And I explained how it can not increment. And the only way is via
fork(), which cannot run concurrently.

>
> What makes that one answer of a changing count special?
>
> What if there are other references to that same page, gotten with
> vmsplice(), and just about to be mapped into another address space?

If we have a shared anonymous page we cannot have GUP references, not
even R/O ones. Because GUP would have unshared and copied the page,
resulting in a R/O mapped anonymous page.

What am I missing?


--
Thanks,

David / dhildenb


2021-12-17 20:56:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 12:47 PM Jason Gunthorpe <[email protected]> wrote:
>
> To remind all, the GUP users, like RDMA, VFIO use
> FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
> COW breaking the coherence. In these case 'r/o pin' does not mean
> "snapshot the data", but its only a promise not to write to the pages
> and still desires coherence with the memory map.
>
> Eg in RDMA we know of apps asking for a R/O pin of something in .bss
> then filling that something with data finally doing the actual
> DMA. Breaking COW after pin breaks those apps.

I agree.

And my argument is that those kinds of things that ask for a R/O pin
are broken, and should just make sure to use the shared pins.

If the page was already writable, you can just re-use the page
directly (marking it pinned, so that any subsequent fork() does the
right pre-cow thing)

And if the page was *NOT* already writable, you do a COW - which might
be sharing the page directly too, if it has no other references.

What's the downside of just doing this properly?

Again: if a DMA user wants coherent memory, then it should use the
coherent pinning. Not some kind of "read-only sharing that looks at
crazy mapcounts that have absolutely zero relevance to whether the
page is coherent or not".

Linus

2021-12-17 21:04:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 17.12.21 21:47, Jason Gunthorpe wrote:
> On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote:
>
>>> 5. Take a R/O pin (RDMA, VFIO, ...)
>>> -> refcount > 1
>>>
>>> 6. memset(mem, 0xff, pagesize);
>>> -> Write fault -> COW
>>
>> I do not believe this is actually a bug.
>>
>> You asked for a R/O pin, and you got one.
>>
>> Then somebody else modified that page, and you got exactly what you
>> asked for - a COW event. The original R/O pin has the original page
>> that it asked for, and can read it just fine.
>

Hi Jason

> To remind all, the GUP users, like RDMA, VFIO use
> FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the

I heard that statement often. Can you point me at the code?

VFIO: drivers/vfio/vfio_iommu_type1.c

vaddr_get_pfns() will end up doing a
pin_user_pages_remote(FOLL_LONGTERM) without
FOLL_FORCE|FOLL_WRITE.

Is that added automatically internally?

Note the comment in the next patch

+ *
+ * TODO: although the security issue described does no longer apply in any case,
+ * the full consistency between the pinned pages and the pages mapped into the
+ * page tables of the MM only apply to short-term pinnings only. For
+ * FOLL_LONGTERM, FOLL_WRITE|FOLL_FORCE is required for now, which can be
+ * inefficient and still result in some consistency issues. Extend this
+ * mechanism to also provide full synchronicity to FOLL_LONGTERM, avoiding
+ * FOLL_WRITE|FOLL_FORCE.

>
> Eg in RDMA we know of apps asking for a R/O pin of something in .bss
> then filling that something with data finally doing the actual
> DMA. Breaking COW after pin breaks those apps.
>
> The above #5 can occur for O_DIRECT read and in that case the
> 'snapshot the data' is perfectly fine as racing the COW with the
> O_DIRECT read just resolves the race toward the read() direction.
>
> IIRC there is some other scenario that motivated this patch?

1. I want to fix the COW security issue as documented.
Reproducers in patch #11

2. I want to fix all of the other issues as documented and linked
in the cover letter that result from the imprecise page_count
check in COW code. Especially the ones where we have memory
corruptions, because this is just not acceptable. There are
reproducers as well for everybody that doesn't believe me.

But this series really just wants to fix the security issue as "part 1".
Without any more breakages.

I'm sorry, but it's all described in the cover letter. Maybe TL;DR

--
Thanks,

David / dhildenb


2021-12-17 21:15:49

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)



> On Dec 17, 2021, at 12:47 PM, Jason Gunthorpe <[email protected]> wrote:
>
> On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote:
>
>>> 5. Take a R/O pin (RDMA, VFIO, ...)
>>> -> refcount > 1
>>>
>>> 6. memset(mem, 0xff, pagesize);
>>> -> Write fault -> COW
>>
>> I do not believe this is actually a bug.
>>
>> You asked for a R/O pin, and you got one.
>>
>> Then somebody else modified that page, and you got exactly what you
>> asked for - a COW event. The original R/O pin has the original page
>> that it asked for, and can read it just fine.
>
> To remind all, the GUP users, like RDMA, VFIO use
> FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
> COW breaking the coherence. In these case 'r/o pin' does not mean
> "snapshot the data", but its only a promise not to write to the pages
> and still desires coherence with the memory map.
>
> Eg in RDMA we know of apps asking for a R/O pin of something in .bss
> then filling that something with data finally doing the actual
> DMA. Breaking COW after pin breaks those apps.
>
> The above #5 can occur for O_DIRECT read and in that case the
> 'snapshot the data' is perfectly fine as racing the COW with the
> O_DIRECT read just resolves the race toward the read() direction.
>
> IIRC there is some other scenario that motivated this patch?

I think that there is an assumption that once a page is COW-broken,
it would never have another write-fault that might lead to COW
breaking later.

AFAIK at least after userfaultfd-WP followed by
userfaultfd-write-unprotect a page might be write-protected and
go through do_wp_page() a second time to be COW-broken again. In
such case, I think the FOLL_FORCE|FOLL_WRITE would not help.

I suspect (not sure) that this might even happen with mprotect()
since I do not see all code-paths preserving whether the page
was writable.


2021-12-17 21:17:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 17.12.21 21:56, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 12:47 PM Jason Gunthorpe <[email protected]> wrote:
>>
>> To remind all, the GUP users, like RDMA, VFIO use
>> FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
>> COW breaking the coherence. In these case 'r/o pin' does not mean
>> "snapshot the data", but its only a promise not to write to the pages
>> and still desires coherence with the memory map.
>>
>> Eg in RDMA we know of apps asking for a R/O pin of something in .bss
>> then filling that something with data finally doing the actual
>> DMA. Breaking COW after pin breaks those apps.
>
> I agree.
>

I agree that breaking COW after a pin should never be done. Therefore,
break the COW before the pin -> unsharing as implemented here.

> And my argument is that those kinds of things that ask for a R/O pin
> are broken, and should just make sure to use the shared pins.

And trigger a write fault although they are just reading. To me this is
just a band aid instead of eventually ...

...
> What's the downside of just doing this properly?

Doing it properly by fixing GUP and not the COW logic. GUP and COW are
just incompatible and we should unshare early.

Honestly, the memory corruptions we can trigger in user space due to the
current COW logic *especially* with FOLL_WRITE users such O_DIRECT,
io_uring or vfio are not a joke anymore. (again, link in the cover letter)

--
Thanks,

David / dhildenb


2021-12-17 21:20:36

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 17.12.21 22:15, Nadav Amit wrote:
>
>
>> On Dec 17, 2021, at 12:47 PM, Jason Gunthorpe <[email protected]> wrote:
>>
>> On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote:
>>
>>>> 5. Take a R/O pin (RDMA, VFIO, ...)
>>>> -> refcount > 1
>>>>
>>>> 6. memset(mem, 0xff, pagesize);
>>>> -> Write fault -> COW
>>>
>>> I do not believe this is actually a bug.
>>>
>>> You asked for a R/O pin, and you got one.
>>>
>>> Then somebody else modified that page, and you got exactly what you
>>> asked for - a COW event. The original R/O pin has the original page
>>> that it asked for, and can read it just fine.
>>
>> To remind all, the GUP users, like RDMA, VFIO use
>> FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
>> COW breaking the coherence. In these case 'r/o pin' does not mean
>> "snapshot the data", but its only a promise not to write to the pages
>> and still desires coherence with the memory map.
>>
>> Eg in RDMA we know of apps asking for a R/O pin of something in .bss
>> then filling that something with data finally doing the actual
>> DMA. Breaking COW after pin breaks those apps.
>>
>> The above #5 can occur for O_DIRECT read and in that case the
>> 'snapshot the data' is perfectly fine as racing the COW with the
>> O_DIRECT read just resolves the race toward the read() direction.
>>
>> IIRC there is some other scenario that motivated this patch?
>
> I think that there is an assumption that once a page is COW-broken,
> it would never have another write-fault that might lead to COW
> breaking later.
>
> AFAIK at least after userfaultfd-WP followed by
> userfaultfd-write-unprotect a page might be write-protected and
> go through do_wp_page() a second time to be COW-broken again. In
> such case, I think the FOLL_FORCE|FOLL_WRITE would not help.
>
> I suspect (not sure) that this might even happen with mprotect()
> since I do not see all code-paths preserving whether the page
> was writable.
>

uffd-wp and mprotect() are broken as well, yes. But the easiest example
is just swap + read fault.

Section 2 and 3 in [1], along with reproducers.

Note that I didn't mention uffd-wp and mprotect(), because these require
"manual intervention". With swap, it's not your application doing
something "special".

[1]
https://lore.kernel.org/r/[email protected]

--
Thanks,

David / dhildenb


2021-12-17 21:29:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] seqlock: provide lockdep-free raw_seqcount_t variant

On Fri, Dec 17 2021 at 12:30, David Hildenbrand wrote:
> Sometimes it is required to have a seqcount implementation that uses
> a structure with a fixed and minimal size -- just a bare unsigned int --
> independent of the kernel configuration. This is especially valuable, when
> the raw_ variants of the seqlock function will be used and the additional
> lockdep part of the seqcount_t structure remains essentially unused.
>
> Let's provide a lockdep-free raw_seqcount_t variant that can be used via
> the raw functions to have a basic seqlock.
>
> The target use case is embedding a raw_seqcount_t in the "struct page",
> where we really want a minimal size and cannot tolerate a sudden grow of
> the seqcount_t structure resulting in a significant "struct page"
> increase or even a layout change.

Cannot tolerate? Could you please provide a reason and not just a
statement?

> Provide raw_read_seqcount_retry(), to make it easy to match to
> raw_read_seqcount_begin() in the code.
>
> Let's add a short documentation as well.
>
> Note: There might be other possible users for raw_seqcount_t where the
> lockdep part might be completely unused and just wastes memory --
> essentially any users that only use the raw_ function variants.

Even when the reader side uses raw_seqcount_begin/retry() the writer
side still can use the non-raw variant which validates that the
associated lock is held on write.

Aside of that your proposed extra raw sequence count needs extra care
vs. PREEMPT_RT and this want's to be very clearly documented. Why?

The lock association has two purposes:

1) Lockdep coverage which unearthed bugs already

2) PREEMPT_RT livelock prevention

Assume the following:

spin_lock(wrlock);
write_seqcount_begin(seq);

---> preemption by a high priority reader

seqcount_begin(seq); <-- live lock

The RT substitution does:

seqcount_begin(seq)
cnt = READ_ONCE(seq->sequence);

if (cnt & 1) {
lock(s->lock);
unlock(s->lock);
}

which prevents the deadlock because it makes the reader block on
the associated lock, which allows the preempted writer to make
progress.

This applies to raw_seqcount_begin() as well.

I have no objections against the construct itself, but this has to be
properly documented vs. the restriction this imposes.

As you can see above the writer side therefore has to ensure that it
cannot preempted on PREEMPT_RT, which limits the possibilities what you
can do inside a preemption (or interrupt) disabled section on RT enabled
kernels. See Documentation/locking/locktypes.rst for further information.

Thanks,

tglx

2021-12-17 21:42:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 12:55 PM David Hildenbrand <[email protected]> wrote:
>
> If we have a shared anonymous page we cannot have GUP references, not
> even R/O ones. Because GUP would have unshared and copied the page,
> resulting in a R/O mapped anonymous page.

Doing a GUP on an actual shared page is wrong to begin with.

You even know that, you try to use "page_mapcount() > 1" to disallow it.

My point is that it's wrong regardless, and that "mapcount" is
dubious, and that COW cannot - and must not - use mapcount, and that I
think your shared case should strive to avoid it for the exact same
reason.

So, what I think should happen is:

(a) GUP makes sure that it only ever looks up pages that can be
shared with this VM. This may in involve breaking COW early with any
past fork().

(b) it marks such pages so that any future work will not cause them
to COW either

Note that (a) is not necessarily "always COW and have to allocate and
copy new page". In particular, if the page is already writable, you
know you already have exclusive access to it and don't need to COW.

And if it isn't writable, then the other common case is "the cow has
only one user, and it's us" - that's the "refcount == 1" case.

And (b) is what we do with that page_maybe_dma_pinned() logic for
fork(), but also for things like swap cache creation (eg see commit
feb889fb40fa: "mm: don't put pinned pages into the swap cache").

Note that this code all already exists, and already works - even
without getting the (very expensive) mmap_sem. So it works with
fast-GUP and it can race with concurrent forking by another thread,
which is why we also have that seqcount thing.

As far as I can tell, your "mapcount" logic fundamentally requires
mmap_sem for the fork() race avoidance, for example.

So this is why I don't like the mapcount games - I think they are very
fragile, and not at all as logical as the two simple rules a/b above.

I believe you can make mapcount games _work_ - we used to have
something like that. It was incredibly fragile, and it had its own set
of bugs, but with enough care it's doable.

But my argument really is that I think it's the wrong approach, and
that we should simply strive to follow the two simple conceptual rules
above.

Linus

2021-12-17 21:47:27

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 17.12.21 22:36, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 12:55 PM David Hildenbrand <[email protected]> wrote:
>>
>> If we have a shared anonymous page we cannot have GUP references, not
>> even R/O ones. Because GUP would have unshared and copied the page,
>> resulting in a R/O mapped anonymous page.
>
> Doing a GUP on an actual shared page is wrong to begin with.
>
> You even know that, you try to use "page_mapcount() > 1" to disallow it.

GUP is incomaptible with shared anonymous pages, therefore it has to
trigger unsharing, correct.

>
> My point is that it's wrong regardless, and that "mapcount" is
> dubious, and that COW cannot - and must not - use mapcount, and that I
> think your shared case should strive to avoid it for the exact same
> reason.

For now I have not heard a compelling argument why the mapcount is
dubious, I repeat:

* mapcount can only increase due to fork()
* mapcount can decrease due to unmap / zap

We can protect from the transtition == 1 -> >1 using the mmap_lock.

For COW the mapcount is the only thing that matters *if we take GUP* out
of the equation. And that's exactly what we

OTOH, take a look which issues resulted from the page_count changes.
That's what I call dubious, sorry to say.

>
> So, what I think should happen is:
>
> (a) GUP makes sure that it only ever looks up pages that can be
> shared with this VM. This may in involve breaking COW early with any
> past fork().

Is that unsharing as we propose it?

>
> (b) it marks such pages so that any future work will not cause them
> to COW either

Right, exactly. GUP before fork does not result in a page getting shared
again.

>
> Note that (a) is not necessarily "always COW and have to allocate and
> copy new page". In particular, if the page is already writable, you
> know you already have exclusive access to it and don't need to COW.
>
> And if it isn't writable, then the other common case is "the cow has
> only one user, and it's us" - that's the "refcount == 1" case.
>
> And (b) is what we do with that page_maybe_dma_pinned() logic for
> fork(), but also for things like swap cache creation (eg see commit
> feb889fb40fa: "mm: don't put pinned pages into the swap cache").

I fully agree with b). GUP before fork is a totally different set of
problems than GUP after fork.

>
> Note that this code all already exists, and already works - even
> without getting the (very expensive) mmap_sem. So it works with
> fast-GUP and it can race with concurrent forking by another thread,
> which is why we also have that seqcount thing.

I know, I studied it intensively :)

>
> As far as I can tell, your "mapcount" logic fundamentally requires
> mmap_sem for the fork() race avoidance, for example.

Yes. Or any other more lightweight synchronization in the future. For
now this is just perfect.

>
> So this is why I don't like the mapcount games - I think they are very
> fragile, and not at all as logical as the two simple rules a/b above.

I don't really see anything fragile, really. I'm happy to learn as always.

>
> I believe you can make mapcount games _work_ - we used to have
> something like that. It was incredibly fragile, and it had its own set
> of bugs, but with enough care it's doable.

We made it work, and it was comparatively simple.

--
Thanks,

David / dhildenb


2021-12-17 21:50:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand <[email protected]> wrote:
>
> For now I have not heard a compelling argument why the mapcount is
> dubious, I repeat:
>
> * mapcount can only increase due to fork()
> * mapcount can decrease due to unmap / zap
>
> We can protect from the transtition == 1 -> >1 using the mmap_lock.
>
> For COW the mapcount is the only thing that matters *if we take GUP* out
> of the equation. And that's exactly what we

What do you have against just doing what we already do in other parts,
that a/b thing?

Which avoids the whole mmap_sem issue. That was a big issue for the
rdma people, afaik.

Linus

2021-12-17 22:03:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] seqlock: provide lockdep-free raw_seqcount_t variant

On 17.12.21 22:28, Thomas Gleixner wrote:
> On Fri, Dec 17 2021 at 12:30, David Hildenbrand wrote:
>> Sometimes it is required to have a seqcount implementation that uses
>> a structure with a fixed and minimal size -- just a bare unsigned int --
>> independent of the kernel configuration. This is especially valuable, when
>> the raw_ variants of the seqlock function will be used and the additional
>> lockdep part of the seqcount_t structure remains essentially unused.
>>
>> Let's provide a lockdep-free raw_seqcount_t variant that can be used via
>> the raw functions to have a basic seqlock.
>>
>> The target use case is embedding a raw_seqcount_t in the "struct page",
>> where we really want a minimal size and cannot tolerate a sudden grow of
>> the seqcount_t structure resulting in a significant "struct page"
>> increase or even a layout change.
>

Hi Thomas,

thanks for your feedback!

> Cannot tolerate? Could you please provide a reason and not just a
> statement?

Absolutely.

"struct page" is supposed to have a minimal size with a fixed layout.
Embedding something inside such a structure can change the fixed layout
in a way that it can just completely breaks any assumptions on location
of values.

Therefore, embedding a complex structure in it is usually avoided -- and
if we have to (spin_lock), we work around sudden size increases.

There are ways around it: allocate the lock and only store the pointer
in the struct page. But that most certainly adds complexity, which is
why I want to avoid it for now.


I'll extend that answer and add it to the patch description.

>
>> Provide raw_read_seqcount_retry(), to make it easy to match to
>> raw_read_seqcount_begin() in the code.
>>
>> Let's add a short documentation as well.
>>
>> Note: There might be other possible users for raw_seqcount_t where the
>> lockdep part might be completely unused and just wastes memory --
>> essentially any users that only use the raw_ function variants.
>
> Even when the reader side uses raw_seqcount_begin/retry() the writer
> side still can use the non-raw variant which validates that the
> associated lock is held on write.

Yes, that's my understanding as well.

>
> Aside of that your proposed extra raw sequence count needs extra care
> vs. PREEMPT_RT and this want's to be very clearly documented. Why?
>
> The lock association has two purposes:
>
> 1) Lockdep coverage which unearthed bugs already

Yes, that's a real shame to lose.

>
> 2) PREEMPT_RT livelock prevention
>
> Assume the following:
>
> spin_lock(wrlock);
> write_seqcount_begin(seq);
>
> ---> preemption by a high priority reader
>
> seqcount_begin(seq); <-- live lock
>
> The RT substitution does:
>
> seqcount_begin(seq)
> cnt = READ_ONCE(seq->sequence);
>
> if (cnt & 1) {
> lock(s->lock);
> unlock(s->lock);
> }
>
> which prevents the deadlock because it makes the reader block on
> the associated lock, which allows the preempted writer to make
> progress.
>
> This applies to raw_seqcount_begin() as well.
>
> I have no objections against the construct itself, but this has to be
> properly documented vs. the restriction this imposes.

Absolutely, any input is highly appreciated.

But to mention it again: whatever you can do with raw_seqcount_t, you
can do with seqcount_t, and there are already users relying completely
on the raw_ function variants (see my other reply).

So the documentation should most probably be extended to cover the raw_
functions and seqcount_t in general.

>
> As you can see above the writer side therefore has to ensure that it
> cannot preempted on PREEMPT_RT, which limits the possibilities what you
> can do inside a preemption (or interrupt) disabled section on RT enabled
> kernels. See Documentation/locking/locktypes.rst for further information.

It's going to be used for THP, which are currently incompatible with
PREEMPT_RT (disabled in the Kconfig). But preemption is also disabled
because we're using bit_spin_lock(), which does a bit_spin_lock().

Certainly worth documenting!

Thanks for your input!

--
Thanks,

David / dhildenb


2021-12-17 22:18:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand <[email protected]> wrote:
>
> For now I have not heard a compelling argument why the mapcount is
> dubious, I repeat:
>
> * mapcount can only increase due to fork()
> * mapcount can decrease due to unmap / zap

And to answer the "why is this dubious", let' sjust look at your
actual code that I reacted to:

+ vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
+ if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
+ page_mapcount(vmf->page) > 1) {

Note how you don't just check page_mapcount(). Why not? Because
mapcount is completely immaterial if it's not a PageAnon page, so you
test for that.

So even when you do the mapcount read as one atomic thing, it's one
atomic thing that depends on _other_ things, and all these checks are
not atomic.

But a PageAnon() page can actually become a swap-backed page, and as
far as I can tell, your code doesn't have any locking to protect
against that.

So now you need not only the mmap_sem (to protect against fork), you
also need the page lock (to protect against rmap changing the type of
page).

I don't see you taking the page lock anywhere. Maybe the page table
lock ends up serializing sufficiently with the rmap code that it ends
up working

In the do_wp_page() path, we currently do those kinds of racy checks
too, but then we do a trylock_page, and re-do them. And at any time
there is any question about things, we fall back to copying - because
a copy is always safe.

Well, it's always safe if we have the rule that "once we've pinned
things, we don't cause them to be COW again".

But that "it's safe if" was exactly my (b) case.

That's why I much prefer the model I'm trying to push - it's
conceptually quite simple. I can literally explain mine at a
conceptual level with that "break pre-existing COW, make sure no
future COW" model.

In contrast, I look at your page_mapcount() code, and I go "there is
no conceptual rules here, and the actual implementation details look
dodgy".

I personally like having clear conceptual rules - as opposed to random
implementation details.

Linus

2021-12-17 22:29:52

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 17.12.21 22:50, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand <[email protected]> wrote:
>>
>> For now I have not heard a compelling argument why the mapcount is
>> dubious, I repeat:
>>
>> * mapcount can only increase due to fork()
>> * mapcount can decrease due to unmap / zap
>>
>> We can protect from the transtition == 1 -> >1 using the mmap_lock.
>>
>> For COW the mapcount is the only thing that matters *if we take GUP* out
>> of the equation. And that's exactly what we
>
> What do you have against just doing what we already do in other parts,
> that a/b thing?

Let me put it that way: I just want to get all of this fixed for good. I
don't particularly care how. *But* I will fight for something that is
superior, logically makes sense (at least to me :) ) and not super
complicated. And I call also avoiding unnecessary COW "superior".

I do know that what this series proposes fixes the CVE: GUP after fork.
I do know that the part 2 we'll be sending out next year will fix
everything else we discovered so far, and it will rely on this as a
basis, to not reintroduce any other COW issues we've seen so far.


If someone can propose something comparable that makes all discovered
problems go away I'll be *extremely* happy. We have reproducers for all
issues, so it's easy to verify, and I'm planning on extending the
selftests to cover even more corner cases.


So far, I am not convinced that using the mapcount is dubious or
problematic, I just don't see how. COW is an about sharing pages between
processes, each expressed in the mapcount. It's a pure optimization for
exactly that purpose.

GUP is the problem, not COW, not the mapcount. To me the mapcount is the
only thing that makes sense in COW+unsharing logic, and GUP has to be
taught to identify it and resolve it -> unshare when it detects a shared
anaonymous page.


>
> Which avoids the whole mmap_sem issue. That was a big issue for the
> rdma people, afaik.

While I do care about future use cases, I cannot possibly see fork() not
requiring the mmap_lock in the foreseeable future. Just so much depends
on it as of now.

And after all, fixing everything what we discovered so far is more
important to me than something like that for the future. We have other
problems to solve in that regard.

----------------------------------------------------------------------

I didn't want to talk about hugetlb here but I will just because it's a
good example why using the refocunt is just wrong -- because unnecessary
COW are just absolutely problematic.

Assume you have a R/O mapped huge page that is only mapped into your
process and you get a write fault. What should your COW logic do?

a) Rely on the mapcount? Yes, iff GUP has been taught to unshare
properly, because then it expresses exactly what we want to know.
mapcount == 1 -> reuse. mapcount > 1 -> COW.

b) Rely on the refocunt? If we have a speculative refrence on the page
we would COW. As huge pages are a scarce resource we can easily just
not have a free huge page anymore and crash the application. The app
didn't do anything wrong.

So teaching the hugetlb COW code to rely on the refount would just be
highly fragile.

--
Thanks,

David / dhildenb


2021-12-17 22:44:54

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 17.12.21 23:18, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand <[email protected]> wrote:
>>
>> For now I have not heard a compelling argument why the mapcount is
>> dubious, I repeat:
>>
>> * mapcount can only increase due to fork()
>> * mapcount can decrease due to unmap / zap
>
> And to answer the "why is this dubious", let' sjust look at your
> actual code that I reacted to:
>
> + vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
> + if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
> + page_mapcount(vmf->page) > 1) {
>
> Note how you don't just check page_mapcount(). Why not? Because
> mapcount is completely immaterial if it's not a PageAnon page, so you
> test for that.
>
> So even when you do the mapcount read as one atomic thing, it's one
> atomic thing that depends on _other_ things, and all these checks are
> not atomic.
>
> But a PageAnon() page can actually become a swap-backed page, and as
> far as I can tell, your code doesn't have any locking to protect
> against that.

The pages stay PageAnon(). swap-backed pages simply set a bit IIRC.
mapcount still applies.

>
> So now you need not only the mmap_sem (to protect against fork), you
> also need the page lock (to protect against rmap changing the type of
> page).

No, I don't think so. But I'm happy to be proven wrong because I might
just be missing something important.

>
> I don't see you taking the page lock anywhere. Maybe the page table
> lock ends up serializing sufficiently with the rmap code that it ends
> up working
>
> In the do_wp_page() path, we currently do those kinds of racy checks
> too, but then we do a trylock_page, and re-do them. And at any time
> there is any question about things, we fall back to copying - because
> a copy is always safe.

Yes, I studied that code in detail as well.

>
> Well, it's always safe if we have the rule that "once we've pinned
> things, we don't cause them to be COW again".

We should also be handling FOLL_GET, but that's a completely different
discussion.

>
> But that "it's safe if" was exactly my (b) case.
>
> That's why I much prefer the model I'm trying to push - it's
> conceptually quite simple. I can literally explain mine at a
> conceptual level with that "break pre-existing COW, make sure no
> future COW" model.

:)

We really might be talking about the same thing just that my point is
that the mapcount is the right thing to use for making the discussion
whether to break COW -> triger unsharing.

>
> In contrast, I look at your page_mapcount() code, and I go "there is
> no conceptual rules here, and the actual implementation details look
> dodgy".
>
> I personally like having clear conceptual rules - as opposed to random
> implementation details.

Oh, don't get me wrong, me to. But for me it just all makes perfect.

What we document is:

"The fault is an unsharing request to unshare a shared anonymous page
(-> mapped R/O). Does not apply to KSM."

And the code checks for exactly that. And in that context the mapcount
just expresses exactly what we want. Again, unless I am missing
something important that you raise above.


Anyhow, it's late in Germany. thanks for the discussion Linus!

--
Thanks,

David / dhildenb


2021-12-17 22:59:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 2:29 PM David Hildenbrand <[email protected]> wrote:
>
> While I do care about future use cases, I cannot possibly see fork() not
> requiring the mmap_lock in the foreseeable future. Just so much depends
> on it as of now.

It's not that *fork()* depends on it.

Of course fork() takes the mmap_sem.

It's that fast-gup really really doesn't want it, and can't take it.

So any fast-gup user fundamentally cannot look at mapcount(), because
that would be fundamentally wrong and racy, and could race with fork.

And yet, as far as I can tell, that's *exactly* what your gup patches
do, with gup_pte_range() adding

+ if (!pte_write(pte) && gup_must_unshare(flags, page, false)) {
+ put_compound_head(head, 1, flags);
+ goto pte_unmap;
+ }

which looks at the page mapcount without holding the mmap sem at all.

And see my other email - I think there are other examples of your
patches looking at data that isn't stable because you don't hold the
right locks.

And you can't even do the optimistic case without taking the lock,
because in your world, a COW that optimistically copies in the case of
a race condition is fundamentally *wrong* and buggy. Because in your
world-view, GUP and COW are very different and have different rules,
but you need things to be *exact*, and they aren't.

And none of this is anything at least I can think about, because I
don't see what the "design" is.

I really have a hard time following what the rules actually are. You
seem to think that "page_mapcount()" is a really simple rule, and I
fundamentally disagree. It's a _very_ complicated thing indeed, with
locking issues, AND YOU ACTIVELY VIOLATE THE LOCKING RULES!

See why I'm so unhappy?

We *did* do the page_mapcount() thing. It was bad. It forced COW to
always take the page lock. There's a very real reason why I'm pushing
my "let's have a _design_ here", instead of your "let's look at
page_mapcount without even doing the locking".

And yes, I *know* that fork() takes the mmap_sem, and likely always
will. That really isn't the problem here. The problem is that your
page_mapcount() paths DO NOT take that lock.

Btw, maybe I'm misreading things. I looked at the individual patches,
I didn't apply them, maybe I missed something. But I don't think I am.

Linus

2021-12-17 23:26:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 2:43 PM David Hildenbrand <[email protected]> wrote:
>
> The pages stay PageAnon(). swap-backed pages simply set a bit IIRC.
> mapcount still applies.

Our code-base is too large for me to remember all the details, but if
we still end up having PageAnon for swapbacked pages, then mapcount
can increase from another process faulting in an pte with that swap
entry.

And mmap_sem doesn't protect against that. Again, page_lock() does.

And taking the page lock was a big performance issue.

One of the reasons that new COW handling is so nice is that you can do
things like

if (!trylock_page(page))
goto copy;

exactly because in the a/b world order, the copy case is always safe.

In your model, as far as I can tell, you leave the page read-only and
a subsequent COW fault _can_ happen, which means that now the
subsequent COW needs to b every very careful, because if it ever
copies a page that was GUP'ed, you just broke the rules.

So COWing too much is a bug (because it breaks the page from the GUP),
but COWing too little is an even worse problem (because it measn that
now the GUP user can see data it shouldn't have seen).

Our old code literally COWed too little. It's why all those changes
happened in the first place.

This is why I'm pushing that whole story line of

(1) COW is based purely on refcounting, because that's the only thing
that obviously can never COW too little.

(2) GUP pre-COWs (the thing I called the "(a)" rule earlier) and then
makes sure to not mark pinned pages COW again (that "(b)" rule).

and here "don't use page_mapcount()" really is about that (1).

You do seem to have kept (1) in that your COW rules don't seem to
change (but maybe I missed it), but because your GUP-vs-COW semantics
are very different indeed, I'm not at all convinced about (2).

Linus

2021-12-17 23:30:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 17.12.21 23:58, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 2:29 PM David Hildenbrand <[email protected]> wrote:
>>
>> While I do care about future use cases, I cannot possibly see fork() not
>> requiring the mmap_lock in the foreseeable future. Just so much depends
>> on it as of now.
>
> It's not that *fork()* depends on it.
>
> Of course fork() takes the mmap_sem.
>
> It's that fast-gup really really doesn't want it, and can't take it.

Oh, sorry, I was misreading your mail.

>
> So any fast-gup user fundamentally cannot look at mapcount(), because
> that would be fundamentally wrong and racy, and could race with fork.
>

So we're concerned about fork() racing with gup-fast-only. gup-fast-only
runs essentially lockless.

As we read an atomic mapcount, the relevant thing to happen would be
that we read an mapcount of 1 and decide to share the page, but there is
concurrent fork() such that the mapcount is increased.

So the parent process has to be the only process owning that page for
this to trigger (mapcount == 1). In that situation, we would pin the
page in gup-fast-only.

BUT this is just like GUP before fork() and caught using the
mm->write_protect_seq, so we'd immediately unpin it and not actually
return it from get-user-pages-fast. No harm done AFAIKS.


> And yet, as far as I can tell, that's *exactly* what your gup patches
> do, with gup_pte_range() adding
>
> + if (!pte_write(pte) && gup_must_unshare(flags, page, false)) {
> + put_compound_head(head, 1, flags);
> + goto pte_unmap;
> + }
>
> which looks at the page mapcount without holding the mmap sem at all.
>
> And see my other email - I think there are other examples of your
> patches looking at data that isn't stable because you don't hold the
> right locks.

We rely on PageAnon(), PageKsm() and the mapcount. To my understanding,
they are stable for our use in pagefault handling code under mmap_lock
and in gup-fast because of above reasoning.

>
> And you can't even do the optimistic case without taking the lock,
> because in your world, a COW that optimistically copies in the case of
> a race condition is fundamentally *wrong* and buggy. Because in your
> world-view, GUP and COW are very different and have different rules,
> but you need things to be *exact*, and they aren't.
>
> And none of this is anything at least I can think about, because I
> don't see what the "design" is.
>
> I really have a hard time following what the rules actually are. You
> seem to think that "page_mapcount()" is a really simple rule, and I
> fundamentally disagree. It's a _very_ complicated thing indeed, with
> locking issues, AND YOU ACTIVELY VIOLATE THE LOCKING RULES!
>
> See why I'm so unhappy?

I see why your unhappy, and I appreciate the productive discussion :)
But I think we just have to complete the big picture of what we're
proposing and how the mapcount is safe to be used for this purpose.

I mean, I'm happy if you actually find a flaw in the current design
proposal.

>
> We *did* do the page_mapcount() thing. It was bad. It forced COW to
> always take the page lock. There's a very real reason why I'm pushing
> my "let's have a _design_ here", instead of your "let's look at
> page_mapcount without even doing the locking".

The locking semantics just have to be clarified and written in stone --
if we don't find any flaws.

But this will be my last mail for today, have a nice weekend Linus!

--
Thanks,

David / dhildenb


2021-12-17 23:53:25

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)


> On Dec 17, 2021, at 3:29 PM, David Hildenbrand <[email protected]> wrote:
>
> On 17.12.21 23:58, Linus Torvalds wrote:
>>
>> And you can't even do the optimistic case without taking the lock,
>> because in your world, a COW that optimistically copies in the case of
>> a race condition is fundamentally *wrong* and buggy. Because in your
>> world-view, GUP and COW are very different and have different rules,
>> but you need things to be *exact*, and they aren’t.

I understand the discussion mainly revolves correctness, which is
obviously the most important property, but I would like to mention
that having transient get_page() calls causing unnecessary COWs can
cause hard-to-analyze and hard-to-avoid performance degradation. COW
means a page copy, a TLB flush and potentially a TLB shootdown, which
is the most painful, specifically on VMs.

So I think that any solution should be able to limit the cases/number
of unnecessary COW operations to be minimal.

2021-12-18 00:50:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 10:04:11PM +0100, David Hildenbrand wrote:

> > To remind all, the GUP users, like RDMA, VFIO use
> > FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
>
> I heard that statement often. Can you point me at the code?
>
> VFIO: drivers/vfio/vfio_iommu_type1.c
>
> vaddr_get_pfns() will end up doing a
> pin_user_pages_remote(FOLL_LONGTERM) without
> FOLL_FORCE|FOLL_WRITE.

> Is that added automatically internally?

No, it is just that VFIO is broken in this regard. AFAIK VFIO users
rarely use the read-only mode and haven't noticed this bug yet.

You can search for FOLL_FORCE and see the drivers that do it this way:

drivers/misc/habanalabs/common/memory.c: FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
drivers/infiniband/core/umem.c: gup_flags |= FOLL_FORCE;
drivers/media/v4l2-core/videobuf-dma-sg.c: unsigned int flags = FOLL_FORCE;
drivers/media/common/videobuf2/frame_vector.c: FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,

[etc]

No doubt there are others that do it right and wrong, this is badly
documented and misunderstood.

> But this series really just wants to fix the security issue as "part 1".
> Without any more breakages.

Sure, I'm just trying to understand why this example was brought up.

Jason

2021-12-18 00:50:46

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 09:15:45PM +0000, Nadav Amit wrote:

> I think that there is an assumption that once a page is COW-broken,
> it would never have another write-fault that might lead to COW
> breaking later.

Yes, that is what Linus has been explaining, AFAICT

> AFAIK at least after userfaultfd-WP followed by
> userfaultfd-write-unprotect a page might be write-protected and
> go through do_wp_page() a second time to be COW-broken again. In
> such case, I think the FOLL_FORCE|FOLL_WRITE would not help.

Right, and this is a good reason why refcount is running into trouble,
it COW's too much in cases like that because userfaultfd-WP doesn't
align to the first assumption.

Jason

2021-12-18 01:54:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

[ Going back in the thread to this one ]

On Fri, Dec 17, 2021 at 1:15 PM Nadav Amit <[email protected]> wrote:
>
> I think that there is an assumption that once a page is COW-broken,
> it would never have another write-fault that might lead to COW
> breaking later.

Right. I do think there are problems in the current code, I just think
that the patches are a step back.

The problems with the current code are of two kinds:

- I think the largepage code (either THP or explicit hugetlb) doesn't
do as good a job of this whole COW handling as the regular pages do

- some of the "you can make pages read-only again explicitly" kinds of loads.

But honestly, at least for the second case, if somebody does a GUP,
and then starts playing mprotect games on the same virtual memory area
that they did a GUP on, and are surprised when they get another COW
fault that breaks their own connection with a page they did a GUP on
earlier, that's their own fault.

So I think there's some of "If you broke it, you get to keep both
pieces". Literally, in this case. You have your GUP page that you
looked up, and you have your virtual address page that you caused COW
on with mprotect() by making it read-only and then read-write again,
then you have two different pages, and at some point it really is just
"Well, don't do that then".

But yes, there's also some of "some code probably didn't get fully
converted to the new world order". So if VFIO only uses
FOLL_LONGTERM, and didn't ask for the COW breaking, then yes, VFIO
will see page incoherencies. But that should be an issue of "VFIO
should do the right thing".

So part of it is a combination of "if you do crazy things, you'll get
crazy results". And some of it is some kernel pinning code that
doesn't do the right thing to actually make sure it gets a shared page
to be pinned.

And then there's THP and HUGETLB, that I do think needs fixing and
aren't about those two kinds of cases.

I think we never got around to just doing the same thing we did for
regular pages. I think the hugepage code simply doesn't follow that
"COW on GUP, mark to not COW later" pattern.

Linus

2021-12-18 02:18:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 5:53 PM Linus Torvalds
<[email protected]> wrote:
>
>
> And then there's THP and HUGETLB, that I do think needs fixing and
> aren't about those two kinds of cases.
>
> I think we never got around to just doing the same thing we did for
> regular pages. I think the hugepage code simply doesn't follow that
> "COW on GUP, mark to not COW later" pattern.

In particular, do_huge_pmd_wp_page() has this pattern:

/*
* We can only reuse the page if nobody else maps the huge page or it's
* part.
*/
if (reuse_swap_page(page, NULL)) {
... mark it writable ...

and that never got converted to "only mark it writable if we actually
have exclusive access to this huge page".

So the problem is literally that reuse_swap_page() uses that
"page_mapcount()" logic, and doesn't take into account that the page
is actually used by a GUP reference.

Which is exactly why David then sees that "oh, we got a GUP reference
to it, and now we're seeing the writes come through". Because that
code looks at mapcount, and it shouldn't.

I think the hugepage code should use the exact same logic that the
regular wp fault code does.

Linus

2021-12-18 02:50:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 6:17 PM Linus Torvalds
<[email protected]> wrote:
>
> I think the hugepage code should use the exact same logic that the
> regular wp fault code does.

IOW, I think that this stupid (AND UNTESTED) patch should likely just
fix David's test-case with the hugepage and splice thing..

Or at least be somewhat close. But it should be paired with the GUP
side doing the right thing too, of course. Maybe it already does,
maybe it doesn't, I didn't check...

And maybe there's something subtle about the page_count() of a THP
entry. Again, I can't really claim to have tested this all, but I'm
hoping this makes somebody go "Ahh, now I see what Linus means"

Linus


Attachments:
patch.diff (1.25 kB)

2021-12-18 03:05:15

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 05:53:45PM -0800, Linus Torvalds wrote:

> But honestly, at least for the second case, if somebody does a GUP,
> and then starts playing mprotect games on the same virtual memory area
> that they did a GUP on, and are surprised when they get another COW
> fault that breaks their own connection with a page they did a GUP on
> earlier, that's their own fault.

I've been told there are real workloads that do this.

Something like qemu will use GUP with VFIO to insert PCI devices into
the guest and GUP with RDMA to do fast network copy of VM memory
during VM migration.

qemu also uses the WP games to implement dirty tracking of VM memory
during migration (and more? I'm not sure). It expects that during all
of this nothing will COW the pages, as the two kinds of DMA must
always go to the pages mapped to KVM.

The big trouble here is this all worked before, so it is a userspace
visible regression.

Can this be made to work at all? I wonder if qemu uses MAP_SHARED, eg
via a memfd or something, does the COW then go away naturally?

Jason

2021-12-18 03:30:39

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)



> On Dec 17, 2021, at 7:05 PM, Jason Gunthorpe <[email protected]> wrote:
>
> On Fri, Dec 17, 2021 at 05:53:45PM -0800, Linus Torvalds wrote:
>
>> But honestly, at least for the second case, if somebody does a GUP,
>> and then starts playing mprotect games on the same virtual memory area
>> that they did a GUP on, and are surprised when they get another COW
>> fault that breaks their own connection with a page they did a GUP on
>> earlier, that's their own fault.
>
> I've been told there are real workloads that do this.
>
> Something like qemu will use GUP with VFIO to insert PCI devices into
> the guest and GUP with RDMA to do fast network copy of VM memory
> during VM migration.
>
> qemu also uses the WP games to implement dirty tracking of VM memory
> during migration (and more? I'm not sure). It expects that during all
> of this nothing will COW the pages, as the two kinds of DMA must
> always go to the pages mapped to KVM.
>
> The big trouble here is this all worked before, so it is a userspace
> visible regression.

In such a case, I do think it makes sense to fail uffd-wp (when
page_count() > 1), and in a prototype I am working on I do something
like that. Otherwise, if the page is written and you use uffd for
dirty tracking, what do you actually achieve?

You can return EAGAIN (which is documented and actually returned
while “mmap_changing”) in such case. This would not break userspace,
but indeed still likely to cause a performance regression.


Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP

2021-12-18 03:42:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 6:42 PM Linus Torvalds
<[email protected]> wrote:
>
> IOW, I think that this stupid (AND UNTESTED) patch should likely just
> fix David's test-case with the hugepage and splice thing..

Looking at that patch, the page lock is entirely pointless.

It existed because of that broken reuse_swap_page() that tried to
count page mappings etc, but once you get rid of that - like we got
rid of it for the regular pages - it's not even needed.

So as we hold the page table lock, and see a page_count() of 1, we
could be done without any page lock at all. So that whole
trylock/unlock is actually unnecessary.

That said, it's possibly woth re-using any swap cache pages at this
point, and that would want the page lock. So some complexity in this
area is likely worth it. Similar to how we did it in commit
f4c4a3f48480 ("mm: free idle swap cache page after COW") for regular
pages.

So that patch is not great, but I think it works as a guiding one.

And notice how *simple* it is. It doesn't require careful counting of
swap entries that depend on page locking.

Linus

2021-12-18 03:45:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit <[email protected]> wrote:
>
> In such a case, I do think it makes sense to fail uffd-wp (when
> page_count() > 1), and in a prototype I am working on I do something
> like that.

Ack. If uddf-wp finds a page that is pinned, just skip it as not
write-protectable.

Because some of the pinners might be writing to it, of course - just
not through the page tables.

So that sounds like the right thing to do. I _think_ we discussed this
the last time this came up. I have some dim memory of that. Jason,
ring a bell?

Linus

2021-12-18 04:03:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 3:53 PM Nadav Amit <[email protected]> wrote:
>
> I understand the discussion mainly revolves correctness, which is
> obviously the most important property, but I would like to mention
> that having transient get_page() calls causing unnecessary COWs can
> cause hard-to-analyze and hard-to-avoid performance degradation.

Note that the COW itself is pretty cheap. Yes, there's the page
allocation and copy, but it's mostly a local thing.

So that falls under the "good to avoid" heading, but in the end it's
not an immense deal.

In contrast, the page lock has been an actual big user-visible latency
issue, to the point of correctness.

A couple of years ago, we literally had NMI watchdog timeouts due to
the page wait-queues growing basically boundlessly. This was some
customer internal benchmark code that I never saw, so it wasn't
*quite* clear exactly what was going on, but we ended up having to
split up the page wait list traversal using bookmark entries, because
it was such a huge latency issue.

That was mostly NUMA balancing faults, I think, but the point I'm
making is that avoiding the page lock can be a *much* bigger deal than
avoiding some local allocation and copying of a page of data. There
are real loads where the page-lock gets insanely bad, and I think it's
because we use it much too much.

See commit 2554db916586 ("sched/wait: Break up long wake list walk")
for some of that saga.

So I really think that having to serialize with the page lock in order
to do some "exact page use counting" is a false economy. Yes, maybe
you'd be able to avoid a COW or two, but at what locking cost?

Linus

2021-12-18 04:52:20

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)


> On Dec 17, 2021, at 8:02 PM, Linus Torvalds <[email protected]> wrote:
>
> On Fri, Dec 17, 2021 at 3:53 PM Nadav Amit <[email protected]> wrote:
>>
>> I understand the discussion mainly revolves correctness, which is
>> obviously the most important property, but I would like to mention
>> that having transient get_page() calls causing unnecessary COWs can
>> cause hard-to-analyze and hard-to-avoid performance degradation.
>
> Note that the COW itself is pretty cheap. Yes, there's the page
> allocation and copy, but it's mostly a local thing.

I don’t know about the page-lock overhead, but I understand your argument.

Having said that, I do know a bit about TLB flushes, which you did not
mention as overheads of COW. Such flushes can be quite expensive on
multithreaded workloads (specifically on VMs, but lets put those aside).

Take for instance memcached and assume you overcommit memory with a very fast
swap (e.g., pmem, zram, perhaps even slower). Now, it turns out memcached
often accesses a page first for read and shortly after for write. I
encountered, in a similar scenario, that the page reference that
lru_cache_add() takes during the first faultin event (for read), causes a COW
on a write page-fault that happens shortly after [1]. So on memcached I
assume this would also trigger frequent unnecessary COWs.

Besides page allocation and copy, COW would then require a TLB flush, which,
when performed locally, might not be too bad (~200 cycles). But if memcached
has many threads, as it usually does, then you need a TLB shootdown and this
one can be expensive (microseconds). If you start getting a TLB shootdown
storm, you may avoid some IPIs since you see that other CPUs already queued
IPIs for the target CPU. But then the kernel would flush the entire TLB on
the the target CPU, as it realizes that multiple TLB flushes were queued,
and as it assumes that a full TLB flush would be cheaper.

[ I can try to run a benchmark during the weekend to measure the impact, as I
did not really measure the impact on memcached before/after 5.8. ]

So I am in no position to prioritize one overhead over the other, but I do
not think that COW can be characterized as mostly-local and cheap in the
case of multithreaded workloads.


[1] https://lore.kernel.org/linux-mm/[email protected]/

2021-12-18 05:04:02

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sat, Dec 18, 2021 at 04:52:13AM +0000, Nadav Amit wrote:
> Take for instance memcached and assume you overcommit memory with a very fast
> swap (e.g., pmem, zram, perhaps even slower). Now, it turns out memcached
> often accesses a page first for read and shortly after for write. I
> encountered, in a similar scenario, that the page reference that
> lru_cache_add() takes during the first faultin event (for read), causes a COW
> on a write page-fault that happens shortly after [1]. So on memcached I
> assume this would also trigger frequent unnecessary COWs.

Why are we comparing page_count() against 1 and not 1 + PageLRU(page)?
Having a reference from the LRU should be expected. Is it because of
some race that we'd need to take the page lock to protect against?

2021-12-18 05:23:32

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)



> On Dec 17, 2021, at 9:03 PM, Matthew Wilcox <[email protected]> wrote:
>
> On Sat, Dec 18, 2021 at 04:52:13AM +0000, Nadav Amit wrote:
>> Take for instance memcached and assume you overcommit memory with a very fast
>> swap (e.g., pmem, zram, perhaps even slower). Now, it turns out memcached
>> often accesses a page first for read and shortly after for write. I
>> encountered, in a similar scenario, that the page reference that
>> lru_cache_add() takes during the first faultin event (for read), causes a COW
>> on a write page-fault that happens shortly after [1]. So on memcached I
>> assume this would also trigger frequent unnecessary COWs.
>
> Why are we comparing page_count() against 1 and not 1 + PageLRU(page)?
> Having a reference from the LRU should be expected. Is it because of
> some race that we'd need to take the page lock to protect against?
>

IIUC, the reference that is taken on the page is taken before SetPageLRU()
is called and the reference is later dropped:

lru_add_drain()
lru_add_drain_cpu()
__pagevec_lru_add()
__pagevec_lru_add_fn()
__pagevec_lru_add_fn()
SetPageLRU() <- sets the LRU
release_pages() <- drops the reference

It is one scenario I encountered. There might be others that take transient
references on pages that cause unnecessary COWs. I think David and Andrea
had few in mind. To trigger a COW bug I once used mlock()/munlock() that
take such transient reference. But who knows how many other cases exist
(KSM? vmscan?)


2021-12-18 09:58:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 18.12.21 00:20, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 2:43 PM David Hildenbrand <[email protected]> wrote:
>>
>> The pages stay PageAnon(). swap-backed pages simply set a bit IIRC.
>> mapcount still applies.
>
> Our code-base is too large for me to remember all the details, but if
> we still end up having PageAnon for swapbacked pages, then mapcount
> can increase from another process faulting in an pte with that swap
> entry.

"Our code-base is too large for me to remember all the details". I
second that.

You might a valid point with the mapcount regarding concurrent swapin in
the current code, I'll have to think further about that if it could be a
problem and if it cannot be handled without heavy synchronization (I
think the concern is that gup unsharing could miss doing an unshare
because it doesn't detect that there are other page sharers not
expressed in the mapcount code but via the swap code when seeing
mapcount == 1).

Do you have any other concerns regarding the semantics/stability
regarding the following points (as discussed, fork() is not the issue
because it can be handled via write_protect_seq or something comparable.
handling per-process thingies is not the problem):

a) Using PageAnon(): It cannot possibly change in the pagefault path or
in the gup-fast-only path (otherwise there would be use-after-free
already).
b) Using PageKsm(): It cannot possibly change in the pagefault path or
in the gup-fast path (otherwise there would be use-after-free
already).
c) Using mapcount: It cannot possibly change in the way we care about or
cannot detect (mapcount going from == 1 to > 1 concurrently) in the
pagefault path or in the gup-fast path due to fork().

You're point for c) is that we might currently not handle swap
correctly. Any other concerns, especially regarding the mapcount or is
that it?


IIUC, any GUP approach to detect necessary unsharing would at least
require a check for a) and b). What we're arguing about is c).

>
> And mmap_sem doesn't protect against that. Again, page_lock() does.
>
> And taking the page lock was a big performance issue.
>
> One of the reasons that new COW handling is so nice is that you can do
> things like
>
> if (!trylock_page(page))
> goto copy;
>
> exactly because in the a/b world order, the copy case is always safe.
>
> In your model, as far as I can tell, you leave the page read-only and
> a subsequent COW fault _can_ happen, which means that now the
> subsequent COW needs to b every very careful, because if it ever
> copies a page that was GUP'ed, you just broke the rules.
>
> So COWing too much is a bug (because it breaks the page from the GUP),
> but COWing too little is an even worse problem (because it measn that
> now the GUP user can see data it shouldn't have seen).

Good summary, I'll extend below.

>
> Our old code literally COWed too little. It's why all those changes
> happened in the first place.

Let's see if we can agree on some things to get a common understanding.


What can happen with COW is:

1) Missed COW

We miss a COW, therefore someone has access to a wrong page.

This is the security issue as in patch #11. The security issue
documented in [1].

2) Unnecessary COW

We do a COW, but there are no other valid users, so it's just overhead +
noise.

The performance issue documented in section 5 in [1].

3) Wrong COW

We do a COW but there are other valid users (-> GUP).

The memory corruption issue documented in section 2 and 3 in [1].

Most notably, the io_uring reproducer which races with the
page_maybe_dma_pinned() check in current code can trigger this easily,
and exactly this issues is what gives me nightmares. [2]


Does that make sense? If we agree on the above, then here is how the
currently discussed approaches differ:

page_count != 1:
* 1) cannot happen
* 2) can happen easily (speculative references due to pagecache,
migration, daemon, pagevec, ...)
* 3) can happen in the current code

mapcount > 1:
* 1) your concern is that this can happen due to concurrent swapin
* 2) cannot happen.
* 3) your concern is that this can happen due to concurrent swapin


If we can agree on that, I can see why you dislike mapcount, can you see
why I dislike page_count?

Ideally we'd really have a fast and reliable check for "is this page
shared and could get used by multiple processes -- either multiple
processes are already mapping it R/O or could map it via the swap R/O
later".


> This is why I'm pushing that whole story line of
>
> (1) COW is based purely on refcounting, because that's the only thing
> that obviously can never COW too little.

I am completely missing how 2) or 3) could *ever* be handled properly
for page_count != 1. 3) is obviously more important and gives me nightmares.


And that's what I'm trying to communicate the whole time: page_count is
absolutely fragile, because anything that results in a page getting
mapped R/O into a page table can trigger 3). And as [2] proves that can
even happen with *swap*.

(see how we're running into the same swap issues with both approaches?
Stupid swap :) )

>
> (2) GUP pre-COWs (the thing I called the "(a)" rule earlier) and then
> makes sure to not mark pinned pages COW again (that "(b)" rule).
>
> and here "don't use page_mapcount()" really is about that (1).
>
> You do seem to have kept (1) in that your COW rules don't seem to
> change (but maybe I missed it), but because your GUP-vs-COW semantics
> are very different indeed, I'm not at all convinced about (2).

Oh yes, sorry, not in the context of this series. The point is that the
current page_count != 1 covers mapcount > 1, so we can adjust that
separately later.


You mentioned "design", so let's assume we have a nice function:

/*
* Check if an anon page is shared or exclusively used by a single
* process: if shared, the page is shared by multiple processes either
* mapping the page R/O ("active sharing") or having swap entries that
* could result in the page getting mapped R/O ("inactive sharing").
*
* This function is safe to be called under mmap_lock in read/write mode
* because it prevents concurrent fork() sharing the page.
* This function is safe to be called from gup-fast-only in IRQ context,
* as it detects concurrent fork() sharing the page
*/
bool page_anon_shared();


Can we agree that that would that be a suitable function for (1) and (2)
instead of using either the page_count or the mapcount directly? (yes,
how to actually make it reliable due to swapin is to be discussed, but
it might be a problem worth solving if that's the way to go)

For hugetlb, this would really have to use the mapcount as explained
(after all, fortunately there is no swap ...).



[1]
https://lore.kernel.org/all/[email protected]/

[2]
https://gitlab.com/aarcange/kernel-testcases-for-v5.11/-/blob/main/io_uring_swap.c
--
Thanks,

David / dhildenb


2021-12-18 10:07:06

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 18.12.21 03:42, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 6:17 PM Linus Torvalds
> <[email protected]> wrote:
>>
>> I think the hugepage code should use the exact same logic that the
>> regular wp fault code does.
>
> IOW, I think that this stupid (AND UNTESTED) patch should likely just
> fix David's test-case with the hugepage and splice thing..
>
> Or at least be somewhat close. But it should be paired with the GUP
> side doing the right thing too, of course. Maybe it already does,
> maybe it doesn't, I didn't check...
>
> And maybe there's something subtle about the page_count() of a THP
> entry. Again, I can't really claim to have tested this all, but I'm
> hoping this makes somebody go "Ahh, now I see what Linus means"

Not the reaction you are hoping for: "Gah, Linus still doesn't see why
the page_count is just wrong". :)

See the mail I just wrote, let's get a common understanding of how our
check should actually look like.

--
Thanks,

David / dhildenb


2021-12-18 14:24:08

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v1 02/11] mm: thp: consolidate mapcount logic on THP split

On Fri, Dec 17, 2021 at 12:30:40PM +0100, David Hildenbrand wrote:
> Let's consolidate the mapcount logic to make it easier to understand and
> to prepare for further changes.
>
> Reviewed-by: Peter Xu <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2021-12-18 14:31:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v1 03/11] mm: simplify hugetlb and file-THP handling in __page_mapcount()

On Fri, Dec 17, 2021 at 12:30:41PM +0100, David Hildenbrand wrote:
> Let's return early for hugetlb, which really only relies on the compound
> mapcount so far and does not support PageDoubleMap() yet. Use the chance
> to cleanup the file-THP case to make it easier to grasp. While at it, use
> head_compound_mapcount().
>
> This is a preparation for further changes.
>
> Reviewed-by: Peter Xu <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/util.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/mm/util.c b/mm/util.c
> index 741ba32a43ac..3239e75c148d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page)
> {
> int ret;
>
> - ret = atomic_read(&page->_mapcount) + 1;
> + if (PageHuge(page))
> + return compound_mapcount(page);

It would be nice to make PageHuge() inlinable first. It's a shame the we
need to have to do a function call for PageHuge() check.

Otherwise, looks good:

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2021-12-18 14:35:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v1 04/11] mm: thp: simlify total_mapcount()

On Fri, Dec 17, 2021 at 12:30:42PM +0100, David Hildenbrand wrote:
> Let's simplify a bit, returning for PageHuge() early and using
> head_compound_page() as we are only getting called for HEAD pages.
>
> Note the VM_BUG_ON_PAGE(PageTail(page), page) check at the beginning of
> total_mapcount().
>
> This is a preparation for further changes.
>
> Reviewed-by: Peter Xu <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2021-12-18 18:37:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 9:03 PM Matthew Wilcox <[email protected]> wrote:
>
> Why are we comparing page_count() against 1 and not 1 + PageLRU(page)?
> Having a reference from the LRU should be expected. Is it because of
> some race that we'd need to take the page lock to protect against?

The LRU doesn't actually count towards a reference - the LRU list is
maintained independently of the lifetime of the page (and is torn down
on last release - which wouldn't work if the LRU list itself held a
ref to the page).

But atr least some of the code that gathers up pages to then put them
on the LRU list takes a ref to the pages before passing them off, just
to guarantee to keep them around during the operation.

So yes, various things can increment page counts in a transitory manner.

I still *much* prefer a reliable COW over one that doesn't happen enough.

The page count can have these (on the whole fairly rare) blips. That's
ok. The page count is still *reliable*, in ways that teh mapcount can
never be. The mapcount fundamentally doesn't show "other non-mapped
users".

So Nadav is correct that unnecessary cow events will cause extra work
(and the TLB flush is a good point - just marking a page writable
as-is is much cheaper).

But I'm looking at teh actual code, and the actual logic, and I am
dismissign the whole mapcount games completely.

David has a 10-patch series (plus one test) of complex, grotty,
hard-to-understand code with new flags.

I posted a patch that removed 10 lines, and fixes the problem case his
test-case was designed for.

I think that really speaks to the issues.

My approach is *simpler* and a hell of a lot more robust. And as
mentioned, I can explain it.

And christ the thing I'm advocating for is WHAT WE ALREADY DO FOR
99.99% of all cases. Why? Because it's literally how the regular COW
paths work TODAY.

And we had benchmarks show performance improvements (or no movement at
all) from when we made those changes. Not the downsides that people
claim.

It's only the THP paths that are broken (and possibly some individual
mis-uses of GUP - people have mentioned virtio).

So nbow people are trying to do a fragile, complex thing that was
shown to be problematic for the common case, and they are doing it for
the insanely rare case? When a ten-line removal patch fixes that one
too?

Linus

PS. Yes, yes, that 10-line removal patch is obviously still not
tested, it's still likely incomplete because the THP case needs to do
the page-pinning logic on the other side too, so I'm very obviously
over-simplifying. But the fact that the *normal* pages already do this
correctly - and don't use mapcount - should really make people go
"Hmm".

2021-12-18 18:42:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 07:38:39PM -0800, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit <[email protected]> wrote:
> >
> > In such a case, I do think it makes sense to fail uffd-wp (when
> > page_count() > 1), and in a prototype I am working on I do something
> > like that.
>
> Ack. If uddf-wp finds a page that is pinned, just skip it as not
> write-protectable.
>
> Because some of the pinners might be writing to it, of course - just
> not through the page tables.

That doesn't address the qemu use case though. The RDMA pin is the
'coherent r/o pin' we discussed before, which requires that the pages
remain un-write-protected and the HW DMA is read only.

The VFIO pin will enable dirty page tracking in the system IOMMU so it
gets the same effect from qemu's perspective as the CPU WP is doing.

In these operations every single page of the guest will be pinned, so
skip it just means userfault fd wp doesn't work at all.

Qemu needs some solution to be able to dirty track the CPU memory for
migration..

> So that sounds like the right thing to do. I _think_ we discussed this
> the last time this came up. I have some dim memory of that. Jason,
> ring a bell?

We talked about clear_refs alot, but it was never really clear the use
case, I think. Plus that discussion never finialized to anything.

David's latest summary seems accurate, if I paraphrase at a high
level, Linus's approach always does enough COWs but might do extra and
David's approach tries to do exactly the right number of COWs.

It looks like to have the same functionality with Linus's approach we
need to have a way for userspace to opt out of COW and work in an
entirely deterministic non-COW world. WP&GUP can never work together
otherwise which leaves qemu stranded.

Or, we follow David's approach and make COW be precise and accept the
complexity..

Jason

2021-12-18 18:49:20

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 18.12.21 19:42, Jason Gunthorpe wrote:
> On Fri, Dec 17, 2021 at 07:38:39PM -0800, Linus Torvalds wrote:
>> On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit <[email protected]> wrote:
>>>
>>> In such a case, I do think it makes sense to fail uffd-wp (when
>>> page_count() > 1), and in a prototype I am working on I do something
>>> like that.
>>
>> Ack. If uddf-wp finds a page that is pinned, just skip it as not
>> write-protectable.
>>
>> Because some of the pinners might be writing to it, of course - just
>> not through the page tables.
>
> That doesn't address the qemu use case though. The RDMA pin is the
> 'coherent r/o pin' we discussed before, which requires that the pages
> remain un-write-protected and the HW DMA is read only.
>
> The VFIO pin will enable dirty page tracking in the system IOMMU so it
> gets the same effect from qemu's perspective as the CPU WP is doing.
>
> In these operations every single page of the guest will be pinned, so
> skip it just means userfault fd wp doesn't work at all.
>
> Qemu needs some solution to be able to dirty track the CPU memory for
> migration..
>
>> So that sounds like the right thing to do. I _think_ we discussed this
>> the last time this came up. I have some dim memory of that. Jason,
>> ring a bell?
>
> We talked about clear_refs alot, but it was never really clear the use
> case, I think. Plus that discussion never finialized to anything.
>
> David's latest summary seems accurate, if I paraphrase at a high
> level, Linus's approach always does enough COWs but might do extra and
> David's approach tries to do exactly the right number of COWs.
>
> It looks like to have the same functionality with Linus's approach we
> need to have a way for userspace to opt out of COW and work in an
> entirely deterministic non-COW world. WP&GUP can never work together
> otherwise which leaves qemu stranded.
>
> Or, we follow David's approach and make COW be precise and accept the
> complexity..

Thanks Jason,

I would really enjoy us discussion how we can eventually make it
*precise* COW model work instead of living with a broken MM subsystem,
as all the reproducers show. IMHO we should stop throwing more band-aids
at it.

Is my approach complete? Sounds like it's not because Linus raised a
good point that the mapcount in the current state might not be stable
for our use case. I'm very happy that he reviewed this series.

I have some ideas to make the "_mapcount" of anonymous pages express
exactly that: how many active (PTE mapped) users do we have and how many
inactive (swap entries, migration entries) do we have. We can certainly
discuss any such approaches, but first there should be the will to try
getting it right ...

--
Thanks,

David / dhildenb


2021-12-18 19:21:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

[ Cutting down ruthlessly to the core of the issue ]

On Sat, Dec 18, 2021 at 1:58 AM David Hildenbrand <[email protected]> wrote:
>
> 1) Missed COW
>
> 2) Unnecessary COW
>
> 3) Wrong COW

> Does that make sense? If we agree on the above, then here is how the
> currently discussed approaches differ:
>
> page_count != 1:
> * 1) cannot happen
> * 2) can happen easily (speculative references due to pagecache,
> migration, daemon, pagevec, ...)
> * 3) can happen in the current code

I claim that (1) "cannot happen" is a huge mother of a deal. It's
*LITERALLY* the bug you are chasing, and it's the security issue, so
on a bug scale, it's about the worst there is.

I further then claim that (2) "happen easily" is you just making
things up. Yes, it can happen. But no, it's not actually that common,
and since (2) is harmless from a correctness standpoint, it is purely
about performance.

And as mentioned, not using the mapcount actually makes *common*
operations much simpler and faster. You don't need the page lock to
serialize the mapcount.

So (2) is a performance argument, and you haven't actually shown it to
be a problem.

Which really only leaves (3). Which I've already explained what the
fix is: don't ever mark pages that shouldn't be COW'ed as being COW
pages.

(3) is really that simple, although it ended up depending on Jason and
John Hubbard and others doing that FOLL_PIN logic to distinguish "I
just want to see a random page, and I don't care about COW" from "I
want to get a page, and that page needs to be coherent with this VM
and not be COW'ed away"

So I'm not claiming (3) is "trivial", but at the same time it's
certainly not some fundamentally complicated thing, and it's easy to
explain what is going on.

> mapcount > 1:
> * 1) your concern is that this can happen due to concurrent swapin
> * 2) cannot happen.
> * 3) your concern is that this can happen due to concurrent swapin

No, my concern about (1) is that IT IS WRONG.

"mapcount" means nothing for COW. I even gave you an example of
exactly where it means nothing. It's crazy. It's illogical. And it's
complicated as hell.

The fact that only one user maps a page is simply not meaningful. That
page can have other users that you don't know anything about, and that
don't show up in the mapcount.

That page can be swapcached, in which case mapcount can change
radically in ways that you earlier indicated cannot happen. You were
wrong.

But even if you fix that - by taking the page lock in every single
place - there are still *other* users that for all you know may want
the old contents. You don't know.

The only thing that says "no other users" is the page count. Not the mapcount.

In other words, I claim that

(a) mapcount is fundamentally the wrong thing to test. You can be the
only mapper, without being the "owner" of the page.

(b) it's *LITERALLY* the direct and present source of that bug in the
testcase you added, where a page with a mapcount of 1 has other
concurrent users and needs to be COW'ed but isn't.

(c) it's complicated and expensive to calculate (where one big part
of the expense is the page lock synchronization requirements, but
there are others)

And this all happens for that "case (1)", which is the worst adn
scariest of them all.

In contrast to that, your argument that "(2) cannot happen" is a total
non-argument. (2) isn't the problem.

And I claim that (3) can happen because you're testing the wrong
counter, so who knows if the COW is wrong or not?

> I am completely missing how 2) or 3) could *ever* be handled properly
> for page_count != 1. 3) is obviously more important and gives me nightmares.

Ok, so if I tell you how (2) and (3) are handled properly, you will
just admit you were wrong?

Here's how they are handled properly with page counts. I have told you
this before, but I'll summarize:

(2) is handled semantically properly by definition - it may be
"unnecessary", but it has no semantic meaning

This is an IMPORTANT thing to realize. The fact is, (2) is not in the
same class as (1) or (3).

And honestly - we've been doing this for all the common cases already
since at least 5.9, and your performance argument simply has not
really reared its head. Which makes the whole argument moot. I claim
that it simplifies lots of common operations and avoids having to
serialize on a lock that has been a real and major problem. You claim
it's extra overhead and can cause extra COW events. Neither of has any
numbers worth anything, but at least I can point to the fact that all
the *normal* VM paths have been doing the thing I advocate for many
releases now, and the sky most definitely is NOT falling.

So that only leaves (3).

Handling (3) really is so conceptually simple that I feel silly for
repeating it: if you don't want a COW to happen, then you mark the
page as being not-COW.

That sounds so simple as to be stupid. But it really is the solution.
It's what that pinning logic does, and keeps that "page may be pinned"
state around, and then operations like fork() that would otherwise
create a COW mapping of it will just not do it.

So that incredibly simple approach does require actual code: it
requires that explicit "fork() needs to copy instead of COW" code, it
requires that "if it's pinned, we don't make a new swapcache entry out
of it". So it's real code, and it's a real issue, but it's
conceptually absolutely trivial, and the code is usualyl really simple
to understand too.

So you have a *trivial* concept, and you have simple code that could
be described to a slightly developmentally challenged waterfowl. If
you're one of the programmers doing the "explain your code to a rubber
ducky", you can look at code like this:

/*
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
* Lazyfree page could be freed directly
*/
if (PageAnon(page) && PageSwapBacked(page)) {
if (!PageSwapCache(page)) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
if (page_maybe_dma_pinned(page))
goto keep_locked;

and you can explain that page_maybe_dma_pinned() test to your rubber
ducky, and that rubber ducky will literally nod its head. It gets it.

To recap:
(1) is important, and page_count() is the only thing that guarantees
"you get full access to a page only when it's *obviously* exclusively
yours".
(2) is NOT important, but could be a performance issue, but we have
real data from the past year that it isn't.
(3) is important, and has a really spectacularly simple conceptual
fix with quite simple code too.

In contrast, with the "mapcount" games you can't even explain why they
should work, and the patches I see are actively buggy because
everything is so subtle.

Linus

2021-12-18 19:53:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sat, Dec 18, 2021 at 11:21 AM Linus Torvalds
<[email protected]> wrote:
>
> To recap:
> (1) is important, and page_count() is the only thing that guarantees
> "you get full access to a page only when it's *obviously* exclusively
> yours".
> (2) is NOT important, but could be a performance issue, but we have
> real data from the past year that it isn't.
> (3) is important, and has a really spectacularly simple conceptual
> fix with quite simple code too.
>
> In contrast, with the "mapcount" games you can't even explain why they
> should work, and the patches I see are actively buggy because
> everything is so subtle.

So to challenge you, please explain exactly how mapcount works to
solve (1) and (3), and how it incidentally guarantees that (2) doesn't
happen.

And that really involves explaining the actual code too. I can explain
the high-level concepts in literally a couple of sentences.

For (1), "the page_count()==1 guarantees you are the only owner, so a
COW event can re-use the page" really explains it. And the code is
pretty simple too. There's nothing subtle about "goto copy" when
pagecount is not 1. And even the locking is simple: "we hold the page
table lock, we found a page, it has only one ref to it, we own it"

Our VM is *incredibly* complicated. There really are serious
advantages to having simple rules in place.

And for (2), the simple rule is "yeah, we can cause spurious cow
events". That's not only simple to explain, it's simple to code for.
Suddenly you don't need to worry. "Copying the page is always safe".
That's a really really powerful statement.

Now, admittedly (3) is the one that ends up being more complicated,
but the *concept* sure is simple. "If you don't want to COW this page,
then don't mark it for COW".

The *code* for (3) is admittedly a bit more complicated. The "don't
mark it for COW" is simple to say, but we do have that fairly odd
locking thing with fork() doing a seqcount_write_begin/end, and then
GIP does the read-seqcount thing with retry. So it's a bit unusual,
and I don't think we have that particular pattern anywhere else, but
it's one well-defined lock and while unusual it's not *complicated* as
far as kernel locking rules go. It's unusual and perhaps not trivial,
but in the end those seqcount code sequences are maybe 10 lines total,
and they don't interact with anything else.

And yes, the "don't mark it for COW" means that write-protecting
something is special, mainly because we sadly do not have extra bits
in the page tables. It would be *really* easy if we could just hide
this "don't COW this page" in the page table. Truly trivial. We don't,
because of portability across different architectures ;(

So I'll freely give you that my (3) is somewhat painful, but it's
painful with a really simple concept.

And the places that get (3) wrong are generally places that nobody has
been able to care about. I didn't realize the problem with creating a
swap page after the fact for a while, so that commit feb889fb40fa
("mm: don't put pinned pages into the swap cache") came later, but
it's literally a very simple two-liner.

The commit message for commit feb889fb40fa may be worth reading. It
very much explains the spirit of the thing, and is much longer than
the trivial patch itself.

Simple and clear concepts matter. Code gets complicated even then, but
complex code with complex concepts is a bad combination.

Linus

2021-12-18 21:49:12

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)



> On Dec 18, 2021, at 10:42 AM, Jason Gunthorpe <[email protected]> wrote:
>
> On Fri, Dec 17, 2021 at 07:38:39PM -0800, Linus Torvalds wrote:
>> On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit <[email protected]> wrote:
>>>
>>> In such a case, I do think it makes sense to fail uffd-wp (when
>>> page_count() > 1), and in a prototype I am working on I do something
>>> like that.
>>
>> Ack. If uddf-wp finds a page that is pinned, just skip it as not
>> write-protectable.
>>
>> Because some of the pinners might be writing to it, of course - just
>> not through the page tables.
>
> That doesn't address the qemu use case though. The RDMA pin is the
> 'coherent r/o pin' we discussed before, which requires that the pages
> remain un-write-protected and the HW DMA is read only.
>
> The VFIO pin will enable dirty page tracking in the system IOMMU so it
> gets the same effect from qemu's perspective as the CPU WP is doing.
>
> In these operations every single page of the guest will be pinned, so
> skip it just means userfault fd wp doesn't work at all.
>
> Qemu needs some solution to be able to dirty track the CPU memory for
> migration..

My bad. I misunderstood the scenario.

Yes, I guess that you pin the pages early for RDMA registration, which
is also something you may do for IO-uring buffers. This would render
userfaultfd unusable.

I do not see how it can be solved without custom, potentially
complicated logic, which the page_count() approach wants to avoid.

The only thing I can think of is requiring the pinned regions to be
first madvise’d with MADV_DONTFORK and not COW’ing in such case.
But this would break existing code though.

2021-12-18 22:52:09

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 17, 2021 at 12:45:45PM -0800, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 12:42 PM David Hildenbrand <[email protected]> wrote:
> >
> > > Then somebody else modified that page, and you got exactly what you
> > > asked for - a COW event. The original R/O pin has the original page
> > > that it asked for, and can read it just fine.
> >
> > Where in the code did I ask for a COW event? I asked for a R/O pin, not
> > any kind of memory protection.
>
> Why didn't you ask for a shared pin, if that is what you want?
>
> We already support that.
>
> If you don't like the read-only pins, don't use them. It's that simple.

So you are saying that if a GUP user wants to see changes made by
userspace to the page after the GUP it must ask for FOLL_WRITE, even if it
doesn't have intend to write to the page?

That's news to me.

Or did I misunderstand you?

--
Kirill A. Shutemov

2021-12-18 22:53:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sat, Dec 18, 2021 at 1:49 PM Nadav Amit <[email protected]> wrote:
>
> Yes, I guess that you pin the pages early for RDMA registration, which
> is also something you may do for IO-uring buffers. This would render
> userfaultfd unusable.

I think this is all on usefaultfd.

That code literally stole two of the bits from the page table layout -
bits that we could have used for better things.

And guess what? Because it required those two bits in the page tables,
and because that's non-portable, it turns out that UFFD_WP can only be
enabled and only works on x86-64 in the first place.

So UFFS_WP is fundamentally non-portable. Don't use it.

Anyway, the good news is that I think that exactly because uffd_wp
stole two bits from the page table layout, it already has all the
knowledge it needs to handle this entirely on its own. It's just too
lazy to do so now.

In particular, it has that special UFFD_WP bit that basically says
"this page is actually writable, but I've made it read-only just to
get the fault for soft-dirty".

And the hint here is that if the page truly *was* writable, then COW
just shouldn't happen, and all that the page fault code should do is
set soft-dirty and return with the page set writable.

And if the page was *not* writable, then UFFD_WP wasn't actually
needed in the first place, but the uffd code just sets it blindly.

Notice? It _should_ be just an operation based purely on the page
table contents, never even looking at the page AT ALL. Not even the
page count, much less some mapcount thing.

Done right, that soft-dirty thing could work even with no page backing
at all, I think.

But as far as I know, we've actually never seen a workload that does
all this, so.. Does anybody even have a test-case?

Because I do think that UFFD_WP really should never really look at the
page, and this issue is actually independent of the "page_count() vs
page_mapcount()" discussion.

(Somewhat related aside: Looking up the page is actually one of the
more expensive operations of a page fault and a lot of other page
table manipulation functions - it's where most of the cache misses
happen. That's true on the page fault side, but it's also true for
things like copy_page_range() etc)

Linus

2021-12-18 23:12:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sat, Dec 18, 2021 at 2:52 PM Kirill A. Shutemov <[email protected]> wrote:
>
> So you are saying that if a GUP user wants to see changes made by
> userspace to the page after the GUP it must ask for FOLL_WRITE, even if it
> doesn't have intend to write to the page?

Yup. Put the onus very clearly on GUP.

It's a very special operation, and it's one of the operations that
cause a lot of problems for the VM code. It's by no means the _only_
one - we've got a lot of other things that cause issues - but I think
it's very clear that GUP is very very special, and nobody should say
"I want GUP to do whatever".

There are two cases for GUP:

- look up the page as it is *NOW*

- look up the page in order to see any future state on it (and
possibly modify it)

that "any future state" is a fundamentally much heavier operation.
It's the one that really *has* to get rid of any sharing, and it has
to make sure no sharing happens in the future either.

So ii it is an anonymous page, it basically needs to act like a write.
Even if that page is then used only for reading.

Note that here that "if it's anonymous" is kind of a big deal. If it's
a shared file-mapped page, at that point it's "just another
reference". It's potentially problematic even in that case (think of
"truncate()" that tries to force-unmap all pages from VM's), but for
the shared case the answer is "if you truncate it and disassociate the
page from the mapping, it's _your_ problem.

And once it acts as a write, and once it does that COW and we have
exclusive access to it, it might as well be just writable and dirty.
You've done the expensive part already. You've forced it to be private
to that VM.

And this was all triggered by the user doing something very special,
so no amount of "but POSIX" or whatever matters.

GUP is not great. If you use GUP, you get to deal with the downsides.

Linus

2021-12-19 00:19:40

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)



> On Dec 18, 2021, at 2:53 PM, Linus Torvalds <[email protected]> wrote:
>
> On Sat, Dec 18, 2021 at 1:49 PM Nadav Amit <[email protected]> wrote:
>>
>> Yes, I guess that you pin the pages early for RDMA registration, which
>> is also something you may do for IO-uring buffers. This would render
>> userfaultfd unusable.
>
> I think this is all on usefaultfd.
>
> That code literally stole two of the bits from the page table layout -
> bits that we could have used for better things.
>
> And guess what? Because it required those two bits in the page tables,
> and because that's non-portable, it turns out that UFFD_WP can only be
> enabled and only works on x86-64 in the first place.
>
> So UFFS_WP is fundamentally non-portable. Don't use it.

I have always felt that the PTE software-bits limit is very artificial.
We can just allocate two adjacent pages when needed, one for PTEs and
one for extra software bits. A software bit in the PTE can indicate
“extra software bits” are relevant (to save cache-misses), and a bit
in the PTEs' page-struct indicate whether there is adjacent “extra
software bits” page.

I’ve done it something similar once in a research project. It is
rather similar to what is done for PTI in the PGD level.


>
> Anyway, the good news is that I think that exactly because uffd_wp
> stole two bits from the page table layout, it already has all the
> knowledge it needs to handle this entirely on its own. It's just too
> lazy to do so now.
>
> In particular, it has that special UFFD_WP bit that basically says
> "this page is actually writable, but I've made it read-only just to
> get the fault for soft-dirty".
>
> And the hint here is that if the page truly *was* writable, then COW
> just shouldn't happen, and all that the page fault code should do is
> set soft-dirty and return with the page set writable.
>
> And if the page was *not* writable, then UFFD_WP wasn't actually
> needed in the first place, but the uffd code just sets it blindly.

I don’t think that I am following. The write-protection of UFFD means
that the userspace wants to intervene before anything else (including
COW).

UFFD_WP indications are recorded per PTE (i.e., not VMA). So if
userspace wants to intervene on write, it must use UFFD_WP even if
the page is write-protected. The kernel then has to keep the UFFD_WP
indication to call userspace upon a write.


>
> Notice? It _should_ be just an operation based purely on the page
> table contents, never even looking at the page AT ALL. Not even the
> page count, much less some mapcount thing.
>
> Done right, that soft-dirty thing could work even with no page backing
> at all, I think.
>
> But as far as I know, we've actually never seen a workload that does
> all this, so.. Does anybody even have a test-case?
>
> Because I do think that UFFD_WP really should never really look at the
> page, and this issue is actually independent of the "page_count() vs
> page_mapcount()" discussion.

I can think of two examples for reasonable flows of UFFD:

[ M = Monitor thread; F = Faulting thread ]

(A) Userspace page-fault tracking (e.g., for memory migration):

1. M: WP memory.
2. F: WP page-fault: provide UFFD notification.
3. M: Unprotect the page.
4. M: Wake the faulting thread (usually as part of the unprotect)
5. F: Retry the page-fault (and handle COW).

(B) Userspace memory snapshots:

1. M: Write-protect memory.
2. M: Copy the memory to disk.
3. M: Write-unprotect memory (e.g., speculatively as you expect a page
to be written to and do not want to pay the #PF price).

[ notice that the un-protection is logical, not really in the PTEs]

4. F: Get a page-fault (later) and handle it (because it might or
might not need COW)

There may be “crazier” flows (e.g., wake the faulting thread and
emulate the instruction that triggered the write with ptrace), but
let’s put those aside.

IIUC the solution you propose, it tries to address flows such as (A).

I am not sure whether the proposal is to change the write-protection
API to only provide notifications (i.e., not block to after page-fault
as done today), but I do not see how it addresses (B).

I am not saying it is impossible, but I think that the solution would
complicate the code by making UFFD a special case.

2021-12-19 00:54:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sat, Dec 18, 2021 at 4:19 PM Nadav Amit <[email protected]> wrote:
>
> I have always felt that the PTE software-bits limit is very artificial.
> We can just allocate two adjacent pages when needed, one for PTEs and
> one for extra software bits. A software bit in the PTE can indicate
> “extra software bits” are relevant (to save cache-misses), and a bit
> in the PTEs' page-struct indicate whether there is adjacent “extra
> software bits” page.

Hmm. That doesn't sound very bad, no. And it would be nice to have
more software bits (and have them portably).

> I don’t think that I am following. The write-protection of UFFD means
> that the userspace wants to intervene before anything else (including
> COW).

The point I was making (badly) is that UFFD_WP is only needed to for
the case where the pte isn't already non-writable for other reasons.

> UFFD_WP indications are recorded per PTE (i.e., not VMA).

The changing of those bits are basically a bastardized 'mprotect()',
and does already require the vma to be marked VM_UFFD_WP.

And the way you set (or clear) the bits is with a range operation. It
really could have been done with mprotect(), and with actual explicit
vma bits.

The fact that it now uses the page table bit is rather random. I think
it would actually be cleaner to make that userfaultfd_writeprotect
truly *be* a vma range.

Right now it's kind of "half this, half that".

Of course, it's possible that because of this situation, some users do
a lot of fine-grained VM_UFFD_WP setting, and they kind of expect to
not have issues with lots of vma fragments. So practical concerns may
have made the implementation set in stone.

(I have only ever seen the kernel side of uffd, not the actual user
side, so I'm not sure about the use patterns).

That said, your suggestion of a shadow sw page table bit thing would
also work. And it would solve some problems we have in core areas
(notably "page_special()" which right now has that
ARCH_HAS_PTE_SPECIAL thing).

It would make it really easy to have that "this page table entry is
pinned" flag too.

Linus

2021-12-19 06:02:08

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)


> On Dec 18, 2021, at 4:35 PM, Linus Torvalds <[email protected]> wrote:
>
> (I have only ever seen the kernel side of uffd, not the actual user
> side, so I'm not sure about the use patterns).

I use it in a very fine granularity, and I suspect QEMU and CRIU do so
too.

>
> That said, your suggestion of a shadow sw page table bit thing would
> also work. And it would solve some problems we have in core areas
> (notably "page_special()" which right now has that
> ARCH_HAS_PTE_SPECIAL thing).
>
> It would make it really easy to have that "this page table entry is
> pinned" flag too.

I found my old messy code for the software-PTE thing.

I see that eventually I decided to hold a pointer to the “extra PTEs”
of each page in the PMD-page-struct. [ I also implemented the 2-adjacent
pages approach but this code is long gone. ]

My rationale was that:

1. It does not bound you to have the same size for PTE and “extra-PTE”
2. The PMD-page struct is anyhow hot (since you acquired the PTL)
3. Allocating “extra-PTE” dynamically does not require to rewire the
page-tables, which requires a TLB flush.

I think there is a place to hold a pointer in the PMD-page-struct
(_pt_pad_1, we just need to keep the lowest bit clear so the kernel
won’t mistaken it to be a compound page).

I still don’t know what exactly you have in mind for making use
out of it for the COW issue. Keeping a pin-count (which requires
internal API changes for unpin_user_page() and friends?) or having
“was ever pinned” sticky bit? And then changing
page_needs_cow_for_dma() to look at the PTE so copy_present_pte()
would break the COW eagerly?

Anyhow, I can clean it up and send (although it is rather simple
and I ignored many thing, such as THP, remap, etc), but I am not
sure I have the time now to fully address the COW problem. I will
wait for Monday for David’s response.

2021-12-19 08:02:07

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 12/18/21 22:02, Nadav Amit wrote:
>
>> On Dec 18, 2021, at 4:35 PM, Linus Torvalds <[email protected]> wrote:
>>
>> (I have only ever seen the kernel side of uffd, not the actual user
>> side, so I'm not sure about the use patterns).
>
> I use it in a very fine granularity, and I suspect QEMU and CRIU do so
> too.
>
>>
>> That said, your suggestion of a shadow sw page table bit thing would
>> also work. And it would solve some problems we have in core areas
>> (notably "page_special()" which right now has that
>> ARCH_HAS_PTE_SPECIAL thing).
>>
>> It would make it really easy to have that "this page table entry is
>> pinned" flag too.
>
> I found my old messy code for the software-PTE thing.
>
> I see that eventually I decided to hold a pointer to the “extra PTEs”
> of each page in the PMD-page-struct. [ I also implemented the 2-adjacent
> pages approach but this code is long gone. ]
>
> My rationale was that:
>
> 1. It does not bound you to have the same size for PTE and “extra-PTE”
> 2. The PMD-page struct is anyhow hot (since you acquired the PTL)
> 3. Allocating “extra-PTE” dynamically does not require to rewire the
> page-tables, which requires a TLB flush.
>
> I think there is a place to hold a pointer in the PMD-page-struct
> (_pt_pad_1, we just need to keep the lowest bit clear so the kernel
> won’t mistaken it to be a compound page).
>
> I still don’t know what exactly you have in mind for making use
> out of it for the COW issue. Keeping a pin-count (which requires
> internal API changes for unpin_user_page() and friends?) or having
> “was ever pinned” sticky bit? And then changing
> page_needs_cow_for_dma() to look at the PTE so copy_present_pte()
> would break the COW eagerly?
>
> Anyhow, I can clean it up and send (although it is rather simple
> and I ignored many thing, such as THP, remap, etc), but I am not
> sure I have the time now to fully address the COW problem. I will
> wait for Monday for David’s response.
>

Hi Nadav,

A couple of thoughts about this part of the design:

a) The PMD-page-struct approach won't help as much, because (assuming
that we're using it in an attempt to get a true, perfect pin count), you
are combining the pin counts of a PMD's worth of pages. OTOH...maybe
that actually *is* OK, assuming you don't overflow--except that you can
only answer the "is it dma-pinned?" question at a PMD level. That's a
contradiction of your stated desire above to have very granular control.

Also, because of not having bit 0 available in page._pt_pad_1, I think
the count would have to be implemented as adding and subtracting 2,
instead of 1 (in order to keep the value even), further reducing the
counter range.

b) If, instead, you used your older 2-adjacent pages approach, then
Linus' comment makes more sense here: we could use the additional struct
page to hold an exact pin count, per page. That way, you can actually
build a wrapper function such as:

page_really_is_dma_pinned()

...and/or simply get a stronger "maybe" for page_maybe_dma_pinned().

Furthermore, this direction is extensible and supports solving other "I
am out of space in struct page" problems, at the cost of more pages, of
course.

As an aside, I find it instructive that we're talking about this
approach, instead of extending struct page. The lesson I'm taking away
is: allocating more space for some cases (2x pages) is better than
having *all* struct pages be larger than they are now.

Anyway, the pin count implementation would look somewhat like the
existing hpage_pincount, which similarly has ample space for a separate,
exact pin count. In other words, this sort of thing (mostly-pseudo
code):


diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..646761388025 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -938,6 +938,16 @@ static inline bool hpage_pincount_available(struct page *page)
return PageCompound(page) && compound_order(page) > 1;
}

+static inline bool shadow_page_pincount_available(struct page *page)
+{
+ /*
+ * TODO: Nadav: connect this up with the shadow page table
+ * implementation, and return an appropriate answer.
+ */
+
+ return false; // hardcoded for now, for compile testing
+}
+
static inline int head_compound_pincount(struct page *head)
{
return atomic_read(compound_pincount_ptr(head));
@@ -950,6 +960,13 @@ static inline int compound_pincount(struct page *page)
return head_compound_pincount(page);
}

+static inline int shadow_page_pincount(struct page *page)
+{
+ VM_BUG_ON_PAGE(!shadow_page_pincount_available(page), page);
+
+ return atomic_read(shadow_page_pincount_ptr(page));
+}
+
static inline void set_compound_order(struct page *page, unsigned int order)
{
page[1].compound_order = order;
@@ -1326,6 +1343,9 @@ static inline bool page_maybe_dma_pinned(struct page *page)
if (hpage_pincount_available(page))
return compound_pincount(page) > 0;

+ if (shadow_page_pincount_available(page))
+ return shadow_page_pincount(page) > 0;
+
/*
* page_ref_count() is signed. If that refcount overflows, then
* page_ref_count() returns a negative value, and callers will avoid

c) The "was it ever pinned" sticky bit is not a workable concept, at the
struct page level. A counter is required, in order to allow pages to
live out their normal lives to their fullest potential. The only time we
even temporarily got away with this kind of stickiness was at a higher
level, and only per-process, not per-physical-page. Processes come and
go, but the struct pages are more or less forever, so once you mark one
sticky like this, it's out of play.

thanks,
--
John Hubbard
NVIDIA

2021-12-19 08:43:36

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 18.12.21 20:52, Linus Torvalds wrote:
> On Sat, Dec 18, 2021 at 11:21 AM Linus Torvalds
> <[email protected]> wrote:
>>
>> To recap:
>> (1) is important, and page_count() is the only thing that guarantees
>> "you get full access to a page only when it's *obviously* exclusively
>> yours".
>> (2) is NOT important, but could be a performance issue, but we have
>> real data from the past year that it isn't.
>> (3) is important, and has a really spectacularly simple conceptual
>> fix with quite simple code too.
>>
>> In contrast, with the "mapcount" games you can't even explain why they
>> should work, and the patches I see are actively buggy because
>> everything is so subtle.
>
> So to challenge you, please explain exactly how mapcount works to
> solve (1) and (3), and how it incidentally guarantees that (2) doesn't
> happen.

Oh, there is no need for additional challenges, I've been challenged
with this problem for way too long already ;)

And again, I appreciate this discussion and your feedback. I want to get
all of this fixed ASAP, even if it's not going to be the way I propose
as I raised. Any input is appreciated (as long as people don't scream at
me).


To get to your point: I thought about your remarks with the swapcount
and it makes sense. The mapcount itself is not expressive enough to
catch macpount == 1 vs mapcount > 1.

What *would* work to have precise COW I think is having the active and
inactive count instead of just the active (mapcount) part, whereby:

active: page tables mapping this page -> mapcount
inactive: page tables referencing this page via swap/migration entry

An alternative would be to just know that there are inactive users. We'd
have to read both values atomically in one shot.

There would be ways to store that information in the _mapcount but it
certainly adds a lot of complexity, and ...

>
> And that really involves explaining the actual code too. I can explain
> the high-level concepts in literally a couple of sentences.
>
> For (1), "the page_count()==1 guarantees you are the only owner, so a
> COW event can re-use the page" really explains it. And the code is
> pretty simple too. There's nothing subtle about "goto copy" when
> pagecount is not 1. And even the locking is simple: "we hold the page
> table lock, we found a page, it has only one ref to it, we own it"
>
> Our VM is *incredibly* complicated. There really are serious
> advantages to having simple rules in place.

... you have a point here.

Having that said, I hope we can agree that the "page_count" is not the
perfect solution. I hope we can at least tweak it for now to get rid of
3) Wrong COW.

>
> And for (2), the simple rule is "yeah, we can cause spurious cow
> events". That's not only simple to explain, it's simple to code for.
> Suddenly you don't need to worry. "Copying the page is always safe".
> That's a really really powerful statement.
>
> Now, admittedly (3) is the one that ends up being more complicated,
> but the *concept* sure is simple. "If you don't want to COW this page,
> then don't mark it for COW".
>
> The *code* for (3) is admittedly a bit more complicated. The "don't
> mark it for COW" is simple to say, but we do have that fairly odd
> locking thing with fork() doing a seqcount_write_begin/end, and then
> GIP does the read-seqcount thing with retry. So it's a bit unusual,
> and I don't think we have that particular pattern anywhere else, but
> it's one well-defined lock and while unusual it's not *complicated* as
> far as kernel locking rules go. It's unusual and perhaps not trivial,
> but in the end those seqcount code sequences are maybe 10 lines total,
> and they don't interact with anything else.
>
> And yes, the "don't mark it for COW" means that write-protecting
> something is special, mainly because we sadly do not have extra bits
> in the page tables. It would be *really* easy if we could just hide
> this "don't COW this page" in the page table. Truly trivial. We don't,
> because of portability across different architectures ;(
>
> So I'll freely give you that my (3) is somewhat painful, but it's
> painful with a really simple concept.

Thanks for admitting that!

I might have had an idea yesterday on how to fix most of the issues
without relying on the mapcount, doing it similar (but slightly
different) as you propose here. Let's call it a mixture of the unsharing
approach and your approach. I cannot promise anything, so ...

... I'll go playing with it and share some details ASAP. At least it
sounds comparatively simple in my head.

>
> And the places that get (3) wrong are generally places that nobody has
> been able to care about. I didn't realize the problem with creating a
> swap page after the fact for a while, so that commit feb889fb40fa
> ("mm: don't put pinned pages into the swap cache") came later, but
> it's literally a very simple two-liner.
>

Just to give you my perspective:


Personally I don't care too much about 2). The only reason why I somehow
care about "Unnecessary COW" are
* Challenging for hugetlb use as I explained. We might still want to use
the mapcount there.
* It's mostly a symptom of our eventually too simple COW logic that
effectively leads to 3).

While I do care about 1) (Missed CoW) for our customers, I *especially*
care about 3) (Wrong Cow) simply because silent memory corruptions in
user space are not acceptable.

As you say, fixing 1) the "page_count" way might be easy, at least for THP.

Simple example: Have swapping enabled and register a fixed io_uring
buffer at the wrong time. Fixed io_uring buffers are no a commodity
feature for unprivileged users space ...

So that's why I so deeply care about all of this.

--
Thanks,

David / dhildenb


2021-12-19 11:30:38

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sun, Dec 19, 2021 at 12:01:59AM -0800, John Hubbard wrote:
> On 12/18/21 22:02, Nadav Amit wrote:
> > I found my old messy code for the software-PTE thing.
> >
> > I see that eventually I decided to hold a pointer to the “extra PTEs”
> > of each page in the PMD-page-struct. [ I also implemented the 2-adjacent
> > pages approach but this code is long gone. ]
>
> a) The PMD-page-struct approach won't help as much, because (assuming
> that we're using it in an attempt to get a true, perfect pin count), you
> are combining the pin counts of a PMD's worth of pages. OTOH...maybe
> that actually *is* OK, assuming you don't overflow--except that you can
> only answer the "is it dma-pinned?" question at a PMD level. That's a
> contradiction of your stated desire above to have very granular control.
>
> Also, because of not having bit 0 available in page._pt_pad_1, I think
> the count would have to be implemented as adding and subtracting 2,
> instead of 1 (in order to keep the value even), further reducing the
> counter range.

I think you misunderstood Nadav's approach. He's talking about making
an extra side-allocation per PMD if you're using uffd, and storing
extra information in it. I think it's a worthwile approach.

2021-12-19 17:36:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sat, Dec 18, 2021 at 10:02 PM Nadav Amit <[email protected]> wrote:
>
> I found my old messy code for the software-PTE thing.
>
> I see that eventually I decided to hold a pointer to the “extra PTEs”
> of each page in the PMD-page-struct. [ I also implemented the 2-adjacent
> pages approach but this code is long gone. ]

Ok, I understand why that ends up being the choice, but it makes it
too ugly and messy to look up to be worth it, I think.

> I still don’t know what exactly you have in mind for making use
> out of it for the COW issue.

So the truly fundamental question for COW (and for a long-term GUP) is
fairly simple:

- Is the page I have truly owned exclusively by this VM?

If that _isn't_ the case, you absolutely have to COW.

If that _is_ the case, you can re-use the page.

That is really it, boiled down to the pure basics.

And if you aren't sure whether you are the ultimate and only authority
over the page, then COW is the "safer" option, in that breaking
sharing is fundamentally better than over-sharing.

Now, the reason I like "page_count()==1" is that it is a 100% certain
way to know that you own the page absolutely and clearly.

There is no question what-so-ever about it.

And the reason I hate "page_mapcount()==1" with a passion is that it
is NOTHING OF THE KIND. It is an entirely meaningless number. It
doesn't mean anything at all.

Even if the page mapcount is exactly right, it could easily and
trivially be a result of "fork, then unmap in either parent or child".

Now that page_mapcount() is unquestionably 1, but despite that, at
some point the page was shared by another VM, and you can not know
whether you really have exclusive access.

And that "even if page mapcount is exactly right" is a big issue in
itself, as I hope I've explained.

It requires page locking, it requires that you take swapcache users
into account, it is just a truly messy and messed up thing.

There really is absolutely no reason for page_mapcount to exist. It's
a mistake. We have it for completely broken historical reasons.

It's WRONG.

Now, if "page_count()==1" is so great, what is the issue? Problem solved.

No, while page_count()==1 is one really fundamental marker (unlike the
mapcount), it does have problems too.

Because yes, "page_count()==1" does mean that you have truly exclusive
ownership of the page, but the reverse is not true.

The way the current regular VM code handles that "the reverse is not
true" is by making "the page is writable" be the second way you can
say "you clearly have full ownership of the page".

So that's why you then have the "maybe_pinned()" thing in fork() and
in swap cache creation that keeps such a page writable, and doesn't do
the virtual copy and make it read-only again.

But that's also why it has problems with write-protect (whether
mprotect or uddf_wp).

Anyway, that was a long explanation to make the thinking clear, and
finally come to the actual answer to your question:

Adding another bit in the page tables - *purely* to say "this VM owns
the page outright" - would be fairly powerful. And fairly simple.

Then any COW event will set that bit - because when you actually COW,
the page you install is *yours*. No questions asked.

And fork() would simply clear that bit (unless the page was one of the
pinned pages that we simply copy).

See how simple that kind of concept is.

And please, see how INCREDIBLY BROKEN page_mapcount() is. It really
fundamentally is pure and utter garbage. It in no way says "I have
exclusive ownership of this page", because even if the mapcount is 1
*now*, it could have been something else earlier, and some other VM
could have gotten a reference to it before the current VM did so.

This is why I will categoricall NAK any stupid attempt to re-introduce
page_mapcount() for COW or GUP handling. It's unacceptably
fundamentally broken.

Btw, the extra bit doesn't really have to be in the page tables. It
could be a bit in the page itself. We could add another page bit that
we just clear when we do the "add ref to page as you make a virtual
copy during fork() etc".

And no, we can't use "pincount" either, because it's not exact. The
fact that the page count is so elevated that we think it's pinned is a
_heuristic_, and that's ok when you have the opposite problem, and ask
"*might* this page be pinned". You want to never get a false negative,
but it can get a false positive.

Linus

2021-12-19 17:44:19

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

>
> Btw, the extra bit doesn't really have to be in the page tables. It
> could be a bit in the page itself. We could add another page bit that
> we just clear when we do the "add ref to page as you make a virtual
> copy during fork() etc".

^ I'm playing with the idea if using a page bit to express: "This page
is exclusive". On a CoW fault, if that bit is set, I can simply reuse
the page.

The semantics under which semantics to set the bit are slightly
different than what you describe, and I'm playing with additional
unsharing (on GUP R/O) that avoids mapping the copied page similarly R/O
and simply sets the bit.

But the general idea could fly I think, devil's in the detail ...

--
Thanks,

David / dhildenb


2021-12-19 17:45:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

David, you said that you were working on some alternative model. Is it
perhaps along these same lines below?

I was thinking that a bit in the page tables to say "this page is
exclusive to this VM" would be a really simple thing to deal with for
fork() and swapout and friends.

But we don't have such a bit in general, since many architectures have
very limited sets of SW bits, and even when they exist we've spent
them on things like UDDF_WP.,

But the more I think about the "bit doesn't even have to be in the
page tables", the more I think maybe that's the solution.

A bit in the 'struct page' itself.

For hugepages, you'd have to distribute said bit when you split the hugepage.

But other than that it looks quite simple: anybody who does a virtual
copy will inevitably be messing with the page refcount, so clearing
the "exclusive ownership" bit wouldn't be costly: the 'struct page'
cacheline is already getting dirtied.

Or what was your model you were implying you were thinking about in
your other email? You said

"I might have had an idea yesterday on how to fix most of the issues
without relying on the mapcount, doing it similar [..]"

but I didn't then reply to that email because I had just written this
other long email to Nadav.

Linus

On Sun, Dec 19, 2021 at 9:27 AM Linus Torvalds
<[email protected]> wrote:
>
> Adding another bit in the page tables - *purely* to say "this VM owns
> the page outright" - would be fairly powerful. And fairly simple.
>
> Then any COW event will set that bit - because when you actually COW,
> the page you install is *yours*. No questions asked.
>
[ snip snip ]
>
> Btw, the extra bit doesn't really have to be in the page tables. It
> could be a bit in the page itself. We could add another page bit that
> we just clear when we do the "add ref to page as you make a virtual
> copy during fork() etc".
>
> And no, we can't use "pincount" either, because it's not exact. The
> fact that the page count is so elevated that we think it's pinned is a
> _heuristic_, and that's ok when you have the opposite problem, and ask
> "*might* this page be pinned". You want to never get a false negative,
> but it can get a false positive.
>
> Linus

2021-12-19 17:59:58

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 19.12.21 18:44, Linus Torvalds wrote:
> David, you said that you were working on some alternative model. Is it
> perhaps along these same lines below?
>
> I was thinking that a bit in the page tables to say "this page is
> exclusive to this VM" would be a really simple thing to deal with for
> fork() and swapout and friends.
>
> But we don't have such a bit in general, since many architectures have
> very limited sets of SW bits, and even when they exist we've spent
> them on things like UDDF_WP.,
>
> But the more I think about the "bit doesn't even have to be in the
> page tables", the more I think maybe that's the solution.
>
> A bit in the 'struct page' itself.
>

Exactly what I am prototyping right now.

> For hugepages, you'd have to distribute said bit when you split the hugepage.

Yes, that's one tricky part ...

>
> But other than that it looks quite simple: anybody who does a virtual
> copy will inevitably be messing with the page refcount, so clearing
> the "exclusive ownership" bit wouldn't be costly: the 'struct page'
> cacheline is already getting dirtied.
>
> Or what was your model you were implying you were thinking about in
> your other email? You said

I'm playing with the idea of not setting the bit always during COW but
only on GUP request to set the bit (either manually if possible or via
FOLL_UNSHARE). That's a bit more tricky but allows for decoupling that
approach completely from the page_pin() counter.

fork() is allowed to clear the bit if page_count() == 1 and share the
page. So no GUP->no fork() performance changes (!) . Otherwise the bit
can only vanish if we swapout/migrate the page: in which case there are
no additional GUP/references on the page that rely on it!

The bit can be set directly if we have to copy the page in the fault
handler (COW or unshare). Outside of COW/Unshare code, the bit can only
be set if page_count() == 1 and we sync against fork(). (and that's the
problem for gup-fast-only that I'm investigating right now, because it
would then always have to fallback to the slow variant if the bit isn't
already set)

So the bit can "vanish" whenever there is no additional reference on the
page. GUP syncs against fork() and can thereby set the bit/request to
set the bit.

I'm trying to decouple it completely from the page_pin() counter to also
be able to handle FOLL_GET (O_DIRECT reproducers unfortunately) correctly.

Not set it stone, just an idea what I'm playing with right now ... and I
have to tripple-check if
* page is PTE mapped in the page table I'm walking
* page_count() == 1
Really means that "this is the only reference.". I do strongly believe
so .. :)

--
Thanks,

David / dhildenb


2021-12-19 21:12:18

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:
> On 19.12.21 18:44, Linus Torvalds wrote:
> > David, you said that you were working on some alternative model. Is it
> > perhaps along these same lines below?
> >
> > I was thinking that a bit in the page tables to say "this page is
> > exclusive to this VM" would be a really simple thing to deal with for
> > fork() and swapout and friends.
> >
> > But we don't have such a bit in general, since many architectures have
> > very limited sets of SW bits, and even when they exist we've spent
> > them on things like UDDF_WP.,
> >
> > But the more I think about the "bit doesn't even have to be in the
> > page tables", the more I think maybe that's the solution.
> >
> > A bit in the 'struct page' itself.
> >
>
> Exactly what I am prototyping right now.
>
> > For hugepages, you'd have to distribute said bit when you split the hugepage.
>
> Yes, that's one tricky part ...

That part shouldn't be that tricky ...

Can we get rid of ->mapcount altogether? Three states:
- Not mapped
- Mapped exactly once
- Possibly mapped more than once

I appreciate "Not mapped" is not a state that anon pages can
meaningfully have (maybe when they go into the swap cache?)

And this information would only be present on the head page (ie stored
per folio). If one VMA has multiple PTEs that map the same folio,
then hopefully that only counts as mapped once.

I must admit about half this conversation is going over my head. I need
more time to understand all the constraints than exists between emails
:-)

2021-12-19 21:27:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sun, Dec 19, 2021 at 1:12 PM Matthew Wilcox <[email protected]> wrote:
>
> Can we get rid of ->mapcount altogether? Three states:
> - Not mapped
> - Mapped exactly once
> - Possibly mapped more than once

I don't think even that is useful. We should get rid of mapcount entirely.

It doesn't actually help to know "mapped exactly once", exactly
because even when that's true, there may be non-mapped references to
the page.

Mapped references really aren't that special in general.

One case where it *can* be special is on virtually indexed cache
architectures, where "is this mapped anywhere else" can be an issue
for cache flushing.

There the page_mapcount() can actually really matter, but it's such an
odd case that I'm not convinced it should be something the kernel VM
code should bend over backwards for.

And the count could be useful for 'rmap' operations, where you can
stop walking the rmap once you've found all mapped cases (paghe
migration being one case of this). But again, I'm not convinced the
serialization is worth it, or that it's a noticeable win.

However, I'm not 100% convinced it's worth it even there, and I'm not
sure we necessarily use it there.

So in general, I think page_mapcount() can be useful as a count for
those things that are _literally_ about "where is this page mapped".
Page migration, virtual cache coherency, things like that can
literally be about "how many different virtual mappings can we find".

It's just that pages can have a number of non-mapped users too, so
mapcount isn't all that meaningful in general.

And you can look it up with rmap too, and so I do think it would be
worth discussing whether we really should strive to maintain
'mapcount' at all.

> I appreciate "Not mapped" is not a state that anon pages can
> meaningfully have (maybe when they go into the swap cache?)

Absolutely. And we can keep references around to an anonymous page
even without it having any mapping or swap cache at all (ie "gup +
unmap").

So "Not mapped at all" is a possible case, without the page being free'd.

Linus

2021-12-19 21:48:10

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sun, Dec 19, 2021 at 01:27:07PM -0800, Linus Torvalds wrote:
> On Sun, Dec 19, 2021 at 1:12 PM Matthew Wilcox <[email protected]> wrote:
> >
> > Can we get rid of ->mapcount altogether? Three states:
> > - Not mapped
> > - Mapped exactly once
> > - Possibly mapped more than once
>
> I don't think even that is useful. We should get rid of mapcount entirely.
>
> It doesn't actually help to know "mapped exactly once", exactly
> because even when that's true, there may be non-mapped references to
> the page.
>
> Mapped references really aren't that special in general.
>
> One case where it *can* be special is on virtually indexed cache
> architectures, where "is this mapped anywhere else" can be an issue
> for cache flushing.
>
> There the page_mapcount() can actually really matter, but it's such an
> odd case that I'm not convinced it should be something the kernel VM
> code should bend over backwards for.
>
> And the count could be useful for 'rmap' operations, where you can
> stop walking the rmap once you've found all mapped cases (paghe
> migration being one case of this). But again, I'm not convinced the
> serialization is worth it, or that it's a noticeable win.
>
> However, I'm not 100% convinced it's worth it even there, and I'm not
> sure we necessarily use it there.
>
> So in general, I think page_mapcount() can be useful as a count for
> those things that are _literally_ about "where is this page mapped".
> Page migration, virtual cache coherency, things like that can
> literally be about "how many different virtual mappings can we find".
>
> It's just that pages can have a number of non-mapped users too, so
> mapcount isn't all that meaningful in general.
>
> And you can look it up with rmap too, and so I do think it would be
> worth discussing whether we really should strive to maintain
> 'mapcount' at all.

Yes, agreed, I was thinking that we could use "not mapped at all"
as an optimisation to avoid doing rmap walks. eg __unmap_and_move().

Perhaps more interestingly in truncate_cleanup_page():
if (page_mapped(page))
unmap_mapping_page(page);
where we can skip the i_mmap rbtree walk if we know the page isn't
mapped. I'd be willing to give up that optimisation if we had "this
page was never mapped" (ie if page_mapped() was allowed to return
false positives).

2021-12-19 21:53:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sun, Dec 19, 2021 at 1:48 PM Matthew Wilcox <[email protected]> wrote:
>
> Yes, agreed, I was thinking that we could use "not mapped at all"
> as an optimisation to avoid doing rmap walks. eg __unmap_and_move().

So the thing is, it's a very dodgy optimization for a rather simple
reason: what if somebody pages the page in?

So even "not mapped at all" is questionable.

You have to check that it's also not a swapcache page, and hold the
page lock for that check, at the very least.

And by then, you're really in a very unusual situation - and my gut
feel says not one worth optimizing for (because anon pages are
_usually_ mapped at least once).

But I dunno - it might depend on your load. Maybe you have some very
special load that happens to trigger this case a lot?

Linus

2021-12-19 22:02:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sun, Dec 19, 2021 at 01:53:36PM -0800, Linus Torvalds wrote:
> On Sun, Dec 19, 2021 at 1:48 PM Matthew Wilcox <[email protected]> wrote:
> >
> > Yes, agreed, I was thinking that we could use "not mapped at all"
> > as an optimisation to avoid doing rmap walks. eg __unmap_and_move().
>
> So the thing is, it's a very dodgy optimization for a rather simple
> reason: what if somebody pages the page in?
>
> So even "not mapped at all" is questionable.
>
> You have to check that it's also not a swapcache page, and hold the
> page lock for that check, at the very least.
>
> And by then, you're really in a very unusual situation - and my gut
> feel says not one worth optimizing for (because anon pages are
> _usually_ mapped at least once).

I'd like to get rid of ->mapcount for file pages too. And those are
definitely never mapped in the majority of cases.

2021-12-19 22:12:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sun, Dec 19, 2021 at 2:02 PM Matthew Wilcox <[email protected]> wrote:
>
> I'd like to get rid of ->mapcount for file pages too. And those are
> definitely never mapped in the majority of cases.

Fair enough.

You'd probably be better off checking "is this mapping mapped" though.
Because otherwise you have to get the page lock to serialize each
page.

Linus

2021-12-19 22:26:28

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sun, Dec 19, 2021 at 02:12:04PM -0800, Linus Torvalds wrote:
> On Sun, Dec 19, 2021 at 2:02 PM Matthew Wilcox <[email protected]> wrote:
> >
> > I'd like to get rid of ->mapcount for file pages too. And those are
> > definitely never mapped in the majority of cases.
>
> Fair enough.
>
> You'd probably be better off checking "is this mapping mapped" though.
> Because otherwise you have to get the page lock to serialize each
> page.

Truncate already has the page locked, eg
truncate_inode_pages_range()
find_lock_entries()
truncate_cleanup_page()
if (page_mapped(page))
unmap_mapping_page(page)

I think anyone calling unmap_mapping_page() really ought to have the
page lock. Oh, we actually have an assert already to that effect ;-)
VM_BUG_ON(!PageLocked(page));


2021-12-20 18:37:47

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sun, Dec 19, 2021 at 09:12:01PM +0000, Matthew Wilcox wrote:
> Can we get rid of ->mapcount altogether? Three states:

This might be a step in the right direction?

From f723fb7cf2519428eee75e9e779907f80258f302 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <[email protected]>
Date: Mon, 20 Dec 2021 13:24:04 -0500
Subject: [PATCH] mm: reuse_swap_page() no longer needs to return map_swapcount

All callers of reuse_swap_page() currently pass NULL, indicating that
they don't use the painstakingly calculated map_swapcount. That lets
us further remove it from page_trans_huge_map_swapcount() and
page_trans_huge_mapcount().

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
I don't know if this is a helpful patch to add at this point or
whether it will get in the way of David's much more extensive work.

include/linux/mm.h | 10 +++-------
include/linux/swap.h | 6 +++---
mm/huge_memory.c | 32 +++++++++++---------------------
mm/khugepaged.c | 2 +-
mm/memory.c | 2 +-
mm/swapfile.c | 33 +++++++++++----------------------
6 files changed, 30 insertions(+), 55 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..286eb4155c80 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -840,19 +840,15 @@ static inline int page_mapcount(struct page *page)

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
int total_mapcount(struct page *page);
-int page_trans_huge_mapcount(struct page *page, int *total_mapcount);
+int page_trans_huge_mapcount(struct page *page);
#else
static inline int total_mapcount(struct page *page)
{
return page_mapcount(page);
}
-static inline int page_trans_huge_mapcount(struct page *page,
- int *total_mapcount)
+static inline int page_trans_huge_mapcount(struct page *page)
{
- int mapcount = page_mapcount(page);
- if (total_mapcount)
- *total_mapcount = mapcount;
- return mapcount;
+ return page_mapcount(page);
}
#endif

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d1ea44b31f19..1d38d9475c4d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -514,7 +514,7 @@ extern int __swp_swapcount(swp_entry_t entry);
extern int swp_swapcount(swp_entry_t entry);
extern struct swap_info_struct *page_swap_info(struct page *);
extern struct swap_info_struct *swp_swap_info(swp_entry_t entry);
-extern bool reuse_swap_page(struct page *, int *);
+extern bool reuse_swap_page(struct page *);
extern int try_to_free_swap(struct page *);
struct backing_dev_info;
extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
@@ -680,8 +680,8 @@ static inline int swp_swapcount(swp_entry_t entry)
return 0;
}

-#define reuse_swap_page(page, total_map_swapcount) \
- (page_trans_huge_mapcount(page, total_map_swapcount) == 1)
+#define reuse_swap_page(page) \
+ (page_trans_huge_mapcount(page) == 1)

static inline int try_to_free_swap(struct page *page)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e5483347291c..6ed86a8f6a5b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1322,7 +1322,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
* We can only reuse the page if nobody else maps the huge page or it's
* part.
*/
- if (reuse_swap_page(page, NULL)) {
+ if (reuse_swap_page(page)) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -2542,38 +2542,28 @@ int total_mapcount(struct page *page)
* need full accuracy to avoid breaking page pinning, because
* page_trans_huge_mapcount() is slower than page_mapcount().
*/
-int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
+int page_trans_huge_mapcount(struct page *page)
{
- int i, ret, _total_mapcount, mapcount;
+ int i, ret;

/* hugetlbfs shouldn't call it */
VM_BUG_ON_PAGE(PageHuge(page), page);

- if (likely(!PageTransCompound(page))) {
- mapcount = atomic_read(&page->_mapcount) + 1;
- if (total_mapcount)
- *total_mapcount = mapcount;
- return mapcount;
- }
+ if (likely(!PageTransCompound(page)))
+ return atomic_read(&page->_mapcount) + 1;

page = compound_head(page);

- _total_mapcount = ret = 0;
+ ret = 0;
for (i = 0; i < thp_nr_pages(page); i++) {
- mapcount = atomic_read(&page[i]._mapcount) + 1;
+ int mapcount = atomic_read(&page[i]._mapcount) + 1;
ret = max(ret, mapcount);
- _total_mapcount += mapcount;
}
- if (PageDoubleMap(page)) {
+
+ if (PageDoubleMap(page))
ret -= 1;
- _total_mapcount -= thp_nr_pages(page);
- }
- mapcount = compound_mapcount(page);
- ret += mapcount;
- _total_mapcount += mapcount;
- if (total_mapcount)
- *total_mapcount = _total_mapcount;
- return ret;
+
+ return ret + compound_mapcount(page);
}

/* Racy check whether the huge page can be split */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e99101162f1a..11794bdf513a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -681,7 +681,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
goto out;
}
if (!pte_write(pteval) && PageSwapCache(page) &&
- !reuse_swap_page(page, NULL)) {
+ !reuse_swap_page(page)) {
/*
* Page is in the swap cache and cannot be re-used.
* It cannot be collapsed into a THP.
diff --git a/mm/memory.c b/mm/memory.c
index 8f1de811a1dc..dd85fd07cb24 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3626,7 +3626,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot);
- if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
+ if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e59e08ef46e1..bc0810c3b2a5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1601,31 +1601,30 @@ static bool page_swapped(struct page *page)
return false;
}

-static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
+static int page_trans_huge_map_swapcount(struct page *page,
int *total_swapcount)
{
- int i, map_swapcount, _total_mapcount, _total_swapcount;
+ int i, map_swapcount, _total_swapcount;
unsigned long offset = 0;
struct swap_info_struct *si;
struct swap_cluster_info *ci = NULL;
unsigned char *map = NULL;
- int mapcount, swapcount = 0;
+ int swapcount = 0;

/* hugetlbfs shouldn't call it */
VM_BUG_ON_PAGE(PageHuge(page), page);

if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!PageTransCompound(page))) {
- mapcount = page_trans_huge_mapcount(page, total_mapcount);
if (PageSwapCache(page))
swapcount = page_swapcount(page);
if (total_swapcount)
*total_swapcount = swapcount;
- return mapcount + swapcount;
+ return swapcount + page_trans_huge_mapcount(page);
}

page = compound_head(page);

- _total_mapcount = _total_swapcount = map_swapcount = 0;
+ _total_swapcount = map_swapcount = 0;
if (PageSwapCache(page)) {
swp_entry_t entry;

@@ -1639,8 +1638,7 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
if (map)
ci = lock_cluster(si, offset);
for (i = 0; i < HPAGE_PMD_NR; i++) {
- mapcount = atomic_read(&page[i]._mapcount) + 1;
- _total_mapcount += mapcount;
+ int mapcount = atomic_read(&page[i]._mapcount) + 1;
if (map) {
swapcount = swap_count(map[offset + i]);
_total_swapcount += swapcount;
@@ -1648,15 +1646,9 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
map_swapcount = max(map_swapcount, mapcount + swapcount);
}
unlock_cluster(ci);
- if (PageDoubleMap(page)) {
+ if (PageDoubleMap(page))
map_swapcount -= 1;
- _total_mapcount -= HPAGE_PMD_NR;
- }
- mapcount = compound_mapcount(page);
- map_swapcount += mapcount;
- _total_mapcount += mapcount;
- if (total_mapcount)
- *total_mapcount = _total_mapcount;
+ map_swapcount += compound_mapcount(page);
if (total_swapcount)
*total_swapcount = _total_swapcount;

@@ -1673,17 +1665,14 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
* reuse_swap_page() returns false, but it may be always overwritten
* (see the other implementation for CONFIG_SWAP=n).
*/
-bool reuse_swap_page(struct page *page, int *total_map_swapcount)
+bool reuse_swap_page(struct page *page)
{
- int count, total_mapcount, total_swapcount;
+ int count, total_swapcount;

VM_BUG_ON_PAGE(!PageLocked(page), page);
if (unlikely(PageKsm(page)))
return false;
- count = page_trans_huge_map_swapcount(page, &total_mapcount,
- &total_swapcount);
- if (total_map_swapcount)
- *total_map_swapcount = total_mapcount + total_swapcount;
+ count = page_trans_huge_map_swapcount(page, &total_swapcount);
if (count == 1 && PageSwapCache(page) &&
(likely(!PageTransCompound(page)) ||
/* The remaining swap count will be freed soon */
--
2.33.0


2021-12-20 18:53:09

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Mon, Dec 20, 2021 at 06:37:30PM +0000, Matthew Wilcox wrote:
> +++ b/mm/memory.c
> @@ -3626,7 +3626,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
> pte = mk_pte(page, vma->vm_page_prot);
> - if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
> + if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> vmf->flags &= ~FAULT_FLAG_WRITE;
> ret |= VM_FAULT_WRITE;
[...]
> @@ -1673,17 +1665,14 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
> * reuse_swap_page() returns false, but it may be always overwritten
> * (see the other implementation for CONFIG_SWAP=n).
> */
> -bool reuse_swap_page(struct page *page, int *total_map_swapcount)
> +bool reuse_swap_page(struct page *page)
> {
> - int count, total_mapcount, total_swapcount;
> + int count, total_swapcount;
>
> VM_BUG_ON_PAGE(!PageLocked(page), page);
> if (unlikely(PageKsm(page)))
> return false;
> - count = page_trans_huge_map_swapcount(page, &total_mapcount,
> - &total_swapcount);
> - if (total_map_swapcount)
> - *total_map_swapcount = total_mapcount + total_swapcount;
> + count = page_trans_huge_map_swapcount(page, &total_swapcount);
> if (count == 1 && PageSwapCache(page) &&
> (likely(!PageTransCompound(page)) ||
> /* The remaining swap count will be freed soon */

It makes me wonder if reuse_swap_page() can also be based on refcount
instead of mapcount?

2021-12-20 19:15:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Mon, Dec 20, 2021 at 10:37 AM Matthew Wilcox <[email protected]> wrote:
>
> This might be a step in the right direction?
>
> Subject: [PATCH] mm: reuse_swap_page() no longer needs to return map_swapcount

Well, that patch seems to be a no-op removal of dead code, so absolutely yes.

That said, I think it would be good to split it up. I looked at that
patch and went "is that really a no-op" to the point of recreating it.

I think it would be good to make it multiple patches that are each
individally trivial. IOW, start with

(1) remove second argument to reuse_swap_page() that is always NULL,
without making any other changes

(2) that now made 'total_mapcount' unused in reuse_swap_page(),
remove it as an argument from page_trans_huge_map_swapcount()

(3) that now made 'total_mapcount' unused in
page_trans_huge_mapcount(), remove it as an argument there too.

because as it stands, that patch of yours looks like it is changing a
lot of things, and I think it would be clearer to remove one thign at
a time as it becomes obviously not used.

Hmm?

Linus

2021-12-20 19:39:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Mon, Dec 20, 2021 at 10:53 AM Matthew Wilcox <[email protected]> wrote:
>
> It makes me wonder if reuse_swap_page() can also be based on refcount
> instead of mapcount?

I suspect it doesn't even need refcount.

For regular pages, after we've copied the page, all we do right now is

if (page_copied)
free_swap_cache(old_page);

which is basically just an optimistic trylock_page() followed by
try_to_free_swap().

And that then pretty much simply checks "are there any swap users
left" and deletes it from the swap cache if not.

The "free_swap_cache()" thing is actually just an optimization to
avoid having memory pressure do it later. So it doesn't have to be
exact.

In fact, I thought that swap is so unusual that it's not even needed
at all, but I was wrong. See how this was re-introduced in commit
f4c4a3f48480 ("mm: free idle swap cache page after COW") because yes,
some loads still have swap space allocated.

In theory, it would probably be a good idea at COW time to see if the
page ref is 2, and if it's a swap cache page, and try to do that swap
cache removal even earlier, so that the page actually gets re-used
(instead of copied and then the swap entry removed).

But swap is such a non-issue these days that I doubt it matters, and
it's probably better to keep the swap handling in the unusual path.

So mapcount and refcount aren't what matters for the swap cache.

The swap count obviously *does* matter - because it means that some
mapping has a reference to this swap entry (not as a page, but as an
actual swap pointer).

But the mapcount is irrelevant - any users that have the swap page
actually mapped, don't actually need to be a swapcache page.

Even the refcount doesn't really matter, afaik. The only "refcount" we
care about is that swapcount - that's what actually reference counts
the swap cases.

try_to_free_swap() does check for one particular kind of reference: it
does a check for PageWriteback(). We don't want to remove the thing
from the swap cache if it's under active IO.

(This codepath does need the page lock, though, thus all those
"page_trylock()" things).

Linus

2021-12-20 21:02:15

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Mon, Dec 20, 2021 at 11:15:14AM -0800, Linus Torvalds wrote:
> Well, that patch seems to be a no-op removal of dead code, so absolutely yes.
>
> That said, I think it would be good to split it up. I looked at that
> patch and went "is that really a no-op" to the point of recreating it.
>
> I think it would be good to make it multiple patches that are each
> individally trivial. IOW, start with
>
> (1) remove second argument to reuse_swap_page() that is always NULL,
> without making any other changes
>
> (2) that now made 'total_mapcount' unused in reuse_swap_page(),
> remove it as an argument from page_trans_huge_map_swapcount()
>
> (3) that now made 'total_mapcount' unused in
> page_trans_huge_mapcount(), remove it as an argument there too.

Hah, that was actually how I did it originally (without actually
committing at each step, and with a few "Oh, hang on, now we can avoid
calculating this too" stops and restarts along the way), but I thought
it all hung together logically as a single change. It's hard to see
things from the other person's perspective at times.

2021-12-20 21:28:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Mon, Dec 20, 2021 at 1:02 PM Matthew Wilcox <[email protected]> wrote:
>
> Hah, that was actually how I did it originally (without actually
> committing at each step, and with a few "Oh, hang on, now we can avoid
> calculating this too" stops and restarts along the way), but I thought
> it all hung together logically as a single change. It's hard to see
> things from the other person's perspective at times.

In just about any other area, I wouldn't mind one bigger patch that
just removes code that isn't used.

But when it's in the vm code, and it's pretty grotty, I do prefer
seeing three patches that individually are much easier to see that
"yeah, this doesn't actually change anything at all".

The combined patch may be exactly the same thing, it's just much
harder to see that "oh, now it's not used any more".

That was perhaps especially true since a number of the changes also
ended up doing statement simplification when the old layout made no
sense any more with part of the results not used.

So your 3-patch series was much easier to look at and go "Yeah, I
believe each of these patches is a no-op".

So ACK on all those patches.

Linus

2021-12-21 01:03:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:

> handler (COW or unshare). Outside of COW/Unshare code, the bit can only
> be set if page_count() == 1 and we sync against fork(). (and that's the
> problem for gup-fast-only that I'm investigating right now, because it
> would then always have to fallback to the slow variant if the bit isn't
> already set)

I'm having a hard time imagining how gup_fast can maintain any sort of
bit - it lacks all forms of locks so how can we do an atomic test and
set between two pieces of data?

I think the point of Linus's plan really is that the new bit is
derived from page_count, we get the set the new bit when we observe
page_count==1 in various situations and we clear the new bit whenever
we write protect with the intent to copy.

GUP doesn't interact with this bit. A writable page would still be the
second way you can say "you clearly have full ownership of the page',
so GUP just checks writability and bumps the refcount. The challenge
of GUP reamins to sanely sequence it with things that are doing WP.

The elevated GUP refcount prevents the page from getting the bit set
again, regardless of what happens to it.

Then on the WP sides.. Obviously we clear the bit when applying a WP
for copy. So all the bad GUP cases are halted now, as with a cleared
bit and a != 1 refcount COW must happen.

Then, it is also the case that most often a read-only page will have
this bit cleared, UFFWD WP being the exception.

UFFWD WP works fine as it will have the bit set in the cases we care
about and COW will not happen.

If the bit is not set then everything works as is today and you get
extra COWs. We still have to fix the things that are missing the extra
COWs to check the page ref to fix this.

It seems this new bit is acting as a 'COW disable', so, the accuracy
of COW vs GUP&speculative pagerefs now relies on setting the bit as
aggressively as possible when it is safe and cheap to do so.

If I got it right this is why it is not just mapcount reduced to 1
bit. It is quite different, even though "this VM owns the page
outright" sure sounds like "mapcount == 1"..

It seems like an interesting direction - the security properties seem
good as we only have to take care of sites applying WP to decide what
to do with the extra bit, and all places that set the bit to 1 do so
after testing refcount under various locks preventing PTE WP.

That just leave the THP splitting.. I suppose we get the PTL, then
compute the current value of the new bit based on refcount and diffuse
it to all tail pages, then update the PMD and release the PTL. Safe
against concurrent WP - don't need DoubleMap horrors because it isn't
a counter.

> Not set it stone, just an idea what I'm playing with right now ... and I
> have to tripple-check if
> * page is PTE mapped in the page table I'm walking
> * page_count() == 1
> Really means that "this is the only reference.". I do strongly believe
> so .. :)

AFAIK the only places that can break this are places putting struct
page memory into special PTEs. Which is horrific and is just bugs, but
I think I've seen it from time to time :(

ZONE_DEVICE is also messed up, of course, but that is just more
reasons ZONE_DEVICE refcounting needs fixing and you should ignore it.

Jason

2021-12-21 03:29:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Mon, Dec 20, 2021 at 09:03:12PM -0400, Jason Gunthorpe wrote:
> That just leave the THP splitting.. I suppose we get the PTL, then
> compute the current value of the new bit based on refcount and diffuse
> it to all tail pages, then update the PMD and release the PTL. Safe
> against concurrent WP - don't need DoubleMap horrors because it isn't
> a counter.

One of the things I've been trying to figure out is how we do
can_split_huge_page(). Maybe an rmap walk to figure out how many
refcounts we would subtract if we did unmap it from everywhere it's
currently mapped? (just to be clear, we call unmap_page() as the
next thing, so I don't mind warming up the rbtree cachelines
if it's mapped anywhere)

2021-12-21 08:58:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 21.12.21 02:03, Jason Gunthorpe wrote:
> On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:
>
>> handler (COW or unshare). Outside of COW/Unshare code, the bit can only
>> be set if page_count() == 1 and we sync against fork(). (and that's the
>> problem for gup-fast-only that I'm investigating right now, because it
>> would then always have to fallback to the slow variant if the bit isn't
>> already set)
>

[in the meantime I figured out which pageflag we can reuse for anon
pages, which is at least one step into the right direction]

> I'm having a hard time imagining how gup_fast can maintain any sort of
> bit - it lacks all forms of locks so how can we do an atomic test and
> set between two pieces of data?

And exactly that is to be figured out.

Note that I am trying to make also any kind of R/O pins on an anonymous
page work as expected as well, to fix any kind of GUP after fork() and
GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
similarly has to make sure that the page is exclusive -- even if it's
mapped R/O (!).

In the pagefault handler we can then always reuse a PageAnonExclusive()
page, because we know it's exclusive and it will stay exclusive because
concurrent fork() isn't possible.

>
> I think the point of Linus's plan really is that the new bit is
> derived from page_count, we get the set the new bit when we observe
> page_count==1 in various situations and we clear the new bit whenever
> we write protect with the intent to copy.

Here are is one problems I'm fighting with:


Assume we set the bit whenever we create a new anon page (either due to
COW, ordinary fault, unsharing request, ..., even if it's mapped R/O
first). We know the page is exclusive at that point because we created
it and fork() could not happen yet.

fork() is the only code that can share the page between processes and
turn it non-exclusive.

We can only clear the bit during fork() -- to turn the share page
exclusive and map it R/O into both processes -- when we are sure that
*nobody* concurrently takes a reference on the page the would be
problematic (-> GUP).

So to clear the bit during fork, we have to
(1) Check against page_count == 1
(2) Synchronize against GUP

(2) is easy using the mmap_lock and the mm->write_protect_seq


BUT, it would mean that whenever we fork() and there is one additional
reference on a page (even if it's from the swapcache), we would slow
down fork() even if there was never any GUP. This would apply to any
process out there that does a fork() ...


So the idea is to mark a page only exclusive as soon as someone needs
the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN
or selected FOLL_GET like O_DIRECT). This can happen in my current
approach using two ways:

(1) Set the bit when we know we are the only users

We can set PageAnonExclusive() in case *we sync against fork* and the
page cannot get unmapped (pt lock) when:
* The page is mapped writable
* The page is mapped readable and page_count == 1

This should work during ordinary GUP in many cases.

If we cannot set the page exclusive, we have to trigger a page fault.

(2) During pagefaults when FOLL_FAULT_UNSHARE is set.

GUP will set FOLL_FAULT_UNSHARE for a pagefault when required (again
e.g., FOLL_PIN or selected FOLL_GET users), and manual setting of the
bit failed. The page fault will then try once again to set the bit if
there is a page mapped, and if that fails, do the COW/unshare and set
the bit.


The above should work fairly reliable with GUP. But indeed,
gup-fast-only is the problem. I'm still investigating what kind of
lightweight synchronization we could do against fork() such that we
wouldn't try setting a page PageAnonExclusive() while fork()
concurrently shares the page.

We could eventually use the page lock and do a try_lock(), both in
fork() and in gup-fast-only. fork() would only clear the bit if the
try_lock() succeeded. gup-fast-only would only be able to set the bit
and not fallback to the slow path if try_lock() succeeded.


But I'm still investigating if there are better alternatives ...

>
> GUP doesn't interact with this bit. A writable page would still be the
> second way you can say "you clearly have full ownership of the page',
> so GUP just checks writability and bumps the refcount. The challenge
> of GUP reamins to sanely sequence it with things that are doing WP.
>
> The elevated GUP refcount prevents the page from getting the bit set
> again, regardless of what happens to it.
>
> Then on the WP sides.. Obviously we clear the bit when applying a WP
> for copy. So all the bad GUP cases are halted now, as with a cleared
> bit and a != 1 refcount COW must happen.
>
> Then, it is also the case that most often a read-only page will have
> this bit cleared, UFFWD WP being the exception.
>
> UFFWD WP works fine as it will have the bit set in the cases we care
> about and COW will not happen.
>
> If the bit is not set then everything works as is today and you get
> extra COWs. We still have to fix the things that are missing the extra
> COWs to check the page ref to fix this.
>
> It seems this new bit is acting as a 'COW disable', so, the accuracy
> of COW vs GUP&speculative pagerefs now relies on setting the bit as
> aggressively as possible when it is safe and cheap to do so.

But we really want to avoid degrading fork() for everybody that doesn't
do heavy GUP ...

>
> If I got it right this is why it is not just mapcount reduced to 1
> bit. It is quite different, even though "this VM owns the page
> outright" sure sounds like "mapcount == 1"..
>
> It seems like an interesting direction - the security properties seem
> good as we only have to take care of sites applying WP to decide what
> to do with the extra bit, and all places that set the bit to 1 do so
> after testing refcount under various locks preventing PTE WP.
>
> That just leave the THP splitting.. I suppose we get the PTL, then
> compute the current value of the new bit based on refcount and diffuse
> it to all tail pages, then update the PMD and release the PTL. Safe
> against concurrent WP - don't need DoubleMap horrors because it isn't
> a counter.
>
>> Not set it stone, just an idea what I'm playing with right now ... and I
>> have to tripple-check if
>> * page is PTE mapped in the page table I'm walking
>> * page_count() == 1
>> Really means that "this is the only reference.". I do strongly believe
>> so .. :)
>
> AFAIK the only places that can break this are places putting struct
> page memory into special PTEs. Which is horrific and is just bugs, but
> I think I've seen it from time to time :(

As we only care about anon pages, I think that doesn't apply. At least
that's what I hope.


--
Thanks,

David / dhildenb


2021-12-21 14:28:18

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Tue, Dec 21, 2021 at 09:58:32AM +0100, David Hildenbrand wrote:
> > I'm having a hard time imagining how gup_fast can maintain any sort of
> > bit - it lacks all forms of locks so how can we do an atomic test and
> > set between two pieces of data?
>
> And exactly that is to be figured out.
>
> Note that I am trying to make also any kind of R/O pins on an anonymous
> page work as expected as well, to fix any kind of GUP after fork() and
> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
> similarly has to make sure that the page is exclusive -- even if it's
> mapped R/O (!).

Why? AFAIK we don't have bugs here. If the page is RO and has an
elevated refcount it cannot be 'PageAnonExclusive' and so any place
that wants to drop the WP just cannot. What is the issue?

> BUT, it would mean that whenever we fork() and there is one additional
> reference on a page (even if it's from the swapcache), we would slow
> down fork() even if there was never any GUP. This would apply to any
> process out there that does a fork() ...

You mean because we'd copy?

Is this common? Linus' prior email was talking as though swap is so
rare we should't optimize for it?

> So the idea is to mark a page only exclusive as soon as someone needs
> the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN
> or selected FOLL_GET like O_DIRECT). This can happen in my current
> approach using two ways:
>
> (1) Set the bit when we know we are the only users
>
> We can set PageAnonExclusive() in case *we sync against fork* and the
> page cannot get unmapped (pt lock) when:
> * The page is mapped writable
> * The page is mapped readable and page_count == 1

I'm still not sure I see that all this complexity is netting a gain?

> If we cannot set the page exclusive, we have to trigger a page fault.
>
> (2) During pagefaults when FOLL_FAULT_UNSHARE is set.

Why do we need FOLL_FAULT_UNSHARE ? AFAICT that was part of this
series because of mapcount, once the hugetlb COW is fixed to use
refcount properly, as Linus showed, the bugs this was trying to fix go
away.

And as discussed before it is OK if READ gup becomes incoherent, that
is its defined semantic.

> The above should work fairly reliable with GUP. But indeed,
> gup-fast-only is the problem. I'm still investigating what kind of
> lightweight synchronization we could do against fork() such that we
> wouldn't try setting a page PageAnonExclusive() while fork()
> concurrently shares the page.
>
> We could eventually use the page lock and do a try_lock(), both in
> fork() and in gup-fast-only. fork() would only clear the bit if the
> try_lock() succeeded. gup-fast-only would only be able to set the bit
> and not fallback to the slow path if try_lock() succeeded.

I suspect that is worse than just having fork clear the bit and leave
GUP as-is. try lock is an atomic, clearing PageAnonExclusive does not
need to be atomic, it is protected by the PTL.

> > Then on the WP sides.. Obviously we clear the bit when applying a WP
> > for copy. So all the bad GUP cases are halted now, as with a cleared
> > bit and a != 1 refcount COW must happen.

> But we really want to avoid degrading fork() for everybody that doesn't
> do heavy GUP ...

fork() already has to dirty the struct page cache line for refcount,
setting a flag seems minor at that point? At least we shouldn't
discard this nice understandable approach without a measurement....

Remember fork is already incring mapcount so if we kill mapcount it is
a win for fork to replace the mapcount atomic with a non-atomic flag.

> > AFAIK the only places that can break this are places putting struct
> > page memory into special PTEs. Which is horrific and is just bugs, but
> > I think I've seen it from time to time :(
>
> As we only care about anon pages, I think that doesn't apply. At least
> that's what I hope.

You are optimistic :)

Jason

2021-12-21 15:19:47

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 21.12.21 15:28, Jason Gunthorpe wrote:
> On Tue, Dec 21, 2021 at 09:58:32AM +0100, David Hildenbrand wrote:
>>> I'm having a hard time imagining how gup_fast can maintain any sort of
>>> bit - it lacks all forms of locks so how can we do an atomic test and
>>> set between two pieces of data?
>>
>> And exactly that is to be figured out.
>>
>> Note that I am trying to make also any kind of R/O pins on an anonymous
>> page work as expected as well, to fix any kind of GUP after fork() and
>> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
>> similarly has to make sure that the page is exclusive -- even if it's
>> mapped R/O (!).
>
> Why? AFAIK we don't have bugs here. If the page is RO and has an
> elevated refcount it cannot be 'PageAnonExclusive' and so any place
> that wants to drop the WP just cannot. What is the issue?

Sure it can.

1. Map page R/W
2. Pin it R/W
3. Swapout
4. Read access

Page is now mapped R/O and *has to be* marked PageAnonExclusive(), to
properly skip the COW fault. That's literally 60% of the reproducers we
have that need fixing.


But what I think you actually mean is if we want to get R/O pins right.
>
>> BUT, it would mean that whenever we fork() and there is one additional
>> reference on a page (even if it's from the swapcache), we would slow
>> down fork() even if there was never any GUP. This would apply to any
>> process out there that does a fork() ...
>
> You mean because we'd copy?

Yes.

>
> Is this common? Linus' prior email was talking as though swap is so
> rare we should't optimize for it?
At least in the enterprise segment having swap enabled is mostly a hard
documented requirement. On customer installations swap is still common,
and even gets replaced zswap that is enabled automatically in many
installations ...

So in the world I live and work in, swap is used frequently.

>
>> So the idea is to mark a page only exclusive as soon as someone needs
>> the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN
>> or selected FOLL_GET like O_DIRECT). This can happen in my current
>> approach using two ways:
>>
>> (1) Set the bit when we know we are the only users
>>
>> We can set PageAnonExclusive() in case *we sync against fork* and the
>> page cannot get unmapped (pt lock) when:
>> * The page is mapped writable
>> * The page is mapped readable and page_count == 1
>
> I'm still not sure I see that all this complexity is netting a gain?

Avoid copy on fork().

>
>> If we cannot set the page exclusive, we have to trigger a page fault.
>>
>> (2) During pagefaults when FOLL_FAULT_UNSHARE is set.
>
> Why do we need FOLL_FAULT_UNSHARE ? AFAICT that was part of this
> series because of mapcount, once the hugetlb COW is fixed to use
> refcount properly, as Linus showed, the bugs this was trying to fix go
> away.

The purpose of FOLL_FAULT_UNSHARE in the !mapcount version is to cleanly
support R/O pins without the need for FOLL_WRITE.

And it's comparatively easy to add on top. This is not core of the
complexity, really.

>
> And as discussed before it is OK if READ gup becomes incoherent, that
> is its defined semantic.

And that's where I still disagree.

But anyhow, this is really more about FOLL_FAULT_UNSHARE, which is
pretty easy and natural to add on top and just gets this right.

>
>> The above should work fairly reliable with GUP. But indeed,
>> gup-fast-only is the problem. I'm still investigating what kind of
>> lightweight synchronization we could do against fork() such that we
>> wouldn't try setting a page PageAnonExclusive() while fork()
>> concurrently shares the page.
>>
>> We could eventually use the page lock and do a try_lock(), both in
>> fork() and in gup-fast-only. fork() would only clear the bit if the
>> try_lock() succeeded. gup-fast-only would only be able to set the bit
>> and not fallback to the slow path if try_lock() succeeded.
>
> I suspect that is worse than just having fork clear the bit and leave
> GUP as-is. try lock is an atomic, clearing PageAnonExclusive does not
> need to be atomic, it is protected by the PTL.

There are 2 models, leaving FOLL_FAULT_UNSHARE out of the picture for now:

1) Whenever mapping an anonymous page R/W (after COW, during ordinary
fault, on swapin), we mark the page exclusive. We must never lose the
PageAnonExclusive bit, not during migration, not during swapout.

fork() will process the bit for each and every process, even if there
was no GUP, and will copy if there are additional references.

2) Whenever GUP wants to pin/ref a page, we try marking it exclusive. We
can lose the PageAnonExclusive bit during migration and swapout, because
that can only happen when there are no additional references.

fork() will process the bit only if there was GUP. Ordinary fork() is
left unchanged.


Getting R/O supported in the same way just means that we have to check
on a R/O pin if the page is PageAnonExclusive, and if that's not the
case, trigger a FOLL_FAULT_UNSHARE fault. That's really the only
"complexity" on top which is without the mapcount really easy.

--
Thanks,

David / dhildenb


2021-12-21 17:05:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Tue, Dec 21, 2021 at 12:58 AM David Hildenbrand <[email protected]> wrote:
>
> On 21.12.21 02:03, Jason Gunthorpe wrote:
>
> > I'm having a hard time imagining how gup_fast can maintain any sort of
> > bit - it lacks all forms of locks so how can we do an atomic test and
> > set between two pieces of data?
>
> And exactly that is to be figured out.

So my preference would be to just always maintain the "exclusive to
this VM" bit in the 'struct page', because that makes things easier to
think about.

[ Of course - the bit could be reversed, and be a 'not exclusive to
this VM' bit, semantically the set-or-cleared issue doesn't matter.
Also, when I talk about some "exclusive to this VM" bit, I'm purely
talking about pages that are marked PageAnon(), so the bit may or may
not even exist for other pager types ]

And then all GUP-fast would need to do is to refuse to look up a page
that isn't exclusive to that VM. We already have the situation that
GUP-fast can fail for non-writable pages etc, so it's just another
test.

> Note that I am trying to make also any kind of R/O pins on an anonymous
> page work as expected as well, to fix any kind of GUP after fork() and
> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
> similarly has to make sure that the page is exclusive -- even if it's
> mapped R/O (!).

I do think the existing "maybe_pinned()" logic is fine for that. The
"exclusive to this VM" bit can be used to *help* that decision -
because only an exclusive page can be pinned - bit I don't think it
should _replace_ that logic.

There's a quite fundamental difference between

(a) COW and GUP: these two operations _have_ to know that they get an
exclusive page in order to re-use or look up the page respectively

(b) the pre-cow logic in fork() or the "add this to the swap cache"
logic in vmscan that decides whether a page can be turned into a COW
page by adding a reference coutn to it (whether due to fork or swap
cache doesn't matter - the end result is the same).

The difference is that in (a) the thing we *have* to get right is
whether a page is exclusively owned by that VM or not. We can COW too
much, but we can never share a page unless it's exclusive. That's true
whether it's pinned or not.

In (b), the "have to get right" is different. In (b), it's perfectly
ok to COW an exclusive page and turn it non-exclusive. But we must
never COW a pinned page.

So (a) and (b) are very different situations, and have different logic.

If we always maintain an exclusive bit for AnonPage pages, then both
(a) and (b) can use that bit, but they'll use it very differently. In
(a) we'll refuse to look it up and will force a 'handle_mm_fault()' to
get an exclusive copy. And in (b), we just use it as a "we know only
exclusive pages can be pinned", so it's just another check for
page_needs_cow_for_dma(), the same way we currently check
"MMF_HAS_PINNED" to narrow down the whole "page count indicates this
may be a pinned page" question.

And the "page is exclusive" would actually be the *common* case for
almost all pages. Any time you've written to a page and you haven't
forked after the write (and it hasn't been turned into a swap page),
that page would be exclusive to that VM.

Doesn't this seem like really straightforward semantics to maintain
(and think about)?

I'd like the exclusive page bit to *not* be directly about "has this
page been pinned" exactly because we already have too many special
cases for GUP. It would be nicer to have a page bit that has very
clear semantics even in the absence of GUP.

Linus

2021-12-21 17:40:38

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 21.12.21 18:05, Linus Torvalds wrote:
> On Tue, Dec 21, 2021 at 12:58 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 21.12.21 02:03, Jason Gunthorpe wrote:
>>
>>> I'm having a hard time imagining how gup_fast can maintain any sort of
>>> bit - it lacks all forms of locks so how can we do an atomic test and
>>> set between two pieces of data?
>>
>> And exactly that is to be figured out.
>
> So my preference would be to just always maintain the "exclusive to
> this VM" bit in the 'struct page', because that makes things easier to
> think about.
>
> [ Of course - the bit could be reversed, and be a 'not exclusive to
> this VM' bit, semantically the set-or-cleared issue doesn't matter.
> Also, when I talk about some "exclusive to this VM" bit, I'm purely
> talking about pages that are marked PageAnon(), so the bit may or may
> not even exist for other pager types ]

Yes, whatever I say applies to PageAnon() only -- including the
(overloaded bit), called PG_anon_exclusive now.

>
> And then all GUP-fast would need to do is to refuse to look up a page
> that isn't exclusive to that VM. We already have the situation that
> GUP-fast can fail for non-writable pages etc, so it's just another
> test.

Right, the simplest way is simply failing GUP fast if the bit isn't set,
forcing it into the slow path. If that would primarily happens for R/O
pins after fork(), fine with me.

>
>> Note that I am trying to make also any kind of R/O pins on an anonymous
>> page work as expected as well, to fix any kind of GUP after fork() and
>> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
>> similarly has to make sure that the page is exclusive -- even if it's
>> mapped R/O (!).
>
> I do think the existing "maybe_pinned()" logic is fine for that. The
> "exclusive to this VM" bit can be used to *help* that decision -
> because only an exclusive page can be pinned - bit I don't think it
> should _replace_ that logic.

The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to
FOLL_PIN unfortunately. So I'm *trying* to make it more generic such
that such corner cases can be handled as well correctly. But yeah, I'll
see where this goes ... O_DIRECT has to be fixed one way or the other.

John H. mentioned that he wants to look into converting that to
FOLL_PIN. So maybe that will work eventually.

>
> There's a quite fundamental difference between
>
> (a) COW and GUP: these two operations _have_ to know that they get an
> exclusive page in order to re-use or look up the page respectively
>
> (b) the pre-cow logic in fork() or the "add this to the swap cache"
> logic in vmscan that decides whether a page can be turned into a COW
> page by adding a reference coutn to it (whether due to fork or swap
> cache doesn't matter - the end result is the same).
>
> The difference is that in (a) the thing we *have* to get right is
> whether a page is exclusively owned by that VM or not. We can COW too
> much, but we can never share a page unless it's exclusive. That's true
> whether it's pinned or not.

Exactly. Once a page is "exclusive" it must not get shared *unless* we
can turn it into a "shared" page during fork().

There are some ugly corner cases that will require some thought.

>
> In (b), the "have to get right" is different. In (b), it's perfectly
> ok to COW an exclusive page and turn it non-exclusive. But we must
> never COW a pinned page.
>
> So (a) and (b) are very different situations, and have different logic.
>
> If we always maintain an exclusive bit for AnonPage pages, then both
> (a) and (b) can use that bit, but they'll use it very differently. In
> (a) we'll refuse to look it up and will force a 'handle_mm_fault()' to
> get an exclusive copy. And in (b), we just use it as a "we know only
> exclusive pages can be pinned", so it's just another check for
> page_needs_cow_for_dma(), the same way we currently check
> "MMF_HAS_PINNED" to narrow down the whole "page count indicates this
> may be a pinned page" question.

If we use page_needs_cow_for_dma() for that purpose we can still have
other references from our process referencing the page, including right
now O_DIRECT ones. So the safest thing to do would be relying on the
same logic as we do in the COW path regarding the pagecount ... but that
might result in unnecessary copies as I mentioned.

It would be perfect if just anything that modifies page content would be
using FOLL_PIN, unfortunately that's not reality ...


>
> And the "page is exclusive" would actually be the *common* case for
> almost all pages. Any time you've written to a page and you haven't
> forked after the write (and it hasn't been turned into a swap page),
> that page would be exclusive to that VM.

Yes. Essentially every time we create a new anonymous page it would end
up as exclusive. Or if we're in a fault and can convert the "exclusive"
page into a "shared" page (essentially the COW reuse logic).

>
> Doesn't this seem like really straightforward semantics to maintain
> (and think about)?
>
> I'd like the exclusive page bit to *not* be directly about "has this
> page been pinned" exactly because we already have too many special
> cases for GUP. It would be nicer to have a page bit that has very
> clear semantics even in the absence of GUP.

What adds complexity to correctly maintain the "exclusive" state are at
least:
* KSM (might be harder, have to think about it)
* migration (might be easy to just copy the bit)
* fork() with migration/swap entries that reference a page that is
"exclusive". I'll have to think about that more.

So I have plenty of stuff to look into.


Just so we're on the same page what I'd like to achieve with anonymous
pages:

1) If we take a R/W pin on an anonymous page, we will always pin an
"exclusive page".

2) If we take a R/O pin on an anonymous page, we will always pin an
"exclusive page", even if the page is mapped R/O.

3) "Exclusive" pages cannot be turned "shared" during fork (and ksm? :/
) if pinned.

4) "Exclusive" pages can be turned "shared" during fork if not pinned.

5) "Exclusive" pages will never be COWed but remain there for all
eternity, until unmapped ... well or until converted into "shared" again
if possible


Ideally we'd handle O_DIRECT ... :(


2) is certainly the cherry on top. But it just means that R/O pins don't
have to be the weird kid. And yes, achieving 2) would require
FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do
what existing COW logic does, just bypass the "map writable" and
"trigger write fault" semantics.

I hope we agree that R/O pins don't have to have weird kid if we can
"get it right" with the same approach.


--
Thanks,

David / dhildenb


2021-12-21 18:05:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Tue, Dec 21, 2021 at 9:40 AM David Hildenbrand <[email protected]> wrote:
>
> > I do think the existing "maybe_pinned()" logic is fine for that. The
> > "exclusive to this VM" bit can be used to *help* that decision -
> > because only an exclusive page can be pinned - bit I don't think it
> > should _replace_ that logic.
>
> The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to
> FOLL_PIN unfortunately. So I'm *trying* to make it more generic such
> that such corner cases can be handled as well correctly. But yeah, I'll
> see where this goes ... O_DIRECT has to be fixed one way or the other.
>
> John H. mentioned that he wants to look into converting that to
> FOLL_PIN. So maybe that will work eventually.

I'd really prefer that as the plan.

What exactly is the issue with O_DIRECT? Is it purely that it uses
"put_page()" instead of "unpin", or what?

I really think that if people look up pages and expect those pages to
stay coherent with the VM they looked it up for, they _have_ to
actively tell the VM layer - which means using FOLL_PIN.

Note that this is in absolutely no way a "new" issue. It has *always*
been true. If some O_DIORECT path depends on pinning behavior, it has
never worked correctly, and it is entirely on O_DIRECT, and not at all
a VM issue. We've had people doing GUP games forever, and being burnt
by those games not working reliably.

GUP (before we even had the notion of pinning) would always just take
a reference to the page, but it would not guarantee that that exact
page then kept an association with the VM.

Now, in *practice* this all works if:

(a) the GUP user had always written to the page since the fork
(either explicitly, or with FOLL_WRITE obviously acting as such)

(b) the GUP user never forks afterwards until the IO is done

(c) the GUP user plays no other VM games on that address

and it's also very possible that it has worked by pure luck (ie we've
had a lot of random code that actively mis-used things and it would
work in practice just because COW would happen to cut the right
direction etc).

Is there some particular GUP user you happen to care about more than
others? I think it's a valid option to try to fix things up one by
one, even if you don't perhaps fix _all_ cases.

Linus

2021-12-21 18:07:15

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Tue 21-12-21 18:40:30, David Hildenbrand wrote:
> On 21.12.21 18:05, Linus Torvalds wrote:
> > On Tue, Dec 21, 2021 at 12:58 AM David Hildenbrand <[email protected]> wrote:
> >> Note that I am trying to make also any kind of R/O pins on an anonymous
> >> page work as expected as well, to fix any kind of GUP after fork() and
> >> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
> >> similarly has to make sure that the page is exclusive -- even if it's
> >> mapped R/O (!).
> >
> > I do think the existing "maybe_pinned()" logic is fine for that. The
> > "exclusive to this VM" bit can be used to *help* that decision -
> > because only an exclusive page can be pinned - bit I don't think it
> > should _replace_ that logic.
>
> The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to
> FOLL_PIN unfortunately. So I'm *trying* to make it more generic such
> that such corner cases can be handled as well correctly. But yeah, I'll
> see where this goes ... O_DIRECT has to be fixed one way or the other.
>
> John H. mentioned that he wants to look into converting that to
> FOLL_PIN. So maybe that will work eventually.

For record we always intended (and still intend) to make O_DIRECT use
FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP
and pages acquired through get_page() in a single bio (such as zero page)
and thus it is non-trivial to do the right thing on IO completion (unpin or
just put_page).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-12-21 18:28:50

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 21.12.21 19:00, Linus Torvalds wrote:
> On Tue, Dec 21, 2021 at 9:40 AM David Hildenbrand <[email protected]> wrote:
>>
>>> I do think the existing "maybe_pinned()" logic is fine for that. The
>>> "exclusive to this VM" bit can be used to *help* that decision -
>>> because only an exclusive page can be pinned - bit I don't think it
>>> should _replace_ that logic.
>>
>> The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to
>> FOLL_PIN unfortunately. So I'm *trying* to make it more generic such
>> that such corner cases can be handled as well correctly. But yeah, I'll
>> see where this goes ... O_DIRECT has to be fixed one way or the other.
>>
>> John H. mentioned that he wants to look into converting that to
>> FOLL_PIN. So maybe that will work eventually.
>
> I'd really prefer that as the plan.
>
> What exactly is the issue with O_DIRECT? Is it purely that it uses
> "put_page()" instead of "unpin", or what?
>
> I really think that if people look up pages and expect those pages to
> stay coherent with the VM they looked it up for, they _have_ to
> actively tell the VM layer - which means using FOLL_PIN.
>
> Note that this is in absolutely no way a "new" issue. It has *always*
> been true. If some O_DIORECT path depends on pinning behavior, it has
> never worked correctly, and it is entirely on O_DIRECT, and not at all
> a VM issue. We've had people doing GUP games forever, and being burnt
> by those games not working reliably.
>
> GUP (before we even had the notion of pinning) would always just take
> a reference to the page, but it would not guarantee that that exact
> page then kept an association with the VM.
>
> Now, in *practice* this all works if:
>
> (a) the GUP user had always written to the page since the fork
> (either explicitly, or with FOLL_WRITE obviously acting as such)
>
> (b) the GUP user never forks afterwards until the IO is done
>
> (c) the GUP user plays no other VM games on that address
>
> and it's also very possible that it has worked by pure luck (ie we've
> had a lot of random code that actively mis-used things and it would
> work in practice just because COW would happen to cut the right
> direction etc).
>
> Is there some particular GUP user you happen to care about more than
> others? I think it's a valid option to try to fix things up one by
> one, even if you don't perhaps fix _all_ cases.

Yes, of course. The important part for me is to have a rough idea in how
to tackle all pieces and have a reliable design/approach. Besides the
security issue, highest priority is getting R/W pins (FOLL_WRITE) right,
including O_DIRECT, because that can silently break existing use cases.

Lower priority is getting R/O pins on anonymous memory right, because
that never worked reliably. Lowest priority is getting R/O pins on
MAP_PRIVATE file memory right.

I'd appreciate if someone could work on the O_DIRECT FOLL_PIN conversion
while I struggle with PageAnonExclusive() and R/W pins :)

[noting that I'll not get too much done within the next 2 weeks]

--
Thanks,

David / dhildenb


2021-12-21 18:31:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Tue, Dec 21, 2021 at 10:07 AM Jan Kara <[email protected]> wrote:
>
> For record we always intended (and still intend) to make O_DIRECT use
> FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP
> and pages acquired through get_page() in a single bio (such as zero page)
> and thus it is non-trivial to do the right thing on IO completion (unpin or
> just put_page).

Side note: the new "exclusive VM" bit wouldn't _solve_ this issue, but
it might make it much easier to debug and catch.

If we only set the exclusive VM bit on pages that get mapped into user
space, and we guarantee that GUP only looks up such pages, then we can
also add a debug test to the "unpin" case that the bit is still set.

And that would catch anybody who ends up using other pages for
unpin(), and you could have a WARN_ON() for it (obviously also trigger
on the page count being too small to unpin).

That way, at least from a kernel debugging and development standpoint
it would make it easy to see "ok, this unpinning got a page that
wasn't pinned", and it would help find these cases where some
situation had used just a get_page() rather than a pin to get a page
pointer.

No?

Linus

2021-12-21 18:51:14

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 21.12.21 19:30, Linus Torvalds wrote:
> On Tue, Dec 21, 2021 at 10:07 AM Jan Kara <[email protected]> wrote:
>>
>> For record we always intended (and still intend) to make O_DIRECT use
>> FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP
>> and pages acquired through get_page() in a single bio (such as zero page)
>> and thus it is non-trivial to do the right thing on IO completion (unpin or
>> just put_page).
>
> Side note: the new "exclusive VM" bit wouldn't _solve_ this issue, but
> it might make it much easier to debug and catch.
>
> If we only set the exclusive VM bit on pages that get mapped into user
> space, and we guarantee that GUP only looks up such pages, then we can
> also add a debug test to the "unpin" case that the bit is still set.
>
> And that would catch anybody who ends up using other pages for
> unpin(), and you could have a WARN_ON() for it (obviously also trigger
> on the page count being too small to unpin).

It would also catch if someone would be wrongly dropping the exclusive
flag although there are users (pin) relying on the page staying exclusive.

>
> That way, at least from a kernel debugging and development standpoint
> it would make it easy to see "ok, this unpinning got a page that
> wasn't pinned"

For that purpose the pincount would already kind-off work. Not precise,
but at least something ("this page cannot possibly have been pinned").


--
Thanks,

David / dhildenb


2021-12-21 19:06:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Tue, Dec 21, 2021 at 10:51 AM David Hildenbrand <[email protected]> wrote:
>
> For that purpose the pincount would already kind-off work. Not precise,
> but at least something ("this page cannot possibly have been pinned").

That part actually exists already, ie put_page_refs() has this:

#ifdef CONFIG_DEBUG_VM
if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page))
return;
#endif

And yeah, it shouldn't have that '#ifdef CONFIG_DEBUG_VM' there, but I
think it's because the non-CONFIG_DEBUG_VM #define for
VM_WARN_ON_ONCE_PAGE() is broken, and doesn't return 0.

Linus

2021-12-21 19:07:13

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:

> 2) is certainly the cherry on top. But it just means that R/O pins don't
> have to be the weird kid. And yes, achieving 2) would require
> FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do
> what existing COW logic does, just bypass the "map writable" and
> "trigger write fault" semantics.

I still don't agree with this - when you come to patches can you have
this work at the end and under a good cover letter? Maybe it will make
more sense then.

Thanks,
Jason

2021-12-21 21:11:46

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 12/21/21 10:28, David Hildenbrand wrote:
...
> I'd appreciate if someone could work on the O_DIRECT FOLL_PIN conversion
> while I struggle with PageAnonExclusive() and R/W pins :)
>

Yes, I'll sign up for that (unless someone else who is faster is already
working on it). I've tried a couple times in the past, but without the
proper level of determination to see it through. So this time will be
different. :)

> [noting that I'll not get too much done within the next 2 weeks]
>

Likewise. Starting in early January.


thanks,
--
John Hubbard
NVIDIA

2021-12-21 21:16:30

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 12/21/21 10:30, Linus Torvalds wrote:
> On Tue, Dec 21, 2021 at 10:07 AM Jan Kara <[email protected]> wrote:
>>
>> For record we always intended (and still intend) to make O_DIRECT use
>> FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP
>> and pages acquired through get_page() in a single bio (such as zero page)
>> and thus it is non-trivial to do the right thing on IO completion (unpin or
>> just put_page).
>
> Side note: the new "exclusive VM" bit wouldn't _solve_ this issue, but
> it might make it much easier to debug and catch.
>
> If we only set the exclusive VM bit on pages that get mapped into user
> space, and we guarantee that GUP only looks up such pages, then we can
> also add a debug test to the "unpin" case that the bit is still set.
>
> And that would catch anybody who ends up using other pages for
> unpin(), and you could have a WARN_ON() for it (obviously also trigger
> on the page count being too small to unpin).
>
> That way, at least from a kernel debugging and development standpoint
> it would make it easy to see "ok, this unpinning got a page that
> wasn't pinned", and it would help find these cases where some
> situation had used just a get_page() rather than a pin to get a page
> pointer.
>
> No?
>
> Linus

Yes, this is especially welcome, because it means that after enough time
sitting in the -mm tree, we can reasonably expect to catch the most important
cases, if any were missed. That makes it a whole other level of useful, as
compared to local testing hacks.


thanks,
--
John Hubbard
NVIDIA

2021-12-21 23:54:08

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Tue, Dec 21, 2021 at 04:19:33PM +0100, David Hildenbrand wrote:

> >> Note that I am trying to make also any kind of R/O pins on an anonymous
> >> page work as expected as well, to fix any kind of GUP after fork() and
> >> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
> >> similarly has to make sure that the page is exclusive -- even if it's
> >> mapped R/O (!).
> >
> > Why? AFAIK we don't have bugs here. If the page is RO and has an
> > elevated refcount it cannot be 'PageAnonExclusive' and so any place
> > that wants to drop the WP just cannot. What is the issue?

> But what I think you actually mean is if we want to get R/O pins
> right.

What I ment was a page that is GUP'd RO, is not PageAnonExclusive and
has an elevated refcount. Those cannot be transformed to
PageAnonExclusive, or re-used during COW, but also they don't have
problems today. Either places are like O_DIRECT read and are tolerant
of a false COW, or they are broken like VFIO and should be using
FOLL_FORCE|FOLL_WRITE, which turns them into a WRITE and then we know
they get PageAnonExclusive.

So, the swap issue is fixed directly with PageAnonExclusive and no
change to READ GUP is required, at least in your #1 scenario, AFAICT..

> There are 2 models, leaving FOLL_FAULT_UNSHARE out of the picture for now:
>
> 1) Whenever mapping an anonymous page R/W (after COW, during ordinary
> fault, on swapin), we mark the page exclusive. We must never lose the
> PageAnonExclusive bit, not during migration, not during swapout.

I prefer this one as well.

It allows us to keep Linus's simple logic that refcount == 1 means
always safe to re-use, no matter what.

And refcount != 1 goes on to consider the additional bit to decide
what to do. The simple bit really means 'we know this page has one PTE
so ignore the refcount for COW reuse decisions'.

> fork() will process the bit for each and every process, even if there
> was no GUP, and will copy if there are additional references.

Yes, just like it does today already for mapcount.

> 2) Whenever GUP wants to pin/ref a page, we try marking it exclusive. We
> can lose the PageAnonExclusive bit during migration and swapout, because
> that can only happen when there are no additional references.

I haven't thought of a way this is achievable.

At least not without destroying GUP fast..

Idea #2 is really a "this page is GUP'd" flag with some sneaky logic
to clear it. That comes along with all the races too because as an
idea it is fundamentally about GUP which runs without locks.

Jason

2021-12-21 23:59:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:

> What adds complexity to correctly maintain the "exclusive" state are at
> least:
> * KSM (might be harder, have to think about it)

I know little about it, but isn't KSM like fork where it is trying to
WP pages with the intention of copying them? Shouldn't KSM completely
reject WP'ing a page that is under any kind of writable GUP?

Jason

2021-12-22 08:30:12

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 22.12.21 00:59, Jason Gunthorpe wrote:
> On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
>
>> What adds complexity to correctly maintain the "exclusive" state are at
>> least:
>> * KSM (might be harder, have to think about it)
>
> I know little about it, but isn't KSM like fork where it is trying to
> WP pages with the intention of copying them? Shouldn't KSM completely
> reject WP'ing a page that is under any kind of writable GUP?

I think KSM will, similar to fork(), always have to try removing
PageAnonExclusive() while synchronizing against concurrent GUP pins. If
that fails, the page cannot be converted to KSM and consequently not be
shared.

That will need some rework of KSM AFAIU, but shouldn't be impossible to do.

--
Thanks,

David / dhildenb


2021-12-22 08:51:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 21.12.21 20:07, Jason Gunthorpe wrote:
> On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
>
>> 2) is certainly the cherry on top. But it just means that R/O pins don't
>> have to be the weird kid. And yes, achieving 2) would require
>> FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do
>> what existing COW logic does, just bypass the "map writable" and
>> "trigger write fault" semantics.
>
> I still don't agree with this - when you come to patches can you have
> this work at the end and under a good cover letter? Maybe it will make
> more sense then.

Yes. But really, I think it's the logical consequence of what Linus said
[1]:

"And then all GUP-fast would need to do is to refuse to look up a page
that isn't exclusive to that VM. We already have the situation that
GUP-fast can fail for non-writable pages etc, so it's just another
test."

We must not FOLL_PIN a page that is not exclusive (not only on gup-fast,
but really, on any gup). If we special case R/O FOLL_PIN, we cannot
enable the sanity check on unpin as suggested by Linus [2]:

"If we only set the exclusive VM bit on pages that get mapped into
user space, and we guarantee that GUP only looks up such pages, then
we can also add a debug test to the "unpin" case that the bit is
still set."

There are really only two feasible options I see when we want to take a
R/O FOLL_PIN on a !PageAnonExclusive() anon page

(1) Fail the pinning completely. This implies that we'll have to fail
O_DIRECT once converted to FOLL_PIN.
(2) Request to mark the page PageAnonExclusive() via a
FAULT_FLAG_UNSHARE and let it succeed.


Anything else would require additional accounting that we already
discussed in the past is hard -- for example, to differentiate R/O from
R/W pins requiring two pin counters.

The only impact would be that FOLL_PIN after fork() has to go via a
FAULT_FLAG_UNSHARE once, to turn the page PageAnonExclusive. IMHO this
is the right thing to do for FOLL_LONGTERM. For !FOLL_LONGTERM it would
be nice to optimize this, to *not* do that, but again ... this would
require even more counters I think, for example, to differentiate
between "R/W short/long-term or R/O long-term pin" and "R/O short-term pin".

So unless we discover a way to do additional accounting for ordinary 4k
pages, I think we really can only do (1) or (2) to make sure we never
ever pin a !PageAnonExclusive() page.


[1]
https://lkml.kernel.org/r/CAHk-=wgQq3H6wfkW7+MmduVgBOqHeiXQN97yCMd+m1mM-1xCLQ@mail.gmail.com
[2]
https://lkml.kernel.org/r/CAHk-=wiyxQ==vnHFHW99S_OPwA=u1Qrfg2OGr_6zPcBAuhQY2w@mail.gmail.com


--
Thanks,

David / dhildenb


2021-12-22 09:58:44

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 22.12.21 09:51, David Hildenbrand wrote:
> On 21.12.21 20:07, Jason Gunthorpe wrote:
>> On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
>>
>>> 2) is certainly the cherry on top. But it just means that R/O pins don't
>>> have to be the weird kid. And yes, achieving 2) would require
>>> FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do
>>> what existing COW logic does, just bypass the "map writable" and
>>> "trigger write fault" semantics.
>>
>> I still don't agree with this - when you come to patches can you have
>> this work at the end and under a good cover letter? Maybe it will make
>> more sense then.
>
> Yes. But really, I think it's the logical consequence of what Linus said
> [1]:
>
> "And then all GUP-fast would need to do is to refuse to look up a page
> that isn't exclusive to that VM. We already have the situation that
> GUP-fast can fail for non-writable pages etc, so it's just another
> test."
>
> We must not FOLL_PIN a page that is not exclusive (not only on gup-fast,
> but really, on any gup). If we special case R/O FOLL_PIN, we cannot
> enable the sanity check on unpin as suggested by Linus [2]:
>
> "If we only set the exclusive VM bit on pages that get mapped into
> user space, and we guarantee that GUP only looks up such pages, then
> we can also add a debug test to the "unpin" case that the bit is
> still set."
>
> There are really only two feasible options I see when we want to take a
> R/O FOLL_PIN on a !PageAnonExclusive() anon page
>
> (1) Fail the pinning completely. This implies that we'll have to fail
> O_DIRECT once converted to FOLL_PIN.
> (2) Request to mark the page PageAnonExclusive() via a
> FAULT_FLAG_UNSHARE and let it succeed.
>
>
> Anything else would require additional accounting that we already
> discussed in the past is hard -- for example, to differentiate R/O from
> R/W pins requiring two pin counters.
>
> The only impact would be that FOLL_PIN after fork() has to go via a
> FAULT_FLAG_UNSHARE once, to turn the page PageAnonExclusive. IMHO this
> is the right thing to do for FOLL_LONGTERM. For !FOLL_LONGTERM it would
> be nice to optimize this, to *not* do that, but again ... this would
> require even more counters I think, for example, to differentiate
> between "R/W short/long-term or R/O long-term pin" and "R/O short-term pin".
>
> So unless we discover a way to do additional accounting for ordinary 4k
> pages, I think we really can only do (1) or (2) to make sure we never
> ever pin a !PageAnonExclusive() page.

BTW, I just wondered if the optimization should actually be that R/O
short-term FOLL_PIN users should actually be using FOLL_GET instead. So
O_DIRECT with R/O would already be doing the right thing.

And it somewhat aligns with what we found: only R/W short-term FOLL_GET
is problematic, where we can lose writes to the page from the device via
O_DIRECT.

IIUC, our COW logic makes sure that a shared anonymous page that might
still be used by a R/O FOLL_GET cannot be modified, because any attempt
to modify it would result in a copy.

But I might be missing something, just an idea.


--
Thanks,

David / dhildenb


2021-12-22 12:42:14

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Wed 22-12-21 10:58:36, David Hildenbrand wrote:
> On 22.12.21 09:51, David Hildenbrand wrote:
> > On 21.12.21 20:07, Jason Gunthorpe wrote:
> >> On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
> >>
> >>> 2) is certainly the cherry on top. But it just means that R/O pins don't
> >>> have to be the weird kid. And yes, achieving 2) would require
> >>> FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do
> >>> what existing COW logic does, just bypass the "map writable" and
> >>> "trigger write fault" semantics.
> >>
> >> I still don't agree with this - when you come to patches can you have
> >> this work at the end and under a good cover letter? Maybe it will make
> >> more sense then.
> >
> > Yes. But really, I think it's the logical consequence of what Linus said
> > [1]:
> >
> > "And then all GUP-fast would need to do is to refuse to look up a page
> > that isn't exclusive to that VM. We already have the situation that
> > GUP-fast can fail for non-writable pages etc, so it's just another
> > test."
> >
> > We must not FOLL_PIN a page that is not exclusive (not only on gup-fast,
> > but really, on any gup). If we special case R/O FOLL_PIN, we cannot
> > enable the sanity check on unpin as suggested by Linus [2]:
> >
> > "If we only set the exclusive VM bit on pages that get mapped into
> > user space, and we guarantee that GUP only looks up such pages, then
> > we can also add a debug test to the "unpin" case that the bit is
> > still set."
> >
> > There are really only two feasible options I see when we want to take a
> > R/O FOLL_PIN on a !PageAnonExclusive() anon page
> >
> > (1) Fail the pinning completely. This implies that we'll have to fail
> > O_DIRECT once converted to FOLL_PIN.
> > (2) Request to mark the page PageAnonExclusive() via a
> > FAULT_FLAG_UNSHARE and let it succeed.
> >
> >
> > Anything else would require additional accounting that we already
> > discussed in the past is hard -- for example, to differentiate R/O from
> > R/W pins requiring two pin counters.
> >
> > The only impact would be that FOLL_PIN after fork() has to go via a
> > FAULT_FLAG_UNSHARE once, to turn the page PageAnonExclusive. IMHO this
> > is the right thing to do for FOLL_LONGTERM. For !FOLL_LONGTERM it would
> > be nice to optimize this, to *not* do that, but again ... this would
> > require even more counters I think, for example, to differentiate
> > between "R/W short/long-term or R/O long-term pin" and "R/O short-term pin".
> >
> > So unless we discover a way to do additional accounting for ordinary 4k
> > pages, I think we really can only do (1) or (2) to make sure we never
> > ever pin a !PageAnonExclusive() page.
>
> BTW, I just wondered if the optimization should actually be that R/O
> short-term FOLL_PIN users should actually be using FOLL_GET instead. So
> O_DIRECT with R/O would already be doing the right thing.
>
> And it somewhat aligns with what we found: only R/W short-term FOLL_GET
> is problematic, where we can lose writes to the page from the device via
> O_DIRECT.
>
> IIUC, our COW logic makes sure that a shared anonymous page that might
> still be used by a R/O FOLL_GET cannot be modified, because any attempt
> to modify it would result in a copy.

Well, we defined FOLL_PIN to mean the intent that the caller wants to access
not only page state (for which is enough FOLL_GET and there are some users
- mostly inside mm - who need this) but also page data. Eventually, we even
wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it
internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.

For file pages we need this data vs no-data access distinction so that
filesystems can detect when someone can be accessing page data although the
page is unmapped. Practically, filesystems care most about when someone
can be *modifying* page data (we need to make sure data is stable e.g. when
writing back data to disk or doing data checksumming or other operations)
so using FOLL_GET when wanting to only read page data should be OK for
filesystems but honestly I would be reluctant to break the rule of "use
FOLL_PIN when wanting to access page data" to keep things simple and
reasonably easy to understand for parties such as filesystem developers or
driver developers who all need to interact with pinned pages...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-12-22 12:44:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Tue 21-12-21 19:59:16, Jason Gunthorpe wrote:
> On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
>
> > What adds complexity to correctly maintain the "exclusive" state are at
> > least:
> > * KSM (might be harder, have to think about it)
>
> I know little about it, but isn't KSM like fork where it is trying to
> WP pages with the intention of copying them? Shouldn't KSM completely
> reject WP'ing a page that is under any kind of writable GUP?

I know little about KSM as well but I think fundamentally it has similar
requirements for anon pages as filesystems have for page cache pages e.g.
when doing block deduplication or data checksumming... I.e., it needs to
make sure data in the page is stable and nobody can modify it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-12-22 13:09:53

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

>> IIUC, our COW logic makes sure that a shared anonymous page that might
>> still be used by a R/O FOLL_GET cannot be modified, because any attempt
>> to modify it would result in a copy.
>
> Well, we defined FOLL_PIN to mean the intent that the caller wants to access
> not only page state (for which is enough FOLL_GET and there are some users
> - mostly inside mm - who need this) but also page data. Eventually, we even
> wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it
> internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
>
> For file pages we need this data vs no-data access distinction so that
> filesystems can detect when someone can be accessing page data although the
> page is unmapped. Practically, filesystems care most about when someone
> can be *modifying* page data (we need to make sure data is stable e.g. when
> writing back data to disk or doing data checksumming or other operations)
> so using FOLL_GET when wanting to only read page data should be OK for
> filesystems but honestly I would be reluctant to break the rule of "use
> FOLL_PIN when wanting to access page data" to keep things simple and
> reasonably easy to understand for parties such as filesystem developers or
> driver developers who all need to interact with pinned pages...

Right, from an API perspective we really want people to use FOLL_PIN.

To optimize this case in particular it would help if we would have the
FOLL flags on the unpin path. Then we could just decide internally
"well, short-term R/O FOLL_PIN can be really lightweight, we can treat
this like a FOLL_GET instead". And we would need that as well if we were
to keep different counters for R/O vs. R/W pinned.

--
Thanks,

David / dhildenb


2021-12-22 14:42:59

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Wed 22-12-21 14:09:41, David Hildenbrand wrote:
> >> IIUC, our COW logic makes sure that a shared anonymous page that might
> >> still be used by a R/O FOLL_GET cannot be modified, because any attempt
> >> to modify it would result in a copy.
> >
> > Well, we defined FOLL_PIN to mean the intent that the caller wants to access
> > not only page state (for which is enough FOLL_GET and there are some users
> > - mostly inside mm - who need this) but also page data. Eventually, we even
> > wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it
> > internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
> >
> > For file pages we need this data vs no-data access distinction so that
> > filesystems can detect when someone can be accessing page data although the
> > page is unmapped. Practically, filesystems care most about when someone
> > can be *modifying* page data (we need to make sure data is stable e.g. when
> > writing back data to disk or doing data checksumming or other operations)
> > so using FOLL_GET when wanting to only read page data should be OK for
> > filesystems but honestly I would be reluctant to break the rule of "use
> > FOLL_PIN when wanting to access page data" to keep things simple and
> > reasonably easy to understand for parties such as filesystem developers or
> > driver developers who all need to interact with pinned pages...
>
> Right, from an API perspective we really want people to use FOLL_PIN.
>
> To optimize this case in particular it would help if we would have the
> FOLL flags on the unpin path. Then we could just decide internally
> "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> this like a FOLL_GET instead". And we would need that as well if we were
> to keep different counters for R/O vs. R/W pinned.

Well, I guess the question here is: Which GUP user needs only R/O access to
page data and is so performance critical that it would be worth it to
sacrifice API clarity for speed? I'm not aware of any but I was not looking
really hard...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-12-22 14:48:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On 22.12.21 15:42, Jan Kara wrote:
> On Wed 22-12-21 14:09:41, David Hildenbrand wrote:
>>>> IIUC, our COW logic makes sure that a shared anonymous page that might
>>>> still be used by a R/O FOLL_GET cannot be modified, because any attempt
>>>> to modify it would result in a copy.
>>>
>>> Well, we defined FOLL_PIN to mean the intent that the caller wants to access
>>> not only page state (for which is enough FOLL_GET and there are some users
>>> - mostly inside mm - who need this) but also page data. Eventually, we even
>>> wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it
>>> internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
>>>
>>> For file pages we need this data vs no-data access distinction so that
>>> filesystems can detect when someone can be accessing page data although the
>>> page is unmapped. Practically, filesystems care most about when someone
>>> can be *modifying* page data (we need to make sure data is stable e.g. when
>>> writing back data to disk or doing data checksumming or other operations)
>>> so using FOLL_GET when wanting to only read page data should be OK for
>>> filesystems but honestly I would be reluctant to break the rule of "use
>>> FOLL_PIN when wanting to access page data" to keep things simple and
>>> reasonably easy to understand for parties such as filesystem developers or
>>> driver developers who all need to interact with pinned pages...
>>
>> Right, from an API perspective we really want people to use FOLL_PIN.
>>
>> To optimize this case in particular it would help if we would have the
>> FOLL flags on the unpin path. Then we could just decide internally
>> "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
>> this like a FOLL_GET instead". And we would need that as well if we were
>> to keep different counters for R/O vs. R/W pinned.
>
> Well, I guess the question here is: Which GUP user needs only R/O access to
> page data and is so performance critical that it would be worth it to
> sacrifice API clarity for speed? I'm not aware of any but I was not looking
> really hard...

I'd be interested in examples as well. Maybe databases that use O_DIRECT
after fork()?


--
Thanks,

David / dhildenb


2021-12-22 16:08:50

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Wed 22-12-21 15:48:34, David Hildenbrand wrote:
> On 22.12.21 15:42, Jan Kara wrote:
> > On Wed 22-12-21 14:09:41, David Hildenbrand wrote:
> >>>> IIUC, our COW logic makes sure that a shared anonymous page that might
> >>>> still be used by a R/O FOLL_GET cannot be modified, because any attempt
> >>>> to modify it would result in a copy.
> >>>
> >>> Well, we defined FOLL_PIN to mean the intent that the caller wants to access
> >>> not only page state (for which is enough FOLL_GET and there are some users
> >>> - mostly inside mm - who need this) but also page data. Eventually, we even
> >>> wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it
> >>> internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
> >>>
> >>> For file pages we need this data vs no-data access distinction so that
> >>> filesystems can detect when someone can be accessing page data although the
> >>> page is unmapped. Practically, filesystems care most about when someone
> >>> can be *modifying* page data (we need to make sure data is stable e.g. when
> >>> writing back data to disk or doing data checksumming or other operations)
> >>> so using FOLL_GET when wanting to only read page data should be OK for
> >>> filesystems but honestly I would be reluctant to break the rule of "use
> >>> FOLL_PIN when wanting to access page data" to keep things simple and
> >>> reasonably easy to understand for parties such as filesystem developers or
> >>> driver developers who all need to interact with pinned pages...
> >>
> >> Right, from an API perspective we really want people to use FOLL_PIN.
> >>
> >> To optimize this case in particular it would help if we would have the
> >> FOLL flags on the unpin path. Then we could just decide internally
> >> "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> >> this like a FOLL_GET instead". And we would need that as well if we were
> >> to keep different counters for R/O vs. R/W pinned.
> >
> > Well, I guess the question here is: Which GUP user needs only R/O access to
> > page data and is so performance critical that it would be worth it to
> > sacrifice API clarity for speed? I'm not aware of any but I was not looking
> > really hard...
>
> I'd be interested in examples as well. Maybe databases that use O_DIRECT
> after fork()?

Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify
page data (and so we need to detect them both for COW and filesystem needs).
O_DIRECT writes could use FOLL_GET but at this point I'm not convinced it
is worth it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-12-22 16:44:57

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Wed, Dec 22, 2021 at 05:08:46PM +0100, Jan Kara wrote:
> On Wed 22-12-21 15:48:34, David Hildenbrand wrote:
> > On 22.12.21 15:42, Jan Kara wrote:
> > > On Wed 22-12-21 14:09:41, David Hildenbrand wrote:
> > >>>> IIUC, our COW logic makes sure that a shared anonymous page that might
> > >>>> still be used by a R/O FOLL_GET cannot be modified, because any attempt
> > >>>> to modify it would result in a copy.
> > >>>
> > >>> Well, we defined FOLL_PIN to mean the intent that the caller wants to access
> > >>> not only page state (for which is enough FOLL_GET and there are some users
> > >>> - mostly inside mm - who need this) but also page data. Eventually, we even
> > >>> wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it
> > >>> internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
> > >>>
> > >>> For file pages we need this data vs no-data access distinction so that
> > >>> filesystems can detect when someone can be accessing page data although the
> > >>> page is unmapped. Practically, filesystems care most about when someone
> > >>> can be *modifying* page data (we need to make sure data is stable e.g. when
> > >>> writing back data to disk or doing data checksumming or other operations)
> > >>> so using FOLL_GET when wanting to only read page data should be OK for
> > >>> filesystems but honestly I would be reluctant to break the rule of "use
> > >>> FOLL_PIN when wanting to access page data" to keep things simple and
> > >>> reasonably easy to understand for parties such as filesystem developers or
> > >>> driver developers who all need to interact with pinned pages...
> > >>
> > >> Right, from an API perspective we really want people to use FOLL_PIN.
> > >>
> > >> To optimize this case in particular it would help if we would have the
> > >> FOLL flags on the unpin path. Then we could just decide internally
> > >> "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> > >> this like a FOLL_GET instead". And we would need that as well if we were
> > >> to keep different counters for R/O vs. R/W pinned.
> > >
> > > Well, I guess the question here is: Which GUP user needs only R/O access to
> > > page data and is so performance critical that it would be worth it to
> > > sacrifice API clarity for speed? I'm not aware of any but I was not looking
> > > really hard...
> >
> > I'd be interested in examples as well. Maybe databases that use O_DIRECT
> > after fork()?
>
> Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify
> page data (and so we need to detect them both for COW and filesystem needs).
> O_DIRECT writes could use FOLL_GET but at this point I'm not convinced it
> is worth it.

Wow, I didn't realise the plan was to make FOLL_PIN the "default".
I hoped it was weird crap that was going away soon. Looks like we'd
better fix all the bugs in it then ...

2021-12-22 18:40:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Wed, Dec 22, 2021 at 8:08 AM Jan Kara <[email protected]> wrote:
>
> Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify
> page data (and so we need to detect them both for COW and filesystem needs).

Well, O_DIRECT reads do, but not necessarily writes.

And hey, even reads have been dodgy in the past when we didn't really
have the pinning logic - there's been a lot of users that just wanted
it to work for their particular use-case rather than in general and in
all situations..

Linus

2021-12-23 00:21:26

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote:
> Right, from an API perspective we really want people to use FOLL_PIN.
>
> To optimize this case in particular it would help if we would have the
> FOLL flags on the unpin path. Then we could just decide internally
> "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> this like a FOLL_GET instead". And we would need that as well if we were
> to keep different counters for R/O vs. R/W pinned.

FYI, in my current tree, there's a gup_put_folio() which replaces
put_compound_head:

static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
{
if (flags & FOLL_PIN) {
node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
if (hpage_pincount_available(&folio->page))
hpage_pincount_sub(&folio->page, refs);
else
refs *= GUP_PIN_COUNTING_BIAS;
}

folio_put_refs(folio, refs);
}

That can become non-static if it's needed. I'm still working on that
series, because I'd like to get it to a point where we return one
folio pointer instead of N page pointers. Not quite there yet.

2021-12-23 12:54:45

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Wed 22-12-21 10:40:18, Linus Torvalds wrote:
> On Wed, Dec 22, 2021 at 8:08 AM Jan Kara <[email protected]> wrote:
> >
> > Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify
> > page data (and so we need to detect them both for COW and filesystem needs).
>
> Well, O_DIRECT reads do, but not necessarily writes.

I agree.

> And hey, even reads have been dodgy in the past when we didn't really
> have the pinning logic - there's been a lot of users that just wanted
> it to work for their particular use-case rather than in general and in
> all situations..

Yes, but currently a malicious user can take the system down (BUG_ON) or
cause DIF/DIX failures if he is nasty and tries hard enough with O_DIRECT
reads (practically, the window is small so I haven't really seen a report
that I could trace to O_DIRECT reads but in principle the problem is the
same as with pinning & dirtying done e.g. by video capture drivers and
there we've seen these problem happen). So forcing pinning for O_DIRECT
reads is IMO mandatory.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-12-23 17:25:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Thu, Dec 23, 2021 at 4:54 AM Jan Kara <[email protected]> wrote:
>
> So forcing pinning for O_DIRECT reads is IMO mandatory.

I don't disagree.

And I do think the eventual aim should be to do it for writes too even
if they don't necessarily require it (since they write _from_ the VM
data, not _to_ the VM data - the "read-vs-write direction has always
been confusing when it comes to GUP").

Partly just for consistency in the IO paths - I think people want to
share as much as possible in there - but also just to make sure that
we're all done with the "wrong-way-cow" kind of issues for good.

If we get to the point where the legacy GUP is used only for very
special things (looking up physical pages for debug and trace purposes
etc), I think that would be lovely.

That may be a pretty long-range goal, though.

Linus

2021-12-24 02:53:16

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote:
> On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote:
> > Right, from an API perspective we really want people to use FOLL_PIN.
> >
> > To optimize this case in particular it would help if we would have the
> > FOLL flags on the unpin path. Then we could just decide internally
> > "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> > this like a FOLL_GET instead". And we would need that as well if we were
> > to keep different counters for R/O vs. R/W pinned.
>
> FYI, in my current tree, there's a gup_put_folio() which replaces
> put_compound_head:
>
> static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> {
> if (flags & FOLL_PIN) {
> node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
> if (hpage_pincount_available(&folio->page))
> hpage_pincount_sub(&folio->page, refs);
> else
> refs *= GUP_PIN_COUNTING_BIAS;
> }
>
> folio_put_refs(folio, refs);
> }
>
> That can become non-static if it's needed. I'm still working on that
> series, because I'd like to get it to a point where we return one
> folio pointer instead of N page pointers. Not quite there yet.

I'm keen to see what that looks like, every driver I'm working on that
calls PUP goes through gyrations to recover contiguous pages, so this
is most welcomed!

Jason

2021-12-24 04:53:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Thu, Dec 23, 2021 at 10:53:09PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote:
> > On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote:
> > > Right, from an API perspective we really want people to use FOLL_PIN.
> > >
> > > To optimize this case in particular it would help if we would have the
> > > FOLL flags on the unpin path. Then we could just decide internally
> > > "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> > > this like a FOLL_GET instead". And we would need that as well if we were
> > > to keep different counters for R/O vs. R/W pinned.
> >
> > FYI, in my current tree, there's a gup_put_folio() which replaces
> > put_compound_head:
> >
> > static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> > {
> > if (flags & FOLL_PIN) {
> > node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
> > if (hpage_pincount_available(&folio->page))
> > hpage_pincount_sub(&folio->page, refs);
> > else
> > refs *= GUP_PIN_COUNTING_BIAS;
> > }
> >
> > folio_put_refs(folio, refs);
> > }
> >
> > That can become non-static if it's needed. I'm still working on that
> > series, because I'd like to get it to a point where we return one
> > folio pointer instead of N page pointers. Not quite there yet.
>
> I'm keen to see what that looks like, every driver I'm working on that
> calls PUP goes through gyrations to recover contiguous pages, so this
> is most welcomed!

I'm about to take some time off, so alas, you won't see it any time
soon. It'd be good to talk with some of the interested users because
it's actually a pretty tricky problem. We can't just return an array
of the struct folios because the actual memory you want to access
might be anywhere in that folio, and you don't want to have to redo
the lookup just to find out which subpages of the folio are meant.

So I'm currently thinking about returning a bio_vec:

struct bio_vec {
struct page *bv_page;
unsigned int bv_len;
unsigned int bv_offset;
};

In the iomap patchset which should go upstream in the next merge window,
you can iterate over a bio like this:

struct folio_iter fi;

bio_for_each_folio_all(fi, bio)
iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);

There aren't any equivalent helpers for a bvec yet, but obviously we can
add them so that you can iterate over each folio in a contiguous range.

But now that each component in it is variable length, the caller can't
know how large an array of bio_vecs to allocate.

1. The callee can allocate the array and let the caller free it when it's
finished
2. The caller passes in a (small, fixed-size, on-stack) array of bio_vecs
over (potentially) multiple calls.
3. The caller can overallocate and ignore that most of the array isn't
used.

Any preferences? I don't like #3.

2022-01-04 00:33:59

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

On Fri, Dec 24, 2021 at 04:53:38AM +0000, Matthew Wilcox wrote:
> On Thu, Dec 23, 2021 at 10:53:09PM -0400, Jason Gunthorpe wrote:
> > On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote:
> > > On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote:
> > > > Right, from an API perspective we really want people to use FOLL_PIN.
> > > >
> > > > To optimize this case in particular it would help if we would have the
> > > > FOLL flags on the unpin path. Then we could just decide internally
> > > > "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> > > > this like a FOLL_GET instead". And we would need that as well if we were
> > > > to keep different counters for R/O vs. R/W pinned.
> > >
> > > FYI, in my current tree, there's a gup_put_folio() which replaces
> > > put_compound_head:
> > >
> > > static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> > > {
> > > if (flags & FOLL_PIN) {
> > > node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
> > > if (hpage_pincount_available(&folio->page))
> > > hpage_pincount_sub(&folio->page, refs);
> > > else
> > > refs *= GUP_PIN_COUNTING_BIAS;
> > > }
> > >
> > > folio_put_refs(folio, refs);
> > > }
> > >
> > > That can become non-static if it's needed. I'm still working on that
> > > series, because I'd like to get it to a point where we return one
> > > folio pointer instead of N page pointers. Not quite there yet.
> >
> > I'm keen to see what that looks like, every driver I'm working on that
> > calls PUP goes through gyrations to recover contiguous pages, so this
> > is most welcomed!
>
> I'm about to take some time off, so alas, you won't see it any time
> soon. It'd be good to talk with some of the interested users because
> it's actually a pretty tricky problem.

Sure, it is a good idea

> We can't just return an array of the struct folios because the
> actual memory you want to access might be anywhere in that folio,
> and you don't want to have to redo the lookup just to find out which
> subpages of the folio are meant.

Yep

> So I'm currently thinking about returning a bio_vec:
>
> struct bio_vec {
> struct page *bv_page;
> unsigned int bv_len;
> unsigned int bv_offset;
> };

The cases I'm looking at basically want an efficient list of physical
addresses + lengths. They don't care about pages or folios, eg often
the next step is to build a SGL and DMA map it which largely ignores
all of that.

As the memory used to hold the output of pin_user_pages() is all
temporary there is a sensitivity to allocate the memory quicky, but
also to have enough of it so that we don't have to do redundant work
in pin_user_pages() - eg traversing to the same PMD table again and
again.

> But now that each component in it is variable length, the caller can't
> know how large an array of bio_vecs to allocate.

And the array entry is now 2x the size and there is no way to scatter
the array to 4k segments?

> 1. The callee can allocate the array and let the caller free it when it's
> finished

It is not bad, but a bit tricky, alot of the GUP code executes in an
IRQ disabled state, so it has to use a pre-allocating scheme. We also
can't scan everything twice and hope it didn't change, so exact
preallocation doesn't seem likely either.

> 2. The caller passes in a (small, fixed-size, on-stack) array of bio_vecs
> over (potentially) multiple calls.

It is slow, because we do redundant work traversing the same locks and
page tables again and again..

> 3. The caller can overallocate and ignore that most of the array isn't
> used.
>
> Any preferences? I don't like #3.

#3 is OK for my applications because we immediately turn around and
copy the output to something else and free the memory anyhow...

However, being an array means we can't reliably allocate more than 4k
and with 16 bytes per entry that isn't even able to store a full PTE
table.

What would be nice for these cases is if the caller can supply an
array of 4k pages and GUP will fill them in. In many cases we'd
probably pass in up to 2M worth of pages or something.

There is some batching balance here where we want to minimize the
temporary memory consumed by GUP's output (and the time to allocate
it!) but also minimize the redundant work inside GUP repeatedly
walking the same tables and locks.

eg ideally GUP would stop at some natural alignment boundary if it
could tell it can't fully fill the buffer. Then the next iteration
would not redo the same locks.

I was once thinking about something like storing an array of PFNs and
using the high bits to encode that the PFN is not 4k. It would allow
efficient packing of the common fragmented cases. To make it work
you'd need to have each 4k page grow the pfn list up from the start
and the pfn sizes down from the end. A size entry is only consumed if
the pfn bits can't encode the size directly so the packing can be a
perfect 8 bytes per PFN for the common 4k and 2M aligned cases.

Jason