2021-01-15 19:07:34

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH 0/9] userfaultfd: add minor fault handling

Changelog
=========

RFC->v1:
- Rebased onto Peter Xu's patches for disabling huge PMD sharing for certain
userfaultfd-registered areas.
- Added commits which update documentation, and add a self test which exercises
the new feature.
- Fixed reporting CONTINUE as a supported ioctl even for non-MINOR ranges.

Overview
========

This series adds a new userfaultfd registration mode,
UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
By "minor" fault, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
One of the mappings is registered with userfaultfd (in minor mode), and the
other is not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what I'm
calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
have huge_pte_none(), but find_lock_page() finds an existing page.

We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
userspace resolves the fault by either a) doing nothing if the contents are
already correct, or b) updating the underlying contents using the second,
non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
"I have ensured the page contents are correct, carry on setting up the mapping".

Use Case
========

Consider the use case of VM live migration (e.g. under QEMU/KVM):

1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".

2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.

3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.

4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".

We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.

Interaction with Existing APIs
==============================

Because it's possible to combine registration modes (e.g. a single VMA can be
userfaultfd-registered MINOR | MISSING), and because it's up to userspace how to
resolve faults once they are received, I spent some time thinking through how
the existing API interacts with the new feature.

UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:

- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.

UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without
modifications, the existing codepath assumes a new page needs to be allocated.
This is okay, since userspace must have a second non-UFFD-registered mapping
anyway, thus there isn't much reason to want to use these in any case (just
memcpy or memset or similar).

- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).

Dependencies
============

I've included 4 commits from Peter Xu's larger series
(https://lore.kernel.org/patchwork/cover/1366017/) in this series. My changes
depend on his work, to disable huge PMD sharing for MINOR registered userfaultfd
areas. I included the 4 commits directly because a) it lets this series just be
applied and work as-is, and b) they are fairly standalone, and could potentially
be merged even without the rest of the larger series Peter submitted. Thanks
Peter!

Also, although it doesn't affect minor fault handling, I did notice that the
userfaultfd self test sometimes experienced memory corruption
(https://lore.kernel.org/patchwork/cover/1356755/). For anyone testing this
series, it may be useful to apply that series first to fix the selftest
flakiness. That series doesn't have to be merged into mainline / maintaner
branches before mine, though.

Future Work
===========

Currently the patchset only supports hugetlbfs. There is no reason it can't work
with shmem, but I expect hugetlbfs to be much more commonly used since we're
talking about backing guest memory for VMs. I plan to implement shmem support in
a follow-up patch series.

Axel Rasmussen (5):
userfaultfd: add minor fault registration mode
userfaultfd: disable huge PMD sharing for MINOR registered VMAs
userfaultfd: add UFFDIO_CONTINUE ioctl
userfaultfd: update documentation to describe minor fault handling
userfaultfd/selftests: add test exercising minor fault handling

Peter Xu (4):
hugetlb: Pass vma into huge_pte_alloc()
hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h
hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp

Documentation/admin-guide/mm/userfaultfd.rst | 105 ++++++----
arch/arm64/mm/hugetlbpage.c | 5 +-
arch/ia64/mm/hugetlbpage.c | 3 +-
arch/mips/mm/hugetlbpage.c | 4 +-
arch/parisc/mm/hugetlbpage.c | 2 +-
arch/powerpc/mm/hugetlbpage.c | 3 +-
arch/s390/mm/hugetlbpage.c | 2 +-
arch/sh/mm/hugetlbpage.c | 2 +-
arch/sparc/mm/hugetlbpage.c | 2 +-
fs/proc/task_mmu.c | 1 +
fs/userfaultfd.c | 190 ++++++++++++++++---
include/linux/hugetlb.h | 22 ++-
include/linux/mm.h | 1 +
include/linux/mmu_notifier.h | 1 +
include/linux/userfaultfd_k.h | 29 ++-
include/trace/events/mmflags.h | 1 +
include/uapi/linux/userfaultfd.h | 36 +++-
mm/hugetlb.c | 61 ++++--
mm/userfaultfd.c | 88 ++++++---
tools/testing/selftests/vm/userfaultfd.c | 147 +++++++++++++-
20 files changed, 570 insertions(+), 135 deletions(-)

--
2.30.0.284.gd98b1dd5eaa7-goog


2021-01-15 19:07:45

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH 7/9] userfaultfd: add UFFDIO_CONTINUE ioctl

This ioctl is how userspace ought to resolve "minor" userfaults. The
idea is, userspace is notified that a minor fault has occurred. It might
change the contents of the page using its second non-UFFD mapping, or
not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have ensured
the page contents are correct, carry on setting up the mapping".

Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.

It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`. We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case. (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)

Signed-off-by: Axel Rasmussen <[email protected]>
---
fs/userfaultfd.c | 67 +++++++++++++++++++++++++
include/linux/userfaultfd_k.h | 2 +
include/uapi/linux/userfaultfd.h | 21 +++++++-
mm/hugetlb.c | 11 ++--
mm/userfaultfd.c | 86 ++++++++++++++++++++++++--------
5 files changed, 158 insertions(+), 29 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 19d6925be03f..f0eb2d63289f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1530,6 +1530,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_WP))
ioctls_out &= ~((__u64)1 << _UFFDIO_WRITEPROTECT);

+ /* CONTINUE ioctl is only supported for MINOR ranges. */
+ if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR))
+ ioctls_out &= ~((__u64)1 << _UFFDIO_CONTINUE);
+
/*
* Now that we scanned all vmas we can already tell
* userland which ioctls methods are guaranteed to
@@ -1883,6 +1887,66 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
return ret;
}

+static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
+{
+ __s64 ret;
+ struct uffdio_continue uffdio_continue;
+ struct uffdio_continue __user *user_uffdio_continue;
+ struct userfaultfd_wake_range range;
+
+ user_uffdio_continue = (struct uffdio_continue __user *)arg;
+
+ ret = -EAGAIN;
+ if (READ_ONCE(ctx->mmap_changing))
+ goto out;
+
+ ret = -EFAULT;
+ if (copy_from_user(&uffdio_continue, user_uffdio_continue,
+ /* don't copy the output fields */
+ sizeof(uffdio_continue) - (sizeof(__s64))))
+ goto out;
+
+ ret = validate_range(ctx->mm, &uffdio_continue.range.start,
+ uffdio_continue.range.len);
+ if (ret)
+ goto out;
+
+ ret = -EINVAL;
+ /* double check for wraparound just in case. */
+ if (uffdio_continue.range.start + uffdio_continue.range.len <=
+ uffdio_continue.range.start) {
+ goto out;
+ }
+ if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
+ goto out;
+
+ if (mmget_not_zero(ctx->mm)) {
+ ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
+ uffdio_continue.range.len,
+ &ctx->mmap_changing);
+ mmput(ctx->mm);
+ } else {
+ return -ESRCH;
+ }
+
+ if (unlikely(put_user(ret, &user_uffdio_continue->mapped)))
+ return -EFAULT;
+ if (ret < 0)
+ goto out;
+
+ /* len == 0 would wake all */
+ BUG_ON(!ret);
+ range.len = ret;
+ if (!(uffdio_continue.mode & UFFDIO_CONTINUE_MODE_DONTWAKE)) {
+ range.start = uffdio_continue.range.start;
+ wake_userfault(ctx, &range);
+ }
+ ret = range.len == uffdio_continue.range.len ? 0 : -EAGAIN;
+
+out:
+ return ret;
+}
+
static inline unsigned int uffd_ctx_features(__u64 user_features)
{
/*
@@ -1967,6 +2031,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
case UFFDIO_WRITEPROTECT:
ret = userfaultfd_writeprotect(ctx, arg);
break;
+ case UFFDIO_CONTINUE:
+ ret = userfaultfd_continue(ctx, arg);
+ break;
}
return ret;
}
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index ed157804ca02..49d7e7b4581f 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -41,6 +41,8 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long len,
bool *mmap_changing);
+extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start,
+ unsigned long len, bool *mmap_changing);
extern int mwriteprotect_range(struct mm_struct *dst_mm,
unsigned long start, unsigned long len,
bool enable_wp, bool *mmap_changing);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 1cc2cd8a5279..9a48305f4bdd 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -40,10 +40,12 @@
((__u64)1 << _UFFDIO_WAKE | \
(__u64)1 << _UFFDIO_COPY | \
(__u64)1 << _UFFDIO_ZEROPAGE | \
- (__u64)1 << _UFFDIO_WRITEPROTECT)
+ (__u64)1 << _UFFDIO_WRITEPROTECT | \
+ (__u64)1 << _UFFDIO_CONTINUE)
#define UFFD_API_RANGE_IOCTLS_BASIC \
((__u64)1 << _UFFDIO_WAKE | \
- (__u64)1 << _UFFDIO_COPY)
+ (__u64)1 << _UFFDIO_COPY | \
+ (__u64)1 << _UFFDIO_CONTINUE)

/*
* Valid ioctl command number range with this API is from 0x00 to
@@ -59,6 +61,7 @@
#define _UFFDIO_COPY (0x03)
#define _UFFDIO_ZEROPAGE (0x04)
#define _UFFDIO_WRITEPROTECT (0x06)
+#define _UFFDIO_CONTINUE (0x07)
#define _UFFDIO_API (0x3F)

/* userfaultfd ioctl ids */
@@ -77,6 +80,8 @@
struct uffdio_zeropage)
#define UFFDIO_WRITEPROTECT _IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
struct uffdio_writeprotect)
+#define UFFDIO_CONTINUE _IOR(UFFDIO, _UFFDIO_CONTINUE, \
+ struct uffdio_continue)

/* read() structure */
struct uffd_msg {
@@ -268,6 +273,18 @@ struct uffdio_writeprotect {
__u64 mode;
};

+struct uffdio_continue {
+ struct uffdio_range range;
+#define UFFDIO_CONTINUE_MODE_DONTWAKE ((__u64)1<<0)
+ __u64 mode;
+
+ /*
+ * Fields below here are written by the ioctl and must be at the end:
+ * the copy_from_user will not read past here.
+ */
+ __s64 mapped;
+};
+
/*
* Flags for the userfaultfd(2) system call itself.
*/
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2b3741d6130c..84392d5fa079 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4666,12 +4666,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
spinlock_t *ptl;
int ret;
struct page *page;
+ bool new_page = false;

if (!*pagep) {
ret = -ENOMEM;
page = alloc_huge_page(dst_vma, dst_addr, 0);
if (IS_ERR(page))
goto out;
+ new_page = true;

ret = copy_huge_page_from_user(page,
(const void __user *) src_addr,
@@ -4699,10 +4701,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
mapping = dst_vma->vm_file->f_mapping;
idx = vma_hugecache_offset(h, dst_vma, dst_addr);

- /*
- * If shared, add to page cache
- */
- if (vm_shared) {
+ /* Add shared, newly allocated pages to the page cache. */
+ if (vm_shared && new_page) {
size = i_size_read(mapping->host) >> huge_page_shift(h);
ret = -EFAULT;
if (idx >= size)
@@ -4762,7 +4762,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
update_mmu_cache(dst_vma, dst_addr, dst_pte);

spin_unlock(ptl);
- set_page_huge_active(page);
+ if (new_page)
+ set_page_huge_active(page);
if (vm_shared)
unlock_page(page);
ret = 0;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index b2ce61c1b50d..0ecc50525dd4 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -197,6 +197,16 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
return pmd_alloc(mm, pud, address);
}

+/* The mode of operation for __mcopy_atomic and its helpers. */
+enum mcopy_atomic_mode {
+ /* A normal copy_from_user into the destination range. */
+ MCOPY_ATOMIC_NORMAL,
+ /* Don't copy; map the destination range to the zero page. */
+ MCOPY_ATOMIC_ZEROPAGE,
+ /* Just setup the dst_vma, without modifying the underlying page(s). */
+ MCOPY_ATOMIC_CONTINUE,
+};
+
#ifdef CONFIG_HUGETLB_PAGE
/*
* __mcopy_atomic processing for HUGETLB vmas. Note that this routine is
@@ -207,7 +217,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- bool zeropage)
+ enum mcopy_atomic_mode mode)
{
int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
@@ -227,7 +237,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
* by THP. Since we can not reliably insert a zero page, this
* feature is not supported.
*/
- if (zeropage) {
+ if (mode == MCOPY_ATOMIC_ZEROPAGE) {
mmap_read_unlock(dst_mm);
return -EINVAL;
}
@@ -273,8 +283,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}

while (src_addr < src_start + len) {
- pte_t dst_pteval;
-
BUG_ON(dst_addr >= dst_start + len);

/*
@@ -297,12 +305,22 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
goto out_unlock;
}

- err = -EEXIST;
- dst_pteval = huge_ptep_get(dst_pte);
- if (!huge_pte_none(dst_pteval)) {
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- i_mmap_unlock_read(mapping);
- goto out_unlock;
+ if (mode == MCOPY_ATOMIC_CONTINUE) {
+ /* hugetlb_mcopy_atomic_pte unlocks the page. */
+ page = find_lock_page(mapping, idx);
+ if (!page) {
+ err = -EFAULT;
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+ goto out_unlock;
+ }
+ } else {
+ if (!huge_pte_none(huge_ptep_get(dst_pte))) {
+ err = -EEXIST;
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+ goto out_unlock;
+ }
}

err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
@@ -408,7 +426,7 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- bool zeropage);
+ enum mcopy_atomic_mode mode);
#endif /* CONFIG_HUGETLB_PAGE */

static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -417,7 +435,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
unsigned long dst_addr,
unsigned long src_addr,
struct page **page,
- bool zeropage,
+ enum mcopy_atomic_mode mode,
bool wp_copy)
{
ssize_t err;
@@ -433,22 +451,38 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
* and not in the radix tree.
*/
if (!(dst_vma->vm_flags & VM_SHARED)) {
- if (!zeropage)
+ switch (mode) {
+ case MCOPY_ATOMIC_NORMAL:
err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
dst_addr, src_addr, page,
wp_copy);
- else
+ break;
+ case MCOPY_ATOMIC_ZEROPAGE:
err = mfill_zeropage_pte(dst_mm, dst_pmd,
dst_vma, dst_addr);
+ break;
+ /* It only makes sense to CONTINUE for shared memory. */
+ case MCOPY_ATOMIC_CONTINUE:
+ err = -EINVAL;
+ break;
+ }
} else {
VM_WARN_ON_ONCE(wp_copy);
- if (!zeropage)
+ switch (mode) {
+ case MCOPY_ATOMIC_NORMAL:
err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
dst_vma, dst_addr,
src_addr, page);
- else
+ break;
+ case MCOPY_ATOMIC_ZEROPAGE:
err = shmem_mfill_zeropage_pte(dst_mm, dst_pmd,
dst_vma, dst_addr);
+ break;
+ case MCOPY_ATOMIC_CONTINUE:
+ /* FIXME: Add minor fault interception for shmem. */
+ err = -EINVAL;
+ break;
+ }
}

return err;
@@ -458,7 +492,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- bool zeropage,
+ enum mcopy_atomic_mode mcopy_mode,
bool *mmap_changing,
__u64 mode)
{
@@ -527,7 +561,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
*/
if (is_vm_hugetlb_page(dst_vma))
return __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
- src_start, len, zeropage);
+ src_start, len, mcopy_mode);

if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
goto out_unlock;
@@ -577,7 +611,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
BUG_ON(pmd_trans_huge(*dst_pmd));

err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
- src_addr, &page, zeropage, wp_copy);
+ src_addr, &page, mcopy_mode, wp_copy);
cond_resched();

if (unlikely(err == -ENOENT)) {
@@ -626,14 +660,22 @@ ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
unsigned long src_start, unsigned long len,
bool *mmap_changing, __u64 mode)
{
- return __mcopy_atomic(dst_mm, dst_start, src_start, len, false,
- mmap_changing, mode);
+ return __mcopy_atomic(dst_mm, dst_start, src_start, len,
+ MCOPY_ATOMIC_NORMAL, mmap_changing, mode);
}

ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
unsigned long len, bool *mmap_changing)
{
- return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
+ return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
+ mmap_changing, 0);
+}
+
+ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
+ unsigned long len, bool *mmap_changing)
+{
+ return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
+ mmap_changing, 0);
}

int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-15 19:08:00

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH 4/9] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp

From: Peter Xu <[email protected]>

Huge pmd sharing for hugetlbfs is racy with userfaultfd-wp because
userfaultfd-wp is always based on pgtable entries, so they cannot be shared.

Walk the hugetlb range and unshare all such mappings if there is, right before
UFFDIO_REGISTER will succeed and return to userspace.

This will pair with want_pmd_share() in hugetlb code so that huge pmd sharing
is completely disabled for userfaultfd-wp registered range.

Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Axel Rasmussen <[email protected]>
---
fs/userfaultfd.c | 43 ++++++++++++++++++++++++++++++++++++
include/linux/mmu_notifier.h | 1 +
2 files changed, 44 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 894cc28142e7..dfe5f7a21883 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -15,6 +15,7 @@
#include <linux/sched/signal.h>
#include <linux/sched/mm.h>
#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
#include <linux/poll.h>
#include <linux/slab.h>
#include <linux/seq_file.h>
@@ -1190,6 +1191,45 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
}
}

+/*
+ * This function will unconditionally remove all the shared pmd pgtable entries
+ * within the specific vma for a hugetlbfs memory range.
+ */
+static void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
+{
+ struct hstate *h = hstate_vma(vma);
+ unsigned long sz = huge_page_size(h);
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_notifier_range range;
+ unsigned long address;
+ spinlock_t *ptl;
+ pte_t *ptep;
+
+ /*
+ * No need to call adjust_range_if_pmd_sharing_possible(), because
+ * we're going to operate on the whole vma
+ */
+ mmu_notifier_range_init(&range, MMU_NOTIFY_HUGETLB_UNSHARE,
+ 0, vma, mm, vma->vm_start, vma->vm_end);
+ mmu_notifier_invalidate_range_start(&range);
+ i_mmap_lock_write(vma->vm_file->f_mapping);
+ for (address = vma->vm_start; address < vma->vm_end; address += sz) {
+ ptep = huge_pte_offset(mm, address, sz);
+ if (!ptep)
+ continue;
+ ptl = huge_pte_lock(h, mm, ptep);
+ huge_pmd_unshare(mm, vma, &address, ptep);
+ spin_unlock(ptl);
+ }
+ flush_hugetlb_tlb_range(vma, vma->vm_start, vma->vm_end);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+ /*
+ * No need to call mmu_notifier_invalidate_range(), see
+ * Documentation/vm/mmu_notifier.rst.
+ */
+ mmu_notifier_invalidate_range_end(&range);
+}
+
static void __wake_userfault(struct userfaultfd_ctx *ctx,
struct userfaultfd_wake_range *range)
{
@@ -1448,6 +1488,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
vma->vm_flags = new_flags;
vma->vm_userfaultfd_ctx.ctx = ctx;

+ if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma))
+ hugetlb_unshare_all_pmds(vma);
+
skip:
prev = vma;
start = vma->vm_end;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b8200782dede..ff50c8528113 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -51,6 +51,7 @@ enum mmu_notifier_event {
MMU_NOTIFY_SOFT_DIRTY,
MMU_NOTIFY_RELEASE,
MMU_NOTIFY_MIGRATE,
+ MMU_NOTIFY_HUGETLB_UNSHARE,
};

#define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-15 19:08:15

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH 8/9] userfaultfd: update documentation to describe minor fault handling

Reword / reorganize things a little bit into "lists", so new features /
modes / ioctls can sort of just be appended.

Describe how UFFDIO_REGISTER_MODE_MINOR and UFFDIO_CONTINUE can be used
to intercept and resolve minor faults. Make it clear that COPY and
ZEROPAGE are used for MISSING faults, whereas CONTINUE is used for MINOR
faults.

Signed-off-by: Axel Rasmussen <[email protected]>
---
Documentation/admin-guide/mm/userfaultfd.rst | 105 +++++++++++--------
1 file changed, 64 insertions(+), 41 deletions(-)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 65eefa66c0ba..67f2c68e65a2 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -63,36 +63,36 @@ the generic ioctl available.

The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl
defines what memory types are supported by the ``userfaultfd`` and what
-events, except page fault notifications, may be generated.
-
-If the kernel supports registering ``userfaultfd`` ranges on hugetlbfs
-virtual memory areas, ``UFFD_FEATURE_MISSING_HUGETLBFS`` will be set in
-``uffdio_api.features``. Similarly, ``UFFD_FEATURE_MISSING_SHMEM`` will be
-set if the kernel supports registering ``userfaultfd`` ranges on shared
-memory (covering all shmem APIs, i.e. tmpfs, ``IPCSHM``, ``/dev/zero``,
-``MAP_SHARED``, ``memfd_create``, etc).
-
-The userland application that wants to use ``userfaultfd`` with hugetlbfs
-or shared memory need to set the corresponding flag in
-``uffdio_api.features`` to enable those features.
-
-If the userland desires to receive notifications for events other than
-page faults, it has to verify that ``uffdio_api.features`` has appropriate
-``UFFD_FEATURE_EVENT_*`` bits set. These events are described in more
-detail below in `Non-cooperative userfaultfd`_ section.
-
-Once the ``userfaultfd`` has been enabled the ``UFFDIO_REGISTER`` ioctl should
-be invoked (if present in the returned ``uffdio_api.ioctls`` bitmask) to
-register a memory range in the ``userfaultfd`` by setting the
+events, except page fault notifications, may be generated:
+
+- The ``UFFD_FEATURE_EVENT_*`` flags indicate that various other events
+ other than page faults are supported. These events are described in more
+ detail below in the `Non-cooperative userfaultfd`_ section.
+
+- ``UFFD_FEATURE_MISSING_HUGETLBFS`` and ``UFFD_FEATURE_MISSING_SHMEM``
+ indicate that the kernel supports ``UFFDIO_REGISTER_MODE_MISSING``
+ registrations for hugetlbfs and shared memory (covering all shmem APIs,
+ i.e. tmpfs, ``IPCSHM``, ``/dev/zero``, ``MAP_SHARED``, ``memfd_create``,
+ etc) virtual memory areas, respectively.
+
+- ``UFFD_FEATURE_MINOR_FAULT_HUGETLBFS`` indicates that the kernel
+ supports ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs
+ virtual memory areas.
+
+The userland application should set the feature flags it intends to use
+when envoking the ``UFFDIO_API`` ioctl, to request that those features be
+enabled if supported.
+
+Once the ``userfaultfd`` API has been enabled the ``UFFDIO_REGISTER``
+ioctl should be invoked (if present in the returned ``uffdio_api.ioctls``
+bitmask) to register a memory range in the ``userfaultfd`` by setting the
uffdio_register structure accordingly. The ``uffdio_register.mode``
bitmask will specify to the kernel which kind of faults to track for
-the range (``UFFDIO_REGISTER_MODE_MISSING`` would track missing
-pages). The ``UFFDIO_REGISTER`` ioctl will return the
+the range. The ``UFFDIO_REGISTER`` ioctl will return the
``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve
userfaults on the range registered. Not all ioctls will necessarily be
-supported for all memory types depending on the underlying virtual
-memory backend (anonymous memory vs tmpfs vs real filebacked
-mappings).
+supported for all memory types (e.g. anonymous memory vs. shmem vs.
+hugetlbfs), or all types of intercepted faults.

Userland can use the ``uffdio_register.ioctls`` to manage the virtual
address space in the background (to add or potentially also remove
@@ -100,21 +100,44 @@ memory from the ``userfaultfd`` registered range). This means a userfault
could be triggering just before userland maps in the background the
user-faulted page.

-The primary ioctl to resolve userfaults is ``UFFDIO_COPY``. That
-atomically copies a page into the userfault registered range and wakes
-up the blocked userfaults
-(unless ``uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE`` is set).
-Other ioctl works similarly to ``UFFDIO_COPY``. They're atomic as in
-guaranteeing that nothing can see an half copied page since it'll
-keep userfaulting until the copy has finished.
+Resolving Userfaults
+--------------------
+
+There are three basic ways to resolve userfaults:
+
+- ``UFFDIO_COPY`` atomically copies some existing page contents from
+ userspace.
+
+- ``UFFDIO_ZEROPAGE`` atomically zeros the new page.
+
+- ``UFFDIO_CONTINUE`` maps an existing, previously-populated page.
+
+These operations are atomic in the sense that they guarantee nothing can
+see a half-populated page, since readers will keep userfaulting until the
+operation has finished.
+
+By default, these wake up userfaults blocked on the range in question.
+They support a ``UFFDIO_*_MODE_DONTWAKE`` ``mode`` flag, which indicates
+that waking will be done separately at some later time.
+
+Which of these are used depends on the kind of fault:
+
+- For ``UFFDIO_REGISTER_MODE_MISSING`` faults, a new page has to be
+ provided. This can be done with either ``UFFDIO_COPY`` or
+ ``UFFDIO_ZEROPAGE``. The default (non-userfaultfd) behavior would be to
+ provide a zero page, but in userfaultfd this is left up to userspace.
+
+- For ``UFFDIO_REGISTER_MODE_MINOR`` faults, an existing page already
+ exists. Userspace needs to ensure its contents are correct (if it needs
+ to be modified, by writing directly to the non-userfaultfd-registered
+ side of shared memory), and then issue ``UFFDIO_CONTINUE`` to resolve
+ the fault.

Notes:

-- If you requested ``UFFDIO_REGISTER_MODE_MISSING`` when registering then
- you must provide some kind of page in your thread after reading from
- the uffd. You must provide either ``UFFDIO_COPY`` or ``UFFDIO_ZEROPAGE``.
- The normal behavior of the OS automatically providing a zero page on
- an anonymous mmaping is not in place.
+- You can tell which kind of fault occurred by examining
+ ``pagefault.flags`` within the ``uffd_msg``, checking for the
+ ``UFFD_PAGEFAULT_FLAG_*`` flags.

- None of the page-delivering ioctls default to the range that you
registered with. You must fill in all fields for the appropriate
@@ -122,9 +145,9 @@ Notes:

- You get the address of the access that triggered the missing page
event out of a struct uffd_msg that you read in the thread from the
- uffd. You can supply as many pages as you want with ``UFFDIO_COPY`` or
- ``UFFDIO_ZEROPAGE``. Keep in mind that unless you used DONTWAKE then
- the first of any of those IOCTLs wakes up the faulting thread.
+ uffd. You can supply as many pages as you want with these IOCTLs.
+ Keep in mind that unless you used DONTWAKE then the first of any of
+ those IOCTLs wakes up the faulting thread.

- Be sure to test for all errors including
(``pollfd[0].revents & POLLERR``). This can happen, e.g. when ranges
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-15 19:08:41

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH 2/9] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled

From: Peter Xu <[email protected]>

Huge pmd sharing could bring problem to userfaultfd. The thing is that
userfaultfd is running its logic based on the special bits on page table
entries, however the huge pmd sharing could potentially share page table
entries for different address ranges. That could cause issues on either:

- When sharing huge pmd page tables for an uffd write protected range, the
newly mapped huge pmd range will also be write protected unexpectedly, or,

- When we try to write protect a range of huge pmd shared range, we'll first
do huge_pmd_unshare() in hugetlb_change_protection(), however that also
means the UFFDIO_WRITEPROTECT could be silently skipped for the shared
region, which could lead to data loss.

Since at it, a few other things are done altogether:

- Move want_pmd_share() from mm/hugetlb.c into linux/hugetlb.h, because
that's definitely something that arch code would like to use too

- ARM64 currently directly check against CONFIG_ARCH_WANT_HUGE_PMD_SHARE when
trying to share huge pmd. Switch to the want_pmd_share() helper.

Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Axel Rasmussen <[email protected]>
---
arch/arm64/mm/hugetlbpage.c | 3 +--
include/linux/hugetlb.h | 12 ++++++++++++
include/linux/userfaultfd_k.h | 9 +++++++++
mm/hugetlb.c | 5 ++---
4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 5b32ec888698..1a8ce0facfe8 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -284,8 +284,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
*/
ptep = pte_alloc_map(mm, pmdp, addr);
} else if (sz == PMD_SIZE) {
- if (IS_ENABLED(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) &&
- pud_none(READ_ONCE(*pudp)))
+ if (want_pmd_share(vma) && pud_none(READ_ONCE(*pudp)))
ptep = huge_pmd_share(mm, addr, pudp);
else
ptep = (pte_t *)pmd_alloc(mm, pudp, addr);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1e0abb609976..4959e94e78b1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -11,6 +11,7 @@
#include <linux/kref.h>
#include <linux/pgtable.h>
#include <linux/gfp.h>
+#include <linux/userfaultfd_k.h>

struct ctl_table;
struct user_struct;
@@ -947,4 +948,15 @@ static inline __init void hugetlb_cma_check(void)
}
#endif

+static inline bool want_pmd_share(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
+ if (uffd_disable_huge_pmd_share(vma))
+ return false;
+ return true;
+#else
+ return false;
+#endif
+}
+
#endif /* _LINUX_HUGETLB_H */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index a8e5f3ea9bb2..c63ccdae3eab 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -52,6 +52,15 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
return vma->vm_userfaultfd_ctx.ctx == vm_ctx.ctx;
}

+/*
+ * Never enable huge pmd sharing on uffd-wp registered vmas, because uffd-wp
+ * protect information is per pgtable entry.
+ */
+static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & VM_UFFD_WP;
+}
+
static inline bool userfaultfd_missing(struct vm_area_struct *vma)
{
return vma->vm_flags & VM_UFFD_MISSING;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 16a8d5ac68c0..1ad91d94cbe2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5371,7 +5371,7 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
*addr = ALIGN(*addr, HPAGE_SIZE * PTRS_PER_PTE) - HPAGE_SIZE;
return 1;
}
-#define want_pmd_share() (1)
+
#else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
{
@@ -5388,7 +5388,6 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
unsigned long *start, unsigned long *end)
{
}
-#define want_pmd_share() (0)
#endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */

#ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
@@ -5410,7 +5409,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
pte = (pte_t *)pud;
} else {
BUG_ON(sz != PMD_SIZE);
- if (want_pmd_share() && pud_none(*pud))
+ if (want_pmd_share(vma) && pud_none(*pud))
pte = huge_pmd_share(mm, addr, pud);
else
pte = (pte_t *)pmd_alloc(mm, pud, addr);
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-15 19:08:48

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH 6/9] userfaultfd: disable huge PMD sharing for MINOR registered VMAs

As the comment says: for the MINOR fault use case, although the page
might be present and populated in the other (non-UFFD-registered) half
of the shared mapping, it may be out of date, and we explicitly want
userspace to get a minor fault so it can check and potentially update
the page's contents.

Huge PMD sharing would prevent these faults from occurring for
suitably aligned areas, so disable it upon UFFD registration.

Signed-off-by: Axel Rasmussen <[email protected]>
---
include/linux/userfaultfd_k.h | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 7aa1461e1a8b..ed157804ca02 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -53,12 +53,18 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
}

/*
- * Never enable huge pmd sharing on uffd-wp registered vmas, because uffd-wp
- * protect information is per pgtable entry.
+ * Never enable huge pmd sharing on some uffd registered vmas:
+ *
+ * - VM_UFFD_WP VMAs, because write protect information is per pgtable entry.
+ *
+ * - VM_UFFD_MINOR VMAs, because we explicitly want minor faults to occur even
+ * when the other half of a shared mapping is populated (even though the page
+ * is there, in our use case it may be out of date, so userspace needs to
+ * check for this and possibly update it).
*/
static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
{
- return vma->vm_flags & VM_UFFD_WP;
+ return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
}

static inline bool userfaultfd_missing(struct vm_area_struct *vma)
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-15 19:09:16

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH 5/9] userfaultfd: add minor fault registration mode

This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.

This commit adds the new registration mode, and sets the relevant flag
on the VMAs being registered. In the hugetlb fault path, if we find
that we have huge_pte_none(), but find_lock_page() does indeed find an
existing page, then we have a "minor" fault, and if the VMA has the
userfaultfd registration flag, we call into userfaultfd to handle it.

Why add a new registration mode, as opposed to adding a feature to
MISSING registration, like UFFD_FEATURE_SIGBUS?

- The semantics are significantly different. UFFDIO_COPY or
UFFDIO_ZEROPAGE do not make sense for these minor faults; userspace
would instead just memset() or memcpy() or whatever via the non-UFFD
mapping. Unlike MISSING registration, MINOR registration only makes
sense for shared memory (hugetlbfs or shmem [to be supported in future
commits]).

- Doing so would make handle_userfault()'s "reason" argument confusing.
We'd pass in "MISSING" even if the pages weren't really missing.

Signed-off-by: Axel Rasmussen <[email protected]>
---
fs/proc/task_mmu.c | 1 +
fs/userfaultfd.c | 80 +++++++++++++++++++-------------
include/linux/mm.h | 1 +
include/linux/userfaultfd_k.h | 12 ++++-
include/trace/events/mmflags.h | 1 +
include/uapi/linux/userfaultfd.h | 15 +++++-
mm/hugetlb.c | 31 +++++++++++++
7 files changed, 107 insertions(+), 34 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee5a235b3056..108faf719a83 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
[ilog2(VM_MTE)] = "mt",
[ilog2(VM_MTE_ALLOWED)] = "",
#endif
+ [ilog2(VM_UFFD_MINOR)] = "ui",
#ifdef CONFIG_ARCH_HAS_PKEYS
/* These come out via ProtectionKey: */
[ilog2(VM_PKEY_BIT0)] = "",
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index dfe5f7a21883..19d6925be03f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -29,6 +29,8 @@
#include <linux/security.h>
#include <linux/hugetlb.h>

+#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
+
int sysctl_unprivileged_userfaultfd __read_mostly;

static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
@@ -197,24 +199,21 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
msg_init(&msg);
msg.event = UFFD_EVENT_PAGEFAULT;
msg.arg.pagefault.address = address;
+ /*
+ * These flags indicate why the userfault occurred:
+ * - UFFD_PAGEFAULT_FLAG_WP indicates a write protect fault.
+ * - UFFD_PAGEFAULT_FLAG_MINOR indicates a minor fault.
+ * - Neither of these flags being set indicates a MISSING fault.
+ *
+ * Separately, UFFD_PAGEFAULT_FLAG_WRITE indicates it was a write
+ * fault. Otherwise, it was a read fault.
+ */
if (flags & FAULT_FLAG_WRITE)
- /*
- * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
- * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE
- * was not set in a UFFD_EVENT_PAGEFAULT, it means it
- * was a read fault, otherwise if set it means it's
- * a write fault.
- */
msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
if (reason & VM_UFFD_WP)
- /*
- * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
- * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was
- * not set in a UFFD_EVENT_PAGEFAULT, it means it was
- * a missing fault, otherwise if set it means it's a
- * write protect fault.
- */
msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
+ if (reason & VM_UFFD_MINOR)
+ msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR;
if (features & UFFD_FEATURE_THREAD_ID)
msg.arg.pagefault.feat.ptid = task_pid_vnr(current);
return msg;
@@ -401,8 +400,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)

BUG_ON(ctx->mm != mm);

- VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
- VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
+ /* Any unrecognized flag is a bug. */
+ VM_BUG_ON(reason & ~__VM_UFFD_FLAGS);
+ /* 0 or > 1 flags set is a bug; we expect exactly 1. */
+ VM_BUG_ON(!reason || !!(reason & (reason - 1)));

if (ctx->features & UFFD_FEATURE_SIGBUS)
goto out;
@@ -612,7 +613,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
- vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+ vma->vm_flags &= ~__VM_UFFD_FLAGS;
}
mmap_write_unlock(mm);

@@ -644,7 +645,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
- vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+ vma->vm_flags &= ~__VM_UFFD_FLAGS;
return 0;
}

@@ -726,7 +727,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
} else {
/* Drop uffd context if remap feature not enabled */
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
- vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+ vma->vm_flags &= ~__VM_UFFD_FLAGS;
}
}

@@ -867,12 +868,12 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
for (vma = mm->mmap; vma; vma = vma->vm_next) {
cond_resched();
BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
- !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+ !!(vma->vm_flags & __VM_UFFD_FLAGS));
if (vma->vm_userfaultfd_ctx.ctx != ctx) {
prev = vma;
continue;
}
- new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
+ new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
new_flags, vma->anon_vma,
vma->vm_file, vma->vm_pgoff,
@@ -1300,9 +1301,26 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
unsigned long vm_flags)
{
/* FIXME: add WP support to hugetlbfs and shmem */
- return vma_is_anonymous(vma) ||
- ((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
- !(vm_flags & VM_UFFD_WP));
+ if (vm_flags & VM_UFFD_WP) {
+ if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
+ return false;
+ }
+
+ if (vm_flags & VM_UFFD_MINOR) {
+ /*
+ * The use case for minor registration (intercepting minor
+ * faults) is to handle the case where a page is present, but
+ * needs to be modified before it can be used. This requires
+ * two mappings: one with UFFD registration, and one without.
+ * So, it only makes sense to do this with shared memory.
+ */
+ /* FIXME: Add minor fault interception for shmem. */
+ if (!(is_vm_hugetlb_page(vma) && (vma->vm_flags & VM_SHARED)))
+ return false;
+ }
+
+ return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+ vma_is_shmem(vma);
}

static int userfaultfd_register(struct userfaultfd_ctx *ctx,
@@ -1328,14 +1346,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
ret = -EINVAL;
if (!uffdio_register.mode)
goto out;
- if (uffdio_register.mode & ~(UFFDIO_REGISTER_MODE_MISSING|
- UFFDIO_REGISTER_MODE_WP))
+ if (uffdio_register.mode & ~UFFD_API_REGISTER_MODES)
goto out;
vm_flags = 0;
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
vm_flags |= VM_UFFD_MISSING;
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
vm_flags |= VM_UFFD_WP;
+ if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR)
+ vm_flags |= VM_UFFD_MINOR;

ret = validate_range(mm, &uffdio_register.range.start,
uffdio_register.range.len);
@@ -1379,7 +1398,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
cond_resched();

BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
- !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+ !!(cur->vm_flags & __VM_UFFD_FLAGS));

/* check not compatible vmas */
ret = -EINVAL;
@@ -1459,8 +1478,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
start = vma->vm_start;
vma_end = min(end, vma->vm_end);

- new_flags = (vma->vm_flags &
- ~(VM_UFFD_MISSING|VM_UFFD_WP)) | vm_flags;
+ new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
prev = vma_merge(mm, prev, start, vma_end, new_flags,
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
@@ -1582,7 +1600,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
cond_resched();

BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
- !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+ !!(cur->vm_flags & __VM_UFFD_FLAGS));

/*
* Check not compatible vmas, not strictly required
@@ -1633,7 +1651,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range);
}

- new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
+ new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
prev = vma_merge(mm, prev, start, vma_end, new_flags,
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..1d7041bd3148 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -276,6 +276,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
#define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
+#define VM_UFFD_MINOR 0x00002000 /* minor fault interception */

#define VM_LOCKED 0x00002000
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index c63ccdae3eab..7aa1461e1a8b 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -71,6 +71,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
return vma->vm_flags & VM_UFFD_WP;
}

+static inline bool userfaultfd_minor(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & VM_UFFD_MINOR;
+}
+
static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
pte_t pte)
{
@@ -85,7 +90,7 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,

static inline bool userfaultfd_armed(struct vm_area_struct *vma)
{
- return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
+ return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR);
}

extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
@@ -132,6 +137,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
return false;
}

+static inline bool userfaultfd_minor(struct vm_area_struct *vma)
+{
+ return false;
+}
+
static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
pte_t pte)
{
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 67018d367b9f..2d583ffd4100 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -151,6 +151,7 @@ IF_HAVE_PG_ARCH_2(PG_arch_2, "arch_2" )
{VM_PFNMAP, "pfnmap" }, \
{VM_DENYWRITE, "denywrite" }, \
{VM_UFFD_WP, "uffd_wp" }, \
+ {VM_UFFD_MINOR, "uffd_minor" }, \
{VM_LOCKED, "locked" }, \
{VM_IO, "io" }, \
{VM_SEQ_READ, "seqread" }, \
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 5f2d88212f7c..1cc2cd8a5279 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -19,15 +19,19 @@
* means the userland is reading).
*/
#define UFFD_API ((__u64)0xAA)
+#define UFFD_API_REGISTER_MODES (UFFDIO_REGISTER_MODE_MISSING | \
+ UFFDIO_REGISTER_MODE_WP | \
+ UFFDIO_REGISTER_MODE_MINOR)
#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
UFFD_FEATURE_EVENT_FORK | \
UFFD_FEATURE_EVENT_REMAP | \
- UFFD_FEATURE_EVENT_REMOVE | \
+ UFFD_FEATURE_EVENT_REMOVE | \
UFFD_FEATURE_EVENT_UNMAP | \
UFFD_FEATURE_MISSING_HUGETLBFS | \
UFFD_FEATURE_MISSING_SHMEM | \
UFFD_FEATURE_SIGBUS | \
- UFFD_FEATURE_THREAD_ID)
+ UFFD_FEATURE_THREAD_ID | \
+ UFFD_FEATURE_MINOR_FAULT_HUGETLBFS)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -127,6 +131,7 @@ struct uffd_msg {
/* flags for UFFD_EVENT_PAGEFAULT */
#define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */
#define UFFD_PAGEFAULT_FLAG_WP (1<<1) /* If reason is VM_UFFD_WP */
+#define UFFD_PAGEFAULT_FLAG_MINOR (1<<2) /* If reason is VM_UFFD_MINOR */

struct uffdio_api {
/* userland asks for an API number and the features to enable */
@@ -171,6 +176,10 @@ struct uffdio_api {
*
* UFFD_FEATURE_THREAD_ID pid of the page faulted task_struct will
* be returned, if feature is not requested 0 will be returned.
+ *
+ * UFFD_FEATURE_MINOR_FAULT_HUGETLBFS indicates that minor faults
+ * can be intercepted (via REGISTER_MODE_MINOR) for
+ * hugetlbfs-backed pages.
*/
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
#define UFFD_FEATURE_EVENT_FORK (1<<1)
@@ -181,6 +190,7 @@ struct uffdio_api {
#define UFFD_FEATURE_EVENT_UNMAP (1<<6)
#define UFFD_FEATURE_SIGBUS (1<<7)
#define UFFD_FEATURE_THREAD_ID (1<<8)
+#define UFFD_FEATURE_MINOR_FAULT_HUGETLBFS (1<<9)
__u64 features;

__u64 ioctls;
@@ -195,6 +205,7 @@ struct uffdio_register {
struct uffdio_range range;
#define UFFDIO_REGISTER_MODE_MISSING ((__u64)1<<0)
#define UFFDIO_REGISTER_MODE_WP ((__u64)1<<1)
+#define UFFDIO_REGISTER_MODE_MINOR ((__u64)1<<2)
__u64 mode;

/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 61d6346ed009..2b3741d6130c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4377,6 +4377,37 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
}

+ /* Check for page in userfault range. */
+ if (!new_page && userfaultfd_minor(vma)) {
+ u32 hash;
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = haddr,
+ .flags = flags,
+ /*
+ * Hard to debug if it ends up being used by a callee
+ * that assumes something about the other uninitialized
+ * fields... same as in memory.c
+ */
+ };
+
+ unlock_page(page);
+
+ /*
+ * hugetlb_fault_mutex and i_mmap_rwsem must be dropped before
+ * handling userfault. Reacquire after handling fault to make
+ * calling code simpler.
+ */
+
+ hash = hugetlb_fault_mutex_hash(mapping, idx);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+ ret = handle_userfault(&vmf, VM_UFFD_MINOR);
+ i_mmap_lock_read(mapping);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
+ goto out;
+ }
+
/*
* If we are going to COW a private mapping later, we examine the
* pending reservations for this page now. This will ensure that
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-15 19:09:17

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH 3/9] mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h

From: Peter Xu <[email protected]>

Prepare for it to be called outside of mm/hugetlb.c.

Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Axel Rasmussen <[email protected]>
---
include/linux/hugetlb.h | 8 ++++++++
mm/hugetlb.c | 8 --------
2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4959e94e78b1..4c8e3447ae6a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -959,4 +959,12 @@ static inline bool want_pmd_share(struct vm_area_struct *vma)
#endif
}

+#ifndef __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
+/*
+ * ARCHes with special requirements for evicting HUGETLB backing TLB entries can
+ * implement this.
+ */
+#define flush_hugetlb_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
+#endif
+
#endif /* _LINUX_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1ad91d94cbe2..61d6346ed009 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4924,14 +4924,6 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
return i ? i : err;
}

-#ifndef __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
-/*
- * ARCHes with special requirements for evicting HUGETLB backing TLB entries can
- * implement this.
- */
-#define flush_hugetlb_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
-#endif
-
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot)
{
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-15 19:09:58

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH 9/9] userfaultfd/selftests: add test exercising minor fault handling

Fix a dormant bug in userfaultfd_events_test(), where we did
`return faulting_process(0)` instead of `exit(faulting_process(0))`.
This caused the forked process to keep running, trying to execute any
further test cases after the events test in parallel with the "real"
process.

Add a simple test case which exercises minor faults. In short, it does
the following:

1. "Sets up" an area (area_dst) and a second shared mapping to the same
underlying pages (area_dst_alias).

2. Register one of these areas with userfaultfd, in minor fault mode.

3. Start a second thread to handle any minor faults.

4. Populate the underlying pages with the non-UFFD-registered side of
the mapping. Basically, memset() each page with some arbitrary
contents.

5. Then, using the UFFD-registered mapping, read all of the page
contents, asserting that the contents match expectations (we expect
the minor fault handling thread can modify the page contents before
resolving the fault).

The minor fault handling thread, upon receiving an event, flips all the
bits (~) in that page, just to prove that it can modify it in some
arbitrary way. Then it issues a UFFDIO_CONTINUE ioctl, to setup the
mapping and resolve the fault. The reading thread should wake up and see
this modification.

Currently the minor fault test is only enabled in hugetlb_shared mode,
as this is the only configuration the kernel feature supports.

Signed-off-by: Axel Rasmussen <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 147 ++++++++++++++++++++++-
1 file changed, 143 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 92b8ec423201..73a72a3c4189 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -81,6 +81,8 @@ static volatile bool test_uffdio_copy_eexist = true;
static volatile bool test_uffdio_zeropage_eexist = true;
/* Whether to test uffd write-protection */
static bool test_uffdio_wp = false;
+/* Whether to test uffd minor faults */
+static bool test_uffdio_minor = false;

static bool map_shared;
static int huge_fd;
@@ -96,6 +98,7 @@ struct uffd_stats {
int cpu;
unsigned long missing_faults;
unsigned long wp_faults;
+ unsigned long minor_faults;
};

/* pthread_mutex_t starts at page offset 0 */
@@ -153,17 +156,19 @@ static void uffd_stats_reset(struct uffd_stats *uffd_stats,
uffd_stats[i].cpu = i;
uffd_stats[i].missing_faults = 0;
uffd_stats[i].wp_faults = 0;
+ uffd_stats[i].minor_faults = 0;
}
}

static void uffd_stats_report(struct uffd_stats *stats, int n_cpus)
{
int i;
- unsigned long long miss_total = 0, wp_total = 0;
+ unsigned long long miss_total = 0, wp_total = 0, minor_total = 0;

for (i = 0; i < n_cpus; i++) {
miss_total += stats[i].missing_faults;
wp_total += stats[i].wp_faults;
+ minor_total += stats[i].minor_faults;
}

printf("userfaults: %llu missing (", miss_total);
@@ -172,6 +177,9 @@ static void uffd_stats_report(struct uffd_stats *stats, int n_cpus)
printf("\b), %llu wp (", wp_total);
for (i = 0; i < n_cpus; i++)
printf("%lu+", stats[i].wp_faults);
+ printf("\b), %llu minor (", minor_total);
+ for (i = 0; i < n_cpus; i++)
+ printf("%lu+", stats[i].minor_faults);
printf("\b)\n");
}

@@ -328,7 +336,7 @@ static struct uffd_test_ops shmem_uffd_test_ops = {
};

static struct uffd_test_ops hugetlb_uffd_test_ops = {
- .expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC,
+ .expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC & ~(1 << _UFFDIO_CONTINUE),
.allocate_area = hugetlb_allocate_area,
.release_pages = hugetlb_release_pages,
.alias_mapping = hugetlb_alias_mapping,
@@ -362,6 +370,22 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
}
}

+static void continue_range(int ufd, __u64 start, __u64 len)
+{
+ struct uffdio_continue req;
+
+ req.range.start = start;
+ req.range.len = len;
+ req.mode = 0;
+
+ if (ioctl(ufd, UFFDIO_CONTINUE, &req)) {
+ fprintf(stderr,
+ "UFFDIO_CONTINUE failed for address 0x%" PRIx64 "\n",
+ (uint64_t)start);
+ exit(1);
+ }
+}
+
static void *locking_thread(void *arg)
{
unsigned long cpu = (unsigned long) arg;
@@ -569,8 +593,32 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
}

if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) {
+ /* Write protect page faults */
wp_range(uffd, msg->arg.pagefault.address, page_size, false);
stats->wp_faults++;
+ } else if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_MINOR) {
+ uint8_t *area;
+ int b;
+
+ /*
+ * Minor page faults
+ *
+ * To prove we can modify the original range for testing
+ * purposes, we're going to bit flip this range before
+ * continuing.
+ *
+ * Note that this requires all minor page fault tests operate on
+ * area_dst (non-UFFD-registered) and area_dst_alias
+ * (UFFD-registered).
+ */
+
+ area = (uint8_t *)(area_dst +
+ ((char *)msg->arg.pagefault.address -
+ area_dst_alias));
+ for (b = 0; b < page_size; ++b)
+ area[b] = ~area[b];
+ continue_range(uffd, msg->arg.pagefault.address, page_size);
+ stats->minor_faults++;
} else {
/* Missing page faults */
if (bounces & BOUNCE_VERIFY &&
@@ -1112,7 +1160,7 @@ static int userfaultfd_events_test(void)
}

if (!pid)
- return faulting_process(0);
+ exit(faulting_process(0));

waitpid(pid, &err, 0);
if (err) {
@@ -1215,6 +1263,95 @@ static int userfaultfd_sig_test(void)
return userfaults != 0;
}

+static int userfaultfd_minor_test(void)
+{
+ struct uffdio_register uffdio_register;
+ unsigned long expected_ioctls;
+ unsigned long p;
+ pthread_t uffd_mon;
+ uint8_t expected_byte;
+ void *expected_page;
+ char c;
+ struct uffd_stats stats = { 0 };
+
+ if (!test_uffdio_minor)
+ return 0;
+
+ printf("testing minor faults: ");
+ fflush(stdout);
+
+ if (uffd_test_ops->release_pages(area_dst))
+ return 1;
+
+ if (userfaultfd_open(0))
+ return 1;
+
+ uffdio_register.range.start = (unsigned long)area_dst_alias;
+ uffdio_register.range.len = nr_pages * page_size;
+ uffdio_register.mode = UFFDIO_REGISTER_MODE_MINOR;
+ if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
+ fprintf(stderr, "register failure\n");
+ exit(1);
+ }
+
+ expected_ioctls = uffd_test_ops->expected_ioctls;
+ expected_ioctls |= 1 << _UFFDIO_CONTINUE;
+ if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) {
+ fprintf(stderr, "unexpected missing ioctl(s)\n");
+ exit(1);
+ }
+
+ /*
+ * After registering with UFFD, populate the non-UFFD-registered side of
+ * the shared mapping. This should *not* trigger any UFFD minor faults.
+ */
+ for (p = 0; p < nr_pages; ++p) {
+ memset(area_dst + (p * page_size), p % ((uint8_t)-1),
+ page_size);
+ }
+
+ if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) {
+ perror("uffd_poll_thread create");
+ exit(1);
+ }
+
+ /*
+ * Read each of the pages back using the UFFD-registered mapping. We
+ * expect that the first time we touch a page, it will result in a minor
+ * fault. uffd_poll_thread will resolve the fault by bit-flipping the
+ * page's contents, and then issuing a CONTINUE ioctl.
+ */
+
+ if (posix_memalign(&expected_page, page_size, page_size)) {
+ fprintf(stderr, "out of memory\n");
+ return 1;
+ }
+
+ for (p = 0; p < nr_pages; ++p) {
+ expected_byte = ~((uint8_t)(p % ((uint8_t)-1)));
+ memset(expected_page, expected_byte, page_size);
+ if (my_bcmp(expected_page, area_dst_alias + (p * page_size),
+ page_size)) {
+ fprintf(stderr,
+ "unexpected page contents after minor fault\n");
+ exit(1);
+ }
+ }
+
+ if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) {
+ perror("pipe write");
+ exit(1);
+ }
+ if (pthread_join(uffd_mon, NULL))
+ return 1;
+
+ close(uffd);
+
+ uffd_stats_report(&stats, 1);
+
+ return stats.minor_faults != nr_pages;
+}
+
static int userfaultfd_stress(void)
{
void *area;
@@ -1413,7 +1550,7 @@ static int userfaultfd_stress(void)

close(uffd);
return userfaultfd_zeropage_test() || userfaultfd_sig_test()
- || userfaultfd_events_test();
+ || userfaultfd_events_test() || userfaultfd_minor_test();
}

/*
@@ -1454,6 +1591,8 @@ static void set_test_type(const char *type)
map_shared = true;
test_type = TEST_HUGETLB;
uffd_test_ops = &hugetlb_uffd_test_ops;
+ /* Minor faults require shared hugetlb; only enable here. */
+ test_uffdio_minor = true;
} else if (!strcmp(type, "shmem")) {
map_shared = true;
test_type = TEST_SHMEM;
--
2.30.0.284.gd98b1dd5eaa7-goog

2021-01-21 19:00:48

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 2/9] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled

On Fri, Jan 15, 2021 at 11:04:44AM -0800, Axel Rasmussen wrote:

[...]

> @@ -947,4 +948,15 @@ static inline __init void hugetlb_cma_check(void)
> }
> #endif
>
> +static inline bool want_pmd_share(struct vm_area_struct *vma)
> +{
> +#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
> + if (uffd_disable_huge_pmd_share(vma))
> + return false;
> + return true;
> +#else
> + return false;
> +#endif
> +}

I got syzbot complains about this chunk when compile some other archs outside
x86, probably because CONFIG_ARCH_WANT_HUGE_PMD_SHARE can be defined without
CONFIG_USERFAULTFD. I don't know why it didn't complain here, so just a fyi
that I've pushed some new version to the online repo so that they should have
been fixed. For example, for this patch:

https://github.com/xzpeter/linux/commit/0fd01db2c9ac3ab8600f3f7df23cf28fddcfee1b

static inline bool want_pmd_share(struct vm_area_struct *vma)
{
#ifdef CONFIG_USERFAULTFD
if (uffd_disable_huge_pmd_share(vma))
return false;
#endif

#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
return true;
#else
return false;
#endif
}

Thanks,

--
Peter Xu

2021-01-21 19:09:17

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 6/9] userfaultfd: disable huge PMD sharing for MINOR registered VMAs

On Fri, Jan 15, 2021 at 11:04:48AM -0800, Axel Rasmussen wrote:
> As the comment says: for the MINOR fault use case, although the page
> might be present and populated in the other (non-UFFD-registered) half
> of the shared mapping, it may be out of date, and we explicitly want
> userspace to get a minor fault so it can check and potentially update
> the page's contents.
>
> Huge PMD sharing would prevent these faults from occurring for
> suitably aligned areas, so disable it upon UFFD registration.
>
> Signed-off-by: Axel Rasmussen <[email protected]>

Reviewed-by: Peter Xu <[email protected]>

--
Peter Xu

2021-01-21 19:16:17

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 0/9] userfaultfd: add minor fault handling

On Fri, Jan 15, 2021 at 11:04:42AM -0800, Axel Rasmussen wrote:
> UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without
> modifications, the existing codepath assumes a new page needs to be allocated.
> This is okay, since userspace must have a second non-UFFD-registered mapping
> anyway, thus there isn't much reason to want to use these in any case (just
> memcpy or memset or similar).
>
> - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.

When minor fault the dst VM will report to src with the address. The src could
checkup whether dst contains the latest data on that (pmd) page and either:

- it's latest, then tells dst, dst does UFFDIO_CONTINUE

- it's not latest, then tells dst (probably along with the page data? if
hugetlbfs doesn't support double map, we'd need to batch all the dirty
small pages in one shot), dst does whatever to replace the page

Then, I'm thinking what would be the way to replace an old page.. is that one
FALLOC_FL_PUNCH_HOLE plus one UFFDIO_COPY at last?

Thanks,

--
Peter Xu

2021-01-21 19:35:35

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 5/9] userfaultfd: add minor fault registration mode

Hi, Axel,

On Fri, Jan 15, 2021 at 11:04:47AM -0800, Axel Rasmussen wrote:
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index c63ccdae3eab..7aa1461e1a8b 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -71,6 +71,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> return vma->vm_flags & VM_UFFD_WP;
> }
>
> +static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> +{
> + return vma->vm_flags & VM_UFFD_MINOR;
> +}
> +
> static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> pte_t pte)
> {
> @@ -85,7 +90,7 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
>
> static inline bool userfaultfd_armed(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
> + return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR);
> }

Maybe move the __VM_UFFD_FLAGS into this header so use it too here?

[...]

> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 5f2d88212f7c..1cc2cd8a5279 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -19,15 +19,19 @@
> * means the userland is reading).
> */
> #define UFFD_API ((__u64)0xAA)
> +#define UFFD_API_REGISTER_MODES (UFFDIO_REGISTER_MODE_MISSING | \
> + UFFDIO_REGISTER_MODE_WP | \
> + UFFDIO_REGISTER_MODE_MINOR)
> #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
> UFFD_FEATURE_EVENT_FORK | \
> UFFD_FEATURE_EVENT_REMAP | \
> - UFFD_FEATURE_EVENT_REMOVE | \
> + UFFD_FEATURE_EVENT_REMOVE | \
> UFFD_FEATURE_EVENT_UNMAP | \
> UFFD_FEATURE_MISSING_HUGETLBFS | \
> UFFD_FEATURE_MISSING_SHMEM | \
> UFFD_FEATURE_SIGBUS | \
> - UFFD_FEATURE_THREAD_ID)
> + UFFD_FEATURE_THREAD_ID | \
> + UFFD_FEATURE_MINOR_FAULT_HUGETLBFS)

I'd remove the "_FAULT" to align with the missing features...

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 61d6346ed009..2b3741d6130c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4377,6 +4377,37 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> }
> }
>
> + /* Check for page in userfault range. */
> + if (!new_page && userfaultfd_minor(vma)) {
> + u32 hash;
> + struct vm_fault vmf = {
> + .vma = vma,
> + .address = haddr,
> + .flags = flags,
> + /*
> + * Hard to debug if it ends up being used by a callee
> + * that assumes something about the other uninitialized
> + * fields... same as in memory.c
> + */
> + };
> +
> + unlock_page(page);
> +
> + /*
> + * hugetlb_fault_mutex and i_mmap_rwsem must be dropped before
> + * handling userfault. Reacquire after handling fault to make
> + * calling code simpler.
> + */
> +
> + hash = hugetlb_fault_mutex_hash(mapping, idx);
> + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> + i_mmap_unlock_read(mapping);
> + ret = handle_userfault(&vmf, VM_UFFD_MINOR);
> + i_mmap_lock_read(mapping);
> + mutex_lock(&hugetlb_fault_mutex_table[hash]);
> + goto out;

I figured it easier if the whole chunk be put into the else block right after
find_lock_page(); will that work the same?

It's just not obviously clear on when we'll go into this block otherwise,
basically the dependency of new_page variable and when it's unset.

Thanks,

--
Peter Xu

2021-01-21 22:30:37

by Axel Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 0/9] userfaultfd: add minor fault handling

On Thu, Jan 21, 2021 at 11:12 AM Peter Xu <[email protected]> wrote:
>
> On Fri, Jan 15, 2021 at 11:04:42AM -0800, Axel Rasmussen wrote:
> > UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without
> > modifications, the existing codepath assumes a new page needs to be allocated.
> > This is okay, since userspace must have a second non-UFFD-registered mapping
> > anyway, thus there isn't much reason to want to use these in any case (just
> > memcpy or memset or similar).
> >
> > - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
>
> When minor fault the dst VM will report to src with the address. The src could
> checkup whether dst contains the latest data on that (pmd) page and either:
>
> - it's latest, then tells dst, dst does UFFDIO_CONTINUE
>
> - it's not latest, then tells dst (probably along with the page data? if
> hugetlbfs doesn't support double map, we'd need to batch all the dirty
> small pages in one shot), dst does whatever to replace the page
>
> Then, I'm thinking what would be the way to replace an old page.. is that one
> FALLOC_FL_PUNCH_HOLE plus one UFFDIO_COPY at last?

When I wrote this, my thinking was that users of this feature would
have two mappings, one of which is not UFFD registered at all. So, to
replace the existing page contents, userspace would just write to the
non-UFFD mapping (with memcpy() or whatever else, or we could get
fancy and imagine using some RDMA technology to copy the page over the
network from the live migration source directly in place). After
performing the write, we just UFFDIO_CONTINUE.

I believe FALLOC_FL_PUNCH_HOLE / MADV_REMOVE doesn't work with
hugetlbfs? Once shmem support is implemented, I would expect
FALLOC_FL_PUNCH_HOLE + UFFDIO_COPY to work, but I wonder if such an
operation would be more expensive than just copying using the other
side of the shared mapping?

>
> Thanks,
>
> --
> Peter Xu
>

2021-01-21 22:41:02

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 0/9] userfaultfd: add minor fault handling

On Thu, Jan 21, 2021 at 02:13:50PM -0800, Axel Rasmussen wrote:
> When I wrote this, my thinking was that users of this feature would
> have two mappings, one of which is not UFFD registered at all. So, to
> replace the existing page contents, userspace would just write to the
> non-UFFD mapping (with memcpy() or whatever else, or we could get
> fancy and imagine using some RDMA technology to copy the page over the
> network from the live migration source directly in place). After
> performing the write, we just UFFDIO_CONTINUE.
>
> I believe FALLOC_FL_PUNCH_HOLE / MADV_REMOVE doesn't work with
> hugetlbfs? Once shmem support is implemented, I would expect
> FALLOC_FL_PUNCH_HOLE + UFFDIO_COPY to work, but I wonder if such an
> operation would be more expensive than just copying using the other
> side of the shared mapping?

IIUC hugetlb supports that (hugetlbfs_punch_hole()). But I agree with you on
what you said should be good enough. Thanks,

--
Peter Xu

2021-01-21 22:51:49

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 7/9] userfaultfd: add UFFDIO_CONTINUE ioctl

On Fri, Jan 15, 2021 at 11:04:49AM -0800, Axel Rasmussen wrote:
> This ioctl is how userspace ought to resolve "minor" userfaults. The
> idea is, userspace is notified that a minor fault has occurred. It might
> change the contents of the page using its second non-UFFD mapping, or
> not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have ensured
> the page contents are correct, carry on setting up the mapping".
>
> Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
> MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
> the minor fault case, we already have some pre-existing underlying page.
> Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
> We'd just use memcpy() or similar instead.
>
> It turns out hugetlb_mcopy_atomic_pte() already does very close to what
> we want, if an existing page is provided via `struct page **pagep`. We
> already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
> just extend that design: add an enum for the three modes of operation,
> and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
> case. (Basically, look up the existing page, and avoid adding the
> existing page to the page cache or calling set_page_huge_active() on
> it.)
>
> Signed-off-by: Axel Rasmussen <[email protected]>
> ---
> fs/userfaultfd.c | 67 +++++++++++++++++++++++++
> include/linux/userfaultfd_k.h | 2 +
> include/uapi/linux/userfaultfd.h | 21 +++++++-
> mm/hugetlb.c | 11 ++--
> mm/userfaultfd.c | 86 ++++++++++++++++++++++++--------
> 5 files changed, 158 insertions(+), 29 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 19d6925be03f..f0eb2d63289f 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1530,6 +1530,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_WP))
> ioctls_out &= ~((__u64)1 << _UFFDIO_WRITEPROTECT);
>
> + /* CONTINUE ioctl is only supported for MINOR ranges. */
> + if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR))
> + ioctls_out &= ~((__u64)1 << _UFFDIO_CONTINUE);
> +
> /*
> * Now that we scanned all vmas we can already tell
> * userland which ioctls methods are guaranteed to
> @@ -1883,6 +1887,66 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> return ret;
> }
>
> +static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
> +{
> + __s64 ret;
> + struct uffdio_continue uffdio_continue;
> + struct uffdio_continue __user *user_uffdio_continue;
> + struct userfaultfd_wake_range range;
> +
> + user_uffdio_continue = (struct uffdio_continue __user *)arg;
> +
> + ret = -EAGAIN;
> + if (READ_ONCE(ctx->mmap_changing))
> + goto out;
> +
> + ret = -EFAULT;
> + if (copy_from_user(&uffdio_continue, user_uffdio_continue,
> + /* don't copy the output fields */
> + sizeof(uffdio_continue) - (sizeof(__s64))))
> + goto out;
> +
> + ret = validate_range(ctx->mm, &uffdio_continue.range.start,
> + uffdio_continue.range.len);
> + if (ret)
> + goto out;
> +
> + ret = -EINVAL;
> + /* double check for wraparound just in case. */
> + if (uffdio_continue.range.start + uffdio_continue.range.len <=
> + uffdio_continue.range.start) {
> + goto out;
> + }
> + if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
> + goto out;
> +
> + if (mmget_not_zero(ctx->mm)) {
> + ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
> + uffdio_continue.range.len,
> + &ctx->mmap_changing);
> + mmput(ctx->mm);
> + } else {
> + return -ESRCH;
> + }
> +
> + if (unlikely(put_user(ret, &user_uffdio_continue->mapped)))
> + return -EFAULT;
> + if (ret < 0)
> + goto out;
> +
> + /* len == 0 would wake all */
> + BUG_ON(!ret);
> + range.len = ret;
> + if (!(uffdio_continue.mode & UFFDIO_CONTINUE_MODE_DONTWAKE)) {
> + range.start = uffdio_continue.range.start;
> + wake_userfault(ctx, &range);
> + }
> + ret = range.len == uffdio_continue.range.len ? 0 : -EAGAIN;
> +
> +out:
> + return ret;
> +}
> +
> static inline unsigned int uffd_ctx_features(__u64 user_features)
> {
> /*
> @@ -1967,6 +2031,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
> case UFFDIO_WRITEPROTECT:
> ret = userfaultfd_writeprotect(ctx, arg);
> break;
> + case UFFDIO_CONTINUE:
> + ret = userfaultfd_continue(ctx, arg);
> + break;
> }
> return ret;
> }
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index ed157804ca02..49d7e7b4581f 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -41,6 +41,8 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
> unsigned long dst_start,
> unsigned long len,
> bool *mmap_changing);
> +extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start,
> + unsigned long len, bool *mmap_changing);
> extern int mwriteprotect_range(struct mm_struct *dst_mm,
> unsigned long start, unsigned long len,
> bool enable_wp, bool *mmap_changing);
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 1cc2cd8a5279..9a48305f4bdd 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -40,10 +40,12 @@
> ((__u64)1 << _UFFDIO_WAKE | \
> (__u64)1 << _UFFDIO_COPY | \
> (__u64)1 << _UFFDIO_ZEROPAGE | \
> - (__u64)1 << _UFFDIO_WRITEPROTECT)
> + (__u64)1 << _UFFDIO_WRITEPROTECT | \
> + (__u64)1 << _UFFDIO_CONTINUE)
> #define UFFD_API_RANGE_IOCTLS_BASIC \
> ((__u64)1 << _UFFDIO_WAKE | \
> - (__u64)1 << _UFFDIO_COPY)
> + (__u64)1 << _UFFDIO_COPY | \
> + (__u64)1 << _UFFDIO_CONTINUE)
>
> /*
> * Valid ioctl command number range with this API is from 0x00 to
> @@ -59,6 +61,7 @@
> #define _UFFDIO_COPY (0x03)
> #define _UFFDIO_ZEROPAGE (0x04)
> #define _UFFDIO_WRITEPROTECT (0x06)
> +#define _UFFDIO_CONTINUE (0x07)
> #define _UFFDIO_API (0x3F)
>
> /* userfaultfd ioctl ids */
> @@ -77,6 +80,8 @@
> struct uffdio_zeropage)
> #define UFFDIO_WRITEPROTECT _IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
> struct uffdio_writeprotect)
> +#define UFFDIO_CONTINUE _IOR(UFFDIO, _UFFDIO_CONTINUE, \
> + struct uffdio_continue)
>
> /* read() structure */
> struct uffd_msg {
> @@ -268,6 +273,18 @@ struct uffdio_writeprotect {
> __u64 mode;
> };
>
> +struct uffdio_continue {
> + struct uffdio_range range;
> +#define UFFDIO_CONTINUE_MODE_DONTWAKE ((__u64)1<<0)
> + __u64 mode;
> +
> + /*
> + * Fields below here are written by the ioctl and must be at the end:
> + * the copy_from_user will not read past here.
> + */
> + __s64 mapped;
> +};
> +
> /*
> * Flags for the userfaultfd(2) system call itself.
> */
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2b3741d6130c..84392d5fa079 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4666,12 +4666,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> spinlock_t *ptl;
> int ret;
> struct page *page;
> + bool new_page = false;
>
> if (!*pagep) {
> ret = -ENOMEM;
> page = alloc_huge_page(dst_vma, dst_addr, 0);
> if (IS_ERR(page))
> goto out;
> + new_page = true;
>
> ret = copy_huge_page_from_user(page,
> (const void __user *) src_addr,
> @@ -4699,10 +4701,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> mapping = dst_vma->vm_file->f_mapping;
> idx = vma_hugecache_offset(h, dst_vma, dst_addr);
>
> - /*
> - * If shared, add to page cache
> - */
> - if (vm_shared) {
> + /* Add shared, newly allocated pages to the page cache. */
> + if (vm_shared && new_page) {

hugetlb_mcopy_atomic_pte() can be called with *page being set, when
copy_huge_page_from_user(allow_pagefault=true) is called outside of this
function before the retry.

IIUC this change could break that case because here new_page will also be false
then we won't insert page cache (while we should).

So I think instead of the new_page flag we may also want to pass in the new
mcopy_atomic_mode into this function, otherwise I don't see how we can identify
these cases.

> size = i_size_read(mapping->host) >> huge_page_shift(h);
> ret = -EFAULT;
> if (idx >= size)
> @@ -4762,7 +4762,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> update_mmu_cache(dst_vma, dst_addr, dst_pte);
>
> spin_unlock(ptl);
> - set_page_huge_active(page);
> + if (new_page)
> + set_page_huge_active(page);

Same here.

> if (vm_shared)
> unlock_page(page);
> ret = 0;
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index b2ce61c1b50d..0ecc50525dd4 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -197,6 +197,16 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
> return pmd_alloc(mm, pud, address);
> }
>
> +/* The mode of operation for __mcopy_atomic and its helpers. */
> +enum mcopy_atomic_mode {
> + /* A normal copy_from_user into the destination range. */
> + MCOPY_ATOMIC_NORMAL,
> + /* Don't copy; map the destination range to the zero page. */
> + MCOPY_ATOMIC_ZEROPAGE,
> + /* Just setup the dst_vma, without modifying the underlying page(s). */
> + MCOPY_ATOMIC_CONTINUE,
> +};
> +
> #ifdef CONFIG_HUGETLB_PAGE
> /*
> * __mcopy_atomic processing for HUGETLB vmas. Note that this routine is
> @@ -207,7 +217,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> unsigned long dst_start,
> unsigned long src_start,
> unsigned long len,
> - bool zeropage)
> + enum mcopy_atomic_mode mode)
> {
> int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
> int vm_shared = dst_vma->vm_flags & VM_SHARED;
> @@ -227,7 +237,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> * by THP. Since we can not reliably insert a zero page, this
> * feature is not supported.
> */
> - if (zeropage) {
> + if (mode == MCOPY_ATOMIC_ZEROPAGE) {
> mmap_read_unlock(dst_mm);
> return -EINVAL;
> }
> @@ -273,8 +283,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> }
>
> while (src_addr < src_start + len) {
> - pte_t dst_pteval;
> -
> BUG_ON(dst_addr >= dst_start + len);
>
> /*
> @@ -297,12 +305,22 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> goto out_unlock;
> }
>
> - err = -EEXIST;
> - dst_pteval = huge_ptep_get(dst_pte);
> - if (!huge_pte_none(dst_pteval)) {
> - mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> - i_mmap_unlock_read(mapping);
> - goto out_unlock;
> + if (mode == MCOPY_ATOMIC_CONTINUE) {
> + /* hugetlb_mcopy_atomic_pte unlocks the page. */
> + page = find_lock_page(mapping, idx);

If my above understanding is right, we may also consider to move this
find_lock_page() into hugetlb_mcopy_atomic_pte() directly, then as we pass in
hugetlb_mcopy_atomic_pte(page==NULL, mode==MCOPY_ATOMIC_CONTINUE) we'll fetch
the page cache instead of allocation.

> + if (!page) {
> + err = -EFAULT;
> + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> + i_mmap_unlock_read(mapping);
> + goto out_unlock;
> + }
> + } else {
> + if (!huge_pte_none(huge_ptep_get(dst_pte))) {
> + err = -EEXIST;
> + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> + i_mmap_unlock_read(mapping);
> + goto out_unlock;
> + }
> }
>
> err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
> @@ -408,7 +426,7 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> unsigned long dst_start,
> unsigned long src_start,
> unsigned long len,
> - bool zeropage);
> + enum mcopy_atomic_mode mode);
> #endif /* CONFIG_HUGETLB_PAGE */
>
> static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> @@ -417,7 +435,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> unsigned long dst_addr,
> unsigned long src_addr,
> struct page **page,
> - bool zeropage,
> + enum mcopy_atomic_mode mode,
> bool wp_copy)
> {
> ssize_t err;
> @@ -433,22 +451,38 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> * and not in the radix tree.
> */
> if (!(dst_vma->vm_flags & VM_SHARED)) {
> - if (!zeropage)
> + switch (mode) {
> + case MCOPY_ATOMIC_NORMAL:
> err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> dst_addr, src_addr, page,
> wp_copy);
> - else
> + break;
> + case MCOPY_ATOMIC_ZEROPAGE:
> err = mfill_zeropage_pte(dst_mm, dst_pmd,
> dst_vma, dst_addr);
> + break;
> + /* It only makes sense to CONTINUE for shared memory. */
> + case MCOPY_ATOMIC_CONTINUE:
> + err = -EINVAL;
> + break;
> + }
> } else {
> VM_WARN_ON_ONCE(wp_copy);
> - if (!zeropage)
> + switch (mode) {
> + case MCOPY_ATOMIC_NORMAL:
> err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
> dst_vma, dst_addr,
> src_addr, page);
> - else
> + break;
> + case MCOPY_ATOMIC_ZEROPAGE:
> err = shmem_mfill_zeropage_pte(dst_mm, dst_pmd,
> dst_vma, dst_addr);
> + break;
> + case MCOPY_ATOMIC_CONTINUE:
> + /* FIXME: Add minor fault interception for shmem. */
> + err = -EINVAL;
> + break;
> + }
> }
>
> return err;
> @@ -458,7 +492,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> unsigned long dst_start,
> unsigned long src_start,
> unsigned long len,
> - bool zeropage,
> + enum mcopy_atomic_mode mcopy_mode,
> bool *mmap_changing,
> __u64 mode)
> {
> @@ -527,7 +561,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> */
> if (is_vm_hugetlb_page(dst_vma))
> return __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
> - src_start, len, zeropage);
> + src_start, len, mcopy_mode);
>
> if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
> goto out_unlock;
> @@ -577,7 +611,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> BUG_ON(pmd_trans_huge(*dst_pmd));
>
> err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
> - src_addr, &page, zeropage, wp_copy);
> + src_addr, &page, mcopy_mode, wp_copy);
> cond_resched();
>
> if (unlikely(err == -ENOENT)) {
> @@ -626,14 +660,22 @@ ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
> unsigned long src_start, unsigned long len,
> bool *mmap_changing, __u64 mode)
> {
> - return __mcopy_atomic(dst_mm, dst_start, src_start, len, false,
> - mmap_changing, mode);
> + return __mcopy_atomic(dst_mm, dst_start, src_start, len,
> + MCOPY_ATOMIC_NORMAL, mmap_changing, mode);
> }
>
> ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
> unsigned long len, bool *mmap_changing)
> {
> - return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
> + return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
> + mmap_changing, 0);
> +}
> +
> +ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
> + unsigned long len, bool *mmap_changing)
> +{
> + return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
> + mmap_changing, 0);
> }
>
> int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> --
> 2.30.0.284.gd98b1dd5eaa7-goog
>

--
Peter Xu

2021-01-22 01:04:17

by Axel Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 7/9] userfaultfd: add UFFDIO_CONTINUE ioctl

On Thu, Jan 21, 2021 at 2:47 PM Peter Xu <[email protected]> wrote:
>
> On Fri, Jan 15, 2021 at 11:04:49AM -0800, Axel Rasmussen wrote:
> > This ioctl is how userspace ought to resolve "minor" userfaults. The
> > idea is, userspace is notified that a minor fault has occurred. It might
> > change the contents of the page using its second non-UFFD mapping, or
> > not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have ensured
> > the page contents are correct, carry on setting up the mapping".
> >
> > Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
> > MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
> > the minor fault case, we already have some pre-existing underlying page.
> > Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
> > We'd just use memcpy() or similar instead.
> >
> > It turns out hugetlb_mcopy_atomic_pte() already does very close to what
> > we want, if an existing page is provided via `struct page **pagep`. We
> > already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
> > just extend that design: add an enum for the three modes of operation,
> > and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
> > case. (Basically, look up the existing page, and avoid adding the
> > existing page to the page cache or calling set_page_huge_active() on
> > it.)
> >
> > Signed-off-by: Axel Rasmussen <[email protected]>
> > ---
> > fs/userfaultfd.c | 67 +++++++++++++++++++++++++
> > include/linux/userfaultfd_k.h | 2 +
> > include/uapi/linux/userfaultfd.h | 21 +++++++-
> > mm/hugetlb.c | 11 ++--
> > mm/userfaultfd.c | 86 ++++++++++++++++++++++++--------
> > 5 files changed, 158 insertions(+), 29 deletions(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 19d6925be03f..f0eb2d63289f 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -1530,6 +1530,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_WP))
> > ioctls_out &= ~((__u64)1 << _UFFDIO_WRITEPROTECT);
> >
> > + /* CONTINUE ioctl is only supported for MINOR ranges. */
> > + if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR))
> > + ioctls_out &= ~((__u64)1 << _UFFDIO_CONTINUE);
> > +
> > /*
> > * Now that we scanned all vmas we can already tell
> > * userland which ioctls methods are guaranteed to
> > @@ -1883,6 +1887,66 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > return ret;
> > }
> >
> > +static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
> > +{
> > + __s64 ret;
> > + struct uffdio_continue uffdio_continue;
> > + struct uffdio_continue __user *user_uffdio_continue;
> > + struct userfaultfd_wake_range range;
> > +
> > + user_uffdio_continue = (struct uffdio_continue __user *)arg;
> > +
> > + ret = -EAGAIN;
> > + if (READ_ONCE(ctx->mmap_changing))
> > + goto out;
> > +
> > + ret = -EFAULT;
> > + if (copy_from_user(&uffdio_continue, user_uffdio_continue,
> > + /* don't copy the output fields */
> > + sizeof(uffdio_continue) - (sizeof(__s64))))
> > + goto out;
> > +
> > + ret = validate_range(ctx->mm, &uffdio_continue.range.start,
> > + uffdio_continue.range.len);
> > + if (ret)
> > + goto out;
> > +
> > + ret = -EINVAL;
> > + /* double check for wraparound just in case. */
> > + if (uffdio_continue.range.start + uffdio_continue.range.len <=
> > + uffdio_continue.range.start) {
> > + goto out;
> > + }
> > + if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
> > + goto out;
> > +
> > + if (mmget_not_zero(ctx->mm)) {
> > + ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
> > + uffdio_continue.range.len,
> > + &ctx->mmap_changing);
> > + mmput(ctx->mm);
> > + } else {
> > + return -ESRCH;
> > + }
> > +
> > + if (unlikely(put_user(ret, &user_uffdio_continue->mapped)))
> > + return -EFAULT;
> > + if (ret < 0)
> > + goto out;
> > +
> > + /* len == 0 would wake all */
> > + BUG_ON(!ret);
> > + range.len = ret;
> > + if (!(uffdio_continue.mode & UFFDIO_CONTINUE_MODE_DONTWAKE)) {
> > + range.start = uffdio_continue.range.start;
> > + wake_userfault(ctx, &range);
> > + }
> > + ret = range.len == uffdio_continue.range.len ? 0 : -EAGAIN;
> > +
> > +out:
> > + return ret;
> > +}
> > +
> > static inline unsigned int uffd_ctx_features(__u64 user_features)
> > {
> > /*
> > @@ -1967,6 +2031,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
> > case UFFDIO_WRITEPROTECT:
> > ret = userfaultfd_writeprotect(ctx, arg);
> > break;
> > + case UFFDIO_CONTINUE:
> > + ret = userfaultfd_continue(ctx, arg);
> > + break;
> > }
> > return ret;
> > }
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index ed157804ca02..49d7e7b4581f 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -41,6 +41,8 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
> > unsigned long dst_start,
> > unsigned long len,
> > bool *mmap_changing);
> > +extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start,
> > + unsigned long len, bool *mmap_changing);
> > extern int mwriteprotect_range(struct mm_struct *dst_mm,
> > unsigned long start, unsigned long len,
> > bool enable_wp, bool *mmap_changing);
> > diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> > index 1cc2cd8a5279..9a48305f4bdd 100644
> > --- a/include/uapi/linux/userfaultfd.h
> > +++ b/include/uapi/linux/userfaultfd.h
> > @@ -40,10 +40,12 @@
> > ((__u64)1 << _UFFDIO_WAKE | \
> > (__u64)1 << _UFFDIO_COPY | \
> > (__u64)1 << _UFFDIO_ZEROPAGE | \
> > - (__u64)1 << _UFFDIO_WRITEPROTECT)
> > + (__u64)1 << _UFFDIO_WRITEPROTECT | \
> > + (__u64)1 << _UFFDIO_CONTINUE)
> > #define UFFD_API_RANGE_IOCTLS_BASIC \
> > ((__u64)1 << _UFFDIO_WAKE | \
> > - (__u64)1 << _UFFDIO_COPY)
> > + (__u64)1 << _UFFDIO_COPY | \
> > + (__u64)1 << _UFFDIO_CONTINUE)
> >
> > /*
> > * Valid ioctl command number range with this API is from 0x00 to
> > @@ -59,6 +61,7 @@
> > #define _UFFDIO_COPY (0x03)
> > #define _UFFDIO_ZEROPAGE (0x04)
> > #define _UFFDIO_WRITEPROTECT (0x06)
> > +#define _UFFDIO_CONTINUE (0x07)
> > #define _UFFDIO_API (0x3F)
> >
> > /* userfaultfd ioctl ids */
> > @@ -77,6 +80,8 @@
> > struct uffdio_zeropage)
> > #define UFFDIO_WRITEPROTECT _IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
> > struct uffdio_writeprotect)
> > +#define UFFDIO_CONTINUE _IOR(UFFDIO, _UFFDIO_CONTINUE, \
> > + struct uffdio_continue)
> >
> > /* read() structure */
> > struct uffd_msg {
> > @@ -268,6 +273,18 @@ struct uffdio_writeprotect {
> > __u64 mode;
> > };
> >
> > +struct uffdio_continue {
> > + struct uffdio_range range;
> > +#define UFFDIO_CONTINUE_MODE_DONTWAKE ((__u64)1<<0)
> > + __u64 mode;
> > +
> > + /*
> > + * Fields below here are written by the ioctl and must be at the end:
> > + * the copy_from_user will not read past here.
> > + */
> > + __s64 mapped;
> > +};
> > +
> > /*
> > * Flags for the userfaultfd(2) system call itself.
> > */
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 2b3741d6130c..84392d5fa079 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4666,12 +4666,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > spinlock_t *ptl;
> > int ret;
> > struct page *page;
> > + bool new_page = false;
> >
> > if (!*pagep) {
> > ret = -ENOMEM;
> > page = alloc_huge_page(dst_vma, dst_addr, 0);
> > if (IS_ERR(page))
> > goto out;
> > + new_page = true;
> >
> > ret = copy_huge_page_from_user(page,
> > (const void __user *) src_addr,
> > @@ -4699,10 +4701,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > mapping = dst_vma->vm_file->f_mapping;
> > idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> >
> > - /*
> > - * If shared, add to page cache
> > - */
> > - if (vm_shared) {
> > + /* Add shared, newly allocated pages to the page cache. */
> > + if (vm_shared && new_page) {
>
> hugetlb_mcopy_atomic_pte() can be called with *page being set, when
> copy_huge_page_from_user(allow_pagefault=true) is called outside of this
> function before the retry.
>
> IIUC this change could break that case because here new_page will also be false
> then we won't insert page cache (while we should).
>
> So I think instead of the new_page flag we may also want to pass in the new
> mcopy_atomic_mode into this function, otherwise I don't see how we can identify
> these cases.

Ah, indeed, I hadn't fully considered this case. You're absolutely
right that in this path hugetlb_mcopy_atomic_pte will have allocated,
but not fully set up, a page, and it needs to notice that this
happened and deal with it.

I thought about some alternatives, like only ever calling
copy_huge_page_from_user in one of these functions or the other, but
that would be too messy I think. Another alternative would be to have
hugetlb_mcopy_atomic_pte take some new struct instead of a struct page
**, which records not just the page pointer but also what
initialization has / needs to be done ... but this seems messier than
just exposing / passing mcopy_atomic_mode.

I'll send a v2 which implements this suggestion. :)

Thanks for the thorough review!

>
> > size = i_size_read(mapping->host) >> huge_page_shift(h);
> > ret = -EFAULT;
> > if (idx >= size)
> > @@ -4762,7 +4762,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > update_mmu_cache(dst_vma, dst_addr, dst_pte);
> >
> > spin_unlock(ptl);
> > - set_page_huge_active(page);
> > + if (new_page)
> > + set_page_huge_active(page);
>
> Same here.
>
> > if (vm_shared)
> > unlock_page(page);
> > ret = 0;
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index b2ce61c1b50d..0ecc50525dd4 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -197,6 +197,16 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
> > return pmd_alloc(mm, pud, address);
> > }
> >
> > +/* The mode of operation for __mcopy_atomic and its helpers. */
> > +enum mcopy_atomic_mode {
> > + /* A normal copy_from_user into the destination range. */
> > + MCOPY_ATOMIC_NORMAL,
> > + /* Don't copy; map the destination range to the zero page. */
> > + MCOPY_ATOMIC_ZEROPAGE,
> > + /* Just setup the dst_vma, without modifying the underlying page(s). */
> > + MCOPY_ATOMIC_CONTINUE,
> > +};
> > +
> > #ifdef CONFIG_HUGETLB_PAGE
> > /*
> > * __mcopy_atomic processing for HUGETLB vmas. Note that this routine is
> > @@ -207,7 +217,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > unsigned long dst_start,
> > unsigned long src_start,
> > unsigned long len,
> > - bool zeropage)
> > + enum mcopy_atomic_mode mode)
> > {
> > int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
> > int vm_shared = dst_vma->vm_flags & VM_SHARED;
> > @@ -227,7 +237,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > * by THP. Since we can not reliably insert a zero page, this
> > * feature is not supported.
> > */
> > - if (zeropage) {
> > + if (mode == MCOPY_ATOMIC_ZEROPAGE) {
> > mmap_read_unlock(dst_mm);
> > return -EINVAL;
> > }
> > @@ -273,8 +283,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > }
> >
> > while (src_addr < src_start + len) {
> > - pte_t dst_pteval;
> > -
> > BUG_ON(dst_addr >= dst_start + len);
> >
> > /*
> > @@ -297,12 +305,22 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > goto out_unlock;
> > }
> >
> > - err = -EEXIST;
> > - dst_pteval = huge_ptep_get(dst_pte);
> > - if (!huge_pte_none(dst_pteval)) {
> > - mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > - i_mmap_unlock_read(mapping);
> > - goto out_unlock;
> > + if (mode == MCOPY_ATOMIC_CONTINUE) {
> > + /* hugetlb_mcopy_atomic_pte unlocks the page. */
> > + page = find_lock_page(mapping, idx);
>
> If my above understanding is right, we may also consider to move this
> find_lock_page() into hugetlb_mcopy_atomic_pte() directly, then as we pass in
> hugetlb_mcopy_atomic_pte(page==NULL, mode==MCOPY_ATOMIC_CONTINUE) we'll fetch
> the page cache instead of allocation.

Agreed, if we expose mcopy_atomic_mode anyway, this is a better place
for find_lock_page.

>
> > + if (!page) {
> > + err = -EFAULT;
> > + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > + i_mmap_unlock_read(mapping);
> > + goto out_unlock;
> > + }
> > + } else {
> > + if (!huge_pte_none(huge_ptep_get(dst_pte))) {
> > + err = -EEXIST;
> > + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > + i_mmap_unlock_read(mapping);
> > + goto out_unlock;
> > + }
> > }
> >
> > err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
> > @@ -408,7 +426,7 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > unsigned long dst_start,
> > unsigned long src_start,
> > unsigned long len,
> > - bool zeropage);
> > + enum mcopy_atomic_mode mode);
> > #endif /* CONFIG_HUGETLB_PAGE */
> >
> > static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> > @@ -417,7 +435,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> > unsigned long dst_addr,
> > unsigned long src_addr,
> > struct page **page,
> > - bool zeropage,
> > + enum mcopy_atomic_mode mode,
> > bool wp_copy)
> > {
> > ssize_t err;
> > @@ -433,22 +451,38 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
> > * and not in the radix tree.
> > */
> > if (!(dst_vma->vm_flags & VM_SHARED)) {
> > - if (!zeropage)
> > + switch (mode) {
> > + case MCOPY_ATOMIC_NORMAL:
> > err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> > dst_addr, src_addr, page,
> > wp_copy);
> > - else
> > + break;
> > + case MCOPY_ATOMIC_ZEROPAGE:
> > err = mfill_zeropage_pte(dst_mm, dst_pmd,
> > dst_vma, dst_addr);
> > + break;
> > + /* It only makes sense to CONTINUE for shared memory. */
> > + case MCOPY_ATOMIC_CONTINUE:
> > + err = -EINVAL;
> > + break;
> > + }
> > } else {
> > VM_WARN_ON_ONCE(wp_copy);
> > - if (!zeropage)
> > + switch (mode) {
> > + case MCOPY_ATOMIC_NORMAL:
> > err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
> > dst_vma, dst_addr,
> > src_addr, page);
> > - else
> > + break;
> > + case MCOPY_ATOMIC_ZEROPAGE:
> > err = shmem_mfill_zeropage_pte(dst_mm, dst_pmd,
> > dst_vma, dst_addr);
> > + break;
> > + case MCOPY_ATOMIC_CONTINUE:
> > + /* FIXME: Add minor fault interception for shmem. */
> > + err = -EINVAL;
> > + break;
> > + }
> > }
> >
> > return err;
> > @@ -458,7 +492,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> > unsigned long dst_start,
> > unsigned long src_start,
> > unsigned long len,
> > - bool zeropage,
> > + enum mcopy_atomic_mode mcopy_mode,
> > bool *mmap_changing,
> > __u64 mode)
> > {
> > @@ -527,7 +561,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> > */
> > if (is_vm_hugetlb_page(dst_vma))
> > return __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
> > - src_start, len, zeropage);
> > + src_start, len, mcopy_mode);
> >
> > if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
> > goto out_unlock;
> > @@ -577,7 +611,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> > BUG_ON(pmd_trans_huge(*dst_pmd));
> >
> > err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
> > - src_addr, &page, zeropage, wp_copy);
> > + src_addr, &page, mcopy_mode, wp_copy);
> > cond_resched();
> >
> > if (unlikely(err == -ENOENT)) {
> > @@ -626,14 +660,22 @@ ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
> > unsigned long src_start, unsigned long len,
> > bool *mmap_changing, __u64 mode)
> > {
> > - return __mcopy_atomic(dst_mm, dst_start, src_start, len, false,
> > - mmap_changing, mode);
> > + return __mcopy_atomic(dst_mm, dst_start, src_start, len,
> > + MCOPY_ATOMIC_NORMAL, mmap_changing, mode);
> > }
> >
> > ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
> > unsigned long len, bool *mmap_changing)
> > {
> > - return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
> > + return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
> > + mmap_changing, 0);
> > +}
> > +
> > +ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
> > + unsigned long len, bool *mmap_changing)
> > +{
> > + return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
> > + mmap_changing, 0);
> > }
> >
> > int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> > --
> > 2.30.0.284.gd98b1dd5eaa7-goog
> >
>
> --
> Peter Xu
>