2021-02-19 00:50:58

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v7 0/6] userfaultfd: add minor fault handling

Base
====

This series is based on linux-next/akpm. Additionally, this series depends on
Peter Xu's series to allow disabling huge pmd sharing.

[1] https://lore.kernel.org/patchwork/cover/1382204/

Changelog
=========

v6->v7:
- Based upon discussion, switched back to the VM_* flags approach which was used
in v5, instead of implementing this as an API feature. Switched to using a
high bit (instead of brokenly conflicting with VM_LOCKED), which implies
introducing CONFIG_HAVE_ARCH_USERFAULTFD_MINOR and selecting it only on 64-bit
architectures (x86_64 and arm64 for now).

v5->v6:
- Fixed the condition guarding a second case where we unlock_page() in
hugetlb_mcopy_atomic_pte().
- Significantly refactored how minor registration works. Because there are no
VM_* flags available to use, it has to be a userfaultfd API feature, rather
than a registration mode. This has a few knock on consequences worth calling
out:
- userfaultfd_minor() can no longer be inline, because we have to inspect
the userfaultfd_ctx, which is only defined in fs/userfaultfd.c. This means
slightly more overhead (1 function call) on all hugetlbfs minor faults.
- vma_can_userfault() no longer changes. It seems valid to me to create an
FD with the minor fault feature enabled, and then register e.g. some
non-hugetlbfs region in MISSING mode, fully expecting to not get any minor
faults for it, alongside some other region which you *do* want minor
faults for. So, at registration time, either should be accepted.
- Since I'm no longer adding a new registration mode, I'm no longer
introducing __VM_UFFD_FLAGS or UFFD_API_REGISTER_MODES, and all the
related cleanups have been reverted.

v4->v5:
- Typo fix in the documentation update.
- Removed comment in vma_can_userfault. The same information is better covered
in the documentation update, so the comment is unnecessary (and slightly
confusing as written).
- Reworded comment for MCOPY_ATOMIC_CONTINUE mode.
- For non-shared CONTINUE, only make the PTE(s) non-writable, don't change flags
on the VMA.
- In hugetlb_mcopy_atomic_pte, always unlock the page in MCOPY_ATOMIC_CONTINUE,
even if we don't have VM_SHARED.
- In hugetlb_mcopy_atomic_pte, introduce "bool is_continue" to make that kind of
mode check more terse.
- Merged two nested if()s into a single expression in __mcopy_atomic_hugetlb.
- Moved "return -EINVAL if MCOPY_CONTINUE isn't supported for this vma type" up
one level, into __mcopy_atomic.
- Rebased onto linux-next/akpm, instead of the latest 5.11 RC. Resolved
conflicts with Mike's recent hugetlb changes.

v3->v4:
- Relaxed restriction for minor registration to allow any hugetlb VMAs, not
just those with VM_SHARED. Fixed setting VM_WRITE flag in a CONTINUE ioctl
for non-VM_SHARED VMAs.
- Reordered if() branches in hugetlb_mcopy_atomic_pte, so the conditions are
simpler and easier to read.
- Reverted most of the mfill_atomic_pte change (the anon / shmem path). Just
return -EINVAL for CONTINUE, and set zeropage = (mode ==
MCOPY_ATOMIC_ZEROPAGE), so we can keep the delta small.
- Split out adding #ifdef CONFIG_USERFAULTFD to a separate patch (instead of
lumping it together with adding UFFDIO_CONTINUE).
- Fixed signature of hugetlb_mcopy_atomic_pte for !CONFIG_HUGETLB_PAGE
(signature must be the same in either case).
- Rebased onto a newer version of Peter's patches to disable huge PMD sharing.

v2->v3:
- Added #ifdef CONFIG_USERFAULTFD around hugetlb helper functions, to fix build
errors when building without CONFIG_USERFAULTFD set.

v1->v2:
- Fixed a bug in the hugetlb_mcopy_atomic_pte retry case. We now plumb in the
enum mcopy_atomic_mode, so we can differentiate between the three cases this
function needs to handle:
1) We're doing a COPY op, and need to allocate a page, add to cache, etc.
2) We're doing a COPY op, but allocation in this function failed previously;
we're in the retry path. The page was allocated, but not e.g. added to page
cache, so that still needs to be done.
3) We're doing a CONTINUE op, we need to look up an existing page instead of
allocating a new one.
- Rebased onto a newer version of Peter's patches to disable huge PMD sharing,
which fixes syzbot complaints on some non-x86 architectures.
- Moved __VM_UFFD_FLAGS into userfaultfd_k.h, so inline helpers can use it.
- Renamed UFFD_FEATURE_MINOR_FAULT_HUGETLBFS to UFFD_FEATURE_MINOR_HUGETLBFS,
for consistency with other existing feature flags.
- Moved the userfaultfd_minor hook in hugetlb.c into the else block, so we don't
have to explicitly check for !new_page.

RFC->v1:
- Rebased onto Peter Xu's patches for disabling huge PMD sharing for certain
userfaultfd-registered areas.
- Added commits which update documentation, and add a self test which exercises
the new feature.
- Fixed reporting CONTINUE as a supported ioctl even for non-MINOR ranges.

Overview
========

This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS. When
enabled (via the UFFDIO_API ioctl), this feature means that any hugetlbfs VMAs
registered with UFFDIO_REGISTER_MODE_MISSING will *also* get events for "minor"
faults. By "minor" fault, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
One of the mappings is registered with userfaultfd (in minor mode), and the
other is not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what I'm
calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
have huge_pte_none(), but find_lock_page() finds an existing page.

We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
userspace resolves the fault by either a) doing nothing if the contents are
already correct, or b) updating the underlying contents using the second,
non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
"I have ensured the page contents are correct, carry on setting up the mapping".

Use Case
========

Consider the use case of VM live migration (e.g. under QEMU/KVM):

1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".

2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.

3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.

4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".

We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.

Interaction with Existing APIs
==============================

Because this is a feature, a registered VMA could potentially receive both
missing and minor faults. I spent some time thinking through how the existing
API interacts with the new feature:

UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:

- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.

UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without
modifications, the existing codepath assumes a new page needs to be allocated.
This is okay, since userspace must have a second non-UFFD-registered mapping
anyway, thus there isn't much reason to want to use these in any case (just
memcpy or memset or similar).

- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).

Future Work
===========

Currently the patchset only supports hugetlbfs. There is no reason it can't work
with shmem, but I expect hugetlbfs to be much more commonly used since we're
talking about backing guest memory for VMs. I plan to implement shmem support in
a follow-up patch series.

Axel Rasmussen (6):
userfaultfd: add minor fault registration mode
userfaultfd: disable huge PMD sharing for MINOR registered VMAs
userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled
userfaultfd: add UFFDIO_CONTINUE ioctl
userfaultfd: update documentation to describe minor fault handling
userfaultfd/selftests: add test exercising minor fault handling

Documentation/admin-guide/mm/userfaultfd.rst | 107 +++++++-----
arch/arm64/Kconfig | 1 +
arch/x86/Kconfig | 1 +
fs/proc/task_mmu.c | 3 +
fs/userfaultfd.c | 146 +++++++++++++----
include/linux/hugetlb.h | 7 +
include/linux/mm.h | 7 +
include/linux/userfaultfd_k.h | 46 +++++-
include/trace/events/mmflags.h | 9 +-
include/uapi/linux/userfaultfd.h | 36 +++-
init/Kconfig | 5 +
mm/hugetlb.c | 75 +++++++--
mm/userfaultfd.c | 37 +++--
tools/testing/selftests/vm/userfaultfd.c | 164 ++++++++++++++++++-
14 files changed, 529 insertions(+), 115 deletions(-)

--
2.30.0.617.g56c4b15f3c-goog


2021-02-19 00:51:32

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v7 3/6] userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled

For background, mm/userfaultfd.c provides a general mcopy_atomic
implementation. But some types of memory (i.e., hugetlb and shmem) need
a slightly different implementation, so they provide their own helpers
for this. In other words, userfaultfd is the only caller of these
functions.

This patch achieves two things:

1. Don't spend time compiling code which will end up never being
referenced anyway (a small build time optimization).

2. In patches later in this series, we extend the signature of these
helpers with UFFD-specific state (a mode enumeration). Once this
happens, we *have to* either not compile the helpers, or unconditionally
define the UFFD-only state (which seems messier to me). This includes
the declarations in the headers, as otherwise they'd yield warnings
about implicitly defining the type of those arguments.

Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: Axel Rasmussen <[email protected]>
---
include/linux/hugetlb.h | 4 ++++
mm/hugetlb.c | 2 ++
2 files changed, 6 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ef5b55dbeb9a..7e6d2f126df3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -134,11 +134,13 @@ void hugetlb_show_meminfo(void);
unsigned long hugetlb_total_pages(void);
vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags);
+#ifdef CONFIG_USERFAULTFD
int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
struct page **pagep);
+#endif /* CONFIG_USERFAULTFD */
bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
struct vm_area_struct *vma,
vm_flags_t vm_flags);
@@ -310,6 +312,7 @@ static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
BUG();
}

+#ifdef CONFIG_USERFAULTFD
static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
pte_t *dst_pte,
struct vm_area_struct *dst_vma,
@@ -320,6 +323,7 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
BUG();
return 0;
}
+#endif /* CONFIG_USERFAULTFD */

static inline pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr,
unsigned long sz)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0388107da4b1..301b6b64c04e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4624,6 +4624,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return ret;
}

+#ifdef CONFIG_USERFAULTFD
/*
* Used by userfaultfd UFFDIO_COPY. Based on mcopy_atomic_pte with
* modifications for huge pages.
@@ -4754,6 +4755,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
put_page(page);
goto out;
}
+#endif /* CONFIG_USERFAULTFD */

static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
int refs, struct page **pages,
--
2.30.0.617.g56c4b15f3c-goog

2021-02-19 00:52:01

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v7 2/6] userfaultfd: disable huge PMD sharing for MINOR registered VMAs

As the comment says: for the MINOR fault use case, although the page
might be present and populated in the other (non-UFFD-registered) half
of the mapping, it may be out of date, and we explicitly want userspace
to get a minor fault so it can check and potentially update the page's
contents.

Huge PMD sharing would prevent these faults from occurring for
suitably aligned areas, so disable it upon UFFD registration.

Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: Axel Rasmussen <[email protected]>
---
include/linux/userfaultfd_k.h | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 0390e5ac63b3..e060d5f77cc5 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -56,12 +56,19 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
}

/*
- * Never enable huge pmd sharing on uffd-wp registered vmas, because uffd-wp
- * protect information is per pgtable entry.
+ * Never enable huge pmd sharing on some uffd registered vmas:
+ *
+ * - VM_UFFD_WP VMAs, because write protect information is per pgtable entry.
+ *
+ * - VM_UFFD_MINOR VMAs, because otherwise we would never get minor faults for
+ * VMAs which share huge pmds. (If you have two mappings to the same
+ * underlying pages, and fault in the non-UFFD-registered one with a write,
+ * with huge pmd sharing this would *also* setup the second UFFD-registered
+ * mapping, and we'd not get minor faults.)
*/
static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
{
- return vma->vm_flags & VM_UFFD_WP;
+ return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
}

static inline bool userfaultfd_missing(struct vm_area_struct *vma)
--
2.30.0.617.g56c4b15f3c-goog

2021-02-19 00:52:17

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v7 4/6] userfaultfd: add UFFDIO_CONTINUE ioctl

This ioctl is how userspace ought to resolve "minor" userfaults. The
idea is, userspace is notified that a minor fault has occurred. It might
change the contents of the page using its second non-UFFD mapping, or
not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have ensured
the page contents are correct, carry on setting up the mapping".

Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.

It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`. We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case. (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)

Signed-off-by: Axel Rasmussen <[email protected]>
---
fs/userfaultfd.c | 67 ++++++++++++++++++++++++++++++++
include/linux/hugetlb.h | 3 ++
include/linux/userfaultfd_k.h | 18 +++++++++
include/uapi/linux/userfaultfd.h | 21 +++++++++-
mm/hugetlb.c | 41 ++++++++++++-------
mm/userfaultfd.c | 37 +++++++++++-------
6 files changed, 157 insertions(+), 30 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 041449e47870..3b42c09eb043 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1488,6 +1488,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_WP))
ioctls_out &= ~((__u64)1 << _UFFDIO_WRITEPROTECT);

+ /* CONTINUE ioctl is only supported for MINOR ranges. */
+ if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR))
+ ioctls_out &= ~((__u64)1 << _UFFDIO_CONTINUE);
+
/*
* Now that we scanned all vmas we can already tell
* userland which ioctls methods are guaranteed to
@@ -1841,6 +1845,66 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
return ret;
}

+static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
+{
+ __s64 ret;
+ struct uffdio_continue uffdio_continue;
+ struct uffdio_continue __user *user_uffdio_continue;
+ struct userfaultfd_wake_range range;
+
+ user_uffdio_continue = (struct uffdio_continue __user *)arg;
+
+ ret = -EAGAIN;
+ if (READ_ONCE(ctx->mmap_changing))
+ goto out;
+
+ ret = -EFAULT;
+ if (copy_from_user(&uffdio_continue, user_uffdio_continue,
+ /* don't copy the output fields */
+ sizeof(uffdio_continue) - (sizeof(__s64))))
+ goto out;
+
+ ret = validate_range(ctx->mm, &uffdio_continue.range.start,
+ uffdio_continue.range.len);
+ if (ret)
+ goto out;
+
+ ret = -EINVAL;
+ /* double check for wraparound just in case. */
+ if (uffdio_continue.range.start + uffdio_continue.range.len <=
+ uffdio_continue.range.start) {
+ goto out;
+ }
+ if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
+ goto out;
+
+ if (mmget_not_zero(ctx->mm)) {
+ ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
+ uffdio_continue.range.len,
+ &ctx->mmap_changing);
+ mmput(ctx->mm);
+ } else {
+ return -ESRCH;
+ }
+
+ if (unlikely(put_user(ret, &user_uffdio_continue->mapped)))
+ return -EFAULT;
+ if (ret < 0)
+ goto out;
+
+ /* len == 0 would wake all */
+ BUG_ON(!ret);
+ range.len = ret;
+ if (!(uffdio_continue.mode & UFFDIO_CONTINUE_MODE_DONTWAKE)) {
+ range.start = uffdio_continue.range.start;
+ wake_userfault(ctx, &range);
+ }
+ ret = range.len == uffdio_continue.range.len ? 0 : -EAGAIN;
+
+out:
+ return ret;
+}
+
static inline unsigned int uffd_ctx_features(__u64 user_features)
{
/*
@@ -1928,6 +1992,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
case UFFDIO_WRITEPROTECT:
ret = userfaultfd_writeprotect(ctx, arg);
break;
+ case UFFDIO_CONTINUE:
+ ret = userfaultfd_continue(ctx, arg);
+ break;
}
return ret;
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 7e6d2f126df3..8f2a4fc11b1f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -11,6 +11,7 @@
#include <linux/kref.h>
#include <linux/pgtable.h>
#include <linux/gfp.h>
+#include <linux/userfaultfd_k.h>

struct ctl_table;
struct user_struct;
@@ -139,6 +140,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
+ enum mcopy_atomic_mode mode,
struct page **pagep);
#endif /* CONFIG_USERFAULTFD */
bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
@@ -318,6 +320,7 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
+ enum mcopy_atomic_mode mode,
struct page **pagep)
{
BUG();
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index e060d5f77cc5..794d1538b8ba 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -37,6 +37,22 @@ extern int sysctl_unprivileged_userfaultfd;

extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);

+/*
+ * The mode of operation for __mcopy_atomic and its helpers.
+ *
+ * This is almost an implementation detail (mcopy_atomic below doesn't take this
+ * as a parameter), but it's exposed here because memory-kind-specific
+ * implementations (e.g. hugetlbfs) need to know the mode of operation.
+ */
+enum mcopy_atomic_mode {
+ /* A normal copy_from_user into the destination range. */
+ MCOPY_ATOMIC_NORMAL,
+ /* Don't copy; map the destination range to the zero page. */
+ MCOPY_ATOMIC_ZEROPAGE,
+ /* Just install pte(s) with the existing page(s) in the page cache. */
+ MCOPY_ATOMIC_CONTINUE,
+};
+
extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
unsigned long src_start, unsigned long len,
bool *mmap_changing, __u64 mode);
@@ -44,6 +60,8 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long len,
bool *mmap_changing);
+extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start,
+ unsigned long len, bool *mmap_changing);
extern int mwriteprotect_range(struct mm_struct *dst_mm,
unsigned long start, unsigned long len,
bool enable_wp, bool *mmap_changing);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index f24dd4fcbad9..bafbeb1a2624 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -40,10 +40,12 @@
((__u64)1 << _UFFDIO_WAKE | \
(__u64)1 << _UFFDIO_COPY | \
(__u64)1 << _UFFDIO_ZEROPAGE | \
- (__u64)1 << _UFFDIO_WRITEPROTECT)
+ (__u64)1 << _UFFDIO_WRITEPROTECT | \
+ (__u64)1 << _UFFDIO_CONTINUE)
#define UFFD_API_RANGE_IOCTLS_BASIC \
((__u64)1 << _UFFDIO_WAKE | \
- (__u64)1 << _UFFDIO_COPY)
+ (__u64)1 << _UFFDIO_COPY | \
+ (__u64)1 << _UFFDIO_CONTINUE)

/*
* Valid ioctl command number range with this API is from 0x00 to
@@ -59,6 +61,7 @@
#define _UFFDIO_COPY (0x03)
#define _UFFDIO_ZEROPAGE (0x04)
#define _UFFDIO_WRITEPROTECT (0x06)
+#define _UFFDIO_CONTINUE (0x07)
#define _UFFDIO_API (0x3F)

/* userfaultfd ioctl ids */
@@ -77,6 +80,8 @@
struct uffdio_zeropage)
#define UFFDIO_WRITEPROTECT _IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
struct uffdio_writeprotect)
+#define UFFDIO_CONTINUE _IOR(UFFDIO, _UFFDIO_CONTINUE, \
+ struct uffdio_continue)

/* read() structure */
struct uffd_msg {
@@ -268,6 +273,18 @@ struct uffdio_writeprotect {
__u64 mode;
};

+struct uffdio_continue {
+ struct uffdio_range range;
+#define UFFDIO_CONTINUE_MODE_DONTWAKE ((__u64)1<<0)
+ __u64 mode;
+
+ /*
+ * Fields below here are written by the ioctl and must be at the end:
+ * the copy_from_user will not read past here.
+ */
+ __s64 mapped;
+};
+
/*
* Flags for the userfaultfd(2) system call itself.
*/
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 301b6b64c04e..2553b0d525aa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -39,7 +39,6 @@
#include <linux/hugetlb.h>
#include <linux/hugetlb_cgroup.h>
#include <linux/node.h>
-#include <linux/userfaultfd_k.h>
#include <linux/page_owner.h>
#include "internal.h"

@@ -4634,8 +4633,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
+ enum mcopy_atomic_mode mode,
struct page **pagep)
{
+ bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
struct address_space *mapping;
pgoff_t idx;
unsigned long size;
@@ -4645,8 +4646,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
spinlock_t *ptl;
int ret;
struct page *page;
+ int writable;

- if (!*pagep) {
+ mapping = dst_vma->vm_file->f_mapping;
+ idx = vma_hugecache_offset(h, dst_vma, dst_addr);
+
+ if (is_continue) {
+ ret = -EFAULT;
+ page = find_lock_page(mapping, idx);
+ *pagep = NULL;
+ if (!page)
+ goto out;
+ } else if (!*pagep) {
ret = -ENOMEM;
page = alloc_huge_page(dst_vma, dst_addr, 0);
if (IS_ERR(page))
@@ -4675,13 +4686,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
*/
__SetPageUptodate(page);

- mapping = dst_vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, dst_vma, dst_addr);
-
- /*
- * If shared, add to page cache
- */
- if (vm_shared) {
+ /* Add shared, newly allocated pages to the page cache. */
+ if (vm_shared && !is_continue) {
size = i_size_read(mapping->host) >> huge_page_shift(h);
ret = -EFAULT;
if (idx >= size)
@@ -4726,8 +4732,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
}

- _dst_pte = make_huge_pte(dst_vma, page, dst_vma->vm_flags & VM_WRITE);
- if (dst_vma->vm_flags & VM_WRITE)
+ /* For CONTINUE on a non-shared VMA, don't set VM_WRITE for CoW. */
+ if (is_continue && !vm_shared)
+ writable = 0;
+ else
+ writable = dst_vma->vm_flags & VM_WRITE;
+
+ _dst_pte = make_huge_pte(dst_vma, page, writable);
+ if (writable)
_dst_pte = huge_pte_mkdirty(_dst_pte);
_dst_pte = pte_mkyoung(_dst_pte);

@@ -4741,15 +4753,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
update_mmu_cache(dst_vma, dst_addr, dst_pte);

spin_unlock(ptl);
- SetHPageMigratable(page);
- if (vm_shared)
+ if (!is_continue)
+ SetHPageMigratable(page);
+ if (vm_shared || is_continue)
unlock_page(page);
ret = 0;
out:
return ret;
out_release_unlock:
spin_unlock(ptl);
- if (vm_shared)
+ if (vm_shared || is_continue)
unlock_page(page);
out_release_nounlock:
put_page(page);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index b2ce61c1b50d..ce6cb4760d2c 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -207,7 +207,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- bool zeropage)
+ enum mcopy_atomic_mode mode)
{
int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
@@ -227,7 +227,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
* by THP. Since we can not reliably insert a zero page, this
* feature is not supported.
*/
- if (zeropage) {
+ if (mode == MCOPY_ATOMIC_ZEROPAGE) {
mmap_read_unlock(dst_mm);
return -EINVAL;
}
@@ -273,8 +273,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}

while (src_addr < src_start + len) {
- pte_t dst_pteval;
-
BUG_ON(dst_addr >= dst_start + len);

/*
@@ -297,16 +295,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
goto out_unlock;
}

- err = -EEXIST;
- dst_pteval = huge_ptep_get(dst_pte);
- if (!huge_pte_none(dst_pteval)) {
+ if (mode != MCOPY_ATOMIC_CONTINUE &&
+ !huge_pte_none(huge_ptep_get(dst_pte))) {
+ err = -EEXIST;
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
goto out_unlock;
}

err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
- dst_addr, src_addr, &page);
+ dst_addr, src_addr, mode, &page);

mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
@@ -408,7 +406,7 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- bool zeropage);
+ enum mcopy_atomic_mode mode);
#endif /* CONFIG_HUGETLB_PAGE */

static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -458,7 +456,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- bool zeropage,
+ enum mcopy_atomic_mode mcopy_mode,
bool *mmap_changing,
__u64 mode)
{
@@ -469,6 +467,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
long copied;
struct page *page;
bool wp_copy;
+ bool zeropage = (mcopy_mode == MCOPY_ATOMIC_ZEROPAGE);

/*
* Sanitize the command parameters:
@@ -527,10 +526,12 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
*/
if (is_vm_hugetlb_page(dst_vma))
return __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
- src_start, len, zeropage);
+ src_start, len, mcopy_mode);

if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
goto out_unlock;
+ if (mcopy_mode == MCOPY_ATOMIC_CONTINUE)
+ goto out_unlock;

/*
* Ensure the dst_vma has a anon_vma or this page
@@ -626,14 +627,22 @@ ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
unsigned long src_start, unsigned long len,
bool *mmap_changing, __u64 mode)
{
- return __mcopy_atomic(dst_mm, dst_start, src_start, len, false,
- mmap_changing, mode);
+ return __mcopy_atomic(dst_mm, dst_start, src_start, len,
+ MCOPY_ATOMIC_NORMAL, mmap_changing, mode);
}

ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
unsigned long len, bool *mmap_changing)
{
- return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
+ return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
+ mmap_changing, 0);
+}
+
+ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
+ unsigned long len, bool *mmap_changing)
+{
+ return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
+ mmap_changing, 0);
}

int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
--
2.30.0.617.g56c4b15f3c-goog

2021-02-19 00:52:50

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v7 5/6] userfaultfd: update documentation to describe minor fault handling

Reword / reorganize things a little bit into "lists", so new features /
modes / ioctls can sort of just be appended.

Describe how UFFDIO_REGISTER_MODE_MINOR and UFFDIO_CONTINUE can be used
to intercept and resolve minor faults. Make it clear that COPY and
ZEROPAGE are used for MISSING faults, whereas CONTINUE is used for MINOR
faults.

Signed-off-by: Axel Rasmussen <[email protected]>
---
Documentation/admin-guide/mm/userfaultfd.rst | 107 ++++++++++++-------
1 file changed, 66 insertions(+), 41 deletions(-)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 65eefa66c0ba..3aa38e8b8361 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -63,36 +63,36 @@ the generic ioctl available.

The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl
defines what memory types are supported by the ``userfaultfd`` and what
-events, except page fault notifications, may be generated.
-
-If the kernel supports registering ``userfaultfd`` ranges on hugetlbfs
-virtual memory areas, ``UFFD_FEATURE_MISSING_HUGETLBFS`` will be set in
-``uffdio_api.features``. Similarly, ``UFFD_FEATURE_MISSING_SHMEM`` will be
-set if the kernel supports registering ``userfaultfd`` ranges on shared
-memory (covering all shmem APIs, i.e. tmpfs, ``IPCSHM``, ``/dev/zero``,
-``MAP_SHARED``, ``memfd_create``, etc).
-
-The userland application that wants to use ``userfaultfd`` with hugetlbfs
-or shared memory need to set the corresponding flag in
-``uffdio_api.features`` to enable those features.
-
-If the userland desires to receive notifications for events other than
-page faults, it has to verify that ``uffdio_api.features`` has appropriate
-``UFFD_FEATURE_EVENT_*`` bits set. These events are described in more
-detail below in `Non-cooperative userfaultfd`_ section.
-
-Once the ``userfaultfd`` has been enabled the ``UFFDIO_REGISTER`` ioctl should
-be invoked (if present in the returned ``uffdio_api.ioctls`` bitmask) to
-register a memory range in the ``userfaultfd`` by setting the
+events, except page fault notifications, may be generated:
+
+- The ``UFFD_FEATURE_EVENT_*`` flags indicate that various other events
+ other than page faults are supported. These events are described in more
+ detail below in the `Non-cooperative userfaultfd`_ section.
+
+- ``UFFD_FEATURE_MISSING_HUGETLBFS`` and ``UFFD_FEATURE_MISSING_SHMEM``
+ indicate that the kernel supports ``UFFDIO_REGISTER_MODE_MISSING``
+ registrations for hugetlbfs and shared memory (covering all shmem APIs,
+ i.e. tmpfs, ``IPCSHM``, ``/dev/zero``, ``MAP_SHARED``, ``memfd_create``,
+ etc) virtual memory areas, respectively.
+
+- ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports
+ ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory
+ areas.
+
+The userland application should set the feature flags it intends to use
+when invoking the ``UFFDIO_API`` ioctl, to request that those features be
+enabled if supported.
+
+Once the ``userfaultfd`` API has been enabled the ``UFFDIO_REGISTER``
+ioctl should be invoked (if present in the returned ``uffdio_api.ioctls``
+bitmask) to register a memory range in the ``userfaultfd`` by setting the
uffdio_register structure accordingly. The ``uffdio_register.mode``
bitmask will specify to the kernel which kind of faults to track for
-the range (``UFFDIO_REGISTER_MODE_MISSING`` would track missing
-pages). The ``UFFDIO_REGISTER`` ioctl will return the
+the range. The ``UFFDIO_REGISTER`` ioctl will return the
``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve
userfaults on the range registered. Not all ioctls will necessarily be
-supported for all memory types depending on the underlying virtual
-memory backend (anonymous memory vs tmpfs vs real filebacked
-mappings).
+supported for all memory types (e.g. anonymous memory vs. shmem vs.
+hugetlbfs), or all types of intercepted faults.

Userland can use the ``uffdio_register.ioctls`` to manage the virtual
address space in the background (to add or potentially also remove
@@ -100,21 +100,46 @@ memory from the ``userfaultfd`` registered range). This means a userfault
could be triggering just before userland maps in the background the
user-faulted page.

-The primary ioctl to resolve userfaults is ``UFFDIO_COPY``. That
-atomically copies a page into the userfault registered range and wakes
-up the blocked userfaults
-(unless ``uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE`` is set).
-Other ioctl works similarly to ``UFFDIO_COPY``. They're atomic as in
-guaranteeing that nothing can see an half copied page since it'll
-keep userfaulting until the copy has finished.
+Resolving Userfaults
+--------------------
+
+There are three basic ways to resolve userfaults:
+
+- ``UFFDIO_COPY`` atomically copies some existing page contents from
+ userspace.
+
+- ``UFFDIO_ZEROPAGE`` atomically zeros the new page.
+
+- ``UFFDIO_CONTINUE`` maps an existing, previously-populated page.
+
+These operations are atomic in the sense that they guarantee nothing can
+see a half-populated page, since readers will keep userfaulting until the
+operation has finished.
+
+By default, these wake up userfaults blocked on the range in question.
+They support a ``UFFDIO_*_MODE_DONTWAKE`` ``mode`` flag, which indicates
+that waking will be done separately at some later time.
+
+Which ioctl to choose depends on the kind of page fault, and what we'd
+like to do to resolve it:
+
+- For ``UFFDIO_REGISTER_MODE_MISSING`` faults, the fault needs to be
+ resolved by either providing a new page (``UFFDIO_COPY``), or mapping
+ the zero page (``UFFDIO_ZEROPAGE``). By default, the kernel would map
+ the zero page for a missing fault. With userfaultfd, userspace can
+ decide what content to provide before the faulting thread continues.
+
+- For ``UFFDIO_REGISTER_MODE_MINOR`` faults, there is an existing page (in
+ the page cache). Userspace has the option of modifying the page's
+ contents before resolving the fault. Once the contents are correct
+ (modified or not), userspace asks the kernel to map the page and let the
+ faulting thread continue with ``UFFDIO_CONTINUE``.

Notes:

-- If you requested ``UFFDIO_REGISTER_MODE_MISSING`` when registering then
- you must provide some kind of page in your thread after reading from
- the uffd. You must provide either ``UFFDIO_COPY`` or ``UFFDIO_ZEROPAGE``.
- The normal behavior of the OS automatically providing a zero page on
- an anonymous mmaping is not in place.
+- You can tell which kind of fault occurred by examining
+ ``pagefault.flags`` within the ``uffd_msg``, checking for the
+ ``UFFD_PAGEFAULT_FLAG_*`` flags.

- None of the page-delivering ioctls default to the range that you
registered with. You must fill in all fields for the appropriate
@@ -122,9 +147,9 @@ Notes:

- You get the address of the access that triggered the missing page
event out of a struct uffd_msg that you read in the thread from the
- uffd. You can supply as many pages as you want with ``UFFDIO_COPY`` or
- ``UFFDIO_ZEROPAGE``. Keep in mind that unless you used DONTWAKE then
- the first of any of those IOCTLs wakes up the faulting thread.
+ uffd. You can supply as many pages as you want with these IOCTLs.
+ Keep in mind that unless you used DONTWAKE then the first of any of
+ those IOCTLs wakes up the faulting thread.

- Be sure to test for all errors including
(``pollfd[0].revents & POLLERR``). This can happen, e.g. when ranges
--
2.30.0.617.g56c4b15f3c-goog

2021-02-19 00:53:15

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v7 6/6] userfaultfd/selftests: add test exercising minor fault handling

Fix a dormant bug in userfaultfd_events_test(), where we did
`return faulting_process(0)` instead of `exit(faulting_process(0))`.
This caused the forked process to keep running, trying to execute any
further test cases after the events test in parallel with the "real"
process.

Add a simple test case which exercises minor faults. In short, it does
the following:

1. "Sets up" an area (area_dst) and a second shared mapping to the same
underlying pages (area_dst_alias).

2. Register one of these areas with userfaultfd, in minor fault mode.

3. Start a second thread to handle any minor faults.

4. Populate the underlying pages with the non-UFFD-registered side of
the mapping. Basically, memset() each page with some arbitrary
contents.

5. Then, using the UFFD-registered mapping, read all of the page
contents, asserting that the contents match expectations (we expect
the minor fault handling thread can modify the page contents before
resolving the fault).

The minor fault handling thread, upon receiving an event, flips all the
bits (~) in that page, just to prove that it can modify it in some
arbitrary way. Then it issues a UFFDIO_CONTINUE ioctl, to setup the
mapping and resolve the fault. The reading thread should wake up and see
this modification.

Currently the minor fault test is only enabled in hugetlb_shared mode,
as this is the only configuration the kernel feature supports.

Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: Axel Rasmussen <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 164 ++++++++++++++++++++++-
1 file changed, 158 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 92b8ec423201..f5ab5e0312e7 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -81,6 +81,8 @@ static volatile bool test_uffdio_copy_eexist = true;
static volatile bool test_uffdio_zeropage_eexist = true;
/* Whether to test uffd write-protection */
static bool test_uffdio_wp = false;
+/* Whether to test uffd minor faults */
+static bool test_uffdio_minor = false;

static bool map_shared;
static int huge_fd;
@@ -96,6 +98,7 @@ struct uffd_stats {
int cpu;
unsigned long missing_faults;
unsigned long wp_faults;
+ unsigned long minor_faults;
};

/* pthread_mutex_t starts at page offset 0 */
@@ -153,17 +156,19 @@ static void uffd_stats_reset(struct uffd_stats *uffd_stats,
uffd_stats[i].cpu = i;
uffd_stats[i].missing_faults = 0;
uffd_stats[i].wp_faults = 0;
+ uffd_stats[i].minor_faults = 0;
}
}

static void uffd_stats_report(struct uffd_stats *stats, int n_cpus)
{
int i;
- unsigned long long miss_total = 0, wp_total = 0;
+ unsigned long long miss_total = 0, wp_total = 0, minor_total = 0;

for (i = 0; i < n_cpus; i++) {
miss_total += stats[i].missing_faults;
wp_total += stats[i].wp_faults;
+ minor_total += stats[i].minor_faults;
}

printf("userfaults: %llu missing (", miss_total);
@@ -172,6 +177,9 @@ static void uffd_stats_report(struct uffd_stats *stats, int n_cpus)
printf("\b), %llu wp (", wp_total);
for (i = 0; i < n_cpus; i++)
printf("%lu+", stats[i].wp_faults);
+ printf("\b), %llu minor (", minor_total);
+ for (i = 0; i < n_cpus; i++)
+ printf("%lu+", stats[i].minor_faults);
printf("\b)\n");
}

@@ -328,7 +336,7 @@ static struct uffd_test_ops shmem_uffd_test_ops = {
};

static struct uffd_test_ops hugetlb_uffd_test_ops = {
- .expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC,
+ .expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC & ~(1 << _UFFDIO_CONTINUE),
.allocate_area = hugetlb_allocate_area,
.release_pages = hugetlb_release_pages,
.alias_mapping = hugetlb_alias_mapping,
@@ -362,6 +370,22 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
}
}

+static void continue_range(int ufd, __u64 start, __u64 len)
+{
+ struct uffdio_continue req;
+
+ req.range.start = start;
+ req.range.len = len;
+ req.mode = 0;
+
+ if (ioctl(ufd, UFFDIO_CONTINUE, &req)) {
+ fprintf(stderr,
+ "UFFDIO_CONTINUE failed for address 0x%" PRIx64 "\n",
+ (uint64_t)start);
+ exit(1);
+ }
+}
+
static void *locking_thread(void *arg)
{
unsigned long cpu = (unsigned long) arg;
@@ -569,8 +593,32 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
}

if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) {
+ /* Write protect page faults */
wp_range(uffd, msg->arg.pagefault.address, page_size, false);
stats->wp_faults++;
+ } else if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_MINOR) {
+ uint8_t *area;
+ int b;
+
+ /*
+ * Minor page faults
+ *
+ * To prove we can modify the original range for testing
+ * purposes, we're going to bit flip this range before
+ * continuing.
+ *
+ * Note that this requires all minor page fault tests operate on
+ * area_dst (non-UFFD-registered) and area_dst_alias
+ * (UFFD-registered).
+ */
+
+ area = (uint8_t *)(area_dst +
+ ((char *)msg->arg.pagefault.address -
+ area_dst_alias));
+ for (b = 0; b < page_size; ++b)
+ area[b] = ~area[b];
+ continue_range(uffd, msg->arg.pagefault.address, page_size);
+ stats->minor_faults++;
} else {
/* Missing page faults */
if (bounces & BOUNCE_VERIFY &&
@@ -779,7 +827,7 @@ static int stress(struct uffd_stats *uffd_stats)
return 0;
}

-static int userfaultfd_open(int features)
+static int userfaultfd_open_ext(uint64_t *features)
{
struct uffdio_api uffdio_api;

@@ -792,7 +840,7 @@ static int userfaultfd_open(int features)
uffd_flags = fcntl(uffd, F_GETFD, NULL);

uffdio_api.api = UFFD_API;
- uffdio_api.features = features;
+ uffdio_api.features = *features;
if (ioctl(uffd, UFFDIO_API, &uffdio_api)) {
fprintf(stderr, "UFFDIO_API failed.\nPlease make sure to "
"run with either root or ptrace capability.\n");
@@ -804,9 +852,15 @@ static int userfaultfd_open(int features)
return 1;
}

+ *features = uffdio_api.features;
return 0;
}

+static int userfaultfd_open(uint64_t features)
+{
+ return userfaultfd_open_ext(&features);
+}
+
sigjmp_buf jbuf, *sigbuf;

static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
@@ -1112,7 +1166,7 @@ static int userfaultfd_events_test(void)
}

if (!pid)
- return faulting_process(0);
+ exit(faulting_process(0));

waitpid(pid, &err, 0);
if (err) {
@@ -1215,6 +1269,102 @@ static int userfaultfd_sig_test(void)
return userfaults != 0;
}

+static int userfaultfd_minor_test(void)
+{
+ struct uffdio_register uffdio_register;
+ unsigned long expected_ioctls;
+ unsigned long p;
+ pthread_t uffd_mon;
+ uint8_t expected_byte;
+ void *expected_page;
+ char c;
+ struct uffd_stats stats = { 0 };
+ uint64_t features = UFFD_FEATURE_MINOR_HUGETLBFS;
+
+ if (!test_uffdio_minor)
+ return 0;
+
+ printf("testing minor faults: ");
+ fflush(stdout);
+
+ if (uffd_test_ops->release_pages(area_dst))
+ return 1;
+
+ if (userfaultfd_open_ext(&features))
+ return 1;
+ /* If kernel reports the feature isn't supported, skip the test. */
+ if (!(features & UFFD_FEATURE_MINOR_HUGETLBFS)) {
+ printf("skipping test due to lack of feature support\n");
+ fflush(stdout);
+ return 0;
+ }
+
+ uffdio_register.range.start = (unsigned long)area_dst_alias;
+ uffdio_register.range.len = nr_pages * page_size;
+ uffdio_register.mode = UFFDIO_REGISTER_MODE_MINOR;
+ if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
+ fprintf(stderr, "register failure\n");
+ exit(1);
+ }
+
+ expected_ioctls = uffd_test_ops->expected_ioctls;
+ expected_ioctls |= 1 << _UFFDIO_CONTINUE;
+ if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) {
+ fprintf(stderr, "unexpected missing ioctl(s)\n");
+ exit(1);
+ }
+
+ /*
+ * After registering with UFFD, populate the non-UFFD-registered side of
+ * the shared mapping. This should *not* trigger any UFFD minor faults.
+ */
+ for (p = 0; p < nr_pages; ++p) {
+ memset(area_dst + (p * page_size), p % ((uint8_t)-1),
+ page_size);
+ }
+
+ if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) {
+ perror("uffd_poll_thread create");
+ exit(1);
+ }
+
+ /*
+ * Read each of the pages back using the UFFD-registered mapping. We
+ * expect that the first time we touch a page, it will result in a minor
+ * fault. uffd_poll_thread will resolve the fault by bit-flipping the
+ * page's contents, and then issuing a CONTINUE ioctl.
+ */
+
+ if (posix_memalign(&expected_page, page_size, page_size)) {
+ fprintf(stderr, "out of memory\n");
+ return 1;
+ }
+
+ for (p = 0; p < nr_pages; ++p) {
+ expected_byte = ~((uint8_t)(p % ((uint8_t)-1)));
+ memset(expected_page, expected_byte, page_size);
+ if (my_bcmp(expected_page, area_dst_alias + (p * page_size),
+ page_size)) {
+ fprintf(stderr,
+ "unexpected page contents after minor fault\n");
+ exit(1);
+ }
+ }
+
+ if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) {
+ perror("pipe write");
+ exit(1);
+ }
+ if (pthread_join(uffd_mon, NULL))
+ return 1;
+
+ close(uffd);
+
+ uffd_stats_report(&stats, 1);
+
+ return stats.missing_faults != 0 || stats.minor_faults != nr_pages;
+}
+
static int userfaultfd_stress(void)
{
void *area;
@@ -1413,7 +1563,7 @@ static int userfaultfd_stress(void)

close(uffd);
return userfaultfd_zeropage_test() || userfaultfd_sig_test()
- || userfaultfd_events_test();
+ || userfaultfd_events_test() || userfaultfd_minor_test();
}

/*
@@ -1454,6 +1604,8 @@ static void set_test_type(const char *type)
map_shared = true;
test_type = TEST_HUGETLB;
uffd_test_ops = &hugetlb_uffd_test_ops;
+ /* Minor faults require shared hugetlb; only enable here. */
+ test_uffdio_minor = true;
} else if (!strcmp(type, "shmem")) {
map_shared = true;
test_type = TEST_SHMEM;
--
2.30.0.617.g56c4b15f3c-goog

2021-02-23 15:42:19

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] userfaultfd: add UFFDIO_CONTINUE ioctl

On Thu, Feb 18, 2021 at 04:48:22PM -0800, Axel Rasmussen wrote:
> @@ -4645,8 +4646,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> spinlock_t *ptl;
> int ret;
> struct page *page;
> + int writable;
>
> - if (!*pagep) {
> + mapping = dst_vma->vm_file->f_mapping;
> + idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> +
> + if (is_continue) {
> + ret = -EFAULT;
> + page = find_lock_page(mapping, idx);
> + *pagep = NULL;

Why set *pagep to NULL? Shouldn't it be NULL always?.. If that's the case,
maybe WARN_ON_ONCE(*pagep) suite more.

Otherwise the patch looks good to me.

Thanks,

--
Peter Xu

2021-02-23 18:09:03

by Axel Rasmussen

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] userfaultfd: add UFFDIO_CONTINUE ioctl

On Tue, Feb 23, 2021 at 7:38 AM Peter Xu <[email protected]> wrote:
>
> On Thu, Feb 18, 2021 at 04:48:22PM -0800, Axel Rasmussen wrote:
> > @@ -4645,8 +4646,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > spinlock_t *ptl;
> > int ret;
> > struct page *page;
> > + int writable;
> >
> > - if (!*pagep) {
> > + mapping = dst_vma->vm_file->f_mapping;
> > + idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> > +
> > + if (is_continue) {
> > + ret = -EFAULT;
> > + page = find_lock_page(mapping, idx);
> > + *pagep = NULL;
>
> Why set *pagep to NULL? Shouldn't it be NULL always?.. If that's the case,
> maybe WARN_ON_ONCE(*pagep) suite more.

Right, the caller should be passing in NULL in the
MCOPY_ATOMIC_CONTINUE case. Looking more closely at the caller
(__mcopy_atomic_hugetlb), it already has a BUG_ON(page), so at best
this assignment is redundant, and at worst it might actually cover up
a real bug (say the caller mistakenly *did* pass in some page, we'd
set it to NULL and the BUG_ON wouldn't trigger).

So, I'll just remove this - I don't think an additional WARN_ON_ONCE
is needed given the existing BUG_ON.

>
> Otherwise the patch looks good to me.

Shall I add a R-B? :)

Thanks for taking the time to review Peter!

>
> Thanks,
>
> --
> Peter Xu
>

2021-02-23 18:32:53

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v7 4/6] userfaultfd: add UFFDIO_CONTINUE ioctl

On Tue, Feb 23, 2021 at 10:05:49AM -0800, Axel Rasmussen wrote:
> On Tue, Feb 23, 2021 at 7:38 AM Peter Xu <[email protected]> wrote:
> >
> > On Thu, Feb 18, 2021 at 04:48:22PM -0800, Axel Rasmussen wrote:
> > > @@ -4645,8 +4646,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > > spinlock_t *ptl;
> > > int ret;
> > > struct page *page;
> > > + int writable;
> > >
> > > - if (!*pagep) {
> > > + mapping = dst_vma->vm_file->f_mapping;
> > > + idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> > > +
> > > + if (is_continue) {
> > > + ret = -EFAULT;
> > > + page = find_lock_page(mapping, idx);
> > > + *pagep = NULL;
> >
> > Why set *pagep to NULL? Shouldn't it be NULL always?.. If that's the case,
> > maybe WARN_ON_ONCE(*pagep) suite more.
>
> Right, the caller should be passing in NULL in the
> MCOPY_ATOMIC_CONTINUE case. Looking more closely at the caller
> (__mcopy_atomic_hugetlb), it already has a BUG_ON(page), so at best
> this assignment is redundant, and at worst it might actually cover up
> a real bug (say the caller mistakenly *did* pass in some page, we'd
> set it to NULL and the BUG_ON wouldn't trigger).
>
> So, I'll just remove this - I don't think an additional WARN_ON_ONCE
> is needed given the existing BUG_ON.

It's still okay to have the WARN_ON_ONCE; it gives a direct hint that *pagep
should never be set for uffdio_continue. No strong opinion.

>
> >
> > Otherwise the patch looks good to me.
>
> Shall I add a R-B? :)

Yes, as long as "*pagep = NULL" dropped, please feel free to. :)

Thanks,

--
Peter Xu

2021-02-25 08:20:41

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v7 2/6] userfaultfd: disable huge PMD sharing for MINOR registered VMAs

On 2/18/21 4:48 PM, Axel Rasmussen wrote:
> As the comment says: for the MINOR fault use case, although the page
> might be present and populated in the other (non-UFFD-registered) half
> of the mapping, it may be out of date, and we explicitly want userspace
> to get a minor fault so it can check and potentially update the page's
> contents.
>
> Huge PMD sharing would prevent these faults from occurring for
> suitably aligned areas, so disable it upon UFFD registration.
>
> Reviewed-by: Peter Xu <[email protected]>
> Signed-off-by: Axel Rasmussen <[email protected]>

Thanks,

Reviewed-by: Mike Kravetz <[email protected]>
--
Mike Kravetz