Overview
========
This series adds a new userfaultfd registration mode,
UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
By "minor" fault, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
One of the mappings is registered with userfaultfd (in minor mode), and the
other is not. Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents. The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what I'm
calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
have huge_pte_none(), but find_lock_page() finds an existing page.
We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
userspace resolves the fault by either a) doing nothing if the contents are
already correct, or b) updating the underlying contents using the second,
non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
"I have ensured the page contents are correct, carry on setting up the mapping".
Use Case
========
Consider the use case of VM live migration (e.g. under QEMU/KVM):
1. While a VM is still running, we copy the contents of its memory to a
target machine. The pages are populated on the target by writing to the
non-UFFD mapping, using the setup described above. The VM is still running
(and therefore its memory is likely changing), so this may be repeated
several times, until we decide the target is "up to date enough".
2. We pause the VM on the source, and start executing on the target machine.
During this gap, the VM's user(s) will *see* a pause, so it is desirable to
minimize this window.
3. Between the last time any page was copied from the source to the target, and
when the VM was paused, the contents of that page may have changed - and
therefore the copy we have on the target machine is out of date. Although we
can keep track of which pages are out of date, for VMs with large amounts of
memory, it is "slow" to transfer this information to the target machine. We
want to resume execution before such a transfer would complete.
4. So, the guest begins executing on the target machine. The first time it
touches its memory (via the UFFD-registered mapping), userspace wants to
intercept this fault. Userspace checks whether or not the page is up to date,
and if not, copies the updated page from the source machine, via the non-UFFD
mapping. Finally, whether a copy was performed or not, userspace issues a
UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
are correct, carry on setting up the mapping".
We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.
Interaction with Existing APIs
==============================
Because it's possible to combine registration modes (e.g. a single VMA can be
userfaultfd-registered MINOR | MISSING), and because it's up to userspace how to
resolve faults once they are received, I spent some time thinking through how
the existing API interacts with the new feature.
UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.
UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without
modifications, the existing codepath assumes a new page needs to be allocated.
This is okay, since userspace must have a second non-UFFD-registered mapping
anyway, thus there isn't much reason to want to use these in any case (just
memcpy or memset or similar).
- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
-ENOENT in that case (regardless of the kind of fault).
Remaining Work
==============
This patchset doesn't include updates to userfaultfd's documentation or
selftests. This will be added before I send a non-RFC version of this series
(I want to find out if there are strong objections to the API surface before
spending the time to document it.)
Currently the patchset only supports hugetlbfs. There is no reason it can't work
with shmem, but I expect hugetlbfs to be much more commonly used since we're
talking about backing guest memory for VMs. I plan to implement shmem support in
a follow-up patch series.
Axel Rasmussen (2):
userfaultfd: add minor fault registration mode
userfaultfd: add UFFDIO_CONTINUE ioctl
fs/proc/task_mmu.c | 1 +
fs/userfaultfd.c | 143 ++++++++++++++++++++++++-------
include/linux/mm.h | 1 +
include/linux/userfaultfd_k.h | 14 ++-
include/trace/events/mmflags.h | 1 +
include/uapi/linux/userfaultfd.h | 36 +++++++-
mm/hugetlb.c | 42 +++++++--
mm/userfaultfd.c | 86 ++++++++++++++-----
8 files changed, 261 insertions(+), 63 deletions(-)
--
2.29.2.729.g45daf8777d-goog
This ioctl is how userspace ought to resolve "minor" userfaults. The
idea is, userspace is notified that a minor fault has occurred. It might
change the contents of the page using its second non-UFFD mapping, or
not. Then, it calls UFFDIO_CONTINUE to tell the kernel "I have ensured
the page contents are correct, carry on setting up the mapping".
Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs. ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.
It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`. We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case. (Basically, look up the existing page. This in turn causes
hugetlb_mcopy_atomic_pte() to not (double-)add the page to the page
cache, and it also avoid set_page_huge_active() (as this would have been
done when the page was allocated).
Signed-off-by: Axel Rasmussen <[email protected]>
---
fs/userfaultfd.c | 63 +++++++++++++++++++++++
include/linux/userfaultfd_k.h | 2 +
include/uapi/linux/userfaultfd.h | 21 +++++++-
mm/hugetlb.c | 11 ++--
mm/userfaultfd.c | 86 ++++++++++++++++++++++++--------
5 files changed, 154 insertions(+), 29 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 0a661422eb19..08ff561426c1 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1840,6 +1840,66 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
return ret;
}
+static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
+{
+ __s64 ret;
+ struct uffdio_continue uffdio_continue;
+ struct uffdio_continue __user *user_uffdio_continue;
+ struct userfaultfd_wake_range range;
+
+ user_uffdio_continue = (struct uffdio_continue __user *)arg;
+
+ ret = -EAGAIN;
+ if (READ_ONCE(ctx->mmap_changing))
+ goto out;
+
+ ret = -EFAULT;
+ if (copy_from_user(&uffdio_continue, user_uffdio_continue,
+ /* don't copy the output fields */
+ sizeof(uffdio_continue) - (sizeof(__s64))))
+ goto out;
+
+ ret = validate_range(ctx->mm, &uffdio_continue.range.start,
+ uffdio_continue.range.len);
+ if (ret)
+ goto out;
+
+ ret = -EINVAL;
+ /* double check for wraparound just in case. */
+ if (uffdio_continue.range.start + uffdio_continue.range.len <=
+ uffdio_continue.range.start) {
+ goto out;
+ }
+ if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
+ goto out;
+
+ if (mmget_not_zero(ctx->mm)) {
+ ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
+ uffdio_continue.range.len,
+ &ctx->mmap_changing);
+ mmput(ctx->mm);
+ } else {
+ return -ESRCH;
+ }
+
+ if (unlikely(put_user(ret, &user_uffdio_continue->mapped)))
+ return -EFAULT;
+ if (ret < 0)
+ goto out;
+
+ /* len == 0 would wake all */
+ BUG_ON(!ret);
+ range.len = ret;
+ if (!(uffdio_continue.mode & UFFDIO_CONTINUE_MODE_DONTWAKE)) {
+ range.start = uffdio_continue.range.start;
+ wake_userfault(ctx, &range);
+ }
+ ret = range.len == uffdio_continue.range.len ? 0 : -EAGAIN;
+
+out:
+ return ret;
+}
+
static inline unsigned int uffd_ctx_features(__u64 user_features)
{
/*
@@ -1924,6 +1984,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
case UFFDIO_WRITEPROTECT:
ret = userfaultfd_writeprotect(ctx, arg);
break;
+ case UFFDIO_CONTINUE:
+ ret = userfaultfd_continue(ctx, arg);
+ break;
}
return ret;
}
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 0c8c5fa5efc8..b1f5aaec8dad 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -41,6 +41,8 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long len,
bool *mmap_changing);
+extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start,
+ unsigned long len, bool *mmap_changing);
extern int mwriteprotect_range(struct mm_struct *dst_mm,
unsigned long start, unsigned long len,
bool enable_wp, bool *mmap_changing);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 1cc2cd8a5279..9a48305f4bdd 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -40,10 +40,12 @@
((__u64)1 << _UFFDIO_WAKE | \
(__u64)1 << _UFFDIO_COPY | \
(__u64)1 << _UFFDIO_ZEROPAGE | \
- (__u64)1 << _UFFDIO_WRITEPROTECT)
+ (__u64)1 << _UFFDIO_WRITEPROTECT | \
+ (__u64)1 << _UFFDIO_CONTINUE)
#define UFFD_API_RANGE_IOCTLS_BASIC \
((__u64)1 << _UFFDIO_WAKE | \
- (__u64)1 << _UFFDIO_COPY)
+ (__u64)1 << _UFFDIO_COPY | \
+ (__u64)1 << _UFFDIO_CONTINUE)
/*
* Valid ioctl command number range with this API is from 0x00 to
@@ -59,6 +61,7 @@
#define _UFFDIO_COPY (0x03)
#define _UFFDIO_ZEROPAGE (0x04)
#define _UFFDIO_WRITEPROTECT (0x06)
+#define _UFFDIO_CONTINUE (0x07)
#define _UFFDIO_API (0x3F)
/* userfaultfd ioctl ids */
@@ -77,6 +80,8 @@
struct uffdio_zeropage)
#define UFFDIO_WRITEPROTECT _IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
struct uffdio_writeprotect)
+#define UFFDIO_CONTINUE _IOR(UFFDIO, _UFFDIO_CONTINUE, \
+ struct uffdio_continue)
/* read() structure */
struct uffd_msg {
@@ -268,6 +273,18 @@ struct uffdio_writeprotect {
__u64 mode;
};
+struct uffdio_continue {
+ struct uffdio_range range;
+#define UFFDIO_CONTINUE_MODE_DONTWAKE ((__u64)1<<0)
+ __u64 mode;
+
+ /*
+ * Fields below here are written by the ioctl and must be at the end:
+ * the copy_from_user will not read past here.
+ */
+ __s64 mapped;
+};
+
/*
* Flags for the userfaultfd(2) system call itself.
*/
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0ba8f2f5a4ae..889b9cf7f107 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4666,12 +4666,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
spinlock_t *ptl;
int ret;
struct page *page;
+ bool new_page = false;
if (!*pagep) {
ret = -ENOMEM;
page = alloc_huge_page(dst_vma, dst_addr, 0);
if (IS_ERR(page))
goto out;
+ new_page = true;
ret = copy_huge_page_from_user(page,
(const void __user *) src_addr,
@@ -4699,10 +4701,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
mapping = dst_vma->vm_file->f_mapping;
idx = vma_hugecache_offset(h, dst_vma, dst_addr);
- /*
- * If shared, add to page cache
- */
- if (vm_shared) {
+ /* Add shared, newly allocated pages to the page cache. */
+ if (vm_shared && new_page) {
size = i_size_read(mapping->host) >> huge_page_shift(h);
ret = -EFAULT;
if (idx >= size)
@@ -4762,7 +4762,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
update_mmu_cache(dst_vma, dst_addr, dst_pte);
spin_unlock(ptl);
- set_page_huge_active(page);
+ if (new_page)
+ set_page_huge_active(page);
if (vm_shared)
unlock_page(page);
ret = 0;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9a3d451402d7..15ba09bb4ceb 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -197,6 +197,16 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
return pmd_alloc(mm, pud, address);
}
+/* The mode of operation for __mcopy_atomic and its helpers. */
+enum mcopy_atomic_mode {
+ /* A normal copy_from_user into the destination range. */
+ MCOPY_ATOMIC_NORMAL,
+ /* Don't copy; map the destination range to the zero page. */
+ MCOPY_ATOMIC_ZEROPAGE,
+ /* Just setup the dst_vma, without modifying the underlying page(s). */
+ MCOPY_ATOMIC_CONTINUE,
+};
+
#ifdef CONFIG_HUGETLB_PAGE
/*
* __mcopy_atomic processing for HUGETLB vmas. Note that this routine is
@@ -207,7 +217,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- bool zeropage)
+ enum mcopy_atomic_mode mode)
{
int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
@@ -227,7 +237,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
* by THP. Since we can not reliably insert a zero page, this
* feature is not supported.
*/
- if (zeropage) {
+ if (mode == MCOPY_ATOMIC_ZEROPAGE) {
mmap_read_unlock(dst_mm);
return -EINVAL;
}
@@ -273,8 +283,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}
while (src_addr < src_start + len) {
- pte_t dst_pteval;
-
BUG_ON(dst_addr >= dst_start + len);
/*
@@ -297,12 +305,22 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
goto out_unlock;
}
- err = -EEXIST;
- dst_pteval = huge_ptep_get(dst_pte);
- if (!huge_pte_none(dst_pteval)) {
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- i_mmap_unlock_read(mapping);
- goto out_unlock;
+ if (mode == MCOPY_ATOMIC_CONTINUE) {
+ /* hugetlb_mcopy_atomic_pte unlocks the page. */
+ page = find_lock_page(mapping, idx);
+ if (!page) {
+ err = -EFAULT;
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+ goto out_unlock;
+ }
+ } else {
+ if (!huge_pte_none(huge_ptep_get(dst_pte))) {
+ err = -EEXIST;
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+ goto out_unlock;
+ }
}
err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
@@ -408,7 +426,7 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- bool zeropage);
+ enum mcopy_atomic_mode mode);
#endif /* CONFIG_HUGETLB_PAGE */
static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -417,7 +435,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
unsigned long dst_addr,
unsigned long src_addr,
struct page **page,
- bool zeropage,
+ enum mcopy_atomic_mode mode,
bool wp_copy)
{
ssize_t err;
@@ -433,22 +451,38 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
* and not in the radix tree.
*/
if (!(dst_vma->vm_flags & VM_SHARED)) {
- if (!zeropage)
+ switch (mode) {
+ case MCOPY_ATOMIC_NORMAL:
err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
dst_addr, src_addr, page,
wp_copy);
- else
+ break;
+ case MCOPY_ATOMIC_ZEROPAGE:
err = mfill_zeropage_pte(dst_mm, dst_pmd,
dst_vma, dst_addr);
+ break;
+ /* It only makes sense to CONTINUE for shared memory. */
+ case MCOPY_ATOMIC_CONTINUE:
+ err = -EINVAL;
+ break;
+ }
} else {
VM_WARN_ON_ONCE(wp_copy);
- if (!zeropage)
+ switch (mode) {
+ case MCOPY_ATOMIC_NORMAL:
err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
dst_vma, dst_addr,
src_addr, page);
- else
+ break;
+ case MCOPY_ATOMIC_ZEROPAGE:
err = shmem_mfill_zeropage_pte(dst_mm, dst_pmd,
dst_vma, dst_addr);
+ break;
+ case MCOPY_ATOMIC_CONTINUE:
+ /* FIXME: Add minor fault interception for shmem. */
+ err = -EINVAL;
+ break;
+ }
}
return err;
@@ -458,7 +492,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- bool zeropage,
+ enum mcopy_atomic_mode mcopy_mode,
bool *mmap_changing,
__u64 mode)
{
@@ -527,7 +561,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
*/
if (is_vm_hugetlb_page(dst_vma))
return __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
- src_start, len, zeropage);
+ src_start, len, mcopy_mode);
if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
goto out_unlock;
@@ -577,7 +611,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
BUG_ON(pmd_trans_huge(*dst_pmd));
err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
- src_addr, &page, zeropage, wp_copy);
+ src_addr, &page, mcopy_mode, wp_copy);
cond_resched();
if (unlikely(err == -ENOENT)) {
@@ -626,14 +660,22 @@ ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
unsigned long src_start, unsigned long len,
bool *mmap_changing, __u64 mode)
{
- return __mcopy_atomic(dst_mm, dst_start, src_start, len, false,
- mmap_changing, mode);
+ return __mcopy_atomic(dst_mm, dst_start, src_start, len,
+ MCOPY_ATOMIC_NORMAL, mmap_changing, mode);
}
ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
unsigned long len, bool *mmap_changing)
{
- return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
+ return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
+ mmap_changing, 0);
+}
+
+ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
+ unsigned long len, bool *mmap_changing)
+{
+ return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
+ mmap_changing, 0);
}
int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
--
2.29.2.729.g45daf8777d-goog
This feature allows userspace to intercept "minor" faults. By "minor"
fault, I mean the following situation:
Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory). One of the mappings is registered with userfaultfd (in minor
mode), and the other is not. Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents. The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.
This commit adds the new registration mode, and sets the relevant flag
on the VMAs being registered. In the hugetlb fault path, if we find
that we have huge_pte_none(), but find_lock_page() does indeed find an
existing page, then we have a "minor" fault, and if the VMA has the
userfaultfd registration flag, we call into userfaultfd to handle it.
Why add a new registration mode, as opposed to adding a feature to
MISSING registration, like UFFD_FEATURE_SIGBUS?
- The semantics are significantly different. UFFDIO_COPY or
UFFDIO_ZEROPAGE do not make sense for these minor faults; userspace
would instead just memset() or memcpy() or whatever via the non-UFFD
mapping. Unlike MISSING registration, MINOR registration only makes
sense for shared memory (hugetlbfs or shmem [to be supported in future
commits]).
- Doing so would make handle_userfault()'s "reason" argument confusing.
We'd pass in "MISSING" even if the pages weren't really missing.
Signed-off-by: Axel Rasmussen <[email protected]>
---
fs/proc/task_mmu.c | 1 +
fs/userfaultfd.c | 80 +++++++++++++++++++-------------
include/linux/mm.h | 1 +
include/linux/userfaultfd_k.h | 12 ++++-
include/trace/events/mmflags.h | 1 +
include/uapi/linux/userfaultfd.h | 15 +++++-
mm/hugetlb.c | 31 +++++++++++++
7 files changed, 107 insertions(+), 34 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee5a235b3056..108faf719a83 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
[ilog2(VM_MTE)] = "mt",
[ilog2(VM_MTE_ALLOWED)] = "",
#endif
+ [ilog2(VM_UFFD_MINOR)] = "ui",
#ifdef CONFIG_ARCH_HAS_PKEYS
/* These come out via ProtectionKey: */
[ilog2(VM_PKEY_BIT0)] = "",
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 894cc28142e7..0a661422eb19 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -28,6 +28,8 @@
#include <linux/security.h>
#include <linux/hugetlb.h>
+#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
+
int sysctl_unprivileged_userfaultfd __read_mostly;
static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
@@ -196,24 +198,21 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
msg_init(&msg);
msg.event = UFFD_EVENT_PAGEFAULT;
msg.arg.pagefault.address = address;
+ /*
+ * These flags indicate why the userfault occurred:
+ * - UFFD_PAGEFAULT_FLAG_WP indicates a write protect fault.
+ * - UFFD_PAGEFAULT_FLAG_MINOR indicates a minor fault.
+ * - Neither of these flags being set indicates a MISSING fault.
+ *
+ * Separately, UFFD_PAGEFAULT_FLAG_WRITE indicates it was a write
+ * fault. Otherwise, it was a read fault.
+ */
if (flags & FAULT_FLAG_WRITE)
- /*
- * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
- * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE
- * was not set in a UFFD_EVENT_PAGEFAULT, it means it
- * was a read fault, otherwise if set it means it's
- * a write fault.
- */
msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
if (reason & VM_UFFD_WP)
- /*
- * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
- * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was
- * not set in a UFFD_EVENT_PAGEFAULT, it means it was
- * a missing fault, otherwise if set it means it's a
- * write protect fault.
- */
msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
+ if (reason & VM_UFFD_MINOR)
+ msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR;
if (features & UFFD_FEATURE_THREAD_ID)
msg.arg.pagefault.feat.ptid = task_pid_vnr(current);
return msg;
@@ -400,8 +399,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
BUG_ON(ctx->mm != mm);
- VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
- VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
+ /* Any unrecognized flag is a bug. */
+ VM_BUG_ON(reason & ~__VM_UFFD_FLAGS);
+ /* 0 or > 1 flags set is a bug; we expect exactly 1. */
+ VM_BUG_ON(!reason || !!(reason & (reason - 1)));
if (ctx->features & UFFD_FEATURE_SIGBUS)
goto out;
@@ -611,7 +612,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
- vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+ vma->vm_flags &= ~__VM_UFFD_FLAGS;
}
mmap_write_unlock(mm);
@@ -643,7 +644,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
- vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+ vma->vm_flags &= ~__VM_UFFD_FLAGS;
return 0;
}
@@ -725,7 +726,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
} else {
/* Drop uffd context if remap feature not enabled */
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
- vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+ vma->vm_flags &= ~__VM_UFFD_FLAGS;
}
}
@@ -866,12 +867,12 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
for (vma = mm->mmap; vma; vma = vma->vm_next) {
cond_resched();
BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
- !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+ !!(vma->vm_flags & __VM_UFFD_FLAGS));
if (vma->vm_userfaultfd_ctx.ctx != ctx) {
prev = vma;
continue;
}
- new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
+ new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
new_flags, vma->anon_vma,
vma->vm_file, vma->vm_pgoff,
@@ -1260,9 +1261,26 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
unsigned long vm_flags)
{
/* FIXME: add WP support to hugetlbfs and shmem */
- return vma_is_anonymous(vma) ||
- ((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
- !(vm_flags & VM_UFFD_WP));
+ if (vm_flags & VM_UFFD_WP) {
+ if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
+ return false;
+ }
+
+ if (vm_flags & VM_UFFD_MINOR) {
+ /*
+ * The use case for minor registration (intercepting minor
+ * faults) is to handle the case where a page is present, but
+ * needs to be modified before it can be used. This requires
+ * two mappings: one with UFFD registration, and one without.
+ * So, it only makes sense to do this with shared memory.
+ */
+ /* FIXME: Add minor fault interception for shmem. */
+ if (!(is_vm_hugetlb_page(vma) && (vma->vm_flags & VM_SHARED)))
+ return false;
+ }
+
+ return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+ vma_is_shmem(vma);
}
static int userfaultfd_register(struct userfaultfd_ctx *ctx,
@@ -1288,14 +1306,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
ret = -EINVAL;
if (!uffdio_register.mode)
goto out;
- if (uffdio_register.mode & ~(UFFDIO_REGISTER_MODE_MISSING|
- UFFDIO_REGISTER_MODE_WP))
+ if (uffdio_register.mode & ~UFFD_API_REGISTER_MODES)
goto out;
vm_flags = 0;
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
vm_flags |= VM_UFFD_MISSING;
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
vm_flags |= VM_UFFD_WP;
+ if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR)
+ vm_flags |= VM_UFFD_MINOR;
ret = validate_range(mm, &uffdio_register.range.start,
uffdio_register.range.len);
@@ -1339,7 +1358,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
cond_resched();
BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
- !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+ !!(cur->vm_flags & __VM_UFFD_FLAGS));
/* check not compatible vmas */
ret = -EINVAL;
@@ -1419,8 +1438,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
start = vma->vm_start;
vma_end = min(end, vma->vm_end);
- new_flags = (vma->vm_flags &
- ~(VM_UFFD_MISSING|VM_UFFD_WP)) | vm_flags;
+ new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
prev = vma_merge(mm, prev, start, vma_end, new_flags,
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
@@ -1539,7 +1557,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
cond_resched();
BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
- !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+ !!(cur->vm_flags & __VM_UFFD_FLAGS));
/*
* Check not compatible vmas, not strictly required
@@ -1590,7 +1608,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range);
}
- new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
+ new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
prev = vma_merge(mm, prev, start, vma_end, new_flags,
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..1d7041bd3148 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -276,6 +276,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
#define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
+#define VM_UFFD_MINOR 0x00002000 /* minor fault interception */
#define VM_LOCKED 0x00002000
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index a8e5f3ea9bb2..0c8c5fa5efc8 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -62,6 +62,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
return vma->vm_flags & VM_UFFD_WP;
}
+static inline bool userfaultfd_minor(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & VM_UFFD_MINOR;
+}
+
static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
pte_t pte)
{
@@ -76,7 +81,7 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
{
- return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
+ return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR);
}
extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
@@ -123,6 +128,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
return false;
}
+static inline bool userfaultfd_minor(struct vm_area_struct *vma)
+{
+ return false;
+}
+
static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
pte_t pte)
{
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 67018d367b9f..2d583ffd4100 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -151,6 +151,7 @@ IF_HAVE_PG_ARCH_2(PG_arch_2, "arch_2" )
{VM_PFNMAP, "pfnmap" }, \
{VM_DENYWRITE, "denywrite" }, \
{VM_UFFD_WP, "uffd_wp" }, \
+ {VM_UFFD_MINOR, "uffd_minor" }, \
{VM_LOCKED, "locked" }, \
{VM_IO, "io" }, \
{VM_SEQ_READ, "seqread" }, \
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 5f2d88212f7c..1cc2cd8a5279 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -19,15 +19,19 @@
* means the userland is reading).
*/
#define UFFD_API ((__u64)0xAA)
+#define UFFD_API_REGISTER_MODES (UFFDIO_REGISTER_MODE_MISSING | \
+ UFFDIO_REGISTER_MODE_WP | \
+ UFFDIO_REGISTER_MODE_MINOR)
#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
UFFD_FEATURE_EVENT_FORK | \
UFFD_FEATURE_EVENT_REMAP | \
- UFFD_FEATURE_EVENT_REMOVE | \
+ UFFD_FEATURE_EVENT_REMOVE | \
UFFD_FEATURE_EVENT_UNMAP | \
UFFD_FEATURE_MISSING_HUGETLBFS | \
UFFD_FEATURE_MISSING_SHMEM | \
UFFD_FEATURE_SIGBUS | \
- UFFD_FEATURE_THREAD_ID)
+ UFFD_FEATURE_THREAD_ID | \
+ UFFD_FEATURE_MINOR_FAULT_HUGETLBFS)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -127,6 +131,7 @@ struct uffd_msg {
/* flags for UFFD_EVENT_PAGEFAULT */
#define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */
#define UFFD_PAGEFAULT_FLAG_WP (1<<1) /* If reason is VM_UFFD_WP */
+#define UFFD_PAGEFAULT_FLAG_MINOR (1<<2) /* If reason is VM_UFFD_MINOR */
struct uffdio_api {
/* userland asks for an API number and the features to enable */
@@ -171,6 +176,10 @@ struct uffdio_api {
*
* UFFD_FEATURE_THREAD_ID pid of the page faulted task_struct will
* be returned, if feature is not requested 0 will be returned.
+ *
+ * UFFD_FEATURE_MINOR_FAULT_HUGETLBFS indicates that minor faults
+ * can be intercepted (via REGISTER_MODE_MINOR) for
+ * hugetlbfs-backed pages.
*/
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
#define UFFD_FEATURE_EVENT_FORK (1<<1)
@@ -181,6 +190,7 @@ struct uffdio_api {
#define UFFD_FEATURE_EVENT_UNMAP (1<<6)
#define UFFD_FEATURE_SIGBUS (1<<7)
#define UFFD_FEATURE_THREAD_ID (1<<8)
+#define UFFD_FEATURE_MINOR_FAULT_HUGETLBFS (1<<9)
__u64 features;
__u64 ioctls;
@@ -195,6 +205,7 @@ struct uffdio_register {
struct uffdio_range range;
#define UFFDIO_REGISTER_MODE_MISSING ((__u64)1<<0)
#define UFFDIO_REGISTER_MODE_WP ((__u64)1<<1)
+#define UFFDIO_REGISTER_MODE_MINOR ((__u64)1<<2)
__u64 mode;
/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a2602969873d..0ba8f2f5a4ae 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4377,6 +4377,37 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
}
+ /* Check for page in userfault range. */
+ if (!new_page && userfaultfd_minor(vma)) {
+ u32 hash;
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = haddr,
+ .flags = flags,
+ /*
+ * Hard to debug if it ends up being used by a callee
+ * that assumes something about the other uninitialized
+ * fields... same as in memory.c
+ */
+ };
+
+ unlock_page(page);
+
+ /*
+ * hugetlb_fault_mutex and i_mmap_rwsem must be dropped before
+ * handling userfault. Reacquire after handling fault to make
+ * calling code simpler.
+ */
+
+ hash = hugetlb_fault_mutex_hash(mapping, idx);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+ ret = handle_userfault(&vmf, VM_UFFD_MINOR);
+ i_mmap_lock_read(mapping);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
+ goto out;
+ }
+
/*
* If we are going to COW a private mapping later, we examine the
* pending reservations for this page now. This will ensure that
--
2.29.2.729.g45daf8777d-goog
* Axel Rasmussen ([email protected]) wrote:
> Overview
> ========
>
> This series adds a new userfaultfd registration mode,
> UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
> By "minor" fault, I mean the following situation:
>
> Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
> One of the mappings is registered with userfaultfd (in minor mode), and the
> other is not. Via the non-UFFD mapping, the underlying pages have already been
> allocated & filled with some contents. The UFFD mapping has not yet been
> faulted in; when it is touched for the first time, this results in what I'm
> calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
> have huge_pte_none(), but find_lock_page() finds an existing page.
>
> We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
> userspace resolves the fault by either a) doing nothing if the contents are
> already correct, or b) updating the underlying contents using the second,
> non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
> or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
> "I have ensured the page contents are correct, carry on setting up the mapping".
>
> Use Case
> ========
>
> Consider the use case of VM live migration (e.g. under QEMU/KVM):
>
> 1. While a VM is still running, we copy the contents of its memory to a
> target machine. The pages are populated on the target by writing to the
> non-UFFD mapping, using the setup described above. The VM is still running
> (and therefore its memory is likely changing), so this may be repeated
> several times, until we decide the target is "up to date enough".
>
> 2. We pause the VM on the source, and start executing on the target machine.
> During this gap, the VM's user(s) will *see* a pause, so it is desirable to
> minimize this window.
>
> 3. Between the last time any page was copied from the source to the target, and
> when the VM was paused, the contents of that page may have changed - and
> therefore the copy we have on the target machine is out of date. Although we
> can keep track of which pages are out of date, for VMs with large amounts of
> memory, it is "slow" to transfer this information to the target machine. We
> want to resume execution before such a transfer would complete.
>
> 4. So, the guest begins executing on the target machine. The first time it
> touches its memory (via the UFFD-registered mapping), userspace wants to
> intercept this fault. Userspace checks whether or not the page is up to date,
> and if not, copies the updated page from the source machine, via the non-UFFD
> mapping. Finally, whether a copy was performed or not, userspace issues a
> UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
> are correct, carry on setting up the mapping".
>
> We don't have to do all of the final updates on-demand. The userfaultfd manager
> can, in the background, also copy over updated pages once it receives the map of
> which pages are up-to-date or not.
Yes, this would make the handover during postcopy of large VMs a heck of
a lot faster; and probably simpler; the cleanup code that tidies up the
re-dirty pages is pretty messy.
Dave
> Interaction with Existing APIs
> ==============================
>
> Because it's possible to combine registration modes (e.g. a single VMA can be
> userfaultfd-registered MINOR | MISSING), and because it's up to userspace how to
> resolve faults once they are received, I spent some time thinking through how
> the existing API interacts with the new feature.
>
> UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
> allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
>
> - For non-shared memory or shmem, -EINVAL is returned.
> - For hugetlb, -EFAULT is returned.
>
> UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without
> modifications, the existing codepath assumes a new page needs to be allocated.
> This is okay, since userspace must have a second non-UFFD-registered mapping
> anyway, thus there isn't much reason to want to use these in any case (just
> memcpy or memset or similar).
>
> - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
> - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
> in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
> - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
> -ENOENT in that case (regardless of the kind of fault).
>
> Remaining Work
> ==============
>
> This patchset doesn't include updates to userfaultfd's documentation or
> selftests. This will be added before I send a non-RFC version of this series
> (I want to find out if there are strong objections to the API surface before
> spending the time to document it.)
>
> Currently the patchset only supports hugetlbfs. There is no reason it can't work
> with shmem, but I expect hugetlbfs to be much more commonly used since we're
> talking about backing guest memory for VMs. I plan to implement shmem support in
> a follow-up patch series.
>
> Axel Rasmussen (2):
> userfaultfd: add minor fault registration mode
> userfaultfd: add UFFDIO_CONTINUE ioctl
>
> fs/proc/task_mmu.c | 1 +
> fs/userfaultfd.c | 143 ++++++++++++++++++++++++-------
> include/linux/mm.h | 1 +
> include/linux/userfaultfd_k.h | 14 ++-
> include/trace/events/mmflags.h | 1 +
> include/uapi/linux/userfaultfd.h | 36 +++++++-
> mm/hugetlb.c | 42 +++++++--
> mm/userfaultfd.c | 86 ++++++++++++++-----
> 8 files changed, 261 insertions(+), 63 deletions(-)
>
> --
> 2.29.2.729.g45daf8777d-goog
>
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
* Axel Rasmussen ([email protected]) wrote:
> This feature allows userspace to intercept "minor" faults. By "minor"
> fault, I mean the following situation:
>
> Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
> memory). One of the mappings is registered with userfaultfd (in minor
> mode), and the other is not. Via the non-UFFD mapping, the underlying
> pages have already been allocated & filled with some contents. The UFFD
> mapping has not yet been faulted in; when it is touched for the first
> time, this results in what I'm calling a "minor" fault. As a concrete
> example, when working with hugetlbfs, we have huge_pte_none(), but
> find_lock_page() finds an existing page.
>
> This commit adds the new registration mode, and sets the relevant flag
> on the VMAs being registered. In the hugetlb fault path, if we find
> that we have huge_pte_none(), but find_lock_page() does indeed find an
> existing page, then we have a "minor" fault, and if the VMA has the
> userfaultfd registration flag, we call into userfaultfd to handle it.
>
> Why add a new registration mode, as opposed to adding a feature to
> MISSING registration, like UFFD_FEATURE_SIGBUS?
>
> - The semantics are significantly different. UFFDIO_COPY or
> UFFDIO_ZEROPAGE do not make sense for these minor faults; userspace
> would instead just memset() or memcpy() or whatever via the non-UFFD
> mapping. Unlike MISSING registration, MINOR registration only makes
> sense for shared memory (hugetlbfs or shmem [to be supported in future
> commits]).
Is there a reason that UFFDIO_COPY can't work with this and a non-shared
mapping? and/or the fallocate or hole punching we currently use?
Dave
> - Doing so would make handle_userfault()'s "reason" argument confusing.
> We'd pass in "MISSING" even if the pages weren't really missing.
>
> Signed-off-by: Axel Rasmussen <[email protected]>
> ---
> fs/proc/task_mmu.c | 1 +
> fs/userfaultfd.c | 80 +++++++++++++++++++-------------
> include/linux/mm.h | 1 +
> include/linux/userfaultfd_k.h | 12 ++++-
> include/trace/events/mmflags.h | 1 +
> include/uapi/linux/userfaultfd.h | 15 +++++-
> mm/hugetlb.c | 31 +++++++++++++
> 7 files changed, 107 insertions(+), 34 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ee5a235b3056..108faf719a83 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> [ilog2(VM_MTE)] = "mt",
> [ilog2(VM_MTE_ALLOWED)] = "",
> #endif
> + [ilog2(VM_UFFD_MINOR)] = "ui",
> #ifdef CONFIG_ARCH_HAS_PKEYS
> /* These come out via ProtectionKey: */
> [ilog2(VM_PKEY_BIT0)] = "",
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 894cc28142e7..0a661422eb19 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -28,6 +28,8 @@
> #include <linux/security.h>
> #include <linux/hugetlb.h>
>
> +#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
> +
> int sysctl_unprivileged_userfaultfd __read_mostly;
>
> static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
> @@ -196,24 +198,21 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
> msg_init(&msg);
> msg.event = UFFD_EVENT_PAGEFAULT;
> msg.arg.pagefault.address = address;
> + /*
> + * These flags indicate why the userfault occurred:
> + * - UFFD_PAGEFAULT_FLAG_WP indicates a write protect fault.
> + * - UFFD_PAGEFAULT_FLAG_MINOR indicates a minor fault.
> + * - Neither of these flags being set indicates a MISSING fault.
> + *
> + * Separately, UFFD_PAGEFAULT_FLAG_WRITE indicates it was a write
> + * fault. Otherwise, it was a read fault.
> + */
> if (flags & FAULT_FLAG_WRITE)
> - /*
> - * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> - * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE
> - * was not set in a UFFD_EVENT_PAGEFAULT, it means it
> - * was a read fault, otherwise if set it means it's
> - * a write fault.
> - */
> msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
> if (reason & VM_UFFD_WP)
> - /*
> - * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> - * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was
> - * not set in a UFFD_EVENT_PAGEFAULT, it means it was
> - * a missing fault, otherwise if set it means it's a
> - * write protect fault.
> - */
> msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
> + if (reason & VM_UFFD_MINOR)
> + msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR;
> if (features & UFFD_FEATURE_THREAD_ID)
> msg.arg.pagefault.feat.ptid = task_pid_vnr(current);
> return msg;
> @@ -400,8 +399,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>
> BUG_ON(ctx->mm != mm);
>
> - VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> - VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> + /* Any unrecognized flag is a bug. */
> + VM_BUG_ON(reason & ~__VM_UFFD_FLAGS);
> + /* 0 or > 1 flags set is a bug; we expect exactly 1. */
> + VM_BUG_ON(!reason || !!(reason & (reason - 1)));
>
> if (ctx->features & UFFD_FEATURE_SIGBUS)
> goto out;
> @@ -611,7 +612,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
> for (vma = mm->mmap; vma; vma = vma->vm_next)
> if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> + vma->vm_flags &= ~__VM_UFFD_FLAGS;
> }
> mmap_write_unlock(mm);
>
> @@ -643,7 +644,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
> octx = vma->vm_userfaultfd_ctx.ctx;
> if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> + vma->vm_flags &= ~__VM_UFFD_FLAGS;
> return 0;
> }
>
> @@ -725,7 +726,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
> } else {
> /* Drop uffd context if remap feature not enabled */
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> + vma->vm_flags &= ~__VM_UFFD_FLAGS;
> }
> }
>
> @@ -866,12 +867,12 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
> for (vma = mm->mmap; vma; vma = vma->vm_next) {
> cond_resched();
> BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
> - !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> + !!(vma->vm_flags & __VM_UFFD_FLAGS));
> if (vma->vm_userfaultfd_ctx.ctx != ctx) {
> prev = vma;
> continue;
> }
> - new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> + new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
> new_flags, vma->anon_vma,
> vma->vm_file, vma->vm_pgoff,
> @@ -1260,9 +1261,26 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
> unsigned long vm_flags)
> {
> /* FIXME: add WP support to hugetlbfs and shmem */
> - return vma_is_anonymous(vma) ||
> - ((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
> - !(vm_flags & VM_UFFD_WP));
> + if (vm_flags & VM_UFFD_WP) {
> + if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
> + return false;
> + }
> +
> + if (vm_flags & VM_UFFD_MINOR) {
> + /*
> + * The use case for minor registration (intercepting minor
> + * faults) is to handle the case where a page is present, but
> + * needs to be modified before it can be used. This requires
> + * two mappings: one with UFFD registration, and one without.
> + * So, it only makes sense to do this with shared memory.
> + */
> + /* FIXME: Add minor fault interception for shmem. */
> + if (!(is_vm_hugetlb_page(vma) && (vma->vm_flags & VM_SHARED)))
> + return false;
> + }
> +
> + return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
> + vma_is_shmem(vma);
> }
>
> static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> @@ -1288,14 +1306,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> ret = -EINVAL;
> if (!uffdio_register.mode)
> goto out;
> - if (uffdio_register.mode & ~(UFFDIO_REGISTER_MODE_MISSING|
> - UFFDIO_REGISTER_MODE_WP))
> + if (uffdio_register.mode & ~UFFD_API_REGISTER_MODES)
> goto out;
> vm_flags = 0;
> if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
> vm_flags |= VM_UFFD_MISSING;
> if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
> vm_flags |= VM_UFFD_WP;
> + if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR)
> + vm_flags |= VM_UFFD_MINOR;
>
> ret = validate_range(mm, &uffdio_register.range.start,
> uffdio_register.range.len);
> @@ -1339,7 +1358,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> cond_resched();
>
> BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
> - !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> + !!(cur->vm_flags & __VM_UFFD_FLAGS));
>
> /* check not compatible vmas */
> ret = -EINVAL;
> @@ -1419,8 +1438,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> start = vma->vm_start;
> vma_end = min(end, vma->vm_end);
>
> - new_flags = (vma->vm_flags &
> - ~(VM_UFFD_MISSING|VM_UFFD_WP)) | vm_flags;
> + new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
> prev = vma_merge(mm, prev, start, vma_end, new_flags,
> vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> vma_policy(vma),
> @@ -1539,7 +1557,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> cond_resched();
>
> BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
> - !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> + !!(cur->vm_flags & __VM_UFFD_FLAGS));
>
> /*
> * Check not compatible vmas, not strictly required
> @@ -1590,7 +1608,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range);
> }
>
> - new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> + new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> prev = vma_merge(mm, prev, start, vma_end, new_flags,
> vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> vma_policy(vma),
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ecdf8a8cd6ae..1d7041bd3148 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -276,6 +276,7 @@ extern unsigned int kobjsize(const void *objp);
> #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
> #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
> #define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
> +#define VM_UFFD_MINOR 0x00002000 /* minor fault interception */
>
> #define VM_LOCKED 0x00002000
> #define VM_IO 0x00004000 /* Memory mapped I/O or similar */
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index a8e5f3ea9bb2..0c8c5fa5efc8 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -62,6 +62,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> return vma->vm_flags & VM_UFFD_WP;
> }
>
> +static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> +{
> + return vma->vm_flags & VM_UFFD_MINOR;
> +}
> +
> static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> pte_t pte)
> {
> @@ -76,7 +81,7 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
>
> static inline bool userfaultfd_armed(struct vm_area_struct *vma)
> {
> - return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
> + return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR);
> }
>
> extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
> @@ -123,6 +128,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> return false;
> }
>
> +static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> +{
> + return false;
> +}
> +
> static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> pte_t pte)
> {
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 67018d367b9f..2d583ffd4100 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -151,6 +151,7 @@ IF_HAVE_PG_ARCH_2(PG_arch_2, "arch_2" )
> {VM_PFNMAP, "pfnmap" }, \
> {VM_DENYWRITE, "denywrite" }, \
> {VM_UFFD_WP, "uffd_wp" }, \
> + {VM_UFFD_MINOR, "uffd_minor" }, \
> {VM_LOCKED, "locked" }, \
> {VM_IO, "io" }, \
> {VM_SEQ_READ, "seqread" }, \
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 5f2d88212f7c..1cc2cd8a5279 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -19,15 +19,19 @@
> * means the userland is reading).
> */
> #define UFFD_API ((__u64)0xAA)
> +#define UFFD_API_REGISTER_MODES (UFFDIO_REGISTER_MODE_MISSING | \
> + UFFDIO_REGISTER_MODE_WP | \
> + UFFDIO_REGISTER_MODE_MINOR)
> #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
> UFFD_FEATURE_EVENT_FORK | \
> UFFD_FEATURE_EVENT_REMAP | \
> - UFFD_FEATURE_EVENT_REMOVE | \
> + UFFD_FEATURE_EVENT_REMOVE | \
> UFFD_FEATURE_EVENT_UNMAP | \
> UFFD_FEATURE_MISSING_HUGETLBFS | \
> UFFD_FEATURE_MISSING_SHMEM | \
> UFFD_FEATURE_SIGBUS | \
> - UFFD_FEATURE_THREAD_ID)
> + UFFD_FEATURE_THREAD_ID | \
> + UFFD_FEATURE_MINOR_FAULT_HUGETLBFS)
> #define UFFD_API_IOCTLS \
> ((__u64)1 << _UFFDIO_REGISTER | \
> (__u64)1 << _UFFDIO_UNREGISTER | \
> @@ -127,6 +131,7 @@ struct uffd_msg {
> /* flags for UFFD_EVENT_PAGEFAULT */
> #define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */
> #define UFFD_PAGEFAULT_FLAG_WP (1<<1) /* If reason is VM_UFFD_WP */
> +#define UFFD_PAGEFAULT_FLAG_MINOR (1<<2) /* If reason is VM_UFFD_MINOR */
>
> struct uffdio_api {
> /* userland asks for an API number and the features to enable */
> @@ -171,6 +176,10 @@ struct uffdio_api {
> *
> * UFFD_FEATURE_THREAD_ID pid of the page faulted task_struct will
> * be returned, if feature is not requested 0 will be returned.
> + *
> + * UFFD_FEATURE_MINOR_FAULT_HUGETLBFS indicates that minor faults
> + * can be intercepted (via REGISTER_MODE_MINOR) for
> + * hugetlbfs-backed pages.
> */
> #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
> #define UFFD_FEATURE_EVENT_FORK (1<<1)
> @@ -181,6 +190,7 @@ struct uffdio_api {
> #define UFFD_FEATURE_EVENT_UNMAP (1<<6)
> #define UFFD_FEATURE_SIGBUS (1<<7)
> #define UFFD_FEATURE_THREAD_ID (1<<8)
> +#define UFFD_FEATURE_MINOR_FAULT_HUGETLBFS (1<<9)
> __u64 features;
>
> __u64 ioctls;
> @@ -195,6 +205,7 @@ struct uffdio_register {
> struct uffdio_range range;
> #define UFFDIO_REGISTER_MODE_MISSING ((__u64)1<<0)
> #define UFFDIO_REGISTER_MODE_WP ((__u64)1<<1)
> +#define UFFDIO_REGISTER_MODE_MINOR ((__u64)1<<2)
> __u64 mode;
>
> /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a2602969873d..0ba8f2f5a4ae 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4377,6 +4377,37 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> }
> }
>
> + /* Check for page in userfault range. */
> + if (!new_page && userfaultfd_minor(vma)) {
> + u32 hash;
> + struct vm_fault vmf = {
> + .vma = vma,
> + .address = haddr,
> + .flags = flags,
> + /*
> + * Hard to debug if it ends up being used by a callee
> + * that assumes something about the other uninitialized
> + * fields... same as in memory.c
> + */
> + };
> +
> + unlock_page(page);
> +
> + /*
> + * hugetlb_fault_mutex and i_mmap_rwsem must be dropped before
> + * handling userfault. Reacquire after handling fault to make
> + * calling code simpler.
> + */
> +
> + hash = hugetlb_fault_mutex_hash(mapping, idx);
> + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> + i_mmap_unlock_read(mapping);
> + ret = handle_userfault(&vmf, VM_UFFD_MINOR);
> + i_mmap_lock_read(mapping);
> + mutex_lock(&hugetlb_fault_mutex_table[hash]);
> + goto out;
> + }
> +
> /*
> * If we are going to COW a private mapping later, we examine the
> * pending reservations for this page now. This will ensure that
> --
> 2.29.2.729.g45daf8777d-goog
>
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
* Axel Rasmussen ([email protected]) wrote:
> On Mon, Jan 11, 2021 at 3:58 AM Dr. David Alan Gilbert
> <[email protected]> wrote:
> >
> > * Axel Rasmussen ([email protected]) wrote:
> > > This feature allows userspace to intercept "minor" faults. By "minor"
> > > fault, I mean the following situation:
> > >
> > > Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
> > > memory). One of the mappings is registered with userfaultfd (in minor
> > > mode), and the other is not. Via the non-UFFD mapping, the underlying
> > > pages have already been allocated & filled with some contents. The UFFD
> > > mapping has not yet been faulted in; when it is touched for the first
> > > time, this results in what I'm calling a "minor" fault. As a concrete
> > > example, when working with hugetlbfs, we have huge_pte_none(), but
> > > find_lock_page() finds an existing page.
> > >
> > > This commit adds the new registration mode, and sets the relevant flag
> > > on the VMAs being registered. In the hugetlb fault path, if we find
> > > that we have huge_pte_none(), but find_lock_page() does indeed find an
> > > existing page, then we have a "minor" fault, and if the VMA has the
> > > userfaultfd registration flag, we call into userfaultfd to handle it.
> > >
> > > Why add a new registration mode, as opposed to adding a feature to
> > > MISSING registration, like UFFD_FEATURE_SIGBUS?
> > >
> > > - The semantics are significantly different. UFFDIO_COPY or
> > > UFFDIO_ZEROPAGE do not make sense for these minor faults; userspace
> > > would instead just memset() or memcpy() or whatever via the non-UFFD
> > > mapping. Unlike MISSING registration, MINOR registration only makes
> > > sense for shared memory (hugetlbfs or shmem [to be supported in future
> > > commits]).
> >
> > Is there a reason that UFFDIO_COPY can't work with this and a non-shared
> > mapping? and/or the fallocate or hole punching we currently use?
>
> It could work, with a small modification. Mainly I just didn't see it
> being useful since memcpy() or similar works in this case. (We don't
> have the atomicity problem UFFDIO_COPY is trying to solve, since we're
> modifying via a non-UFFD VMA.)
I guess I was thinking how easily to modify my existing postcopy code;
> Maybe another (weaker?) argument against it is, if it happily accepts
> VMAs both with and without backing pages, it may be easier to misuse.
> E.g., if you mistakenly give it the wrong address, and it overwrites
> some existing page you didn't mean to modify.
>
> I don't feel particularly strongly, if it strikes you as cleaner to
> just support it, I'll do so in the next revision.
Not necessarily; I was just thinking still in the way I'm used to with
the existing interface.
Dave
> >
> > Dave
> >
> > > - Doing so would make handle_userfault()'s "reason" argument confusing.
> > > We'd pass in "MISSING" even if the pages weren't really missing.
> > >
> > > Signed-off-by: Axel Rasmussen <[email protected]>
> > > ---
> > > fs/proc/task_mmu.c | 1 +
> > > fs/userfaultfd.c | 80 +++++++++++++++++++-------------
> > > include/linux/mm.h | 1 +
> > > include/linux/userfaultfd_k.h | 12 ++++-
> > > include/trace/events/mmflags.h | 1 +
> > > include/uapi/linux/userfaultfd.h | 15 +++++-
> > > mm/hugetlb.c | 31 +++++++++++++
> > > 7 files changed, 107 insertions(+), 34 deletions(-)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index ee5a235b3056..108faf719a83 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> > > [ilog2(VM_MTE)] = "mt",
> > > [ilog2(VM_MTE_ALLOWED)] = "",
> > > #endif
> > > + [ilog2(VM_UFFD_MINOR)] = "ui",
> > > #ifdef CONFIG_ARCH_HAS_PKEYS
> > > /* These come out via ProtectionKey: */
> > > [ilog2(VM_PKEY_BIT0)] = "",
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index 894cc28142e7..0a661422eb19 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -28,6 +28,8 @@
> > > #include <linux/security.h>
> > > #include <linux/hugetlb.h>
> > >
> > > +#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
> > > +
> > > int sysctl_unprivileged_userfaultfd __read_mostly;
> > >
> > > static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
> > > @@ -196,24 +198,21 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
> > > msg_init(&msg);
> > > msg.event = UFFD_EVENT_PAGEFAULT;
> > > msg.arg.pagefault.address = address;
> > > + /*
> > > + * These flags indicate why the userfault occurred:
> > > + * - UFFD_PAGEFAULT_FLAG_WP indicates a write protect fault.
> > > + * - UFFD_PAGEFAULT_FLAG_MINOR indicates a minor fault.
> > > + * - Neither of these flags being set indicates a MISSING fault.
> > > + *
> > > + * Separately, UFFD_PAGEFAULT_FLAG_WRITE indicates it was a write
> > > + * fault. Otherwise, it was a read fault.
> > > + */
> > > if (flags & FAULT_FLAG_WRITE)
> > > - /*
> > > - * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> > > - * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE
> > > - * was not set in a UFFD_EVENT_PAGEFAULT, it means it
> > > - * was a read fault, otherwise if set it means it's
> > > - * a write fault.
> > > - */
> > > msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
> > > if (reason & VM_UFFD_WP)
> > > - /*
> > > - * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> > > - * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was
> > > - * not set in a UFFD_EVENT_PAGEFAULT, it means it was
> > > - * a missing fault, otherwise if set it means it's a
> > > - * write protect fault.
> > > - */
> > > msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
> > > + if (reason & VM_UFFD_MINOR)
> > > + msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR;
> > > if (features & UFFD_FEATURE_THREAD_ID)
> > > msg.arg.pagefault.feat.ptid = task_pid_vnr(current);
> > > return msg;
> > > @@ -400,8 +399,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> > >
> > > BUG_ON(ctx->mm != mm);
> > >
> > > - VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> > > - VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> > > + /* Any unrecognized flag is a bug. */
> > > + VM_BUG_ON(reason & ~__VM_UFFD_FLAGS);
> > > + /* 0 or > 1 flags set is a bug; we expect exactly 1. */
> > > + VM_BUG_ON(!reason || !!(reason & (reason - 1)));
> > >
> > > if (ctx->features & UFFD_FEATURE_SIGBUS)
> > > goto out;
> > > @@ -611,7 +612,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
> > > for (vma = mm->mmap; vma; vma = vma->vm_next)
> > > if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
> > > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > > - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> > > + vma->vm_flags &= ~__VM_UFFD_FLAGS;
> > > }
> > > mmap_write_unlock(mm);
> > >
> > > @@ -643,7 +644,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
> > > octx = vma->vm_userfaultfd_ctx.ctx;
> > > if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
> > > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > > - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> > > + vma->vm_flags &= ~__VM_UFFD_FLAGS;
> > > return 0;
> > > }
> > >
> > > @@ -725,7 +726,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
> > > } else {
> > > /* Drop uffd context if remap feature not enabled */
> > > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > > - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> > > + vma->vm_flags &= ~__VM_UFFD_FLAGS;
> > > }
> > > }
> > >
> > > @@ -866,12 +867,12 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
> > > for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > > cond_resched();
> > > BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
> > > - !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> > > + !!(vma->vm_flags & __VM_UFFD_FLAGS));
> > > if (vma->vm_userfaultfd_ctx.ctx != ctx) {
> > > prev = vma;
> > > continue;
> > > }
> > > - new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> > > + new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> > > prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
> > > new_flags, vma->anon_vma,
> > > vma->vm_file, vma->vm_pgoff,
> > > @@ -1260,9 +1261,26 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
> > > unsigned long vm_flags)
> > > {
> > > /* FIXME: add WP support to hugetlbfs and shmem */
> > > - return vma_is_anonymous(vma) ||
> > > - ((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
> > > - !(vm_flags & VM_UFFD_WP));
> > > + if (vm_flags & VM_UFFD_WP) {
> > > + if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
> > > + return false;
> > > + }
> > > +
> > > + if (vm_flags & VM_UFFD_MINOR) {
> > > + /*
> > > + * The use case for minor registration (intercepting minor
> > > + * faults) is to handle the case where a page is present, but
> > > + * needs to be modified before it can be used. This requires
> > > + * two mappings: one with UFFD registration, and one without.
> > > + * So, it only makes sense to do this with shared memory.
> > > + */
> > > + /* FIXME: Add minor fault interception for shmem. */
> > > + if (!(is_vm_hugetlb_page(vma) && (vma->vm_flags & VM_SHARED)))
> > > + return false;
> > > + }
> > > +
> > > + return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
> > > + vma_is_shmem(vma);
> > > }
> > >
> > > static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > @@ -1288,14 +1306,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > ret = -EINVAL;
> > > if (!uffdio_register.mode)
> > > goto out;
> > > - if (uffdio_register.mode & ~(UFFDIO_REGISTER_MODE_MISSING|
> > > - UFFDIO_REGISTER_MODE_WP))
> > > + if (uffdio_register.mode & ~UFFD_API_REGISTER_MODES)
> > > goto out;
> > > vm_flags = 0;
> > > if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
> > > vm_flags |= VM_UFFD_MISSING;
> > > if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
> > > vm_flags |= VM_UFFD_WP;
> > > + if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR)
> > > + vm_flags |= VM_UFFD_MINOR;
> > >
> > > ret = validate_range(mm, &uffdio_register.range.start,
> > > uffdio_register.range.len);
> > > @@ -1339,7 +1358,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > cond_resched();
> > >
> > > BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
> > > - !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> > > + !!(cur->vm_flags & __VM_UFFD_FLAGS));
> > >
> > > /* check not compatible vmas */
> > > ret = -EINVAL;
> > > @@ -1419,8 +1438,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > start = vma->vm_start;
> > > vma_end = min(end, vma->vm_end);
> > >
> > > - new_flags = (vma->vm_flags &
> > > - ~(VM_UFFD_MISSING|VM_UFFD_WP)) | vm_flags;
> > > + new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
> > > prev = vma_merge(mm, prev, start, vma_end, new_flags,
> > > vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> > > vma_policy(vma),
> > > @@ -1539,7 +1557,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > > cond_resched();
> > >
> > > BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
> > > - !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> > > + !!(cur->vm_flags & __VM_UFFD_FLAGS));
> > >
> > > /*
> > > * Check not compatible vmas, not strictly required
> > > @@ -1590,7 +1608,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > > wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range);
> > > }
> > >
> > > - new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> > > + new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> > > prev = vma_merge(mm, prev, start, vma_end, new_flags,
> > > vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> > > vma_policy(vma),
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index ecdf8a8cd6ae..1d7041bd3148 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -276,6 +276,7 @@ extern unsigned int kobjsize(const void *objp);
> > > #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
> > > #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
> > > #define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
> > > +#define VM_UFFD_MINOR 0x00002000 /* minor fault interception */
> > >
> > > #define VM_LOCKED 0x00002000
> > > #define VM_IO 0x00004000 /* Memory mapped I/O or similar */
> > > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > > index a8e5f3ea9bb2..0c8c5fa5efc8 100644
> > > --- a/include/linux/userfaultfd_k.h
> > > +++ b/include/linux/userfaultfd_k.h
> > > @@ -62,6 +62,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> > > return vma->vm_flags & VM_UFFD_WP;
> > > }
> > >
> > > +static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> > > +{
> > > + return vma->vm_flags & VM_UFFD_MINOR;
> > > +}
> > > +
> > > static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> > > pte_t pte)
> > > {
> > > @@ -76,7 +81,7 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
> > >
> > > static inline bool userfaultfd_armed(struct vm_area_struct *vma)
> > > {
> > > - return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
> > > + return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR);
> > > }
> > >
> > > extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
> > > @@ -123,6 +128,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> > > return false;
> > > }
> > >
> > > +static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> > > +{
> > > + return false;
> > > +}
> > > +
> > > static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> > > pte_t pte)
> > > {
> > > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> > > index 67018d367b9f..2d583ffd4100 100644
> > > --- a/include/trace/events/mmflags.h
> > > +++ b/include/trace/events/mmflags.h
> > > @@ -151,6 +151,7 @@ IF_HAVE_PG_ARCH_2(PG_arch_2, "arch_2" )
> > > {VM_PFNMAP, "pfnmap" }, \
> > > {VM_DENYWRITE, "denywrite" }, \
> > > {VM_UFFD_WP, "uffd_wp" }, \
> > > + {VM_UFFD_MINOR, "uffd_minor" }, \
> > > {VM_LOCKED, "locked" }, \
> > > {VM_IO, "io" }, \
> > > {VM_SEQ_READ, "seqread" }, \
> > > diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> > > index 5f2d88212f7c..1cc2cd8a5279 100644
> > > --- a/include/uapi/linux/userfaultfd.h
> > > +++ b/include/uapi/linux/userfaultfd.h
> > > @@ -19,15 +19,19 @@
> > > * means the userland is reading).
> > > */
> > > #define UFFD_API ((__u64)0xAA)
> > > +#define UFFD_API_REGISTER_MODES (UFFDIO_REGISTER_MODE_MISSING | \
> > > + UFFDIO_REGISTER_MODE_WP | \
> > > + UFFDIO_REGISTER_MODE_MINOR)
> > > #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
> > > UFFD_FEATURE_EVENT_FORK | \
> > > UFFD_FEATURE_EVENT_REMAP | \
> > > - UFFD_FEATURE_EVENT_REMOVE | \
> > > + UFFD_FEATURE_EVENT_REMOVE | \
> > > UFFD_FEATURE_EVENT_UNMAP | \
> > > UFFD_FEATURE_MISSING_HUGETLBFS | \
> > > UFFD_FEATURE_MISSING_SHMEM | \
> > > UFFD_FEATURE_SIGBUS | \
> > > - UFFD_FEATURE_THREAD_ID)
> > > + UFFD_FEATURE_THREAD_ID | \
> > > + UFFD_FEATURE_MINOR_FAULT_HUGETLBFS)
> > > #define UFFD_API_IOCTLS \
> > > ((__u64)1 << _UFFDIO_REGISTER | \
> > > (__u64)1 << _UFFDIO_UNREGISTER | \
> > > @@ -127,6 +131,7 @@ struct uffd_msg {
> > > /* flags for UFFD_EVENT_PAGEFAULT */
> > > #define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */
> > > #define UFFD_PAGEFAULT_FLAG_WP (1<<1) /* If reason is VM_UFFD_WP */
> > > +#define UFFD_PAGEFAULT_FLAG_MINOR (1<<2) /* If reason is VM_UFFD_MINOR */
> > >
> > > struct uffdio_api {
> > > /* userland asks for an API number and the features to enable */
> > > @@ -171,6 +176,10 @@ struct uffdio_api {
> > > *
> > > * UFFD_FEATURE_THREAD_ID pid of the page faulted task_struct will
> > > * be returned, if feature is not requested 0 will be returned.
> > > + *
> > > + * UFFD_FEATURE_MINOR_FAULT_HUGETLBFS indicates that minor faults
> > > + * can be intercepted (via REGISTER_MODE_MINOR) for
> > > + * hugetlbfs-backed pages.
> > > */
> > > #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
> > > #define UFFD_FEATURE_EVENT_FORK (1<<1)
> > > @@ -181,6 +190,7 @@ struct uffdio_api {
> > > #define UFFD_FEATURE_EVENT_UNMAP (1<<6)
> > > #define UFFD_FEATURE_SIGBUS (1<<7)
> > > #define UFFD_FEATURE_THREAD_ID (1<<8)
> > > +#define UFFD_FEATURE_MINOR_FAULT_HUGETLBFS (1<<9)
> > > __u64 features;
> > >
> > > __u64 ioctls;
> > > @@ -195,6 +205,7 @@ struct uffdio_register {
> > > struct uffdio_range range;
> > > #define UFFDIO_REGISTER_MODE_MISSING ((__u64)1<<0)
> > > #define UFFDIO_REGISTER_MODE_WP ((__u64)1<<1)
> > > +#define UFFDIO_REGISTER_MODE_MINOR ((__u64)1<<2)
> > > __u64 mode;
> > >
> > > /*
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index a2602969873d..0ba8f2f5a4ae 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -4377,6 +4377,37 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> > > }
> > > }
> > >
> > > + /* Check for page in userfault range. */
> > > + if (!new_page && userfaultfd_minor(vma)) {
> > > + u32 hash;
> > > + struct vm_fault vmf = {
> > > + .vma = vma,
> > > + .address = haddr,
> > > + .flags = flags,
> > > + /*
> > > + * Hard to debug if it ends up being used by a callee
> > > + * that assumes something about the other uninitialized
> > > + * fields... same as in memory.c
> > > + */
> > > + };
> > > +
> > > + unlock_page(page);
> > > +
> > > + /*
> > > + * hugetlb_fault_mutex and i_mmap_rwsem must be dropped before
> > > + * handling userfault. Reacquire after handling fault to make
> > > + * calling code simpler.
> > > + */
> > > +
> > > + hash = hugetlb_fault_mutex_hash(mapping, idx);
> > > + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > > + i_mmap_unlock_read(mapping);
> > > + ret = handle_userfault(&vmf, VM_UFFD_MINOR);
> > > + i_mmap_lock_read(mapping);
> > > + mutex_lock(&hugetlb_fault_mutex_table[hash]);
> > > + goto out;
> > > + }
> > > +
> > > /*
> > > * If we are going to COW a private mapping later, we examine the
> > > * pending reservations for this page now. This will ensure that
> > > --
> > > 2.29.2.729.g45daf8777d-goog
> > >
> > --
> > Dr. David Alan Gilbert / [email protected] / Manchester, UK
> >
>
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
On Mon, Jan 11, 2021 at 3:58 AM Dr. David Alan Gilbert
<[email protected]> wrote:
>
> * Axel Rasmussen ([email protected]) wrote:
> > This feature allows userspace to intercept "minor" faults. By "minor"
> > fault, I mean the following situation:
> >
> > Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
> > memory). One of the mappings is registered with userfaultfd (in minor
> > mode), and the other is not. Via the non-UFFD mapping, the underlying
> > pages have already been allocated & filled with some contents. The UFFD
> > mapping has not yet been faulted in; when it is touched for the first
> > time, this results in what I'm calling a "minor" fault. As a concrete
> > example, when working with hugetlbfs, we have huge_pte_none(), but
> > find_lock_page() finds an existing page.
> >
> > This commit adds the new registration mode, and sets the relevant flag
> > on the VMAs being registered. In the hugetlb fault path, if we find
> > that we have huge_pte_none(), but find_lock_page() does indeed find an
> > existing page, then we have a "minor" fault, and if the VMA has the
> > userfaultfd registration flag, we call into userfaultfd to handle it.
> >
> > Why add a new registration mode, as opposed to adding a feature to
> > MISSING registration, like UFFD_FEATURE_SIGBUS?
> >
> > - The semantics are significantly different. UFFDIO_COPY or
> > UFFDIO_ZEROPAGE do not make sense for these minor faults; userspace
> > would instead just memset() or memcpy() or whatever via the non-UFFD
> > mapping. Unlike MISSING registration, MINOR registration only makes
> > sense for shared memory (hugetlbfs or shmem [to be supported in future
> > commits]).
>
> Is there a reason that UFFDIO_COPY can't work with this and a non-shared
> mapping? and/or the fallocate or hole punching we currently use?
It could work, with a small modification. Mainly I just didn't see it
being useful since memcpy() or similar works in this case. (We don't
have the atomicity problem UFFDIO_COPY is trying to solve, since we're
modifying via a non-UFFD VMA.)
Maybe another (weaker?) argument against it is, if it happily accepts
VMAs both with and without backing pages, it may be easier to misuse.
E.g., if you mistakenly give it the wrong address, and it overwrites
some existing page you didn't mean to modify.
I don't feel particularly strongly, if it strikes you as cleaner to
just support it, I'll do so in the next revision.
>
> Dave
>
> > - Doing so would make handle_userfault()'s "reason" argument confusing.
> > We'd pass in "MISSING" even if the pages weren't really missing.
> >
> > Signed-off-by: Axel Rasmussen <[email protected]>
> > ---
> > fs/proc/task_mmu.c | 1 +
> > fs/userfaultfd.c | 80 +++++++++++++++++++-------------
> > include/linux/mm.h | 1 +
> > include/linux/userfaultfd_k.h | 12 ++++-
> > include/trace/events/mmflags.h | 1 +
> > include/uapi/linux/userfaultfd.h | 15 +++++-
> > mm/hugetlb.c | 31 +++++++++++++
> > 7 files changed, 107 insertions(+), 34 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index ee5a235b3056..108faf719a83 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> > [ilog2(VM_MTE)] = "mt",
> > [ilog2(VM_MTE_ALLOWED)] = "",
> > #endif
> > + [ilog2(VM_UFFD_MINOR)] = "ui",
> > #ifdef CONFIG_ARCH_HAS_PKEYS
> > /* These come out via ProtectionKey: */
> > [ilog2(VM_PKEY_BIT0)] = "",
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 894cc28142e7..0a661422eb19 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -28,6 +28,8 @@
> > #include <linux/security.h>
> > #include <linux/hugetlb.h>
> >
> > +#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
> > +
> > int sysctl_unprivileged_userfaultfd __read_mostly;
> >
> > static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
> > @@ -196,24 +198,21 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
> > msg_init(&msg);
> > msg.event = UFFD_EVENT_PAGEFAULT;
> > msg.arg.pagefault.address = address;
> > + /*
> > + * These flags indicate why the userfault occurred:
> > + * - UFFD_PAGEFAULT_FLAG_WP indicates a write protect fault.
> > + * - UFFD_PAGEFAULT_FLAG_MINOR indicates a minor fault.
> > + * - Neither of these flags being set indicates a MISSING fault.
> > + *
> > + * Separately, UFFD_PAGEFAULT_FLAG_WRITE indicates it was a write
> > + * fault. Otherwise, it was a read fault.
> > + */
> > if (flags & FAULT_FLAG_WRITE)
> > - /*
> > - * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> > - * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE
> > - * was not set in a UFFD_EVENT_PAGEFAULT, it means it
> > - * was a read fault, otherwise if set it means it's
> > - * a write fault.
> > - */
> > msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
> > if (reason & VM_UFFD_WP)
> > - /*
> > - * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> > - * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was
> > - * not set in a UFFD_EVENT_PAGEFAULT, it means it was
> > - * a missing fault, otherwise if set it means it's a
> > - * write protect fault.
> > - */
> > msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
> > + if (reason & VM_UFFD_MINOR)
> > + msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR;
> > if (features & UFFD_FEATURE_THREAD_ID)
> > msg.arg.pagefault.feat.ptid = task_pid_vnr(current);
> > return msg;
> > @@ -400,8 +399,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >
> > BUG_ON(ctx->mm != mm);
> >
> > - VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> > - VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> > + /* Any unrecognized flag is a bug. */
> > + VM_BUG_ON(reason & ~__VM_UFFD_FLAGS);
> > + /* 0 or > 1 flags set is a bug; we expect exactly 1. */
> > + VM_BUG_ON(!reason || !!(reason & (reason - 1)));
> >
> > if (ctx->features & UFFD_FEATURE_SIGBUS)
> > goto out;
> > @@ -611,7 +612,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
> > for (vma = mm->mmap; vma; vma = vma->vm_next)
> > if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
> > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> > + vma->vm_flags &= ~__VM_UFFD_FLAGS;
> > }
> > mmap_write_unlock(mm);
> >
> > @@ -643,7 +644,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
> > octx = vma->vm_userfaultfd_ctx.ctx;
> > if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
> > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> > + vma->vm_flags &= ~__VM_UFFD_FLAGS;
> > return 0;
> > }
> >
> > @@ -725,7 +726,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
> > } else {
> > /* Drop uffd context if remap feature not enabled */
> > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> > - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> > + vma->vm_flags &= ~__VM_UFFD_FLAGS;
> > }
> > }
> >
> > @@ -866,12 +867,12 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
> > for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > cond_resched();
> > BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
> > - !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> > + !!(vma->vm_flags & __VM_UFFD_FLAGS));
> > if (vma->vm_userfaultfd_ctx.ctx != ctx) {
> > prev = vma;
> > continue;
> > }
> > - new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> > + new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> > prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
> > new_flags, vma->anon_vma,
> > vma->vm_file, vma->vm_pgoff,
> > @@ -1260,9 +1261,26 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
> > unsigned long vm_flags)
> > {
> > /* FIXME: add WP support to hugetlbfs and shmem */
> > - return vma_is_anonymous(vma) ||
> > - ((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
> > - !(vm_flags & VM_UFFD_WP));
> > + if (vm_flags & VM_UFFD_WP) {
> > + if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
> > + return false;
> > + }
> > +
> > + if (vm_flags & VM_UFFD_MINOR) {
> > + /*
> > + * The use case for minor registration (intercepting minor
> > + * faults) is to handle the case where a page is present, but
> > + * needs to be modified before it can be used. This requires
> > + * two mappings: one with UFFD registration, and one without.
> > + * So, it only makes sense to do this with shared memory.
> > + */
> > + /* FIXME: Add minor fault interception for shmem. */
> > + if (!(is_vm_hugetlb_page(vma) && (vma->vm_flags & VM_SHARED)))
> > + return false;
> > + }
> > +
> > + return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
> > + vma_is_shmem(vma);
> > }
> >
> > static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > @@ -1288,14 +1306,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > ret = -EINVAL;
> > if (!uffdio_register.mode)
> > goto out;
> > - if (uffdio_register.mode & ~(UFFDIO_REGISTER_MODE_MISSING|
> > - UFFDIO_REGISTER_MODE_WP))
> > + if (uffdio_register.mode & ~UFFD_API_REGISTER_MODES)
> > goto out;
> > vm_flags = 0;
> > if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
> > vm_flags |= VM_UFFD_MISSING;
> > if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
> > vm_flags |= VM_UFFD_WP;
> > + if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR)
> > + vm_flags |= VM_UFFD_MINOR;
> >
> > ret = validate_range(mm, &uffdio_register.range.start,
> > uffdio_register.range.len);
> > @@ -1339,7 +1358,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > cond_resched();
> >
> > BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
> > - !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> > + !!(cur->vm_flags & __VM_UFFD_FLAGS));
> >
> > /* check not compatible vmas */
> > ret = -EINVAL;
> > @@ -1419,8 +1438,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > start = vma->vm_start;
> > vma_end = min(end, vma->vm_end);
> >
> > - new_flags = (vma->vm_flags &
> > - ~(VM_UFFD_MISSING|VM_UFFD_WP)) | vm_flags;
> > + new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
> > prev = vma_merge(mm, prev, start, vma_end, new_flags,
> > vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> > vma_policy(vma),
> > @@ -1539,7 +1557,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > cond_resched();
> >
> > BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
> > - !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> > + !!(cur->vm_flags & __VM_UFFD_FLAGS));
> >
> > /*
> > * Check not compatible vmas, not strictly required
> > @@ -1590,7 +1608,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range);
> > }
> >
> > - new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> > + new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> > prev = vma_merge(mm, prev, start, vma_end, new_flags,
> > vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> > vma_policy(vma),
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ecdf8a8cd6ae..1d7041bd3148 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -276,6 +276,7 @@ extern unsigned int kobjsize(const void *objp);
> > #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
> > #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
> > #define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
> > +#define VM_UFFD_MINOR 0x00002000 /* minor fault interception */
> >
> > #define VM_LOCKED 0x00002000
> > #define VM_IO 0x00004000 /* Memory mapped I/O or similar */
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index a8e5f3ea9bb2..0c8c5fa5efc8 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -62,6 +62,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> > return vma->vm_flags & VM_UFFD_WP;
> > }
> >
> > +static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> > +{
> > + return vma->vm_flags & VM_UFFD_MINOR;
> > +}
> > +
> > static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> > pte_t pte)
> > {
> > @@ -76,7 +81,7 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
> >
> > static inline bool userfaultfd_armed(struct vm_area_struct *vma)
> > {
> > - return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
> > + return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR);
> > }
> >
> > extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
> > @@ -123,6 +128,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> > return false;
> > }
> >
> > +static inline bool userfaultfd_minor(struct vm_area_struct *vma)
> > +{
> > + return false;
> > +}
> > +
> > static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> > pte_t pte)
> > {
> > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> > index 67018d367b9f..2d583ffd4100 100644
> > --- a/include/trace/events/mmflags.h
> > +++ b/include/trace/events/mmflags.h
> > @@ -151,6 +151,7 @@ IF_HAVE_PG_ARCH_2(PG_arch_2, "arch_2" )
> > {VM_PFNMAP, "pfnmap" }, \
> > {VM_DENYWRITE, "denywrite" }, \
> > {VM_UFFD_WP, "uffd_wp" }, \
> > + {VM_UFFD_MINOR, "uffd_minor" }, \
> > {VM_LOCKED, "locked" }, \
> > {VM_IO, "io" }, \
> > {VM_SEQ_READ, "seqread" }, \
> > diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> > index 5f2d88212f7c..1cc2cd8a5279 100644
> > --- a/include/uapi/linux/userfaultfd.h
> > +++ b/include/uapi/linux/userfaultfd.h
> > @@ -19,15 +19,19 @@
> > * means the userland is reading).
> > */
> > #define UFFD_API ((__u64)0xAA)
> > +#define UFFD_API_REGISTER_MODES (UFFDIO_REGISTER_MODE_MISSING | \
> > + UFFDIO_REGISTER_MODE_WP | \
> > + UFFDIO_REGISTER_MODE_MINOR)
> > #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
> > UFFD_FEATURE_EVENT_FORK | \
> > UFFD_FEATURE_EVENT_REMAP | \
> > - UFFD_FEATURE_EVENT_REMOVE | \
> > + UFFD_FEATURE_EVENT_REMOVE | \
> > UFFD_FEATURE_EVENT_UNMAP | \
> > UFFD_FEATURE_MISSING_HUGETLBFS | \
> > UFFD_FEATURE_MISSING_SHMEM | \
> > UFFD_FEATURE_SIGBUS | \
> > - UFFD_FEATURE_THREAD_ID)
> > + UFFD_FEATURE_THREAD_ID | \
> > + UFFD_FEATURE_MINOR_FAULT_HUGETLBFS)
> > #define UFFD_API_IOCTLS \
> > ((__u64)1 << _UFFDIO_REGISTER | \
> > (__u64)1 << _UFFDIO_UNREGISTER | \
> > @@ -127,6 +131,7 @@ struct uffd_msg {
> > /* flags for UFFD_EVENT_PAGEFAULT */
> > #define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */
> > #define UFFD_PAGEFAULT_FLAG_WP (1<<1) /* If reason is VM_UFFD_WP */
> > +#define UFFD_PAGEFAULT_FLAG_MINOR (1<<2) /* If reason is VM_UFFD_MINOR */
> >
> > struct uffdio_api {
> > /* userland asks for an API number and the features to enable */
> > @@ -171,6 +176,10 @@ struct uffdio_api {
> > *
> > * UFFD_FEATURE_THREAD_ID pid of the page faulted task_struct will
> > * be returned, if feature is not requested 0 will be returned.
> > + *
> > + * UFFD_FEATURE_MINOR_FAULT_HUGETLBFS indicates that minor faults
> > + * can be intercepted (via REGISTER_MODE_MINOR) for
> > + * hugetlbfs-backed pages.
> > */
> > #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
> > #define UFFD_FEATURE_EVENT_FORK (1<<1)
> > @@ -181,6 +190,7 @@ struct uffdio_api {
> > #define UFFD_FEATURE_EVENT_UNMAP (1<<6)
> > #define UFFD_FEATURE_SIGBUS (1<<7)
> > #define UFFD_FEATURE_THREAD_ID (1<<8)
> > +#define UFFD_FEATURE_MINOR_FAULT_HUGETLBFS (1<<9)
> > __u64 features;
> >
> > __u64 ioctls;
> > @@ -195,6 +205,7 @@ struct uffdio_register {
> > struct uffdio_range range;
> > #define UFFDIO_REGISTER_MODE_MISSING ((__u64)1<<0)
> > #define UFFDIO_REGISTER_MODE_WP ((__u64)1<<1)
> > +#define UFFDIO_REGISTER_MODE_MINOR ((__u64)1<<2)
> > __u64 mode;
> >
> > /*
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a2602969873d..0ba8f2f5a4ae 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4377,6 +4377,37 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> > }
> > }
> >
> > + /* Check for page in userfault range. */
> > + if (!new_page && userfaultfd_minor(vma)) {
> > + u32 hash;
> > + struct vm_fault vmf = {
> > + .vma = vma,
> > + .address = haddr,
> > + .flags = flags,
> > + /*
> > + * Hard to debug if it ends up being used by a callee
> > + * that assumes something about the other uninitialized
> > + * fields... same as in memory.c
> > + */
> > + };
> > +
> > + unlock_page(page);
> > +
> > + /*
> > + * hugetlb_fault_mutex and i_mmap_rwsem must be dropped before
> > + * handling userfault. Reacquire after handling fault to make
> > + * calling code simpler.
> > + */
> > +
> > + hash = hugetlb_fault_mutex_hash(mapping, idx);
> > + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > + i_mmap_unlock_read(mapping);
> > + ret = handle_userfault(&vmf, VM_UFFD_MINOR);
> > + i_mmap_lock_read(mapping);
> > + mutex_lock(&hugetlb_fault_mutex_table[hash]);
> > + goto out;
> > + }
> > +
> > /*
> > * If we are going to COW a private mapping later, we examine the
> > * pending reservations for this page now. This will ensure that
> > --
> > 2.29.2.729.g45daf8777d-goog
> >
> --
> Dr. David Alan Gilbert / [email protected] / Manchester, UK
>
On 1/7/21 11:04 AM, Axel Rasmussen wrote:
> Overview
> ========
>
> This series adds a new userfaultfd registration mode,
> UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
> By "minor" fault, I mean the following situation:
>
> Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
> One of the mappings is registered with userfaultfd (in minor mode), and the
> other is not. Via the non-UFFD mapping, the underlying pages have already been
> allocated & filled with some contents. The UFFD mapping has not yet been
> faulted in; when it is touched for the first time, this results in what I'm
> calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
> have huge_pte_none(), but find_lock_page() finds an existing page.
>
> We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
> userspace resolves the fault by either a) doing nothing if the contents are
> already correct, or b) updating the underlying contents using the second,
> non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
> or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
> "I have ensured the page contents are correct, carry on setting up the mapping".
>
One quick thought.
This is not going to work as expected with hugetlbfs pmd sharing. If you
are not familiar with hugetlbfs pmd sharing, you are not alone. :)
pmd sharing is enabled for x86 and arm64 architectures. If there are multiple
shared mappings of the same underlying hugetlbfs file or shared memory segment
that are 'suitably aligned', then the PMD pages associated with those regions
are shared by all the mappings. Suitably aligned means 'on a 1GB boundary'
and 1GB in size.
When pmds are shared, your mappings will never see a 'minor fault'. This
is because the PMD (page table entries) is shared.
--
Mike Kravetz
On Mon, Jan 11, 2021 at 5:49 PM Peter Xu <[email protected]> wrote:
>
> On Mon, Jan 11, 2021 at 04:13:41PM -0800, Mike Kravetz wrote:
> > On 1/11/21 3:08 PM, Peter Xu wrote:
> > > On Mon, Jan 11, 2021 at 02:42:48PM -0800, Mike Kravetz wrote:
> > >> On 1/7/21 11:04 AM, Axel Rasmussen wrote:
> > >>> Overview
> > >>> ========
> > >>>
> > >>> This series adds a new userfaultfd registration mode,
> > >>> UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
> > >>> By "minor" fault, I mean the following situation:
> > >>>
> > >>> Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
> > >>> One of the mappings is registered with userfaultfd (in minor mode), and the
> > >>> other is not. Via the non-UFFD mapping, the underlying pages have already been
> > >>> allocated & filled with some contents. The UFFD mapping has not yet been
> > >>> faulted in; when it is touched for the first time, this results in what I'm
> > >>> calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
> > >>> have huge_pte_none(), but find_lock_page() finds an existing page.
> > >>>
> > >>> We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
> > >>> userspace resolves the fault by either a) doing nothing if the contents are
> > >>> already correct, or b) updating the underlying contents using the second,
> > >>> non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
> > >>> or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
> > >>> "I have ensured the page contents are correct, carry on setting up the mapping".
> > >>>
> > >>
> > >> One quick thought.
> > >>
> > >> This is not going to work as expected with hugetlbfs pmd sharing. If you
> > >> are not familiar with hugetlbfs pmd sharing, you are not alone. :)
> > >>
> > >> pmd sharing is enabled for x86 and arm64 architectures. If there are multiple
> > >> shared mappings of the same underlying hugetlbfs file or shared memory segment
> > >> that are 'suitably aligned', then the PMD pages associated with those regions
> > >> are shared by all the mappings. Suitably aligned means 'on a 1GB boundary'
> > >> and 1GB in size.
> > >>
> > >> When pmds are shared, your mappings will never see a 'minor fault'. This
> > >> is because the PMD (page table entries) is shared.
> > >
> > > Thanks for raising this, Mike.
> > >
> > > I've got a few patches that plan to disable huge pmd sharing for uffd in
> > > general, e.g.:
> > >
> > > https://github.com/xzpeter/linux/commit/f9123e803d9bdd91bf6ef23b028087676bed1540
> > > https://github.com/xzpeter/linux/commit/aa9aeb5c4222a2fdb48793cdbc22902288454a31
> > >
> > > I believe we don't want that for missing mode too, but it's just not extremely
> > > important for missing mode yet, because in missing mode we normally monitor all
> > > the processes that will be using the registered mm range. For example, in QEMU
> > > postcopy migration with vhost-user hugetlbfs files as backends, we'll monitor
> > > both the QEMU process and the DPDK program, so that either of the programs will
> > > trigger a missing fault even if pmd shared between them. However again I think
> > > it's not ideal since uffd (even if missing mode) is pgtable-based, so sharing
> > > could always be too tricky.
> > >
> > > They're not yet posted to public yet since that's part of uffd-wp support for
> > > hugetlbfs (along with shmem). So just raise this up to avoid potential
> > > duplicated work before I post the patchset.
> > >
> > > (Will read into details soon; probably too many things piled up...)
> >
> > Thanks for the heads up about this Peter.
> >
> > I know Oracle DB really wants shared pmds -and- UFFD. I need to get details
> > of their exact usage model. I know they primarily use SIGBUS, but use
> > MISSING_HUGETLBFS as well. We may need to be more selective in when to
> > disable.
>
> After a second thought, indeed it's possible to use it that way with pmd
> sharing. Actually we don't need to generate the fault for every page, if what
> we want to do is simply "initializing the pages using some data" on the
> registered ranges. Should also be the case even for qemu+dpdk, because if
> e.g. qemu faulted in a page, then it'll be nicer if dpdk can avoid faulting in
> again (so when huge pmd sharing enabled we can even avoid the PF irq to install
> the pte if at last page cache existed). It should be similarly beneficial if
> the other process is not faulting in but proactively filling the holes using
> UFFDIO_COPY either for the current process or for itself; sounds like a valid
> scenario for Google too when VM migrates.
Exactly right, but I'm a little unsure how to get it to work. There
are two different cases:
- Allocate + populate a page in the background (not on demand) during
postcopy (i.e., after the VM has started executing on the migration
target). In this case, we can be certain that the page contents are up
to date, since execution on the source was already paused. In this
case PMD sharing would actually be nice, because it would mean the VM
would never fault on this page going forward.
- Allocate + populate a page during precopy (i.e., while the VM is
still executing on the migration source). In this case, we *don't*
want PMD sharing, because we need to intercept the first time this
page is touched, verify it's up to date, and copy over the updated
data if not.
Another related situation to consider is, at some point on the target
machine, we'll receive the "dirty map" indicating which pages are out
of date or not. My original thinking was, when the VM faults on any of
these pages, from this point forward we'd just look at the map and
then UFFDIO_CONTINUE if things were up to date. But you're right that
a possible optimization is, once we receive the map, just immediately
"enable PMD sharing" on these pages, so the VM will never fault on
them.
But, this is all kind of speculative. I don't know of any existing API
for *userspace* to take an existing shared memory mapping without PMD
sharing, and "turn on" PMD sharing for particular page(s).
For now, I'll plan on disabling PMD sharing for MINOR registered
ranges. Thanks, Peter and Mike!
>
> I've modified my local tree to only disable pmd sharing for uffd-wp but keep
> missing mode as-is [1]. A new helper uffd_disable_huge_pmd_share() is
> introduced in patch "hugetlb/userfaultfd: Forbid huge pmd sharing when uffd
> enabled", so should be easier if we would like to add minor mode too.
>
> Thanks!
>
> [1] https://github.com/xzpeter/linux/commits/uffd-wp-shmem-hugetlbfs
>
> --
> Peter Xu
>
On Mon, Jan 11, 2021 at 02:42:48PM -0800, Mike Kravetz wrote:
> On 1/7/21 11:04 AM, Axel Rasmussen wrote:
> > Overview
> > ========
> >
> > This series adds a new userfaultfd registration mode,
> > UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
> > By "minor" fault, I mean the following situation:
> >
> > Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
> > One of the mappings is registered with userfaultfd (in minor mode), and the
> > other is not. Via the non-UFFD mapping, the underlying pages have already been
> > allocated & filled with some contents. The UFFD mapping has not yet been
> > faulted in; when it is touched for the first time, this results in what I'm
> > calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
> > have huge_pte_none(), but find_lock_page() finds an existing page.
> >
> > We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
> > userspace resolves the fault by either a) doing nothing if the contents are
> > already correct, or b) updating the underlying contents using the second,
> > non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
> > or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
> > "I have ensured the page contents are correct, carry on setting up the mapping".
> >
>
> One quick thought.
>
> This is not going to work as expected with hugetlbfs pmd sharing. If you
> are not familiar with hugetlbfs pmd sharing, you are not alone. :)
>
> pmd sharing is enabled for x86 and arm64 architectures. If there are multiple
> shared mappings of the same underlying hugetlbfs file or shared memory segment
> that are 'suitably aligned', then the PMD pages associated with those regions
> are shared by all the mappings. Suitably aligned means 'on a 1GB boundary'
> and 1GB in size.
>
> When pmds are shared, your mappings will never see a 'minor fault'. This
> is because the PMD (page table entries) is shared.
Thanks for raising this, Mike.
I've got a few patches that plan to disable huge pmd sharing for uffd in
general, e.g.:
https://github.com/xzpeter/linux/commit/f9123e803d9bdd91bf6ef23b028087676bed1540
https://github.com/xzpeter/linux/commit/aa9aeb5c4222a2fdb48793cdbc22902288454a31
I believe we don't want that for missing mode too, but it's just not extremely
important for missing mode yet, because in missing mode we normally monitor all
the processes that will be using the registered mm range. For example, in QEMU
postcopy migration with vhost-user hugetlbfs files as backends, we'll monitor
both the QEMU process and the DPDK program, so that either of the programs will
trigger a missing fault even if pmd shared between them. However again I think
it's not ideal since uffd (even if missing mode) is pgtable-based, so sharing
could always be too tricky.
They're not yet posted to public yet since that's part of uffd-wp support for
hugetlbfs (along with shmem). So just raise this up to avoid potential
duplicated work before I post the patchset.
(Will read into details soon; probably too many things piled up...)
Thanks,
--
Peter Xu
On 1/11/21 3:08 PM, Peter Xu wrote:
> On Mon, Jan 11, 2021 at 02:42:48PM -0800, Mike Kravetz wrote:
>> On 1/7/21 11:04 AM, Axel Rasmussen wrote:
>>> Overview
>>> ========
>>>
>>> This series adds a new userfaultfd registration mode,
>>> UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
>>> By "minor" fault, I mean the following situation:
>>>
>>> Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
>>> One of the mappings is registered with userfaultfd (in minor mode), and the
>>> other is not. Via the non-UFFD mapping, the underlying pages have already been
>>> allocated & filled with some contents. The UFFD mapping has not yet been
>>> faulted in; when it is touched for the first time, this results in what I'm
>>> calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
>>> have huge_pte_none(), but find_lock_page() finds an existing page.
>>>
>>> We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
>>> userspace resolves the fault by either a) doing nothing if the contents are
>>> already correct, or b) updating the underlying contents using the second,
>>> non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
>>> or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
>>> "I have ensured the page contents are correct, carry on setting up the mapping".
>>>
>>
>> One quick thought.
>>
>> This is not going to work as expected with hugetlbfs pmd sharing. If you
>> are not familiar with hugetlbfs pmd sharing, you are not alone. :)
>>
>> pmd sharing is enabled for x86 and arm64 architectures. If there are multiple
>> shared mappings of the same underlying hugetlbfs file or shared memory segment
>> that are 'suitably aligned', then the PMD pages associated with those regions
>> are shared by all the mappings. Suitably aligned means 'on a 1GB boundary'
>> and 1GB in size.
>>
>> When pmds are shared, your mappings will never see a 'minor fault'. This
>> is because the PMD (page table entries) is shared.
>
> Thanks for raising this, Mike.
>
> I've got a few patches that plan to disable huge pmd sharing for uffd in
> general, e.g.:
>
> https://github.com/xzpeter/linux/commit/f9123e803d9bdd91bf6ef23b028087676bed1540
> https://github.com/xzpeter/linux/commit/aa9aeb5c4222a2fdb48793cdbc22902288454a31
>
> I believe we don't want that for missing mode too, but it's just not extremely
> important for missing mode yet, because in missing mode we normally monitor all
> the processes that will be using the registered mm range. For example, in QEMU
> postcopy migration with vhost-user hugetlbfs files as backends, we'll monitor
> both the QEMU process and the DPDK program, so that either of the programs will
> trigger a missing fault even if pmd shared between them. However again I think
> it's not ideal since uffd (even if missing mode) is pgtable-based, so sharing
> could always be too tricky.
>
> They're not yet posted to public yet since that's part of uffd-wp support for
> hugetlbfs (along with shmem). So just raise this up to avoid potential
> duplicated work before I post the patchset.
>
> (Will read into details soon; probably too many things piled up...)
Thanks for the heads up about this Peter.
I know Oracle DB really wants shared pmds -and- UFFD. I need to get details
of their exact usage model. I know they primarily use SIGBUS, but use
MISSING_HUGETLBFS as well. We may need to be more selective in when to
disable.
--
Mike Kravetz
On Mon, Jan 11, 2021 at 04:13:41PM -0800, Mike Kravetz wrote:
> On 1/11/21 3:08 PM, Peter Xu wrote:
> > On Mon, Jan 11, 2021 at 02:42:48PM -0800, Mike Kravetz wrote:
> >> On 1/7/21 11:04 AM, Axel Rasmussen wrote:
> >>> Overview
> >>> ========
> >>>
> >>> This series adds a new userfaultfd registration mode,
> >>> UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
> >>> By "minor" fault, I mean the following situation:
> >>>
> >>> Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
> >>> One of the mappings is registered with userfaultfd (in minor mode), and the
> >>> other is not. Via the non-UFFD mapping, the underlying pages have already been
> >>> allocated & filled with some contents. The UFFD mapping has not yet been
> >>> faulted in; when it is touched for the first time, this results in what I'm
> >>> calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
> >>> have huge_pte_none(), but find_lock_page() finds an existing page.
> >>>
> >>> We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
> >>> userspace resolves the fault by either a) doing nothing if the contents are
> >>> already correct, or b) updating the underlying contents using the second,
> >>> non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
> >>> or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
> >>> "I have ensured the page contents are correct, carry on setting up the mapping".
> >>>
> >>
> >> One quick thought.
> >>
> >> This is not going to work as expected with hugetlbfs pmd sharing. If you
> >> are not familiar with hugetlbfs pmd sharing, you are not alone. :)
> >>
> >> pmd sharing is enabled for x86 and arm64 architectures. If there are multiple
> >> shared mappings of the same underlying hugetlbfs file or shared memory segment
> >> that are 'suitably aligned', then the PMD pages associated with those regions
> >> are shared by all the mappings. Suitably aligned means 'on a 1GB boundary'
> >> and 1GB in size.
> >>
> >> When pmds are shared, your mappings will never see a 'minor fault'. This
> >> is because the PMD (page table entries) is shared.
> >
> > Thanks for raising this, Mike.
> >
> > I've got a few patches that plan to disable huge pmd sharing for uffd in
> > general, e.g.:
> >
> > https://github.com/xzpeter/linux/commit/f9123e803d9bdd91bf6ef23b028087676bed1540
> > https://github.com/xzpeter/linux/commit/aa9aeb5c4222a2fdb48793cdbc22902288454a31
> >
> > I believe we don't want that for missing mode too, but it's just not extremely
> > important for missing mode yet, because in missing mode we normally monitor all
> > the processes that will be using the registered mm range. For example, in QEMU
> > postcopy migration with vhost-user hugetlbfs files as backends, we'll monitor
> > both the QEMU process and the DPDK program, so that either of the programs will
> > trigger a missing fault even if pmd shared between them. However again I think
> > it's not ideal since uffd (even if missing mode) is pgtable-based, so sharing
> > could always be too tricky.
> >
> > They're not yet posted to public yet since that's part of uffd-wp support for
> > hugetlbfs (along with shmem). So just raise this up to avoid potential
> > duplicated work before I post the patchset.
> >
> > (Will read into details soon; probably too many things piled up...)
>
> Thanks for the heads up about this Peter.
>
> I know Oracle DB really wants shared pmds -and- UFFD. I need to get details
> of their exact usage model. I know they primarily use SIGBUS, but use
> MISSING_HUGETLBFS as well. We may need to be more selective in when to
> disable.
After a second thought, indeed it's possible to use it that way with pmd
sharing. Actually we don't need to generate the fault for every page, if what
we want to do is simply "initializing the pages using some data" on the
registered ranges. Should also be the case even for qemu+dpdk, because if
e.g. qemu faulted in a page, then it'll be nicer if dpdk can avoid faulting in
again (so when huge pmd sharing enabled we can even avoid the PF irq to install
the pte if at last page cache existed). It should be similarly beneficial if
the other process is not faulting in but proactively filling the holes using
UFFDIO_COPY either for the current process or for itself; sounds like a valid
scenario for Google too when VM migrates.
I've modified my local tree to only disable pmd sharing for uffd-wp but keep
missing mode as-is [1]. A new helper uffd_disable_huge_pmd_share() is
introduced in patch "hugetlb/userfaultfd: Forbid huge pmd sharing when uffd
enabled", so should be easier if we would like to add minor mode too.
Thanks!
[1] https://github.com/xzpeter/linux/commits/uffd-wp-shmem-hugetlbfs
--
Peter Xu