2015-05-14 17:34:54

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 00/23] userfaultfd v4

Hello everyone,

This is the latest userfaultfd patchset against mm-v4.1-rc3
2015-05-14-10:04.

The postcopy live migration feature on the qemu side is mostly ready
to be merged and it entirely depends on the userfaultfd syscall to be
merged as well. So it'd be great if this patchset could be reviewed
for merging in -mm.

Userfaults allow to implement on demand paging from userland and more
generally they allow userland to more efficiently take control of the
behavior of page faults than what was available before
(PROT_NONE + SIGSEGV trap).

The use cases are:

1) KVM postcopy live migration (one form of cloud memory
externalization).

KVM postcopy live migration is the primary driver of this work:

http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html

2) postcopy live migration of binaries inside linux containers:

http://thread.gmane.org/gmane.linux.kernel.mm/132662

3) KVM postcopy live snapshotting (allowing to limit/throttle the
memory usage, unlike fork would, plus the avoidance of fork
overhead in the first place).

While the wrprotect tracking is not implemented yet, the syscall API is
already contemplating the wrprotect fault tracking and it's generic enough
to allow its later implementation in a backwards compatible fashion.

4) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
should be extended to work also on tmpfs and then the
uffdio_register.ioctls will notify userland that UFFDIO_COPY is
available even when the registered virtual memory range is tmpfs
backed.

5) alternate mechanism to notify web browsers or apps on embedded
devices that volatile pages have been reclaimed. This basically
avoids the need to run a syscall before the app can access with the
CPU the virtual regions marked volatile. This depends on point 4)
to be fulfilled first, as volatile pages happily apply to tmpfs.

Even though there wasn't a real use case requesting it yet, it also
allows to implement distributed shared memory in a way that readonly
shared mappings can exist simultaneously in different hosts and they
can be become exclusive at the first wrprotect fault.

The development version can also be cloned here:

git clone --reference linux -b userfault git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

Slides from LSF-MM summit (but beware that they're not uptodate):

https://www.kernel.org/pub/linux/kernel/people/andrea/userfaultfd/userfaultfd-LSFMM-2015.pdf

Comments welcome.

Thanks,
Andrea

Changelog of the major changes since the last RFC v3:

o The API has been slightly modified to avoid having to introduce a
second revision of the API, in order to support the non cooperative
usage.

o Various mixed fixes thanks to the feedback from Dave Hansen and
David Gilbert.

The most notable one is the use of mm_users instead of mm_count to
pin the mm to avoid crashes that assumed the vma still existed (in
the userfaultfd_release method and in the various ioctl). exit_mmap
doesn't even set mm->mmap to NULL, so unless I introduce a
userfaultfd_exit to call in mmput, I have to pin the mm_users to be
safe. This is a visible change mainly for the non-cooperative usage.

o userfaults are waken immediately even if they're not been "read"
yet, this can lead to POLLIN false positives (so I only allow poll
if the fd is open in nonblocking mode to be sure it won't hang).

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=f222d9de0a5302dc8ac62d6fab53a84251098751

o optimize read to return entries in O(1) and poll which was already
O(1) becomes lockless. This required to split the waitqueue in two,
one for pending faults and one for non pending faults, and the
faults are refiled across the two waitqueues when they're read. Both
waitqueues are protected by a single lock to be simpler and faster
at runtime (the fault_pending_wqh one).

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=9aa033ed43a1134c2223dac8c5d9e02e0100fca1

o Allocate the ctx with kmem_cache_alloc.

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=f5a8db16d2876eed8906a4d36f1d0e06ca5490f6

o Originally qemu had two bitflags for each page and kept 3 states (of
the 4 possible with two bits) for each page in order to deal with
the races that can happen if one thread is reading the userfaults
and another thread is calling the UFFDIO_COPY ioctl in the
background. This patch solves all races in the kernel so the two
bits per page can be dropped from qemu codebase. I started
documenting the races that can materialize by using 2 threads
(instead of running the workload single threaded with a single poll
event loop) and how userland had to solve them until I decided it
was simpler to fix the race in the kernel by running an ad-hoc
pagetable walk inside the wait_event()-kind-of-section. This
simplified qemu significantly (hundreds line of code involving a
mutex have been deleted and that mutex disappeared as well) and it
doesn't make the kernel much more complicated.

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=41efeae4e93f0296436f2a9fc6b28b6b0158512a

After this patch the only reason to call UFFDIO_WAKE is to handle
the userfaults in batches in combination with the DONT_WAKE flag of
UFFDIO_COPY.

o I removed the read recursion from mcopy_atomic. This avoids to
depend on the write-starvation behavior of rwsem to be safe. After
this change the rwsem is free to stop any further down_read if
there's a down_write waiting on the lock.

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b1e3a08acc9e3f6c2614e89fc3b8e338daa58e18

o Extendeded the Documentation userfaultfd.txt file to explain how
QEMU/KVM uses userfaultfd to implement postcopy live migration.

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=016f9523b7b2238851533736e84452cb00b2ddcd

Andrea Arcangeli (22):
userfaultfd: linux/Documentation/vm/userfaultfd.txt
userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key
userfaultfd: uAPI
userfaultfd: linux/userfaultfd_k.h
userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP
userfaultfd: call handle_userfault() for userfaultfd_missing() faults
userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx
userfaultfd: prevent khugepaged to merge if userfaultfd is armed
userfaultfd: add new syscall to provide memory externalization
userfaultfd: Rename uffd_api.bits into .features fixup
userfaultfd: change the read API to return a uffd_msg
userfaultfd: wake pending userfaults
userfaultfd: optimize read() and poll() to be O(1)
userfaultfd: allocate the userfaultfd_ctx cacheline aligned
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
userfaultfd: buildsystem activation
userfaultfd: activate syscall
userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI
userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE
preparation
userfaultfd: avoid mmap_sem read recursion in mcopy_atomic
userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE

Pavel Emelyanov (1):
userfaultfd: Rename uffd_api.bits into .features

Documentation/ioctl/ioctl-number.txt | 1 +
Documentation/vm/userfaultfd.txt | 142 ++++
arch/powerpc/include/asm/systbl.h | 1 +
arch/powerpc/include/uapi/asm/unistd.h | 1 +
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
fs/Makefile | 1 +
fs/proc/task_mmu.c | 2 +
fs/userfaultfd.c | 1236 ++++++++++++++++++++++++++++++++
include/linux/mm.h | 4 +-
include/linux/mm_types.h | 11 +
include/linux/syscalls.h | 1 +
include/linux/userfaultfd_k.h | 85 +++
include/linux/wait.h | 5 +-
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/userfaultfd.h | 161 +++++
init/Kconfig | 11 +
kernel/fork.c | 3 +-
kernel/sched/wait.c | 7 +-
kernel/sys_ni.c | 1 +
mm/Makefile | 1 +
mm/huge_memory.c | 75 +-
mm/madvise.c | 3 +-
mm/memory.c | 16 +
mm/mempolicy.c | 4 +-
mm/mlock.c | 3 +-
mm/mmap.c | 40 +-
mm/mprotect.c | 3 +-
mm/userfaultfd.c | 309 ++++++++
net/sunrpc/sched.c | 2 +-
30 files changed, 2082 insertions(+), 50 deletions(-)
create mode 100644 Documentation/vm/userfaultfd.txt
create mode 100644 fs/userfaultfd.c
create mode 100644 include/linux/userfaultfd_k.h
create mode 100644 include/uapi/linux/userfaultfd.h
create mode 100644 mm/userfaultfd.c

Credits: partially funded by the Orbit EU project.


2015-05-14 17:33:25

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt

Add documentation.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
Documentation/vm/userfaultfd.txt | 140 +++++++++++++++++++++++++++++++++++++++
1 file changed, 140 insertions(+)
create mode 100644 Documentation/vm/userfaultfd.txt

diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
new file mode 100644
index 0000000..c2f5145
--- /dev/null
+++ b/Documentation/vm/userfaultfd.txt
@@ -0,0 +1,140 @@
+= Userfaultfd =
+
+== Objective ==
+
+Userfaults allow the implementation of on-demand paging from userland
+and more generally they allow userland to take control various memory
+page faults, something otherwise only the kernel code could do.
+
+For example userfaults allows a proper and more optimal implementation
+of the PROT_NONE+SIGSEGV trick.
+
+== Design ==
+
+Userfaults are delivered and resolved through the userfaultfd syscall.
+
+The userfaultfd (aside from registering and unregistering virtual
+memory ranges) provides two primary functionalities:
+
+1) read/POLLIN protocol to notify a userland thread of the faults
+ happening
+
+2) various UFFDIO_* ioctls that can manage the virtual memory regions
+ registered in the userfaultfd that allows userland to efficiently
+ resolve the userfaults it receives via 1) or to manage the virtual
+ memory in the background
+
+The real advantage of userfaults if compared to regular virtual memory
+management of mremap/mprotect is that the userfaults in all their
+operations never involve heavyweight structures like vmas (in fact the
+userfaultfd runtime load never takes the mmap_sem for writing).
+
+Vmas are not suitable for page- (or hugepage) granular fault tracking
+when dealing with virtual address spaces that could span
+Terabytes. Too many vmas would be needed for that.
+
+The userfaultfd once opened by invoking the syscall, can also be
+passed using unix domain sockets to a manager process, so the same
+manager process could handle the userfaults of a multitude of
+different processes without them being aware about what is going on
+(well of course unless they later try to use the userfaultfd
+themselves on the same region the manager is already tracking, which
+is a corner case that would currently return -EBUSY).
+
+== API ==
+
+When first opened the userfaultfd must be enabled invoking the
+UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
+a later API version) which will specify the read/POLLIN protocol
+userland intends to speak on the UFFD. The UFFDIO_API ioctl if
+successful (i.e. if the requested uffdio_api.api is spoken also by the
+running kernel), will return into uffdio_api.features and
+uffdio_api.ioctls two 64bit bitmasks of respectively the activated
+feature of the read(2) protocol and the generic ioctl available.
+
+Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
+be invoked (if present in the returned uffdio_api.ioctls bitmask) to
+register a memory range in the userfaultfd by setting the
+uffdio_register structure accordingly. The uffdio_register.mode
+bitmask will specify to the kernel which kind of faults to track for
+the range (UFFDIO_REGISTER_MODE_MISSING would track missing
+pages). The UFFDIO_REGISTER ioctl will return the
+uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
+userfaults on the range registered. Not all ioctls will necessarily be
+supported for all memory types depending on the underlying virtual
+memory backend (anonymous memory vs tmpfs vs real filebacked
+mappings).
+
+Userland can use the uffdio_register.ioctls to manage the virtual
+address space in the background (to add or potentially also remove
+memory from the userfaultfd registered range). This means a userfault
+could be triggering just before userland maps in the background the
+user-faulted page.
+
+The primary ioctl to resolve userfaults is UFFDIO_COPY. That
+atomically copies a page into the userfault registered range and wakes
+up the blocked userfaults (unless uffdio_copy.mode &
+UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
+UFFDIO_COPY.
+
+== QEMU/KVM ==
+
+QEMU/KVM is using the userfaultfd syscall to implement postcopy live
+migration. Postcopy live migration is one form of memory
+externalization consisting of a virtual machine running with part or
+all of its memory residing on a different node in the cloud. The
+userfaultfd abstraction is generic enough that not a single line of
+KVM kernel code had to be modified in order to add postcopy live
+migration to QEMU.
+
+Guest async page faults, FOLL_NOWAIT and all other GUP features work
+just fine in combination with userfaults. Userfaults trigger async
+page faults in the guest scheduler so those guest processes that
+aren't waiting for userfaults (i.e. network bound) can keep running in
+the guest vcpus.
+
+It is generally beneficial to run one pass of precopy live migration
+just before starting postcopy live migration, in order to avoid
+generating userfaults for readonly guest regions.
+
+The implementation of postcopy live migration currently uses one
+single bidirectional socket but in the future two different sockets
+will be used (to reduce the latency of the userfaults to the minimum
+possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
+
+The QEMU in the source node writes all pages that it knows are missing
+in the destination node, into the socket, and the migration thread of
+the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
+ioctls on the userfaultfd in order to map the received pages into the
+guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
+
+A different postcopy thread in the destination node listens with
+poll() to the userfaultfd in parallel. When a POLLIN event is
+generated after a userfault triggers, the postcopy thread read() from
+the userfaultfd and receives the fault address (or -EAGAIN in case the
+userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
+by the parallel QEMU migration thread).
+
+After the QEMU postcopy thread (running in the destination node) gets
+the userfault address it writes the information about the missing page
+into the socket. The QEMU source node receives the information and
+roughly "seeks" to that page address and continues sending all
+remaining missing pages from that new page offset. Soon after that
+(just the time to flush the tcp_wmem queue through the network) the
+migration thread in the QEMU running in the destination node will
+receive the page that triggered the userfault and it'll map it as
+usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
+was spontaneously sent by the source or if it was an urgent page
+requested through an userfault).
+
+By the time the userfaults start, the QEMU in the destination node
+doesn't need to keep any per-page state bitmap relative to the live
+migration around and a single per-page bitmap has to be maintained in
+the QEMU running in the source node to know which pages are still
+missing in the destination node. The bitmap in the source node is
+checked to find which missing pages to send in round robin and we seek
+over it when receiving incoming userfaults. After sending each page of
+course the bitmap is updated accordingly. It's also useful to avoid
+sending the same page twice (in case the userfault is read by the
+postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
+thread).

2015-05-14 17:31:59

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 02/23] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key

userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
instead of the current hardcoded 1 (that would wake just the first
waitqueue in the head list).

Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/wait.h | 5 +++--
kernel/sched/wait.c | 7 ++++---
net/sunrpc/sched.c | 2 +-
3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index d69ac4e..c48c0e1 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -147,7 +147,8 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)

typedef int wait_bit_action_f(struct wait_bit_key *);
void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+ void *key);
void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -179,7 +180,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
#define wake_up_poll(x, m) \
__wake_up(x, TASK_NORMAL, 1, (void *) (m))
#define wake_up_locked_poll(x, m) \
- __wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
+ __wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
#define wake_up_interruptible_poll(x, m) \
__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
#define wake_up_interruptible_sync_poll(x, m) \
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 2ccec98..d48513e 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -106,9 +106,10 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr)
}
EXPORT_SYMBOL_GPL(__wake_up_locked);

-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+ void *key)
{
- __wake_up_common(q, mode, 1, 0, key);
+ __wake_up_common(q, mode, nr, 0, key);
}
EXPORT_SYMBOL_GPL(__wake_up_locked_key);

@@ -283,7 +284,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
if (!list_empty(&wait->task_list))
list_del_init(&wait->task_list);
else if (waitqueue_active(q))
- __wake_up_locked_key(q, mode, key);
+ __wake_up_locked_key(q, mode, 1, key);
spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 337ca85..b140c09 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_complete_task(struct rpc_task *task)
clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
ret = atomic_dec_and_test(&task->tk_count);
if (waitqueue_active(wq))
- __wake_up_locked_key(wq, TASK_NORMAL, &k);
+ __wake_up_locked_key(wq, TASK_NORMAL, 1, &k);
spin_unlock_irqrestore(&wq->lock, flags);
return ret;
}

2015-05-14 17:31:28

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 03/23] userfaultfd: uAPI

Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
Documentation/ioctl/ioctl-number.txt | 1 +
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/userfaultfd.h | 81 ++++++++++++++++++++++++++++++++++++
3 files changed, 83 insertions(+)
create mode 100644 include/uapi/linux/userfaultfd.h

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index ec7c81b..ed950ad 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -302,6 +302,7 @@ Code Seq#(hex) Include File Comments
0xA3 80-8F Port ACL in development:
<mailto:[email protected]>
0xA3 90-9F linux/dtlk.h
+0xAA 00-3F linux/uapi/linux/userfaultfd.h
0xAB 00-1F linux/nbd.h
0xAC 00-1F linux/raw.h
0xAD 00 Netfilter device in development:
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 4842a98..47547ea 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -452,3 +452,4 @@ header-y += xfrm.h
header-y += xilinx-v4l2-controls.h
header-y += zorro.h
header-y += zorro_ids.h
+header-y += userfaultfd.h
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
new file mode 100644
index 0000000..9a8cd56
--- /dev/null
+++ b/include/uapi/linux/userfaultfd.h
@@ -0,0 +1,81 @@
+/*
+ * include/linux/userfaultfd.h
+ *
+ * Copyright (C) 2007 Davide Libenzi <[email protected]>
+ * Copyright (C) 2015 Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_H
+#define _LINUX_USERFAULTFD_H
+
+#define UFFD_API ((__u64)0xAA)
+/* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */
+#define UFFD_API_BITS (UFFD_BIT_WRITE)
+#define UFFD_API_IOCTLS \
+ ((__u64)1 << _UFFDIO_REGISTER | \
+ (__u64)1 << _UFFDIO_UNREGISTER | \
+ (__u64)1 << _UFFDIO_API)
+#define UFFD_API_RANGE_IOCTLS \
+ ((__u64)1 << _UFFDIO_WAKE)
+
+/*
+ * Valid ioctl command number range with this API is from 0x00 to
+ * 0x3F. UFFDIO_API is the fixed number, everything else can be
+ * changed by implementing a different UFFD_API. If sticking to the
+ * same UFFD_API more ioctl can be added and userland will be aware of
+ * which ioctl the running kernel implements through the ioctl command
+ * bitmask written by the UFFDIO_API.
+ */
+#define _UFFDIO_REGISTER (0x00)
+#define _UFFDIO_UNREGISTER (0x01)
+#define _UFFDIO_WAKE (0x02)
+#define _UFFDIO_API (0x3F)
+
+/* userfaultfd ioctl ids */
+#define UFFDIO 0xAA
+#define UFFDIO_API _IOWR(UFFDIO, _UFFDIO_API, \
+ struct uffdio_api)
+#define UFFDIO_REGISTER _IOWR(UFFDIO, _UFFDIO_REGISTER, \
+ struct uffdio_register)
+#define UFFDIO_UNREGISTER _IOR(UFFDIO, _UFFDIO_UNREGISTER, \
+ struct uffdio_range)
+#define UFFDIO_WAKE _IOR(UFFDIO, _UFFDIO_WAKE, \
+ struct uffdio_range)
+
+/*
+ * Valid bits below PAGE_SHIFT in the userfault address read through
+ * the read() syscall.
+ */
+#define UFFD_BIT_WRITE (1<<0) /* this was a write fault, MISSING or WP */
+#define UFFD_BIT_WP (1<<1) /* handle_userfault() reason VM_UFFD_WP */
+#define UFFD_BITS 2 /* two above bits used for UFFD_BIT_* mask */
+
+struct uffdio_api {
+ /* userland asks for an API number */
+ __u64 api;
+
+ /* kernel answers below with the available features for the API */
+ __u64 bits;
+ __u64 ioctls;
+};
+
+struct uffdio_range {
+ __u64 start;
+ __u64 len;
+};
+
+struct uffdio_register {
+ struct uffdio_range range;
+#define UFFDIO_REGISTER_MODE_MISSING ((__u64)1<<0)
+#define UFFDIO_REGISTER_MODE_WP ((__u64)1<<1)
+ __u64 mode;
+
+ /*
+ * kernel answers which ioctl commands are available for the
+ * range, keep at the end as the last 8 bytes aren't read.
+ */
+ __u64 ioctls;
+};
+
+#endif /* _LINUX_USERFAULTFD_H */

2015-05-14 17:36:24

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 04/23] userfaultfd: linux/userfaultfd_k.h

Kernel header defining the methods needed by the VM common code to
interact with the userfaultfd.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/userfaultfd_k.h | 79 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 79 insertions(+)
create mode 100644 include/linux/userfaultfd_k.h

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
new file mode 100644
index 0000000..e1e4360
--- /dev/null
+++ b/include/linux/userfaultfd_k.h
@@ -0,0 +1,79 @@
+/*
+ * include/linux/userfaultfd_k.h
+ *
+ * Copyright (C) 2015 Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_K_H
+#define _LINUX_USERFAULTFD_K_H
+
+#ifdef CONFIG_USERFAULTFD
+
+#include <linux/userfaultfd.h> /* linux/include/uapi/linux/userfaultfd.h */
+
+#include <linux/fcntl.h>
+
+/*
+ * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
+ * new flags, since they might collide with O_* ones. We want
+ * to re-use O_* flags that couldn't possibly have a meaning
+ * from userfaultfd, in order to leave a free define-space for
+ * shared O_* flags.
+ */
+#define UFFD_CLOEXEC O_CLOEXEC
+#define UFFD_NONBLOCK O_NONBLOCK
+
+#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
+
+extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags, unsigned long reason);
+
+/* mm helpers */
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+ struct vm_userfaultfd_ctx vm_ctx)
+{
+ return vma->vm_userfaultfd_ctx.ctx == vm_ctx.ctx;
+}
+
+static inline bool userfaultfd_missing(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & VM_UFFD_MISSING;
+}
+
+static inline bool userfaultfd_armed(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
+}
+
+#else /* CONFIG_USERFAULTFD */
+
+/* mm helpers */
+static inline int handle_userfault(struct vm_area_struct *vma,
+ unsigned long address,
+ unsigned int flags,
+ unsigned long reason)
+{
+ return VM_FAULT_SIGBUS;
+}
+
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+ struct vm_userfaultfd_ctx vm_ctx)
+{
+ return true;
+}
+
+static inline bool userfaultfd_missing(struct vm_area_struct *vma)
+{
+ return false;
+}
+
+static inline bool userfaultfd_armed(struct vm_area_struct *vma)
+{
+ return false;
+}
+
+#endif /* CONFIG_USERFAULTFD */
+
+#endif /* _LINUX_USERFAULTFD_K_H */

2015-05-14 17:31:32

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 05/23] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct

This adds the vm_userfaultfd_ctx to the vm_area_struct.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/mm_types.h | 11 +++++++++++
kernel/fork.c | 1 +
2 files changed, 12 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7..2836da7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -265,6 +265,16 @@ struct vm_region {
* this region */
};

+#ifdef CONFIG_USERFAULTFD
+#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) { NULL, })
+struct vm_userfaultfd_ctx {
+ struct userfaultfd_ctx *ctx;
+};
+#else /* CONFIG_USERFAULTFD */
+#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) {})
+struct vm_userfaultfd_ctx {};
+#endif /* CONFIG_USERFAULTFD */
+
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
@@ -331,6 +341,7 @@ struct vm_area_struct {
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
};

struct core_thread {
diff --git a/kernel/fork.c b/kernel/fork.c
index 1481cdd..430141b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -451,6 +451,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
goto fail_nomem_anon_vma_fork;
tmp->vm_flags &= ~VM_LOCKED;
tmp->vm_next = tmp->vm_prev = NULL;
+ tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
file = tmp->vm_file;
if (file) {
struct inode *inode = file_inode(file);

2015-05-14 17:31:53

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 06/23] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP

These two flags gets set in vma->vm_flags to tell the VM common code
if the userfaultfd is armed and in which mode (only tracking missing
faults, only tracking wrprotect faults or both). If neither flags is
set it means the userfaultfd is not armed on the vma.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/proc/task_mmu.c | 2 ++
include/linux/mm.h | 2 ++
kernel/fork.c | 2 +-
3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6dee68d..58be92e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -597,6 +597,8 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
[ilog2(VM_HUGEPAGE)] = "hg",
[ilog2(VM_NOHUGEPAGE)] = "nh",
[ilog2(VM_MERGEABLE)] = "mg",
+ [ilog2(VM_UFFD_MISSING)]= "um",
+ [ilog2(VM_UFFD_WP)] = "uw",
};
size_t i;

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2923a51..7fccd22 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -123,8 +123,10 @@ extern unsigned int kobjsize(const void *objp);
#define VM_MAYSHARE 0x00000080

#define VM_GROWSDOWN 0x00000100 /* general info on the segment */
+#define VM_UFFD_MISSING 0x00000200 /* missing pages tracking */
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
+#define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */

#define VM_LOCKED 0x00002000
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */
diff --git a/kernel/fork.c b/kernel/fork.c
index 430141b..56d1ddf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -449,7 +449,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
tmp->vm_mm = mm;
if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
- tmp->vm_flags &= ~VM_LOCKED;
+ tmp->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP);
tmp->vm_next = tmp->vm_prev = NULL;
tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
file = tmp->vm_file;

2015-05-14 17:33:57

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 07/23] userfaultfd: call handle_userfault() for userfaultfd_missing() faults

This is where the page faults must be modified to call
handle_userfault() if userfaultfd_missing() is true (so if the
vma->vm_flags had VM_UFFD_MISSING set).

handle_userfault() then takes care of blocking the page fault and
delivering it to userland.

The fault flags must also be passed as parameter so the "read|write"
kind of fault can be passed to userland.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 69 ++++++++++++++++++++++++++++++++++++++------------------
mm/memory.c | 16 +++++++++++++
2 files changed, 63 insertions(+), 22 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cb8904c..c221be3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
#include <linux/pagemap.h>
#include <linux/migrate.h>
#include <linux/hashtable.h>
+#include <linux/userfaultfd_k.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -717,7 +718,8 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long haddr, pmd_t *pmd,
- struct page *page, gfp_t gfp)
+ struct page *page, gfp_t gfp,
+ unsigned int flags)
{
struct mem_cgroup *memcg;
pgtable_t pgtable;
@@ -725,12 +727,16 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,

VM_BUG_ON_PAGE(!PageCompound(page), page);

- if (mem_cgroup_try_charge(page, mm, gfp, &memcg))
- return VM_FAULT_OOM;
+ if (mem_cgroup_try_charge(page, mm, gfp, &memcg)) {
+ put_page(page);
+ count_vm_event(THP_FAULT_FALLBACK);
+ return VM_FAULT_FALLBACK;
+ }

pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable)) {
mem_cgroup_cancel_charge(page, memcg);
+ put_page(page);
return VM_FAULT_OOM;
}

@@ -750,6 +756,21 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
pte_free(mm, pgtable);
} else {
pmd_t entry;
+
+ /* Deliver the page fault to userland */
+ if (userfaultfd_missing(vma)) {
+ int ret;
+
+ spin_unlock(ptl);
+ mem_cgroup_cancel_charge(page, memcg);
+ put_page(page);
+ pte_free(mm, pgtable);
+ ret = handle_userfault(vma, haddr, flags,
+ VM_UFFD_MISSING);
+ VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+ return ret;
+ }
+
entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
page_add_new_anon_rmap(page, vma, haddr);
@@ -760,6 +781,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
atomic_long_inc(&mm->nr_ptes);
spin_unlock(ptl);
+ count_vm_event(THP_FAULT_ALLOC);
}

return 0;
@@ -771,19 +793,16 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
}

/* Caller must hold page table lock. */
-static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
struct page *zero_page)
{
pmd_t entry;
- if (!pmd_none(*pmd))
- return false;
entry = mk_pmd(zero_page, vma->vm_page_prot);
entry = pmd_mkhuge(entry);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
atomic_long_inc(&mm->nr_ptes);
- return true;
}

int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -806,6 +825,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pgtable_t pgtable;
struct page *zero_page;
bool set;
+ int ret;
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
return VM_FAULT_OOM;
@@ -816,14 +836,28 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_FALLBACK;
}
ptl = pmd_lock(mm, pmd);
- set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
- zero_page);
- spin_unlock(ptl);
+ ret = 0;
+ set = false;
+ if (pmd_none(*pmd)) {
+ if (userfaultfd_missing(vma)) {
+ spin_unlock(ptl);
+ ret = handle_userfault(vma, haddr, flags,
+ VM_UFFD_MISSING);
+ VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+ } else {
+ set_huge_zero_page(pgtable, mm, vma,
+ haddr, pmd,
+ zero_page);
+ spin_unlock(ptl);
+ set = true;
+ }
+ } else
+ spin_unlock(ptl);
if (!set) {
pte_free(mm, pgtable);
put_huge_zero_page();
}
- return 0;
+ return ret;
}
gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma), 0);
page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
@@ -831,14 +865,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
- if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, gfp))) {
- put_page(page);
- count_vm_event(THP_FAULT_FALLBACK);
- return VM_FAULT_FALLBACK;
- }
-
- count_vm_event(THP_FAULT_ALLOC);
- return 0;
+ return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, gfp, flags);
}

int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -873,16 +900,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
*/
if (is_huge_zero_pmd(pmd)) {
struct page *zero_page;
- bool set;
/*
* get_huge_zero_page() will never allocate a new page here,
* since we already have a zero page to copy. It just takes a
* reference.
*/
zero_page = get_huge_zero_page();
- set = set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+ set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
zero_page);
- BUG_ON(!set); /* unexpected !pmd_none(dst_pmd) */
ret = 0;
goto out_unlock;
}
diff --git a/mm/memory.c b/mm/memory.c
index be9f075..790e6f7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -61,6 +61,7 @@
#include <linux/string.h>
#include <linux/dma-debug.h>
#include <linux/debugfs.h>
+#include <linux/userfaultfd_k.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -2680,6 +2681,12 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (!pte_none(*page_table))
goto unlock;
+ /* Deliver the page fault to userland, check inside PT lock */
+ if (userfaultfd_missing(vma)) {
+ pte_unmap_unlock(page_table, ptl);
+ return handle_userfault(vma, address, flags,
+ VM_UFFD_MISSING);
+ }
goto setpte;
}

@@ -2707,6 +2714,15 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (!pte_none(*page_table))
goto release;

+ /* Deliver the page fault to userland, check inside PT lock */
+ if (userfaultfd_missing(vma)) {
+ pte_unmap_unlock(page_table, ptl);
+ mem_cgroup_cancel_charge(page, memcg);
+ page_cache_release(page);
+ return handle_userfault(vma, address, flags,
+ VM_UFFD_MISSING);
+ }
+
inc_mm_counter_fast(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
mem_cgroup_commit_charge(page, memcg, false);

2015-05-14 17:32:05

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 08/23] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx

vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
must be aware about so that we can merge vmas back like they were
originally before arming the userfaultfd on some memory range.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/mm.h | 2 +-
mm/madvise.c | 3 ++-
mm/mempolicy.c | 4 ++--
mm/mlock.c | 3 ++-
mm/mmap.c | 40 +++++++++++++++++++++++++++-------------
mm/mprotect.c | 3 ++-
6 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7fccd22..76376e0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1763,7 +1763,7 @@ extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
extern struct vm_area_struct *vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
- struct mempolicy *);
+ struct mempolicy *, struct vm_userfaultfd_ctx);
extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
extern int split_vma(struct mm_struct *,
struct vm_area_struct *, unsigned long addr, int new_below);
diff --git a/mm/madvise.c b/mm/madvise.c
index 22e8f0c..d215ea9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -111,7 +111,8 @@ static long madvise_behavior(struct vm_area_struct *vma,

pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
- vma->vm_file, pgoff, vma_policy(vma));
+ vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx);
if (*prev) {
vma = *prev;
goto success;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7477432..eeec77a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -722,8 +722,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
pgoff = vma->vm_pgoff +
((vmstart - vma->vm_start) >> PAGE_SHIFT);
prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
- vma->anon_vma, vma->vm_file, pgoff,
- new_pol);
+ vma->anon_vma, vma->vm_file, pgoff,
+ new_pol, vma->vm_userfaultfd_ctx);
if (prev) {
vma = prev;
next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index 6fd2cf1..25936680 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -510,7 +510,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,

pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
- vma->vm_file, pgoff, vma_policy(vma));
+ vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx);
if (*prev) {
vma = *prev;
goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index 9c1c5b9..5f09a7e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -41,6 +41,7 @@
#include <linux/notifier.h>
#include <linux/memory.h>
#include <linux/printk.h>
+#include <linux/userfaultfd_k.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -919,7 +920,8 @@ again: remove_next = 1 + (end > next->vm_end);
* per-vma resources, so we don't attempt to merge those.
*/
static inline int is_mergeable_vma(struct vm_area_struct *vma,
- struct file *file, unsigned long vm_flags)
+ struct file *file, unsigned long vm_flags,
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
{
/*
* VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -935,6 +937,8 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma,
return 0;
if (vma->vm_ops && vma->vm_ops->close)
return 0;
+ if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
+ return 0;
return 1;
}

@@ -965,9 +969,11 @@ static inline int is_mergeable_anon_vma(struct anon_vma *anon_vma1,
*/
static int
can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
- struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+ struct anon_vma *anon_vma, struct file *file,
+ pgoff_t vm_pgoff,
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
{
- if (is_mergeable_vma(vma, file, vm_flags) &&
+ if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
if (vma->vm_pgoff == vm_pgoff)
return 1;
@@ -984,9 +990,11 @@ can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
*/
static int
can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
- struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+ struct anon_vma *anon_vma, struct file *file,
+ pgoff_t vm_pgoff,
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
{
- if (is_mergeable_vma(vma, file, vm_flags) &&
+ if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
pgoff_t vm_pglen;
vm_pglen = vma_pages(vma);
@@ -1029,7 +1037,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr,
unsigned long end, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
- pgoff_t pgoff, struct mempolicy *policy)
+ pgoff_t pgoff, struct mempolicy *policy,
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
{
pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
struct vm_area_struct *area, *next;
@@ -1056,14 +1065,17 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
if (prev && prev->vm_end == addr &&
mpol_equal(vma_policy(prev), policy) &&
can_vma_merge_after(prev, vm_flags,
- anon_vma, file, pgoff)) {
+ anon_vma, file, pgoff,
+ vm_userfaultfd_ctx)) {
/*
* OK, it can. Can we now merge in the successor as well?
*/
if (next && end == next->vm_start &&
mpol_equal(policy, vma_policy(next)) &&
can_vma_merge_before(next, vm_flags,
- anon_vma, file, pgoff+pglen) &&
+ anon_vma, file,
+ pgoff+pglen,
+ vm_userfaultfd_ctx) &&
is_mergeable_anon_vma(prev->anon_vma,
next->anon_vma, NULL)) {
/* cases 1, 6 */
@@ -1084,7 +1096,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
if (next && end == next->vm_start &&
mpol_equal(policy, vma_policy(next)) &&
can_vma_merge_before(next, vm_flags,
- anon_vma, file, pgoff+pglen)) {
+ anon_vma, file, pgoff+pglen,
+ vm_userfaultfd_ctx)) {
if (prev && addr < prev->vm_end) /* case 4 */
err = vma_adjust(prev, prev->vm_start,
addr, prev->vm_pgoff, NULL);
@@ -1570,8 +1583,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
/*
* Can we just expand an old mapping?
*/
- vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff,
- NULL);
+ vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
+ NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
if (vma)
goto out;

@@ -2757,7 +2770,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)

/* Can we just expand an old private anonymous mapping? */
vma = vma_merge(mm, prev, addr, addr + len, flags,
- NULL, NULL, pgoff, NULL);
+ NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);
if (vma)
goto out;

@@ -2913,7 +2926,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
return NULL; /* should never get here */
new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
- vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+ vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx);
if (new_vma) {
/*
* Source vma may have been merged into new_vma
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e7d6f11..ef5be8e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -292,7 +292,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
*/
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*pprev = vma_merge(mm, *pprev, start, end, newflags,
- vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+ vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx);
if (*pprev) {
vma = *pprev;
goto success;

2015-05-14 17:34:57

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 09/23] userfaultfd: prevent khugepaged to merge if userfaultfd is armed

If userfaultfd is armed on a certain vma we can't "fill" the holes
with zeroes or we'll break the userland on demand paging. The holes if
the userfault is armed, are really missing information (not zeroes)
that the userland has to load from network or elsewhere.

The same issue happens for wrprotected ptes that we can't just convert
into a single writable pmd_trans_huge.

We could however in theory still merge across zeropages if only
VM_UFFD_MISSING is set (so if VM_UFFD_WP is not set)... that could be
slightly improved but it'd be much more complex code for a tiny corner
case.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c221be3..9671f51 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2198,7 +2198,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
_pte++, address += PAGE_SIZE) {
pte_t pteval = *_pte;
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
- if (++none_or_zero <= khugepaged_max_ptes_none)
+ if (!userfaultfd_armed(vma) &&
+ ++none_or_zero <= khugepaged_max_ptes_none)
continue;
else
goto out;
@@ -2651,7 +2652,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
_pte++, _address += PAGE_SIZE) {
pte_t pteval = *_pte;
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
- if (++none_or_zero <= khugepaged_max_ptes_none)
+ if (!userfaultfd_armed(vma) &&
+ ++none_or_zero <= khugepaged_max_ptes_none)
continue;
else
goto out_unmap;

2015-05-14 17:32:44

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization

Once an userfaultfd has been created and certain region of the process
virtual address space have been registered into it, the thread
responsible for doing the memory externalization can manage the page
faults in userland by talking to the kernel using the userfaultfd
protocol.

poll() can be used to know when there are new pending userfaults to be
read (POLLIN).

Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/userfaultfd.c | 1008 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 1008 insertions(+)
create mode 100644 fs/userfaultfd.c

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
new file mode 100644
index 0000000..1c9be61
--- /dev/null
+++ b/fs/userfaultfd.c
@@ -0,0 +1,1008 @@
+/*
+ * fs/userfaultfd.c
+ *
+ * Copyright (C) 2007 Davide Libenzi <[email protected]>
+ * Copyright (C) 2008-2009 Red Hat, Inc.
+ * Copyright (C) 2015 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+ * Some part derived from fs/eventfd.c (anon inode setup) and
+ * mm/ksm.c (mm hashing).
+ */
+
+#include <linux/hashtable.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/file.h>
+#include <linux/bug.h>
+#include <linux/anon_inodes.h>
+#include <linux/syscalls.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/mempolicy.h>
+#include <linux/ioctl.h>
+#include <linux/security.h>
+
+enum userfaultfd_state {
+ UFFD_STATE_WAIT_API,
+ UFFD_STATE_RUNNING,
+};
+
+struct userfaultfd_ctx {
+ /* pseudo fd refcounting */
+ atomic_t refcount;
+ /* waitqueue head for the userfaultfd page faults */
+ wait_queue_head_t fault_wqh;
+ /* waitqueue head for the pseudo fd to wakeup poll/read */
+ wait_queue_head_t fd_wqh;
+ /* userfaultfd syscall flags */
+ unsigned int flags;
+ /* state machine */
+ enum userfaultfd_state state;
+ /* released */
+ bool released;
+ /* mm with one ore more vmas attached to this userfaultfd_ctx */
+ struct mm_struct *mm;
+};
+
+struct userfaultfd_wait_queue {
+ unsigned long address;
+ wait_queue_t wq;
+ bool pending;
+ struct userfaultfd_ctx *ctx;
+};
+
+struct userfaultfd_wake_range {
+ unsigned long start;
+ unsigned long len;
+};
+
+static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
+ int wake_flags, void *key)
+{
+ struct userfaultfd_wake_range *range = key;
+ int ret;
+ struct userfaultfd_wait_queue *uwq;
+ unsigned long start, len;
+
+ uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+ ret = 0;
+ /* don't wake the pending ones to avoid reads to block */
+ if (uwq->pending && !ACCESS_ONCE(uwq->ctx->released))
+ goto out;
+ /* len == 0 means wake all */
+ start = range->start;
+ len = range->len;
+ if (len && (start > uwq->address || start + len <= uwq->address))
+ goto out;
+ ret = wake_up_state(wq->private, mode);
+ if (ret)
+ /* wake only once, autoremove behavior */
+ list_del_init(&wq->task_list);
+out:
+ return ret;
+}
+
+/**
+ * userfaultfd_ctx_get - Acquires a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to the userfaultfd context.
+ *
+ * Returns: In case of success, returns not zero.
+ */
+static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
+{
+ if (!atomic_inc_not_zero(&ctx->refcount))
+ BUG();
+}
+
+/**
+ * userfaultfd_ctx_put - Releases a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to userfaultfd context.
+ *
+ * The userfaultfd context reference must have been previously acquired either
+ * with userfaultfd_ctx_get() or userfaultfd_ctx_fdget().
+ */
+static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
+{
+ if (atomic_dec_and_test(&ctx->refcount)) {
+ VM_BUG_ON(spin_is_locked(&ctx->fault_pending_wqh.lock));
+ VM_BUG_ON(waitqueue_active(&ctx->fault_pending_wqh));
+ VM_BUG_ON(spin_is_locked(&ctx->fault_wqh.lock));
+ VM_BUG_ON(waitqueue_active(&ctx->fault_wqh));
+ VM_BUG_ON(spin_is_locked(&ctx->fd_wqh.lock));
+ VM_BUG_ON(waitqueue_active(&ctx->fd_wqh));
+ mmput(ctx->mm);
+ kfree(ctx);
+ }
+}
+
+static inline unsigned long userfault_address(unsigned long address,
+ unsigned int flags,
+ unsigned long reason)
+{
+ BUILD_BUG_ON(PAGE_SHIFT < UFFD_BITS);
+ address &= PAGE_MASK;
+ if (flags & FAULT_FLAG_WRITE)
+ /*
+ * Encode "write" fault information in the LSB of the
+ * address read by userland, without depending on
+ * FAULT_FLAG_WRITE kernel internal value.
+ */
+ address |= UFFD_BIT_WRITE;
+ if (reason & VM_UFFD_WP)
+ /*
+ * Encode "reason" fault information as bit number 1
+ * in the address read by userland. If bit number 1 is
+ * clear it means the reason is a VM_FAULT_MISSING
+ * fault.
+ */
+ address |= UFFD_BIT_WP;
+ return address;
+}
+
+/*
+ * The locking rules involved in returning VM_FAULT_RETRY depending on
+ * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
+ * FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
+ * recommendation in __lock_page_or_retry is not an understatement.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
+ * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
+ * not set.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
+ * set, VM_FAULT_RETRY can still be returned if and only if there are
+ * fatal_signal_pending()s, and the mmap_sem must be released before
+ * returning it.
+ */
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags, unsigned long reason)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct userfaultfd_ctx *ctx;
+ struct userfaultfd_wait_queue uwq;
+
+ BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+ ctx = vma->vm_userfaultfd_ctx.ctx;
+ if (!ctx)
+ return VM_FAULT_SIGBUS;
+
+ BUG_ON(ctx->mm != mm);
+
+ VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
+ VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
+
+ /*
+ * If it's already released don't get it. This avoids to loop
+ * in __get_user_pages if userfaultfd_release waits on the
+ * caller of handle_userfault to release the mmap_sem.
+ */
+ if (unlikely(ACCESS_ONCE(ctx->released)))
+ return VM_FAULT_SIGBUS;
+
+ /*
+ * Check that we can return VM_FAULT_RETRY.
+ *
+ * NOTE: it should become possible to return VM_FAULT_RETRY
+ * even if FAULT_FLAG_TRIED is set without leading to gup()
+ * -EBUSY failures, if the userfaultfd is to be extended for
+ * VM_UFFD_WP tracking and we intend to arm the userfault
+ * without first stopping userland access to the memory. For
+ * VM_UFFD_MISSING userfaults this is enough for now.
+ */
+ if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
+ /*
+ * Validate the invariant that nowait must allow retry
+ * to be sure not to return SIGBUS erroneously on
+ * nowait invocations.
+ */
+ BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
+#ifdef CONFIG_DEBUG_VM
+ if (printk_ratelimit()) {
+ printk(KERN_WARNING
+ "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
+ dump_stack();
+ }
+#endif
+ return VM_FAULT_SIGBUS;
+ }
+
+ /*
+ * Handle nowait, not much to do other than tell it to retry
+ * and wait.
+ */
+ if (flags & FAULT_FLAG_RETRY_NOWAIT)
+ return VM_FAULT_RETRY;
+
+ /* take the reference before dropping the mmap_sem */
+ userfaultfd_ctx_get(ctx);
+
+ /* be gentle and immediately relinquish the mmap_sem */
+ up_read(&mm->mmap_sem);
+
+ init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
+ uwq.wq.private = current;
+ uwq.address = userfault_address(address, flags, reason);
+ uwq.pending = true;
+ uwq.ctx = ctx;
+
+ spin_lock(&ctx->fault_wqh.lock);
+ /*
+ * After the __add_wait_queue the uwq is visible to userland
+ * through poll/read().
+ */
+ __add_wait_queue(&ctx->fault_wqh, &uwq.wq);
+ for (;;) {
+ set_current_state(TASK_KILLABLE);
+ if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
+ fatal_signal_pending(current))
+ break;
+ spin_unlock(&ctx->fault_wqh.lock);
+
+ wake_up_poll(&ctx->fd_wqh, POLLIN);
+ schedule();
+
+ spin_lock(&ctx->fault_wqh.lock);
+ }
+ __remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
+ __set_current_state(TASK_RUNNING);
+ spin_unlock(&ctx->fault_wqh.lock);
+
+ /*
+ * ctx may go away after this if the userfault pseudo fd is
+ * already released.
+ */
+ userfaultfd_ctx_put(ctx);
+
+ return VM_FAULT_RETRY;
+}
+
+static int userfaultfd_release(struct inode *inode, struct file *file)
+{
+ struct userfaultfd_ctx *ctx = file->private_data;
+ struct mm_struct *mm = ctx->mm;
+ struct vm_area_struct *vma, *prev;
+ /* len == 0 means wake all */
+ struct userfaultfd_wake_range range = { .len = 0, };
+ unsigned long new_flags;
+
+ ACCESS_ONCE(ctx->released) = true;
+
+ /*
+ * Flush page faults out of all CPUs. NOTE: all page faults
+ * must be retried without returning VM_FAULT_SIGBUS if
+ * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
+ * changes while handle_userfault released the mmap_sem. So
+ * it's critical that released is set to true (above), before
+ * taking the mmap_sem for writing.
+ */
+ down_write(&mm->mmap_sem);
+ prev = NULL;
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ cond_resched();
+ BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
+ !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+ if (vma->vm_userfaultfd_ctx.ctx != ctx) {
+ prev = vma;
+ continue;
+ }
+ new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
+ prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
+ new_flags, vma->anon_vma,
+ vma->vm_file, vma->vm_pgoff,
+ vma_policy(vma),
+ NULL_VM_UFFD_CTX);
+ if (prev)
+ vma = prev;
+ else
+ prev = vma;
+ vma->vm_flags = new_flags;
+ vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+ }
+ up_write(&mm->mmap_sem);
+
+ /*
+ * After no new page faults can wait on this fault_wqh, flush
+ * the last page faults that may have been already waiting on
+ * the fault_wqh.
+ */
+ spin_lock(&ctx->fault_wqh.lock);
+ __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, &range);
+ spin_unlock(&ctx->fault_wqh.lock);
+
+ wake_up_poll(&ctx->fd_wqh, POLLHUP);
+ userfaultfd_ctx_put(ctx);
+ return 0;
+}
+
+/* fault_wqh.lock must be hold by the caller */
+static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
+ struct userfaultfd_wait_queue **uwq)
+{
+ wait_queue_t *wq;
+ struct userfaultfd_wait_queue *_uwq;
+ unsigned int ret = 0;
+
+ VM_BUG_ON(!spin_is_locked(&ctx->fault_wqh.lock));
+
+ list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+ _uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+ if (_uwq->pending) {
+ ret = POLLIN;
+ if (!uwq)
+ /*
+ * If there's at least a pending and
+ * we don't care which one it is,
+ * break immediately and leverage the
+ * efficiency of the LIFO walk.
+ */
+ break;
+ /*
+ * If we need to find which one was pending we
+ * keep walking until we find the first not
+ * pending one, so we read() them in FIFO order.
+ */
+ *uwq = _uwq;
+ } else
+ /*
+ * break the loop at the first not pending
+ * one, there cannot be pending userfaults
+ * after the first not pending one, because
+ * all new pending ones are inserted at the
+ * head and we walk it in LIFO.
+ */
+ break;
+ }
+
+ return ret;
+}
+
+static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
+{
+ struct userfaultfd_ctx *ctx = file->private_data;
+ unsigned int ret;
+
+ poll_wait(file, &ctx->fd_wqh, wait);
+
+ switch (ctx->state) {
+ case UFFD_STATE_WAIT_API:
+ return POLLERR;
+ case UFFD_STATE_RUNNING:
+ spin_lock(&ctx->fault_wqh.lock);
+ ret = find_userfault(ctx, NULL);
+ spin_unlock(&ctx->fault_wqh.lock);
+ return ret;
+ default:
+ BUG();
+ }
+}
+
+static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
+ __u64 *addr)
+{
+ ssize_t ret;
+ DECLARE_WAITQUEUE(wait, current);
+ struct userfaultfd_wait_queue *uwq = NULL;
+
+ /* always take the fd_wqh lock before the fault_wqh lock */
+ spin_lock(&ctx->fd_wqh.lock);
+ __add_wait_queue(&ctx->fd_wqh, &wait);
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ spin_lock(&ctx->fault_wqh.lock);
+ if (find_userfault(ctx, &uwq)) {
+ /*
+ * The fault_wqh.lock prevents the uwq to
+ * disappear from under us.
+ */
+ uwq->pending = false;
+ /* careful to always initialize addr if ret == 0 */
+ *addr = uwq->address;
+ spin_unlock(&ctx->fault_wqh.lock);
+ ret = 0;
+ break;
+ }
+ spin_unlock(&ctx->fault_wqh.lock);
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+ if (no_wait) {
+ ret = -EAGAIN;
+ break;
+ }
+ spin_unlock(&ctx->fd_wqh.lock);
+ schedule();
+ spin_lock_irq(&ctx->fd_wqh.lock);
+ }
+ __remove_wait_queue(&ctx->fd_wqh, &wait);
+ __set_current_state(TASK_RUNNING);
+ spin_unlock_irq(&ctx->fd_wqh.lock);
+
+ return ret;
+}
+
+static ssize_t userfaultfd_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct userfaultfd_ctx *ctx = file->private_data;
+ ssize_t _ret, ret = 0;
+ /* careful to always initialize addr if ret == 0 */
+ __u64 uninitialized_var(addr);
+ int no_wait = file->f_flags & O_NONBLOCK;
+
+ if (ctx->state == UFFD_STATE_WAIT_API)
+ return -EINVAL;
+ BUG_ON(ctx->state != UFFD_STATE_RUNNING);
+
+ for (;;) {
+ if (count < sizeof(addr))
+ return ret ? ret : -EINVAL;
+ _ret = userfaultfd_ctx_read(ctx, no_wait, &addr);
+ if (_ret < 0)
+ return ret ? ret : _ret;
+ if (put_user(addr, (__u64 __user *) buf))
+ return ret ? ret : -EFAULT;
+ ret += sizeof(addr);
+ buf += sizeof(addr);
+ count -= sizeof(addr);
+ /*
+ * Allow to read more than one fault at time but only
+ * block if waiting for the very first one.
+ */
+ no_wait = O_NONBLOCK;
+ }
+}
+
+static void __wake_userfault(struct userfaultfd_ctx *ctx,
+ struct userfaultfd_wake_range *range)
+{
+ unsigned long start, end;
+
+ start = range->start;
+ end = range->start + range->len;
+
+ spin_lock(&ctx->fault_wqh.lock);
+ /* wake all in the range and autoremove */
+ __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
+ spin_unlock(&ctx->fault_wqh.lock);
+}
+
+static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
+ struct userfaultfd_wake_range *range)
+{
+ if (waitqueue_active(&ctx->fault_wqh))
+ __wake_userfault(ctx, range);
+}
+
+static __always_inline int validate_range(struct mm_struct *mm,
+ __u64 start, __u64 len)
+{
+ __u64 task_size = mm->task_size;
+
+ if (start & ~PAGE_MASK)
+ return -EINVAL;
+ if (len & ~PAGE_MASK)
+ return -EINVAL;
+ if (!len)
+ return -EINVAL;
+ if (start < mmap_min_addr)
+ return -EINVAL;
+ if (start >= task_size)
+ return -EINVAL;
+ if (len > task_size - start)
+ return -EINVAL;
+ return 0;
+}
+
+static int userfaultfd_register(struct userfaultfd_ctx *ctx,
+ unsigned long arg)
+{
+ struct mm_struct *mm = ctx->mm;
+ struct vm_area_struct *vma, *prev, *cur;
+ int ret;
+ struct uffdio_register uffdio_register;
+ struct uffdio_register __user *user_uffdio_register;
+ unsigned long vm_flags, new_flags;
+ bool found;
+ unsigned long start, end, vma_end;
+
+ user_uffdio_register = (struct uffdio_register __user *) arg;
+
+ ret = -EFAULT;
+ if (copy_from_user(&uffdio_register, user_uffdio_register,
+ sizeof(uffdio_register)-sizeof(__u64)))
+ goto out;
+
+ ret = -EINVAL;
+ if (!uffdio_register.mode)
+ goto out;
+ if (uffdio_register.mode & ~(UFFDIO_REGISTER_MODE_MISSING|
+ UFFDIO_REGISTER_MODE_WP))
+ goto out;
+ vm_flags = 0;
+ if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
+ vm_flags |= VM_UFFD_MISSING;
+ if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
+ vm_flags |= VM_UFFD_WP;
+ /*
+ * FIXME: remove the below error constraint by
+ * implementing the wprotect tracking mode.
+ */
+ ret = -EINVAL;
+ goto out;
+ }
+
+ ret = validate_range(mm, uffdio_register.range.start,
+ uffdio_register.range.len);
+ if (ret)
+ goto out;
+
+ start = uffdio_register.range.start;
+ end = start + uffdio_register.range.len;
+
+ down_write(&mm->mmap_sem);
+ vma = find_vma_prev(mm, start, &prev);
+
+ ret = -ENOMEM;
+ if (!vma)
+ goto out_unlock;
+
+ /* check that there's at least one vma in the range */
+ ret = -EINVAL;
+ if (vma->vm_start >= end)
+ goto out_unlock;
+
+ /*
+ * Search for not compatible vmas.
+ *
+ * FIXME: this shall be relaxed later so that it doesn't fail
+ * on tmpfs backed vmas (in addition to the current allowance
+ * on anonymous vmas).
+ */
+ found = false;
+ for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
+ cond_resched();
+
+ BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
+ !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+
+ /* check not compatible vmas */
+ ret = -EINVAL;
+ if (cur->vm_ops)
+ goto out_unlock;
+
+ /*
+ * Check that this vma isn't already owned by a
+ * different userfaultfd. We can't allow more than one
+ * userfaultfd to own a single vma simultaneously or we
+ * wouldn't know which one to deliver the userfaults to.
+ */
+ ret = -EBUSY;
+ if (cur->vm_userfaultfd_ctx.ctx &&
+ cur->vm_userfaultfd_ctx.ctx != ctx)
+ goto out_unlock;
+
+ found = true;
+ }
+ BUG_ON(!found);
+
+ if (vma->vm_start < start)
+ prev = vma;
+
+ ret = 0;
+ do {
+ cond_resched();
+
+ BUG_ON(vma->vm_ops);
+ BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
+ vma->vm_userfaultfd_ctx.ctx != ctx);
+
+ /*
+ * Nothing to do: this vma is already registered into this
+ * userfaultfd and with the right tracking mode too.
+ */
+ if (vma->vm_userfaultfd_ctx.ctx == ctx &&
+ (vma->vm_flags & vm_flags) == vm_flags)
+ goto skip;
+
+ if (vma->vm_start > start)
+ start = vma->vm_start;
+ vma_end = min(end, vma->vm_end);
+
+ new_flags = (vma->vm_flags & ~vm_flags) | vm_flags;
+ prev = vma_merge(mm, prev, start, vma_end, new_flags,
+ vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+ vma_policy(vma),
+ ((struct vm_userfaultfd_ctx){ ctx }));
+ if (prev) {
+ vma = prev;
+ goto next;
+ }
+ if (vma->vm_start < start) {
+ ret = split_vma(mm, vma, start, 1);
+ if (ret)
+ break;
+ }
+ if (vma->vm_end > end) {
+ ret = split_vma(mm, vma, end, 0);
+ if (ret)
+ break;
+ }
+ next:
+ /*
+ * In the vma_merge() successful mprotect-like case 8:
+ * the next vma was merged into the current one and
+ * the current one has not been updated yet.
+ */
+ vma->vm_flags = new_flags;
+ vma->vm_userfaultfd_ctx.ctx = ctx;
+
+ skip:
+ prev = vma;
+ start = vma->vm_end;
+ vma = vma->vm_next;
+ } while (vma && vma->vm_start < end);
+out_unlock:
+ up_write(&mm->mmap_sem);
+ if (!ret) {
+ /*
+ * Now that we scanned all vmas we can already tell
+ * userland which ioctls methods are guaranteed to
+ * succeed on this range.
+ */
+ if (put_user(UFFD_API_RANGE_IOCTLS,
+ &user_uffdio_register->ioctls))
+ ret = -EFAULT;
+ }
+out:
+ return ret;
+}
+
+static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
+ unsigned long arg)
+{
+ struct mm_struct *mm = ctx->mm;
+ struct vm_area_struct *vma, *prev, *cur;
+ int ret;
+ struct uffdio_range uffdio_unregister;
+ unsigned long new_flags;
+ bool found;
+ unsigned long start, end, vma_end;
+ const void __user *buf = (void __user *)arg;
+
+ ret = -EFAULT;
+ if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
+ goto out;
+
+ ret = validate_range(mm, uffdio_unregister.start,
+ uffdio_unregister.len);
+ if (ret)
+ goto out;
+
+ start = uffdio_unregister.start;
+ end = start + uffdio_unregister.len;
+
+ down_write(&mm->mmap_sem);
+ vma = find_vma_prev(mm, start, &prev);
+
+ ret = -ENOMEM;
+ if (!vma)
+ goto out_unlock;
+
+ /* check that there's at least one vma in the range */
+ ret = -EINVAL;
+ if (vma->vm_start >= end)
+ goto out_unlock;
+
+ /*
+ * Search for not compatible vmas.
+ *
+ * FIXME: this shall be relaxed later so that it doesn't fail
+ * on tmpfs backed vmas (in addition to the current allowance
+ * on anonymous vmas).
+ */
+ found = false;
+ ret = -EINVAL;
+ for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
+ cond_resched();
+
+ BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
+ !!(cur->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
+
+ /*
+ * Check not compatible vmas, not strictly required
+ * here as not compatible vmas cannot have an
+ * userfaultfd_ctx registered on them, but this
+ * provides for more strict behavior to notice
+ * unregistration errors.
+ */
+ if (cur->vm_ops)
+ goto out_unlock;
+
+ found = true;
+ }
+ BUG_ON(!found);
+
+ if (vma->vm_start < start)
+ prev = vma;
+
+ ret = 0;
+ do {
+ cond_resched();
+
+ BUG_ON(vma->vm_ops);
+
+ /*
+ * Nothing to do: this vma is already registered into this
+ * userfaultfd and with the right tracking mode too.
+ */
+ if (!vma->vm_userfaultfd_ctx.ctx)
+ goto skip;
+
+ if (vma->vm_start > start)
+ start = vma->vm_start;
+ vma_end = min(end, vma->vm_end);
+
+ new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
+ prev = vma_merge(mm, prev, start, vma_end, new_flags,
+ vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+ vma_policy(vma),
+ NULL_VM_UFFD_CTX);
+ if (prev) {
+ vma = prev;
+ goto next;
+ }
+ if (vma->vm_start < start) {
+ ret = split_vma(mm, vma, start, 1);
+ if (ret)
+ break;
+ }
+ if (vma->vm_end > end) {
+ ret = split_vma(mm, vma, end, 0);
+ if (ret)
+ break;
+ }
+ next:
+ /*
+ * In the vma_merge() successful mprotect-like case 8:
+ * the next vma was merged into the current one and
+ * the current one has not been updated yet.
+ */
+ vma->vm_flags = new_flags;
+ vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+
+ skip:
+ prev = vma;
+ start = vma->vm_end;
+ vma = vma->vm_next;
+ } while (vma && vma->vm_start < end);
+out_unlock:
+ up_write(&mm->mmap_sem);
+out:
+ return ret;
+}
+
+/*
+ * This is mostly needed to re-wakeup those userfaults that were still
+ * pending when userland wake them up the first time. We don't wake
+ * the pending one to avoid blocking reads to block, or non blocking
+ * read to return -EAGAIN, if used with POLLIN, to avoid userland
+ * doubts on why POLLIN wasn't reliable.
+ */
+static int userfaultfd_wake(struct userfaultfd_ctx *ctx,
+ unsigned long arg)
+{
+ int ret;
+ struct uffdio_range uffdio_wake;
+ struct userfaultfd_wake_range range;
+ const void __user *buf = (void __user *)arg;
+
+ ret = -EFAULT;
+ if (copy_from_user(&uffdio_wake, buf, sizeof(uffdio_wake)))
+ goto out;
+
+ ret = validate_range(ctx->mm, uffdio_wake.start, uffdio_wake.len);
+ if (ret)
+ goto out;
+
+ range.start = uffdio_wake.start;
+ range.len = uffdio_wake.len;
+
+ /*
+ * len == 0 means wake all and we don't want to wake all here,
+ * so check it again to be sure.
+ */
+ VM_BUG_ON(!range.len);
+
+ wake_userfault(ctx, &range);
+ ret = 0;
+
+out:
+ return ret;
+}
+
+/*
+ * userland asks for a certain API version and we return which bits
+ * and ioctl commands are implemented in this kernel for such API
+ * version or -EINVAL if unknown.
+ */
+static int userfaultfd_api(struct userfaultfd_ctx *ctx,
+ unsigned long arg)
+{
+ struct uffdio_api uffdio_api;
+ void __user *buf = (void __user *)arg;
+ int ret;
+
+ ret = -EINVAL;
+ if (ctx->state != UFFD_STATE_WAIT_API)
+ goto out;
+ ret = -EFAULT;
+ if (copy_from_user(&uffdio_api, buf, sizeof(__u64)))
+ goto out;
+ if (uffdio_api.api != UFFD_API) {
+ /* careful not to leak info, we only read the first 8 bytes */
+ memset(&uffdio_api, 0, sizeof(uffdio_api));
+ if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
+ goto out;
+ ret = -EINVAL;
+ goto out;
+ }
+ /* careful not to leak info, we only read the first 8 bytes */
+ uffdio_api.bits = UFFD_API_BITS;
+ uffdio_api.ioctls = UFFD_API_IOCTLS;
+ ret = -EFAULT;
+ if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
+ goto out;
+ ctx->state = UFFD_STATE_RUNNING;
+ ret = 0;
+out:
+ return ret;
+}
+
+static long userfaultfd_ioctl(struct file *file, unsigned cmd,
+ unsigned long arg)
+{
+ int ret = -EINVAL;
+ struct userfaultfd_ctx *ctx = file->private_data;
+
+ switch(cmd) {
+ case UFFDIO_API:
+ ret = userfaultfd_api(ctx, arg);
+ break;
+ case UFFDIO_REGISTER:
+ ret = userfaultfd_register(ctx, arg);
+ break;
+ case UFFDIO_UNREGISTER:
+ ret = userfaultfd_unregister(ctx, arg);
+ break;
+ case UFFDIO_WAKE:
+ ret = userfaultfd_wake(ctx, arg);
+ break;
+ }
+ return ret;
+}
+
+#ifdef CONFIG_PROC_FS
+static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+ struct userfaultfd_ctx *ctx = f->private_data;
+ wait_queue_t *wq;
+ struct userfaultfd_wait_queue *uwq;
+ unsigned long pending = 0, total = 0;
+
+ spin_lock(&ctx->fault_wqh.lock);
+ list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+ uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+ if (uwq->pending)
+ pending++;
+ total++;
+ }
+ spin_unlock(&ctx->fault_wqh.lock);
+
+ /*
+ * If more protocols will be added, there will be all shown
+ * separated by a space. Like this:
+ * protocols: aa:... bb:...
+ */
+ seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nAPI:\t%Lx:%x:%Lx\n",
+ pending, total, UFFD_API, UFFD_API_BITS,
+ UFFD_API_IOCTLS|UFFD_API_RANGE_IOCTLS);
+}
+#endif
+
+static const struct file_operations userfaultfd_fops = {
+#ifdef CONFIG_PROC_FS
+ .show_fdinfo = userfaultfd_show_fdinfo,
+#endif
+ .release = userfaultfd_release,
+ .poll = userfaultfd_poll,
+ .read = userfaultfd_read,
+ .unlocked_ioctl = userfaultfd_ioctl,
+ .compat_ioctl = userfaultfd_ioctl,
+ .llseek = noop_llseek,
+};
+
+/**
+ * userfaultfd_file_create - Creates an userfaultfd file pointer.
+ * @flags: Flags for the userfaultfd file.
+ *
+ * This function creates an userfaultfd file pointer, w/out installing
+ * it into the fd table. This is useful when the userfaultfd file is
+ * used during the initialization of data structures that require
+ * extra setup after the userfaultfd creation. So the userfaultfd
+ * creation is split into the file pointer creation phase, and the
+ * file descriptor installation phase. In this way races with
+ * userspace closing the newly installed file descriptor can be
+ * avoided. Returns an userfaultfd file pointer, or a proper error
+ * pointer.
+ */
+static struct file *userfaultfd_file_create(int flags)
+{
+ struct file *file;
+ struct userfaultfd_ctx *ctx;
+
+ BUG_ON(!current->mm);
+
+ /* Check the UFFD_* constants for consistency. */
+ BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
+ BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
+
+ file = ERR_PTR(-EINVAL);
+ if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
+ goto out;
+
+ file = ERR_PTR(-ENOMEM);
+ ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ goto out;
+
+ atomic_set(&ctx->refcount, 1);
+ init_waitqueue_head(&ctx->fault_wqh);
+ init_waitqueue_head(&ctx->fd_wqh);
+ ctx->flags = flags;
+ ctx->state = UFFD_STATE_WAIT_API;
+ ctx->released = false;
+ ctx->mm = current->mm;
+ /* prevent the mm struct to be freed */
+ atomic_inc(&ctx->mm->mm_users);
+
+ file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, ctx,
+ O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
+ if (IS_ERR(file))
+ kfree(ctx);
+out:
+ return file;
+}
+
+SYSCALL_DEFINE1(userfaultfd, int, flags)
+{
+ int fd, error;
+ struct file *file;
+
+ error = get_unused_fd_flags(flags & UFFD_SHARED_FCNTL_FLAGS);
+ if (error < 0)
+ return error;
+ fd = error;
+
+ file = userfaultfd_file_create(flags);
+ if (IS_ERR(file)) {
+ error = PTR_ERR(file);
+ goto err_put_unused_fd;
+ }
+ fd_install(fd, file);
+
+ return fd;
+
+err_put_unused_fd:
+ put_unused_fd(fd);
+
+ return error;
+}

2015-05-14 17:36:19

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 11/23] userfaultfd: Rename uffd_api.bits into .features

From: Pavel Emelyanov <[email protected]>

This is (seem to be) the minimal thing that is required to unblock
standard uffd usage from the non-cooperative one. Now more bits can
be added to the features field indicating e.g. UFFD_FEATURE_FORK and
others needed for the latter use-case.

Signed-off-by: Pavel Emelyanov <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/userfaultfd.c | 4 ++--
include/uapi/linux/userfaultfd.h | 10 ++++++++--
2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1c9be61..9085365 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -856,7 +856,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
goto out;
}
/* careful not to leak info, we only read the first 8 bytes */
- uffdio_api.bits = UFFD_API_BITS;
+ uffdio_api.features = UFFD_API_FEATURES;
uffdio_api.ioctls = UFFD_API_IOCTLS;
ret = -EFAULT;
if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
@@ -913,7 +913,7 @@ static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
* protocols: aa:... bb:...
*/
seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nAPI:\t%Lx:%x:%Lx\n",
- pending, total, UFFD_API, UFFD_API_BITS,
+ pending, total, UFFD_API, UFFD_API_FEATURES,
UFFD_API_IOCTLS|UFFD_API_RANGE_IOCTLS);
}
#endif
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 9a8cd56..5e1c2f7 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -11,7 +11,7 @@

#define UFFD_API ((__u64)0xAA)
/* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */
-#define UFFD_API_BITS (UFFD_BIT_WRITE)
+#define UFFD_API_FEATURES (UFFD_FEATURE_WRITE_BIT)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -51,12 +51,18 @@
#define UFFD_BIT_WP (1<<1) /* handle_userfault() reason VM_UFFD_WP */
#define UFFD_BITS 2 /* two above bits used for UFFD_BIT_* mask */

+/*
+ * Features reported in uffdio_api.features field
+ */
+#define UFFD_FEATURE_WRITE_BIT (1<<0) /* Corresponds to UFFD_BIT_WRITE */
+#define UFFD_FEATURE_WP_BIT (1<<1) /* Corresponds to UFFD_BIT_WP */
+
struct uffdio_api {
/* userland asks for an API number */
__u64 api;

/* kernel answers below with the available features for the API */
- __u64 bits;
+ __u64 features;
__u64 ioctls;
};

2015-05-14 17:36:16

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 12/23] userfaultfd: Rename uffd_api.bits into .features fixup

Update comment.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/uapi/linux/userfaultfd.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 5e1c2f7..03f21cb 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -10,7 +10,7 @@
#define _LINUX_USERFAULTFD_H

#define UFFD_API ((__u64)0xAA)
-/* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */
+/* FIXME: add "|UFFD_FEATURE_WP" to UFFD_API_FEATURES after implementing it */
#define UFFD_API_FEATURES (UFFD_FEATURE_WRITE_BIT)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \

2015-05-14 17:33:56

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 13/23] userfaultfd: change the read API to return a uffd_msg

I had requests to return the full address (not the page aligned one)
to userland.

It's not entirely clear how the page offset could be relevant because
userfaults aren't like SIGBUS that can sigjump to a different place
and it actually skip resolving the fault depending on a page
offset. There's currently no real way to skip the fault especially
because after a UFFDIO_COPY|ZEROPAGE, the fault is optimized to be
retried within the kernel without having to return to userland first
(not even self modifying code replacing the .text that touched the
faulting address would prevent the fault to be repeated). Userland
cannot skip repeating the fault even more so if the fault was
triggered by a KVM secondary page fault or any get_user_pages or any
copy-user inside some syscall which will return to kernel code. The
second time FAULT_FLAG_RETRY_NOWAIT won't be set leading to a SIGBUS
being raised because the userfault can't wait if it cannot release the
mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for that).

Still returning userland a proper structure during the read() on the
uffd, can allow to use the current UFFD_API for the future
non-cooperative extensions too and it looks cleaner as well. Once we
get additional fields there's no point to return the fault address
page aligned anymore to reuse the bits below PAGE_SHIFT.

The only downside is that the read() syscall will read 32bytes instead
of 8bytes but that's not going to be measurable overhead.

The total number of new events that can be extended or of new future
bits for already shipped events, is limited to 64 by the features
field of the uffdio_api structure. If more will be needed a bump of
UFFD_API will be required.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
Documentation/vm/userfaultfd.txt | 12 +++---
fs/userfaultfd.c | 79 +++++++++++++++++++++++-----------------
include/uapi/linux/userfaultfd.h | 64 ++++++++++++++++++++++++--------
3 files changed, 102 insertions(+), 53 deletions(-)

diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
index c2f5145..3557edd 100644
--- a/Documentation/vm/userfaultfd.txt
+++ b/Documentation/vm/userfaultfd.txt
@@ -46,11 +46,13 @@ is a corner case that would currently return -EBUSY).
When first opened the userfaultfd must be enabled invoking the
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
a later API version) which will specify the read/POLLIN protocol
-userland intends to speak on the UFFD. The UFFDIO_API ioctl if
-successful (i.e. if the requested uffdio_api.api is spoken also by the
-running kernel), will return into uffdio_api.features and
-uffdio_api.ioctls two 64bit bitmasks of respectively the activated
-feature of the read(2) protocol and the generic ioctl available.
+userland intends to speak on the UFFD and the uffdio_api.features
+userland needs to be enabled. The UFFDIO_API ioctl if successful
+(i.e. if the requested uffdio_api.api is spoken also by the running
+kernel and the requested features are going to be enabled) will return
+into uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
+respectively all the available features of the read(2) protocol and
+the generic ioctl available.

Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
be invoked (if present in the returned uffdio_api.ioctls bitmask) to
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 9085365..b45cefe 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -50,7 +50,7 @@ struct userfaultfd_ctx {
};

struct userfaultfd_wait_queue {
- unsigned long address;
+ struct uffd_msg msg;
wait_queue_t wq;
bool pending;
struct userfaultfd_ctx *ctx;
@@ -77,7 +77,8 @@ static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
/* len == 0 means wake all */
start = range->start;
len = range->len;
- if (len && (start > uwq->address || start + len <= uwq->address))
+ if (len && (start > uwq->msg.arg.pagefault.address ||
+ start + len <= uwq->msg.arg.pagefault.address))
goto out;
ret = wake_up_state(wq->private, mode);
if (ret)
@@ -122,28 +123,43 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
}
}

-static inline unsigned long userfault_address(unsigned long address,
- unsigned int flags,
- unsigned long reason)
+static inline void msg_init(struct uffd_msg *msg)
{
- BUILD_BUG_ON(PAGE_SHIFT < UFFD_BITS);
- address &= PAGE_MASK;
+ BUILD_BUG_ON(sizeof(struct uffd_msg) != 32);
+ /*
+ * Must use memset to zero out the paddings or kernel data is
+ * leaked to userland.
+ */
+ memset(msg, 0, sizeof(struct uffd_msg));
+}
+
+static inline struct uffd_msg userfault_msg(unsigned long address,
+ unsigned int flags,
+ unsigned long reason)
+{
+ struct uffd_msg msg;
+ msg_init(&msg);
+ msg.event = UFFD_EVENT_PAGEFAULT;
+ msg.arg.pagefault.address = address;
if (flags & FAULT_FLAG_WRITE)
/*
- * Encode "write" fault information in the LSB of the
- * address read by userland, without depending on
- * FAULT_FLAG_WRITE kernel internal value.
+ * If UFFD_FEATURE_PAGEFAULT_FLAG_WRITE was set in the
+ * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE
+ * was not set in a UFFD_EVENT_PAGEFAULT, it means it
+ * was a read fault, otherwise if set it means it's
+ * a write fault.
*/
- address |= UFFD_BIT_WRITE;
+ msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
if (reason & VM_UFFD_WP)
/*
- * Encode "reason" fault information as bit number 1
- * in the address read by userland. If bit number 1 is
- * clear it means the reason is a VM_FAULT_MISSING
- * fault.
+ * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
+ * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was
+ * not set in a UFFD_EVENT_PAGEFAULT, it means it was
+ * a missing fault, otherwise if set it means it's a
+ * write protect fault.
*/
- address |= UFFD_BIT_WP;
- return address;
+ msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
+ return msg;
}

/*
@@ -229,7 +245,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,

init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
uwq.wq.private = current;
- uwq.address = userfault_address(address, flags, reason);
+ uwq.msg = userfault_msg(address, flags, reason);
uwq.pending = true;
uwq.ctx = ctx;

@@ -385,7 +401,7 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
}

static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
- __u64 *addr)
+ struct uffd_msg *msg)
{
ssize_t ret;
DECLARE_WAITQUEUE(wait, current);
@@ -403,8 +419,8 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
* disappear from under us.
*/
uwq->pending = false;
- /* careful to always initialize addr if ret == 0 */
- *addr = uwq->address;
+ /* careful to always initialize msg if ret == 0 */
+ *msg = uwq->msg;
spin_unlock(&ctx->fault_wqh.lock);
ret = 0;
break;
@@ -434,8 +450,7 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
{
struct userfaultfd_ctx *ctx = file->private_data;
ssize_t _ret, ret = 0;
- /* careful to always initialize addr if ret == 0 */
- __u64 uninitialized_var(addr);
+ struct uffd_msg msg;
int no_wait = file->f_flags & O_NONBLOCK;

if (ctx->state == UFFD_STATE_WAIT_API)
@@ -443,16 +458,16 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
BUG_ON(ctx->state != UFFD_STATE_RUNNING);

for (;;) {
- if (count < sizeof(addr))
+ if (count < sizeof(msg))
return ret ? ret : -EINVAL;
- _ret = userfaultfd_ctx_read(ctx, no_wait, &addr);
+ _ret = userfaultfd_ctx_read(ctx, no_wait, &msg);
if (_ret < 0)
return ret ? ret : _ret;
- if (put_user(addr, (__u64 __user *) buf))
+ if (copy_to_user((__u64 __user *) buf, &msg, sizeof(msg)))
return ret ? ret : -EFAULT;
- ret += sizeof(addr);
- buf += sizeof(addr);
- count -= sizeof(addr);
+ ret += sizeof(msg);
+ buf += sizeof(msg);
+ count -= sizeof(msg);
/*
* Allow to read more than one fault at time but only
* block if waiting for the very first one.
@@ -845,17 +860,15 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
if (ctx->state != UFFD_STATE_WAIT_API)
goto out;
ret = -EFAULT;
- if (copy_from_user(&uffdio_api, buf, sizeof(__u64)))
+ if (copy_from_user(&uffdio_api, buf, sizeof(uffdio_api)))
goto out;
- if (uffdio_api.api != UFFD_API) {
- /* careful not to leak info, we only read the first 8 bytes */
+ if (uffdio_api.api != UFFD_API || uffdio_api.features) {
memset(&uffdio_api, 0, sizeof(uffdio_api));
if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
goto out;
ret = -EINVAL;
goto out;
}
- /* careful not to leak info, we only read the first 8 bytes */
uffdio_api.features = UFFD_API_FEATURES;
uffdio_api.ioctls = UFFD_API_IOCTLS;
ret = -EFAULT;
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 03f21cb..8e42bc3 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -10,8 +10,12 @@
#define _LINUX_USERFAULTFD_H

#define UFFD_API ((__u64)0xAA)
-/* FIXME: add "|UFFD_FEATURE_WP" to UFFD_API_FEATURES after implementing it */
-#define UFFD_API_FEATURES (UFFD_FEATURE_WRITE_BIT)
+/*
+ * After implementing the respective features it will become:
+ * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
+ * UFFD_FEATURE_EVENT_FORK)
+ */
+#define UFFD_API_FEATURES (0)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -43,26 +47,56 @@
#define UFFDIO_WAKE _IOR(UFFDIO, _UFFDIO_WAKE, \
struct uffdio_range)

-/*
- * Valid bits below PAGE_SHIFT in the userfault address read through
- * the read() syscall.
- */
-#define UFFD_BIT_WRITE (1<<0) /* this was a write fault, MISSING or WP */
-#define UFFD_BIT_WP (1<<1) /* handle_userfault() reason VM_UFFD_WP */
-#define UFFD_BITS 2 /* two above bits used for UFFD_BIT_* mask */
+/* read() structure */
+struct uffd_msg {
+ __u8 event;
+
+ union {
+ struct {
+ __u32 flags;
+ __u64 address;
+ } pagefault;
+
+ struct {
+ /* unused reserved fields */
+ __u64 reserved1;
+ __u64 reserved2;
+ __u64 reserved3;
+ } reserved;
+ } arg;
+};

/*
- * Features reported in uffdio_api.features field
+ * Start at 0x12 and not at 0 to be more strict against bugs.
*/
-#define UFFD_FEATURE_WRITE_BIT (1<<0) /* Corresponds to UFFD_BIT_WRITE */
-#define UFFD_FEATURE_WP_BIT (1<<1) /* Corresponds to UFFD_BIT_WP */
+#define UFFD_EVENT_PAGEFAULT 0x12
+#if 0 /* not available yet */
+#define UFFD_EVENT_FORK 0x13
+#endif
+
+/* flags for UFFD_EVENT_PAGEFAULT */
+#define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */
+#define UFFD_PAGEFAULT_FLAG_WP (1<<1) /* If reason is VM_UFFD_WP */

struct uffdio_api {
- /* userland asks for an API number */
+ /* userland asks for an API number and the features to enable */
__u64 api;
-
- /* kernel answers below with the available features for the API */
+ /*
+ * Kernel answers below with the all available features for
+ * the API, this notifies userland of which events and/or
+ * which flags for each event are enabled in the current
+ * kernel.
+ *
+ * Note: UFFD_EVENT_PAGEFAULT and UFFD_PAGEFAULT_FLAG_WRITE
+ * are to be considered implicitly always enabled in all kernels as
+ * long as the uffdio_api.api requested matches UFFD_API.
+ */
+#if 0 /* not available yet */
+#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
+#define UFFD_FEATURE_EVENT_FORK (1<<1)
+#endif
__u64 features;
+
__u64 ioctls;
};

2015-05-14 17:35:00

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 14/23] userfaultfd: wake pending userfaults

This is an optimization but it's a userland visible one and it affects
the API.

The downside of this optimization is that if you call poll() and you
get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
may be waken by a different thread, before read(ufd) comes
around. This in short means that poll() isn't really usable if the
userfaultfd is opened in blocking mode.

userfaults won't wait in "pending" state to be read anymore and any
UFFDIO_WAKE or similar operations that has the objective of waking
userfaults after their resolution, will wake all blocked userfaults
for the resolved range, including those that haven't been read() by
userland yet.

The behavior of poll() becomes not standard, but this obviates the
need of "spurious" UFFDIO_WAKE and it lets the userland threads to
restart immediately without requiring an UFFDIO_WAKE. This is even
more significant in case of repeated faults on the same address from
multiple threads.

This optimization is justified by the measurement that the number of
spurious UFFDIO_WAKE accounts for 5% and 10% of the total
userfaults for heavy workloads, so it's worth optimizing those away.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/userfaultfd.c | 65 +++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 43 insertions(+), 22 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b45cefe..50edbd8 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -52,6 +52,10 @@ struct userfaultfd_ctx {
struct userfaultfd_wait_queue {
struct uffd_msg msg;
wait_queue_t wq;
+ /*
+ * Only relevant when queued in fault_wqh and only used by the
+ * read operation to avoid reading the same userfault twice.
+ */
bool pending;
struct userfaultfd_ctx *ctx;
};
@@ -71,9 +75,6 @@ static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,

uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
ret = 0;
- /* don't wake the pending ones to avoid reads to block */
- if (uwq->pending && !ACCESS_ONCE(uwq->ctx->released))
- goto out;
/* len == 0 means wake all */
start = range->start;
len = range->len;
@@ -183,12 +184,14 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
struct mm_struct *mm = vma->vm_mm;
struct userfaultfd_ctx *ctx;
struct userfaultfd_wait_queue uwq;
+ int ret;

BUG_ON(!rwsem_is_locked(&mm->mmap_sem));

+ ret = VM_FAULT_SIGBUS;
ctx = vma->vm_userfaultfd_ctx.ctx;
if (!ctx)
- return VM_FAULT_SIGBUS;
+ goto out;

BUG_ON(ctx->mm != mm);

@@ -201,7 +204,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
* caller of handle_userfault to release the mmap_sem.
*/
if (unlikely(ACCESS_ONCE(ctx->released)))
- return VM_FAULT_SIGBUS;
+ goto out;

/*
* Check that we can return VM_FAULT_RETRY.
@@ -227,15 +230,16 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
dump_stack();
}
#endif
- return VM_FAULT_SIGBUS;
+ goto out;
}

/*
* Handle nowait, not much to do other than tell it to retry
* and wait.
*/
+ ret = VM_FAULT_RETRY;
if (flags & FAULT_FLAG_RETRY_NOWAIT)
- return VM_FAULT_RETRY;
+ goto out;

/* take the reference before dropping the mmap_sem */
userfaultfd_ctx_get(ctx);
@@ -255,21 +259,23 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
* through poll/read().
*/
__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
- for (;;) {
- set_current_state(TASK_KILLABLE);
- if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
- fatal_signal_pending(current))
- break;
- spin_unlock(&ctx->fault_wqh.lock);
+ set_current_state(TASK_KILLABLE);
+ spin_unlock(&ctx->fault_wqh.lock);

+ if (likely(!ACCESS_ONCE(ctx->released) &&
+ !fatal_signal_pending(current))) {
wake_up_poll(&ctx->fd_wqh, POLLIN);
schedule();
+ ret |= VM_FAULT_MAJOR;
+ }

+ __set_current_state(TASK_RUNNING);
+ /* see finish_wait() comment for why list_empty_careful() */
+ if (!list_empty_careful(&uwq.wq.task_list)) {
spin_lock(&ctx->fault_wqh.lock);
+ list_del_init(&uwq.wq.task_list);
+ spin_unlock(&ctx->fault_wqh.lock);
}
- __remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
- __set_current_state(TASK_RUNNING);
- spin_unlock(&ctx->fault_wqh.lock);

/*
* ctx may go away after this if the userfault pseudo fd is
@@ -277,7 +283,8 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
*/
userfaultfd_ctx_put(ctx);

- return VM_FAULT_RETRY;
+out:
+ return ret;
}

static int userfaultfd_release(struct inode *inode, struct file *file)
@@ -391,6 +398,12 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
case UFFD_STATE_WAIT_API:
return POLLERR;
case UFFD_STATE_RUNNING:
+ /*
+ * poll() never guarantees that read won't block.
+ * userfaults can be waken before they're read().
+ */
+ if (unlikely(!(file->f_flags & O_NONBLOCK)))
+ return POLLERR;
spin_lock(&ctx->fault_wqh.lock);
ret = find_userfault(ctx, NULL);
spin_unlock(&ctx->fault_wqh.lock);
@@ -806,11 +819,19 @@ out:
}

/*
- * This is mostly needed to re-wakeup those userfaults that were still
- * pending when userland wake them up the first time. We don't wake
- * the pending one to avoid blocking reads to block, or non blocking
- * read to return -EAGAIN, if used with POLLIN, to avoid userland
- * doubts on why POLLIN wasn't reliable.
+ * userfaultfd_wake is needed in case an userfault is in flight by the
+ * time a UFFDIO_COPY (or other ioctl variants) completes. The page
+ * may be well get mapped and the page fault if repeated wouldn't lead
+ * to a userfault anymore, but before scheduling in TASK_KILLABLE mode
+ * handle_userfault() doesn't recheck the pagetables and it doesn't
+ * serialize against UFFDO_COPY (or other ioctl variants). Ultimately
+ * the knowledge of which pages are mapped is left to userland who is
+ * responsible for handling the race between read() userfaults and
+ * background UFFDIO_COPY (or other ioctl variants), if done by
+ * separate concurrent threads.
+ *
+ * userfaultfd_wake may be used in combination with the
+ * UFFDIO_*_MODE_DONTWAKE to wakeup userfaults in batches.
*/
static int userfaultfd_wake(struct userfaultfd_ctx *ctx,
unsigned long arg)

2015-05-14 17:39:54

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 15/23] userfaultfd: optimize read() and poll() to be O(1)

This makes read O(1) and poll that was already O(1) becomes lockless.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/userfaultfd.c | 172 +++++++++++++++++++++++++++++++------------------------
1 file changed, 98 insertions(+), 74 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 50edbd8..3d26f41 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -35,7 +35,9 @@ enum userfaultfd_state {
struct userfaultfd_ctx {
/* pseudo fd refcounting */
atomic_t refcount;
- /* waitqueue head for the userfaultfd page faults */
+ /* waitqueue head for the pending (i.e. not read) userfaults */
+ wait_queue_head_t fault_pending_wqh;
+ /* waitqueue head for the userfaults */
wait_queue_head_t fault_wqh;
/* waitqueue head for the pseudo fd to wakeup poll/read */
wait_queue_head_t fd_wqh;
@@ -52,11 +54,6 @@ struct userfaultfd_ctx {
struct userfaultfd_wait_queue {
struct uffd_msg msg;
wait_queue_t wq;
- /*
- * Only relevant when queued in fault_wqh and only used by the
- * read operation to avoid reading the same userfault twice.
- */
- bool pending;
struct userfaultfd_ctx *ctx;
};

@@ -250,17 +247,21 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
uwq.wq.private = current;
uwq.msg = userfault_msg(address, flags, reason);
- uwq.pending = true;
uwq.ctx = ctx;

- spin_lock(&ctx->fault_wqh.lock);
+ spin_lock(&ctx->fault_pending_wqh.lock);
/*
* After the __add_wait_queue the uwq is visible to userland
* through poll/read().
*/
- __add_wait_queue(&ctx->fault_wqh, &uwq.wq);
+ __add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
+ /*
+ * The smp_mb() after __set_current_state prevents the reads
+ * following the spin_unlock to happen before the list_add in
+ * __add_wait_queue.
+ */
set_current_state(TASK_KILLABLE);
- spin_unlock(&ctx->fault_wqh.lock);
+ spin_unlock(&ctx->fault_pending_wqh.lock);

if (likely(!ACCESS_ONCE(ctx->released) &&
!fatal_signal_pending(current))) {
@@ -270,11 +271,28 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
}

__set_current_state(TASK_RUNNING);
- /* see finish_wait() comment for why list_empty_careful() */
+
+ /*
+ * Here we race with the list_del; list_add in
+ * userfaultfd_ctx_read(), however because we don't ever run
+ * list_del_init() to refile across the two lists, the prev
+ * and next pointers will never point to self. list_add also
+ * would never let any of the two pointers to point to
+ * self. So list_empty_careful won't risk to see both pointers
+ * pointing to self at any time during the list refile. The
+ * only case where list_del_init() is called is the full
+ * removal in the wake function and there we don't re-list_add
+ * and it's fine not to block on the spinlock. The uwq on this
+ * kernel stack can be released after the list_del_init.
+ */
if (!list_empty_careful(&uwq.wq.task_list)) {
- spin_lock(&ctx->fault_wqh.lock);
- list_del_init(&uwq.wq.task_list);
- spin_unlock(&ctx->fault_wqh.lock);
+ spin_lock(&ctx->fault_pending_wqh.lock);
+ /*
+ * No need of list_del_init(), the uwq on the stack
+ * will be freed shortly anyway.
+ */
+ list_del(&uwq.wq.task_list);
+ spin_unlock(&ctx->fault_pending_wqh.lock);
}

/*
@@ -332,59 +350,38 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
up_write(&mm->mmap_sem);

/*
- * After no new page faults can wait on this fault_wqh, flush
+ * After no new page faults can wait on this fault_*wqh, flush
* the last page faults that may have been already waiting on
- * the fault_wqh.
+ * the fault_*wqh.
*/
- spin_lock(&ctx->fault_wqh.lock);
+ spin_lock(&ctx->fault_pending_wqh.lock);
+ __wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, 0, &range);
__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, &range);
- spin_unlock(&ctx->fault_wqh.lock);
+ spin_unlock(&ctx->fault_pending_wqh.lock);

wake_up_poll(&ctx->fd_wqh, POLLHUP);
userfaultfd_ctx_put(ctx);
return 0;
}

-/* fault_wqh.lock must be hold by the caller */
-static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
- struct userfaultfd_wait_queue **uwq)
+/* fault_pending_wqh.lock must be hold by the caller */
+static inline struct userfaultfd_wait_queue *find_userfault(
+ struct userfaultfd_ctx *ctx)
{
wait_queue_t *wq;
- struct userfaultfd_wait_queue *_uwq;
- unsigned int ret = 0;
-
- VM_BUG_ON(!spin_is_locked(&ctx->fault_wqh.lock));
+ struct userfaultfd_wait_queue *uwq;

- list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
- _uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
- if (_uwq->pending) {
- ret = POLLIN;
- if (!uwq)
- /*
- * If there's at least a pending and
- * we don't care which one it is,
- * break immediately and leverage the
- * efficiency of the LIFO walk.
- */
- break;
- /*
- * If we need to find which one was pending we
- * keep walking until we find the first not
- * pending one, so we read() them in FIFO order.
- */
- *uwq = _uwq;
- } else
- /*
- * break the loop at the first not pending
- * one, there cannot be pending userfaults
- * after the first not pending one, because
- * all new pending ones are inserted at the
- * head and we walk it in LIFO.
- */
- break;
- }
+ VM_BUG_ON(!spin_is_locked(&ctx->fault_pending_wqh.lock));

- return ret;
+ uwq = NULL;
+ if (!waitqueue_active(&ctx->fault_pending_wqh))
+ goto out;
+ /* walk in reverse to provide FIFO behavior to read userfaults */
+ wq = list_last_entry(&ctx->fault_pending_wqh.task_list,
+ typeof(*wq), task_list);
+ uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+out:
+ return uwq;
}

static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
@@ -404,9 +401,20 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
*/
if (unlikely(!(file->f_flags & O_NONBLOCK)))
return POLLERR;
- spin_lock(&ctx->fault_wqh.lock);
- ret = find_userfault(ctx, NULL);
- spin_unlock(&ctx->fault_wqh.lock);
+ /*
+ * lockless access to see if there are pending faults
+ * __pollwait last action is the add_wait_queue but
+ * the spin_unlock would allow the waitqueue_active to
+ * pass above the actual list_add inside
+ * add_wait_queue critical section. So use a full
+ * memory barrier to serialize the list_add write of
+ * add_wait_queue() with the waitqueue_active read
+ * below.
+ */
+ ret = 0;
+ smp_mb();
+ if (waitqueue_active(&ctx->fault_pending_wqh))
+ ret = POLLIN;
return ret;
default:
BUG();
@@ -418,27 +426,34 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
{
ssize_t ret;
DECLARE_WAITQUEUE(wait, current);
- struct userfaultfd_wait_queue *uwq = NULL;
+ struct userfaultfd_wait_queue *uwq;

- /* always take the fd_wqh lock before the fault_wqh lock */
+ /* always take the fd_wqh lock before the fault_pending_wqh lock */
spin_lock(&ctx->fd_wqh.lock);
__add_wait_queue(&ctx->fd_wqh, &wait);
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
- spin_lock(&ctx->fault_wqh.lock);
- if (find_userfault(ctx, &uwq)) {
+ spin_lock(&ctx->fault_pending_wqh.lock);
+ uwq = find_userfault(ctx);
+ if (uwq) {
/*
- * The fault_wqh.lock prevents the uwq to
- * disappear from under us.
+ * The fault_pending_wqh.lock prevents the uwq
+ * to disappear from under us.
+ *
+ * Refile this userfault from
+ * fault_pending_wqh to fault_wqh, it's not
+ * pending anymore after we read it.
*/
- uwq->pending = false;
+ list_del(&uwq->wq.task_list);
+ __add_wait_queue(&ctx->fault_wqh, &uwq->wq);
+
/* careful to always initialize msg if ret == 0 */
*msg = uwq->msg;
- spin_unlock(&ctx->fault_wqh.lock);
+ spin_unlock(&ctx->fault_pending_wqh.lock);
ret = 0;
break;
}
- spin_unlock(&ctx->fault_wqh.lock);
+ spin_unlock(&ctx->fault_pending_wqh.lock);
if (signal_pending(current)) {
ret = -ERESTARTSYS;
break;
@@ -497,16 +512,21 @@ static void __wake_userfault(struct userfaultfd_ctx *ctx,
start = range->start;
end = range->start + range->len;

- spin_lock(&ctx->fault_wqh.lock);
+ spin_lock(&ctx->fault_pending_wqh.lock);
/* wake all in the range and autoremove */
- __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
- spin_unlock(&ctx->fault_wqh.lock);
+ if (waitqueue_active(&ctx->fault_pending_wqh))
+ __wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, 0,
+ range);
+ if (waitqueue_active(&ctx->fault_wqh))
+ __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
+ spin_unlock(&ctx->fault_pending_wqh.lock);
}

static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
struct userfaultfd_wake_range *range)
{
- if (waitqueue_active(&ctx->fault_wqh))
+ if (waitqueue_active(&ctx->fault_pending_wqh) ||
+ waitqueue_active(&ctx->fault_wqh))
__wake_userfault(ctx, range);
}

@@ -932,14 +952,17 @@ static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
struct userfaultfd_wait_queue *uwq;
unsigned long pending = 0, total = 0;

- spin_lock(&ctx->fault_wqh.lock);
+ spin_lock(&ctx->fault_pending_wqh.lock);
+ list_for_each_entry(wq, &ctx->fault_pending_wqh.task_list, task_list) {
+ uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+ pending++;
+ total++;
+ }
list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
- if (uwq->pending)
- pending++;
total++;
}
- spin_unlock(&ctx->fault_wqh.lock);
+ spin_unlock(&ctx->fault_pending_wqh.lock);

/*
* If more protocols will be added, there will be all shown
@@ -999,6 +1022,7 @@ static struct file *userfaultfd_file_create(int flags)
goto out;

atomic_set(&ctx->refcount, 1);
+ init_waitqueue_head(&ctx->fault_pending_wqh);
init_waitqueue_head(&ctx->fault_wqh);
init_waitqueue_head(&ctx->fd_wqh);
ctx->flags = flags;

2015-05-14 17:32:09

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 16/23] userfaultfd: allocate the userfaultfd_ctx cacheline aligned

Use proper slab to guarantee alignment.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/userfaultfd.c | 39 +++++++++++++++++++++++++++++++--------
1 file changed, 31 insertions(+), 8 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3d26f41..5542fe7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -27,20 +27,26 @@
#include <linux/ioctl.h>
#include <linux/security.h>

+static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
+
enum userfaultfd_state {
UFFD_STATE_WAIT_API,
UFFD_STATE_RUNNING,
};

+/*
+ * Start with fault_pending_wqh and fault_wqh so they're more likely
+ * to be in the same cacheline.
+ */
struct userfaultfd_ctx {
- /* pseudo fd refcounting */
- atomic_t refcount;
/* waitqueue head for the pending (i.e. not read) userfaults */
wait_queue_head_t fault_pending_wqh;
/* waitqueue head for the userfaults */
wait_queue_head_t fault_wqh;
/* waitqueue head for the pseudo fd to wakeup poll/read */
wait_queue_head_t fd_wqh;
+ /* pseudo fd refcounting */
+ atomic_t refcount;
/* userfaultfd syscall flags */
unsigned int flags;
/* state machine */
@@ -117,7 +123,7 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
VM_BUG_ON(spin_is_locked(&ctx->fd_wqh.lock));
VM_BUG_ON(waitqueue_active(&ctx->fd_wqh));
mmput(ctx->mm);
- kfree(ctx);
+ kmem_cache_free(userfaultfd_ctx_cachep, ctx);
}
}

@@ -987,6 +993,15 @@ static const struct file_operations userfaultfd_fops = {
.llseek = noop_llseek,
};

+static void init_once_userfaultfd_ctx(void *mem)
+{
+ struct userfaultfd_ctx *ctx = (struct userfaultfd_ctx *) mem;
+
+ init_waitqueue_head(&ctx->fault_pending_wqh);
+ init_waitqueue_head(&ctx->fault_wqh);
+ init_waitqueue_head(&ctx->fd_wqh);
+}
+
/**
* userfaultfd_file_create - Creates an userfaultfd file pointer.
* @flags: Flags for the userfaultfd file.
@@ -1017,14 +1032,11 @@ static struct file *userfaultfd_file_create(int flags)
goto out;

file = ERR_PTR(-ENOMEM);
- ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+ ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
if (!ctx)
goto out;

atomic_set(&ctx->refcount, 1);
- init_waitqueue_head(&ctx->fault_pending_wqh);
- init_waitqueue_head(&ctx->fault_wqh);
- init_waitqueue_head(&ctx->fd_wqh);
ctx->flags = flags;
ctx->state = UFFD_STATE_WAIT_API;
ctx->released = false;
@@ -1035,7 +1047,7 @@ static struct file *userfaultfd_file_create(int flags)
file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, ctx,
O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
if (IS_ERR(file))
- kfree(ctx);
+ kmem_cache_free(userfaultfd_ctx_cachep, ctx);
out:
return file;
}
@@ -1064,3 +1076,14 @@ err_put_unused_fd:

return error;
}
+
+static int __init userfaultfd_init(void)
+{
+ userfaultfd_ctx_cachep = kmem_cache_create("userfaultfd_ctx_cache",
+ sizeof(struct userfaultfd_ctx),
+ 0,
+ SLAB_HWCACHE_ALIGN|SLAB_PANIC,
+ init_once_userfaultfd_ctx);
+ return 0;
+}
+__initcall(userfaultfd_init);

2015-05-14 17:31:40

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 17/23] userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read

Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.

Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.

Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.

Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.

Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).

vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()

postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)

vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked

poll() -> POLLIN
read() -> 0x7fb76a139000
postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED

tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
/* check that no userfault raced with UFFDIO_COPY */
if (old_state == MISSING && tmp_state == REQUESTED)
UFFDIO_WAKE from background thread

And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:

vcpu background_thr userfault_thr
----- ----- -----
vcpu0 handle_mm_fault()

postcopy_place_page
read old_state -> MISSING
UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED

vcpu0 fault at 0x7fb76a139000 enters handle_userfault
poll() is kicked

poll() -> POLLIN
read() -> 0x7fb76a139000

if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
UFFDIO_WAKE from userfault thread

This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/userfaultfd.c | 81 +++++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 66 insertions(+), 15 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 5542fe7..6772c22 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -167,6 +167,67 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
}

/*
+ * Verify the pagetables are still not ok after having reigstered into
+ * the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any
+ * userfault that has already been resolved, if userfaultfd_read and
+ * UFFDIO_COPY|ZEROPAGE are being run simultaneously on two different
+ * threads.
+ */
+static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
+ unsigned long address,
+ unsigned long flags,
+ unsigned long reason)
+{
+ struct mm_struct *mm = ctx->mm;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd, _pmd;
+ pte_t *pte;
+ bool ret = true;
+
+ VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
+ goto out;
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ goto out;
+ pmd = pmd_offset(pud, address);
+ /*
+ * READ_ONCE must function as a barrier with narrower scope
+ * and it must be equivalent to:
+ * _pmd = *pmd; barrier();
+ *
+ * This is to deal with the instability (as in
+ * pmd_trans_unstable) of the pmd.
+ */
+ _pmd = READ_ONCE(*pmd);
+ if (!pmd_present(_pmd))
+ goto out;
+
+ ret = false;
+ if (pmd_trans_huge(_pmd))
+ goto out;
+
+ /*
+ * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
+ * and use the standard pte_offset_map() instead of parsing _pmd.
+ */
+ pte = pte_offset_map(pmd, address);
+ /*
+ * Lockless access: we're in a wait_event so it's ok if it
+ * changes under us.
+ */
+ if (pte_none(*pte))
+ ret = true;
+ pte_unmap(pte);
+
+out:
+ return ret;
+}
+
+/*
* The locking rules involved in returning VM_FAULT_RETRY depending on
* FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
* FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
@@ -188,6 +249,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
struct userfaultfd_ctx *ctx;
struct userfaultfd_wait_queue uwq;
int ret;
+ bool must_wait;

BUG_ON(!rwsem_is_locked(&mm->mmap_sem));

@@ -247,9 +309,6 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
/* take the reference before dropping the mmap_sem */
userfaultfd_ctx_get(ctx);

- /* be gentle and immediately relinquish the mmap_sem */
- up_read(&mm->mmap_sem);
-
init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
uwq.wq.private = current;
uwq.msg = userfault_msg(address, flags, reason);
@@ -269,7 +328,10 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
set_current_state(TASK_KILLABLE);
spin_unlock(&ctx->fault_pending_wqh.lock);

- if (likely(!ACCESS_ONCE(ctx->released) &&
+ must_wait = userfaultfd_must_wait(ctx, address, flags, reason);
+ up_read(&mm->mmap_sem);
+
+ if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
!fatal_signal_pending(current))) {
wake_up_poll(&ctx->fd_wqh, POLLIN);
schedule();
@@ -845,17 +907,6 @@ out:
}

/*
- * userfaultfd_wake is needed in case an userfault is in flight by the
- * time a UFFDIO_COPY (or other ioctl variants) completes. The page
- * may be well get mapped and the page fault if repeated wouldn't lead
- * to a userfault anymore, but before scheduling in TASK_KILLABLE mode
- * handle_userfault() doesn't recheck the pagetables and it doesn't
- * serialize against UFFDO_COPY (or other ioctl variants). Ultimately
- * the knowledge of which pages are mapped is left to userland who is
- * responsible for handling the race between read() userfaults and
- * background UFFDIO_COPY (or other ioctl variants), if done by
- * separate concurrent threads.
- *
* userfaultfd_wake may be used in combination with the
* UFFDIO_*_MODE_DONTWAKE to wakeup userfaults in batches.
*/

2015-05-14 17:41:21

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 18/23] userfaultfd: buildsystem activation

This allows to select the userfaultfd during configuration to build it.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/Makefile | 1 +
init/Kconfig | 11 +++++++++++
2 files changed, 12 insertions(+)

diff --git a/fs/Makefile b/fs/Makefile
index cb92fd4..53e59b2 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES) += anon_inodes.o
obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
+obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
obj-$(CONFIG_AIO) += aio.o
obj-$(CONFIG_FS_DAX) += dax.o
obj-$(CONFIG_FILE_LOCKING) += locks.o
diff --git a/init/Kconfig b/init/Kconfig
index 88ad043..e1c9d9e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1644,6 +1644,17 @@ config ADVISE_SYSCALLS
applications use these syscalls, you can disable this option to save
space.

+config USERFAULTFD
+ bool "Enable userfaultfd() system call"
+ select ANON_INODES
+ default y
+ depends on MMU
+ help
+ Enable the userfaultfd() system call that allows to intercept and
+ handle page faults in userland.
+
+ If unsure, say Y.
+
config PCI_QUIRKS
default y
bool "Enable PCI quirk workarounds" if EXPERT

2015-05-14 17:39:50

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 19/23] userfaultfd: activate syscall

This activates the userfaultfd syscall.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/powerpc/include/asm/systbl.h | 1 +
arch/powerpc/include/uapi/asm/unistd.h | 1 +
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 1 +
kernel/sys_ni.c | 1 +
6 files changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index f1863a1..4741b15 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -368,3 +368,4 @@ SYSCALL_SPU(memfd_create)
SYSCALL_SPU(bpf)
COMPAT_SYS(execveat)
PPC64ONLY(switch_endian)
+SYSCALL_SPU(userfaultfd)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index e4aa173..6ad58d4 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -386,5 +386,6 @@
#define __NR_bpf 361
#define __NR_execveat 362
#define __NR_switch_endian 363
+#define __NR_userfaultfd 364

#endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index ef8187f..dcc18ea 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
+359 i386 userfaultfd sys_userfaultfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 9ef32d5..81c4906 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 common userfaultfd sys_userfaultfd

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index bb51bec..8ee1d0c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -810,6 +810,7 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
asmlinkage long sys_eventfd(unsigned int count);
asmlinkage long sys_eventfd2(unsigned int count, int flags);
asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
+asmlinkage long sys_userfaultfd(int flags);
asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7995ef5..8b3e10ea 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -218,6 +218,7 @@ cond_syscall(compat_sys_timerfd_gettime);
cond_syscall(sys_eventfd);
cond_syscall(sys_eventfd2);
cond_syscall(sys_memfd_create);
+cond_syscall(sys_userfaultfd);

/* performance counters: */
cond_syscall(sys_perf_event_open);

2015-05-14 17:39:47

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 20/23] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI

This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/uapi/linux/userfaultfd.h | 42 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 8e42bc3..c8a543f 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -21,7 +21,9 @@
(__u64)1 << _UFFDIO_UNREGISTER | \
(__u64)1 << _UFFDIO_API)
#define UFFD_API_RANGE_IOCTLS \
- ((__u64)1 << _UFFDIO_WAKE)
+ ((__u64)1 << _UFFDIO_WAKE | \
+ (__u64)1 << _UFFDIO_COPY | \
+ (__u64)1 << _UFFDIO_ZEROPAGE)

/*
* Valid ioctl command number range with this API is from 0x00 to
@@ -34,6 +36,8 @@
#define _UFFDIO_REGISTER (0x00)
#define _UFFDIO_UNREGISTER (0x01)
#define _UFFDIO_WAKE (0x02)
+#define _UFFDIO_COPY (0x03)
+#define _UFFDIO_ZEROPAGE (0x04)
#define _UFFDIO_API (0x3F)

/* userfaultfd ioctl ids */
@@ -46,6 +50,10 @@
struct uffdio_range)
#define UFFDIO_WAKE _IOR(UFFDIO, _UFFDIO_WAKE, \
struct uffdio_range)
+#define UFFDIO_COPY _IOWR(UFFDIO, _UFFDIO_COPY, \
+ struct uffdio_copy)
+#define UFFDIO_ZEROPAGE _IOWR(UFFDIO, _UFFDIO_ZEROPAGE, \
+ struct uffdio_zeropage)

/* read() structure */
struct uffd_msg {
@@ -118,4 +126,36 @@ struct uffdio_register {
__u64 ioctls;
};

+struct uffdio_copy {
+ __u64 dst;
+ __u64 src;
+ __u64 len;
+ /*
+ * There will be a wrprotection flag later that allows to map
+ * pages wrprotected on the fly. And such a flag will be
+ * available if the wrprotection ioctl are implemented for the
+ * range according to the uffdio_register.ioctls.
+ */
+#define UFFDIO_COPY_MODE_DONTWAKE ((__u64)1<<0)
+ __u64 mode;
+
+ /*
+ * "copy" is written by the ioctl and must be at the end: the
+ * copy_from_user will not read the last 8 bytes.
+ */
+ __s64 copy;
+};
+
+struct uffdio_zeropage {
+ struct uffdio_range range;
+#define UFFDIO_ZEROPAGE_MODE_DONTWAKE ((__u64)1<<0)
+ __u64 mode;
+
+ /*
+ * "zeropage" is written by the ioctl and must be at the end:
+ * the copy_from_user will not read the last 8 bytes.
+ */
+ __s64 zeropage;
+};
+
#endif /* _LINUX_USERFAULTFD_H */

2015-05-14 17:31:44

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 21/23] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation

This implements mcopy_atomic and mfill_zeropage that are the lowlevel
VM methods that are invoked respectively by the UFFDIO_COPY and
UFFDIO_ZEROPAGE userfaultfd commands.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/userfaultfd_k.h | 6 +
mm/Makefile | 1 +
mm/userfaultfd.c | 269 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 276 insertions(+)
create mode 100644 mm/userfaultfd.c

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index e1e4360..587480a 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -30,6 +30,12 @@
extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags, unsigned long reason);

+extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
+ unsigned long src_start, unsigned long len);
+extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
+ unsigned long dst_start,
+ unsigned long len);
+
/* mm helpers */
static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
struct vm_userfaultfd_ctx vm_ctx)
diff --git a/mm/Makefile b/mm/Makefile
index 98c4eae..b424d5e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -78,3 +78,4 @@ obj-$(CONFIG_CMA) += cma.o
obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
+obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
new file mode 100644
index 0000000..c54c761
--- /dev/null
+++ b/mm/userfaultfd.c
@@ -0,0 +1,269 @@
+/*
+ * mm/userfaultfd.c
+ *
+ * Copyright (C) 2015 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/mmu_notifier.h>
+#include <asm/tlbflush.h>
+#include "internal.h"
+
+static int mcopy_atomic_pte(struct mm_struct *dst_mm,
+ pmd_t *dst_pmd,
+ struct vm_area_struct *dst_vma,
+ unsigned long dst_addr,
+ unsigned long src_addr)
+{
+ struct mem_cgroup *memcg;
+ pte_t _dst_pte, *dst_pte;
+ spinlock_t *ptl;
+ struct page *page;
+ void *page_kaddr;
+ int ret;
+
+ ret = -ENOMEM;
+ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr);
+ if (!page)
+ goto out;
+
+ page_kaddr = kmap(page);
+ ret = -EFAULT;
+ if (copy_from_user(page_kaddr, (const void __user *) src_addr,
+ PAGE_SIZE))
+ goto out_kunmap_release;
+ kunmap(page);
+
+ /*
+ * The memory barrier inside __SetPageUptodate makes sure that
+ * preceeding stores to the page contents become visible before
+ * the set_pte_at() write.
+ */
+ __SetPageUptodate(page);
+
+ ret = -ENOMEM;
+ if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg))
+ goto out_release;
+
+ _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+ if (dst_vma->vm_flags & VM_WRITE)
+ _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
+
+ ret = -EEXIST;
+ dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+ if (!pte_none(*dst_pte))
+ goto out_release_uncharge_unlock;
+
+ inc_mm_counter(dst_mm, MM_ANONPAGES);
+ page_add_new_anon_rmap(page, dst_vma, dst_addr);
+ mem_cgroup_commit_charge(page, memcg, false);
+ lru_cache_add_active_or_unevictable(page, dst_vma);
+
+ set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(dst_vma, dst_addr, dst_pte);
+
+ pte_unmap_unlock(dst_pte, ptl);
+ ret = 0;
+out:
+ return ret;
+out_release_uncharge_unlock:
+ pte_unmap_unlock(dst_pte, ptl);
+ mem_cgroup_cancel_charge(page, memcg);
+out_release:
+ page_cache_release(page);
+ goto out;
+out_kunmap_release:
+ kunmap(page);
+ goto out_release;
+}
+
+static int mfill_zeropage_pte(struct mm_struct *dst_mm,
+ pmd_t *dst_pmd,
+ struct vm_area_struct *dst_vma,
+ unsigned long dst_addr)
+{
+ pte_t _dst_pte, *dst_pte;
+ spinlock_t *ptl;
+ int ret;
+
+ _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
+ dst_vma->vm_page_prot));
+ ret = -EEXIST;
+ dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+ if (!pte_none(*dst_pte))
+ goto out_unlock;
+ set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(dst_vma, dst_addr, dst_pte);
+ ret = 0;
+out_unlock:
+ pte_unmap_unlock(dst_pte, ptl);
+ return ret;
+}
+
+static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd = NULL;
+
+ pgd = pgd_offset(mm, address);
+ pud = pud_alloc(mm, pgd, address);
+ if (pud)
+ /*
+ * Note that we didn't run this because the pmd was
+ * missing, the *pmd may be already established and in
+ * turn it may also be a trans_huge_pmd.
+ */
+ pmd = pmd_alloc(mm, pud, address);
+ return pmd;
+}
+
+static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
+ unsigned long dst_start,
+ unsigned long src_start,
+ unsigned long len,
+ bool zeropage)
+{
+ struct vm_area_struct *dst_vma;
+ ssize_t err;
+ pmd_t *dst_pmd;
+ unsigned long src_addr, dst_addr;
+ long copied = 0;
+
+ /*
+ * Sanitize the command parameters:
+ */
+ BUG_ON(dst_start & ~PAGE_MASK);
+ BUG_ON(len & ~PAGE_MASK);
+
+ /* Does the address range wrap, or is the span zero-sized? */
+ BUG_ON(src_start + len <= src_start);
+ BUG_ON(dst_start + len <= dst_start);
+
+ down_read(&dst_mm->mmap_sem);
+
+ /*
+ * Make sure the vma is not shared, that the dst range is
+ * both valid and fully within a single existing vma.
+ */
+ err = -EINVAL;
+ dst_vma = find_vma(dst_mm, dst_start);
+ if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+ goto out;
+ if (dst_start < dst_vma->vm_start ||
+ dst_start + len > dst_vma->vm_end)
+ goto out;
+
+ /*
+ * Be strict and only allow __mcopy_atomic on userfaultfd
+ * registered ranges to prevent userland errors going
+ * unnoticed. As far as the VM consistency is concerned, it
+ * would be perfectly safe to remove this check, but there's
+ * no useful usage for __mcopy_atomic ouside of userfaultfd
+ * registered ranges. This is after all why these are ioctls
+ * belonging to the userfaultfd and not syscalls.
+ */
+ if (!dst_vma->vm_userfaultfd_ctx.ctx)
+ goto out;
+
+ /*
+ * FIXME: only allow copying on anonymous vmas, tmpfs should
+ * be added.
+ */
+ if (dst_vma->vm_ops)
+ goto out;
+
+ /*
+ * Ensure the dst_vma has a anon_vma or this page
+ * would get a NULL anon_vma when moved in the
+ * dst_vma.
+ */
+ err = -ENOMEM;
+ if (unlikely(anon_vma_prepare(dst_vma)))
+ goto out;
+
+ for (src_addr = src_start, dst_addr = dst_start;
+ src_addr < src_start + len; ) {
+ pmd_t dst_pmdval;
+ BUG_ON(dst_addr >= dst_start + len);
+ dst_pmd = mm_alloc_pmd(dst_mm, dst_addr);
+ if (unlikely(!dst_pmd)) {
+ err = -ENOMEM;
+ break;
+ }
+
+ dst_pmdval = pmd_read_atomic(dst_pmd);
+ /*
+ * If the dst_pmd is mapped as THP don't
+ * override it and just be strict.
+ */
+ if (unlikely(pmd_trans_huge(dst_pmdval))) {
+ err = -EEXIST;
+ break;
+ }
+ if (unlikely(pmd_none(dst_pmdval)) &&
+ unlikely(__pte_alloc(dst_mm, dst_vma, dst_pmd,
+ dst_addr))) {
+ err = -ENOMEM;
+ break;
+ }
+ /* If an huge pmd materialized from under us fail */
+ if (unlikely(pmd_trans_huge(*dst_pmd))) {
+ err = -EFAULT;
+ break;
+ }
+
+ BUG_ON(pmd_none(*dst_pmd));
+ BUG_ON(pmd_trans_huge(*dst_pmd));
+
+ if (!zeropage)
+ err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
+ dst_addr, src_addr);
+ else
+ err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
+ dst_addr);
+
+ cond_resched();
+
+ if (!err) {
+ dst_addr += PAGE_SIZE;
+ src_addr += PAGE_SIZE;
+ copied += PAGE_SIZE;
+
+ if (fatal_signal_pending(current))
+ err = -EINTR;
+ }
+ if (err)
+ break;
+ }
+
+out:
+ up_read(&dst_mm->mmap_sem);
+ BUG_ON(copied < 0);
+ BUG_ON(err > 0);
+ BUG_ON(!copied && !err);
+ return copied ? copied : err;
+}
+
+ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
+ unsigned long src_start, unsigned long len)
+{
+ return __mcopy_atomic(dst_mm, dst_start, src_start, len, false);
+}
+
+ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
+ unsigned long len)
+{
+ return __mcopy_atomic(dst_mm, start, 0, len, true);
+}

2015-05-14 17:36:29

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic

If the rwsem starves writers it wasn't strictly a bug but lockdep
doesn't like it and this avoids depending on lowlevel implementation
details of the lock.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/userfaultfd.c | 92 ++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 66 insertions(+), 26 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index c54c761..f8fe494 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -21,26 +21,39 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
- unsigned long src_addr)
+ unsigned long src_addr,
+ struct page **pagep)
{
struct mem_cgroup *memcg;
pte_t _dst_pte, *dst_pte;
spinlock_t *ptl;
- struct page *page;
void *page_kaddr;
int ret;
+ struct page *page;

- ret = -ENOMEM;
- page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr);
- if (!page)
- goto out;
-
- page_kaddr = kmap(page);
- ret = -EFAULT;
- if (copy_from_user(page_kaddr, (const void __user *) src_addr,
- PAGE_SIZE))
- goto out_kunmap_release;
- kunmap(page);
+ if (!*pagep) {
+ ret = -ENOMEM;
+ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr);
+ if (!page)
+ goto out;
+
+ page_kaddr = kmap_atomic(page);
+ ret = copy_from_user(page_kaddr,
+ (const void __user *) src_addr,
+ PAGE_SIZE);
+ kunmap_atomic(page_kaddr);
+
+ /* fallback to copy_from_user outside mmap_sem */
+ if (unlikely(ret)) {
+ ret = -EFAULT;
+ *pagep = page;
+ /* don't free the page */
+ goto out;
+ }
+ } else {
+ page = *pagep;
+ *pagep = NULL;
+ }

/*
* The memory barrier inside __SetPageUptodate makes sure that
@@ -82,9 +95,6 @@ out_release_uncharge_unlock:
out_release:
page_cache_release(page);
goto out;
-out_kunmap_release:
- kunmap(page);
- goto out_release;
}

static int mfill_zeropage_pte(struct mm_struct *dst_mm,
@@ -139,7 +149,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
ssize_t err;
pmd_t *dst_pmd;
unsigned long src_addr, dst_addr;
- long copied = 0;
+ long copied;
+ struct page *page;

/*
* Sanitize the command parameters:
@@ -151,6 +162,11 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
BUG_ON(src_start + len <= src_start);
BUG_ON(dst_start + len <= dst_start);

+ src_addr = src_start;
+ dst_addr = dst_start;
+ copied = 0;
+ page = NULL;
+retry:
down_read(&dst_mm->mmap_sem);

/*
@@ -160,10 +176,10 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
err = -EINVAL;
dst_vma = find_vma(dst_mm, dst_start);
if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
- goto out;
+ goto out_unlock;
if (dst_start < dst_vma->vm_start ||
dst_start + len > dst_vma->vm_end)
- goto out;
+ goto out_unlock;

/*
* Be strict and only allow __mcopy_atomic on userfaultfd
@@ -175,14 +191,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
* belonging to the userfaultfd and not syscalls.
*/
if (!dst_vma->vm_userfaultfd_ctx.ctx)
- goto out;
+ goto out_unlock;

/*
* FIXME: only allow copying on anonymous vmas, tmpfs should
* be added.
*/
if (dst_vma->vm_ops)
- goto out;
+ goto out_unlock;

/*
* Ensure the dst_vma has a anon_vma or this page
@@ -191,12 +207,13 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
*/
err = -ENOMEM;
if (unlikely(anon_vma_prepare(dst_vma)))
- goto out;
+ goto out_unlock;

- for (src_addr = src_start, dst_addr = dst_start;
- src_addr < src_start + len; ) {
+ while (src_addr < src_start + len) {
pmd_t dst_pmdval;
+
BUG_ON(dst_addr >= dst_start + len);
+
dst_pmd = mm_alloc_pmd(dst_mm, dst_addr);
if (unlikely(!dst_pmd)) {
err = -ENOMEM;
@@ -229,13 +246,33 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,

if (!zeropage)
err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
- dst_addr, src_addr);
+ dst_addr, src_addr, &page);
else
err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
dst_addr);

cond_resched();

+ if (unlikely(err == -EFAULT)) {
+ void *page_kaddr;
+
+ BUILD_BUG_ON(zeropage);
+ up_read(&dst_mm->mmap_sem);
+ BUG_ON(!page);
+
+ page_kaddr = kmap(page);
+ err = copy_from_user(page_kaddr,
+ (const void __user *) src_addr,
+ PAGE_SIZE);
+ kunmap(page);
+ if (unlikely(err)) {
+ err = -EFAULT;
+ goto out;
+ }
+ goto retry;
+ } else
+ BUG_ON(page);
+
if (!err) {
dst_addr += PAGE_SIZE;
src_addr += PAGE_SIZE;
@@ -248,8 +285,11 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
break;
}

-out:
+out_unlock:
up_read(&dst_mm->mmap_sem);
+out:
+ if (page)
+ page_cache_release(page);
BUG_ON(copied < 0);
BUG_ON(err > 0);
BUG_ON(!copied && !err);

2015-05-14 17:33:03

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 23/23] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE

These two ioctl allows to either atomically copy or to map zeropages
into the virtual address space. This is used by the thread that opened
the userfaultfd to resolve the userfaults.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/userfaultfd.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 96 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6772c22..65704cb 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -942,6 +942,96 @@ out:
return ret;
}

+static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
+ unsigned long arg)
+{
+ __s64 ret;
+ struct uffdio_copy uffdio_copy;
+ struct uffdio_copy __user *user_uffdio_copy;
+ struct userfaultfd_wake_range range;
+
+ user_uffdio_copy = (struct uffdio_copy __user *) arg;
+
+ ret = -EFAULT;
+ if (copy_from_user(&uffdio_copy, user_uffdio_copy,
+ /* don't copy "copy" last field */
+ sizeof(uffdio_copy)-sizeof(__s64)))
+ goto out;
+
+ ret = validate_range(ctx->mm, uffdio_copy.dst, uffdio_copy.len);
+ if (ret)
+ goto out;
+ /*
+ * double check for wraparound just in case. copy_from_user()
+ * will later check uffdio_copy.src + uffdio_copy.len to fit
+ * in the userland range.
+ */
+ ret = -EINVAL;
+ if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
+ goto out;
+ if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
+ goto out;
+
+ ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
+ uffdio_copy.len);
+ if (unlikely(put_user(ret, &user_uffdio_copy->copy)))
+ return -EFAULT;
+ if (ret < 0)
+ goto out;
+ BUG_ON(!ret);
+ /* len == 0 would wake all */
+ range.len = ret;
+ if (!(uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE)) {
+ range.start = uffdio_copy.dst;
+ wake_userfault(ctx, &range);
+ }
+ ret = range.len == uffdio_copy.len ? 0 : -EAGAIN;
+out:
+ return ret;
+}
+
+static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
+ unsigned long arg)
+{
+ __s64 ret;
+ struct uffdio_zeropage uffdio_zeropage;
+ struct uffdio_zeropage __user *user_uffdio_zeropage;
+ struct userfaultfd_wake_range range;
+
+ user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
+
+ ret = -EFAULT;
+ if (copy_from_user(&uffdio_zeropage, user_uffdio_zeropage,
+ /* don't copy "zeropage" last field */
+ sizeof(uffdio_zeropage)-sizeof(__s64)))
+ goto out;
+
+ ret = validate_range(ctx->mm, uffdio_zeropage.range.start,
+ uffdio_zeropage.range.len);
+ if (ret)
+ goto out;
+ ret = -EINVAL;
+ if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
+ goto out;
+
+ ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
+ uffdio_zeropage.range.len);
+ if (unlikely(put_user(ret, &user_uffdio_zeropage->zeropage)))
+ return -EFAULT;
+ if (ret < 0)
+ goto out;
+ /* len == 0 would wake all */
+ BUG_ON(!ret);
+ range.len = ret;
+ if (!(uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE)) {
+ range.start = uffdio_zeropage.range.start;
+ wake_userfault(ctx, &range);
+ }
+ ret = range.len == uffdio_zeropage.range.len ? 0 : -EAGAIN;
+out:
+ return ret;
+}
+
/*
* userland asks for a certain API version and we return which bits
* and ioctl commands are implemented in this kernel for such API
@@ -997,6 +1087,12 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
case UFFDIO_WAKE:
ret = userfaultfd_wake(ctx, arg);
break;
+ case UFFDIO_COPY:
+ ret = userfaultfd_copy(ctx, arg);
+ break;
+ case UFFDIO_ZEROPAGE:
+ ret = userfaultfd_zeropage(ctx, arg);
+ break;
}
return ret;
}

2015-05-14 17:49:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization

On Thu, May 14, 2015 at 10:31 AM, Andrea Arcangeli <[email protected]> wrote:
> +static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
> + struct userfaultfd_wake_range *range)
> +{
> + if (waitqueue_active(&ctx->fault_wqh))
> + __wake_userfault(ctx, range);
> +}

Pretty much every single time people use this "if
(waitqueue_active())" model, it tends to be a bug, because it means
that there is zero serialization with people who are just about to go
to sleep. It's fundamentally racy against all the "wait_event()" loops
that carefully do memory barriers between testing conditions and going
to sleep, because the memory barriers now don't exist on the waking
side.

So I'm making a new rule: if you use waitqueue_active(), I want an
explanation for why it's not racy with the waiter. A big comment about
the memory ordering, or about higher-level locks that are held by the
caller, or something.

Linus

2015-05-15 16:05:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization

On Thu, May 14, 2015 at 10:49:06AM -0700, Linus Torvalds wrote:
> On Thu, May 14, 2015 at 10:31 AM, Andrea Arcangeli <[email protected]> wrote:
> > +static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
> > + struct userfaultfd_wake_range *range)
> > +{
> > + if (waitqueue_active(&ctx->fault_wqh))
> > + __wake_userfault(ctx, range);
> > +}
>
> Pretty much every single time people use this "if
> (waitqueue_active())" model, it tends to be a bug, because it means
> that there is zero serialization with people who are just about to go
> to sleep. It's fundamentally racy against all the "wait_event()" loops
> that carefully do memory barriers between testing conditions and going
> to sleep, because the memory barriers now don't exist on the waking
> side.

As far as 10/23 is concerned, the __wake_userfault taking the locks
would also ignore any "incoming" not yet "read(2)" userfault. "read(2)"
as in syscall read. So there was no race to worry about as "incoming"
userfaults would be ignored anyway by the wake.

The only case that had to be reliable in not missing wakeup events was
the "release" file operation. But that doesn't use waitqueue_active
and it relies on ctx->released combined with mmap_sem taken for writing.

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/tree/fs/userfaultfd.c?h=userfault&id=2f73ffa8267e41f04fe6b0d93d23feed45c0980a#n267

However later (after I started documenting what userland should do) I
started to question this old model of leaving the race handling to
userland. It's way too complex to leave the race handling to
userland. Furthermore it's inefficient because there would be lots of
spurious userfaults that we could have been waken up within the kernel
before userland could have a chance to read them.

So then I handled the race in the kernel in patch 17/23. That allowed
to drop hundres of line of locking code from qemu (two bits per page
and a mutex and lots of complexity) and then I didn't need to document
the complex rules described in this commit header:

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=5a2b3614e107482da9a1fcfcd120d30eb62d45dc

You can see from the patch it complicated the kernel by adding a
pagetable walk, but it's worth it because it's simpler than solving it
in userland, plus its solved at once for all users.

The only cons is that the old model would allow userland to be way
more consistent in enforcing asserts as it had the control.

With the current model POLLIN may be returned from poll() despite the
later read returns -EAGAIN, so it basically requires a non blocking
open if people uses poll(). (blocking reads without poll are still
fine)

Now back to your question: the waitqueue_active is still there in the
current model as well (since 17/23).

Since 17/23 losing a wakeup (as far as qemu is concerned) would mean
that qemu would read the address of the fault that didn't get waken up
because of the race, then it would send a page request to the source
node (it had no state in the destination node where the userfault runs
to know if it was a "dup"), which would discard the request (not
sending the page twice) noticing in its simple per-page bitmap that it
was already sent. So with the new model after 17/23 qemu, losing a
wakeup is a bug.

The wait_event is like this:

handle_userfault (wait_event)
-------
spin_lock(&ctx->fault_pending_wqh.lock);
/*
* After the __add_wait_queue the uwq is visible to userland
* through poll/read().
*/
__add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
/*
* The smp_mb() after __set_current_state prevents the reads
* following the spin_unlock to happen before the list_add in
* __add_wait_queue.
*/
set_current_state(TASK_KILLABLE);
spin_unlock(&ctx->fault_pending_wqh.lock);

must_wait = userfaultfd_must_wait(ctx, address, flags, reason);

up_read(&mm->mmap_sem);

if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
!fatal_signal_pending(current))) {
wake_up_poll(&ctx->fd_wqh, POLLIN);
schedule();

The wakeup side is:

userfaultfd_copy
-----------
ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
uffdio_copy.len);
if (unlikely(put_user(ret, &user_uffdio_copy->copy)))
return -EFAULT;
if (ret < 0)
goto out;
BUG_ON(!ret);
/* len == 0 would wake all */
range.len = ret;
if (!(uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE)) {
range.start = uffdio_copy.dst;
wake_userfault(ctx, &range);


wake_userfault is the function that is using waitqueue_active so the
problem materializes if waitqueue_active moves before
mcopy_atomic. Precisely it should move before the set_pte_at below:

mcopy_atomic_pte
------
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);

/* No need to invalidate - it was non-present before */
update_mmu_cache(dst_vma, dst_addr, dst_pte);

pte_unmap_unlock(dst_pte, ptl);

unlock would allow the read to be reordered before it even on
x86. There's an up_read as well in between but it has the same
problem. So you're right that theoretically we can miss a wakeup.

Practically it sounds unlikely because of the sheer size of
mcopy_atomic and we never experienced it but it's still a bug.

> So I'm making a new rule: if you use waitqueue_active(), I want an
> explanation for why it's not racy with the waiter. A big comment
> about the memory ordering, or about higher-level locks that are held
> by the caller, or something. Linus

To fix it I added this along a comment:

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index c89e96f..6be316b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -593,6 +593,21 @@ static void __wake_userfault(struct userfaultfd_ctx *ctx,
static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
struct userfaultfd_wake_range *range)
{
+ /*
+ * To be sure waitqueue_active() is not reordered by the CPU
+ * before the pagetable update, use an explicit SMP memory
+ * barrier here. PT lock release or up_read(mmap_sem) still
+ * have release semantics that can allow the
+ * waitqueue_active() to be reordered before the pte update.
+ */
+ smp_mb();
+
+ /*
+ * Use waitqueue_active because it's very frequent to
+ * change the address space atomically even if there are no
+ * userfaults yet. So we take the spinlock only when we're
+ * sure we've userfaults to wake.
+ */
if (waitqueue_active(&ctx->fault_pending_wqh) ||
waitqueue_active(&ctx->fault_wqh))
__wake_userfault(ctx, range);

The wait_event/handle_userfault side already has a smp_mb() to prevent
the lockless pagetable walk to be reordered before the list_add in
__add_wait_queue (needed as well precisely because of the
waitqueue_active optimization that I'd like to keep).

Thanks,
Andrea

2015-05-15 18:22:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization

On Fri, May 15, 2015 at 9:04 AM, Andrea Arcangeli <[email protected]> wrote:
>
> To fix it I added this along a comment:

Ok, this looks good as a explanation/fix for the races (and also as an
example of my worry about waitqueue_active() use in general).

However, it now makes me suspect that the optimistic "let's check if
they are even active" may not be worth it any more. You're adding a
"smp_mb()" in order to avoid taking the real lock. Although I guess
there are two locks there (one for each wait-queue) so maybe it's
worth it.

Linus

2015-05-18 14:25:08

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH 00/23] userfaultfd v4

On 05/14/2015 08:30 PM, Andrea Arcangeli wrote:
> Hello everyone,
>
> This is the latest userfaultfd patchset against mm-v4.1-rc3
> 2015-05-14-10:04.
>
> The postcopy live migration feature on the qemu side is mostly ready
> to be merged and it entirely depends on the userfaultfd syscall to be
> merged as well. So it'd be great if this patchset could be reviewed
> for merging in -mm.
>
> Userfaults allow to implement on demand paging from userland and more
> generally they allow userland to more efficiently take control of the
> behavior of page faults than what was available before
> (PROT_NONE + SIGSEGV trap).

Not to spam with 23 e-mails, all patches are

Acked-by: Pavel Emelyanov <[email protected]>

Thanks!

-- Pavel

2015-05-19 21:38:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/23] userfaultfd v4

On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli <[email protected]> wrote:

> This is the latest userfaultfd patchset against mm-v4.1-rc3
> 2015-05-14-10:04.

It would be useful to have some userfaultfd testcases in
tools/testing/selftests/. Partly as an aid to arch maintainers when
enabling this. And also as a standalone thing to give people a
practical way of exercising this interface.

What are your thoughts on enabling userfaultfd for other architectures,
btw? Are there good use cases, are people working on it, etc?


Also, I assume a manpage is in the works? Sooner rather than later
would be good - Michael's review of proposed kernel interfaces has
often been valuable.

2015-05-19 21:59:47

by Richard Weinberger

[permalink] [raw]
Subject: Re: [PATCH 00/23] userfaultfd v4

On Tue, May 19, 2015 at 11:38 PM, Andrew Morton
<[email protected]> wrote:
> On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli <[email protected]> wrote:
>
>> This is the latest userfaultfd patchset against mm-v4.1-rc3
>> 2015-05-14-10:04.
>
> It would be useful to have some userfaultfd testcases in
> tools/testing/selftests/. Partly as an aid to arch maintainers when
> enabling this. And also as a standalone thing to give people a
> practical way of exercising this interface.
>
> What are your thoughts on enabling userfaultfd for other architectures,
> btw? Are there good use cases, are people working on it, etc?

UML is using SIGSEGV for page faults.
i.e. the UML processes receives a SIGSEGV, learns the faulting address
from the mcontext
and resolves the fault by installing a new mapping.

If userfaultfd is faster that the SIGSEGV notification it could speed
up UML a bit.
For UML I'm only interested in the notification, not the resolving
part. The "missing"
data is present, only a new mapping is needed. No copy of data.

Andrea, what do you think?

--
Thanks,
//richard

2015-05-20 13:24:56

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 00/23] userfaultfd v4

Hi Andrew,

On Tue, May 19, 2015 at 02:38:01PM -0700, Andrew Morton wrote:
> On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli <[email protected]> wrote:
>
> > This is the latest userfaultfd patchset against mm-v4.1-rc3
> > 2015-05-14-10:04.
>
> It would be useful to have some userfaultfd testcases in
> tools/testing/selftests/. Partly as an aid to arch maintainers when
> enabling this. And also as a standalone thing to give people a
> practical way of exercising this interface.

Agreed.

I was also thinking about writing a trinity module for it, I wrote it
for an older version but it was much easier to do that back then
before we had ioctls, now it's more tricky because the ioctls requires
the fd open first etc... it's not enough to just call a syscall with a
flood of supervised-random params anymore.

> What are your thoughts on enabling userfaultfd for other architectures,
> btw? Are there good use cases, are people working on it, etc?

powerpc should be enabled and functional already. There's not much
arch dependent code in it, so in theory if the postcopy live migration
patchset is applied to qemu, it should work on powerpc out of the
box. Nobody tested it yet but I don't expect trouble on the kernel side.

Adding support for all other archs is just a few liner patch that
defines the syscall number. I didn't do that out of tree because every
time a new syscall materialized I would get more rejects during
rebase.

> Also, I assume a manpage is in the works? Sooner rather than later
> would be good - Michael's review of proposed kernel interfaces has
> often been valuable.

Yes, the manpage was certainly planned. It would require updates as we
keep adding features (like the wrprotect tracking, the non-cooperative
usage, and extending the availability of the ioctls to tmpfs). We can
definitely write a manpage with the current features.

Ok, so I'll continue working on the testcase and on the manpage.

Thanks!!
Andrea

2015-05-20 14:17:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 00/23] userfaultfd v4

Hello Richard,

On Tue, May 19, 2015 at 11:59:42PM +0200, Richard Weinberger wrote:
> On Tue, May 19, 2015 at 11:38 PM, Andrew Morton
> <[email protected]> wrote:
> > On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli <[email protected]> wrote:
> >
> >> This is the latest userfaultfd patchset against mm-v4.1-rc3
> >> 2015-05-14-10:04.
> >
> > It would be useful to have some userfaultfd testcases in
> > tools/testing/selftests/. Partly as an aid to arch maintainers when
> > enabling this. And also as a standalone thing to give people a
> > practical way of exercising this interface.
> >
> > What are your thoughts on enabling userfaultfd for other architectures,
> > btw? Are there good use cases, are people working on it, etc?
>
> UML is using SIGSEGV for page faults.
> i.e. the UML processes receives a SIGSEGV, learns the faulting address
> from the mcontext
> and resolves the fault by installing a new mapping.
>
> If userfaultfd is faster that the SIGSEGV notification it could speed
> up UML a bit.
> For UML I'm only interested in the notification, not the resolving
> part. The "missing"
> data is present, only a new mapping is needed. No copy of data.
>
> Andrea, what do you think?

I think you need some kind of UFFDIO_MPROTECT ioctl that is the same
ioctl wrprotect tracking also needs. At the moment we focused the
future plans mostly on wrprotection tracking but it could be extended
to protnone tracking, either with the same feature flag as
wrprotection (with a generic UFFDIO_MPROTECT) or with two separate
feature flags and two separate ioctl.

Your pages are not missing, like in the postcopy live snapshotting
case the pages are not missing. The userfaultfd memory protection
ioctl will not modify the VMA, but it'll just selectively mark
pte/trans_huge_pmd wrprotected/protnone in order to get the faults. In
the case of postcopy live snapshotting a single ioctl call will mark
the entire guest address space readonly.

For live snapshotting the fault resolution is a no brainer: when you
get the fault the page is still readable and it just needs to be
copied off by the live snapshotting thread to a different location,
and then the UFFDIO_MPROTECT will be called again to make the page
writable and wake the blocked fault.

For the protnone, you need to modify the page before waking the
blocked userfault, you can't just remove the protnone or other threads
could modify it (if there are other threads). You'd need a further
ioctl to copy the page off to a different place by using its kernel
address (the userland address is not mapped) and copy it back to
overwrite the original page.

Alternatively once we extend the handle_userfault to tmpfs you could
map the page in two virtual mappings and track the faults in one
mapping (where the tracked app runs) and read/write the page contents
in the other mapping that isn't tracked by the userfault.

These are the first thoughts that comes to mind without knowing
exactly what you need to do after you get the fault address, and
without knowing exactly why you need to mark the region PROT_NONE.

There will be some complications in adding the wrprotection/protnone
feature: if faults could already happen when the wrprotect/protnone is
armed, the handle_userfault() could be invoked in a retry-fault, that
is not ok without allowing the userfault to return VM_FAULT_RETRY even
during a refault (i.e. FAULT_FLAG_TRIED set but FAULT_FLAG_ALLOW_RETRY
not set). The invariants of vma->vm_page_prot and pte/trans_huge_pmd
permissions must also not break anywhere. These are the two main
reasons why these features that requires to flip protection bits are
left implemented later and made visible later with uffdio_api.feature
flags and/or through uffdio_register.ioctl during UFFDIO_REGISTER.

2015-05-21 13:26:01

by Kirill Smelkov

[permalink] [raw]
Subject: Re: [PATCH 00/23] userfaultfd v4

Hello up there,

On Thu, May 14, 2015 at 07:30:57PM +0200, Andrea Arcangeli wrote:
> Hello everyone,
>
> This is the latest userfaultfd patchset against mm-v4.1-rc3
> 2015-05-14-10:04.
>
> The postcopy live migration feature on the qemu side is mostly ready
> to be merged and it entirely depends on the userfaultfd syscall to be
> merged as well. So it'd be great if this patchset could be reviewed
> for merging in -mm.
>
> Userfaults allow to implement on demand paging from userland and more
> generally they allow userland to more efficiently take control of the
> behavior of page faults than what was available before
> (PROT_NONE + SIGSEGV trap).
>
> The use cases are:

[...]

> Even though there wasn't a real use case requesting it yet, it also
> allows to implement distributed shared memory in a way that readonly
> shared mappings can exist simultaneously in different hosts and they
> can be become exclusive at the first wrprotect fault.

Sorry for maybe speaking up too late, but here is additional real
potential use-case which in my view is overlapping with the above:

Recently we needed to implement persistency for NumPy arrays - that is
to track made changes to array memory and transactionally either abandon
the changes on transaction abort, or store them back to storage on
transaction commit.

Since arrays can be large, it would be slow and thus not practical to
have original data copy and compare memory to original to find what
array parts have been changed.

So I've implemented a scheme where array data is initially PROT_READ
protected, then we catch SIGSEGV, if it is write and area belongs to array
data - we mark that page as PROT_WRITE and continue. On commit time we
know which parts were modified.

Also, since arrays could be large - bigger than RAM, and only sparse
parts of it could be needed to get needed information, for reading it
also makes sense to lazily load data in SIGSEGV handler with initial
PROT_NONE protection.

This is very similar to how memory mapped files work, but adds
transactionality which, as far as I know, is not provided by any
currently in-kernel filesystem on Linux.

The system is done as files, and arrays are then build on top of
this-way memory-mapped files. So from now on we can forget about NumPy
arrays and only talk about files, their mapping, lazy loading and
transactionally storing in-memory changes back to file storage.

To get this working, a custom user-space virtual memory manager is
unrolled, which manages RAM memory "pages", file mappings into virtual
address-space, tracks pages protection and does SIGSEGV handling
appropriately.


The gist of virtual memory-manager is this:

https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h
https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c (vma_on_pagefault)


For operations it currently needs

- establishing virtual memory areas and connecting to tracking it

- changing pages protection

PROT_NONE or absent - initially
PROT_NONE -> PROT_READ - after read
PROT_READ -> PROT_READWRITE - after write
PROT_READWRITE -> PROT_READ - after commit
PROT_READWRITE -> PROT_NONE or absent (again) - after abort
PROT_READ -> PROT_NONE or absent (again) - on reclaim

- working with aliasable memory (thus taken from tmpfs)

there could be two overlapping-in-file mapping for file (array)
requested at different time, and changes from one mapping should
propagate to another one -> for common parts only 1 page should
be memory-mapped into 2 places in address-space.

so what is currently lacking on userfaultfd side is:

- ability to remove / make PROT_NONE already mapped pages
(UFFDIO_REMAP was recently dropped)

- ability to arbitrarily change pages protection (e.g. RW -> R)

- inject aliasable memory from tmpfs (or better hugetlbfs) and into
several places (UFFDIO_REMAP + some mapping copy semantic).


The code is ugly because it is only a prototype. You can clone/read it
all from here:

https://lab.nexedi.cn/kirr/wendelin.core

Virtual memory-manager even has tests, and from them it could be seen
how the system is supposed to work (after each access - what pages and
where are mapped and how):

https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/tests/test_virtmem.c

The performance currently is not great, partly because of page clearing
when getting ram from tmpfs, and partly because of mprotect/SIGSEGV/vmas
overhead and other dumb things on my side.

I still wanted to show the case, as userfaultd here has potential to
remove overhead related to kernel.

Thanks beforehand for feedback,

Kirill


P.S. some context

http://www.wendelin.io/NXD-Wendelin.Core.Non.Secret/asEntireHTML

2015-05-21 15:53:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 00/23] userfaultfd v4

Hi Kirill,

On Thu, May 21, 2015 at 04:11:11PM +0300, Kirill Smelkov wrote:
> Sorry for maybe speaking up too late, but here is additional real

Not too late, in fact I don't think there's any change required for
this at this stage, but it'd be great if you could help me to review.

> Since arrays can be large, it would be slow and thus not practical to
[..]
> So I've implemented a scheme where array data is initially PROT_READ
> protected, then we catch SIGSEGV, if it is write and area belongs to array

In the case of postcopy live migration (for qemu and/or containers) and
postcopy live snapshotting, splitting the vmas is not an option
because we may run out of them.

If your PROT_READ areas are limited perhaps this isn't an issue but
with hundreds GB guests (currently plenty in production) that needs to
live migrate fully reliably and fast, the vmas could exceed the limit
if we were to use mprotect. If your arrays are very large and the
PROT_READ aren't limited, using userfaultfd this isn't only an
optimization for you too, it's actually a must to avoid a potential
-ENOMEM.

> Also, since arrays could be large - bigger than RAM, and only sparse
> parts of it could be needed to get needed information, for reading it
> also makes sense to lazily load data in SIGSEGV handler with initial
> PROT_NONE protection.

Similarly I heard somebody wrote a fastresume to load the suspended
(on disk) guest ram using userfaultfd. That is a slightly less
fundamental case than postcopy because you could do it also with
MAP_SHARED, but it's still interesting in allowing to compress or
decompress the suspended ram on the fly with lz4 for example,
something MAP_PRIVATE/MAP_SHARED wouldn't do (plus there's the
additional benefit of not having an orphaned inode left open even if
the file is deleted, that prevents to unmount the filesystem for the
whole lifetime of the guest).

> This is very similar to how memory mapped files work, but adds
> transactionality which, as far as I know, is not provided by any
> currently in-kernel filesystem on Linux.

That's another benefit yes.

> The gist of virtual memory-manager is this:
>
> https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h
> https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c (vma_on_pagefault)

I'll check it more in detail ASAP, thanks for the pointers!

> For operations it currently needs
>
> - establishing virtual memory areas and connecting to tracking it

That's the UFFDIO_REGISTER/UNREGISTER.

> - changing pages protection
>
> PROT_NONE or absent - initially

absent is what works with -mm already. The lazy loading already works.

> PROT_NONE -> PROT_READ - after read

Current UFFDIO_COPY will map it using vma->vm_page_prot.

We'll need a new flag for UFFDIO_COPY to map it readonly. This is
already contemplated:

/*
* There will be a wrprotection flag later that allows to map
* pages wrprotected on the fly. And such a flag will be
* available if the wrprotection ioctl are implemented for the
* range according to the uffdio_register.ioctls.
*/
#define UFFDIO_COPY_MODE_DONTWAKE ((__u64)1<<0)
__u64 mode;

If the memory protection framework exists (either through the
uffdio_register.ioctl out value, or through uffdio_api.features
out-only value) you can pass a new flag (MODE_WP) above to transition
from "absent" to "PROT_READ".

> PROT_READ -> PROT_READWRITE - after write

This will need to add UFFDIO_MPROTECT.

> PROT_READWRITE -> PROT_READ - after commit

UFFDIO_MPROTECT again (but harder if going from rw to ro, because of a
slight mess to solve with regard to FAULT_FLAG_TRIED, in case you want
to run this UFFDIO_MPROTECT without stopping the threads that are
accessing the memory concurrently).

And this should only work if the uffdio_register.mode had MODE_WP set,
so we don't run into the races created by COWs (gup vs fork race).

> PROT_READWRITE -> PROT_NONE or absent (again) - after abort

UFFDIO_MPROTECT again, but you won't be able to read the page contents
inside the memory manager thread (the one working with
userfaultfd).

The manager at all times if forbidden to touch the memory it is
tracking with userfaultfd (if it does it'll deadlock, but kill -9 will
get rid of it). gdb ironically because it is using an underoptimized
access_process_vm wouldn't hang, because FAULT_FLAG_RETRY won't be set
in handle_userfault in the gdb context, and it'll just receive a
sigbus if by mistake the user tries to touch the memory. Even if it
will hung later as get_user_pages_locked|unlocked gets used there too,
kill -9 would solve gdb too.

Back to the problem of accessing the UFFDIO_MPROTECT(PROT_NONE)
memory: to do that a new ioctl should be required. I'd rather not go
back to the route of UFFDIO_REMAP, but it could copy the data using
the kernel address.

It could be simply a reverse UFFDIO_COPY. We could add a
UFFDIO_COPY_MODE_REVERSE flag to the "mode" of UFFDIO_COPY to mean
"read source from kernel address and write destination in user
address". By default it reads the source from user address and write
the destination in kernel address (to be atomic).

If you want to put data back before lifting the PROT_NONE, UFFDIO_COPY
could be used in the standard way but with a
UFFDIO_COPY_MODE_OVERWRITE flag that just overwrites the contents of
the old page if it's not mapped (protnone), or just get rid of the old
page (currently it'd return -EEXIST if the pte is not none).

So the process would be:

UFFDIO_COPY(dst_tmpaddr, src_addr, mode=REVERSE)
UFFDIO_COPY(src_addr, dst_tmpaddr, mode=OVERWRITE)

Then if you also set mode=READONLY in the last UFFDIO_COPY, it'll
create a wrprotected mapping atomically before giving visibility to
the new page contents:

UFFDIO_COPY(src_addr, dst_tmpaddr, mode=OVERWRITE|WP)

> PROT_READ -> PROT_NONE or absent (again) - on reclaim

Same as above.

> - working with aliasable memory (thus taken from tmpfs)
>
> there could be two overlapping-in-file mapping for file (array)
> requested at different time, and changes from one mapping should
> propagate to another one -> for common parts only 1 page should
> be memory-mapped into 2 places in address-space.

Why isn't the manager thread taking care of calling UFFDIO_MPROTECT in
two places?

And UFFDIO_COPY would fill the page and replace the old page and the
effect would be visible as far as the "data" is concerned, but the
protection bits would be more naturally different for each
mapping, like a double mmap call is also required to map such an area
in two places.

You could have a MAP_PRIVATE vma with PROT_READ, you can't create a
writable pte into it, just because you called
UFFDIO_MPROTECT(PROT_READ|PROT_WRITE) in a different mapping of the
same tmpfs page.

NOTE: the availability of the UFFDIO_MPROTECT|COPY on tmpfs ares would
still depend on UFFDIO_REGISTER to return the respecteve ioctl id in
the uffdio_register.ioctl (out value of the register ioctl).

> so what is currently lacking on userfaultfd side is:
>
> - ability to remove / make PROT_NONE already mapped pages
> (UFFDIO_REMAP was recently dropped)
>
> - ability to arbitrarily change pages protection (e.g. RW -> R)
>
> - inject aliasable memory from tmpfs (or better hugetlbfs) and into
> several places (UFFDIO_REMAP + some mapping copy semantic).

I think UFFDIO_COPY if added with OVERWRITE|REVERSE|WP flags is an ok
substitute for UFFDIO_REMAP.

If UFFDIO_COPY sees the page is protnone during the REVERSE copy (to
extract the memory atomically), it can also skip the tlb flush (and
obviously there's no tlb flush in the reverse direction). If the page
was not protnone, it can turn it in protnone, do a tlb flush, and then
copy it to the destination address using the userland mapping.

UFFDIO_MPROTECT is definitely necessary for postcopy live snapshotting
too (the reverse UFFDIO_COPY is not, it never deals with
PROT_NONE and it never cares about missing faults).

MPROTECT(PROT_NONE) so far seems needed only by this and perhaps UML
(and perhaps qemu linux-user).

I posted in another email why these features aren't implemented yet

==
There will be some complications in adding the wrprotection/protnone
feature: if faults could already happen when the wrprotect/protnone is
armed, the handle_userfault() could be invoked in a retry-fault, that
is not ok without allowing the userfault to return VM_FAULT_RETRY even
during a refault (i.e. FAULT_FLAG_TRIED set but FAULT_FLAG_ALLOW_RETRY
not set). The invariants of vma->vm_page_prot and pte/trans_huge_pmd
permissions must also not break anywhere. These are the two main
reasons why these features that requires to flip protection bits are
left implemented later and made visible later with uffdio_api.feature
flags and/or through uffdio_register.ioctl during UFFDIO_REGISTER.
==

> The performance currently is not great, partly because of page clearing
> when getting ram from tmpfs, and partly because of mprotect/SIGSEGV/vmas
> overhead and other dumb things on my side.

Also the page faults get slowed down when the rbtree grows a lot,
userfaultfd won't let the rbtree grow.

> I still wanted to show the case, as userfaultd here has potential to
> remove overhead related to kernel.

That's very useful and interesting feedback!

Could you review the API to be sure we don't have to modify it when we
extend it like described above?

1) tmpfs returning uffdio_register.ioctl |=
UFFDIO_MPROTECT|UFFDIO_COPY when enabled

2) UFFDIO_MPROTECT(PROT_NONE|READ|WRITE|EXEC) and in turn
UFFDIO_COPY_MODE_REVERSE|UFFDIO_COPY_MODE_OVERWRITE|UFFDIO_COPY_MODE_WP
being available if uffdio_register.ioctl includes UFFDIO_MPROTECT
(and uffdio_api.features will then include
UFFD_FEATURE_PAGEFAULT_WP to signal the uffd_msg.pagefault.flag WP
is available [bit 1], and UFFDIO_REGISTER_MODE_WP can be used in
uffdio_register.mode)

All of it could just check for uffdio_api.features &
UFFD_FEATURE_PAGEFAULT_WP being set, but you'd still have to check for
UFFDIO_MPROTECT being set in uffdio_register.ioctl for tmpfs areas (or
to know it's not available yet on hugetlbfs), so I think it's more
robust to check UFFDIO_MPROTECT ioctl being set in
uffdio_register.ioctl to assume all mprotection and writeprotect
tracking features are available for that specific range. The feature
flag will just tell that UFFDIO_REGISTER_MODE_WP can be used in the
register ioctl, that is something you need to know before in order to
"arm" the VMA for wrprotect faults.

For your usage I think you probably want to set
UFFDIO_REGISTER_MODE_WP|UFFDIO_REGISTER_MODE_MISSING and you'll be
told through uffdio_msg.flags if it's a WP or MISSING fault. You won't
be told if it's missing because of PROT_NONE or absent.

On a side note: all of the above is completely orthognal from the
non-cooperative usage: as far as memory protection features it doesn't
need any, it just needs to track more events like fork/mremap to
adjust its offsets as the memory manager is not part of the app and it
has no way to orchestrate by other means.

Doing it all at once (non-cooperative + full memory protection) looked
too much. We should just try to get the API right in a way that won't
require an UFFD_API bump passed to uffdio_api.api. Even then, if an
api bump is required, that's not a big deal, until recently the
non-cooperative usage already did the API bumb but we accomodated the
read(2) API to avoid it.

Thinking at the worst case scenario, if the API gets bumped the only
thing that has to remain fixed is the ioctl number of the UFFDIO_API
and the uffdio_api structure. Everything else can be upgraded without
risk of ABI breakage, even the ioctl numbers can be reused (except the
very UFFDIO_API). When the non-cooperative usage bumped the API it
actually kept all ioctl the same except the read(2) format.

Comments welcome, thanks!
Andrea

2015-05-22 16:33:41

by Kirill Smelkov

[permalink] [raw]
Subject: Re: [PATCH 00/23] userfaultfd v4

Hi Andrea,

On Thu, May 21, 2015 at 05:52:51PM +0200, Andrea Arcangeli wrote:
> Hi Kirill,
>
> On Thu, May 21, 2015 at 04:11:11PM +0300, Kirill Smelkov wrote:
> > Sorry for maybe speaking up too late, but here is additional real
>
> Not too late, in fact I don't think there's any change required for
> this at this stage, but it'd be great if you could help me to review.

Thanks


> > Since arrays can be large, it would be slow and thus not practical to
> [..]
> > So I've implemented a scheme where array data is initially PROT_READ
> > protected, then we catch SIGSEGV, if it is write and area belongs to array
>
> In the case of postcopy live migration (for qemu and/or containers) and
> postcopy live snapshotting, splitting the vmas is not an option
> because we may run out of them.
>
> If your PROT_READ areas are limited perhaps this isn't an issue but
> with hundreds GB guests (currently plenty in production) that needs to
> live migrate fully reliably and fast, the vmas could exceed the limit
> if we were to use mprotect. If your arrays are very large and the
> PROT_READ aren't limited, using userfaultfd this isn't only an
> optimization for you too, it's actually a must to avoid a potential
> -ENOMEM.

I understand. To somehow mitigate this issue for every array/file I try
to allocate ram pages from separate file on tmpfs with the same offset.
This way if we allocate a lot of pages and mmap them in with PROT_READ,
if they are adjacent to each other, the kernel will merge adjacent vmas
into one vma:

https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/include/wendelin/bigfile/ram.h#L100
https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/bigfile/ram_shmfs.c#L102
https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/bigfile/virtmem.c#L435

I agree this is only a half-measure - file parts accessed could be
sparse, and also there is a in-shmfs overhead maintaining tables what
real pages are allocated to which part of file on shmfs.

So yes, if userfaultfd allows to overcomes vma layer and work directly
on page tables, this helps.


> > Also, since arrays could be large - bigger than RAM, and only sparse
> > parts of it could be needed to get needed information, for reading it
> > also makes sense to lazily load data in SIGSEGV handler with initial
> > PROT_NONE protection.
>
> Similarly I heard somebody wrote a fastresume to load the suspended
> (on disk) guest ram using userfaultfd. That is a slightly less
> fundamental case than postcopy because you could do it also with
> MAP_SHARED, but it's still interesting in allowing to compress or
> decompress the suspended ram on the fly with lz4 for example,
> something MAP_PRIVATE/MAP_SHARED wouldn't do (plus there's the
> additional benefit of not having an orphaned inode left open even if
> the file is deleted, that prevents to unmount the filesystem for the
> whole lifetime of the guest).

I see. Just a note - transparent compression/decompression could be
achieved with MAP_SHARED if the compression is being performed by
underlying filesystem - e.g. implemented with FUSE.

( I have not measured performance though )


> > This is very similar to how memory mapped files work, but adds
> > transactionality which, as far as I know, is not provided by any
> > currently in-kernel filesystem on Linux.
>
> That's another benefit yes.
>
> > The gist of virtual memory-manager is this:
> >
> > https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h
> > https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c (vma_on_pagefault)
>
> I'll check it more in detail ASAP, thanks for the pointers!
>
> > For operations it currently needs
> >
> > - establishing virtual memory areas and connecting to tracking it
>
> That's the UFFDIO_REGISTER/UNREGISTER.

Yes

> > - changing pages protection
> >
> > PROT_NONE or absent - initially
>
> absent is what works with -mm already. The lazy loading already works.

Yes

> > PROT_NONE -> PROT_READ - after read
>
> Current UFFDIO_COPY will map it using vma->vm_page_prot.
>
> We'll need a new flag for UFFDIO_COPY to map it readonly. This is
> already contemplated:
>
> /*
> * There will be a wrprotection flag later that allows to map
> * pages wrprotected on the fly. And such a flag will be
> * available if the wrprotection ioctl are implemented for the
> * range according to the uffdio_register.ioctls.
> */
> #define UFFDIO_COPY_MODE_DONTWAKE ((__u64)1<<0)
> __u64 mode;
>
> If the memory protection framework exists (either through the
> uffdio_register.ioctl out value, or through uffdio_api.features
> out-only value) you can pass a new flag (MODE_WP) above to transition
> from "absent" to "PROT_READ".

Yes. The same probably applies to UFFDIO_ZEROPAGE (to mmap-in zeropage
as RO on read, if that part of file is currently hole)

So we settle on adding

UFFDIO_COPY_MODE_WP and
UFFDIO_ZEROPAGE_MODE_WP
?

Or maybe why we have both COPY_DONTWAKE and ZEROPAGE_DONTWAKE (and
previously REMAP_MODE_DONTWAKE) and now duplicate _MODE_WP to all them?

Maybe it makes sense to move those common flags related to waking up or
not, mmaping in as R or RW (and maybe other in the future) to common
place.


> > PROT_READ -> PROT_READWRITE - after write
>
> This will need to add UFFDIO_MPROTECT.

Yes

> > PROT_READWRITE -> PROT_READ - after commit
>
> UFFDIO_MPROTECT again (but harder if going from rw to ro, because of a
> slight mess to solve with regard to FAULT_FLAG_TRIED, in case you want
> to run this UFFDIO_MPROTECT without stopping the threads that are
> accessing the memory concurrently).

Yes. I understand the race of a manager making pages RW -> R, and an
other thread in user process simultaneously making a write to the same
page.

On user-level this race can be solved this way: before commit, changed
pages are first marked as R and only then written to storage.

- If a write is racing with RW->R protection being made - it's a client
problem (of having one thread still modifying data, and other thread
triggering commit) - we are ok with committing what has already been
written at the moment RW->R has happened.

- If a write happened after RW->R protection has been made (even if it
is racy), we are ok if we get a proper notification to userfaultfd
handler of write to WP area.

So if on kernel side the "slight mess" can be solved, userspace is ok to
live with the potential race and solve it itself.

> And this should only work if the uffdio_register.mode had MODE_WP set,
> so we don't run into the races created by COWs (gup vs fork race).

It is ok to start with registering with

(UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_WP)

for whole area at the beginning. In other words we are asking
userfaultfd "we want to handle all kind of faults - both reads and writes"


> > PROT_READWRITE -> PROT_NONE or absent (again) - after abort
>
> UFFDIO_MPROTECT again,

My idea here is that on transaction abort, we just don't need that
changed memory - we can both forget the changes and free the appropriate
pages - if in next transaction someone will need the file data of that
same part again, it just reloads the usual way from file.

So it is maybe

UFFDIO_MFREE ( make pte absent, and free(*) referenced page
(*) free maybe = decrement its refcount )

what is needed here.

But for generality, I agree it make sense to have a way to just MPROTECT
with PROT_NONE without freeing the page.

> but you won't be able to read the page contents
> inside the memory manager thread (the one working with
> userfaultfd).

With PROT_NONE I see.

> The manager at all times if forbidden to touch the memory it is
> tracking with userfaultfd (if it does it'll deadlock, but kill -9 will
> get rid of it). gdb ironically because it is using an underoptimized
> access_process_vm wouldn't hang, because FAULT_FLAG_RETRY won't be set
> in handle_userfault in the gdb context, and it'll just receive a
> sigbus if by mistake the user tries to touch the memory. Even if it
> will hung later as get_user_pages_locked|unlocked gets used there too,
> kill -9 would solve gdb too.

Wait. I partly understand, because I have no much experience in mm. But
are you saying the manager cannot access the memory it tracks, if pages
even have PROT_READ or PROT_READWRITE protection, and we know they were
already loaded, e.g. they are not missing?

If yes, then for sure, there need to be a way to get the data back from
memory to manager to implement storing changes back.

And maybe for some cases it would make sense to first set just
protection to PROT_NONE so manager know the client cannot mess with the
data while it accesses it.


> Back to the problem of accessing the UFFDIO_MPROTECT(PROT_NONE)
> memory: to do that a new ioctl should be required. I'd rather not go
> back to the route of UFFDIO_REMAP, but it could copy the data using
> the kernel address.
>
> It could be simply a reverse UFFDIO_COPY. We could add a
> UFFDIO_COPY_MODE_REVERSE flag to the "mode" of UFFDIO_COPY to mean
> "read source from kernel address and write destination in user
> address". By default it reads the source from user address and write
> the destination in kernel address (to be atomic).

Yes, with small clarification that "write to kernel address" is write to
"kernel address associated to address in destination mm"


> If you want to put data back before lifting the PROT_NONE, UFFDIO_COPY
> could be used in the standard way but with a
> UFFDIO_COPY_MODE_OVERWRITE flag that just overwrites the contents of
> the old page if it's not mapped (protnone), or just get rid of the old
> page (currently it'd return -EEXIST if the pte is not none).
>
> So the process would be:
>
> UFFDIO_COPY(dst_tmpaddr, src_addr, mode=REVERSE)
> UFFDIO_COPY(src_addr, dst_tmpaddr, mode=OVERWRITE)
>
> Then if you also set mode=READONLY in the last UFFDIO_COPY, it'll
> create a wrprotected mapping atomically before giving visibility to
> the new page contents:
>
> UFFDIO_COPY(src_addr, dst_tmpaddr, mode=OVERWRITE|WP)

I agree in general.

The only thing which confuses me a bit is the REVERSE flag and dst/src
being skipped. I would rather keep the src / dst ordering and in mode
have COPY_TO and COPY_FROM or have both UFFDIO_COPY_TO and
UFFDIO_COPY_FROM to clarify API.

But this is only a style and does not change semantics.


But I wonder though again - is it maybe possible to get managed memory
content (with PROT_READ or PROT_READWRITE) without copying?


> > PROT_READ -> PROT_NONE or absent (again) - on reclaim
>
> Same as above.

The same as above about abort applies to reclaim - we just need
to forget and free memory here - so UFFDIO_MFREE.


> > - working with aliasable memory (thus taken from tmpfs)
> >
> > there could be two overlapping-in-file mapping for file (array)
> > requested at different time, and changes from one mapping should
> > propagate to another one -> for common parts only 1 page should
> > be memory-mapped into 2 places in address-space.
>
> Why isn't the manager thread taking care of calling UFFDIO_MPROTECT in
> two places?
>
> And UFFDIO_COPY would fill the page and replace the old page and the
> effect would be visible as far as the "data" is concerned, but the
> protection bits would be more naturally different for each
> mapping, like a double mmap call is also required to map such an area
> in two places.

Because if we have a write on one place, the manager will
write-unprotect it, and would not get notified on further writes to the
same area.

And I need further writes to be visible in other mappings instantly.
Here is simplified example:

In [1]: from numpy import *
In [2]: A = zeros(10) # it simulates big array backed by manager

In [3]: a = A[0:5] # one part is mapped with start/stop = (0,5)

# something unrelated is done in between

In [4]: b = A[3:10] # another part is mapped with start/stop = (3,10)

# NOTE a and b overlaps in array/backing file.
# but since we already allocated address-space for a and did other
# things, address space beyond a could be already allocated, so we
# cannot just extend it and make b referencing addresses starting at
# a tail.

In [5]: a
Out[5]: array([ 0., 0., 0., 0., 0.])

In [6]: b
Out[6]: array([ 0., 0., 0., 0., 0., 0., 0.])

In [7]: a[4] = 1 # first change to a; the manager removes WP

In [8]: a
Out[8]: array([ 0., 0., 0., 0., 1.])

In [9]: b
Out[9]: array([ 0., 1., 0., 0., 0., 0., 0.]) # propagated to b

# write again - the manager does not get a WP fault, but the change have to
# propagate again
In [10]: a[4] = 2

In [11]: a
Out[11]: array([ 0., 0., 0., 0., 2.])

In [12]: b
Out[12]: array([ 0., 2., 0., 0., 0., 0., 0.])

so if we resolve fault for a[4], the exact page mapped in, should be
also eventually mapped into b[1].

> You could have a MAP_PRIVATE vma with PROT_READ, you can't create a
> writable pte into it, just because you called
> UFFDIO_MPROTECT(PROT_READ|PROT_WRITE) in a different mapping of the
> same tmpfs page.

I don't fully understand here, but I use MAP_SHARED so changes propagate
back to tmpfs file and are visible in other mappings.

https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/bigfile/ram_shmfs.c#L89

> NOTE: the availability of the UFFDIO_MPROTECT|COPY on tmpfs ares would
> still depend on UFFDIO_REGISTER to return the respecteve ioctl id in
> the uffdio_register.ioctl (out value of the register ioctl).

It is ok to check. But I'd like to note: here we mmap two overlapping
parts of a tmpfs file in two regions, and register both regions to
userfaultfd.


> > so what is currently lacking on userfaultfd side is:
> >
> > - ability to remove / make PROT_NONE already mapped pages
> > (UFFDIO_REMAP was recently dropped)
> >
> > - ability to arbitrarily change pages protection (e.g. RW -> R)
> >
> > - inject aliasable memory from tmpfs (or better hugetlbfs) and into
> > several places (UFFDIO_REMAP + some mapping copy semantic).
>
> I think UFFDIO_COPY if added with OVERWRITE|REVERSE|WP flags is an ok
> substitute for UFFDIO_REMAP.
>
> If UFFDIO_COPY sees the page is protnone during the REVERSE copy (to
> extract the memory atomically), it can also skip the tlb flush (and
> obviously there's no tlb flush in the reverse direction). If the page
> was not protnone, it can turn it in protnone, do a tlb flush, and then
> copy it to the destination address using the userland mapping.

We are also ok if the page is PROT_READ - in this case no need to do a
tlbflush - nothing can change to page while we are copying it - only we
have to care not to allow changing protection to PROT_READWRITE in the
process.

> UFFDIO_MPROTECT is definitely necessary for postcopy live snapshotting
> too (the reverse UFFDIO_COPY is not, it never deals with
> PROT_NONE and it never cares about missing faults).
>
> MPROTECT(PROT_NONE) so far seems needed only by this and perhaps UML
> (and perhaps qemu linux-user).
>
> I posted in another email why these features aren't implemented yet
>
> ==
> There will be some complications in adding the wrprotection/protnone
> feature: if faults could already happen when the wrprotect/protnone is
> armed, the handle_userfault() could be invoked in a retry-fault, that
> is not ok without allowing the userfault to return VM_FAULT_RETRY even
> during a refault (i.e. FAULT_FLAG_TRIED set but FAULT_FLAG_ALLOW_RETRY
> not set). The invariants of vma->vm_page_prot and pte/trans_huge_pmd
> permissions must also not break anywhere. These are the two main
> reasons why these features that requires to flip protection bits are
> left implemented later and made visible later with uffdio_api.feature
> flags and/or through uffdio_register.ioctl during UFFDIO_REGISTER.
> ==

I understand, maybe not in full details though. The changes would anyway
be needed to make userfaultfd capable of generic memory managing,
instead of populating it only in one way.

And as I noted above, besides UFFDIO_COPY(both direction, overwrite) and
UFFDIO_MPROTECT, UFFDIO_MFREE is also needed.

> > The performance currently is not great, partly because of page clearing
> > when getting ram from tmpfs, and partly because of mprotect/SIGSEGV/vmas
> > overhead and other dumb things on my side.
>
> Also the page faults get slowed down when the rbtree grows a lot,
> userfaultfd won't let the rbtree grow.

Yes. This is the same as splitting or not vmas.

> > I still wanted to show the case, as userfaultd here has potential to
> > remove overhead related to kernel.
>
> That's very useful and interesting feedback!
>
> Could you review the API to be sure we don't have to modify it when we
> extend it like described above?
>
> 1) tmpfs returning uffdio_register.ioctl |=
> UFFDIO_MPROTECT|UFFDIO_COPY when enabled

and UFFDIO_MFREE with maybe ability to automatically punch hole of freed
memory (we need to return the memory to the system on reclaim, and
preferable on abort).


> 2) UFFDIO_MPROTECT(PROT_NONE|READ|WRITE|EXEC) and in turn
> UFFDIO_COPY_MODE_REVERSE|UFFDIO_COPY_MODE_OVERWRITE|UFFDIO_COPY_MODE_WP
> being available if uffdio_register.ioctl includes UFFDIO_MPROTECT

UFFDIO_MPROTECT -> UFFDIO_MPROTECT(PROT_NONE|READ|WRITE|EXEC) looks ok.

but imho various copy modes could be allowed besides mprotect - e.g. for
a manager to get back managed memory. Thus, maybe

UFFDIO_COPY_MODE_REVERSE (or analogues) and UFFDIO_COPY_MODE_OVERWRITE

should be available for when just UFFDIO_COPY is set.


>From userspace point of view, it would be simpler to operate when all
those operations are always allowed though.

> (and uffdio_api.features will then include
> UFFD_FEATURE_PAGEFAULT_WP to signal the uffd_msg.pagefault.flag WP
> is available [bit 1], and UFFDIO_REGISTER_MODE_WP can be used in
> uffdio_register.mode)

Yes, though it is not fully represents the capability - e.g.
MPROTECT(PROT_NONE) should work too, so it should be maybe
UFFD_FEATURE_PAGEFAULT_PROTECT.

Again, from clients point of view it would be simpler if those features
are in base - e.g. they are always available.

> All of it could just check for uffdio_api.features &
> UFFD_FEATURE_PAGEFAULT_WP being set, but you'd still have to check for
> UFFDIO_MPROTECT being set in uffdio_register.ioctl for tmpfs areas (or
> to know it's not available yet on hugetlbfs), so I think it's more
> robust to check UFFDIO_MPROTECT ioctl being set in
> uffdio_register.ioctl to assume all mprotection and writeprotect
> tracking features are available for that specific range. The feature
> flag will just tell that UFFDIO_REGISTER_MODE_WP can be used in the
> register ioctl, that is something you need to know before in order to
> "arm" the VMA for wrprotect faults.

ok

> For your usage I think you probably want to set
> UFFDIO_REGISTER_MODE_WP|UFFDIO_REGISTER_MODE_MISSING and you'll be
> told through uffdio_msg.flags if it's a WP or MISSING fault.

ok

> You won't be told if it's missing because of PROT_NONE or absent.

looks like not good - to know whether a page is mapped there already or
not. Is it possible to distinguish this cases too?

>
> On a side note: all of the above is completely orthognal from the
> non-cooperative usage: as far as memory protection features it doesn't
> need any, it just needs to track more events like fork/mremap to
> adjust its offsets as the memory manager is not part of the app and it
> has no way to orchestrate by other means.
>
> Doing it all at once (non-cooperative + full memory protection) looked
> too much. We should just try to get the API right in a way that won't
> require an UFFD_API bump passed to uffdio_api.api. Even then, if an
> api bump is required, that's not a big deal, until recently the
> non-cooperative usage already did the API bumb but we accomodated the
> read(2) API to avoid it.
>
> Thinking at the worst case scenario, if the API gets bumped the only
> thing that has to remain fixed is the ioctl number of the UFFDIO_API
> and the uffdio_api structure. Everything else can be upgraded without
> risk of ABI breakage, even the ioctl numbers can be reused (except the
> very UFFDIO_API). When the non-cooperative usage bumped the API it
> actually kept all ioctl the same except the read(2) format.

I agree we can change API / ABI with versioning, but it is better to try
to get it right from the beginning and not fixup with several api
versions on top.

Thanks for your comments and feedback,
Kirill

2015-05-22 20:18:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic

On Thu, 14 May 2015 19:31:19 +0200 Andrea Arcangeli <[email protected]> wrote:

> If the rwsem starves writers it wasn't strictly a bug but lockdep
> doesn't like it and this avoids depending on lowlevel implementation
> details of the lock.
>
> ...
>
> @@ -229,13 +246,33 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>
> if (!zeropage)
> err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> - dst_addr, src_addr);
> + dst_addr, src_addr, &page);
> else
> err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
> dst_addr);
>
> cond_resched();
>
> + if (unlikely(err == -EFAULT)) {
> + void *page_kaddr;
> +
> + BUILD_BUG_ON(zeropage);

I'm not sure what this is trying to do. BUILD_BUG_ON(local_variable)?

It goes bang in my build. I'll just delete it.

> + up_read(&dst_mm->mmap_sem);
> + BUG_ON(!page);
> +
> + page_kaddr = kmap(page);
> + err = copy_from_user(page_kaddr,
> + (const void __user *) src_addr,
> + PAGE_SIZE);
> + kunmap(page);
> + if (unlikely(err)) {
> + err = -EFAULT;
> + goto out;
> + }
> + goto retry;
> + } else
> + BUG_ON(page);
> +

2015-05-22 20:48:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic

On Fri, May 22, 2015 at 01:18:22PM -0700, Andrew Morton wrote:
> On Thu, 14 May 2015 19:31:19 +0200 Andrea Arcangeli <[email protected]> wrote:
>
> > If the rwsem starves writers it wasn't strictly a bug but lockdep
> > doesn't like it and this avoids depending on lowlevel implementation
> > details of the lock.
> >
> > ...
> >
> > @@ -229,13 +246,33 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> >
> > if (!zeropage)
> > err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> > - dst_addr, src_addr);
> > + dst_addr, src_addr, &page);
> > else
> > err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
> > dst_addr);
> >
> > cond_resched();
> >
> > + if (unlikely(err == -EFAULT)) {
> > + void *page_kaddr;
> > +
> > + BUILD_BUG_ON(zeropage);
>
> I'm not sure what this is trying to do. BUILD_BUG_ON(local_variable)?
>
> It goes bang in my build. I'll just delete it.

Yes, it has to be a false positive failure, so it's fine to drop
it. My gcc 4.8.4 can go inside the static called function and see that
only mcopy_atomic_pte can return -EFAULT. RHEL7 (4.8.3) gcc didn't
complain either. Perhaps to make the BUILD_BUG_ON work with older gcc,
it requrires a local variable set explicitly in the callee, but it's
not worth it.

It would be bad if we end up in the -EFAULT path in the zeropage case
(if somebody later adds an apparently innocent -EFAULT retval and
unexpectedly ends up in the mcopy_atomic_pte retry logic), but it's
not important, the caller should be reviewed before improvising new
retvals anyway.

The retry loop addition and the BUILD_BUG_ON is all about the
copy_from_user run while we already hold the mmap_sem (potentially of
a different process in the non-cooperative case but it's a problem if
it's the current task mmap_sem in case the rwlock implementation
changes to avoid write starvation and becomes non-reentrant). lockdep
definitely complains (even if I think in practice it'd be safe to
read-lock recurse, we just got lockdep complains never deadlocks in
fact). I didn't want to call gup_fast as copy_from_user is faster and
I got an usable user mapping with likely TLB entry hot too. The
lockdep warnings we hit I think were associated with NUMA hinting
faults or something infrequent like that, the fast path doesn't need
to retry.

Thanks,
Andrea

2015-05-22 21:18:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic


There's a more serious failure with i386 allmodconfig:

fs/userfaultfd.c:145:2: note: in expansion of macro 'BUILD_BUG_ON'
BUILD_BUG_ON(sizeof(struct uffd_msg) != 32);

I'm surprised the feature is even reachable on i386 builds?

2015-05-23 01:04:46

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic

On Fri, May 22, 2015 at 02:18:30PM -0700, Andrew Morton wrote:
>
> There's a more serious failure with i386 allmodconfig:
>
> fs/userfaultfd.c:145:2: note: in expansion of macro 'BUILD_BUG_ON'
> BUILD_BUG_ON(sizeof(struct uffd_msg) != 32);
>
> I'm surprised the feature is even reachable on i386 builds?

Unless we risk to run out of vma->vm_flags there's no particular
reason not to enable it on 32bit (even if we run out, making vm_flags
an unsigned long long is a few liner patch). Certainly it's less
useful on 32bit as there's a 3G limit but the max vmas per process are
still a small fraction of that. Especially if used for the volatile
pages on demand notification of page reclaim, it could end up useful
on arm32 (S6 is 64bit I think and latest snapdragon is too, so perhaps
it's too late anyway, but again it's not big deal).

Removing the BUILD_BUG_ON I think is not ok here because while I'm ok
to support 32bit archs, I don't want translation, the 64bit kernel
should talk with the 32bit app directly without a layer in between.

I tried to avoid using packet as without packed I could not get the
alignment wrong (and future union also couldn't get it wrong), and I
could avoid those reserved1/2/3, but it's more robust to use it in
combination with the BUILD_BUG_ON to detect right away problems like
this with 32bit builds that aligns things differently.

I'm actually surprised the buildbot that sends me email about all
archs didn't actually send me anything about it for 32bit x86?
Perhaps I'm overlooking something or x86 32bit (or any other 32bit
arch for that matter) isn't being checked? This is actually a fairly
recent change, perhaps the buildbot was shutdown recently? That
buildbot was very useful to detect for problems like this.

===
>From 2f0a48670dc515932dec8b983871ec35caeba553 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <[email protected]>
Date: Sat, 23 May 2015 02:26:32 +0200
Subject: [PATCH] userfaultfd: update the uffd_msg structure to be the same on
32/64bit

Avoiding to using packed allowed the code to be nicer and it avoided
the reserved1/2/3 but the structure must be the same for 32bit and
64bit archs so x86 applications built with the 32bit ABI can run on
the 64bit kernel without requiring translation of the data read
through the read syscall.

$ gcc -m64 p.c && ./a.out
32
0
16
8
8
16
24
$ gcc -m32 p.c && ./a.out
32
0
16
8
8
16
24

int main()
{
printf("%lu\n", sizeof(struct uffd_msg));
printf("%lu\n", (unsigned long) &((struct uffd_msg *) 0)->event);
printf("%lu\n", (unsigned long) &((struct uffd_msg *) 0)->arg.pagefault.address);
printf("%lu\n", (unsigned long) &((struct uffd_msg *) 0)->arg.pagefault.flags);
printf("%lu\n", (unsigned long) &((struct uffd_msg *) 0)->arg.reserved.reserved1);
printf("%lu\n", (unsigned long) &((struct uffd_msg *) 0)->arg.reserved.reserved2);
printf("%lu\n", (unsigned long) &((struct uffd_msg *) 0)->arg.reserved.reserved3);
}

Reported-by: Andrew Morton <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/uapi/linux/userfaultfd.h | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index c8a543f..00d28e2 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -59,9 +59,13 @@
struct uffd_msg {
__u8 event;

+ __u8 reserved1;
+ __u16 reserved2;
+ __u32 reserved3;
+
union {
struct {
- __u32 flags;
+ __u64 flags;
__u64 address;
} pagefault;

@@ -72,7 +76,7 @@ struct uffd_msg {
__u64 reserved3;
} reserved;
} arg;
-};
+} __attribute__((packed));

/*
* Start at 0x12 and not at 0 to be more strict against bugs.

2015-06-23 19:00:32

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization

On 05/14/2015 10:31 AM, Andrea Arcangeli wrote:
> +static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
> + int wake_flags, void *key)
> +{
> + struct userfaultfd_wake_range *range = key;
> + int ret;
> + struct userfaultfd_wait_queue *uwq;
> + unsigned long start, len;
> +
> + uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> + ret = 0;
> + /* don't wake the pending ones to avoid reads to block */
> + if (uwq->pending && !ACCESS_ONCE(uwq->ctx->released))
> + goto out;
> + /* len == 0 means wake all */
> + start = range->start;
> + len = range->len;
> + if (len && (start > uwq->address || start + len <= uwq->address))
> + goto out;
> + ret = wake_up_state(wq->private, mode);
> + if (ret)
> + /* wake only once, autoremove behavior */
> + list_del_init(&wq->task_list);
> +out:
> + return ret;
> +}
...
> +static __always_inline int validate_range(struct mm_struct *mm,
> + __u64 start, __u64 len)
> +{
> + __u64 task_size = mm->task_size;
> +
> + if (start & ~PAGE_MASK)
> + return -EINVAL;
> + if (len & ~PAGE_MASK)
> + return -EINVAL;
> + if (!len)
> + return -EINVAL;
> + if (start < mmap_min_addr)
> + return -EINVAL;
> + if (start >= task_size)
> + return -EINVAL;
> + if (len > task_size - start)
> + return -EINVAL;
> + return 0;
> +}

Hey Andrea,

Down in userfaultfd_wake_function(), it looks like you intended for a
len=0 to mean "wake all". But the validate_range() that we do from
userspace has a !len check in it, which keeps us from passing a len=0 in
from userspace.

Was that "wake all" for some internal use, or is the check too strict?

I was trying to use the wake ioctl after an madvise() (as opposed to
filling things in using a userfd copy).

2015-06-23 21:41:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization

Hi Dave,

On Tue, Jun 23, 2015 at 12:00:19PM -0700, Dave Hansen wrote:
> Down in userfaultfd_wake_function(), it looks like you intended for a
> len=0 to mean "wake all". But the validate_range() that we do from
> userspace has a !len check in it, which keeps us from passing a len=0 in
> from userspace.
> Was that "wake all" for some internal use, or is the check too strict?

It's for internal use or userfaultfd_release that has to wake them all
(after setting ctx->released) if the uffd is closed. It avoids to
enlarge the structure by depending on the invariant that userland
cannot pass len=0.

If we'd accept len=0 from userland as valid, I'd be safer if it does
nothing like in madvise, I doubt we want to expose this non standard
kernel internal behavior to userland.

> I was trying to use the wake ioctl after an madvise() (as opposed to
> filling things in using a userfd copy).

madvise will return 0 if len=0, mremap would return -EINVAL if new_len
is zero, mmap also returns -EINVAL if len is 0, not all MM syscalls
are as permissive as madvise. Can't you pass the same len you pass to
madvise to UFFDIO_WAKE (or just skip the call if the madvise len is
zero)?

Thanks,
Andrea

2015-08-11 10:08:37

by Bharata B Rao

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

On Thu, May 14, 2015 at 07:31:16PM +0200, Andrea Arcangeli wrote:
> This activates the userfaultfd syscall.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> arch/powerpc/include/asm/systbl.h | 1 +
> arch/powerpc/include/uapi/asm/unistd.h | 1 +
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> include/linux/syscalls.h | 1 +
> kernel/sys_ni.c | 1 +
> 6 files changed, 6 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
> index f1863a1..4741b15 100644
> --- a/arch/powerpc/include/asm/systbl.h
> +++ b/arch/powerpc/include/asm/systbl.h
> @@ -368,3 +368,4 @@ SYSCALL_SPU(memfd_create)
> SYSCALL_SPU(bpf)
> COMPAT_SYS(execveat)
> PPC64ONLY(switch_endian)
> +SYSCALL_SPU(userfaultfd)
> diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
> index e4aa173..6ad58d4 100644
> --- a/arch/powerpc/include/uapi/asm/unistd.h
> +++ b/arch/powerpc/include/uapi/asm/unistd.h
> @@ -386,5 +386,6 @@
> #define __NR_bpf 361
> #define __NR_execveat 362
> #define __NR_switch_endian 363
> +#define __NR_userfaultfd 364

May be it is a bit late to bring this up, but I needed the following fix
to userfault21 branch of your git tree to compile on powerpc.
----

powerpc: Bump up __NR_syscalls to account for __NR_userfaultfd

From: Bharata B Rao <[email protected]>

With userfaultfd syscall, the number of syscalls will be 365 on PowerPC.
Reflect the same in __NR_syscalls.

Signed-off-by: Bharata B Rao <[email protected]>
---
arch/powerpc/include/asm/unistd.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index f4f8b66..4a055b6 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
#include <uapi/asm/unistd.h>


-#define __NR_syscalls 364
+#define __NR_syscalls 365

#define __NR__exit __NR_exit
#define NR_syscalls __NR_syscalls

2015-08-11 13:48:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

Hello Bharata,

On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
> May be it is a bit late to bring this up, but I needed the following fix
> to userfault21 branch of your git tree to compile on powerpc.

Not late, just in time. I increased the number of syscalls in earlier
versions, it must have gotten lost during a rejecting rebase, sorry.

I applied it to my tree and it can be applied to -mm and linux-next,
thanks!

The syscall for arm32 are also ready and on their way to the arm tree,
the testsuite worked fine there. ppc also should work fine if you
could confirm it'd be interesting, just beware that I got a typo in
the testcase:

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 76071b1..925c3c9 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -70,7 +70,7 @@
#define __NR_userfaultfd 323
#elif defined(__i386__)
#define __NR_userfaultfd 374
-#elif defined(__powewrpc__)
+#elif defined(__powerpc__)
#define __NR_userfaultfd 364
#else
#error "missing __NR_userfaultfd definition"



> ----
>
> powerpc: Bump up __NR_syscalls to account for __NR_userfaultfd
>
> From: Bharata B Rao <[email protected]>
>
> With userfaultfd syscall, the number of syscalls will be 365 on PowerPC.
> Reflect the same in __NR_syscalls.
>
> Signed-off-by: Bharata B Rao <[email protected]>
> ---
> arch/powerpc/include/asm/unistd.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
> index f4f8b66..4a055b6 100644
> --- a/arch/powerpc/include/asm/unistd.h
> +++ b/arch/powerpc/include/asm/unistd.h
> @@ -12,7 +12,7 @@
> #include <uapi/asm/unistd.h>
>
>
> -#define __NR_syscalls 364
> +#define __NR_syscalls 365
>
> #define __NR__exit __NR_exit
> #define NR_syscalls __NR_syscalls

Reviewed-by: Andrea Arcangeli <[email protected]>

2015-08-12 05:24:06

by Bharata B Rao

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote:
> Hello Bharata,
>
> On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
> > May be it is a bit late to bring this up, but I needed the following fix
> > to userfault21 branch of your git tree to compile on powerpc.
>
> Not late, just in time. I increased the number of syscalls in earlier
> versions, it must have gotten lost during a rejecting rebase, sorry.
>
> I applied it to my tree and it can be applied to -mm and linux-next,
> thanks!
>
> The syscall for arm32 are also ready and on their way to the arm tree,
> the testsuite worked fine there. ppc also should work fine if you
> could confirm it'd be interesting, just beware that I got a typo in
> the testcase:

The testsuite passes on powerpc.

--------------------
running userfaultfd
--------------------
nr_pages: 2040, nr_pages_per_cpu: 170
bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 2 96 13 128
bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 1 0
bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 114 96 1 0
bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2
bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 71 20
bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6
bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 0
bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57
bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 3 6 0
bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 40 34 0
bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0
bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0
bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 0 0
bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0
bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0
bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0
bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 0 180 75
bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 0 30
bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 68 1 2
bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 10 1
bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 51 37 1
bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 25 10
bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1
bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28
bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 7 4 5
bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 23 2
bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0
bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0
bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0
bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0
bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0
bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1
[PASS]

Regards,
Bharata.

Subject: Re: [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt

Hi Andrea,

On 09/11/2015 10:47 AM, Michael Kerrisk (man-pages) wrote:
> On 05/14/2015 07:30 PM, Andrea Arcangeli wrote:
>> Add documentation.
>
> Hi Andrea,
>
> I do not recall... Did you write a man page also for this new system call?

No response to my last mail, so I'll try again... Did you
write any man page for this interface?

Thanks,

Michael


>> Signed-off-by: Andrea Arcangeli <[email protected]>
>> ---
>> Documentation/vm/userfaultfd.txt | 140 +++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 140 insertions(+)
>> create mode 100644 Documentation/vm/userfaultfd.txt
>>
>> diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
>> new file mode 100644
>> index 0000000..c2f5145
>> --- /dev/null
>> +++ b/Documentation/vm/userfaultfd.txt
>> @@ -0,0 +1,140 @@
>> += Userfaultfd =
>> +
>> +== Objective ==
>> +
>> +Userfaults allow the implementation of on-demand paging from userland
>> +and more generally they allow userland to take control various memory
>> +page faults, something otherwise only the kernel code could do.
>> +
>> +For example userfaults allows a proper and more optimal implementation
>> +of the PROT_NONE+SIGSEGV trick.
>> +
>> +== Design ==
>> +
>> +Userfaults are delivered and resolved through the userfaultfd syscall.
>> +
>> +The userfaultfd (aside from registering and unregistering virtual
>> +memory ranges) provides two primary functionalities:
>> +
>> +1) read/POLLIN protocol to notify a userland thread of the faults
>> + happening
>> +
>> +2) various UFFDIO_* ioctls that can manage the virtual memory regions
>> + registered in the userfaultfd that allows userland to efficiently
>> + resolve the userfaults it receives via 1) or to manage the virtual
>> + memory in the background
>> +
>> +The real advantage of userfaults if compared to regular virtual memory
>> +management of mremap/mprotect is that the userfaults in all their
>> +operations never involve heavyweight structures like vmas (in fact the
>> +userfaultfd runtime load never takes the mmap_sem for writing).
>> +
>> +Vmas are not suitable for page- (or hugepage) granular fault tracking
>> +when dealing with virtual address spaces that could span
>> +Terabytes. Too many vmas would be needed for that.
>> +
>> +The userfaultfd once opened by invoking the syscall, can also be
>> +passed using unix domain sockets to a manager process, so the same
>> +manager process could handle the userfaults of a multitude of
>> +different processes without them being aware about what is going on
>> +(well of course unless they later try to use the userfaultfd
>> +themselves on the same region the manager is already tracking, which
>> +is a corner case that would currently return -EBUSY).
>> +
>> +== API ==
>> +
>> +When first opened the userfaultfd must be enabled invoking the
>> +UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
>> +a later API version) which will specify the read/POLLIN protocol
>> +userland intends to speak on the UFFD. The UFFDIO_API ioctl if
>> +successful (i.e. if the requested uffdio_api.api is spoken also by the
>> +running kernel), will return into uffdio_api.features and
>> +uffdio_api.ioctls two 64bit bitmasks of respectively the activated
>> +feature of the read(2) protocol and the generic ioctl available.
>> +
>> +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
>> +be invoked (if present in the returned uffdio_api.ioctls bitmask) to
>> +register a memory range in the userfaultfd by setting the
>> +uffdio_register structure accordingly. The uffdio_register.mode
>> +bitmask will specify to the kernel which kind of faults to track for
>> +the range (UFFDIO_REGISTER_MODE_MISSING would track missing
>> +pages). The UFFDIO_REGISTER ioctl will return the
>> +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
>> +userfaults on the range registered. Not all ioctls will necessarily be
>> +supported for all memory types depending on the underlying virtual
>> +memory backend (anonymous memory vs tmpfs vs real filebacked
>> +mappings).
>> +
>> +Userland can use the uffdio_register.ioctls to manage the virtual
>> +address space in the background (to add or potentially also remove
>> +memory from the userfaultfd registered range). This means a userfault
>> +could be triggering just before userland maps in the background the
>> +user-faulted page.
>> +
>> +The primary ioctl to resolve userfaults is UFFDIO_COPY. That
>> +atomically copies a page into the userfault registered range and wakes
>> +up the blocked userfaults (unless uffdio_copy.mode &
>> +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
>> +UFFDIO_COPY.
>> +
>> +== QEMU/KVM ==
>> +
>> +QEMU/KVM is using the userfaultfd syscall to implement postcopy live
>> +migration. Postcopy live migration is one form of memory
>> +externalization consisting of a virtual machine running with part or
>> +all of its memory residing on a different node in the cloud. The
>> +userfaultfd abstraction is generic enough that not a single line of
>> +KVM kernel code had to be modified in order to add postcopy live
>> +migration to QEMU.
>> +
>> +Guest async page faults, FOLL_NOWAIT and all other GUP features work
>> +just fine in combination with userfaults. Userfaults trigger async
>> +page faults in the guest scheduler so those guest processes that
>> +aren't waiting for userfaults (i.e. network bound) can keep running in
>> +the guest vcpus.
>> +
>> +It is generally beneficial to run one pass of precopy live migration
>> +just before starting postcopy live migration, in order to avoid
>> +generating userfaults for readonly guest regions.
>> +
>> +The implementation of postcopy live migration currently uses one
>> +single bidirectional socket but in the future two different sockets
>> +will be used (to reduce the latency of the userfaults to the minimum
>> +possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
>> +
>> +The QEMU in the source node writes all pages that it knows are missing
>> +in the destination node, into the socket, and the migration thread of
>> +the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
>> +ioctls on the userfaultfd in order to map the received pages into the
>> +guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
>> +
>> +A different postcopy thread in the destination node listens with
>> +poll() to the userfaultfd in parallel. When a POLLIN event is
>> +generated after a userfault triggers, the postcopy thread read() from
>> +the userfaultfd and receives the fault address (or -EAGAIN in case the
>> +userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
>> +by the parallel QEMU migration thread).
>> +
>> +After the QEMU postcopy thread (running in the destination node) gets
>> +the userfault address it writes the information about the missing page
>> +into the socket. The QEMU source node receives the information and
>> +roughly "seeks" to that page address and continues sending all
>> +remaining missing pages from that new page offset. Soon after that
>> +(just the time to flush the tcp_wmem queue through the network) the
>> +migration thread in the QEMU running in the destination node will
>> +receive the page that triggered the userfault and it'll map it as
>> +usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
>> +was spontaneously sent by the source or if it was an urgent page
>> +requested through an userfault).
>> +
>> +By the time the userfaults start, the QEMU in the destination node
>> +doesn't need to keep any per-page state bitmap relative to the live
>> +migration around and a single per-page bitmap has to be maintained in
>> +the QEMU running in the source node to know which pages are still
>> +missing in the destination node. The bitmap in the source node is
>> +checked to find which missing pages to send in round robin and we seek
>> +over it when receiving incoming userfaults. After sending each page of
>> +course the bitmap is updated accordingly. It's also useful to avoid
>> +sending the same page twice (in case the userfault is read by the
>> +postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
>> +thread).
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2015-12-04 17:55:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt

Hello Michael,

On Fri, Dec 04, 2015 at 04:50:03PM +0100, Michael Kerrisk (man-pages) wrote:
> Hi Andrea,
>
> On 09/11/2015 10:47 AM, Michael Kerrisk (man-pages) wrote:
> > On 05/14/2015 07:30 PM, Andrea Arcangeli wrote:
> >> Add documentation.
> >
> > Hi Andrea,
> >
> > I do not recall... Did you write a man page also for this new system call?
>
> No response to my last mail, so I'll try again... Did you
> write any man page for this interface?

I wished I would answer with the manpage itself to give a more
satisfactory answer, but answer is still no at this time. Right now
there's the write protection tracking feature posted to linux-mm and
I'm currently reviewing that. It's worth documenting that part too in
the manpage as it's going to happen sooner than later.

Lack of manpage so far didn't prevent userland to use it (qemu
postcopy is already in upstream qemu and it depends on userfaultfd),
nor review of the code nor other kernel contributors to extend the
syscall API. Other users started testing the syscall too. This is just
to explain why unfortunately the manpage didn't get the top priority
yet, but nevertheless the manpage should happen too and it's
important. Advice on how to proceed is welcome.

Thanks,
Andrea