2016-03-20 12:42:42

by Michael Rapoport

[permalink] [raw]
Subject: [PATCH 0/5] userfaultfd: extension for non cooperative uffd usage

Hi,

This set is to address the issues that appear in userfaultfd usage
scenarios when the task monitoring the uffd and the mm-owner do not
cooperate to each other on VM changes such as remaps, madvises and
fork()-s.

The pacthes are essentially the same as in the prevoious respin (1),
they've just been rebased on the current tree.

[1] http://thread.gmane.org/gmane.linux.kernel.mm/132662

Pavel Emelyanov (5):
uffd: Split the find_userfault() routine
uffd: Add ability to report non-PF events from uffd descriptor
uffd: Add fork() event
uffd: Add mremap() event
uffd: Add madvise() event for MADV_DONTNEED request

fs/userfaultfd.c | 319 ++++++++++++++++++++++++++++++++++++++-
include/linux/userfaultfd_k.h | 41 +++++
include/uapi/linux/userfaultfd.h | 28 +++-
kernel/fork.c | 10 +-
mm/madvise.c | 2 +
mm/mremap.c | 17 ++-
6 files changed, 395 insertions(+), 22 deletions(-)

--
1.9.1


2016-03-20 12:42:49

by Michael Rapoport

[permalink] [raw]
Subject: [PATCH 1/5] uffd: Split the find_userfault() routine

From: Pavel Emelyanov <[email protected]>

I will need one to lookup for userfaultfd_wait_queue-s in different
wait queue

Signed-off-by: Pavel Emelyanov <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
---
fs/userfaultfd.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 66cdb44..4f0b53d 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -483,25 +483,30 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
}

/* fault_pending_wqh.lock must be hold by the caller */
-static inline struct userfaultfd_wait_queue *find_userfault(
- struct userfaultfd_ctx *ctx)
+static inline struct userfaultfd_wait_queue *find_userfault_in(
+ wait_queue_head_t *wqh)
{
wait_queue_t *wq;
struct userfaultfd_wait_queue *uwq;

- VM_BUG_ON(!spin_is_locked(&ctx->fault_pending_wqh.lock));
+ VM_BUG_ON(!spin_is_locked(&wqh->lock));

uwq = NULL;
- if (!waitqueue_active(&ctx->fault_pending_wqh))
+ if (!waitqueue_active(wqh))
goto out;
/* walk in reverse to provide FIFO behavior to read userfaults */
- wq = list_last_entry(&ctx->fault_pending_wqh.task_list,
- typeof(*wq), task_list);
+ wq = list_last_entry(&wqh->task_list, typeof(*wq), task_list);
uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
out:
return uwq;
}

+static inline struct userfaultfd_wait_queue *find_userfault(
+ struct userfaultfd_ctx *ctx)
+{
+ return find_userfault_in(&ctx->fault_pending_wqh);
+}
+
static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
{
struct userfaultfd_ctx *ctx = file->private_data;
--
1.9.1

2016-03-20 12:43:04

by Michael Rapoport

[permalink] [raw]
Subject: [PATCH 2/5] uffd: Add ability to report non-PF events from uffd descriptor

From: Pavel Emelyanov <[email protected]>

The custom events are queued in ctx->event_wqh not to disturb the
fast-path-ed PF queue-wait-wakeup functions.

The events to be generated (other than PF-s) are requested in UFFD_API
ioctl with the uffd_api.features bits. Those, known by the kernel, are
then turned on and reported back to the user-space.

Signed-off-by: Pavel Emelyanov <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
---
fs/userfaultfd.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 97 insertions(+), 2 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 4f0b53d..c8e7039 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -12,6 +12,7 @@
* mm/ksm.c (mm hashing).
*/

+#include <linux/list.h>
#include <linux/hashtable.h>
#include <linux/sched.h>
#include <linux/mm.h>
@@ -45,18 +46,23 @@ struct userfaultfd_ctx {
wait_queue_head_t fault_wqh;
/* waitqueue head for the pseudo fd to wakeup poll/read */
wait_queue_head_t fd_wqh;
+ /* waitqueue head for events */
+ wait_queue_head_t event_wqh;
/* a refile sequence protected by fault_pending_wqh lock */
struct seqcount refile_seq;
/* pseudo fd refcounting */
atomic_t refcount;
/* userfaultfd syscall flags */
unsigned int flags;
+ /* features requested from the userspace */
+ unsigned int features;
/* state machine */
enum userfaultfd_state state;
/* released */
bool released;
/* mm with one ore more vmas attached to this userfaultfd_ctx */
struct mm_struct *mm;
+
};

struct userfaultfd_wait_queue {
@@ -135,6 +141,8 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
VM_BUG_ON(waitqueue_active(&ctx->fault_pending_wqh));
VM_BUG_ON(spin_is_locked(&ctx->fault_wqh.lock));
VM_BUG_ON(waitqueue_active(&ctx->fault_wqh));
+ VM_BUG_ON(spin_is_locked(&ctx->event_wqh.lock));
+ VM_BUG_ON(waitqueue_active(&ctx->event_wqh));
VM_BUG_ON(spin_is_locked(&ctx->fd_wqh.lock));
VM_BUG_ON(waitqueue_active(&ctx->fd_wqh));
mmput(ctx->mm);
@@ -423,6 +431,59 @@ out:
return ret;
}

+static int __maybe_unused userfaultfd_event_wait_completion(
+ struct userfaultfd_ctx *ctx,
+ struct userfaultfd_wait_queue *ewq)
+{
+ int ret = 0;
+
+ ewq->ctx = ctx;
+ init_waitqueue_entry(&ewq->wq, current);
+
+ spin_lock(&ctx->event_wqh.lock);
+ /*
+ * After the __add_wait_queue the uwq is visible to userland
+ * through poll/read().
+ */
+ __add_wait_queue(&ctx->event_wqh, &ewq->wq);
+ for (;;) {
+ set_current_state(TASK_KILLABLE);
+ if (ewq->msg.event == 0)
+ break;
+ if (ACCESS_ONCE(ctx->released) ||
+ fatal_signal_pending(current)) {
+ ret = -1;
+ __remove_wait_queue(&ctx->event_wqh, &ewq->wq);
+ break;
+ }
+
+ spin_unlock(&ctx->event_wqh.lock);
+
+ wake_up_poll(&ctx->fd_wqh, POLLIN);
+ schedule();
+
+ spin_lock(&ctx->event_wqh.lock);
+ }
+ __set_current_state(TASK_RUNNING);
+ spin_unlock(&ctx->event_wqh.lock);
+
+ /*
+ * ctx may go away after this if the userfault pseudo fd is
+ * already released.
+ */
+
+ userfaultfd_ctx_put(ctx);
+ return ret;
+}
+
+static void userfaultfd_event_complete(struct userfaultfd_ctx *ctx,
+ struct userfaultfd_wait_queue *ewq)
+{
+ ewq->msg.event = 0;
+ wake_up_locked(&ctx->event_wqh);
+ __remove_wait_queue(&ctx->event_wqh, &ewq->wq);
+}
+
static int userfaultfd_release(struct inode *inode, struct file *file)
{
struct userfaultfd_ctx *ctx = file->private_data;
@@ -507,6 +568,12 @@ static inline struct userfaultfd_wait_queue *find_userfault(
return find_userfault_in(&ctx->fault_pending_wqh);
}

+static inline struct userfaultfd_wait_queue *find_userfault_evt(
+ struct userfaultfd_ctx *ctx)
+{
+ return find_userfault_in(&ctx->event_wqh);
+}
+
static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
{
struct userfaultfd_ctx *ctx = file->private_data;
@@ -538,6 +605,9 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
smp_mb();
if (waitqueue_active(&ctx->fault_pending_wqh))
ret = POLLIN;
+ else if (waitqueue_active(&ctx->event_wqh))
+ ret = POLLIN;
+
return ret;
default:
BUG();
@@ -601,6 +671,19 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
break;
}
spin_unlock(&ctx->fault_pending_wqh.lock);
+
+ spin_lock(&ctx->event_wqh.lock);
+ uwq = find_userfault_evt(ctx);
+ if (uwq) {
+ *msg = uwq->msg;
+
+ userfaultfd_event_complete(ctx, uwq);
+ spin_unlock(&ctx->event_wqh.lock);
+ ret = 0;
+ break;
+ }
+ spin_unlock(&ctx->event_wqh.lock);
+
if (signal_pending(current)) {
ret = -ERESTARTSYS;
break;
@@ -1133,6 +1216,14 @@ out:
return ret;
}

+static inline unsigned int uffd_ctx_features(__u64 user_features)
+{
+ /*
+ * For the current set of features the bits just coincide
+ */
+ return (unsigned int)user_features;
+}
+
/*
* userland asks for a certain API version and we return which bits
* and ioctl commands are implemented in this kernel for such API
@@ -1151,19 +1242,21 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
ret = -EFAULT;
if (copy_from_user(&uffdio_api, buf, sizeof(uffdio_api)))
goto out;
- if (uffdio_api.api != UFFD_API || uffdio_api.features) {
+ if (uffdio_api.api != UFFD_API ||
+ (uffdio_api.features & ~UFFD_API_FEATURES)) {
memset(&uffdio_api, 0, sizeof(uffdio_api));
if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
goto out;
ret = -EINVAL;
goto out;
}
- uffdio_api.features = UFFD_API_FEATURES;
+ uffdio_api.features &= UFFD_API_FEATURES;
uffdio_api.ioctls = UFFD_API_IOCTLS;
ret = -EFAULT;
if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
goto out;
ctx->state = UFFD_STATE_RUNNING;
+ ctx->features = uffd_ctx_features(uffdio_api.features);
ret = 0;
out:
return ret;
@@ -1250,6 +1343,7 @@ static void init_once_userfaultfd_ctx(void *mem)

init_waitqueue_head(&ctx->fault_pending_wqh);
init_waitqueue_head(&ctx->fault_wqh);
+ init_waitqueue_head(&ctx->event_wqh);
init_waitqueue_head(&ctx->fd_wqh);
seqcount_init(&ctx->refile_seq);
}
@@ -1290,6 +1384,7 @@ static struct file *userfaultfd_file_create(int flags)

atomic_set(&ctx->refcount, 1);
ctx->flags = flags;
+ ctx->features = 0;
ctx->state = UFFD_STATE_WAIT_API;
ctx->released = false;
ctx->mm = current->mm;
--
1.9.1

2016-03-20 12:43:14

by Michael Rapoport

[permalink] [raw]
Subject: [PATCH 5/5] uffd: Add madvise() event for MADV_DONTNEED request

From: Pavel Emelyanov <[email protected]>

If the page is punched out of the address space the uffd reader
should know this and zeromap the respective area in case of
the #PF event.

Signed-off-by: Pavel Emelyanov <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
---
fs/userfaultfd.c | 26 ++++++++++++++++++++++++++
include/linux/userfaultfd_k.h | 12 ++++++++++++
include/uapi/linux/userfaultfd.h | 9 ++++++++-
mm/madvise.c | 2 ++
4 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index a7771bd..e65ca84 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -599,6 +599,32 @@ void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx vm_ctx,
userfaultfd_event_wait_completion(ctx, &ewq);
}

+void madvise_userfault_dontneed(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+ struct userfaultfd_ctx *ctx;
+ struct userfaultfd_wait_queue ewq;
+
+ ctx = vma->vm_userfaultfd_ctx.ctx;
+ if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_MADVDONTNEED))
+ return;
+
+ userfaultfd_ctx_get(ctx);
+ *prev = NULL; /* We wait for ACK w/o the mmap semaphore */
+ up_read(&vma->vm_mm->mmap_sem);
+
+ msg_init(&ewq.msg);
+
+ ewq.msg.event = UFFD_EVENT_MADVDONTNEED;
+ ewq.msg.arg.madv_dn.start = start;
+ ewq.msg.arg.madv_dn.end = end;
+
+ userfaultfd_event_wait_completion(ctx, &ewq);
+
+ down_read(&vma->vm_mm->mmap_sem);
+}
+
static int userfaultfd_release(struct inode *inode, struct file *file)
{
struct userfaultfd_ctx *ctx = file->private_data;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 42ea277..7e22a3d 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -62,6 +62,11 @@ extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx,
unsigned long from, unsigned long to,
unsigned long len);

+extern void madvise_userfault_dontneed(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start,
+ unsigned long end);
+
#else /* CONFIG_USERFAULTFD */

/* mm helpers */
@@ -109,6 +114,13 @@ static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx ctx,
unsigned long len)
{
}
+
+static inline void madvise_userfault_dontneed(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start,
+ unsigned long end)
+{
+}
#endif /* CONFIG_USERFAULTFD */

#endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 46bbb6f..cbcb3a5 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -16,7 +16,7 @@
* After implementing the respective features it will become:
* #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP)
*/
-#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP)
+#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP|UFFD_FEATURE_EVENT_MADVDONTNEED)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -81,6 +81,11 @@ struct uffd_msg {
} remap;

struct {
+ __u64 start;
+ __u64 end;
+ } madv_dn;
+
+ struct {
/* unused reserved fields */
__u64 reserved1;
__u64 reserved2;
@@ -95,6 +100,7 @@ struct uffd_msg {
#define UFFD_EVENT_PAGEFAULT 0x12
#define UFFD_EVENT_FORK 0x13
#define UFFD_EVENT_REMAP 0x14
+#define UFFD_EVENT_MADVDONTNEED 0x15

/* flags for UFFD_EVENT_PAGEFAULT */
#define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */
@@ -118,6 +124,7 @@ struct uffdio_api {
#endif
#define UFFD_FEATURE_EVENT_FORK (1<<1)
#define UFFD_FEATURE_EVENT_REMAP (1<<2)
+#define UFFD_FEATURE_EVENT_MADVDONTNEED (1<<3)
__u64 features;

__u64 ioctls;
diff --git a/mm/madvise.c b/mm/madvise.c
index a011473..7b66d6b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -10,6 +10,7 @@
#include <linux/syscalls.h>
#include <linux/mempolicy.h>
#include <linux/page-isolation.h>
+#include <linux/userfaultfd_k.h>
#include <linux/hugetlb.h>
#include <linux/falloc.h>
#include <linux/sched.h>
@@ -476,6 +477,7 @@ static long madvise_dontneed(struct vm_area_struct *vma,
return -EINVAL;

zap_page_range(vma, start, end - start, NULL);
+ madvise_userfault_dontneed(vma, prev, start, end);
return 0;
}

--
1.9.1

2016-03-20 12:43:19

by Michael Rapoport

[permalink] [raw]
Subject: [PATCH 4/5] uffd: Add mremap() event

From: Pavel Emelyanov <[email protected]>

The event denotes that an area [start:end] moves to different
location. Length change isn't reported as "new" addresses, if
they appear on the uffd reader side they will not contain any
data and the latter can just zeromap them.

Waiting for the event ACK is also done outside of mmap sem, as
for fork event.

Signed-off-by: Pavel Emelyanov <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
---
fs/userfaultfd.c | 37 +++++++++++++++++++++++++++++++++++++
include/linux/userfaultfd_k.h | 17 +++++++++++++++++
include/uapi/linux/userfaultfd.h | 10 +++++++++-
mm/mremap.c | 17 ++++++++++++-----
4 files changed, 75 insertions(+), 6 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 565d8f2..a7771bd 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -562,6 +562,43 @@ void dup_userfaultfd_complete(struct list_head *fcs)
}
}

+void mremap_userfaultfd_prep(struct vm_area_struct *vma,
+ struct vm_userfaultfd_ctx *vm_ctx)
+{
+ struct userfaultfd_ctx *ctx;
+
+ ctx = vma->vm_userfaultfd_ctx.ctx;
+ if (ctx && (ctx->features & UFFD_FEATURE_EVENT_REMAP)) {
+ vm_ctx->ctx = ctx;
+ userfaultfd_ctx_get(ctx);
+ }
+}
+
+void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx vm_ctx,
+ unsigned long from, unsigned long to,
+ unsigned long len)
+{
+ struct userfaultfd_ctx *ctx = vm_ctx.ctx;
+ struct userfaultfd_wait_queue ewq;
+
+ if (!ctx)
+ return;
+
+ if (to & ~PAGE_MASK) {
+ userfaultfd_ctx_put(ctx);
+ return;
+ }
+
+ msg_init(&ewq.msg);
+
+ ewq.msg.event = UFFD_EVENT_REMAP;
+ ewq.msg.arg.remap.from = from;
+ ewq.msg.arg.remap.to = to;
+ ewq.msg.arg.remap.len = len;
+
+ userfaultfd_event_wait_completion(ctx, &ewq);
+}
+
static int userfaultfd_release(struct inode *inode, struct file *file)
{
struct userfaultfd_ctx *ctx = file->private_data;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 0c7b723..42ea277 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -56,6 +56,12 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
extern void dup_userfaultfd_complete(struct list_head *);

+extern void mremap_userfaultfd_prep(struct vm_area_struct *,
+ struct vm_userfaultfd_ctx *);
+extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx,
+ unsigned long from, unsigned long to,
+ unsigned long len);
+
#else /* CONFIG_USERFAULTFD */

/* mm helpers */
@@ -92,6 +98,17 @@ static inline void dup_userfaultfd_complete(struct list_head *)
{
}

+static inline void mremap_userfaultfd_prep(struct vm_area_struct *vma,
+ struct vm_userfaultfd_ctx *ctx)
+{
+}
+
+static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx ctx,
+ unsigned long from,
+ unsigned long to,
+ unsigned long len)
+{
+}
#endif /* CONFIG_USERFAULTFD */

#endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index d89eef6..46bbb6f 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -16,7 +16,7 @@
* After implementing the respective features it will become:
* #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP)
*/
-#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK)
+#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_EVENT_REMAP)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -75,6 +75,12 @@ struct uffd_msg {
} fork;

struct {
+ __u64 from;
+ __u64 to;
+ __u64 len;
+ } remap;
+
+ struct {
/* unused reserved fields */
__u64 reserved1;
__u64 reserved2;
@@ -88,6 +94,7 @@ struct uffd_msg {
*/
#define UFFD_EVENT_PAGEFAULT 0x12
#define UFFD_EVENT_FORK 0x13
+#define UFFD_EVENT_REMAP 0x14

/* flags for UFFD_EVENT_PAGEFAULT */
#define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */
@@ -110,6 +117,7 @@ struct uffdio_api {
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
#endif
#define UFFD_FEATURE_EVENT_FORK (1<<1)
+#define UFFD_FEATURE_EVENT_REMAP (1<<2)
__u64 features;

__u64 ioctls;
diff --git a/mm/mremap.c b/mm/mremap.c
index 3fa0a467..3581f31 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -22,6 +22,7 @@
#include <linux/mmu_notifier.h>
#include <linux/uaccess.h>
#include <linux/mm-arch-hooks.h>
+#include <linux/userfaultfd_k.h>

#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
@@ -234,7 +235,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,

static unsigned long move_vma(struct vm_area_struct *vma,
unsigned long old_addr, unsigned long old_len,
- unsigned long new_len, unsigned long new_addr, bool *locked)
+ unsigned long new_len, unsigned long new_addr,
+ bool *locked, struct vm_userfaultfd_ctx *uf)
{
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *new_vma;
@@ -293,6 +295,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
old_addr = new_addr;
new_addr = err;
} else {
+ mremap_userfaultfd_prep(new_vma, uf);
arch_remap(mm, old_addr, old_addr + old_len,
new_addr, new_addr + new_len);
}
@@ -397,7 +400,8 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
}

static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
- unsigned long new_addr, unsigned long new_len, bool *locked)
+ unsigned long new_addr, unsigned long new_len, bool *locked,
+ struct vm_userfaultfd_ctx *uf)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
@@ -442,7 +446,7 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
if (offset_in_page(ret))
goto out1;

- ret = move_vma(vma, addr, old_len, new_len, new_addr, locked);
+ ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, uf);
if (!(offset_in_page(ret)))
goto out;
out1:
@@ -481,6 +485,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
unsigned long ret = -EINVAL;
unsigned long charged = 0;
bool locked = false;
+ struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;

if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
return ret;
@@ -506,7 +511,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,

if (flags & MREMAP_FIXED) {
ret = mremap_to(addr, old_len, new_addr, new_len,
- &locked);
+ &locked, &uf);
goto out;
}

@@ -575,7 +580,8 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
goto out;
}

- ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked);
+ ret = move_vma(vma, addr, old_len, new_len, new_addr,
+ &locked, &uf);
}
out:
if (offset_in_page(ret)) {
@@ -585,5 +591,6 @@ out:
up_write(&current->mm->mmap_sem);
if (locked && new_len > old_len)
mm_populate(new_addr + old_len, new_len - old_len);
+ mremap_userfaultfd_complete(uf, addr, new_addr, old_len);
return ret;
}
--
1.9.1

2016-03-20 12:43:24

by Michael Rapoport

[permalink] [raw]
Subject: [PATCH 3/5] uffd: Add fork() event

From: Pavel Emelyanov <[email protected]>

When the mm with uffd-ed vmas fork()-s the respective vmas
notify their uffds with the event which contains a descriptor
with new uffd. This new descriptor can then be used to get
events from the child and populate its mm with data. Note,
that there can be different uffd-s controlling different
vmas within one mm, so first we should collect all those
uffds (and ctx-s) in a list and then notify them all one by
one but only once per fork().

The context is created at fork() time but the descriptor, file
struct and anon inode object is created at event read time. So
some trickery is added to the userfaultfd_ctx_read() to handle
the ctx queues' locking vs file creation.

Another thing worth noticing is that the task that fork()-s
waits for the uffd event to get processed WITHOUT the mmap sem.

Signed-off-by: Pavel Emelyanov <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
---
fs/userfaultfd.c | 146 ++++++++++++++++++++++++++++++++++++++-
include/linux/userfaultfd_k.h | 12 ++++
include/uapi/linux/userfaultfd.h | 13 ++--
kernel/fork.c | 10 ++-
4 files changed, 169 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index c8e7039..565d8f2 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -65,6 +65,12 @@ struct userfaultfd_ctx {

};

+struct userfaultfd_fork_ctx {
+ struct userfaultfd_ctx *orig;
+ struct userfaultfd_ctx *new;
+ struct list_head list;
+};
+
struct userfaultfd_wait_queue {
struct uffd_msg msg;
wait_queue_t wq;
@@ -431,9 +437,8 @@ out:
return ret;
}

-static int __maybe_unused userfaultfd_event_wait_completion(
- struct userfaultfd_ctx *ctx,
- struct userfaultfd_wait_queue *ewq)
+static int userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
+ struct userfaultfd_wait_queue *ewq)
{
int ret = 0;

@@ -484,6 +489,79 @@ static void userfaultfd_event_complete(struct userfaultfd_ctx *ctx,
__remove_wait_queue(&ctx->event_wqh, &ewq->wq);
}

+int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
+{
+ struct userfaultfd_ctx *ctx = NULL, *octx;
+ struct userfaultfd_fork_ctx *fctx;
+
+ octx = vma->vm_userfaultfd_ctx.ctx;
+ if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+ vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+ vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+ return 0;
+ }
+
+ list_for_each_entry(fctx, fcs, list)
+ if (fctx->orig == octx) {
+ ctx = fctx->new;
+ break;
+ }
+
+ if (!ctx) {
+ fctx = kmalloc(sizeof(*fctx), GFP_KERNEL);
+ if (!fctx)
+ return -ENOMEM;
+
+ ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
+ if (!ctx) {
+ kfree(fctx);
+ return -ENOMEM;
+ }
+
+ atomic_set(&ctx->refcount, 1);
+ ctx->flags = octx->flags;
+ ctx->state = UFFD_STATE_RUNNING;
+ ctx->features = octx->features;
+ ctx->released = false;
+ ctx->mm = vma->vm_mm;
+ atomic_inc(&ctx->mm->mm_users);
+
+ userfaultfd_ctx_get(octx);
+ fctx->orig = octx;
+ fctx->new = ctx;
+ list_add_tail(&fctx->list, fcs);
+ }
+
+ vma->vm_userfaultfd_ctx.ctx = ctx;
+ return 0;
+}
+
+static int dup_fctx(struct userfaultfd_fork_ctx *fctx)
+{
+ struct userfaultfd_ctx *ctx = fctx->orig;
+ struct userfaultfd_wait_queue ewq;
+
+ msg_init(&ewq.msg);
+
+ ewq.msg.event = UFFD_EVENT_FORK;
+ ewq.msg.arg.reserved.reserved1 = (__u64)fctx->new;
+
+ return userfaultfd_event_wait_completion(ctx, &ewq);
+}
+
+void dup_userfaultfd_complete(struct list_head *fcs)
+{
+ int ret = 0;
+ struct userfaultfd_fork_ctx *fctx, *n;
+
+ list_for_each_entry_safe(fctx, n, fcs, list) {
+ if (!ret)
+ ret = dup_fctx(fctx);
+ list_del(&fctx->list);
+ kfree(fctx);
+ }
+}
+
static int userfaultfd_release(struct inode *inode, struct file *file)
{
struct userfaultfd_ctx *ctx = file->private_data;
@@ -614,12 +692,49 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
}
}

+static const struct file_operations userfaultfd_fops;
+
+static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
+ struct userfaultfd_ctx *new,
+ struct uffd_msg *msg)
+{
+ int fd;
+ struct file *file;
+ unsigned int flags = new->flags & UFFD_SHARED_FCNTL_FLAGS;
+
+ fd = get_unused_fd_flags(flags);
+ if (fd < 0)
+ return fd;
+
+ file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, new,
+ O_RDWR | flags);
+ if (IS_ERR(file)) {
+ put_unused_fd(fd);
+ return PTR_ERR(file);
+ }
+
+ fd_install(fd, file);
+ msg->arg.reserved.reserved1 = 0;
+ msg->arg.fork.ufd = fd;
+
+ return 0;
+}
+
static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
struct uffd_msg *msg)
{
ssize_t ret;
DECLARE_WAITQUEUE(wait, current);
struct userfaultfd_wait_queue *uwq;
+ /*
+ * Handling fork event requires sleeping operations, so
+ * we drop the event_wqh lock, then do these ops, then
+ * lock it back and wake up the waiter. While the lock is
+ * dropped the ewq may go away so we keep track of it
+ * carefully.
+ */
+ LIST_HEAD(fork_event);
+ struct userfaultfd_ctx *fork_nctx = NULL;

/* always take the fd_wqh lock before the fault_pending_wqh lock */
spin_lock(&ctx->fd_wqh.lock);
@@ -677,6 +792,14 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
if (uwq) {
*msg = uwq->msg;

+ if (uwq->msg.event == UFFD_EVENT_FORK) {
+ fork_nctx = (struct userfaultfd_ctx *)uwq->msg.arg.reserved.reserved1;
+ list_move(&uwq->wq.task_list, &fork_event);
+ spin_unlock(&ctx->event_wqh.lock);
+ ret = 0;
+ break;
+ }
+
userfaultfd_event_complete(ctx, uwq);
spin_unlock(&ctx->event_wqh.lock);
ret = 0;
@@ -700,6 +823,23 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
__set_current_state(TASK_RUNNING);
spin_unlock(&ctx->fd_wqh.lock);

+ if (!ret && msg->event == UFFD_EVENT_FORK) {
+ ret = resolve_userfault_fork(ctx, fork_nctx, msg);
+
+ if (!ret) {
+ spin_lock(&ctx->event_wqh.lock);
+ if (!list_empty(&fork_event)) {
+ uwq = list_first_entry(&fork_event,
+ typeof(*uwq),
+ wq.task_list);
+ list_del(&uwq->wq.task_list);
+ __add_wait_queue(&ctx->event_wqh, &uwq->wq);
+ userfaultfd_event_complete(ctx, uwq);
+ }
+ spin_unlock(&ctx->event_wqh.lock);
+ }
+ }
+
return ret;
}

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 587480a..0c7b723 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -53,6 +53,9 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
}

+extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
+extern void dup_userfaultfd_complete(struct list_head *);
+
#else /* CONFIG_USERFAULTFD */

/* mm helpers */
@@ -80,6 +83,15 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
return false;
}

+static inline int dup_userfaultfd(struct vm_area_struct *, struct list_head *)
+{
+ return 0;
+}
+
+static inline void dup_userfaultfd_complete(struct list_head *)
+{
+}
+
#endif /* CONFIG_USERFAULTFD */

#endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 9057d7a..d89eef6 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -14,10 +14,9 @@
#define UFFD_API ((__u64)0xAA)
/*
* After implementing the respective features it will become:
- * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
- * UFFD_FEATURE_EVENT_FORK)
+ * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP)
*/
-#define UFFD_API_FEATURES (0)
+#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -72,6 +71,10 @@ struct uffd_msg {
} pagefault;

struct {
+ __u32 ufd;
+ } fork;
+
+ struct {
/* unused reserved fields */
__u64 reserved1;
__u64 reserved2;
@@ -84,9 +87,7 @@ struct uffd_msg {
* Start at 0x12 and not at 0 to be more strict against bugs.
*/
#define UFFD_EVENT_PAGEFAULT 0x12
-#if 0 /* not available yet */
#define UFFD_EVENT_FORK 0x13
-#endif

/* flags for UFFD_EVENT_PAGEFAULT */
#define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */
@@ -107,8 +108,8 @@ struct uffdio_api {
*/
#if 0 /* not available yet */
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
-#define UFFD_FEATURE_EVENT_FORK (1<<1)
#endif
+#define UFFD_FEATURE_EVENT_FORK (1<<1)
__u64 features;

__u64 ioctls;
diff --git a/kernel/fork.c b/kernel/fork.c
index accb722..0624762 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -55,6 +55,7 @@
#include <linux/rmap.h>
#include <linux/ksm.h>
#include <linux/acct.h>
+#include <linux/userfaultfd_k.h>
#include <linux/tsacct_kern.h>
#include <linux/cn_proc.h>
#include <linux/freezer.h>
@@ -408,6 +409,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
struct rb_node **rb_link, *rb_parent;
int retval;
unsigned long charge;
+ LIST_HEAD(uf);

uprobe_start_dup_mmap();
down_write(&oldmm->mmap_sem);
@@ -461,12 +463,13 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
if (retval)
goto fail_nomem_policy;
tmp->vm_mm = mm;
+ retval = dup_userfaultfd(tmp, &uf);
+ if (retval)
+ goto fail_nomem_anon_vma_fork;
if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
- tmp->vm_flags &=
- ~(VM_LOCKED|VM_LOCKONFAULT|VM_UFFD_MISSING|VM_UFFD_WP);
+ tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
tmp->vm_next = tmp->vm_prev = NULL;
- tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
file = tmp->vm_file;
if (file) {
struct inode *inode = file_inode(file);
@@ -522,6 +525,7 @@ out:
up_write(&mm->mmap_sem);
flush_tlb_mm(oldmm);
up_write(&oldmm->mmap_sem);
+ dup_userfaultfd_complete(&uf);
uprobe_end_dup_mmap();
return retval;
fail_nomem_anon_vma_fork:
--
1.9.1

2016-03-20 12:55:23

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 3/5] uffd: Add fork() event

Hi Pavel,

[auto build test ERROR on next-20160318]
[also build test ERROR on v4.5]
[cannot apply to v4.5-rc7 v4.5-rc6 v4.5-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url: https://github.com/0day-ci/linux/commits/Mike-Rapoport/userfaultfd-extension-for-non-cooperative-uffd-usage/20160320-204520
config: i386-tinyconfig (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All errors (new ones prefixed by >>):

In file included from kernel/fork.c:58:0:
include/linux/userfaultfd_k.h: In function 'dup_userfaultfd':
>> include/linux/userfaultfd_k.h:86:42: error: parameter name omitted
static inline int dup_userfaultfd(struct vm_area_struct *, struct list_head *)
^
include/linux/userfaultfd_k.h:86:67: error: parameter name omitted
static inline int dup_userfaultfd(struct vm_area_struct *, struct list_head *)
^
include/linux/userfaultfd_k.h: In function 'dup_userfaultfd_complete':
include/linux/userfaultfd_k.h:91:52: error: parameter name omitted
static inline void dup_userfaultfd_complete(struct list_head *)
^

vim +86 include/linux/userfaultfd_k.h

80
81 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
82 {
83 return false;
84 }
85
> 86 static inline int dup_userfaultfd(struct vm_area_struct *, struct list_head *)
87 {
88 return 0;
89 }

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.75 kB)
.config.gz (6.12 kB)
Download all attachments

2016-03-21 10:09:23

by Pavel Emelianov

[permalink] [raw]
Subject: Re: [PATCH 0/5] userfaultfd: extension for non cooperative uffd usage

On 03/20/2016 03:42 PM, Mike Rapoport wrote:
> Hi,
>
> This set is to address the issues that appear in userfaultfd usage
> scenarios when the task monitoring the uffd and the mm-owner do not
> cooperate to each other on VM changes such as remaps, madvises and
> fork()-s.
>
> The pacthes are essentially the same as in the prevoious respin (1),
> they've just been rebased on the current tree.
>
> [1] http://thread.gmane.org/gmane.linux.kernel.mm/132662

Thanks, Mike!

Acked-by: Pavel Emelyanov <[email protected]>

2016-04-06 06:14:07

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 0/5] userfaultfd: extension for non cooperative uffd usage

On Mon, Mar 21, 2016 at 11:54 AM, Pavel Emelyanov <[email protected]> wrote:
> On 03/20/2016 03:42 PM, Mike Rapoport wrote:
>> Hi,
>>
>> This set is to address the issues that appear in userfaultfd usage
>> scenarios when the task monitoring the uffd and the mm-owner do not
>> cooperate to each other on VM changes such as remaps, madvises and
>> fork()-s.
>>
>> The pacthes are essentially the same as in the prevoious respin (1),
>> they've just been rebased on the current tree.
>>
>> [1] http://thread.gmane.org/gmane.linux.kernel.mm/132662
>
> Thanks, Mike!
>
> Acked-by: Pavel Emelyanov <[email protected]>
>

Any updates/comments on this?

--
Sincerely yours,
Mike.

2016-04-20 09:58:41

by Pavel Emelianov

[permalink] [raw]
Subject: Re: [PATCH 0/5] userfaultfd: extension for non cooperative uffd usage

On 03/20/2016 03:42 PM, Mike Rapoport wrote:
> Hi,
>
> This set is to address the issues that appear in userfaultfd usage
> scenarios when the task monitoring the uffd and the mm-owner do not
> cooperate to each other on VM changes such as remaps, madvises and
> fork()-s.
>
> The pacthes are essentially the same as in the prevoious respin (1),
> they've just been rebased on the current tree.

Hi, Andrea.

Hopefully one day after LSFMM is good time to try to get a bit of
your attention to this set :)

-- Pavel

2016-04-22 16:06:08

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 0/5] userfaultfd: extension for non cooperative uffd usage

Hello Pavel and Mike,

On Wed, Apr 20, 2016 at 12:44:48PM +0300, Pavel Emelyanov wrote:
> On 03/20/2016 03:42 PM, Mike Rapoport wrote:
> > Hi,
> >
> > This set is to address the issues that appear in userfaultfd usage
> > scenarios when the task monitoring the uffd and the mm-owner do not
> > cooperate to each other on VM changes such as remaps, madvises and
> > fork()-s.
> >
> > The pacthes are essentially the same as in the prevoious respin (1),
> > they've just been rebased on the current tree.

Thanks for the rebasing and the submit of these new features!

>
> Hi, Andrea.
>
> Hopefully one day after LSFMM is good time to try to get a bit of
> your attention to this set :)

Yes, at first glance this patchset looks fine. In fact I already
merged it in my tree at the time of last post. Just I didn't have much
time to review it in detail yet as I did with the wrprotect tracking
one, this is why I didn't answer yet, sorry.

As said I already reviewed the wrprotect tracking feature in detail
and it requires a few (but non trivial) fixes and I was planning to
fix that part first as the developer who sent the first implementation
a few months ago got busy with something else. But until those bugs
gets fixed I cannot ship it in my tree, nor in the way to -mm.

The other main reason of the delay is that I got sidetracked by other
issues (one internal) and the other notable one is the failure in
postcopy caused by the new THP refcounting introduced in 4.5 with THP
enabled, which apparently isn't the huge zeropage (tested with
use_zero_page = 0) nor the MADV_DONTNEED. I'm also unconvinced it's a
bug only in the userfaultfd interaction with the new THP refcounting,
perhaps it's something more generic that just happen to be reproduced
more easily by the heavy postcopy load, which makes it even more high
priority to track that down.

I'm afraid until that regression is fixed, I'll have to concentrate on
fixing that. At least I found a way to reproduce faster so I'm
optimistic it won't take long ;).

Andrea