2014-10-03 17:11:23

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 00/17] RFC: userfault v2

Hello everyone,

There's a large To/Cc list for this RFC because this adds two new
syscalls (userfaultfd and remap_anon_pages) and
MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes are welcome
sooner than later.

The major change compared to the previous RFC I sent a few months ago
is that the userfaultfd protocol now supports dynamic range
registration. So you can have an unlimited number of userfaults for
each process, so each shared library can use its own userfaultfd on
its own memory independently from other shared libraries or the main
program. This functionality was suggested from Andy Lutomirski (more
details on this are in the commit header of the last patch of this
patchset).

In addition the mmap_sem complexities has been sorted out. In fact the
real userfault patchset starts from patch number 7. Patches 1-6 will
be submitted separately for merging and if applied standalone they
provide a scalability improvement by reducing the mmap_sem hold times
during I/O. I included patch 1-6 here too because they're an hard
dependency for the userfault patchset. The userfaultfd syscall depends
on the first fault to always have FAULT_FLAG_ALLOW_RETRY set (the
later retry faults don't matter, it's fine to clear
FAULT_FLAG_ALLOW_RETRY with the retry faults, following the current
model).

The combination of these features are what I would propose to
implement postcopy live migration in qemu, and in general demand
paging of remote memory, hosted in different cloud nodes.

If the access could ever happen in kernel context through syscalls
(not not just from userland context), then userfaultfd has to be used
on top of MADV_USERFAULT, to make the userfault unnoticeable to the
syscall (no error will be returned). This latter feature is more
advanced than what volatile ranges alone could do with SIGBUS so far
(but it's optional, if the process doesn't register the memory in a
userfaultfd, the regular SIGBUS will fire, if the fd is closed SIGBUS
will also fire for any blocked userfault that was waiting a
userfaultfd_write ack).

userfaultfd is also a generic enough feature, that it allows KVM to
implement postcopy live migration without having to modify a single
line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
other GUP features works just fine in combination with userfaults
(userfaults trigger async page faults in the guest scheduler so those
guest processes that aren't waiting for userfaults can keep running in
the guest vcpus).

remap_anon_pages is the syscall to use to resolve the userfaults (it's
not mandatory, vmsplice will likely still be used in the case of local
postcopy live migration just to upgrade the qemu binary, but
remap_anon_pages is faster and ideal for transferring memory across
the network, it's zerocopy and doesn't touch the vma: it only holds
the mmap_sem for reading).

The current behavior of remap_anon_pages is very strict to avoid any
chance of memory corruption going unnoticed. mremap is not strict like
that: if there's a synchronization bug it would drop the destination
range silently resulting in subtle memory corruption for
example. remap_anon_pages would return -EEXIST in that case. If there
are holes in the source range remap_anon_pages will return -ENOENT.

If remap_anon_pages is used always with 2M naturally aligned
addresses, transparent hugepages will not be splitted. In there could
be 4k (or any size) holes in the 2M (or any size) source range,
remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
relax some of its strict checks (-ENOENT won't be returned if
RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
a noop on any hole in the source range). This flag is generally useful
when implementing userfaults with THP granularity, but it shouldn't be
set if doing the userfaults with PAGE_SIZE granularity if the
developer wants to benefit from the strict -ENOENT behavior.

The remap_anon_pages syscall API is not vectored, as I expect it to be
used mainly for demand paging (where there can be just one faulting
range per userfault) or for large ranges (with the THP model as an
alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
granularity before starting the guest in the destination node) where
vectoring isn't going to provide much performance advantages (thanks
to the THP coarser granularity).

On the rmap side remap_anon_pages doesn't add much complexity: there's
no need of nonlinear anon vmas to support it because I added the
constraint that it will fail if the mapcount is more than 1. So in
general the source range of remap_anon_pages should be marked
MADV_DONTFORK to prevent any risk of failure if the process ever
forks (like qemu can in some case).

The MADV_USERFAULT feature should be generic enough that it can
provide the userfaults to the Android volatile range feature too, on
access of reclaimed volatile pages. Or it could be used for other
similar things with tmpfs in the future. I've been discussing how to
extend it to tmpfs for example. Currently if MADV_USERFAULT is set on
a non-anonymous vma, it will return -EINVAL and that's enough to
provide backwards compatibility once MADV_USERFAULT will be extended
to tmpfs. An orthogonal problem then will be to identify the optimal
mechanism to atomically resolve a tmpfs backed userfault (like
remap_anon_pages does it optimally for anonymous memory) but that's
beyond the scope of the userfault functionality (in theory
remap_anon_pages is also orthogonal and I could split it off in a
separate patchset if somebody prefers). Of course remap_file_pages
should do it fine too, but it would create rmap nonlinearity which
isn't optimal.

The code can be found here:

git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault

The branch is rebased so you can get updates for example with:

git fetch && git checkout -f origin/userfault

Comments welcome, thanks!
Andrea

Andrea Arcangeli (15):
mm: gup: add get_user_pages_locked and get_user_pages_unlocked
mm: gup: use get_user_pages_unlocked within get_user_pages_fast
mm: gup: make get_user_pages_fast and __get_user_pages_fast latency
conscious
mm: gup: use get_user_pages_fast and get_user_pages_unlocked
mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
mm: madvise MADV_USERFAULT
mm: PT lock: export double_pt_lock/unlock
mm: rmap preparation for remap_anon_pages
mm: swp_entry_swapcount
mm: sys_remap_anon_pages
waitqueue: add nr wake parameter to __wake_up_locked_key
userfaultfd: add new syscall to provide memory externalization
userfaultfd: make userfaultfd_write non blocking
powerpc: add remap_anon_pages and userfaultfd
userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER

Andres Lagar-Cavilla (2):
mm: gup: add FOLL_TRIED
kvm: Faults which trigger IO release the mmap_sem

arch/alpha/include/uapi/asm/mman.h | 3 +
arch/mips/include/uapi/asm/mman.h | 3 +
arch/mips/mm/gup.c | 8 +-
arch/parisc/include/uapi/asm/mman.h | 3 +
arch/powerpc/include/asm/systbl.h | 2 +
arch/powerpc/include/asm/unistd.h | 2 +-
arch/powerpc/include/uapi/asm/unistd.h | 2 +
arch/powerpc/mm/gup.c | 6 +-
arch/s390/kvm/kvm-s390.c | 4 +-
arch/s390/mm/gup.c | 6 +-
arch/sh/mm/gup.c | 6 +-
arch/sparc/mm/gup.c | 6 +-
arch/x86/mm/gup.c | 235 +++++++----
arch/x86/syscalls/syscall_32.tbl | 2 +
arch/x86/syscalls/syscall_64.tbl | 2 +
arch/xtensa/include/uapi/asm/mman.h | 3 +
drivers/dma/iovlock.c | 10 +-
drivers/iommu/amd_iommu_v2.c | 6 +-
drivers/media/pci/ivtv/ivtv-udma.c | 6 +-
drivers/scsi/st.c | 10 +-
drivers/video/fbdev/pvr2fb.c | 5 +-
fs/Makefile | 1 +
fs/proc/task_mmu.c | 5 +-
fs/userfaultfd.c | 722 +++++++++++++++++++++++++++++++++
include/linux/huge_mm.h | 11 +-
include/linux/ksm.h | 4 +-
include/linux/mm.h | 15 +-
include/linux/mm_types.h | 13 +-
include/linux/swap.h | 6 +
include/linux/syscalls.h | 5 +
include/linux/userfaultfd.h | 55 +++
include/linux/wait.h | 5 +-
include/uapi/asm-generic/mman-common.h | 3 +
init/Kconfig | 11 +
kernel/sched/wait.c | 7 +-
kernel/sys_ni.c | 2 +
mm/fremap.c | 506 +++++++++++++++++++++++
mm/gup.c | 182 ++++++++-
mm/huge_memory.c | 208 ++++++++--
mm/ksm.c | 2 +-
mm/madvise.c | 22 +-
mm/memory.c | 14 +
mm/mempolicy.c | 4 +-
mm/mlock.c | 3 +-
mm/mmap.c | 39 +-
mm/mprotect.c | 3 +-
mm/mremap.c | 2 +-
mm/nommu.c | 23 ++
mm/process_vm_access.c | 7 +-
mm/rmap.c | 9 +
mm/swapfile.c | 13 +
mm/util.c | 10 +-
net/ceph/pagevec.c | 9 +-
net/sunrpc/sched.c | 2 +-
virt/kvm/async_pf.c | 4 +-
virt/kvm/kvm_main.c | 4 +-
56 files changed, 2025 insertions(+), 236 deletions(-)
create mode 100644 fs/userfaultfd.c
create mode 100644 include/linux/userfaultfd.h


2014-10-03 17:09:00

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 11/17] mm: swp_entry_swapcount

Provide a new swapfile method for remap_anon_pages to verify the swap
entry is mapped only in one vma before relocating the swap entry in a
different virtual address. Otherwise if the swap entry is mapped
in multiple vmas, when the page is swapped back in, it could get
mapped in a non linear way in some anon_vma.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/swap.h | 6 ++++++
mm/swapfile.c | 13 +++++++++++++
2 files changed, 19 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8197452..af9977c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -458,6 +458,7 @@ extern unsigned int count_swap_pages(int, int);
extern sector_t map_swap_page(struct page *, struct block_device **);
extern sector_t swapdev_block(int, pgoff_t);
extern int page_swapcount(struct page *);
+extern int swp_entry_swapcount(swp_entry_t entry);
extern struct swap_info_struct *page_swap_info(struct page *);
extern int reuse_swap_page(struct page *);
extern int try_to_free_swap(struct page *);
@@ -559,6 +560,11 @@ static inline int page_swapcount(struct page *page)
return 0;
}

+static inline int swp_entry_swapcount(swp_entry_t entry)
+{
+ return 0;
+}
+
#define reuse_swap_page(page) (page_mapcount(page) == 1)

static inline int try_to_free_swap(struct page *page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8798b2e..4cc9af6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -874,6 +874,19 @@ int page_swapcount(struct page *page)
return count;
}

+int swp_entry_swapcount(swp_entry_t entry)
+{
+ int count = 0;
+ struct swap_info_struct *p;
+
+ p = swap_info_get(entry);
+ if (p) {
+ count = swap_count(p->swap_map[swp_offset(entry)]);
+ spin_unlock(&p->lock);
+ }
+ return count;
+}
+
/*
* We can write to an anon page without COW if there are no other references
* to it. And as a side-effect, free up its swap: because the old content

2014-10-03 17:09:05

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 03/17] mm: gup: use get_user_pages_unlocked within get_user_pages_fast

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/mips/mm/gup.c | 8 +++-----
arch/powerpc/mm/gup.c | 6 ++----
arch/s390/kvm/kvm-s390.c | 4 +---
arch/s390/mm/gup.c | 6 ++----
arch/sh/mm/gup.c | 6 ++----
arch/sparc/mm/gup.c | 6 ++----
arch/x86/mm/gup.c | 7 +++----
7 files changed, 15 insertions(+), 28 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 06ce17c..20884f5 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -301,11 +301,9 @@ slow_irqon:
start += nr << PAGE_SHIFT;
pages += nr;

- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start,
- (end - start) >> PAGE_SHIFT,
- write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
+ ret = get_user_pages_unlocked(current, mm, start,
+ (end - start) >> PAGE_SHIFT,
+ write, 0, pages);

/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d874668..b70c34a 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -215,10 +215,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
start += nr << PAGE_SHIFT;
pages += nr;

- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start,
- nr_pages - nr, write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
+ ret = get_user_pages_unlocked(current, mm, start,
+ nr_pages - nr, write, 0, pages);

/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 81b0e11..37ca29a 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1092,9 +1092,7 @@ long kvm_arch_fault_in_page(struct kvm_vcpu *vcpu, gpa_t gpa, int writable)
hva = gmap_fault(gpa, vcpu->arch.gmap);
if (IS_ERR_VALUE(hva))
return (long)hva;
- down_read(&mm->mmap_sem);
- rc = get_user_pages(current, mm, hva, 1, writable, 0, NULL, NULL);
- up_read(&mm->mmap_sem);
+ rc = get_user_pages_unlocked(current, mm, hva, 1, writable, 0, NULL);

return rc < 0 ? rc : 0;
}
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 639fce46..5c586c7 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -235,10 +235,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
/* Try to get the remaining pages with get_user_pages */
start += nr << PAGE_SHIFT;
pages += nr;
- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start,
- nr_pages - nr, write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
+ ret = get_user_pages_unlocked(current, mm, start,
+ nr_pages - nr, write, 0, pages);
/* Have to be a bit careful with return values */
if (nr > 0)
ret = (ret < 0) ? nr : ret + nr;
diff --git a/arch/sh/mm/gup.c b/arch/sh/mm/gup.c
index 37458f3..e15f52a 100644
--- a/arch/sh/mm/gup.c
+++ b/arch/sh/mm/gup.c
@@ -257,10 +257,8 @@ slow_irqon:
start += nr << PAGE_SHIFT;
pages += nr;

- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start,
- (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
+ ret = get_user_pages_unlocked(current, mm, start,
+ (end - start) >> PAGE_SHIFT, write, 0, pages);

/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 1aed043..fa7de7d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -219,10 +219,8 @@ slow:
start += nr << PAGE_SHIFT;
pages += nr;

- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start,
- (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
+ ret = get_user_pages_unlocked(current, mm, start,
+ (end - start) >> PAGE_SHIFT, write, 0, pages);

/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef..2ab183b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -388,10 +388,9 @@ slow_irqon:
start += nr << PAGE_SHIFT;
pages += nr;

- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start,
- (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
+ ret = get_user_pages_unlocked(current, mm, start,
+ (end - start) >> PAGE_SHIFT,
+ write, 0, pages);

/* Have to be a bit careful with return values */
if (nr > 0) {

2014-10-03 17:09:26

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 13/17] waitqueue: add nr wake parameter to __wake_up_locked_key

Userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
instead of the current hardcoded 1 (that would wake just the first
waitqueue in the head list).

Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/wait.h | 5 +++--
kernel/sched/wait.c | 7 ++++---
net/sunrpc/sched.c | 2 +-
3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 6fb1ba5..f8271cb 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -144,7 +144,8 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)

typedef int wait_bit_action_f(struct wait_bit_key *);
void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+ void *key);
void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -175,7 +176,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
#define wake_up_poll(x, m) \
__wake_up(x, TASK_NORMAL, 1, (void *) (m))
#define wake_up_locked_poll(x, m) \
- __wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
+ __wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
#define wake_up_interruptible_poll(x, m) \
__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
#define wake_up_interruptible_sync_poll(x, m) \
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 15cab1a..d848738 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -105,9 +105,10 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr)
}
EXPORT_SYMBOL_GPL(__wake_up_locked);

-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+ void *key)
{
- __wake_up_common(q, mode, 1, 0, key);
+ __wake_up_common(q, mode, nr, 0, key);
}
EXPORT_SYMBOL_GPL(__wake_up_locked_key);

@@ -282,7 +283,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
if (!list_empty(&wait->task_list))
list_del_init(&wait->task_list);
else if (waitqueue_active(q))
- __wake_up_locked_key(q, mode, key);
+ __wake_up_locked_key(q, mode, 1, key);
spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79..39b7496 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_complete_task(struct rpc_task *task)
clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
ret = atomic_dec_and_test(&task->tk_count);
if (waitqueue_active(wq))
- __wake_up_locked_key(wq, TASK_NORMAL, &k);
+ __wake_up_locked_key(wq, TASK_NORMAL, 1, &k);
spin_unlock_irqrestore(&wq->lock, flags);
return ret;
}

2014-10-03 17:10:01

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 14/17] userfaultfd: add new syscall to provide memory externalization

Once an userfaultfd is created MADV_USERFAULT regions talks through
the userfaultfd protocol with the thread responsible for doing the
memory externalization of the process.

The protocol starts by userland writing the requested/preferred
USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
kernel knows it, it will ack it by allowing userland to read 64bit
from the userfault fd that will contain the same 64bit
USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
will have to try again by writing an older protocol version if
suitable for its usage too, and read it back again until it stops
reading -1ULL. After that the userfaultfd protocol starts.

The protocol consists in the userfault fd reads 64bit in size
providing userland the fault addresses. After a userfault address has
been read and the fault is resolved by userland, the application must
write back 128bits in the form of [ start, end ] range (64bit each)
that will tell the kernel such a range has been mapped. Multiple read
userfaults can be resolved in a single range write. poll() can be used
to know when there are new userfaults to read (POLLIN) and when there
are threads waiting a wakeup through a range write (POLLOUT).

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
fs/Makefile | 1 +
fs/userfaultfd.c | 643 +++++++++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 1 +
include/linux/userfaultfd.h | 42 +++
init/Kconfig | 11 +
kernel/sys_ni.c | 1 +
mm/huge_memory.c | 24 +-
mm/memory.c | 5 +-
10 files changed, 720 insertions(+), 10 deletions(-)
create mode 100644 fs/userfaultfd.c
create mode 100644 include/linux/userfaultfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 2d0594c..782038c 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -364,3 +364,4 @@
355 i386 getrandom sys_getrandom
356 i386 memfd_create sys_memfd_create
357 i386 remap_anon_pages sys_remap_anon_pages
+358 i386 userfaultfd sys_userfaultfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 41e8f3e..3d5601f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -328,6 +328,7 @@
319 common memfd_create sys_memfd_create
320 common kexec_file_load sys_kexec_file_load
321 common remap_anon_pages sys_remap_anon_pages
+322 common userfaultfd sys_userfaultfd

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 90c8852..00dfe77 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES) += anon_inodes.o
obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
+obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
obj-$(CONFIG_AIO) += aio.o
obj-$(CONFIG_FILE_LOCKING) += locks.o
obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
new file mode 100644
index 0000000..62b827e
--- /dev/null
+++ b/fs/userfaultfd.c
@@ -0,0 +1,643 @@
+/*
+ * fs/userfaultfd.c
+ *
+ * Copyright (C) 2007 Davide Libenzi <[email protected]>
+ * Copyright (C) 2008-2009 Red Hat, Inc.
+ * Copyright (C) 2014 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+ * Some part derived from fs/eventfd.c (anon inode setup) and
+ * mm/ksm.c (mm hashing).
+ */
+
+#include <linux/hashtable.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/file.h>
+#include <linux/bug.h>
+#include <linux/anon_inodes.h>
+#include <linux/syscalls.h>
+#include <linux/userfaultfd.h>
+
+struct userfaultfd_ctx {
+ /* pseudo fd refcounting */
+ atomic_t refcount;
+ /* waitqueue head for the userfaultfd page faults */
+ wait_queue_head_t fault_wqh;
+ /* waitqueue head for the pseudo fd to wakeup poll/read */
+ wait_queue_head_t fd_wqh;
+ /* userfaultfd syscall flags */
+ unsigned int flags;
+ /* state machine */
+ unsigned int state;
+ /* released */
+ bool released;
+};
+
+struct userfaultfd_wait_queue {
+ unsigned long address;
+ wait_queue_t wq;
+ bool pending;
+ struct userfaultfd_ctx *ctx;
+};
+
+#define USERFAULTFD_PROTOCOL ((__u64) 0xaa)
+#define USERFAULTFD_UNKNOWN_PROTOCOL ((__u64) -1ULL)
+
+enum {
+ USERFAULTFD_STATE_ASK_PROTOCOL,
+ USERFAULTFD_STATE_ACK_PROTOCOL,
+ USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL,
+ USERFAULTFD_STATE_RUNNING,
+};
+
+/**
+ * struct mm_slot - userlandfd information per mm that is being scanned
+ * @link: link to the mm_slots hash list
+ * @mm: the mm that this information is valid for
+ * @ctx: userfaultfd context for this mm
+ */
+struct mm_slot {
+ struct hlist_node link;
+ struct mm_struct *mm;
+ struct userfaultfd_ctx ctx;
+ struct rcu_head rcu_head;
+};
+
+#define MM_USERLANDFD_HASH_BITS 10
+static DEFINE_HASHTABLE(mm_userlandfd_hash, MM_USERLANDFD_HASH_BITS);
+
+static DEFINE_MUTEX(mm_userlandfd_mutex);
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+ struct mm_slot *slot;
+
+ hash_for_each_possible_rcu(mm_userlandfd_hash, slot, link,
+ (unsigned long)mm)
+ if (slot->mm == mm)
+ return slot;
+
+ return NULL;
+}
+
+static void insert_to_mm_userlandfd_hash(struct mm_struct *mm,
+ struct mm_slot *mm_slot)
+{
+ mm_slot->mm = mm;
+ hash_add_rcu(mm_userlandfd_hash, &mm_slot->link, (unsigned long)mm);
+}
+
+static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
+ int wake_flags, void *key)
+{
+ unsigned long *range = key;
+ int ret;
+ struct userfaultfd_wait_queue *uwq;
+
+ uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+ ret = 0;
+ /* don't wake the pending ones to avoid reads to block */
+ if (uwq->pending && !ACCESS_ONCE(uwq->ctx->released))
+ goto out;
+ if (range[0] > uwq->address || range[1] <= uwq->address)
+ goto out;
+ ret = wake_up_state(wq->private, mode);
+ if (ret)
+ /* wake only once, autoremove behavior */
+ list_del_init(&wq->task_list);
+out:
+ return ret;
+}
+
+/**
+ * userfaultfd_ctx_get - Acquires a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to the userfaultfd context.
+ *
+ * Returns: In case of success, returns not zero.
+ */
+static int userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
+{
+ /*
+ * If it's already released don't get it. This can race
+ * against userfaultfd_release, if the race triggers it'll be
+ * handled safely by the handle_userfault main loop
+ * (userfaultfd_release will take the mmap_sem for writing to
+ * flush out all in-flight userfaults). This check is only an
+ * optimization.
+ */
+ if (unlikely(ACCESS_ONCE(ctx->released)))
+ return 0;
+ return atomic_inc_not_zero(&ctx->refcount);
+}
+
+static void userfaultfd_free(struct userfaultfd_ctx *ctx)
+{
+ struct mm_slot *mm_slot = container_of(ctx, struct mm_slot, ctx);
+
+ mutex_lock(&mm_userlandfd_mutex);
+ hash_del_rcu(&mm_slot->link);
+ mutex_unlock(&mm_userlandfd_mutex);
+
+ kfree_rcu(mm_slot, rcu_head);
+}
+
+/**
+ * userfaultfd_ctx_put - Releases a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to userfaultfd context.
+ *
+ * The userfaultfd context reference must have been previously acquired either
+ * with userfaultfd_ctx_get() or userfaultfd_ctx_fdget().
+ */
+static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
+{
+ if (atomic_dec_and_test(&ctx->refcount))
+ userfaultfd_free(ctx);
+}
+
+/*
+ * The locking rules involved in returning VM_FAULT_RETRY depending on
+ * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
+ * FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
+ * recommendation in __lock_page_or_retry is not an understatement.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
+ * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
+ * not set.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
+ * set, VM_FAULT_RETRY can still be returned if and only if there are
+ * fatal_signal_pending()s, and the mmap_sem must be released before
+ * returning it.
+ */
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct mm_slot *slot;
+ struct userfaultfd_ctx *ctx;
+ struct userfaultfd_wait_queue uwq;
+ int ret;
+
+ BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+ rcu_read_lock();
+ slot = get_mm_slot(mm);
+ if (!slot) {
+ rcu_read_unlock();
+ return VM_FAULT_SIGBUS;
+ }
+ ctx = &slot->ctx;
+ if (!userfaultfd_ctx_get(ctx)) {
+ rcu_read_unlock();
+ return VM_FAULT_SIGBUS;
+ }
+ rcu_read_unlock();
+
+ init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
+ uwq.wq.private = current;
+ uwq.address = address;
+ uwq.pending = true;
+ uwq.ctx = ctx;
+
+ spin_lock(&ctx->fault_wqh.lock);
+ /*
+ * After the __add_wait_queue the uwq is visible to userland
+ * through poll/read().
+ */
+ __add_wait_queue(&ctx->fault_wqh, &uwq.wq);
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (fatal_signal_pending(current)) {
+ /*
+ * If we have to fail because the task is
+ * killed just retry the fault either by
+ * returning to userland or through
+ * VM_FAULT_RETRY if we come from a page fault
+ * and a fatal signal is pending.
+ */
+ ret = 0;
+ if (flags & FAULT_FLAG_KILLABLE) {
+ /*
+ * If FAULT_FLAG_KILLABLE is set we
+ * and there's a fatal signal pending
+ * can return VM_FAULT_RETRY
+ * regardless if
+ * FAULT_FLAG_ALLOW_RETRY is set or
+ * not as long as we release the
+ * mmap_sem. The page fault will
+ * return stright to userland then to
+ * handle the fatal signal.
+ */
+ up_read(&mm->mmap_sem);
+ ret = VM_FAULT_RETRY;
+ }
+ break;
+ }
+ if (!uwq.pending || ACCESS_ONCE(ctx->released)) {
+ ret = 0;
+ if (flags & FAULT_FLAG_ALLOW_RETRY) {
+ ret = VM_FAULT_RETRY;
+ if (!(flags & FAULT_FLAG_RETRY_NOWAIT))
+ up_read(&mm->mmap_sem);
+ }
+ break;
+ }
+ if (((FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT) &
+ flags) ==
+ (FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT)) {
+ ret = VM_FAULT_RETRY;
+ /*
+ * The mmap_sem must not be released if
+ * FAULT_FLAG_RETRY_NOWAIT is set despite we
+ * return VM_FAULT_RETRY (FOLL_NOWAIT case).
+ */
+ break;
+ }
+ spin_unlock(&ctx->fault_wqh.lock);
+ up_read(&mm->mmap_sem);
+
+ wake_up_poll(&ctx->fd_wqh, POLLIN);
+ schedule();
+
+ down_read(&mm->mmap_sem);
+ spin_lock(&ctx->fault_wqh.lock);
+ }
+ __remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
+ __set_current_state(TASK_RUNNING);
+ spin_unlock(&ctx->fault_wqh.lock);
+
+ /*
+ * ctx may go away after this if the userfault pseudo fd is
+ * released by another CPU.
+ */
+ userfaultfd_ctx_put(ctx);
+
+ return ret;
+}
+
+static int userfaultfd_release(struct inode *inode, struct file *file)
+{
+ struct userfaultfd_ctx *ctx = file->private_data;
+ struct mm_slot *mm_slot = container_of(ctx, struct mm_slot, ctx);
+ __u64 range[2] = { 0ULL, -1ULL };
+
+ ACCESS_ONCE(ctx->released) = true;
+
+ /*
+ * Flush page faults out of all CPUs to avoid race conditions
+ * against ctx->released. All page faults must be retried
+ * without returning VM_FAULT_SIGBUS if the get_mm_slot and
+ * userfaultfd_ctx_get both succeeds but ctx->released is set.
+ */
+ down_write(&mm_slot->mm->mmap_sem);
+ up_write(&mm_slot->mm->mmap_sem);
+
+ spin_lock(&ctx->fault_wqh.lock);
+ __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
+ spin_unlock(&ctx->fault_wqh.lock);
+
+ wake_up_poll(&ctx->fd_wqh, POLLHUP);
+ userfaultfd_ctx_put(ctx);
+ return 0;
+}
+
+static inline unsigned long find_userfault(struct userfaultfd_ctx *ctx,
+ struct userfaultfd_wait_queue **uwq,
+ unsigned int events_filter)
+{
+ wait_queue_t *wq;
+ struct userfaultfd_wait_queue *_uwq;
+ unsigned int events = 0;
+
+ BUG_ON(!events_filter);
+
+ spin_lock(&ctx->fault_wqh.lock);
+ list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+ _uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+ if (_uwq->pending) {
+ if (!(events & POLLIN) && (events_filter & POLLIN)) {
+ events |= POLLIN;
+ if (uwq)
+ *uwq = _uwq;
+ }
+ } else if (events_filter & POLLOUT)
+ events |= POLLOUT;
+ if (events == events_filter)
+ break;
+ }
+ spin_unlock(&ctx->fault_wqh.lock);
+
+ return events;
+}
+
+static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
+{
+ struct userfaultfd_ctx *ctx = file->private_data;
+
+ poll_wait(file, &ctx->fd_wqh, wait);
+
+ switch (ctx->state) {
+ case USERFAULTFD_STATE_ASK_PROTOCOL:
+ return POLLOUT;
+ case USERFAULTFD_STATE_ACK_PROTOCOL:
+ return POLLIN;
+ case USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL:
+ return POLLIN;
+ case USERFAULTFD_STATE_RUNNING:
+ return find_userfault(ctx, NULL, POLLIN|POLLOUT);
+ default:
+ BUG();
+ }
+}
+
+static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
+ __u64 *addr)
+{
+ ssize_t ret;
+ DECLARE_WAITQUEUE(wait, current);
+ struct userfaultfd_wait_queue *uwq = NULL;
+
+ if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
+ return -EINVAL;
+ } else if (ctx->state == USERFAULTFD_STATE_ACK_PROTOCOL) {
+ *addr = USERFAULTFD_PROTOCOL;
+ ctx->state = USERFAULTFD_STATE_RUNNING;
+ return 0;
+ } else if (ctx->state == USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL) {
+ *addr = USERFAULTFD_UNKNOWN_PROTOCOL;
+ ctx->state = USERFAULTFD_STATE_ASK_PROTOCOL;
+ return 0;
+ }
+ BUG_ON(ctx->state != USERFAULTFD_STATE_RUNNING);
+
+ spin_lock(&ctx->fd_wqh.lock);
+ __add_wait_queue(&ctx->fd_wqh, &wait);
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ /* always take the fd_wqh lock before the fault_wqh lock */
+ if (find_userfault(ctx, &uwq, POLLIN)) {
+ uwq->pending = false;
+ *addr = uwq->address;
+ ret = 0;
+ break;
+ }
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+ if (no_wait) {
+ ret = -EAGAIN;
+ break;
+ }
+ spin_unlock(&ctx->fd_wqh.lock);
+ schedule();
+ spin_lock_irq(&ctx->fd_wqh.lock);
+ }
+ __remove_wait_queue(&ctx->fd_wqh, &wait);
+ __set_current_state(TASK_RUNNING);
+ if (ret == 0) {
+ if (waitqueue_active(&ctx->fd_wqh))
+ wake_up_locked_poll(&ctx->fd_wqh, POLLOUT);
+ }
+ spin_unlock_irq(&ctx->fd_wqh.lock);
+
+ return ret;
+}
+
+static ssize_t userfaultfd_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct userfaultfd_ctx *ctx = file->private_data;
+ ssize_t ret;
+ /* careful to always initialize addr if ret == 0 */
+ __u64 uninitialized_var(addr);
+
+ if (count < sizeof(addr))
+ return -EINVAL;
+ ret = userfaultfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &addr);
+ if (ret < 0)
+ return ret;
+
+ return put_user(addr, (__u64 __user *) buf) ? -EFAULT : sizeof(addr);
+}
+
+static int wake_userfault(struct userfaultfd_ctx *ctx, __u64 *range)
+{
+ wait_queue_t *wq;
+ struct userfaultfd_wait_queue *uwq;
+ int ret = -ENOENT;
+
+ spin_lock(&ctx->fault_wqh.lock);
+ list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+ uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+ if (uwq->pending)
+ continue;
+ if (uwq->address >= range[0] &&
+ uwq->address < range[1]) {
+ ret = 0;
+ /* wake all in the range and autoremove */
+ __wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0,
+ range);
+ break;
+ }
+ }
+ spin_unlock(&ctx->fault_wqh.lock);
+
+ return ret;
+}
+
+static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct userfaultfd_ctx *ctx = file->private_data;
+ ssize_t res;
+ __u64 range[2];
+ DECLARE_WAITQUEUE(wait, current);
+
+ if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
+ __u64 protocol;
+ if (count < sizeof(__u64))
+ return -EINVAL;
+ if (copy_from_user(&protocol, buf, sizeof(protocol)))
+ return -EFAULT;
+ if (protocol != USERFAULTFD_PROTOCOL) {
+ /* we'll offer the supported protocol in the ack */
+ printk_once(KERN_INFO
+ "userfaultfd protocol not available\n");
+ ctx->state = USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL;
+ } else
+ ctx->state = USERFAULTFD_STATE_ACK_PROTOCOL;
+ return sizeof(protocol);
+ } else if (ctx->state == USERFAULTFD_STATE_ACK_PROTOCOL)
+ return -EINVAL;
+
+ BUG_ON(ctx->state != USERFAULTFD_STATE_RUNNING);
+
+ if (count < sizeof(range))
+ return -EINVAL;
+ if (copy_from_user(&range, buf, sizeof(range)))
+ return -EFAULT;
+ if (range[0] >= range[1])
+ return -ERANGE;
+
+ spin_lock(&ctx->fd_wqh.lock);
+ __add_wait_queue(&ctx->fd_wqh, &wait);
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ /* always take the fd_wqh lock before the fault_wqh lock */
+ if (find_userfault(ctx, NULL, POLLOUT)) {
+ if (!wake_userfault(ctx, range)) {
+ res = sizeof(range);
+ break;
+ }
+ }
+ if (signal_pending(current)) {
+ res = -ERESTARTSYS;
+ break;
+ }
+ if (file->f_flags & O_NONBLOCK) {
+ res = -EAGAIN;
+ break;
+ }
+ spin_unlock(&ctx->fd_wqh.lock);
+ schedule();
+ spin_lock(&ctx->fd_wqh.lock);
+ }
+ __remove_wait_queue(&ctx->fd_wqh, &wait);
+ __set_current_state(TASK_RUNNING);
+ spin_unlock(&ctx->fd_wqh.lock);
+
+ return res;
+}
+
+#ifdef CONFIG_PROC_FS
+static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+ struct userfaultfd_ctx *ctx = f->private_data;
+ int ret;
+ wait_queue_t *wq;
+ struct userfaultfd_wait_queue *uwq;
+ unsigned long pending = 0, total = 0;
+
+ spin_lock(&ctx->fault_wqh.lock);
+ list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+ uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+ if (uwq->pending)
+ pending++;
+ total++;
+ }
+ spin_unlock(&ctx->fault_wqh.lock);
+
+ /*
+ * If more protocols will be added, there will be all shown
+ * separated by a space. Like this:
+ * protocols: 0xaa 0xbb
+ */
+ ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nprotocols:\t%Lx\n",
+ pending, total, USERFAULTFD_PROTOCOL);
+
+ return ret;
+}
+#endif
+
+static const struct file_operations userfaultfd_fops = {
+#ifdef CONFIG_PROC_FS
+ .show_fdinfo = userfaultfd_show_fdinfo,
+#endif
+ .release = userfaultfd_release,
+ .poll = userfaultfd_poll,
+ .read = userfaultfd_read,
+ .write = userfaultfd_write,
+ .llseek = noop_llseek,
+};
+
+/**
+ * userfaultfd_file_create - Creates an userfaultfd file pointer.
+ * @flags: Flags for the userfaultfd file.
+ *
+ * This function creates an userfaultfd file pointer, w/out installing
+ * it into the fd table. This is useful when the userfaultfd file is
+ * used during the initialization of data structures that require
+ * extra setup after the userfaultfd creation. So the userfaultfd
+ * creation is split into the file pointer creation phase, and the
+ * file descriptor installation phase. In this way races with
+ * userspace closing the newly installed file descriptor can be
+ * avoided. Returns an userfaultfd file pointer, or a proper error
+ * pointer.
+ */
+static struct file *userfaultfd_file_create(int flags)
+{
+ struct file *file;
+ struct mm_slot *mm_slot;
+
+ /* Check the UFFD_* constants for consistency. */
+ BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
+ BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
+
+ file = ERR_PTR(-EINVAL);
+ if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
+ goto out;
+
+ mm_slot = kmalloc(sizeof(*mm_slot), GFP_KERNEL);
+ file = ERR_PTR(-ENOMEM);
+ if (!mm_slot)
+ goto out;
+
+ mutex_lock(&mm_userlandfd_mutex);
+ file = ERR_PTR(-EBUSY);
+ if (get_mm_slot(current->mm))
+ goto out_free_unlock;
+
+ atomic_set(&mm_slot->ctx.refcount, 1);
+ init_waitqueue_head(&mm_slot->ctx.fault_wqh);
+ init_waitqueue_head(&mm_slot->ctx.fd_wqh);
+ mm_slot->ctx.flags = flags;
+ mm_slot->ctx.state = USERFAULTFD_STATE_ASK_PROTOCOL;
+ mm_slot->ctx.released = false;
+
+ file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops,
+ &mm_slot->ctx,
+ O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
+ if (IS_ERR(file))
+ out_free_unlock:
+ kfree(mm_slot);
+ else
+ insert_to_mm_userlandfd_hash(current->mm,
+ mm_slot);
+ mutex_unlock(&mm_userlandfd_mutex);
+out:
+ return file;
+}
+
+SYSCALL_DEFINE1(userfaultfd, int, flags)
+{
+ int fd, error;
+ struct file *file;
+
+ error = get_unused_fd_flags(flags & UFFD_SHARED_FCNTL_FLAGS);
+ if (error < 0)
+ return error;
+ fd = error;
+
+ file = userfaultfd_file_create(flags);
+ if (IS_ERR(file)) {
+ error = PTR_ERR(file);
+ goto err_put_unused_fd;
+ }
+ fd_install(fd, file);
+
+ return fd;
+
+err_put_unused_fd:
+ put_unused_fd(fd);
+
+ return error;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3d4bb05..c5cd88d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -811,6 +811,7 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
asmlinkage long sys_eventfd(unsigned int count);
asmlinkage long sys_eventfd2(unsigned int count, int flags);
asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
+asmlinkage long sys_userfaultfd(int flags);
asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/include/linux/userfaultfd.h b/include/linux/userfaultfd.h
new file mode 100644
index 0000000..b7caef5
--- /dev/null
+++ b/include/linux/userfaultfd.h
@@ -0,0 +1,42 @@
+/*
+ * include/linux/userfaultfd.h
+ *
+ * Copyright (C) 2007 Davide Libenzi <[email protected]>
+ * Copyright (C) 2014 Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_H
+#define _LINUX_USERFAULTFD_H
+
+#include <linux/fcntl.h>
+
+/*
+ * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
+ * new flags, since they might collide with O_* ones. We want
+ * to re-use O_* flags that couldn't possibly have a meaning
+ * from userfaultfd, in order to leave a free define-space for
+ * shared O_* flags.
+ */
+#define UFFD_CLOEXEC O_CLOEXEC
+#define UFFD_NONBLOCK O_NONBLOCK
+
+#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
+
+#ifdef CONFIG_USERFAULTFD
+
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags);
+
+#else /* CONFIG_USERFAULTFD */
+
+static int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags)
+{
+ return VM_FAULT_SIGBUS;
+}
+
+#endif
+
+#endif /* _LINUX_USERFAULTFD_H */
diff --git a/init/Kconfig b/init/Kconfig
index e84c642..d57127e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1518,6 +1518,17 @@ config EVENTFD

If unsure, say Y.

+config USERFAULTFD
+ bool "Enable userfaultfd() system call"
+ select ANON_INODES
+ default y
+ depends on MMU
+ help
+ Enable the userfaultfd() system call that allows to trap and
+ handle page faults in userland.
+
+ If unsure, say Y.
+
config SHMEM
bool "Use full shmem filesystem" if EXPERT
default y
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 2bc7bef..fe6ab0c 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -200,6 +200,7 @@ cond_syscall(compat_sys_timerfd_gettime);
cond_syscall(sys_eventfd);
cond_syscall(sys_eventfd2);
cond_syscall(sys_memfd_create);
+cond_syscall(sys_userfaultfd);

/* performance counters: */
cond_syscall(sys_perf_event_open);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c66428..10e6408 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
#include <linux/pagemap.h>
#include <linux/migrate.h>
#include <linux/hashtable.h>
+#include <linux/userfaultfd.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -713,7 +714,7 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long haddr, pmd_t *pmd,
- struct page *page)
+ struct page *page, unsigned int flags)
{
struct mem_cgroup *memcg;
pgtable_t pgtable;
@@ -753,11 +754,15 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,

/* Deliver the page fault to userland */
if (vma->vm_flags & VM_USERFAULT) {
+ int ret;
+
spin_unlock(ptl);
mem_cgroup_cancel_charge(page, memcg);
put_page(page);
pte_free(mm, pgtable);
- return VM_FAULT_SIGBUS;
+ ret = handle_userfault(vma, haddr, flags);
+ VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+ return ret;
}

entry = mk_huge_pmd(page, vma->vm_page_prot);
@@ -837,16 +842,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
ret = 0;
set = false;
if (pmd_none(*pmd)) {
- if (vma->vm_flags & VM_USERFAULT)
- ret = VM_FAULT_SIGBUS;
- else {
+ if (vma->vm_flags & VM_USERFAULT) {
+ spin_unlock(ptl);
+ ret = handle_userfault(vma, haddr, flags);
+ VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+ } else {
set_huge_zero_page(pgtable, mm, vma,
haddr, pmd,
zero_page);
+ spin_unlock(ptl);
set = true;
}
- }
- spin_unlock(ptl);
+ } else
+ spin_unlock(ptl);
if (!set) {
pte_free(mm, pgtable);
put_huge_zero_page();
@@ -859,7 +867,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
- return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
+ return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, flags);
}

int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/memory.c b/mm/memory.c
index 16e4c8a..e80772b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -61,6 +61,7 @@
#include <linux/string.h>
#include <linux/dma-debug.h>
#include <linux/debugfs.h>
+#include <linux/userfaultfd.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -2648,7 +2649,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Deliver the page fault to userland, check inside PT lock */
if (vma->vm_flags & VM_USERFAULT) {
pte_unmap_unlock(page_table, ptl);
- return VM_FAULT_SIGBUS;
+ return handle_userfault(vma, address, flags);
}
goto setpte;
}
@@ -2682,7 +2683,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_unmap_unlock(page_table, ptl);
mem_cgroup_cancel_charge(page, memcg);
page_cache_release(page);
- return VM_FAULT_SIGBUS;
+ return handle_userfault(vma, address, flags);
}

inc_mm_counter_fast(mm, MM_ANONPAGES);

2014-10-03 17:10:11

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 01/17] mm: gup: add FOLL_TRIED

From: Andres Lagar-Cavilla <[email protected]>

Reviewed-by: Radim Krčmář <[email protected]>
Signed-off-by: Andres Lagar-Cavilla <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/mm.h | 1 +
mm/gup.c | 4 ++++
2 files changed, 5 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..0f4196a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1985,6 +1985,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
+#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */

typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b..af7ea3e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
fault_flags |= FAULT_FLAG_ALLOW_RETRY;
if (*flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
+ if (*flags & FOLL_TRIED) {
+ VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+ fault_flags |= FAULT_FLAG_TRIED;
+ }

ret = handle_mm_fault(mm, vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {

2014-10-03 17:10:08

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 16/17] powerpc: add remap_anon_pages and userfaultfd

Add the syscall numbers.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/powerpc/include/asm/systbl.h | 2 ++
arch/powerpc/include/asm/unistd.h | 2 +-
arch/powerpc/include/uapi/asm/unistd.h | 2 ++
3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 7d8a600..ef03a80 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -365,3 +365,5 @@ SYSCALL_SPU(renameat2)
SYSCALL_SPU(seccomp)
SYSCALL_SPU(getrandom)
SYSCALL_SPU(memfd_create)
+SYSCALL_SPU(remap_anon_pages)
+SYSCALL_SPU(userfaultfd)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index 4e9af3f..36b79c3 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
#include <uapi/asm/unistd.h>


-#define __NR_syscalls 361
+#define __NR_syscalls 363

#define __NR__exit __NR_exit
#define NR_syscalls __NR_syscalls
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index 0688fc0..5514c57 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -383,5 +383,7 @@
#define __NR_seccomp 358
#define __NR_getrandom 359
#define __NR_memfd_create 360
+#define __NR_remap_anon_pages 361
+#define __NR_userfaultfd 362

#endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */

2014-10-03 17:09:58

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 15/17] userfaultfd: make userfaultfd_write non blocking

It is generally inefficient to ask the wakeup of userfault ranges
where there's not a single userfault address read through
userfaultfd_read earlier and in turn waiting a wakeup. However it may
come handy to wakeup the same userfault range twice in case of
multiple thread faulting on the same address. But we should still
return an error so if the application thinks this occurrence can never
happen it will know it hit a bug. So just return -ENOENT instead of
blocking.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/userfaultfd.c | 34 +++++-----------------------------
1 file changed, 5 insertions(+), 29 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 62b827e..2667d0d 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -458,9 +458,7 @@ static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
struct userfaultfd_ctx *ctx = file->private_data;
- ssize_t res;
__u64 range[2];
- DECLARE_WAITQUEUE(wait, current);

if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
__u64 protocol;
@@ -488,34 +486,12 @@ static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
if (range[0] >= range[1])
return -ERANGE;

- spin_lock(&ctx->fd_wqh.lock);
- __add_wait_queue(&ctx->fd_wqh, &wait);
- for (;;) {
- set_current_state(TASK_INTERRUPTIBLE);
- /* always take the fd_wqh lock before the fault_wqh lock */
- if (find_userfault(ctx, NULL, POLLOUT)) {
- if (!wake_userfault(ctx, range)) {
- res = sizeof(range);
- break;
- }
- }
- if (signal_pending(current)) {
- res = -ERESTARTSYS;
- break;
- }
- if (file->f_flags & O_NONBLOCK) {
- res = -EAGAIN;
- break;
- }
- spin_unlock(&ctx->fd_wqh.lock);
- schedule();
- spin_lock(&ctx->fd_wqh.lock);
- }
- __remove_wait_queue(&ctx->fd_wqh, &wait);
- __set_current_state(TASK_RUNNING);
- spin_unlock(&ctx->fd_wqh.lock);
+ /* always take the fd_wqh lock before the fault_wqh lock */
+ if (find_userfault(ctx, NULL, POLLOUT))
+ if (!wake_userfault(ctx, range))
+ return sizeof(range);

- return res;
+ return -ENOENT;
}

#ifdef CONFIG_PROC_FS

2014-10-03 17:11:27

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious

This teaches gup_fast and __gup_fast to re-enable irqs and
cond_resched() if possible every BATCH_PAGES.

This must be implemented by other archs as well and it's a requirement
before converting more get_user_pages() to get_user_pages_fast() as an
optimization (instead of using get_user_pages_unlocked which would be
slower).

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/mm/gup.c | 234 ++++++++++++++++++++++++++++++++++--------------------
1 file changed, 149 insertions(+), 85 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 2ab183b..917d8c1 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -12,6 +12,12 @@

#include <asm/pgtable.h>

+/*
+ * Keep irq disabled for no more than BATCH_PAGES pages.
+ * Matches PTRS_PER_PTE (or half in non-PAE kernels).
+ */
+#define BATCH_PAGES 512
+
static inline pte_t gup_get_pte(pte_t *ptep)
{
#ifndef CONFIG_X86_PAE
@@ -250,6 +256,40 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
return 1;
}

+static inline int __get_user_pages_fast_batch(unsigned long start,
+ unsigned long end,
+ int write, struct page **pages)
+{
+ struct mm_struct *mm = current->mm;
+ unsigned long next;
+ unsigned long flags;
+ pgd_t *pgdp;
+ int nr = 0;
+
+ /*
+ * This doesn't prevent pagetable teardown, but does prevent
+ * the pagetables and pages from being freed on x86.
+ *
+ * So long as we atomically load page table pointers versus teardown
+ * (which we do on x86, with the above PAE exception), we can follow the
+ * address down to the the page and take a ref on it.
+ */
+ local_irq_save(flags);
+ pgdp = pgd_offset(mm, start);
+ do {
+ pgd_t pgd = *pgdp;
+
+ next = pgd_addr_end(start, end);
+ if (pgd_none(pgd))
+ break;
+ if (!gup_pud_range(pgd, start, next, write, pages, &nr))
+ break;
+ } while (pgdp++, start = next, start != end);
+ local_irq_restore(flags);
+
+ return nr;
+}
+
/*
* Like get_user_pages_fast() except its IRQ-safe in that it won't fall
* back to the regular GUP.
@@ -257,31 +297,55 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages)
{
- struct mm_struct *mm = current->mm;
- unsigned long addr, len, end;
- unsigned long next;
- unsigned long flags;
- pgd_t *pgdp;
- int nr = 0;
+ unsigned long len, end, batch_pages;
+ int nr, ret;

start &= PAGE_MASK;
- addr = start;
len = (unsigned long) nr_pages << PAGE_SHIFT;
end = start + len;
+ /*
+ * get_user_pages() handles nr_pages == 0 gracefully, but
+ * gup_fast starts walking the first pagetable in a do {}
+ * while() fashion so it's not robust to handle nr_pages ==
+ * 0. There's no point in being permissive about end < start
+ * either. So this check verifies both nr_pages being non
+ * zero, and that "end" didn't overflow.
+ */
+ VM_BUG_ON(end <= start);
if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
(void __user *)start, len)))
return 0;

- /*
- * XXX: batch / limit 'nr', to avoid large irq off latency
- * needs some instrumenting to determine the common sizes used by
- * important workloads (eg. DB2), and whether limiting the batch size
- * will decrease performance.
- *
- * It seems like we're in the clear for the moment. Direct-IO is
- * the main guy that batches up lots of get_user_pages, and even
- * they are limited to 64-at-a-time which is not so many.
- */
+ ret = 0;
+ for (;;) {
+ batch_pages = nr_pages;
+ if (batch_pages > BATCH_PAGES && !irqs_disabled())
+ batch_pages = BATCH_PAGES;
+ len = (unsigned long) batch_pages << PAGE_SHIFT;
+ end = start + len;
+ nr = __get_user_pages_fast_batch(start, end, write, pages);
+ VM_BUG_ON(nr > batch_pages);
+ nr_pages -= nr;
+ ret += nr;
+ if (!nr_pages || nr != batch_pages)
+ break;
+ start += len;
+ pages += batch_pages;
+ }
+
+ return ret;
+}
+
+static inline int get_user_pages_fast_batch(unsigned long start,
+ unsigned long end,
+ int write, struct page **pages)
+{
+ struct mm_struct *mm = current->mm;
+ unsigned long next;
+ pgd_t *pgdp;
+ int nr = 0;
+ unsigned long orig_start = start;
+
/*
* This doesn't prevent pagetable teardown, but does prevent
* the pagetables and pages from being freed on x86.
@@ -290,18 +354,24 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
* (which we do on x86, with the above PAE exception), we can follow the
* address down to the the page and take a ref on it.
*/
- local_irq_save(flags);
- pgdp = pgd_offset(mm, addr);
+ local_irq_disable();
+ pgdp = pgd_offset(mm, start);
do {
pgd_t pgd = *pgdp;

- next = pgd_addr_end(addr, end);
- if (pgd_none(pgd))
+ next = pgd_addr_end(start, end);
+ if (pgd_none(pgd)) {
+ VM_BUG_ON(nr >= (end-orig_start) >> PAGE_SHIFT);
break;
- if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+ }
+ if (!gup_pud_range(pgd, start, next, write, pages, &nr)) {
+ VM_BUG_ON(nr >= (end-orig_start) >> PAGE_SHIFT);
break;
- } while (pgdp++, addr = next, addr != end);
- local_irq_restore(flags);
+ }
+ } while (pgdp++, start = next, start != end);
+ local_irq_enable();
+
+ cond_resched();

return nr;
}
@@ -326,80 +396,74 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages)
{
struct mm_struct *mm = current->mm;
- unsigned long addr, len, end;
- unsigned long next;
- pgd_t *pgdp;
- int nr = 0;
+ unsigned long len, end, batch_pages;
+ int nr, ret;
+ unsigned long orig_start;

start &= PAGE_MASK;
- addr = start;
+ orig_start = start;
len = (unsigned long) nr_pages << PAGE_SHIFT;

end = start + len;
- if (end < start)
- goto slow_irqon;
+ /*
+ * get_user_pages() handles nr_pages == 0 gracefully, but
+ * gup_fast starts walking the first pagetable in a do {}
+ * while() fashion so it's not robust to handle nr_pages ==
+ * 0. There's no point in being permissive about end < start
+ * either. So this check verifies both nr_pages being non
+ * zero, and that "end" didn't overflow.
+ */
+ VM_BUG_ON(end <= start);

+ nr = ret = 0;
#ifdef CONFIG_X86_64
if (end >> __VIRTUAL_MASK_SHIFT)
goto slow_irqon;
#endif
+ for (;;) {
+ batch_pages = min(nr_pages, BATCH_PAGES);
+ len = (unsigned long) batch_pages << PAGE_SHIFT;
+ end = start + len;
+ nr = get_user_pages_fast_batch(start, end, write, pages);
+ VM_BUG_ON(nr > batch_pages);
+ nr_pages -= nr;
+ ret += nr;
+ if (!nr_pages)
+ break;
+ if (nr < batch_pages)
+ goto slow_irqon;
+ start += len;
+ pages += batch_pages;
+ }

- /*
- * XXX: batch / limit 'nr', to avoid large irq off latency
- * needs some instrumenting to determine the common sizes used by
- * important workloads (eg. DB2), and whether limiting the batch size
- * will decrease performance.
- *
- * It seems like we're in the clear for the moment. Direct-IO is
- * the main guy that batches up lots of get_user_pages, and even
- * they are limited to 64-at-a-time which is not so many.
- */
- /*
- * This doesn't prevent pagetable teardown, but does prevent
- * the pagetables and pages from being freed on x86.
- *
- * So long as we atomically load page table pointers versus teardown
- * (which we do on x86, with the above PAE exception), we can follow the
- * address down to the the page and take a ref on it.
- */
- local_irq_disable();
- pgdp = pgd_offset(mm, addr);
- do {
- pgd_t pgd = *pgdp;
-
- next = pgd_addr_end(addr, end);
- if (pgd_none(pgd))
- goto slow;
- if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
- goto slow;
- } while (pgdp++, addr = next, addr != end);
- local_irq_enable();
-
- VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
- return nr;
-
- {
- int ret;
+ VM_BUG_ON(ret != (end - orig_start) >> PAGE_SHIFT);
+ return ret;

-slow:
- local_irq_enable();
slow_irqon:
- /* Try to get the remaining pages with get_user_pages */
- start += nr << PAGE_SHIFT;
- pages += nr;
-
- ret = get_user_pages_unlocked(current, mm, start,
- (end - start) >> PAGE_SHIFT,
- write, 0, pages);
-
- /* Have to be a bit careful with return values */
- if (nr > 0) {
- if (ret < 0)
- ret = nr;
- else
- ret += nr;
- }
+ /* Try to get the remaining pages with get_user_pages */
+ start += nr << PAGE_SHIFT;
+ pages += nr;

- return ret;
+ /*
+ * "nr" was the get_user_pages_fast_batch last retval, "ret"
+ * was the sum of all get_user_pages_fast_batch retvals, now
+ * "nr" becomes the sum of all get_user_pages_fast_batch
+ * retvals and "ret" will become the get_user_pages_unlocked
+ * retval.
+ */
+ nr = ret;
+
+ ret = get_user_pages_unlocked(current, mm, start,
+ (end - start) >> PAGE_SHIFT,
+ write, 0, pages);
+
+ /* Have to be a bit careful with return values */
+ if (nr > 0) {
+ if (ret < 0)
+ ret = nr;
+ else
+ ret += nr;
}
+
+ return ret;
}

2014-10-03 17:11:32

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 06/17] kvm: Faults which trigger IO release the mmap_sem

From: Andres Lagar-Cavilla <[email protected]>

When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.

If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.

This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.

Reviewed-by: Radim Krčmář <[email protected]>
Signed-off-by: Andres Lagar-Cavilla <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
---
virt/kvm/async_pf.c | 4 +---
virt/kvm/kvm_main.c | 4 ++--
2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index d6a3d09..44660ae 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)

might_sleep();

- down_read(&mm->mmap_sem);
- get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
- up_read(&mm->mmap_sem);
+ get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
kvm_async_page_present_sync(vcpu, apf);

spin_lock(&vcpu->async_pf.lock);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 95519bc..921bce7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1170,8 +1170,8 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
addr, write_fault, page);
up_read(&current->mm->mmap_sem);
} else
- npages = get_user_pages_fast(addr, 1, write_fault,
- page);
+ npages = get_user_pages_unlocked(current, current->mm, addr, 1,
+ write_fault, 0, page);
if (npages != 1)
return npages;

2014-10-03 17:22:19

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 09/17] mm: PT lock: export double_pt_lock/unlock

Those two helpers are needed by remap_anon_pages.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/mm.h | 4 ++++
mm/fremap.c | 29 +++++++++++++++++++++++++++++
2 files changed, 33 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf3df07..71dbe03 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1408,6 +1408,10 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
}
#endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */

+/* mm/fremap.c */
+extern void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
+
#if USE_SPLIT_PTE_PTLOCKS
#if ALLOC_SPLIT_PTLOCKS
void __init ptlock_cache_init(void);
diff --git a/mm/fremap.c b/mm/fremap.c
index 72b8fa3..1e509f7 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -281,3 +281,32 @@ out_freed:

return err;
}
+
+void double_pt_lock(spinlock_t *ptl1,
+ spinlock_t *ptl2)
+ __acquires(ptl1)
+ __acquires(ptl2)
+{
+ spinlock_t *ptl_tmp;
+
+ if (ptl1 > ptl2) {
+ /* exchange ptl1 and ptl2 */
+ ptl_tmp = ptl1;
+ ptl1 = ptl2;
+ ptl2 = ptl_tmp;
+ }
+ /* lock in virtual address order to avoid lock inversion */
+ spin_lock(ptl1);
+ if (ptl1 != ptl2)
+ spin_lock_nested(ptl2, SINGLE_DEPTH_NESTING);
+}
+
+void double_pt_unlock(spinlock_t *ptl1,
+ spinlock_t *ptl2)
+ __releases(ptl1)
+ __releases(ptl2)
+{
+ spin_unlock(ptl1);
+ if (ptl1 != ptl2)
+ spin_unlock(ptl2);
+}

2014-10-03 17:23:21

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits

We run out of 32bits in vm_flags, noop change for 64bit archs.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/proc/task_mmu.c | 4 ++--
include/linux/huge_mm.h | 4 ++--
include/linux/ksm.h | 4 ++--
include/linux/mm_types.h | 2 +-
mm/huge_memory.c | 2 +-
mm/ksm.c | 2 +-
mm/madvise.c | 2 +-
mm/mremap.c | 2 +-
8 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c341568..ee1c3a2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
/*
* Don't forget to update Documentation/ on changes.
*/
- static const char mnemonics[BITS_PER_LONG][2] = {
+ static const char mnemonics[BITS_PER_LONG+1][2] = {
/*
* In case if we meet a flag we don't know about.
*/
- [0 ... (BITS_PER_LONG-1)] = "??",
+ [0 ... (BITS_PER_LONG)] = "??",

[ilog2(VM_READ)] = "rd",
[ilog2(VM_WRITE)] = "wr",
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 63579cb..3aa10e0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -121,7 +121,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
#error "hugepages can't be allocated by the buddy allocator"
#endif
extern int hugepage_madvise(struct vm_area_struct *vma,
- unsigned long *vm_flags, int advice);
+ vm_flags_t *vm_flags, int advice);
extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
@@ -183,7 +183,7 @@ static inline int split_huge_page(struct page *page)
#define split_huge_page_pmd_mm(__mm, __address, __pmd) \
do { } while (0)
static inline int hugepage_madvise(struct vm_area_struct *vma,
- unsigned long *vm_flags, int advice)
+ vm_flags_t *vm_flags, int advice)
{
BUG();
return 0;
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 3be6bb1..8b35253 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -18,7 +18,7 @@ struct mem_cgroup;

#ifdef CONFIG_KSM
int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, int advice, unsigned long *vm_flags);
+ unsigned long end, int advice, vm_flags_t *vm_flags);
int __ksm_enter(struct mm_struct *mm);
void __ksm_exit(struct mm_struct *mm);

@@ -94,7 +94,7 @@ static inline int PageKsm(struct page *page)

#ifdef CONFIG_MMU
static inline int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, int advice, unsigned long *vm_flags)
+ unsigned long end, int advice, vm_flags_t *vm_flags)
{
return 0;
}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e0b286..2c876d1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -217,7 +217,7 @@ struct page_frag {
#endif
};

-typedef unsigned long __nocast vm_flags_t;
+typedef unsigned long long __nocast vm_flags_t;

/*
* A region containing a mapping of a non-memory backed file under NOMMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06..e913a19 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1942,7 +1942,7 @@ out:
#define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)

int hugepage_madvise(struct vm_area_struct *vma,
- unsigned long *vm_flags, int advice)
+ vm_flags_t *vm_flags, int advice)
{
switch (advice) {
case MADV_HUGEPAGE:
diff --git a/mm/ksm.c b/mm/ksm.c
index fb75902..faf319e 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1736,7 +1736,7 @@ static int ksm_scan_thread(void *nothing)
}

int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, int advice, unsigned long *vm_flags)
+ unsigned long end, int advice, vm_flags_t *vm_flags)
{
struct mm_struct *mm = vma->vm_mm;
int err;
diff --git a/mm/madvise.c b/mm/madvise.c
index 0938b30..d5aee71 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -49,7 +49,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
int error = 0;
pgoff_t pgoff;
- unsigned long new_flags = vma->vm_flags;
+ vm_flags_t new_flags = vma->vm_flags;

switch (behavior) {
case MADV_NORMAL:
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180..fa7db87 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -239,7 +239,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
{
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *new_vma;
- unsigned long vm_flags = vma->vm_flags;
+ vm_flags_t vm_flags = vma->vm_flags;
unsigned long new_pgoff;
unsigned long moved_len;
unsigned long excess = 0;

2014-10-03 18:01:35

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

remap_anon_pages (unlike remap_file_pages) tries to be non intrusive
in the rmap code.

As far as the rmap code is concerned, rmap_anon_pages only alters the
page->mapping and page->index. It does it while holding the page
lock. However there are a few places that in presence of anon pages
are allowed to do rmap walks without the page lock (split_huge_page
and page_referenced_anon). Those places that are doing rmap walks
without taking the page lock first, must be updated to re-check that
the page->mapping didn't change after they obtained the anon_vma
lock. remap_anon_pages takes the anon_vma lock for writing before
altering the page->mapping, so if the page->mapping is still the same
after obtaining the anon_vma lock (without the page lock), the rmap
walks can go ahead safely (and remap_anon_pages will wait them to
complete before proceeding).

remap_anon_pages serializes against itself with the page lock.

All other places taking the anon_vma lock while holding the mmap_sem
for writing, don't need to check if the page->mapping has changed
after taking the anon_vma lock, regardless of the page lock, because
remap_anon_pages holds the mmap_sem for reading.

Overall this looks a fairly small change to the rmap code, notably
less intrusive than the nonlinear vmas created by remap_file_pages.

There's one constraint enforced to allow this simplification: the
source pages passed to remap_anon_pages must be mapped only in one
vma, but this is not a limitation when used to handle userland page
faults with MADV_USERFAULT. The source addresses passed to
remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
avoid any risk of the mapcount of the pages increasing, if fork runs
in parallel in another thread, before or while remap_anon_pages runs.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 24 ++++++++++++++++++++----
mm/rmap.c | 9 +++++++++
2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b402d60..4277ed7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1921,6 +1921,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
{
struct anon_vma *anon_vma;
int ret = 1;
+ struct address_space *mapping;

BUG_ON(is_huge_zero_page(page));
BUG_ON(!PageAnon(page));
@@ -1932,10 +1933,24 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
* page_lock_anon_vma_read except the write lock is taken to serialise
* against parallel split or collapse operations.
*/
- anon_vma = page_get_anon_vma(page);
- if (!anon_vma)
- goto out;
- anon_vma_lock_write(anon_vma);
+ for (;;) {
+ mapping = ACCESS_ONCE(page->mapping);
+ anon_vma = page_get_anon_vma(page);
+ if (!anon_vma)
+ goto out;
+ anon_vma_lock_write(anon_vma);
+ /*
+ * We don't hold the page lock here so
+ * remap_anon_pages_huge_pmd can change the anon_vma
+ * from under us until we obtain the anon_vma
+ * lock. Verify that we obtained the anon_vma lock
+ * before remap_anon_pages did.
+ */
+ if (likely(mapping == ACCESS_ONCE(page->mapping)))
+ break;
+ anon_vma_unlock_write(anon_vma);
+ put_anon_vma(anon_vma);
+ }

ret = 0;
if (!PageCompound(page))
@@ -2460,6 +2475,7 @@ static void collapse_huge_page(struct mm_struct *mm,
* Prevent all access to pagetables with the exception of
* gup_fast later hanlded by the ptep_clear_flush and the VM
* handled by the anon_vma lock + PG_lock.
+ * remap_anon_pages is prevented to race as well by the mmap_sem.
*/
down_write(&mm->mmap_sem);
if (unlikely(khugepaged_test_exit(mm)))
diff --git a/mm/rmap.c b/mm/rmap.c
index 3e8491c..6d875eb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -450,6 +450,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
struct anon_vma *root_anon_vma;
unsigned long anon_mapping;

+repeat:
rcu_read_lock();
anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
@@ -488,6 +489,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
rcu_read_unlock();
anon_vma_lock_read(anon_vma);

+ /* check if remap_anon_pages changed the anon_vma */
+ if (unlikely((unsigned long) ACCESS_ONCE(page->mapping) != anon_mapping)) {
+ anon_vma_unlock_read(anon_vma);
+ put_anon_vma(anon_vma);
+ anon_vma = NULL;
+ goto repeat;
+ }
+
if (atomic_dec_and_test(&anon_vma->refcount)) {
/*
* Oops, we held the last refcount, release the lock

2014-10-03 18:02:13

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 05/17] mm: gup: use get_user_pages_fast and get_user_pages_unlocked

Just an optimization.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
drivers/dma/iovlock.c | 10 ++--------
drivers/iommu/amd_iommu_v2.c | 6 ++----
drivers/media/pci/ivtv/ivtv-udma.c | 6 ++----
drivers/scsi/st.c | 10 ++--------
drivers/video/fbdev/pvr2fb.c | 5 +----
mm/process_vm_access.c | 7 ++-----
mm/util.c | 10 ++--------
net/ceph/pagevec.c | 9 ++++-----
8 files changed, 17 insertions(+), 46 deletions(-)

diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
index bb48a57..12ea7c3 100644
--- a/drivers/dma/iovlock.c
+++ b/drivers/dma/iovlock.c
@@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
pages += page_list->nr_pages;

/* pin pages down */
- down_read(&current->mm->mmap_sem);
- ret = get_user_pages(
- current,
- current->mm,
+ ret = get_user_pages_fast(
(unsigned long) iov[i].iov_base,
page_list->nr_pages,
1, /* write */
- 0, /* force */
- page_list->pages,
- NULL);
- up_read(&current->mm->mmap_sem);
+ page_list->pages);

if (ret != page_list->nr_pages)
goto unpin;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 5f578e8..6963b73 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -519,10 +519,8 @@ static void do_fault(struct work_struct *work)

write = !!(fault->flags & PPR_FAULT_WRITE);

- down_read(&fault->state->mm->mmap_sem);
- npages = get_user_pages(NULL, fault->state->mm,
- fault->address, 1, write, 0, &page, NULL);
- up_read(&fault->state->mm->mmap_sem);
+ npages = get_user_pages_unlocked(NULL, fault->state->mm,
+ fault->address, 1, write, 0, &page);

if (npages == 1) {
put_page(page);
diff --git a/drivers/media/pci/ivtv/ivtv-udma.c b/drivers/media/pci/ivtv/ivtv-udma.c
index 7338cb2..96d866b 100644
--- a/drivers/media/pci/ivtv/ivtv-udma.c
+++ b/drivers/media/pci/ivtv/ivtv-udma.c
@@ -124,10 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long ivtv_dest_addr,
}

/* Get user pages for DMA Xfer */
- down_read(&current->mm->mmap_sem);
- err = get_user_pages(current, current->mm,
- user_dma.uaddr, user_dma.page_count, 0, 1, dma->map, NULL);
- up_read(&current->mm->mmap_sem);
+ err = get_user_pages_unlocked(current, current->mm,
+ user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);

if (user_dma.page_count != err) {
IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n",
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index aff9689..c89dcfa 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
return -ENOMEM;

/* Try to fault in all of the necessary pages */
- down_read(&current->mm->mmap_sem);
/* rw==READ means read from drive, write into memory area */
- res = get_user_pages(
- current,
- current->mm,
+ res = get_user_pages_fast(
uaddr,
nr_pages,
rw == READ,
- 0, /* don't force */
- pages,
- NULL);
- up_read(&current->mm->mmap_sem);
+ pages);

/* Errors and no page mapped should return here */
if (res < nr_pages)
diff --git a/drivers/video/fbdev/pvr2fb.c b/drivers/video/fbdev/pvr2fb.c
index 167cfff..ff81f65 100644
--- a/drivers/video/fbdev/pvr2fb.c
+++ b/drivers/video/fbdev/pvr2fb.c
@@ -686,10 +686,7 @@ static ssize_t pvr2fb_write(struct fb_info *info, const char *buf,
if (!pages)
return -ENOMEM;

- down_read(&current->mm->mmap_sem);
- ret = get_user_pages(current, current->mm, (unsigned long)buf,
- nr_pages, WRITE, 0, pages, NULL);
- up_read(&current->mm->mmap_sem);
+ ret = get_user_pages_fast((unsigned long)buf, nr_pages, WRITE, pages);

if (ret < nr_pages) {
nr_pages = ret;
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 5077afc..b159769 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -99,11 +99,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
size_t bytes;

/* Get the pages we're interested in */
- down_read(&mm->mmap_sem);
- pages = get_user_pages(task, mm, pa, pages,
- vm_write, 0, process_pages, NULL);
- up_read(&mm->mmap_sem);
-
+ pages = get_user_pages_unlocked(task, mm, pa, pages,
+ vm_write, 0, process_pages);
if (pages <= 0)
return -EFAULT;

diff --git a/mm/util.c b/mm/util.c
index 093c973..1b93f2d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -247,14 +247,8 @@ int __weak get_user_pages_fast(unsigned long start,
int nr_pages, int write, struct page **pages)
{
struct mm_struct *mm = current->mm;
- int ret;
-
- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start, nr_pages,
- write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
-
- return ret;
+ return get_user_pages_unlocked(current, mm, start, nr_pages,
+ write, 0, pages);
}
EXPORT_SYMBOL_GPL(get_user_pages_fast);

diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 5550130..5504783 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -23,17 +23,16 @@ struct page **ceph_get_direct_page_vector(const void __user *data,
if (!pages)
return ERR_PTR(-ENOMEM);

- down_read(&current->mm->mmap_sem);
while (got < num_pages) {
- rc = get_user_pages(current, current->mm,
- (unsigned long)data + ((unsigned long)got * PAGE_SIZE),
- num_pages - got, write_page, 0, pages + got, NULL);
+ rc = get_user_pages_fast((unsigned long)data +
+ ((unsigned long)got * PAGE_SIZE),
+ num_pages - got,
+ write_page, pages + got);
if (rc < 0)
break;
BUG_ON(rc == 0);
got += rc;
}
- up_read(&current->mm->mmap_sem);
if (rc < 0)
goto fail;
return pages;

2014-10-03 18:02:51

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 02/17] mm: gup: add get_user_pages_locked and get_user_pages_unlocked

We can leverage the VM_FAULT_RETRY functionality in the page fault
paths better by using either get_user_pages_locked or
get_user_pages_unlocked.

The former allow conversion of get_user_pages invocations that will
have to pass a "&locked" parameter to know if the mmap_sem was dropped
during the call. Example from:

down_read(&mm->mmap_sem);
do_something()
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

to:

int locked = 1;
down_read(&mm->mmap_sem);
do_something()
get_user_pages_locked(tsk, mm, ..., pages, &locked);
if (locked)
up_read(&mm->mmap_sem);

The latter is suitable only as a drop in replacement of the form:

down_read(&mm->mmap_sem);
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

into:

get_user_pages_unlocked(tsk, mm, ..., pages);

Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must
be NULL for get_user_pages_locked|unlocked to be usable (the latter
original form wouldn't have been safe anyway if vmas wasn't null, for
the former we just make it explicit by dropping the parameter).

If vmas is not NULL these two methods cannot be used.

This patch then applies the new forms in various places, in some case
also replacing it with get_user_pages_fast whenever tsk and mm are
current and current->mm. get_user_pages_unlocked varies from
get_user_pages_fast only if mm is not current->mm (like when
get_user_pages works on some other process mm). Whenever tsk and mm
matches current and current->mm get_user_pages_fast must always be
used to increase performance and get the page lockless (only with irq
disabled).

Signed-off-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Andres Lagar-Cavilla <[email protected]>
Reviewed-by: Peter Feiner <[email protected]>
---
include/linux/mm.h | 7 +++
mm/gup.c | 178 +++++++++++++++++++++++++++++++++++++++++++++++++----
mm/nommu.c | 23 +++++++
3 files changed, 197 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0f4196a..8900ba9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1196,6 +1196,13 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
int write, int force, struct page **pages,
struct vm_area_struct **vmas);
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages,
+ int *locked);
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages);
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
struct kvec;
diff --git a/mm/gup.c b/mm/gup.c
index af7ea3e..6f2f757 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -580,6 +580,166 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
return 0;
}

+static inline long __get_user_pages_locked(struct task_struct *tsk,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long nr_pages,
+ int write, int force,
+ struct page **pages,
+ struct vm_area_struct **vmas,
+ int *locked,
+ bool notify_drop)
+{
+ int flags = FOLL_TOUCH;
+ long ret, pages_done;
+ bool lock_dropped;
+
+ if (locked) {
+ /* if VM_FAULT_RETRY can be returned, vmas become invalid */
+ BUG_ON(vmas);
+ /* check caller initialized locked */
+ BUG_ON(*locked != 1);
+ }
+
+ if (pages)
+ flags |= FOLL_GET;
+ if (write)
+ flags |= FOLL_WRITE;
+ if (force)
+ flags |= FOLL_FORCE;
+
+ pages_done = 0;
+ lock_dropped = false;
+ for (;;) {
+ ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
+ vmas, locked);
+ if (!locked)
+ /* VM_FAULT_RETRY couldn't trigger, bypass */
+ return ret;
+
+ /* VM_FAULT_RETRY cannot return errors */
+ if (!*locked) {
+ BUG_ON(ret < 0);
+ BUG_ON(ret >= nr_pages);
+ }
+
+ if (!pages)
+ /* If it's a prefault don't insist harder */
+ return ret;
+
+ if (ret > 0) {
+ nr_pages -= ret;
+ pages_done += ret;
+ if (!nr_pages)
+ break;
+ }
+ if (*locked) {
+ /* VM_FAULT_RETRY didn't trigger */
+ if (!pages_done)
+ pages_done = ret;
+ break;
+ }
+ /* VM_FAULT_RETRY triggered, so seek to the faulting offset */
+ pages += ret;
+ start += ret << PAGE_SHIFT;
+
+ /*
+ * Repeat on the address that fired VM_FAULT_RETRY
+ * without FAULT_FLAG_ALLOW_RETRY but with
+ * FAULT_FLAG_TRIED.
+ */
+ *locked = 1;
+ lock_dropped = true;
+ down_read(&mm->mmap_sem);
+ ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
+ pages, NULL, NULL);
+ if (ret != 1) {
+ BUG_ON(ret > 1);
+ if (!pages_done)
+ pages_done = ret;
+ break;
+ }
+ nr_pages--;
+ pages_done++;
+ if (!nr_pages)
+ break;
+ pages++;
+ start += PAGE_SIZE;
+ }
+ if (notify_drop && lock_dropped && *locked) {
+ /*
+ * We must let the caller know we temporarily dropped the lock
+ * and so the critical section protected by it was lost.
+ */
+ up_read(&mm->mmap_sem);
+ *locked = 0;
+ }
+ return pages_done;
+}
+
+/*
+ * We can leverage the VM_FAULT_RETRY functionality in the page fault
+ * paths better by using either get_user_pages_locked() or
+ * get_user_pages_unlocked().
+ *
+ * get_user_pages_locked() is suitable to replace the form:
+ *
+ * down_read(&mm->mmap_sem);
+ * do_something()
+ * get_user_pages(tsk, mm, ..., pages, NULL);
+ * up_read(&mm->mmap_sem);
+ *
+ * to:
+ *
+ * int locked = 1;
+ * down_read(&mm->mmap_sem);
+ * do_something()
+ * get_user_pages_locked(tsk, mm, ..., pages, &locked);
+ * if (locked)
+ * up_read(&mm->mmap_sem);
+ */
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages,
+ int *locked)
+{
+ return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+ pages, NULL, locked, true);
+}
+EXPORT_SYMBOL(get_user_pages_locked);
+
+/*
+ * get_user_pages_unlocked() is suitable to replace the form:
+ *
+ * down_read(&mm->mmap_sem);
+ * get_user_pages(tsk, mm, ..., pages, NULL);
+ * up_read(&mm->mmap_sem);
+ *
+ * with:
+ *
+ * get_user_pages_unlocked(tsk, mm, ..., pages);
+ *
+ * It is functionally equivalent to get_user_pages_fast so
+ * get_user_pages_fast should be used instead, if the two parameters
+ * "tsk" and "mm" are respectively equal to current and current->mm,
+ * or if "force" shall be set to 1 (get_user_pages_fast misses the
+ * "force" parameter).
+ */
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages)
+{
+ long ret;
+ int locked = 1;
+ down_read(&mm->mmap_sem);
+ ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+ pages, NULL, &locked, false);
+ if (locked)
+ up_read(&mm->mmap_sem);
+ return ret;
+}
+EXPORT_SYMBOL(get_user_pages_unlocked);
+
/*
* get_user_pages() - pin user pages in memory
* @tsk: the task_struct to use for page fault accounting, or
@@ -629,22 +789,18 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
* use the correct cache flushing APIs.
*
* See also get_user_pages_fast, for performance critical applications.
+ *
+ * get_user_pages should be phased out in favor of
+ * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing
+ * should use get_user_pages because it cannot pass
+ * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
*/
long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages, int write,
int force, struct page **pages, struct vm_area_struct **vmas)
{
- int flags = FOLL_TOUCH;
-
- if (pages)
- flags |= FOLL_GET;
- if (write)
- flags |= FOLL_WRITE;
- if (force)
- flags |= FOLL_FORCE;
-
- return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
- NULL);
+ return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+ pages, vmas, NULL, false);
}
EXPORT_SYMBOL(get_user_pages);

diff --git a/mm/nommu.c b/mm/nommu.c
index a881d96..3918b0f 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -213,6 +213,29 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
}
EXPORT_SYMBOL(get_user_pages);

+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages,
+ int *locked)
+{
+ return get_user_pages(tsk, mm, start, nr_pages, write, force,
+ pages, NULL);
+}
+EXPORT_SYMBOL(get_user_pages_locked);
+
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages)
+{
+ long ret;
+ down_read(&mm->mmap_sem);
+ ret = get_user_pages(tsk, mm, start, nr_pages, write, force,
+ pages, NULL);
+ up_read(&mm->mmap_sem);
+ return ret;
+}
+EXPORT_SYMBOL(get_user_pages_unlocked);
+
/**
* follow_pfn - look up PFN at a user virtual address
* @vma: memory mapping

2014-10-03 18:03:33

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 12/17] mm: sys_remap_anon_pages

This new syscall will move anon pages across vmas, atomically and
without touching the vmas.

It only works on non shared anonymous pages because those can be
relocated without generating non linear anon_vmas in the rmap code.

It is the ideal mechanism to handle userspace page faults. Normally
the destination vma will have VM_USERFAULT set with
madvise(MADV_USERFAULT) while the source vma will normally have
VM_DONTCOPY set with madvise(MADV_DONTFORK).

MADV_DONTFORK set in the source vma avoids remap_anon_pages to fail if
the process forks during the userland page fault.

The thread triggering the sigbus signal handler by touching an
unmapped hole in the MADV_USERFAULT region, should take care to
receive the data belonging in the faulting virtual address in the
source vma. The data can come from the network, storage or any other
I/O device. After the data has been safely received in the private
area in the source vma, it will call remap_anon_pages to map the page
in the faulting address in the destination vma atomically. And finally
it will return from the signal handler.

It is an alternative to mremap.

It only works if the vma protection bits are identical from the source
and destination vma.

It can remap non shared anonymous pages within the same vma too.

If the source virtual memory range has any unmapped holes, or if the
destination virtual memory range is not a whole unmapped hole,
remap_anon_pages will fail respectively with -ENOENT or -EEXIST. This
provides a very strict behavior to avoid any chance of memory
corruption going unnoticed if there are userland race conditions. Only
one thread should resolve the userland page fault at any given time
for any given faulting address. This means that if two threads try to
both call remap_anon_pages on the same destination address at the same
time, the second thread will get an explicit error from this syscall.

The syscall retval will return "len" is succesful. The syscall however
can be interrupted by fatal signals or errors. If interrupted it will
return the number of bytes successfully remapped before the
interruption if any, or the negative error if none. It will never
return zero. Either it will return an error or an amount of bytes
successfully moved. If the retval reports a "short" remap, the
remap_anon_pages syscall should be repeated by userland with
src+retval, dst+reval, len-retval if it wants to know about the error
that interrupted it.

The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
errors to materialize if there are holes in the source virtual range
that is being remapped. The holes will be accounted as successfully
remapped in the retval of the syscall. This is mostly useful to remap
hugepage naturally aligned virtual regions without knowing if there
are transparent hugepage in the regions or not, but preventing the
risk of having to split the hugepmd during the remap.

The main difference with mremap is that if used to fill holes in
unmapped anonymous memory vmas (if used in combination with
MADV_USERFAULT) remap_anon_pages won't create lots of unmergeable
vmas. mremap instead would create lots of vmas (because of non linear
vma->vm_pgoff) leading to -ENOMEM failures (the number of vmas is
limited).

MADV_USERFAULT and remap_anon_pages() can be tested with a program
like below:

===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <pthread.h>
#include <strings.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <signal.h>
#include <sys/syscall.h>
#include <sys/types.h>

#define USE_USERFAULT
#define THP

#define MADV_USERFAULT 18

#define SIZE (1024*1024*1024)

#define SYS_remap_anon_pages 321

static volatile unsigned char *c, *tmp;

void userfault_sighandler(int signum, siginfo_t *info, void *ctx)
{
unsigned char *addr = info->si_addr;
int len = 4096;
int ret;

addr = (unsigned char *) ((unsigned long) addr & ~((getpagesize())-1));
#ifdef THP
addr = (unsigned char *) ((unsigned long) addr & ~((2*1024*1024)-1));
len = 2*1024*1024;
#endif
if (addr >= c && addr < c + SIZE) {
unsigned long offset = addr - c;
ret = syscall(SYS_remap_anon_pages, c+offset, tmp+offset, len, 0);
if (ret != len)
perror("sigbus remap_anon_pages"), exit(1);
//printf("sigbus offset %lu\n", offset);
return;
}

printf("sigbus error addr %p c %p tmp %p\n", addr, c, tmp), exit(1);
}

int main()
{
struct sigaction sa;
int ret;
unsigned long i;
#ifndef THP
/*
* Fails with THP due lack of alignment because of memset
* pre-filling the destination
*/
c = mmap(0, SIZE, PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (c == MAP_FAILED)
perror("mmap"), exit(1);
tmp = mmap(0, SIZE, PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (tmp == MAP_FAILED)
perror("mmap"), exit(1);
#else
ret = posix_memalign((void **)&c, 2*1024*1024, SIZE);
if (ret)
perror("posix_memalign"), exit(1);
ret = posix_memalign((void **)&tmp, 2*1024*1024, SIZE);
if (ret)
perror("posix_memalign"), exit(1);
#endif
/*
* MADV_USERFAULT must run before memset, to avoid THP 2m
* faults to map memory into "tmp", if "tmp" isn't allocated
* with hugepage alignment.
*/
if (madvise((void *)c, SIZE, MADV_USERFAULT))
perror("madvise"), exit(1);
memset((void *)tmp, 0xaa, SIZE);

sa.sa_sigaction = userfault_sighandler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = SA_SIGINFO;
sigaction(SIGBUS, &sa, NULL);

#ifndef USE_USERFAULT
ret = syscall(SYS_remap_anon_pages, c, tmp, SIZE, 0);
if (ret != SIZE)
perror("remap_anon_pages"), exit(1);
#endif

for (i = 0; i < SIZE; i += 4096) {
if ((i/4096) % 2) {
/* exercise read and write MADV_USERFAULT */
c[i+1] = 0xbb;
}
if (c[i] != 0xaa)
printf("error %x offset %lu\n", c[i], i), exit(1);
}
printf("remap_anon_pages functions correctly\n");

return 0;
}
===

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/huge_mm.h | 7 +
include/linux/syscalls.h | 4 +
kernel/sys_ni.c | 1 +
mm/fremap.c | 477 +++++++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 110 +++++++++
7 files changed, 601 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 028b781..2d0594c 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -363,3 +363,4 @@
354 i386 seccomp sys_seccomp
355 i386 getrandom sys_getrandom
356 i386 memfd_create sys_memfd_create
+357 i386 remap_anon_pages sys_remap_anon_pages
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 35dd922..41e8f3e 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -327,6 +327,7 @@
318 common getrandom sys_getrandom
319 common memfd_create sys_memfd_create
320 common kexec_file_load sys_kexec_file_load
+321 common remap_anon_pages sys_remap_anon_pages

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3aa10e0..8a85fc9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -33,6 +33,13 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot,
int prot_numa);
+extern int remap_anon_pages_huge_pmd(struct mm_struct *mm,
+ pmd_t *dst_pmd, pmd_t *src_pmd,
+ pmd_t dst_pmdval,
+ struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma,
+ unsigned long dst_addr,
+ unsigned long src_addr);

enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0f86d85..3d4bb05 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -451,6 +451,10 @@ asmlinkage long sys_mremap(unsigned long addr,
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
+asmlinkage long sys_remap_anon_pages(unsigned long dst_start,
+ unsigned long src_start,
+ unsigned long len,
+ unsigned long flags);
asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
asmlinkage long sys_fadvise64(int fd, loff_t offset, size_t len, int advice);
asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 391d4dd..2bc7bef 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -178,6 +178,7 @@ cond_syscall(sys_mincore);
cond_syscall(sys_madvise);
cond_syscall(sys_mremap);
cond_syscall(sys_remap_file_pages);
+cond_syscall(sys_remap_anon_pages);
cond_syscall(compat_sys_move_pages);
cond_syscall(compat_sys_migrate_pages);

diff --git a/mm/fremap.c b/mm/fremap.c
index 1e509f7..9337637 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -310,3 +310,480 @@ void double_pt_unlock(spinlock_t *ptl1,
if (ptl1 != ptl2)
spin_unlock(ptl2);
}
+
+#define RAP_ALLOW_SRC_HOLES (1UL<<0)
+
+/*
+ * The mmap_sem for reading is held by the caller. Just move the page
+ * from src_pmd to dst_pmd if possible, and return true if succeeded
+ * in moving the page.
+ */
+static int remap_anon_pages_pte(struct mm_struct *mm,
+ pte_t *dst_pte, pte_t *src_pte, pmd_t *src_pmd,
+ struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma,
+ unsigned long dst_addr,
+ unsigned long src_addr,
+ spinlock_t *dst_ptl,
+ spinlock_t *src_ptl,
+ unsigned long flags)
+{
+ struct page *src_page;
+ swp_entry_t entry;
+ pte_t orig_src_pte, orig_dst_pte;
+ struct anon_vma *src_anon_vma, *dst_anon_vma;
+
+ spin_lock(dst_ptl);
+ orig_dst_pte = *dst_pte;
+ spin_unlock(dst_ptl);
+ if (!pte_none(orig_dst_pte))
+ return -EEXIST;
+
+ spin_lock(src_ptl);
+ orig_src_pte = *src_pte;
+ spin_unlock(src_ptl);
+ if (pte_none(orig_src_pte)) {
+ if (!(flags & RAP_ALLOW_SRC_HOLES))
+ return -ENOENT;
+ else
+ /* nothing to do to remap an hole */
+ return 0;
+ }
+
+ if (pte_present(orig_src_pte)) {
+ /*
+ * Pin the page while holding the lock to be sure the
+ * page isn't freed under us
+ */
+ spin_lock(src_ptl);
+ if (!pte_same(orig_src_pte, *src_pte)) {
+ spin_unlock(src_ptl);
+ return -EAGAIN;
+ }
+ src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
+ if (!src_page || !PageAnon(src_page) ||
+ page_mapcount(src_page) != 1) {
+ spin_unlock(src_ptl);
+ return -EBUSY;
+ }
+
+ get_page(src_page);
+ spin_unlock(src_ptl);
+
+ /* block all concurrent rmap walks */
+ lock_page(src_page);
+
+ /*
+ * page_referenced_anon walks the anon_vma chain
+ * without the page lock. Serialize against it with
+ * the anon_vma lock, the page lock is not enough.
+ */
+ src_anon_vma = page_get_anon_vma(src_page);
+ if (!src_anon_vma) {
+ /* page was unmapped from under us */
+ unlock_page(src_page);
+ put_page(src_page);
+ return -EAGAIN;
+ }
+ anon_vma_lock_write(src_anon_vma);
+
+ double_pt_lock(dst_ptl, src_ptl);
+
+ if (!pte_same(*src_pte, orig_src_pte) ||
+ !pte_same(*dst_pte, orig_dst_pte) ||
+ page_mapcount(src_page) != 1) {
+ double_pt_unlock(dst_ptl, src_ptl);
+ anon_vma_unlock_write(src_anon_vma);
+ put_anon_vma(src_anon_vma);
+ unlock_page(src_page);
+ put_page(src_page);
+ return -EAGAIN;
+ }
+
+ BUG_ON(!PageAnon(src_page));
+ /* the PT lock is enough to keep the page pinned now */
+ put_page(src_page);
+
+ dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+ ACCESS_ONCE(src_page->mapping) = ((struct address_space *)
+ dst_anon_vma);
+ ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma,
+ dst_addr);
+
+ if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte),
+ orig_src_pte))
+ BUG();
+
+ orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
+ orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
+ dst_vma);
+
+ set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
+
+ double_pt_unlock(dst_ptl, src_ptl);
+
+ anon_vma_unlock_write(src_anon_vma);
+ put_anon_vma(src_anon_vma);
+
+ /* unblock rmap walks */
+ unlock_page(src_page);
+
+ mmu_notifier_invalidate_page(mm, src_addr);
+ } else {
+ if (pte_file(orig_src_pte))
+ return -EFAULT;
+
+ entry = pte_to_swp_entry(orig_src_pte);
+ if (non_swap_entry(entry)) {
+ if (is_migration_entry(entry)) {
+ migration_entry_wait(mm, src_pmd, src_addr);
+ return -EAGAIN;
+ }
+ return -EFAULT;
+ }
+
+ if (swp_entry_swapcount(entry) != 1)
+ return -EBUSY;
+
+ double_pt_lock(dst_ptl, src_ptl);
+
+ if (!pte_same(*src_pte, orig_src_pte) ||
+ !pte_same(*dst_pte, orig_dst_pte) ||
+ swp_entry_swapcount(entry) != 1) {
+ double_pt_unlock(dst_ptl, src_ptl);
+ return -EAGAIN;
+ }
+
+ if (pte_val(ptep_get_and_clear(mm, src_addr, src_pte)) !=
+ pte_val(orig_src_pte))
+ BUG();
+ set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
+
+ double_pt_unlock(dst_ptl, src_ptl);
+ }
+
+ return 0;
+}
+
+static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd = NULL;
+
+ pgd = pgd_offset(mm, address);
+ pud = pud_alloc(mm, pgd, address);
+ if (pud)
+ /*
+ * Note that we didn't run this because the pmd was
+ * missing, the *pmd may be already established and in
+ * turn it may also be a trans_huge_pmd.
+ */
+ pmd = pmd_alloc(mm, pud, address);
+ return pmd;
+}
+
+/**
+ * sys_remap_anon_pages - remap arbitrary anonymous pages of an existing vma
+ * @dst_start: start of the destination virtual memory range
+ * @src_start: start of the source virtual memory range
+ * @len: length of the virtual memory range
+ *
+ * sys_remap_anon_pages remaps arbitrary anonymous pages atomically in
+ * zero copy. It only works on non shared anonymous pages because
+ * those can be relocated without generating non linear anon_vmas in
+ * the rmap code.
+ *
+ * It is the ideal mechanism to handle userspace page faults. Normally
+ * the destination vma will have VM_USERFAULT set with
+ * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
+ * set with madvise(MADV_DONTFORK).
+ *
+ * The thread receiving the page during the userland page fault
+ * (MADV_USERFAULT) will receive the faulting page in the source vma
+ * through the network, storage or any other I/O device (MADV_DONTFORK
+ * in the source vma avoids remap_anon_pages to fail with -EBUSY if
+ * the process forks before remap_anon_pages is called), then it will
+ * call remap_anon_pages to map the page in the faulting address in
+ * the destination vma.
+ *
+ * This syscall works purely via pagetables, so it's the most
+ * efficient way to move physical non shared anonymous pages across
+ * different virtual addresses. Unlike mremap()/mmap()/munmap() it
+ * does not create any new vmas. The mapping in the destination
+ * address is atomic.
+ *
+ * It only works if the vma protection bits are identical from the
+ * source and destination vma.
+ *
+ * It can remap non shared anonymous pages within the same vma too.
+ *
+ * If the source virtual memory range has any unmapped holes, or if
+ * the destination virtual memory range is not a whole unmapped hole,
+ * remap_anon_pages will fail respectively with -ENOENT or
+ * -EEXIST. This provides a very strict behavior to avoid any chance
+ * of memory corruption going unnoticed if there are userland race
+ * conditions. Only one thread should resolve the userland page fault
+ * at any given time for any given faulting address. This means that
+ * if two threads try to both call remap_anon_pages on the same
+ * destination address at the same time, the second thread will get an
+ * explicit error from this syscall.
+ *
+ * The syscall retval will return "len" is succesful. The syscall
+ * however can be interrupted by fatal signals or errors. If
+ * interrupted it will return the number of bytes successfully
+ * remapped before the interruption if any, or the negative error if
+ * none. It will never return zero. Either it will return an error or
+ * an amount of bytes successfully moved. If the retval reports a
+ * "short" remap, the remap_anon_pages syscall should be repeated by
+ * userland with src+retval, dst+reval, len-retval if it wants to know
+ * about the error that interrupted it.
+ *
+ * The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
+ * errors to materialize if there are holes in the source virtual
+ * range that is being remapped. The holes will be accounted as
+ * successfully remapped in the retval of the syscall. This is mostly
+ * useful to remap hugepage naturally aligned virtual regions without
+ * knowing if there are transparent hugepage in the regions or not,
+ * but preventing the risk of having to split the hugepmd during the
+ * remap.
+ *
+ * If there's any rmap walk that is taking the anon_vma locks without
+ * first obtaining the page lock (for example split_huge_page and
+ * page_referenced_anon), they will have to verify if the
+ * page->mapping has changed after taking the anon_vma lock. If it
+ * changed they should release the lock and retry obtaining a new
+ * anon_vma, because it means the anon_vma was changed by
+ * remap_anon_pages before the lock could be obtained. This is the
+ * only additional complexity added to the rmap code to provide this
+ * anonymous page remapping functionality.
+ */
+SYSCALL_DEFINE4(remap_anon_pages,
+ unsigned long, dst_start, unsigned long, src_start,
+ unsigned long, len, unsigned long, flags)
+{
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *src_vma, *dst_vma;
+ long err = -EINVAL;
+ pmd_t *src_pmd, *dst_pmd;
+ pte_t *src_pte, *dst_pte;
+ spinlock_t *dst_ptl, *src_ptl;
+ unsigned long src_addr, dst_addr;
+ int thp_aligned = -1;
+ long moved = 0;
+
+ /*
+ * Sanitize the syscall parameters:
+ */
+ if (src_start & ~PAGE_MASK)
+ return err;
+ if (dst_start & ~PAGE_MASK)
+ return err;
+ if (len & ~PAGE_MASK)
+ return err;
+ if (flags & ~RAP_ALLOW_SRC_HOLES)
+ return err;
+
+ /* Does the address range wrap, or is the span zero-sized? */
+ if (unlikely(src_start + len <= src_start))
+ return err;
+ if (unlikely(dst_start + len <= dst_start))
+ return err;
+
+ down_read(&mm->mmap_sem);
+
+ /*
+ * Make sure the vma is not shared, that the src and dst remap
+ * ranges are both valid and fully within a single existing
+ * vma.
+ */
+ src_vma = find_vma(mm, src_start);
+ if (!src_vma || (src_vma->vm_flags & VM_SHARED))
+ goto out;
+ if (src_start < src_vma->vm_start ||
+ src_start + len > src_vma->vm_end)
+ goto out;
+
+ dst_vma = find_vma(mm, dst_start);
+ if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+ goto out;
+ if (dst_start < dst_vma->vm_start ||
+ dst_start + len > dst_vma->vm_end)
+ goto out;
+
+ if (pgprot_val(src_vma->vm_page_prot) !=
+ pgprot_val(dst_vma->vm_page_prot))
+ goto out;
+
+ /* only allow remapping if both are mlocked or both aren't */
+ if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
+ goto out;
+
+ /*
+ * Ensure the dst_vma has a anon_vma or this page
+ * would get a NULL anon_vma when moved in the
+ * dst_vma.
+ */
+ err = -ENOMEM;
+ if (unlikely(anon_vma_prepare(dst_vma)))
+ goto out;
+
+ for (src_addr = src_start, dst_addr = dst_start;
+ src_addr < src_start + len; ) {
+ spinlock_t *ptl;
+ pmd_t dst_pmdval;
+ BUG_ON(dst_addr >= dst_start + len);
+ src_pmd = mm_find_pmd(mm, src_addr);
+ if (unlikely(!src_pmd)) {
+ if (!(flags & RAP_ALLOW_SRC_HOLES)) {
+ err = -ENOENT;
+ break;
+ } else {
+ src_pmd = mm_alloc_pmd(mm, src_addr);
+ if (unlikely(!src_pmd)) {
+ err = -ENOMEM;
+ break;
+ }
+ }
+ }
+ dst_pmd = mm_alloc_pmd(mm, dst_addr);
+ if (unlikely(!dst_pmd)) {
+ err = -ENOMEM;
+ break;
+ }
+
+ dst_pmdval = pmd_read_atomic(dst_pmd);
+ /*
+ * If the dst_pmd is mapped as THP don't
+ * override it and just be strict.
+ */
+ if (unlikely(pmd_trans_huge(dst_pmdval))) {
+ err = -EEXIST;
+ break;
+ }
+ if (pmd_trans_huge_lock(src_pmd, src_vma, &ptl) == 1) {
+ /*
+ * Check if we can move the pmd without
+ * splitting it. First check the address
+ * alignment to be the same in src/dst. These
+ * checks don't actually need the PT lock but
+ * it's good to do it here to optimize this
+ * block away at build time if
+ * CONFIG_TRANSPARENT_HUGEPAGE is not set.
+ */
+ if (thp_aligned == -1)
+ thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
+ (dst_addr & ~HPAGE_PMD_MASK));
+ if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
+ !pmd_none(dst_pmdval) ||
+ src_start + len - src_addr < HPAGE_PMD_SIZE) {
+ spin_unlock(ptl);
+ /* Fall through */
+ split_huge_page_pmd(src_vma, src_addr,
+ src_pmd);
+ } else {
+ BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
+ err = remap_anon_pages_huge_pmd(mm,
+ dst_pmd,
+ src_pmd,
+ dst_pmdval,
+ dst_vma,
+ src_vma,
+ dst_addr,
+ src_addr);
+ cond_resched();
+
+ if (!err) {
+ dst_addr += HPAGE_PMD_SIZE;
+ src_addr += HPAGE_PMD_SIZE;
+ moved += HPAGE_PMD_SIZE;
+ }
+
+ if ((!err || err == -EAGAIN) &&
+ fatal_signal_pending(current))
+ err = -EINTR;
+
+ if (err && err != -EAGAIN)
+ break;
+
+ continue;
+ }
+ }
+
+ if (pmd_none(*src_pmd)) {
+ if (!(flags & RAP_ALLOW_SRC_HOLES)) {
+ err = -ENOENT;
+ break;
+ } else {
+ if (unlikely(__pte_alloc(mm, src_vma, src_pmd,
+ src_addr))) {
+ err = -ENOMEM;
+ break;
+ }
+ }
+ }
+
+ /*
+ * We held the mmap_sem for reading so MADV_DONTNEED
+ * can zap transparent huge pages under us, or the
+ * transparent huge page fault can establish new
+ * transparent huge pages under us.
+ */
+ if (unlikely(pmd_trans_unstable(src_pmd))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (unlikely(pmd_none(dst_pmdval)) &&
+ unlikely(__pte_alloc(mm, dst_vma, dst_pmd,
+ dst_addr))) {
+ err = -ENOMEM;
+ break;
+ }
+ /* If an huge pmd materialized from under us fail */
+ if (unlikely(pmd_trans_huge(*dst_pmd))) {
+ err = -EFAULT;
+ break;
+ }
+
+ BUG_ON(pmd_none(*dst_pmd));
+ BUG_ON(pmd_none(*src_pmd));
+ BUG_ON(pmd_trans_huge(*dst_pmd));
+ BUG_ON(pmd_trans_huge(*src_pmd));
+
+ dst_pte = pte_offset_map(dst_pmd, dst_addr);
+ src_pte = pte_offset_map(src_pmd, src_addr);
+ dst_ptl = pte_lockptr(mm, dst_pmd);
+ src_ptl = pte_lockptr(mm, src_pmd);
+
+ err = remap_anon_pages_pte(mm,
+ dst_pte, src_pte, src_pmd,
+ dst_vma, src_vma,
+ dst_addr, src_addr,
+ dst_ptl, src_ptl, flags);
+
+ pte_unmap(dst_pte);
+ pte_unmap(src_pte);
+ cond_resched();
+
+ if (!err) {
+ dst_addr += PAGE_SIZE;
+ src_addr += PAGE_SIZE;
+ moved += PAGE_SIZE;
+ }
+
+ if ((!err || err == -EAGAIN) &&
+ fatal_signal_pending(current))
+ err = -EINTR;
+
+ if (err && err != -EAGAIN)
+ break;
+ }
+
+out:
+ up_read(&mm->mmap_sem);
+ BUG_ON(moved < 0);
+ BUG_ON(err > 0);
+ BUG_ON(!moved && !err);
+ return moved ? moved : err;
+}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4277ed7..9c66428 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1555,6 +1555,116 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
}

/*
+ * The PT lock for src_pmd and the mmap_sem for reading are held by
+ * the caller, but it must return after releasing the
+ * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
+ * until the PT lock of the src_pmd is released. Just move the page
+ * from src_pmd to dst_pmd if possible. Return zero if succeeded in
+ * moving the page, -EAGAIN if it needs to be repeated by the caller,
+ * or other errors in case of failure.
+ */
+int remap_anon_pages_huge_pmd(struct mm_struct *mm,
+ pmd_t *dst_pmd, pmd_t *src_pmd,
+ pmd_t dst_pmdval,
+ struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma,
+ unsigned long dst_addr,
+ unsigned long src_addr)
+{
+ pmd_t _dst_pmd, src_pmdval;
+ struct page *src_page;
+ struct anon_vma *src_anon_vma, *dst_anon_vma;
+ spinlock_t *src_ptl, *dst_ptl;
+ pgtable_t pgtable;
+
+ src_pmdval = *src_pmd;
+ src_ptl = pmd_lockptr(mm, src_pmd);
+
+ BUG_ON(!pmd_trans_huge(src_pmdval));
+ BUG_ON(pmd_trans_splitting(src_pmdval));
+ BUG_ON(!pmd_none(dst_pmdval));
+ BUG_ON(!spin_is_locked(src_ptl));
+ BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+ src_page = pmd_page(src_pmdval);
+ BUG_ON(!PageHead(src_page));
+ BUG_ON(!PageAnon(src_page));
+ if (unlikely(page_mapcount(src_page) != 1)) {
+ spin_unlock(src_ptl);
+ return -EBUSY;
+ }
+
+ get_page(src_page);
+ spin_unlock(src_ptl);
+
+ mmu_notifier_invalidate_range_start(mm, src_addr,
+ src_addr + HPAGE_PMD_SIZE);
+
+ /* block all concurrent rmap walks */
+ lock_page(src_page);
+
+ /*
+ * split_huge_page walks the anon_vma chain without the page
+ * lock. Serialize against it with the anon_vma lock, the page
+ * lock is not enough.
+ */
+ src_anon_vma = page_get_anon_vma(src_page);
+ if (!src_anon_vma) {
+ unlock_page(src_page);
+ put_page(src_page);
+ mmu_notifier_invalidate_range_end(mm, src_addr,
+ src_addr + HPAGE_PMD_SIZE);
+ return -EAGAIN;
+ }
+ anon_vma_lock_write(src_anon_vma);
+
+ dst_ptl = pmd_lockptr(mm, dst_pmd);
+ double_pt_lock(src_ptl, dst_ptl);
+ if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
+ !pmd_same(*dst_pmd, dst_pmdval) ||
+ page_mapcount(src_page) != 1)) {
+ double_pt_unlock(src_ptl, dst_ptl);
+ anon_vma_unlock_write(src_anon_vma);
+ put_anon_vma(src_anon_vma);
+ unlock_page(src_page);
+ put_page(src_page);
+ mmu_notifier_invalidate_range_end(mm, src_addr,
+ src_addr + HPAGE_PMD_SIZE);
+ return -EAGAIN;
+ }
+
+ BUG_ON(!PageHead(src_page));
+ BUG_ON(!PageAnon(src_page));
+ /* the PT lock is enough to keep the page pinned now */
+ put_page(src_page);
+
+ dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+ ACCESS_ONCE(src_page->mapping) = (struct address_space *) dst_anon_vma;
+ ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma, dst_addr);
+
+ if (!pmd_same(pmdp_clear_flush(src_vma, src_addr, src_pmd),
+ src_pmdval))
+ BUG();
+ _dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
+ _dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
+ set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
+
+ pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+ pgtable_trans_huge_deposit(mm, dst_pmd, pgtable);
+ double_pt_unlock(src_ptl, dst_ptl);
+
+ anon_vma_unlock_write(src_anon_vma);
+ put_anon_vma(src_anon_vma);
+
+ /* unblock rmap walks */
+ unlock_page(src_page);
+
+ mmu_notifier_invalidate_range_end(mm, src_addr,
+ src_addr + HPAGE_PMD_SIZE);
+ return 0;
+}
+
+/*
* Returns 1 if a given pmd maps a stable (not under splitting) thp.
* Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
*

2014-10-03 18:04:15

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 17/17] userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER

This adds two protocol commands to the userfaultfd protocol.

To register memory regions into userfaultfd you can write 16 bytes as:

[ start|0x1, end ]

to unregister write:

[ start|0x2, end ]

End is "start+len" (not start+len-1). Same as vma->vm_end.

This also enforces the constraint that start and end must both be page
aligned (so the last two bits become available to implement the
USERFAULTFD_RANGE_REGISTER|UNREGISTER commands).

This way there can be multiple userfaultfd for each process and each
one can register into its own virtual memory ranges.

If an userfaultfd tries to register into a virtual memory range
already registered into a different userfaultfd, -EBUSY will be
returned by the write() syscall.

userfaultfd can register into allocated ranges that don't have
MADV_USERFAULT set, but if MADV_USERFAULT is not set, no userfault
will fire on those.

Only if MADV_USERFAULT is set on the virtual memory range, and the
userfaultfd registered into the same range, the userfaultfd protocol
will engage.

If only MADV_USERFAULT is set and there's no userfaultfd registered on
a memory range, only a SIGBUS will be raised and the page fault will
not engage the userfaultfd protocol.

This also makes the handle_userfault() safe against race conditions
with regard to the mmap_sem by requiring FAULT_FLAG_ALLOW_RETRY to be
set the first time a fault is raised by any thread. In turn to work
reliably, the userfaultd depends on the gup_locked|unlocked patchset
to be applied.

If get_user_pages() is run on virtual memory ranges registered into
the userfaultfd, handle_userfault() will return VM_FAULT_SIGBUS and
gup() will return -EFAULT, because get_user_pages() doesn't allow
handle_userfault() to release the mmap_sem and in turn we cannot
safely engage the userfaultfd protocol. So the remaining
get_user_pages() calls must be restricted to memory ranges that we
know are not tracked through the userfaultfd protocol for the
userfaultfd to be reliable.

The only exception of a get_user_pages() that can safely run into an
userfaultfd triggering a -EFAULT is ptrace. ptrace would otherwise
hang so it's actually ok if it will get a -EFAULT instead of
hanging. But it would be ok also to phase out get_user_pages()
completely and have ptrace hang on the userfault (the hang can be
resolved sending SIGKILL to gdb or whatever process that is calling
ptrace). We could also decide to retain the current -EFAULT behavior
of ptrace using get_user_pages_locked with a NULL locked parameter so
the FAULT_FLAG_ALLOW_RETRY flag will not be set. Either ways would be
safe.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/userfaultfd.c | 411 +++++++++++++++++++++++++++-----------------
include/linux/mm.h | 2 +-
include/linux/mm_types.h | 11 ++
include/linux/userfaultfd.h | 19 +-
mm/madvise.c | 3 +-
mm/mempolicy.c | 4 +-
mm/mlock.c | 3 +-
mm/mmap.c | 39 +++--
mm/mprotect.c | 3 +-
9 files changed, 320 insertions(+), 175 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 2667d0d..49bbd3b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -23,6 +23,7 @@
#include <linux/anon_inodes.h>
#include <linux/syscalls.h>
#include <linux/userfaultfd.h>
+#include <linux/mempolicy.h>

struct userfaultfd_ctx {
/* pseudo fd refcounting */
@@ -37,6 +38,8 @@ struct userfaultfd_ctx {
unsigned int state;
/* released */
bool released;
+ /* mm with one ore more vmas attached to this userfaultfd_ctx */
+ struct mm_struct *mm;
};

struct userfaultfd_wait_queue {
@@ -49,6 +52,10 @@ struct userfaultfd_wait_queue {
#define USERFAULTFD_PROTOCOL ((__u64) 0xaa)
#define USERFAULTFD_UNKNOWN_PROTOCOL ((__u64) -1ULL)

+#define USERFAULTFD_RANGE_REGISTER ((__u64) 0x1)
+#define USERFAULTFD_RANGE_UNREGISTER ((__u64) 0x2)
+#define USERFAULTFD_RANGE_MASK (~((__u64) 0x3))
+
enum {
USERFAULTFD_STATE_ASK_PROTOCOL,
USERFAULTFD_STATE_ACK_PROTOCOL,
@@ -56,43 +63,6 @@ enum {
USERFAULTFD_STATE_RUNNING,
};

-/**
- * struct mm_slot - userlandfd information per mm that is being scanned
- * @link: link to the mm_slots hash list
- * @mm: the mm that this information is valid for
- * @ctx: userfaultfd context for this mm
- */
-struct mm_slot {
- struct hlist_node link;
- struct mm_struct *mm;
- struct userfaultfd_ctx ctx;
- struct rcu_head rcu_head;
-};
-
-#define MM_USERLANDFD_HASH_BITS 10
-static DEFINE_HASHTABLE(mm_userlandfd_hash, MM_USERLANDFD_HASH_BITS);
-
-static DEFINE_MUTEX(mm_userlandfd_mutex);
-
-static struct mm_slot *get_mm_slot(struct mm_struct *mm)
-{
- struct mm_slot *slot;
-
- hash_for_each_possible_rcu(mm_userlandfd_hash, slot, link,
- (unsigned long)mm)
- if (slot->mm == mm)
- return slot;
-
- return NULL;
-}
-
-static void insert_to_mm_userlandfd_hash(struct mm_struct *mm,
- struct mm_slot *mm_slot)
-{
- mm_slot->mm = mm;
- hash_add_rcu(mm_userlandfd_hash, &mm_slot->link, (unsigned long)mm);
-}
-
static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
int wake_flags, void *key)
{
@@ -122,30 +92,10 @@ out:
*
* Returns: In case of success, returns not zero.
*/
-static int userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
+static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
{
- /*
- * If it's already released don't get it. This can race
- * against userfaultfd_release, if the race triggers it'll be
- * handled safely by the handle_userfault main loop
- * (userfaultfd_release will take the mmap_sem for writing to
- * flush out all in-flight userfaults). This check is only an
- * optimization.
- */
- if (unlikely(ACCESS_ONCE(ctx->released)))
- return 0;
- return atomic_inc_not_zero(&ctx->refcount);
-}
-
-static void userfaultfd_free(struct userfaultfd_ctx *ctx)
-{
- struct mm_slot *mm_slot = container_of(ctx, struct mm_slot, ctx);
-
- mutex_lock(&mm_userlandfd_mutex);
- hash_del_rcu(&mm_slot->link);
- mutex_unlock(&mm_userlandfd_mutex);
-
- kfree_rcu(mm_slot, rcu_head);
+ if (!atomic_inc_not_zero(&ctx->refcount))
+ BUG();
}

/**
@@ -158,8 +108,10 @@ static void userfaultfd_free(struct userfaultfd_ctx *ctx)
*/
static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
{
- if (atomic_dec_and_test(&ctx->refcount))
- userfaultfd_free(ctx);
+ if (atomic_dec_and_test(&ctx->refcount)) {
+ mmdrop(ctx->mm);
+ kfree(ctx);
+ }
}

/*
@@ -181,25 +133,55 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags)
{
struct mm_struct *mm = vma->vm_mm;
- struct mm_slot *slot;
struct userfaultfd_ctx *ctx;
struct userfaultfd_wait_queue uwq;
- int ret;

BUG_ON(!rwsem_is_locked(&mm->mmap_sem));

- rcu_read_lock();
- slot = get_mm_slot(mm);
- if (!slot) {
- rcu_read_unlock();
+ ctx = vma->vm_userfaultfd_ctx.ctx;
+ if (!ctx)
return VM_FAULT_SIGBUS;
- }
- ctx = &slot->ctx;
- if (!userfaultfd_ctx_get(ctx)) {
- rcu_read_unlock();
+
+ BUG_ON(ctx->mm != mm);
+
+ /*
+ * If it's already released don't get it. This avoids to loop
+ * in __get_user_pages if userfaultfd_release waits on the
+ * caller of handle_userfault to release the mmap_sem.
+ */
+ if (unlikely(ACCESS_ONCE(ctx->released)))
+ return VM_FAULT_SIGBUS;
+
+ /* check that we can return VM_FAULT_RETRY */
+ if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
+ /*
+ * Validate the invariant that nowait must allow retry
+ * to be sure not to return SIGBUS erroneously on
+ * nowait invocations.
+ */
+ BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
+#ifdef CONFIG_DEBUG_VM
+ if (printk_ratelimit()) {
+ printk(KERN_WARNING
+ "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
+ dump_stack();
+ }
+#endif
return VM_FAULT_SIGBUS;
}
- rcu_read_unlock();
+
+ /*
+ * Handle nowait, not much to do other than tell it to retry
+ * and wait.
+ */
+ if (flags & FAULT_FLAG_RETRY_NOWAIT)
+ return VM_FAULT_RETRY;
+
+ /* take the reference before dropping the mmap_sem */
+ userfaultfd_ctx_get(ctx);
+
+ /* be gentle and immediately relinquish the mmap_sem */
+ up_read(&mm->mmap_sem);

init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
uwq.wq.private = current;
@@ -214,60 +196,15 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
*/
__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
for (;;) {
- set_current_state(TASK_INTERRUPTIBLE);
- if (fatal_signal_pending(current)) {
- /*
- * If we have to fail because the task is
- * killed just retry the fault either by
- * returning to userland or through
- * VM_FAULT_RETRY if we come from a page fault
- * and a fatal signal is pending.
- */
- ret = 0;
- if (flags & FAULT_FLAG_KILLABLE) {
- /*
- * If FAULT_FLAG_KILLABLE is set we
- * and there's a fatal signal pending
- * can return VM_FAULT_RETRY
- * regardless if
- * FAULT_FLAG_ALLOW_RETRY is set or
- * not as long as we release the
- * mmap_sem. The page fault will
- * return stright to userland then to
- * handle the fatal signal.
- */
- up_read(&mm->mmap_sem);
- ret = VM_FAULT_RETRY;
- }
- break;
- }
- if (!uwq.pending || ACCESS_ONCE(ctx->released)) {
- ret = 0;
- if (flags & FAULT_FLAG_ALLOW_RETRY) {
- ret = VM_FAULT_RETRY;
- if (!(flags & FAULT_FLAG_RETRY_NOWAIT))
- up_read(&mm->mmap_sem);
- }
- break;
- }
- if (((FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT) &
- flags) ==
- (FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT)) {
- ret = VM_FAULT_RETRY;
- /*
- * The mmap_sem must not be released if
- * FAULT_FLAG_RETRY_NOWAIT is set despite we
- * return VM_FAULT_RETRY (FOLL_NOWAIT case).
- */
+ set_current_state(TASK_KILLABLE);
+ if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
+ fatal_signal_pending(current))
break;
- }
spin_unlock(&ctx->fault_wqh.lock);
- up_read(&mm->mmap_sem);

wake_up_poll(&ctx->fd_wqh, POLLIN);
schedule();

- down_read(&mm->mmap_sem);
spin_lock(&ctx->fault_wqh.lock);
}
__remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
@@ -276,30 +213,53 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,

/*
* ctx may go away after this if the userfault pseudo fd is
- * released by another CPU.
+ * already released.
*/
userfaultfd_ctx_put(ctx);

- return ret;
+ return VM_FAULT_RETRY;
}

static int userfaultfd_release(struct inode *inode, struct file *file)
{
struct userfaultfd_ctx *ctx = file->private_data;
- struct mm_slot *mm_slot = container_of(ctx, struct mm_slot, ctx);
+ struct mm_struct *mm = ctx->mm;
+ struct vm_area_struct *vma, *prev;
__u64 range[2] = { 0ULL, -1ULL };

ACCESS_ONCE(ctx->released) = true;

/*
- * Flush page faults out of all CPUs to avoid race conditions
- * against ctx->released. All page faults must be retried
- * without returning VM_FAULT_SIGBUS if the get_mm_slot and
- * userfaultfd_ctx_get both succeeds but ctx->released is set.
+ * Flush page faults out of all CPUs. NOTE: all page faults
+ * must be retried without returning VM_FAULT_SIGBUS if
+ * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
+ * changes while handle_userfault released the mmap_sem. So
+ * it's critical that released is set to true (above), before
+ * taking the mmap_sem for writing.
*/
- down_write(&mm_slot->mm->mmap_sem);
- up_write(&mm_slot->mm->mmap_sem);
+ down_write(&mm->mmap_sem);
+ prev = NULL;
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (vma->vm_userfaultfd_ctx.ctx != ctx)
+ continue;
+ prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
+ vma->vm_flags, vma->anon_vma,
+ vma->vm_file, vma->vm_pgoff,
+ vma_policy(vma),
+ NULL_VM_USERFAULTFD_CTX);
+ if (prev)
+ vma = prev;
+ else
+ prev = vma;
+ vma->vm_userfaultfd_ctx = NULL_VM_USERFAULTFD_CTX;
+ }
+ up_write(&mm->mmap_sem);

+ /*
+ * After no new page faults can wait on this fautl_wqh, flush
+ * the last page faults that may have been already waiting on
+ * the fault_wqh.
+ */
spin_lock(&ctx->fault_wqh.lock);
__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
spin_unlock(&ctx->fault_wqh.lock);
@@ -454,6 +414,140 @@ static int wake_userfault(struct userfaultfd_ctx *ctx, __u64 *range)
return ret;
}

+static ssize_t userfaultfd_range_register(struct userfaultfd_ctx *ctx,
+ unsigned long start,
+ unsigned long end)
+{
+ struct mm_struct *mm = ctx->mm;
+ struct vm_area_struct *vma, *prev;
+ int ret;
+
+ down_write(&mm->mmap_sem);
+ vma = find_vma(mm, start);
+ if (!vma)
+ return -ENOMEM;
+ if (vma->vm_start >= end)
+ return -EINVAL;
+
+ prev = vma->vm_prev;
+ if (vma->vm_start < start)
+ prev = vma;
+
+ ret = 0;
+ /* we got an overlap so start the splitting */
+ do {
+ if (vma->vm_userfaultfd_ctx.ctx == ctx)
+ goto next;
+ if (vma->vm_userfaultfd_ctx.ctx) {
+ ret = -EBUSY;
+ break;
+ }
+ prev = vma_merge(mm, prev, start, end, vma->vm_flags,
+ vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+ vma_policy(vma),
+ ((struct vm_userfaultfd_ctx){ ctx }));
+ if (prev) {
+ vma = prev;
+ vma->vm_userfaultfd_ctx.ctx = ctx;
+ goto next;
+ }
+ if (vma->vm_start < start) {
+ ret = split_vma(mm, vma, start, 1);
+ if (ret < 0)
+ break;
+ }
+ if (vma->vm_end > end) {
+ ret = split_vma(mm, vma, end, 0);
+ if (ret < 0)
+ break;
+ }
+ vma->vm_userfaultfd_ctx.ctx = ctx;
+ next:
+ start = vma->vm_end;
+ vma = vma->vm_next;
+ } while (vma && vma->vm_start < end);
+ up_write(&mm->mmap_sem);
+
+ return ret;
+}
+
+static ssize_t userfaultfd_range_unregister(struct userfaultfd_ctx *ctx,
+ unsigned long start,
+ unsigned long end)
+{
+ struct mm_struct *mm = ctx->mm;
+ struct vm_area_struct *vma, *prev;
+ int ret;
+
+ down_write(&mm->mmap_sem);
+ vma = find_vma(mm, start);
+ if (!vma)
+ return -ENOMEM;
+ if (vma->vm_start >= end)
+ return -EINVAL;
+
+ prev = vma->vm_prev;
+ if (vma->vm_start < start)
+ prev = vma;
+
+ ret = 0;
+ /* we got an overlap so start the splitting */
+ do {
+ if (!vma->vm_userfaultfd_ctx.ctx)
+ goto next;
+ if (vma->vm_userfaultfd_ctx.ctx != ctx) {
+ ret = -EBUSY;
+ break;
+ }
+ prev = vma_merge(mm, prev, start, end, vma->vm_flags,
+ vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+ vma_policy(vma),
+ NULL_VM_USERFAULTFD_CTX);
+ if (prev) {
+ vma = prev;
+ vma->vm_userfaultfd_ctx = NULL_VM_USERFAULTFD_CTX;
+ goto next;
+ }
+ if (vma->vm_start < start) {
+ ret = split_vma(mm, vma, start, 1);
+ if (ret < 0)
+ break;
+ }
+ if (vma->vm_end > end) {
+ ret = split_vma(mm, vma, end, 0);
+ if (ret < 0)
+ break;
+ }
+ vma->vm_userfaultfd_ctx.ctx = NULL;
+ next:
+ start = vma->vm_end;
+ vma = vma->vm_next;
+ } while (vma && vma->vm_start < end);
+ up_write(&mm->mmap_sem);
+
+ return ret;
+}
+
+static ssize_t userfaultfd_handle_range(struct userfaultfd_ctx *ctx,
+ __u64 *range)
+{
+ unsigned long start, end;
+
+ start = range[0] & USERFAULTFD_RANGE_MASK;
+ end = range[1];
+ BUG_ON(end <= start);
+ if (end > TASK_SIZE)
+ return -ENOMEM;
+
+ if (range[0] & USERFAULTFD_RANGE_REGISTER) {
+ BUG_ON(range[0] & USERFAULTFD_RANGE_UNREGISTER);
+ return userfaultfd_range_register(ctx, start, end);
+ } else {
+ BUG_ON(!(range[0] & USERFAULTFD_RANGE_UNREGISTER));
+ return userfaultfd_range_unregister(ctx, start, end);
+ }
+}
+
static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
@@ -483,9 +577,24 @@ static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
return -EINVAL;
if (copy_from_user(&range, buf, sizeof(range)))
return -EFAULT;
- if (range[0] >= range[1])
+ /* the range mask requires 2 bits */
+ BUILD_BUG_ON(PAGE_SHIFT < 2);
+ if (range[0] & ~PAGE_MASK & USERFAULTFD_RANGE_MASK)
+ return -EINVAL;
+ if ((range[0] & ~USERFAULTFD_RANGE_MASK) == ~USERFAULTFD_RANGE_MASK)
+ return -EINVAL;
+ if (range[1] & ~PAGE_MASK)
+ return -EINVAL;
+ if ((range[0] & PAGE_MASK) >= (range[1] & PAGE_MASK))
return -ERANGE;

+ /* handle the register/unregister commands */
+ if (range[0] & ~USERFAULTFD_RANGE_MASK) {
+ ssize_t ret = userfaultfd_handle_range(ctx, range);
+ BUG_ON(ret > 0);
+ return ret < 0 ? ret : sizeof(range);
+ }
+
/* always take the fd_wqh lock before the fault_wqh lock */
if (find_userfault(ctx, NULL, POLLOUT))
if (!wake_userfault(ctx, range))
@@ -552,7 +661,9 @@ static const struct file_operations userfaultfd_fops = {
static struct file *userfaultfd_file_create(int flags)
{
struct file *file;
- struct mm_slot *mm_slot;
+ struct userfaultfd_ctx *ctx;
+
+ BUG_ON(!current->mm);

/* Check the UFFD_* constants for consistency. */
BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
@@ -562,33 +673,25 @@ static struct file *userfaultfd_file_create(int flags)
if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
goto out;

- mm_slot = kmalloc(sizeof(*mm_slot), GFP_KERNEL);
+ ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
file = ERR_PTR(-ENOMEM);
- if (!mm_slot)
+ if (!ctx)
goto out;

- mutex_lock(&mm_userlandfd_mutex);
- file = ERR_PTR(-EBUSY);
- if (get_mm_slot(current->mm))
- goto out_free_unlock;
-
- atomic_set(&mm_slot->ctx.refcount, 1);
- init_waitqueue_head(&mm_slot->ctx.fault_wqh);
- init_waitqueue_head(&mm_slot->ctx.fd_wqh);
- mm_slot->ctx.flags = flags;
- mm_slot->ctx.state = USERFAULTFD_STATE_ASK_PROTOCOL;
- mm_slot->ctx.released = false;
-
- file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops,
- &mm_slot->ctx,
+ atomic_set(&ctx->refcount, 1);
+ init_waitqueue_head(&ctx->fault_wqh);
+ init_waitqueue_head(&ctx->fd_wqh);
+ ctx->flags = flags;
+ ctx->state = USERFAULTFD_STATE_ASK_PROTOCOL;
+ ctx->released = false;
+ ctx->mm = current->mm;
+ /* prevent the mm struct to be freed */
+ atomic_inc(&ctx->mm->mm_count);
+
+ file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, ctx,
O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
if (IS_ERR(file))
- out_free_unlock:
- kfree(mm_slot);
- else
- insert_to_mm_userlandfd_hash(current->mm,
- mm_slot);
- mutex_unlock(&mm_userlandfd_mutex);
+ kfree(ctx);
out:
return file;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 71dbe03..cd60938 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1779,7 +1779,7 @@ extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
extern struct vm_area_struct *vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
- struct mempolicy *);
+ struct mempolicy *, struct vm_userfaultfd_ctx);
extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
extern int split_vma(struct mm_struct *,
struct vm_area_struct *, unsigned long addr, int new_below);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c876d1..bb78fa8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -238,6 +238,16 @@ struct vm_region {
* this region */
};

+#ifdef CONFIG_USERFAULTFD
+#define NULL_VM_USERFAULTFD_CTX ((struct vm_userfaultfd_ctx) { NULL, })
+struct vm_userfaultfd_ctx {
+ struct userfaultfd_ctx *ctx;
+};
+#else /* CONFIG_USERFAULTFD */
+#define NULL_VM_USERFAULTFD_CTX ((struct vm_userfaultfd_ctx) {})
+struct vm_userfaultfd_ctx {};
+#endif /* CONFIG_USERFAULTFD */
+
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
@@ -308,6 +318,7 @@ struct vm_area_struct {
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
};

struct core_thread {
diff --git a/include/linux/userfaultfd.h b/include/linux/userfaultfd.h
index b7caef5..25f49db 100644
--- a/include/linux/userfaultfd.h
+++ b/include/linux/userfaultfd.h
@@ -29,14 +29,27 @@
int handle_userfault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags);

+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+ struct vm_userfaultfd_ctx vm_ctx)
+{
+ return vma->vm_userfaultfd_ctx.ctx == vm_ctx.ctx;
+}
+
#else /* CONFIG_USERFAULTFD */

-static int handle_userfault(struct vm_area_struct *vma, unsigned long address,
- unsigned int flags)
+static inline int handle_userfault(struct vm_area_struct *vma,
+ unsigned long address,
+ unsigned int flags)
{
return VM_FAULT_SIGBUS;
}

-#endif
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+ struct vm_userfaultfd_ctx vm_ctx)
+{
+ return true;
+}
+
+#endif /* CONFIG_USERFAULTFD */

#endif /* _LINUX_USERFAULTFD_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index 24620c0..4bb9a68 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -117,7 +117,8 @@ static long madvise_behavior(struct vm_area_struct *vma,

pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
- vma->vm_file, pgoff, vma_policy(vma));
+ vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx);
if (*prev) {
vma = *prev;
goto success;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8f5330d..bf54e9c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -769,8 +769,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
pgoff = vma->vm_pgoff +
((vmstart - vma->vm_start) >> PAGE_SHIFT);
prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
- vma->anon_vma, vma->vm_file, pgoff,
- new_pol);
+ vma->anon_vma, vma->vm_file, pgoff,
+ new_pol, vma->vm_userfaultfd_ctx);
if (prev) {
vma = prev;
next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index ce84cb0..ccb537e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -566,7 +566,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,

pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
- vma->vm_file, pgoff, vma_policy(vma));
+ vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx);
if (*prev) {
vma = *prev;
goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index c0a3637..303f45b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -41,6 +41,7 @@
#include <linux/notifier.h>
#include <linux/memory.h>
#include <linux/printk.h>
+#include <linux/userfaultfd.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -901,7 +902,8 @@ again: remove_next = 1 + (end > next->vm_end);
* per-vma resources, so we don't attempt to merge those.
*/
static inline int is_mergeable_vma(struct vm_area_struct *vma,
- struct file *file, unsigned long vm_flags)
+ struct file *file, unsigned long vm_flags,
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
{
/*
* VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -917,6 +919,8 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma,
return 0;
if (vma->vm_ops && vma->vm_ops->close)
return 0;
+ if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
+ return 0;
return 1;
}

@@ -947,9 +951,11 @@ static inline int is_mergeable_anon_vma(struct anon_vma *anon_vma1,
*/
static int
can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
- struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+ struct anon_vma *anon_vma, struct file *file,
+ pgoff_t vm_pgoff,
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
{
- if (is_mergeable_vma(vma, file, vm_flags) &&
+ if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
if (vma->vm_pgoff == vm_pgoff)
return 1;
@@ -966,9 +972,11 @@ can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
*/
static int
can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
- struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+ struct anon_vma *anon_vma, struct file *file,
+ pgoff_t vm_pgoff,
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
{
- if (is_mergeable_vma(vma, file, vm_flags) &&
+ if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
pgoff_t vm_pglen;
vm_pglen = vma_pages(vma);
@@ -1011,7 +1019,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr,
unsigned long end, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
- pgoff_t pgoff, struct mempolicy *policy)
+ pgoff_t pgoff, struct mempolicy *policy,
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
{
pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
struct vm_area_struct *area, *next;
@@ -1038,14 +1047,17 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
if (prev && prev->vm_end == addr &&
mpol_equal(vma_policy(prev), policy) &&
can_vma_merge_after(prev, vm_flags,
- anon_vma, file, pgoff)) {
+ anon_vma, file, pgoff,
+ vm_userfaultfd_ctx)) {
/*
* OK, it can. Can we now merge in the successor as well?
*/
if (next && end == next->vm_start &&
mpol_equal(policy, vma_policy(next)) &&
can_vma_merge_before(next, vm_flags,
- anon_vma, file, pgoff+pglen) &&
+ anon_vma, file,
+ pgoff+pglen,
+ vm_userfaultfd_ctx) &&
is_mergeable_anon_vma(prev->anon_vma,
next->anon_vma, NULL)) {
/* cases 1, 6 */
@@ -1066,7 +1078,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
if (next && end == next->vm_start &&
mpol_equal(policy, vma_policy(next)) &&
can_vma_merge_before(next, vm_flags,
- anon_vma, file, pgoff+pglen)) {
+ anon_vma, file, pgoff+pglen,
+ vm_userfaultfd_ctx)) {
if (prev && addr < prev->vm_end) /* case 4 */
err = vma_adjust(prev, prev->vm_start,
addr, prev->vm_pgoff, NULL);
@@ -1548,7 +1561,8 @@ munmap_back:
/*
* Can we just expand an old mapping?
*/
- vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
+ vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
+ NULL, file, pgoff, NULL, NULL_VM_USERFAULTFD_CTX);
if (vma)
goto out;

@@ -2670,7 +2684,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)

/* Can we just expand an old private anonymous mapping? */
vma = vma_merge(mm, prev, addr, addr + len, flags,
- NULL, NULL, pgoff, NULL);
+ NULL, NULL, pgoff, NULL, NULL_VM_USERFAULTFD_CTX);
if (vma)
goto out;

@@ -2829,7 +2843,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
return NULL; /* should never get here */
new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
- vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+ vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx);
if (new_vma) {
/*
* Source vma may have been merged into new_vma
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..2ee5aa7 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -294,7 +294,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
*/
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*pprev = vma_merge(mm, *pprev, start, end, newflags,
- vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+ vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx);
if (*pprev) {
vma = *pprev;
goto success;

2014-10-03 18:15:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 01/17] mm: gup: add FOLL_TRIED

This needs more explanation than that one-liner comment. Make the
commit message explain why the new FOLL_TRIED flag exists.

Linus

On Fri, Oct 3, 2014 at 10:07 AM, Andrea Arcangeli <[email protected]> wrote:
> From: Andres Lagar-Cavilla <[email protected]>
>
> Reviewed-by: Radim Krčmář <[email protected]>
> Signed-off-by: Andres Lagar-Cavilla <[email protected]>
> Signed-off-by: Andrea Arcangeli <[email protected]>

2014-10-03 18:22:46

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH 08/17] mm: madvise MADV_USERFAULT

MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
userland touches a still unmapped virtual address, a sigbus signal is
sent instead of allocating a new page. The sigbus signal handler will
then resolve the page fault in userland by calling the
remap_anon_pages syscall.

This functionality is needed to reliably implement postcopy live
migration in KVM (without having to use a special chardevice that
would disable all advanced Linux VM features, like swapping, KSM, THP,
automatic NUMA balancing, etc...).

MADV_USERFAULT could also be used to offload parts of anonymous memory
regions to remote nodes or to implement network distributed shared
memory.

Here I enlarged the vm_flags to 64bit as we run out of bits (noop on
64bit kernels). An alternative is to find some combination of flags
that are mutually exclusive if set.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/alpha/include/uapi/asm/mman.h | 3 ++
arch/mips/include/uapi/asm/mman.h | 3 ++
arch/parisc/include/uapi/asm/mman.h | 3 ++
arch/xtensa/include/uapi/asm/mman.h | 3 ++
fs/proc/task_mmu.c | 1 +
include/linux/mm.h | 1 +
include/uapi/asm-generic/mman-common.h | 3 ++
mm/huge_memory.c | 60 +++++++++++++++++++++-------------
mm/madvise.c | 17 ++++++++++
mm/memory.c | 13 ++++++++
10 files changed, 85 insertions(+), 22 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b47..a10313c 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 17 /* Clear the MADV_NODUMP flag */

+#define MADV_USERFAULT 18 /* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19 /* Don't trigger user faults */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876..d9d11a4 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -84,6 +84,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 17 /* Clear the MADV_NODUMP flag */

+#define MADV_USERFAULT 18 /* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19 /* Don't trigger user faults */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251..7bc7b7b 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -66,6 +66,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 70 /* Clear the MADV_NODUMP flag */

+#define MADV_USERFAULT 71 /* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 72 /* Don't trigger user faults */
+
/* compatibility flags */
#define MAP_FILE 0
#define MAP_VARIABLE 0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 00eed67..5448d88 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -90,6 +90,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 17 /* Clear the MADV_NODUMP flag */

+#define MADV_USERFAULT 18 /* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19 /* Don't trigger user faults */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee1c3a2..6033cb8 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -568,6 +568,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
[ilog2(VM_HUGEPAGE)] = "hg",
[ilog2(VM_NOHUGEPAGE)] = "nh",
[ilog2(VM_MERGEABLE)] = "mg",
+ [ilog2(VM_USERFAULT)] = "uf",
};
size_t i;

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8900ba9..bf3df07 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -139,6 +139,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HUGEPAGE 0x20000000 /* MADV_HUGEPAGE marked this vma */
#define VM_NOHUGEPAGE 0x40000000 /* MADV_NOHUGEPAGE marked this vma */
#define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
+#define VM_USERFAULT 0x100000000ULL /* Trigger user faults if not mapped */

#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36..dbf1e70 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -52,6 +52,9 @@
overrides the coredump filter bits */
#define MADV_DODUMP 17 /* Clear the MADV_DONTDUMP flag */

+#define MADV_USERFAULT 18 /* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19 /* Don't trigger user faults */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e913a19..b402d60 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -721,12 +721,16 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,

VM_BUG_ON_PAGE(!PageCompound(page), page);

- if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg))
- return VM_FAULT_OOM;
+ if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg)) {
+ put_page(page);
+ count_vm_event(THP_FAULT_FALLBACK);
+ return VM_FAULT_FALLBACK;
+ }

pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable)) {
mem_cgroup_cancel_charge(page, memcg);
+ put_page(page);
return VM_FAULT_OOM;
}

@@ -746,6 +750,16 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
pte_free(mm, pgtable);
} else {
pmd_t entry;
+
+ /* Deliver the page fault to userland */
+ if (vma->vm_flags & VM_USERFAULT) {
+ spin_unlock(ptl);
+ mem_cgroup_cancel_charge(page, memcg);
+ put_page(page);
+ pte_free(mm, pgtable);
+ return VM_FAULT_SIGBUS;
+ }
+
entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
page_add_new_anon_rmap(page, vma, haddr);
@@ -756,6 +770,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
atomic_long_inc(&mm->nr_ptes);
spin_unlock(ptl);
+ count_vm_event(THP_FAULT_ALLOC);
}

return 0;
@@ -776,20 +791,17 @@ static inline struct page *alloc_hugepage_vma(int defrag,
}

/* Caller must hold page table lock. */
-static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
struct page *zero_page)
{
pmd_t entry;
- if (!pmd_none(*pmd))
- return false;
entry = mk_pmd(zero_page, vma->vm_page_prot);
entry = pmd_wrprotect(entry);
entry = pmd_mkhuge(entry);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
atomic_long_inc(&mm->nr_ptes);
- return true;
}

int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -811,6 +823,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
pgtable_t pgtable;
struct page *zero_page;
bool set;
+ int ret;
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
return VM_FAULT_OOM;
@@ -821,14 +834,24 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_FALLBACK;
}
ptl = pmd_lock(mm, pmd);
- set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
- zero_page);
+ ret = 0;
+ set = false;
+ if (pmd_none(*pmd)) {
+ if (vma->vm_flags & VM_USERFAULT)
+ ret = VM_FAULT_SIGBUS;
+ else {
+ set_huge_zero_page(pgtable, mm, vma,
+ haddr, pmd,
+ zero_page);
+ set = true;
+ }
+ }
spin_unlock(ptl);
if (!set) {
pte_free(mm, pgtable);
put_huge_zero_page();
}
- return 0;
+ return ret;
}
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
vma, haddr, numa_node_id(), 0);
@@ -836,14 +859,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
- if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
- put_page(page);
- count_vm_event(THP_FAULT_FALLBACK);
- return VM_FAULT_FALLBACK;
- }
-
- count_vm_event(THP_FAULT_ALLOC);
- return 0;
+ return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
}

int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -878,16 +894,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
*/
if (is_huge_zero_pmd(pmd)) {
struct page *zero_page;
- bool set;
/*
* get_huge_zero_page() will never allocate a new page here,
* since we already have a zero page to copy. It just takes a
* reference.
*/
zero_page = get_huge_zero_page();
- set = set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+ set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
zero_page);
- BUG_ON(!set); /* unexpected !pmd_none(dst_pmd) */
ret = 0;
goto out_unlock;
}
@@ -2148,7 +2162,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
_pte++, address += PAGE_SIZE) {
pte_t pteval = *_pte;
if (pte_none(pteval)) {
- if (++none <= khugepaged_max_ptes_none)
+ if (!(vma->vm_flags & VM_USERFAULT) &&
+ ++none <= khugepaged_max_ptes_none)
continue;
else
goto out;
@@ -2569,7 +2584,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
_pte++, _address += PAGE_SIZE) {
pte_t pteval = *_pte;
if (pte_none(pteval)) {
- if (++none <= khugepaged_max_ptes_none)
+ if (!(vma->vm_flags & VM_USERFAULT) &&
+ ++none <= khugepaged_max_ptes_none)
continue;
else
goto out_unmap;
diff --git a/mm/madvise.c b/mm/madvise.c
index d5aee71..24620c0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -93,6 +93,21 @@ static long madvise_behavior(struct vm_area_struct *vma,
if (error)
goto out;
break;
+ case MADV_USERFAULT:
+ if (vma->vm_ops) {
+ error = -EINVAL;
+ goto out;
+ }
+ new_flags |= VM_USERFAULT;
+ break;
+ case MADV_NOUSERFAULT:
+ if (vma->vm_ops) {
+ WARN_ON(new_flags & VM_USERFAULT);
+ error = -EINVAL;
+ goto out;
+ }
+ new_flags &= ~VM_USERFAULT;
+ break;
}

if (new_flags == vma->vm_flags) {
@@ -408,6 +423,8 @@ madvise_behavior_valid(int behavior)
case MADV_HUGEPAGE:
case MADV_NOHUGEPAGE:
#endif
+ case MADV_USERFAULT:
+ case MADV_NOUSERFAULT:
case MADV_DONTDUMP:
case MADV_DODUMP:
return 1;
diff --git a/mm/memory.c b/mm/memory.c
index e229970..16e4c8a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2645,6 +2645,11 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (!pte_none(*page_table))
goto unlock;
+ /* Deliver the page fault to userland, check inside PT lock */
+ if (vma->vm_flags & VM_USERFAULT) {
+ pte_unmap_unlock(page_table, ptl);
+ return VM_FAULT_SIGBUS;
+ }
goto setpte;
}

@@ -2672,6 +2677,14 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (!pte_none(*page_table))
goto release;

+ /* Deliver the page fault to userland, check inside PT lock */
+ if (vma->vm_flags & VM_USERFAULT) {
+ pte_unmap_unlock(page_table, ptl);
+ mem_cgroup_cancel_charge(page, memcg);
+ page_cache_release(page);
+ return VM_FAULT_SIGBUS;
+ }
+
inc_mm_counter_fast(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
mem_cgroup_commit_charge(page, memcg, false);

2014-10-03 18:23:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious

On Fri, Oct 3, 2014 at 10:07 AM, Andrea Arcangeli <[email protected]> wrote:
> This teaches gup_fast and __gup_fast to re-enable irqs and
> cond_resched() if possible every BATCH_PAGES.

This is disgusting.

Many (most?) __gup_fast() users just want a single page, and the
stupid overhead of the multi-page version is already unnecessary.
This just makes things much worse.

Quite frankly, we should make a single-page version of __gup_fast(),
and convert existign users to use that. After that, the few multi-page
users could have this extra latency control stuff.

And yes, the single-page version of get_user_pages_fast() is actually
latency-critical. shared futexes hit it hard, and yes, I've seen this
in profiles.

Linus

2014-10-03 18:31:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli <[email protected]> wrote:
>
> Overall this looks a fairly small change to the rmap code, notably
> less intrusive than the nonlinear vmas created by remap_file_pages.

Considering that remap_file_pages() was an unmitigated disaster, and
-mm has a patch to remove it entirely, I'm not at all convinced this
is a good argument.

We thought remap_file_pages() was a good idea, and it really really
really wasn't. Almost nobody used it, why would the anonymous page
case be any different?

Linus

2014-10-03 20:56:36

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 01/17] mm: gup: add FOLL_TRIED

> This needs more explanation than that one-liner comment. Make the
> commit message explain why the new FOLL_TRIED flag exists.

This patch actually is extracted from a 3.18 commit in the KVM tree,
https://git.kernel.org/cgit/virt/kvm/kvm.git/commit/?h=next&id=234b239b.

Here is how that patch uses the flag:

/*
* The previous call has now waited on the IO. Now we can
* retry and complete. Pass TRIED to ensure we do not re
* schedule async IO (see e.g. filemap_fault).
*/
down_read(&mm->mmap_sem);
npages = __get_user_pages(tsk, mm, addr, 1, flags | FOLL_TRIED,
pagep, NULL, NULL);

2014-10-03 23:42:14

by Mike Hommey

[permalink] [raw]
Subject: Re: [PATCH 08/17] mm: madvise MADV_USERFAULT

On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> userland touches a still unmapped virtual address, a sigbus signal is
> sent instead of allocating a new page. The sigbus signal handler will
> then resolve the page fault in userland by calling the
> remap_anon_pages syscall.

What does "unmapped virtual address" mean in this context?

Mike

2014-10-04 13:13:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 12/17] mm: sys_remap_anon_pages

Andrea Arcangeli <[email protected]> writes:

> This new syscall will move anon pages across vmas, atomically and
> without touching the vmas.
>
> It only works on non shared anonymous pages because those can be
> relocated without generating non linear anon_vmas in the rmap code.

...

> It is an alternative to mremap.

Why a new syscall? Couldn't mremap do this transparently?

-Andi

--
[email protected] -- Speaking for myself only

2014-10-06 08:57:00

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

* Linus Torvalds ([email protected]) wrote:
> On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli <[email protected]> wrote:
> >
> > Overall this looks a fairly small change to the rmap code, notably
> > less intrusive than the nonlinear vmas created by remap_file_pages.
>
> Considering that remap_file_pages() was an unmitigated disaster, and
> -mm has a patch to remove it entirely, I'm not at all convinced this
> is a good argument.
>
> We thought remap_file_pages() was a good idea, and it really really
> really wasn't. Almost nobody used it, why would the anonymous page
> case be any different?

I've posted code that uses this interface to qemu-devel and it works nicely;
so chalk up at least one user.

For the postcopy case I'm using it for, we need to place a page, atomically
some thread might try and access it, and must either
1) get caught by userfault etc or
2) must succeed in it's access

and we'll have that happening somewhere between thousands and millions of times
to pages in no particular order, so we need to avoid creating millions of mappings.

Dave



>
> Linus
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2014-10-06 14:15:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious

Hello,

On Fri, Oct 03, 2014 at 11:23:53AM -0700, Linus Torvalds wrote:
> On Fri, Oct 3, 2014 at 10:07 AM, Andrea Arcangeli <[email protected]> wrote:
> > This teaches gup_fast and __gup_fast to re-enable irqs and
> > cond_resched() if possible every BATCH_PAGES.
>
> This is disgusting.
>
> Many (most?) __gup_fast() users just want a single page, and the
> stupid overhead of the multi-page version is already unnecessary.
> This just makes things much worse.
>
> Quite frankly, we should make a single-page version of __gup_fast(),
> and convert existign users to use that. After that, the few multi-page
> users could have this extra latency control stuff.

Ok. I didn't think at a better way to add the latency control other
than to reduce nr_pages in a outer loop instead of altering the inner
calls, but this is what I got after implementing it... If somebody has
a cleaner way to implement the latency control stuff that's welcome
and I'd be glad to replace it.

> And yes, the single-page version of get_user_pages_fast() is actually
> latency-critical. shared futexes hit it hard, and yes, I've seen this
> in profiles.

KVM would save a few cycles from a single-page version too. I just
thought further optimizations could be added later and this was better
than nothing.

Considering I've no better idea how to implement the latency control
stuff, for now I'll just drop this controversial patch, and I'll
convert those get_user_pages to gup_unlocked instead of converting
them to gup_fast, which is more than enough to obtain the mmap_sem
holding scalability improvement (that also solves the mmap_sem trouble
for the userfaultfd). gup_unlocked isn't as good as gup_fast but it's
at least better than the current get_user_pages().

I got into this gup_fast latency control stuff purely because there
were a few get_user_pages that could have been converted to
get_user_pages_fast as they were using "current" and "current->mm" as the
first two parameters, except for the risk of disabling irq for
long. So I tried to do the right thing and fix gup_fast but I'll leave
this further optimization queued for later.

About the missing commit header for the other patch Paolo already
replied to it, to clarify this a bit further in short I expect that
FOLL_TRIED flag to be merged through the KVM git tree which already
contains it. I'll add a comment to the commit header to specify
it. Sorry for the confusion about that patch.

Thanks,
Andrea

2014-10-06 16:42:54

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

Hello,

On Mon, Oct 06, 2014 at 09:55:41AM +0100, Dr. David Alan Gilbert wrote:
> * Linus Torvalds ([email protected]) wrote:
> > On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli <[email protected]> wrote:
> > >
> > > Overall this looks a fairly small change to the rmap code, notably
> > > less intrusive than the nonlinear vmas created by remap_file_pages.
> >
> > Considering that remap_file_pages() was an unmitigated disaster, and
> > -mm has a patch to remove it entirely, I'm not at all convinced this
> > is a good argument.
> >
> > We thought remap_file_pages() was a good idea, and it really really
> > really wasn't. Almost nobody used it, why would the anonymous page
> > case be any different?
>
> I've posted code that uses this interface to qemu-devel and it works nicely;
> so chalk up at least one user.
>
> For the postcopy case I'm using it for, we need to place a page, atomically
> some thread might try and access it, and must either
> 1) get caught by userfault etc or
> 2) must succeed in it's access
>
> and we'll have that happening somewhere between thousands and millions of times
> to pages in no particular order, so we need to avoid creating millions of mappings.

Yes, that's our current use case.

Of course if somebody has better ideas on how to resolve an anonymous
userfault they're welcome.

How to resolve an userfault is orthogonal on how to detect it and to
notify userland about it and to be notified when the userfault has
been resolved. The latter is what the userfault and userfaultfd
do. The former is what remap_anon_pages is used for but we could use
something else too if there are better ways. mremap would clearly work
too, but it would be less strict (it could lead to silent data
corruption if there are bugs in the userland code), it would be slower
and it would eventually a hit a -ENOMEM failure because there would be
too many vmas.

I could in theory drop remap_anon_pages from this patchset, but
without an optimal way to resolve an userfault, the rest isn't so
useful.

We're currently discussing on what would be the best way to resolve a
MAP_SHARED userfault on tmpfs in fact (that's not sorted yet), but so
far, it seems remap_anon_pages fits the bill for anonymous memory.

remap_anon_pages is not as problematic to maintain as remap_file_pages
for the reason explained in the commit header, but there are other
reasons: it doesn't require special pte_file and it changes nothing of
how anonymous page faults works. All it requires is a loop to catch a
changed page->index (previously page->index couldn't change, not it
can, that's the only thing it changes).

remap_file_pages complexity derives from not being allowed to change
page->index during a move because the page_mapping may be bigger than
1, while that is precisely what remap_anon_pages does.

As long as this "rmap preparation" is the only constraints that
remap_anon_pages introduces in terms of rmap, it looks a nice
not-too-intrusive solution to resolve anonymous userfaults
efficiently.

Introducing remap_anon_pages in fact doesn't reduce the
simplification derived from the removal of remap_file_pages.

As opposed removing remap_anon_pages later would only have the benefit
of removing this very patch 10/17 and no other benefit.

In short remap_anon_pages does this (heavily simplified):

pte = *src_pte;
*src_pte = 0;
pte_page(pte)->index = adjusted according to src_vma/dst_vma->vm_pgoff
*dst_pte = pte;

It guarantees not to modify the vmas and in turn it doesn't require to
take the mmap_sem for writing.

To use remap_anon_pages, each thread has to create its own temporary
vma with MADV_DONTFORK set on it (not formally required by the syscall
strict checks, but then the application must never fork if
MADV_DONTFORK isn't set or remap_anon_pages could return -EBUSY:
there's no risk of silent data corruption even if the thread forks
without setting MADV_DONTFORK) as source region where receive data
through the network. Then after the data is fully received
rmap_anon_pages moves the page from the temporary vma to the address
where the userfault triggered atomically (while other threads may be
attempting to access the userfault address too, thanks to
remap_anon_pages atomic behavior they won't risk to ever see partial
data coming from the network).

remap_anon_pages as side effect creates an hole in the temporary
(source) vma, so the next recv() syscall receiving data from the
network will fault-in a new anonymous page without requiring any
further malloc/free or other kind of vma mangling.

Thanks,
Andrea

2014-10-06 17:01:11

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 12/17] mm: sys_remap_anon_pages

Hi,

On Sat, Oct 04, 2014 at 06:13:27AM -0700, Andi Kleen wrote:
> Andrea Arcangeli <[email protected]> writes:
>
> > This new syscall will move anon pages across vmas, atomically and
> > without touching the vmas.
> >
> > It only works on non shared anonymous pages because those can be
> > relocated without generating non linear anon_vmas in the rmap code.
>
> ...
>
> > It is an alternative to mremap.
>
> Why a new syscall? Couldn't mremap do this transparently?

The difference between remap_anon_pages and mremap is that mremap
fundamentally moves vmas and not pages (just the pages are moved too
because they remain attached to their respective vmas), while
remap_anon_pages move anonymous pages zerocopy across vmas but it
would never touch any vma.

mremap for example would also nuke the source vma, remap_anon_pages
just moves the pages inside the vmas instead so it doesn't require to
allocate new vmas in the area that receives the data.

We could certainly change mremap to try to detect when page_mapping of
anonymous page is 1 and downgrade the mmap_sem to down_read and then
behave like remap_anon_pages internally by updating the page->index if
all pages in the range can be updated. However to provide the same
strict checks that remap_anon_pages does and to leave the source vma
intact, mremap would need new flags that would need to alter the
normal mremap semantics that silently wipes out the destination range
and get rid of the source range and it would require to run a
remap_anon_pages-detection-routine that isn't zero cost.

Unless we add even more flags to mremap, we wouldn't have the absolute
guarantee that the vma tree is not altered in case userland is not
doing all things right (like if userland forgot MADV_DONTFORK).

Separating the two looked better, mremap was never meant to be
efficient at moving 1 page at time (or 1 THP at time).

Embedding remap_anon_pages inside mremap didn't look worthwhile
considering that as result, mremap would run slower when it cannot
behave like remap_anon_pages and it would also run slower than
remap_anon_pages when it could.

Thanks,
Andrea

2014-10-06 17:25:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 08/17] mm: madvise MADV_USERFAULT

Hi,

On Sat, Oct 04, 2014 at 08:13:36AM +0900, Mike Hommey wrote:
> On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > userland touches a still unmapped virtual address, a sigbus signal is
> > sent instead of allocating a new page. The sigbus signal handler will
> > then resolve the page fault in userland by calling the
> > remap_anon_pages syscall.
>
> What does "unmapped virtual address" mean in this context?

To clarify this I added this in a second sentence in the commit
header:

"still unmapped virtual address" of the previous sentence in this
context means that the pte/trans_huge_pmd is null. It means it's an
hole inside the anonymous vma (the kind of hole that doesn't account
for RSS but only virtual size of the process). It is the same state
all anonymous virtual memory is, right after mmap. The same state that
if you read from it, will map a zeropage into the faulting virtual
address. If the page is swapped out, it will not trigger userfaults.

If something isn't clear let me know.

Thanks,
Andrea

2014-10-07 09:05:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits

On Fri, Oct 03, 2014 at 07:07:57PM +0200, Andrea Arcangeli wrote:
> We run out of 32bits in vm_flags, noop change for 64bit archs.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> fs/proc/task_mmu.c | 4 ++--
> include/linux/huge_mm.h | 4 ++--
> include/linux/ksm.h | 4 ++--
> include/linux/mm_types.h | 2 +-
> mm/huge_memory.c | 2 +-
> mm/ksm.c | 2 +-
> mm/madvise.c | 2 +-
> mm/mremap.c | 2 +-
> 8 files changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c341568..ee1c3a2 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> /*
> * Don't forget to update Documentation/ on changes.
> */
> - static const char mnemonics[BITS_PER_LONG][2] = {
> + static const char mnemonics[BITS_PER_LONG+1][2] = {

I believe here and below should be BITS_PER_LONG_LONG instead: it will
catch unknown vmflags. And +1 is not needed un 64-bit systems.

> /*
> * In case if we meet a flag we don't know about.
> */
> - [0 ... (BITS_PER_LONG-1)] = "??",
> + [0 ... (BITS_PER_LONG)] = "??",
>
> [ilog2(VM_READ)] = "rd",
> [ilog2(VM_WRITE)] = "wr",
--
Kirill A. Shutemov

2014-10-07 10:38:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 08/17] mm: madvise MADV_USERFAULT

On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> userland touches a still unmapped virtual address, a sigbus signal is
> sent instead of allocating a new page. The sigbus signal handler will
> then resolve the page fault in userland by calling the
> remap_anon_pages syscall.

Hm. I wounder if this functionality really fits madvise(2) interface: as
far as I understand it, it provides a way to give a *hint* to kernel which
may or may not trigger an action from kernel side. I don't think an
application will behaive reasonably if kernel ignore the *advise* and will
not send SIGBUS, but allocate memory.

I would suggest to consider to use some other interface for the
functionality: a new syscall or, perhaps, mprotect().

--
Kirill A. Shutemov

2014-10-07 10:47:21

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [PATCH 08/17] mm: madvise MADV_USERFAULT

* Kirill A. Shutemov ([email protected]) wrote:
> On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > userland touches a still unmapped virtual address, a sigbus signal is
> > sent instead of allocating a new page. The sigbus signal handler will
> > then resolve the page fault in userland by calling the
> > remap_anon_pages syscall.
>
> Hm. I wounder if this functionality really fits madvise(2) interface: as
> far as I understand it, it provides a way to give a *hint* to kernel which
> may or may not trigger an action from kernel side. I don't think an
> application will behaive reasonably if kernel ignore the *advise* and will
> not send SIGBUS, but allocate memory.

Aren't DONTNEED and DONTDUMP similar cases of madvise operations that are
expected to do what they say ?

> I would suggest to consider to use some other interface for the
> functionality: a new syscall or, perhaps, mprotect().

Dave

> --
> Kirill A. Shutemov
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2014-10-07 10:53:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT

On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote:
> * Kirill A. Shutemov ([email protected]) wrote:
> > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > > userland touches a still unmapped virtual address, a sigbus signal is
> > > sent instead of allocating a new page. The sigbus signal handler will
> > > then resolve the page fault in userland by calling the
> > > remap_anon_pages syscall.
> >
> > Hm. I wounder if this functionality really fits madvise(2) interface: as
> > far as I understand it, it provides a way to give a *hint* to kernel which
> > may or may not trigger an action from kernel side. I don't think an
> > application will behaive reasonably if kernel ignore the *advise* and will
> > not send SIGBUS, but allocate memory.
>
> Aren't DONTNEED and DONTDUMP similar cases of madvise operations that are
> expected to do what they say ?

No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not
affect correctness, just behaviour will be suboptimal: more than needed
memory used or wasted space in coredump.

--
Kirill A. Shutemov

2014-10-07 11:02:16

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT

* Kirill A. Shutemov ([email protected]) wrote:
> On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote:
> > * Kirill A. Shutemov ([email protected]) wrote:
> > > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > > > userland touches a still unmapped virtual address, a sigbus signal is
> > > > sent instead of allocating a new page. The sigbus signal handler will
> > > > then resolve the page fault in userland by calling the
> > > > remap_anon_pages syscall.
> > >
> > > Hm. I wounder if this functionality really fits madvise(2) interface: as
> > > far as I understand it, it provides a way to give a *hint* to kernel which
> > > may or may not trigger an action from kernel side. I don't think an
> > > application will behaive reasonably if kernel ignore the *advise* and will
> > > not send SIGBUS, but allocate memory.
> >
> > Aren't DONTNEED and DONTDUMP similar cases of madvise operations that are
> > expected to do what they say ?
>
> No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not
> affect correctness, just behaviour will be suboptimal: more than needed
> memory used or wasted space in coredump.

That's not how the manpage reads for DONTNEED; it calls it out as a special
case near the top, and explicitly says what will happen if you read the
area marked as DONTNEED.

It looks like there are openssl patches that use DONTDUMP to explicitly
make sure keys etc don't land in cores.

Dave

>
> --
> Kirill A. Shutemov
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2014-10-07 11:11:17

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages

On Fri, Oct 03, 2014 at 07:08:00PM +0200, Andrea Arcangeli wrote:
> There's one constraint enforced to allow this simplification: the
> source pages passed to remap_anon_pages must be mapped only in one
> vma, but this is not a limitation when used to handle userland page
> faults with MADV_USERFAULT. The source addresses passed to
> remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
> avoid any risk of the mapcount of the pages increasing, if fork runs
> in parallel in another thread, before or while remap_anon_pages runs.

Have you considered triggering COW instead of adding limitation on
pages' mapcount? The limitation looks artificial from interface POV.

--
Kirill A. Shutemov

2014-10-07 11:30:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT

On Tue, Oct 07, 2014 at 12:01:02PM +0100, Dr. David Alan Gilbert wrote:
> * Kirill A. Shutemov ([email protected]) wrote:
> > On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote:
> > > * Kirill A. Shutemov ([email protected]) wrote:
> > > > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > > > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > > > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > > > > userland touches a still unmapped virtual address, a sigbus signal is
> > > > > sent instead of allocating a new page. The sigbus signal handler will
> > > > > then resolve the page fault in userland by calling the
> > > > > remap_anon_pages syscall.
> > > >
> > > > Hm. I wounder if this functionality really fits madvise(2) interface: as
> > > > far as I understand it, it provides a way to give a *hint* to kernel which
> > > > may or may not trigger an action from kernel side. I don't think an
> > > > application will behaive reasonably if kernel ignore the *advise* and will
> > > > not send SIGBUS, but allocate memory.
> > >
> > > Aren't DONTNEED and DONTDUMP similar cases of madvise operations that are
> > > expected to do what they say ?
> >
> > No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not
> > affect correctness, just behaviour will be suboptimal: more than needed
> > memory used or wasted space in coredump.
>
> That's not how the manpage reads for DONTNEED; it calls it out as a special
> case near the top, and explicitly says what will happen if you read the
> area marked as DONTNEED.

Your are right. MADV_DONTNEED doesn't fit the interface too. That's bad
and we can't fix it. But it's not a reason to make this mistake again.

Read the next sentence: "The kernel is free to ignore the advice."

Note, POSIX_MADV_DONTNEED has totally different semantics.

> It looks like there are openssl patches that use DONTDUMP to explicitly
> make sure keys etc don't land in cores.

That's nice to have. But openssl works on systems without the interface,
meaning it's not essential for functionality.

--
Kirill A. Shutemov

2014-10-07 12:48:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <[email protected]> wrote:
>
> Of course if somebody has better ideas on how to resolve an anonymous
> userfault they're welcome.

So I'd *much* rather have a "write()" style interface (ie _copying_
bytes from user space into a newly allocated page that gets mapped)
than a "remap page" style interface

remapping anonymous pages involves page table games that really aren't
necessarily a good idea, and tlb invalidates for the old page etc.
Just don't do it.

Linus

2014-10-07 13:25:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 08/17] mm: madvise MADV_USERFAULT

Hi Kirill,

On Tue, Oct 07, 2014 at 01:36:45PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > userland touches a still unmapped virtual address, a sigbus signal is
> > sent instead of allocating a new page. The sigbus signal handler will
> > then resolve the page fault in userland by calling the
> > remap_anon_pages syscall.
>
> Hm. I wounder if this functionality really fits madvise(2) interface: as
> far as I understand it, it provides a way to give a *hint* to kernel which
> may or may not trigger an action from kernel side. I don't think an
> application will behaive reasonably if kernel ignore the *advise* and will
> not send SIGBUS, but allocate memory.
>
> I would suggest to consider to use some other interface for the
> functionality: a new syscall or, perhaps, mprotect().

I didn't feel like adding PROT_USERFAULT to mprotect, which looks
hardwired to just these flags:

PROT_NONE The memory cannot be accessed at all.

PROT_READ The memory can be read.

PROT_WRITE The memory can be modified.

PROT_EXEC The memory can be executed.

Normally mprotect doesn't just alter the vmas but it also alters
pte/hugepmds protection bits, that's something that is never needed
with VM_USERFAULT so I didn't feel like VM_USERFAULT is a protection
change to the VMA.

mprotect is also hardwired to mangle only the VM_READ|WRITE|EXEC
flags, while madvise is ideal to set arbitrary vma flags.

>From an implementation standpoint the perfect place to set a flag in a
vma is madvise. This is what MADV_DONTFORK (it sets VM_DONTCOPY)
already does too in an identical way to MADV_USERFAULT/VM_USERFAULT.

MADV_DONTFORK is as critical as MADV_USERFAULT because people depends
on it for example to prevent the O_DIRECT vs fork race condition that
results in silent data corruption during I/O with threads that may
fork. The other reason why MADV_DONTFORK is critical is that fork()
would otherwise fail with OOM unless full overcommit is enabled
(i.e. pci hotplug crashes the guest if you forget to set
MADV_DONTFORK).

Another madvise that would generate a failure if not obeyed by the
kernel is MADV_DONTNEED that if it does nothing it could run lead to
OOM killing. We don't inflate virt balloons using munmap just to make
an example. Various other apps (maybe JVM garbage collection too)
makes extensive use of MADV_DONTNEED and depend on it.

Said that I can change it to mprotect, the only thing that I don't
like is that it'll result in a less clean patch and I can't possibly
see a practical risk in keeping it simpler with madvise, as long as we
always return -EINVAL whenever we encounter a vma type that cannot
raise userfaults yet (that is something I already enforced).

Yet another option would be to drop MADV_USERFAULT and
vm_flags&VM_USERFAULT entirely and in turn the ability to handle
userfaults with SIGBUS, and retain only the userfaultfd. The new
userfaultfd protocol requires registering each created userfaultfd
into its own private virtual memory ranges (that is to allow an
unlimited number of userfaultfd per process). Currently the
userfaultfd engages iff the fault address intersects both the
MADV_USERFAULT range and the userfaultfd registered ranges. So I could
drop MADV_USERFAULT and VM_USERFAULT and just check for
vma->vm_userfaultfd_ctx!=NULL to know if the userfaultfd protocol
needs to be engaged during the first page fault for a still unmapped
virtual address. I just thought it would be more flexibile to also
allow SIGBUS without forcing people to use userfaultfd (that's in fact
the only reason to still retain madvise(MADV_USERFAULT)!).

Volatile pages earlier patches only supported SIGBUS behavior for
example.. and I didn't intend to force them to use userfaultfd if
they're guaranteed to access the memory with the CPU and never through
a kernel syscall (that is something the app can enforce by
design). userfaultfd becomes necessary the moment you want to handle
userfaults through syscalls/gup etc... qemu obviously requires
userfaultfd and it never uses the userfaultfd-less SIGBUS behavior as
it touches the memory in all possible ways (first and foremost with
the KVM page fault that uses almost all variants of gup..).

So here somebody should comment and choose between:

1) set VM_USERFAULT with mprotect(PROT_USERFAULT) instead of
the current madvise(MADV_USERFAULT)

2) drop MADV_USERFAULT and VM_USERFAULT and force the usage of the
userfaultfd protocol as the only way for userland to catch
userfaults (each userfaultfd must already register itself into its
own virtual memory ranges so it's a trivial change for userfaultfd
users that deletes just 1 or 2 lines of userland code, but it would
prevent to use the SIGBUS behavior with info->si_addr=faultaddr for
other users)

3) keep things as they are now: use MADV_USERFAULT for SIGBUS
userfaults, with optional intersection between the
vm_flags&VM_USERFAULT ranges and the userfaultfd registered ranges
with vma->vm_userfaultfd_ctx!=NULL to know if to engage the
userfaultfd protocol instead of the plain SIGBUS

I will update the code accordingly to feedback, so please comment.

I implemented 3) because I thought it provided the most flexibility
for userland to choose if to engage in the userfaultfd protocol or to
stay simple with the SIGBUS if the app doesn't require to access the
userfault virtual memory from the kernel code. It also provides the
cleanest and simplest implementation to set the VM_USERFAULT flags
with madvise.

My second choice would be 2). We could always add MADV_USERFAULT later
except then we'd be forced to set and clear VM_USERFAULT within the
userfaultfd registration to remain backwards compatible. The main cons
and the reason I didn't pick 2) is that it wouldn't be a drop in
replacement for volatile pages that would then be force to use the
userfaultfd protocol too.

I don't like 3) very much mostly because the changes to mprotect would
just make things more complex on the implementation side with purely
conceptual benefits, but then it's possible too and it's feature
equivalent to 1) as far as volatile pages are concerned, so I'm
overall fine with this change if that's the preferred way.

Thanks,
Andrea

2014-10-07 13:38:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages

Hi Kirill,

On Tue, Oct 07, 2014 at 02:10:26PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 03, 2014 at 07:08:00PM +0200, Andrea Arcangeli wrote:
> > There's one constraint enforced to allow this simplification: the
> > source pages passed to remap_anon_pages must be mapped only in one
> > vma, but this is not a limitation when used to handle userland page
> > faults with MADV_USERFAULT. The source addresses passed to
> > remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
> > avoid any risk of the mapcount of the pages increasing, if fork runs
> > in parallel in another thread, before or while remap_anon_pages runs.
>
> Have you considered triggering COW instead of adding limitation on
> pages' mapcount? The limitation looks artificial from interface POV.

I haven't considered it, mostly because I see it as a feature that it
returns -EBUSY. I prefer to avoid the risk of userland getting a
successful retval but internally the kernel silently behaving
non-zerocopy by mistake because some userland bug forgot to set
MADV_DONTFORK on the src_vma.

COW would be not zerocopy so it's not ok. We get sub 1msec latency for
userfaults through 10gbit and we don't want to risk wasting CPU
caches.

I however considered allowing to extend the strict behavior (i.e. the
feature) later in a backwards compatible way. We could provide a
non-zerocopy beahvior with a RAP_ALLOW_COW flag that would then turn
the -EBUSY error into a copy.

It's also more complex to implement the cow now, so it would make the
code that really matters, harder to review. So it may be preferable to
extend this later in a backwards compatible way with a new
RAP_ALLOW_COW flag.

The current handling the flags is already written in a way that should
allow backwards compatible extension with RAP_ALLOW_*:

#define RAP_ALLOW_SRC_HOLES (1UL<<0)

SYSCALL_DEFINE4(remap_anon_pages,
unsigned long, dst_start, unsigned long, src_start,
unsigned long, len, unsigned long, flags)
[..]
long err = -EINVAL;
[..]
if (flags & ~RAP_ALLOW_SRC_HOLES)
return err;

2014-10-07 15:10:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

Hello,

On Tue, Oct 07, 2014 at 08:47:59AM -0400, Linus Torvalds wrote:
> On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <[email protected]> wrote:
> >
> > Of course if somebody has better ideas on how to resolve an anonymous
> > userfault they're welcome.
>
> So I'd *much* rather have a "write()" style interface (ie _copying_
> bytes from user space into a newly allocated page that gets mapped)
> than a "remap page" style interface
>
> remapping anonymous pages involves page table games that really aren't
> necessarily a good idea, and tlb invalidates for the old page etc.
> Just don't do it.

I see what you mean. The only cons I see is that we couldn't use then
recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr,
PAGE_SIZE, ..) and retain the zerocopy behavior. Or how could we?
There's no recvfile(userfaultfd, socketfd, PAGE_SIZE).

Ideally if we could prevent the page data coming from the network to
ever become visible in the kernel we could avoid the TLB flush and
also be zerocopy but I can't see how we could achieve that.

The page data could come through a ssh pipe or anything (qemu supports
all kind of network transports for live migration), this is why
leaving the network protocol into userland is preferable.

As things stands now, I'm afraid with a write() syscall we couldn't do
it zerocopy. We'd still need to receive the memory in a temporary page
and then copy it to a kernel page (invisible to userland while we
write to it) to later map into the userfault address.

If it wasn't for the TLB flush of the old page, the remap_anon_pages
variant would be more optimal than doing a copy through a write
syscall. Is the copy cheaper than a TLB flush? I probably naively
assumed the TLB flush was always cheaper.

Now another idea that comes to mind to be able to add the ability to
switch between copy and TLB flush is using a RAP_FORCE_COPY flag, that
would then do a copy inside remap_anon_pages and leave the original
page mapped in place... (and such flag would also disable the -EBUSY
error if page_mapcount is > 1).

So then if the RAP_FORCE_COPY flag is set remap_anon_pages would
behave like you suggested (but with a mremap-like interface, instead
of a write syscall) and we could benchmark the difference between copy
and TLB flush too. We could even periodically benchmark it at runtime
and switch over the faster method (the more CPUs there are in the host
and the more threads the process has, the faster the copy will be
compared to the TLB flush).

Of course in terms of API I could implement the exact same mechanism
as described above for remap_anon_pages inside a write() to the
userfaultfd (it's a pseudo inode). It'd need two different commands to
prepare for the coming write (with a len multiple of PAGE_SIZE) to
know the address where the page should be mapped into and if to behave
zerocopy or if to skip the TLB flush and copy.

Because the copy vs TLB flush trade off is possible to achieve with
both interfaces, I think it really boils down to choosing between a
mremap like interface, or file+commands protocol interface. I tend to
like mremap more, that's why I opted for a remap_anon_pages syscall
kept orthogonal to the userfaultfd functionality (remap_anon_pages
could be also used standalone as an accelerated mremap in some
circumstances) but nothing prevents to just embed the same mechanism
inside userfaultfd if a file+commands API is preferable. Or we could
add a different syscall (separated from userfaultfd) that creates
another pseudofd to write a command plus the page data into it. Just I
wouldn't see the point of creating a pseudofd just to copy a page
atomically, the write() syscall would look more ideal if the
userfaultfd is already open for other reasons and the pseudofd
overhead is required anyway.

Last thing to keep in mind is that if using userfaults with SIGBUS and
without userfaultfd, remap_anon_pages would have been still useful, so
if we retain the SIGBUS behavior for volatile pages and we don't force
the usage for userfaultfd, it may be cleaner not to use userfaultfd
but a separate pseudofd to do the write() syscall though. Otherwise
the app would need to open the userfaultfd to resolve the fault even
though it's not using the userfaultfd protocol which doesn't look an
intuitive interface to me.

Comments welcome.

Thanks,
Andrea

2014-10-07 15:23:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 08/17] mm: madvise MADV_USERFAULT

On Tue, Oct 07, 2014 at 03:24:58PM +0200, Andrea Arcangeli wrote:
> Hi Kirill,
>
> On Tue, Oct 07, 2014 at 01:36:45PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > > userland touches a still unmapped virtual address, a sigbus signal is
> > > sent instead of allocating a new page. The sigbus signal handler will
> > > then resolve the page fault in userland by calling the
> > > remap_anon_pages syscall.
> >
> > Hm. I wounder if this functionality really fits madvise(2) interface: as
> > far as I understand it, it provides a way to give a *hint* to kernel which
> > may or may not trigger an action from kernel side. I don't think an
> > application will behaive reasonably if kernel ignore the *advise* and will
> > not send SIGBUS, but allocate memory.
> >
> > I would suggest to consider to use some other interface for the
> > functionality: a new syscall or, perhaps, mprotect().
>
> I didn't feel like adding PROT_USERFAULT to mprotect, which looks
> hardwired to just these flags:

PROT_NOALLOC may be?

>
> PROT_NONE The memory cannot be accessed at all.
>
> PROT_READ The memory can be read.
>
> PROT_WRITE The memory can be modified.
>
> PROT_EXEC The memory can be executed.

To be complete: PROT_GROWSDOWN, PROT_GROWSUP and unused PROT_SEM.

> So here somebody should comment and choose between:
>
> 1) set VM_USERFAULT with mprotect(PROT_USERFAULT) instead of
> the current madvise(MADV_USERFAULT)
>
> 2) drop MADV_USERFAULT and VM_USERFAULT and force the usage of the
> userfaultfd protocol as the only way for userland to catch
> userfaults (each userfaultfd must already register itself into its
> own virtual memory ranges so it's a trivial change for userfaultfd
> users that deletes just 1 or 2 lines of userland code, but it would
> prevent to use the SIGBUS behavior with info->si_addr=faultaddr for
> other users)
>
> 3) keep things as they are now: use MADV_USERFAULT for SIGBUS
> userfaults, with optional intersection between the
> vm_flags&VM_USERFAULT ranges and the userfaultfd registered ranges
> with vma->vm_userfaultfd_ctx!=NULL to know if to engage the
> userfaultfd protocol instead of the plain SIGBUS

4) new syscall?

> I will update the code accordingly to feedback, so please comment.

I don't have strong points on this. Just *feel* it doesn't fit advice
semantics.

The only userspace interface I've designed was not proven good by time.
I would listen what senior maintainers say. :)

--
Kirill A. Shutemov

2014-10-07 15:54:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote:
> mremap like interface, or file+commands protocol interface. I tend to
> like mremap more, that's why I opted for a remap_anon_pages syscall
> kept orthogonal to the userfaultfd functionality (remap_anon_pages
> could be also used standalone as an accelerated mremap in some
> circumstances) but nothing prevents to just embed the same mechanism

Sorry for the self followup, but something else comes to mind to
elaborate this further.

In term of interfaces, the most efficient I could think of to minimize
the enter/exit kernel, would be to append the "source address" of the
data received from the network transport, to the userfaultfd_write()
command (by appending 8 bytes to the wakeup command). Said that,
mixing the mechanism to be notified about userfaults with the
mechanism to resolve an userfault to me looks a complication. I kind
of liked to keep the userfaultfd protocol is very simple and doing
just its thing. The userfaultfd doesn't need to know how the userfault
was resolved, even mremap would work theoretically (until we run out
of vmas). I thought it was simpler to keep it that way. However if we
want to resolve the fault with a "write()" syscall this may be the
most efficient way to do it, as we're already doing a write() into the
pseudofd to wakeup the page fault that contains the destination
address, I just need to append the source address to the wakeup command.

I probably grossly overestimated the benefits of resolving the
userfault with a zerocopy page move, sorry. So if we entirely drop the
zerocopy behavior and the TLB flush of the old page like you
suggested, the way to keep the userfaultfd mechanism decoupled from
the userfault resolution mechanism would be to implement an
atomic-copy syscall. That would work for SIGBUS userfaults too without
requiring a pseudofd then. It would be enough then to call
mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints
that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic
wouldn't page fault or call GUP into the destination address (it can't
otherwise the in-flight partial copy would be visible to the process,
breaking the atomicity of the copy), but it would fill in the
pte/trans_huge_pmd with the same strict behavior that remap_anon_pages
currently has (in turn it would by design bypass the VM_USERFAULT
check and be ideal for resolving userfaults).

mcopy_atomic could then be also extended to tmpfs and it would work
without requiring the source page to be a tmpfs page too without
having to convert page types on the fly.

If I add mcopy_atomic, the patch in subject (10/17) can be dropped of
course so it'd be even less intrusive than the current
remap_anon_pages and it would require zero TLB flush during its
runtime (it would just require an atomic copy).

So should I try to embed a mcopy_atomic inside userfault_write or can
I expose it to userland as a standalone new syscall? Or should I do
something different? Comments?

Thanks,
Andrea

2014-10-07 15:55:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

On Tue, Oct 7, 2014 at 8:52 AM, Andrea Arcangeli <[email protected]> wrote:
> On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote:
>> mremap like interface, or file+commands protocol interface. I tend to
>> like mremap more, that's why I opted for a remap_anon_pages syscall
>> kept orthogonal to the userfaultfd functionality (remap_anon_pages
>> could be also used standalone as an accelerated mremap in some
>> circumstances) but nothing prevents to just embed the same mechanism
>
> Sorry for the self followup, but something else comes to mind to
> elaborate this further.
>
> In term of interfaces, the most efficient I could think of to minimize
> the enter/exit kernel, would be to append the "source address" of the
> data received from the network transport, to the userfaultfd_write()
> command (by appending 8 bytes to the wakeup command). Said that,
> mixing the mechanism to be notified about userfaults with the
> mechanism to resolve an userfault to me looks a complication. I kind
> of liked to keep the userfaultfd protocol is very simple and doing
> just its thing. The userfaultfd doesn't need to know how the userfault
> was resolved, even mremap would work theoretically (until we run out
> of vmas). I thought it was simpler to keep it that way. However if we
> want to resolve the fault with a "write()" syscall this may be the
> most efficient way to do it, as we're already doing a write() into the
> pseudofd to wakeup the page fault that contains the destination
> address, I just need to append the source address to the wakeup command.
>
> I probably grossly overestimated the benefits of resolving the
> userfault with a zerocopy page move, sorry. So if we entirely drop the
> zerocopy behavior and the TLB flush of the old page like you
> suggested, the way to keep the userfaultfd mechanism decoupled from
> the userfault resolution mechanism would be to implement an
> atomic-copy syscall. That would work for SIGBUS userfaults too without
> requiring a pseudofd then. It would be enough then to call
> mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints
> that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic
> wouldn't page fault or call GUP into the destination address (it can't
> otherwise the in-flight partial copy would be visible to the process,
> breaking the atomicity of the copy), but it would fill in the
> pte/trans_huge_pmd with the same strict behavior that remap_anon_pages
> currently has (in turn it would by design bypass the VM_USERFAULT
> check and be ideal for resolving userfaults).

At the risk of asking a possibly useless question, would it make sense
to splice data into a userfaultfd?

--Andy

>
> mcopy_atomic could then be also extended to tmpfs and it would work
> without requiring the source page to be a tmpfs page too without
> having to convert page types on the fly.
>
> If I add mcopy_atomic, the patch in subject (10/17) can be dropped of
> course so it'd be even less intrusive than the current
> remap_anon_pages and it would require zero TLB flush during its
> runtime (it would just require an atomic copy).
>
> So should I try to embed a mcopy_atomic inside userfault_write or can
> I expose it to userland as a standalone new syscall? Or should I do
> something different? Comments?
>
> Thanks,
> Andrea



--
Andy Lutomirski
AMA Capital Management, LLC

2014-10-07 16:13:27

by Peter Feiner

[permalink] [raw]
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

On Tue, Oct 07, 2014 at 05:52:47PM +0200, Andrea Arcangeli wrote:
> I probably grossly overestimated the benefits of resolving the
> userfault with a zerocopy page move, sorry. [...]

For posterity, I think it's worth noting that most expensive aspect of a TLB
shootdown is the interprocessor interrupt necessary to flush other CPUs' TLBs.
On a many-core machine, copying 4K of data looks pretty cheap compared to
taking an interrupt and invalidating TLBs on many cores :-)

> [...] So if we entirely drop the
> zerocopy behavior and the TLB flush of the old page like you
> suggested, the way to keep the userfaultfd mechanism decoupled from
> the userfault resolution mechanism would be to implement an
> atomic-copy syscall. That would work for SIGBUS userfaults too without
> requiring a pseudofd then. It would be enough then to call
> mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints
> that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic
> wouldn't page fault or call GUP into the destination address (it can't
> otherwise the in-flight partial copy would be visible to the process,
> breaking the atomicity of the copy), but it would fill in the
> pte/trans_huge_pmd with the same strict behavior that remap_anon_pages
> currently has (in turn it would by design bypass the VM_USERFAULT
> check and be ideal for resolving userfaults).
>
> mcopy_atomic could then be also extended to tmpfs and it would work
> without requiring the source page to be a tmpfs page too without
> having to convert page types on the fly.
>
> If I add mcopy_atomic, the patch in subject (10/17) can be dropped of
> course so it'd be even less intrusive than the current
> remap_anon_pages and it would require zero TLB flush during its
> runtime (it would just require an atomic copy).

I like this new approach. It will be good to have a single interface for
resolving anon and tmpfs userfaults.

> So should I try to embed a mcopy_atomic inside userfault_write or can
> I expose it to userland as a standalone new syscall? Or should I do
> something different? Comments?

One interesting (ab)use of userfault_write would be that the faulting process
and the fault-handling process could be different, which would be necessary
for post-copy live migration in CRIU (http://criu.org).

Aside from the asthetic difference, I can't think of any advantage in favor of
a syscall.

Peter

2014-10-07 16:56:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

On Tue, Oct 7, 2014 at 10:19 AM, Andrea Arcangeli <[email protected]> wrote:
>
> I see what you mean. The only cons I see is that we couldn't use then
> recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr,
> PAGE_SIZE, ..) and retain the zerocopy behavior. Or how could we?
> There's no recvfile(userfaultfd, socketfd, PAGE_SIZE).

You're doing completelt faulty math, and you haven't thought it through.

Your "zero-copy" case is no such thing. Who cares if some packet
receive is zero-copy, when you need to set up the anonymous page to
*receive* the zero copy into, which involves page allocation, page
zeroing, page table setup with VM and page table locking, etc etc.

The thing is, the whole concept of "zero-copy" is pure and utter
bullshit. Sun made a big deal about the whole concept back in the
nineties, and IT DOES NOT WORK. It's a scam. Don't buy into it. It's
crap. It's made-up and not real.

Then, once you've allocated and cleared the page, mapped it in, your
"zero-copy" model involves looking up the page in the page tables
again (locking etc), then doing that zero-copy to the page. Then, when
you remap it, you look it up in the page tables AGAIN, with locking,
move it around, have to clear the old page table entry (which involves
a locked cmpxchg64), a TLB flush with most likely a cross-CPU IPI -
since the people who do this are all threaded and want many CPU's, and
then you insert the page into the new place.

That's *insane*. It's crap. All just to try to avoid one page copy.

Don't do it. remapping games really are complete BS. They never beat
just copying the data. It's that simple.

> As things stands now, I'm afraid with a write() syscall we couldn't do
> it zerocopy.

Really, you need to rethink your whole "zerocopy" model. It's broken.
Nobody sane cares. You've bought into a model that Sun already showed
doesn't work.

The only time zero-copy works is in random benchmarks that are very
careful to not touch the data at all at any point, and also try to
make sure that the benchmark is very much single-threaded so that you
never have the whole cross-CPU IPI issue for the TLB invalidate. Then,
and only then, can zero-copy win. And it's just not a realistic
scenario.

> If it wasn't for the TLB flush of the old page, the remap_anon_pages
> variant would be more optimal than doing a copy through a write
> syscall. Is the copy cheaper than a TLB flush? I probably naively
> assumed the TLB flush was always cheaper.

A local TLB flush is cheap. That's not the problem. The problem is the
setup of the page, and the clearing of the page, and the cross-CPU TLB
flush. And the page table locking, etc etc.

So no, I'm not AT ALL worried about a single "invlpg" instruction.
That's nothing. Local CPU TLB flushes of single pages are basically
free. But that really isn't where the costs are.

Quite frankly, the *only* time page remapping has ever made sense is
when it is used for realloc() kind of purposes, where you need to
remap pages not because of zero-copy, but because you need to change
the virtual address space layout. And then you make sure it's not a
common operation, because you're not doing it as a performance win,
you're doing it because you're changing your virtual layout.

Really. Zerocopy is for benchmarks, and for things like "splice()"
when you can avoid the page tables *entirely*. But the notion that
page remapping of user pages is faster than a copy is pure and utter
garbage. It's simply not true.

So I really think you should aim for a "write()": kind of interface.

With write, you may not get the zero-copy, but on the other hand it
allows you to re-use the source buffer many times without having to
allocate new pages and map it in etc. So a "read()+write()" loop (or,
quite commonly a "generate data computationally from a compressed
source + write()" loop) is actually much more efficient than the
zero-copy remapping, because you don't have all the complexities and
overheads in creating the source page.

It is possible that that could involve "splice()" too, although I
don't really think the source data tends to be in page-aligned chunks.
But hey, splice() at least *can* be faster than copying (and then we
have vmsplice() not because it's magically faster, but because it can
under certain circumstances be worth it, and it kind of made sense to
allow the interface, but I really don't think it's used very much or
very useful).

Linus

2014-10-07 17:15:48

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

Il 07/10/2014 19:07, Dr. David Alan Gilbert ha scritto:
>> >
>> > So I'd *much* rather have a "write()" style interface (ie _copying_
>> > bytes from user space into a newly allocated page that gets mapped)
>> > than a "remap page" style interface
> Something like that might work for the postcopy case; it doesn't work
> for some of the other uses that need to stop a page being changed by the
> guest, but then need to somehow get a copy of that page internally to QEMU,
> and perhaps provide it back later.

I cannot parse this. Which uses do you have in mind? Is it for
QEMU-specific or is it for other applications of userfaults?

As long as the page is atomically mapped, I'm not sure what the
difference from remap_anon_pages are (as far as the destination page is
concerned). Are you thinking of having userfaults enabled on the source
as well?

Paolo

> remap_anon_pages worked for those cases
> as well; I can't think of another current way of doing it in userspace.

2014-10-07 17:26:43

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

* Paolo Bonzini ([email protected]) wrote:
> Il 07/10/2014 19:07, Dr. David Alan Gilbert ha scritto:
> >> >
> >> > So I'd *much* rather have a "write()" style interface (ie _copying_
> >> > bytes from user space into a newly allocated page that gets mapped)
> >> > than a "remap page" style interface
> > Something like that might work for the postcopy case; it doesn't work
> > for some of the other uses that need to stop a page being changed by the
> > guest, but then need to somehow get a copy of that page internally to QEMU,
> > and perhaps provide it back later.
>
> I cannot parse this. Which uses do you have in mind? Is it for
> QEMU-specific or is it for other applications of userfaults?

> As long as the page is atomically mapped, I'm not sure what the
> difference from remap_anon_pages are (as far as the destination page is
> concerned). Are you thinking of having userfaults enabled on the source
> as well?

What I'm talking about here is when I want to stop a page being accessed by the
guest, do something with the data in qemu, and give it back to the guest sometime
later.

The main example is: Memory pools for guests where you swap RAM between a series of
VM hosts. You have to take the page out, send it over the wire, sometime later if
the guest tries to access it you userfault and pull it back.
(There's at least one or two companies selling something like this, and at least
one Linux based implementations with their own much more involved kernel hacks)

Dave
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2014-10-07 17:44:29

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages

* Linus Torvalds ([email protected]) wrote:
> On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <[email protected]> wrote:
> >
> > Of course if somebody has better ideas on how to resolve an anonymous
> > userfault they're welcome.
>
> So I'd *much* rather have a "write()" style interface (ie _copying_
> bytes from user space into a newly allocated page that gets mapped)
> than a "remap page" style interface

Something like that might work for the postcopy case; it doesn't work
for some of the other uses that need to stop a page being changed by the
guest, but then need to somehow get a copy of that page internally to QEMU,
and perhaps provide it back later. remap_anon_pages worked for those cases
as well; I can't think of another current way of doing it in userspace.

I'm thinking here of systems for making VMs with memory larger than a single
host; that's something that's not as well thought out. I've also seen people
writing emulation that want to trap and emulate some page accesses while
still having the original data available to the emulator itself.

So yes, OK for now, but the result is less general.

Dave


> remapping anonymous pages involves page table games that really aren't
> necessarily a good idea, and tlb invalidates for the old page etc.
> Just don't do it.
>
> Linus
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2014-10-27 09:35:23

by Hailiang Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

Hi Andrea,

Thanks for your hard work on userfault;)

This is really a useful API.

I want to confirm a question:
Can we support distinguishing between writing and reading memory for userfault?
That is, we can decide whether writing a page, reading a page or both trigger userfault.

I think this will help supporting vhost-scsi,ivshmem for migration,
we can trace dirty page in userspace.

Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault,
but reading memory from migration thread will also trigger userfault.
It will be easy to implement live memory snapshot, if we support configuring
userfault for writing memory only.


Thanks,
zhanghailiang

On 2014/10/4 1:07, Andrea Arcangeli wrote:
> Hello everyone,
>
> There's a large To/Cc list for this RFC because this adds two new
> syscalls (userfaultfd and remap_anon_pages) and
> MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes are welcome
> sooner than later.
>
> The major change compared to the previous RFC I sent a few months ago
> is that the userfaultfd protocol now supports dynamic range
> registration. So you can have an unlimited number of userfaults for
> each process, so each shared library can use its own userfaultfd on
> its own memory independently from other shared libraries or the main
> program. This functionality was suggested from Andy Lutomirski (more
> details on this are in the commit header of the last patch of this
> patchset).
>
> In addition the mmap_sem complexities has been sorted out. In fact the
> real userfault patchset starts from patch number 7. Patches 1-6 will
> be submitted separately for merging and if applied standalone they
> provide a scalability improvement by reducing the mmap_sem hold times
> during I/O. I included patch 1-6 here too because they're an hard
> dependency for the userfault patchset. The userfaultfd syscall depends
> on the first fault to always have FAULT_FLAG_ALLOW_RETRY set (the
> later retry faults don't matter, it's fine to clear
> FAULT_FLAG_ALLOW_RETRY with the retry faults, following the current
> model).
>
> The combination of these features are what I would propose to
> implement postcopy live migration in qemu, and in general demand
> paging of remote memory, hosted in different cloud nodes.
>
> If the access could ever happen in kernel context through syscalls
> (not not just from userland context), then userfaultfd has to be used
> on top of MADV_USERFAULT, to make the userfault unnoticeable to the
> syscall (no error will be returned). This latter feature is more
> advanced than what volatile ranges alone could do with SIGBUS so far
> (but it's optional, if the process doesn't register the memory in a
> userfaultfd, the regular SIGBUS will fire, if the fd is closed SIGBUS
> will also fire for any blocked userfault that was waiting a
> userfaultfd_write ack).
>
> userfaultfd is also a generic enough feature, that it allows KVM to
> implement postcopy live migration without having to modify a single
> line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
> other GUP features works just fine in combination with userfaults
> (userfaults trigger async page faults in the guest scheduler so those
> guest processes that aren't waiting for userfaults can keep running in
> the guest vcpus).
>
> remap_anon_pages is the syscall to use to resolve the userfaults (it's
> not mandatory, vmsplice will likely still be used in the case of local
> postcopy live migration just to upgrade the qemu binary, but
> remap_anon_pages is faster and ideal for transferring memory across
> the network, it's zerocopy and doesn't touch the vma: it only holds
> the mmap_sem for reading).
>
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed. mremap is not strict like
> that: if there's a synchronization bug it would drop the destination
> range silently resulting in subtle memory corruption for
> example. remap_anon_pages would return -EEXIST in that case. If there
> are holes in the source range remap_anon_pages will return -ENOENT.
>
> If remap_anon_pages is used always with 2M naturally aligned
> addresses, transparent hugepages will not be splitted. In there could
> be 4k (or any size) holes in the 2M (or any size) source range,
> remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
> relax some of its strict checks (-ENOENT won't be returned if
> RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
> a noop on any hole in the source range). This flag is generally useful
> when implementing userfaults with THP granularity, but it shouldn't be
> set if doing the userfaults with PAGE_SIZE granularity if the
> developer wants to benefit from the strict -ENOENT behavior.
>
> The remap_anon_pages syscall API is not vectored, as I expect it to be
> used mainly for demand paging (where there can be just one faulting
> range per userfault) or for large ranges (with the THP model as an
> alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
> granularity before starting the guest in the destination node) where
> vectoring isn't going to provide much performance advantages (thanks
> to the THP coarser granularity).
>
> On the rmap side remap_anon_pages doesn't add much complexity: there's
> no need of nonlinear anon vmas to support it because I added the
> constraint that it will fail if the mapcount is more than 1. So in
> general the source range of remap_anon_pages should be marked
> MADV_DONTFORK to prevent any risk of failure if the process ever
> forks (like qemu can in some case).
>
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages. Or it could be used for other
> similar things with tmpfs in the future. I've been discussing how to
> extend it to tmpfs for example. Currently if MADV_USERFAULT is set on
> a non-anonymous vma, it will return -EINVAL and that's enough to
> provide backwards compatibility once MADV_USERFAULT will be extended
> to tmpfs. An orthogonal problem then will be to identify the optimal
> mechanism to atomically resolve a tmpfs backed userfault (like
> remap_anon_pages does it optimally for anonymous memory) but that's
> beyond the scope of the userfault functionality (in theory
> remap_anon_pages is also orthogonal and I could split it off in a
> separate patchset if somebody prefers). Of course remap_file_pages
> should do it fine too, but it would create rmap nonlinearity which
> isn't optimal.
>
> The code can be found here:
>
> git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault
>
> The branch is rebased so you can get updates for example with:
>
> git fetch && git checkout -f origin/userfault
>
> Comments welcome, thanks!
> Andrea
>
> Andrea Arcangeli (15):
> mm: gup: add get_user_pages_locked and get_user_pages_unlocked
> mm: gup: use get_user_pages_unlocked within get_user_pages_fast
> mm: gup: make get_user_pages_fast and __get_user_pages_fast latency
> conscious
> mm: gup: use get_user_pages_fast and get_user_pages_unlocked
> mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
> mm: madvise MADV_USERFAULT
> mm: PT lock: export double_pt_lock/unlock
> mm: rmap preparation for remap_anon_pages
> mm: swp_entry_swapcount
> mm: sys_remap_anon_pages
> waitqueue: add nr wake parameter to __wake_up_locked_key
> userfaultfd: add new syscall to provide memory externalization
> userfaultfd: make userfaultfd_write non blocking
> powerpc: add remap_anon_pages and userfaultfd
> userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER
>
> Andres Lagar-Cavilla (2):
> mm: gup: add FOLL_TRIED
> kvm: Faults which trigger IO release the mmap_sem
>
> arch/alpha/include/uapi/asm/mman.h | 3 +
> arch/mips/include/uapi/asm/mman.h | 3 +
> arch/mips/mm/gup.c | 8 +-
> arch/parisc/include/uapi/asm/mman.h | 3 +
> arch/powerpc/include/asm/systbl.h | 2 +
> arch/powerpc/include/asm/unistd.h | 2 +-
> arch/powerpc/include/uapi/asm/unistd.h | 2 +
> arch/powerpc/mm/gup.c | 6 +-
> arch/s390/kvm/kvm-s390.c | 4 +-
> arch/s390/mm/gup.c | 6 +-
> arch/sh/mm/gup.c | 6 +-
> arch/sparc/mm/gup.c | 6 +-
> arch/x86/mm/gup.c | 235 +++++++----
> arch/x86/syscalls/syscall_32.tbl | 2 +
> arch/x86/syscalls/syscall_64.tbl | 2 +
> arch/xtensa/include/uapi/asm/mman.h | 3 +
> drivers/dma/iovlock.c | 10 +-
> drivers/iommu/amd_iommu_v2.c | 6 +-
> drivers/media/pci/ivtv/ivtv-udma.c | 6 +-
> drivers/scsi/st.c | 10 +-
> drivers/video/fbdev/pvr2fb.c | 5 +-
> fs/Makefile | 1 +
> fs/proc/task_mmu.c | 5 +-
> fs/userfaultfd.c | 722 +++++++++++++++++++++++++++++++++
> include/linux/huge_mm.h | 11 +-
> include/linux/ksm.h | 4 +-
> include/linux/mm.h | 15 +-
> include/linux/mm_types.h | 13 +-
> include/linux/swap.h | 6 +
> include/linux/syscalls.h | 5 +
> include/linux/userfaultfd.h | 55 +++
> include/linux/wait.h | 5 +-
> include/uapi/asm-generic/mman-common.h | 3 +
> init/Kconfig | 11 +
> kernel/sched/wait.c | 7 +-
> kernel/sys_ni.c | 2 +
> mm/fremap.c | 506 +++++++++++++++++++++++
> mm/gup.c | 182 ++++++++-
> mm/huge_memory.c | 208 ++++++++--
> mm/ksm.c | 2 +-
> mm/madvise.c | 22 +-
> mm/memory.c | 14 +
> mm/mempolicy.c | 4 +-
> mm/mlock.c | 3 +-
> mm/mmap.c | 39 +-
> mm/mprotect.c | 3 +-
> mm/mremap.c | 2 +-
> mm/nommu.c | 23 ++
> mm/process_vm_access.c | 7 +-
> mm/rmap.c | 9 +
> mm/swapfile.c | 13 +
> mm/util.c | 10 +-
> net/ceph/pagevec.c | 9 +-
> net/sunrpc/sched.c | 2 +-
> virt/kvm/async_pf.c | 4 +-
> virt/kvm/kvm_main.c | 4 +-
> 56 files changed, 2025 insertions(+), 236 deletions(-)
> create mode 100644 fs/userfaultfd.c
> create mode 100644 include/linux/userfaultfd.h
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> .
>

2014-10-29 17:47:09

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

Hi Zhanghailiang,

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
> Hi Andrea,
>
> Thanks for your hard work on userfault;)
>
> This is really a useful API.
>
> I want to confirm a question:
> Can we support distinguishing between writing and reading memory for userfault?
> That is, we can decide whether writing a page, reading a page or both trigger userfault.
>
> I think this will help supporting vhost-scsi,ivshmem for migration,
> we can trace dirty page in userspace.
>
> Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault,
> but reading memory from migration thread will also trigger userfault.
> It will be easy to implement live memory snapshot, if we support configuring
> userfault for writing memory only.

Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.

After some chat during the KVMForum I've been already thinking it
could be beneficial for some usage to give userland the information
about the fault being read or write, combined with the ability of
mapping pages wrprotected to mcopy_atomic (that would work without
false positives only with MADV_DONTFORK also set, but it's already set
in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
checked also in the wrprotect faults, not just in the not present
faults, but it's not a massive change. Returning the read/write
information is also a not massive change. This will then payoff mostly
if there's also a way to remove the memory atomically (kind of
remap_anon_pages).

Would that be enough? I mean are you still ok if non present read
fault traps too (you'd be notified it's a read) and you get
notification for both wrprotect and non present faults?

The question then is how you mark the memory readonly to let the
wrprotect faults trap if the memory already existed and you didn't map
it yourself in the guest with mcopy_atomic with a readonly flag.

My current plan would be:

- keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
fast path check in the not-present and wrprotect page fault

- if VM_USERFAULT is set, find if there's a userfaultfd registered
into that vma too

if yes engage userfaultfd protocol

otherwise raise SIGBUS (single threaded apps should be fine with
SIGBUS and it'll avoid them to spawn a thread in order to talk the
userfaultfd protocol)

- if userfaultfd protocol is engaged, return read|write fault + fault
address to read(ufd) syscalls

- leave the "userfault" resolution mechanism independent of the
userfaultfd protocol so we keep the two problems separated and we
don't mix them in the same API which makes it even harder to
finalize it.

add mcopy_atomic (with a flag to map the page readonly too)

The alternative would be to hide mcopy_atomic (and even
remap_anon_pages in order to "remove" the memory atomically for
the externalization into the cloud) as userfaultfd commands to
write into the fd. But then there would be no much point to keep
MADV_USERFAULT around if I do so and I could just remove it
too or it doesn't look clean having to open the userfaultfd just
to issue an hidden mcopy_atomic.

So it becomes a decision if the basic SIGBUS mode for single
threaded apps should be supported or not. As long as we support
SIGBUS too and we don't force to use userfaultfd as the only
mechanism to be notified about userfaults, having a separate
mcopy_atomic syscall sounds cleaner.

Perhaps mcopy_atomic could be used in other cases that may arise
later that may not be connected with the userfault.

Questions to double check the above plan is ok:

1) should I drop the SIGBUS behavior and MADV_USERFAULT?

2) should I hide mcopy_atomic as a write into the userfaultfd?

NOTE: even if I hide mcopy_atomic as a userfaultfd command to write
into the fd, the buffer pointer passed to write() syscall would
still _not_ be pointing to the data like a regular write, but it
would be a pointer to a command structure that points to the source
and destination data of the "hidden" mcopy_atomic, the only
advantage is that perhaps I could wakeup the blocked page faults
without requiring an additional syscall.

The standalone mcopy_atomic would still require a write into the
userfaultfd as it happens now after remap_anon_pages returns, in
order to wakeup the stopped page faults.

3) should I add a registration command to trap only write faults?

The protocol can always be extended later anyway in a backwards
compatible way but it's better if we get it fully featured from the
start.

For completeness, some answers for other questions I've seen floating
around but that weren't posted on the list yet (you can skip reading
the below part if not interested):

- open("/dev/userfault") instead of sys_userfaultfd(), I don't see the
benefit: userfaultfd is just like eventfd in terms of kernel API and
registering a /dev/ device actually sounds trickier. userfault is a
core VM feature and generally we prefer syscalls for core VM
features instead of running ioctl on some chardev that may or may
not exist. (like we did with /dev/ksm -> MADV_MERGEABLE)

- there was a suggestion during KVMForum about allowing an external
program to attach to any MM. Like ptrace. So you could have a single
process managing all userfaults for different processes. However
because I cannot allow multiple userfaultfd to register into the
same range, this doesn't look very reliable (ptrace is kind of an
optional/debug feature while if userfault goes wrong and returns
-EBUSY things go bad) and there may be other complications. If I'd
allow multiple userfaultfd to register into the same range, I
wouldn't even know who to deliver the userfault to. It is an erratic
behavior. Currently it'd return -EBUSY if the app has a bug and does
that, but maybe later this can be relaxed to allow higher
scalability with a flag (userfaultfd gets flags as parameters), but
it still would need to be the same logic that manages userfaults and
the only point of allowing multiple ufd to map the same range would
be SMP scalability. So I tend to see the userfaultfd as a MM local
thing. The thread managing the userfaults can still talk with
another process in the local machine using pipes or sockets if it
needs to.

- the userfaultfd protocol version handshake was done this way because
it looked more reliable.

Of course we could pass the version of the protocol as parameter to
userfaultfd too, but running the syscall multiple times until
-EPROTO didn't return anymore doesn't seem any better than writing
into the fd the wanted protocol until you read it back instead of
-1ULL. It just looked more reliable not having to run the syscall
again and again while depending on -EPROTO or some other
-Esomething.

Thanks,
Andrea

2014-10-29 17:57:26

by Peter Maydell

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

On 29 October 2014 17:46, Andrea Arcangeli <[email protected]> wrote:
> After some chat during the KVMForum I've been already thinking it
> could be beneficial for some usage to give userland the information
> about the fault being read or write

...I wonder if that would let us replace the current nasty
mess we use in linux-user to detect read vs write faults
(which uses a bunch of architecture-specific hacks including
in some cases "look at the insn that triggered this SEGV and
decode it to see if it was a load or a store"; see the
various cpu_signal_handler() implementations in user-exec.c).

-- PMM

2014-10-30 11:33:14

by Hailiang Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

On 2014/10/30 1:46, Andrea Arcangeli wrote:
> Hi Zhanghailiang,
>
> On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
>> Hi Andrea,
>>
>> Thanks for your hard work on userfault;)
>>
>> This is really a useful API.
>>
>> I want to confirm a question:
>> Can we support distinguishing between writing and reading memory for userfault?
>> That is, we can decide whether writing a page, reading a page or both trigger userfault.
>>
>> I think this will help supporting vhost-scsi,ivshmem for migration,
>> we can trace dirty page in userspace.
>>
>> Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault,
>> but reading memory from migration thread will also trigger userfault.
>> It will be easy to implement live memory snapshot, if we support configuring
>> userfault for writing memory only.
>
> Mail is going to be long enough already so I'll just assume tracking
> dirty memory in userland (instead of doing it in kernel) is worthy
> feature to have here.
>
> After some chat during the KVMForum I've been already thinking it
> could be beneficial for some usage to give userland the information
> about the fault being read or write, combined with the ability of
> mapping pages wrprotected to mcopy_atomic (that would work without
> false positives only with MADV_DONTFORK also set, but it's already set
> in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
> checked also in the wrprotect faults, not just in the not present
> faults, but it's not a massive change. Returning the read/write
> information is also a not massive change. This will then payoff mostly
> if there's also a way to remove the memory atomically (kind of
> remap_anon_pages).
>
> Would that be enough? I mean are you still ok if non present read
> fault traps too (you'd be notified it's a read) and you get
> notification for both wrprotect and non present faults?
>
Hi Andrea,

Thanks for your reply, and your patience;)

Er, maybe i didn't describe clearly. What i really need for live memory snapshot
is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*.

My initial solution scheme for live memory snapshot is:
(1) pause VM
(2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
(3) save deivce state to snapshot file
(4) resume VM
(5) snapshot thread begin to save page of memory to snapshot file
(6) VM is going to run, and it is OK for VM or other thread to read ram (no fault trap),
but if VM try to write page (dirty the page), there will be
a userfault trap notification.
(7) a fault-handle-thread reads the page request from userfaultfd,
it will copy content of the page to some buffers, and then remove the page's
wrprotect limit(still using the userfaultfd to tell kernel).
(8) after step (7), VM can continue to write the page which is now can be write.
(9) snapshot thread save the page cached in step (7)
(10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.

So, what i need for userfault is supporting only wrprotect fault. i don't
want to get notification for non present reading faults, it will influence
VM's performance and the efficiency of doing snapshot.

Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
which have no dirty-page-tracing now.

> The question then is how you mark the memory readonly to let the
> wrprotect faults trap if the memory already existed and you didn't map
> it yourself in the guest with mcopy_atomic with a readonly flag.
>
> My current plan would be:
>
> - keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
> fast path check in the not-present and wrprotect page fault
>
> - if VM_USERFAULT is set, find if there's a userfaultfd registered
> into that vma too
>
> if yes engage userfaultfd protocol
>
> otherwise raise SIGBUS (single threaded apps should be fine with
> SIGBUS and it'll avoid them to spawn a thread in order to talk the
> userfaultfd protocol)
>
> - if userfaultfd protocol is engaged, return read|write fault + fault
> address to read(ufd) syscalls
>
> - leave the "userfault" resolution mechanism independent of the
> userfaultfd protocol so we keep the two problems separated and we
> don't mix them in the same API which makes it even harder to
> finalize it.
>
> add mcopy_atomic (with a flag to map the page readonly too)
>
> The alternative would be to hide mcopy_atomic (and even
> remap_anon_pages in order to "remove" the memory atomically for
> the externalization into the cloud) as userfaultfd commands to
> write into the fd. But then there would be no much point to keep
> MADV_USERFAULT around if I do so and I could just remove it
> too or it doesn't look clean having to open the userfaultfd just
> to issue an hidden mcopy_atomic.
>
> So it becomes a decision if the basic SIGBUS mode for single
> threaded apps should be supported or not. As long as we support
> SIGBUS too and we don't force to use userfaultfd as the only
> mechanism to be notified about userfaults, having a separate
> mcopy_atomic syscall sounds cleaner.
>
> Perhaps mcopy_atomic could be used in other cases that may arise
> later that may not be connected with the userfault.
>
> Questions to double check the above plan is ok:
>
> 1) should I drop the SIGBUS behavior and MADV_USERFAULT?
>
> 2) should I hide mcopy_atomic as a write into the userfaultfd?
>
> NOTE: even if I hide mcopy_atomic as a userfaultfd command to write
> into the fd, the buffer pointer passed to write() syscall would
> still _not_ be pointing to the data like a regular write, but it
> would be a pointer to a command structure that points to the source
> and destination data of the "hidden" mcopy_atomic, the only
> advantage is that perhaps I could wakeup the blocked page faults
> without requiring an additional syscall.
>
> The standalone mcopy_atomic would still require a write into the
> userfaultfd as it happens now after remap_anon_pages returns, in
> order to wakeup the stopped page faults.
>
> 3) should I add a registration command to trap only write faults?
>

Sure, that is what i really need;)


Best Regards,
zhanghailiang

> The protocol can always be extended later anyway in a backwards
> compatible way but it's better if we get it fully featured from the
> start.
>
> For completeness, some answers for other questions I've seen floating
> around but that weren't posted on the list yet (you can skip reading
> the below part if not interested):
>
> - open("/dev/userfault") instead of sys_userfaultfd(), I don't see the
> benefit: userfaultfd is just like eventfd in terms of kernel API and
> registering a /dev/ device actually sounds trickier. userfault is a
> core VM feature and generally we prefer syscalls for core VM
> features instead of running ioctl on some chardev that may or may
> not exist. (like we did with /dev/ksm -> MADV_MERGEABLE)
>
> - there was a suggestion during KVMForum about allowing an external
> program to attach to any MM. Like ptrace. So you could have a single
> process managing all userfaults for different processes. However
> because I cannot allow multiple userfaultfd to register into the
> same range, this doesn't look very reliable (ptrace is kind of an
> optional/debug feature while if userfault goes wrong and returns
> -EBUSY things go bad) and there may be other complications. If I'd
> allow multiple userfaultfd to register into the same range, I
> wouldn't even know who to deliver the userfault to. It is an erratic
> behavior. Currently it'd return -EBUSY if the app has a bug and does
> that, but maybe later this can be relaxed to allow higher
> scalability with a flag (userfaultfd gets flags as parameters), but
> it still would need to be the same logic that manages userfaults and
> the only point of allowing multiple ufd to map the same range would
> be SMP scalability. So I tend to see the userfaultfd as a MM local
> thing. The thread managing the userfaults can still talk with
> another process in the local machine using pipes or sockets if it
> needs to.
>
> - the userfaultfd protocol version handshake was done this way because
> it looked more reliable.
>
> Of course we could pass the version of the protocol as parameter to
> userfaultfd too, but running the syscall multiple times until
> -EPROTO didn't return anymore doesn't seem any better than writing
> into the fd the wanted protocol until you read it back instead of
> -1ULL. It just looked more reliable not having to run the syscall
> again and again while depending on -EPROTO or some other
> -Esomething.
>
> Thanks,
> Andrea
>
> .
>

2014-10-30 12:50:49

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

* zhanghailiang ([email protected]) wrote:
> On 2014/10/30 1:46, Andrea Arcangeli wrote:
> >Hi Zhanghailiang,
> >
> >On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
> >>Hi Andrea,
> >>
> >>Thanks for your hard work on userfault;)
> >>
> >>This is really a useful API.
> >>
> >>I want to confirm a question:
> >>Can we support distinguishing between writing and reading memory for userfault?
> >>That is, we can decide whether writing a page, reading a page or both trigger userfault.
> >>
> >>I think this will help supporting vhost-scsi,ivshmem for migration,
> >>we can trace dirty page in userspace.
> >>
> >>Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault,
> >>but reading memory from migration thread will also trigger userfault.
> >>It will be easy to implement live memory snapshot, if we support configuring
> >>userfault for writing memory only.
> >
> >Mail is going to be long enough already so I'll just assume tracking
> >dirty memory in userland (instead of doing it in kernel) is worthy
> >feature to have here.
> >
> >After some chat during the KVMForum I've been already thinking it
> >could be beneficial for some usage to give userland the information
> >about the fault being read or write, combined with the ability of
> >mapping pages wrprotected to mcopy_atomic (that would work without
> >false positives only with MADV_DONTFORK also set, but it's already set
> >in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
> >checked also in the wrprotect faults, not just in the not present
> >faults, but it's not a massive change. Returning the read/write
> >information is also a not massive change. This will then payoff mostly
> >if there's also a way to remove the memory atomically (kind of
> >remap_anon_pages).
> >
> >Would that be enough? I mean are you still ok if non present read
> >fault traps too (you'd be notified it's a read) and you get
> >notification for both wrprotect and non present faults?
> >
> Hi Andrea,
>
> Thanks for your reply, and your patience;)
>
> Er, maybe i didn't describe clearly. What i really need for live memory snapshot
> is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*.
>
> My initial solution scheme for live memory snapshot is:
> (1) pause VM
> (2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
> (3) save deivce state to snapshot file
> (4) resume VM
> (5) snapshot thread begin to save page of memory to snapshot file
> (6) VM is going to run, and it is OK for VM or other thread to read ram (no fault trap),
> but if VM try to write page (dirty the page), there will be
> a userfault trap notification.
> (7) a fault-handle-thread reads the page request from userfaultfd,
> it will copy content of the page to some buffers, and then remove the page's
> wrprotect limit(still using the userfaultfd to tell kernel).
> (8) after step (7), VM can continue to write the page which is now can be write.
> (9) snapshot thread save the page cached in step (7)
> (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.

Hmm, I can see the same process being useful for the fault-tolerance schemes
like COLO, it needs a memory state snapshot.

> So, what i need for userfault is supporting only wrprotect fault. i don't
> want to get notification for non present reading faults, it will influence
> VM's performance and the efficiency of doing snapshot.

What pages would be non-present at this point - just balloon?

Dave

> Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
> which have no dirty-page-tracing now.
>
> >The question then is how you mark the memory readonly to let the
> >wrprotect faults trap if the memory already existed and you didn't map
> >it yourself in the guest with mcopy_atomic with a readonly flag.
> >
> >My current plan would be:
> >
> >- keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
> > fast path check in the not-present and wrprotect page fault
> >
> >- if VM_USERFAULT is set, find if there's a userfaultfd registered
> > into that vma too
> >
> > if yes engage userfaultfd protocol
> >
> > otherwise raise SIGBUS (single threaded apps should be fine with
> > SIGBUS and it'll avoid them to spawn a thread in order to talk the
> > userfaultfd protocol)
> >
> >- if userfaultfd protocol is engaged, return read|write fault + fault
> > address to read(ufd) syscalls
> >
> >- leave the "userfault" resolution mechanism independent of the
> > userfaultfd protocol so we keep the two problems separated and we
> > don't mix them in the same API which makes it even harder to
> > finalize it.
> >
> > add mcopy_atomic (with a flag to map the page readonly too)
> >
> > The alternative would be to hide mcopy_atomic (and even
> > remap_anon_pages in order to "remove" the memory atomically for
> > the externalization into the cloud) as userfaultfd commands to
> > write into the fd. But then there would be no much point to keep
> > MADV_USERFAULT around if I do so and I could just remove it
> > too or it doesn't look clean having to open the userfaultfd just
> > to issue an hidden mcopy_atomic.
> >
> > So it becomes a decision if the basic SIGBUS mode for single
> > threaded apps should be supported or not. As long as we support
> > SIGBUS too and we don't force to use userfaultfd as the only
> > mechanism to be notified about userfaults, having a separate
> > mcopy_atomic syscall sounds cleaner.
> >
> > Perhaps mcopy_atomic could be used in other cases that may arise
> > later that may not be connected with the userfault.
> >
> >Questions to double check the above plan is ok:
> >
> >1) should I drop the SIGBUS behavior and MADV_USERFAULT?
> >
> >2) should I hide mcopy_atomic as a write into the userfaultfd?
> >
> > NOTE: even if I hide mcopy_atomic as a userfaultfd command to write
> > into the fd, the buffer pointer passed to write() syscall would
> > still _not_ be pointing to the data like a regular write, but it
> > would be a pointer to a command structure that points to the source
> > and destination data of the "hidden" mcopy_atomic, the only
> > advantage is that perhaps I could wakeup the blocked page faults
> > without requiring an additional syscall.
> >
> > The standalone mcopy_atomic would still require a write into the
> > userfaultfd as it happens now after remap_anon_pages returns, in
> > order to wakeup the stopped page faults.
> >
> >3) should I add a registration command to trap only write faults?
> >
>
> Sure, that is what i really need;)
>
>
> Best Regards???
> zhanghailiang
>
> > The protocol can always be extended later anyway in a backwards
> > compatible way but it's better if we get it fully featured from the
> > start.
> >
> >For completeness, some answers for other questions I've seen floating
> >around but that weren't posted on the list yet (you can skip reading
> >the below part if not interested):
> >
> >- open("/dev/userfault") instead of sys_userfaultfd(), I don't see the
> > benefit: userfaultfd is just like eventfd in terms of kernel API and
> > registering a /dev/ device actually sounds trickier. userfault is a
> > core VM feature and generally we prefer syscalls for core VM
> > features instead of running ioctl on some chardev that may or may
> > not exist. (like we did with /dev/ksm -> MADV_MERGEABLE)
> >
> >- there was a suggestion during KVMForum about allowing an external
> > program to attach to any MM. Like ptrace. So you could have a single
> > process managing all userfaults for different processes. However
> > because I cannot allow multiple userfaultfd to register into the
> > same range, this doesn't look very reliable (ptrace is kind of an
> > optional/debug feature while if userfault goes wrong and returns
> > -EBUSY things go bad) and there may be other complications. If I'd
> > allow multiple userfaultfd to register into the same range, I
> > wouldn't even know who to deliver the userfault to. It is an erratic
> > behavior. Currently it'd return -EBUSY if the app has a bug and does
> > that, but maybe later this can be relaxed to allow higher
> > scalability with a flag (userfaultfd gets flags as parameters), but
> > it still would need to be the same logic that manages userfaults and
> > the only point of allowing multiple ufd to map the same range would
> > be SMP scalability. So I tend to see the userfaultfd as a MM local
> > thing. The thread managing the userfaults can still talk with
> > another process in the local machine using pipes or sockets if it
> > needs to.
> >
> >- the userfaultfd protocol version handshake was done this way because
> > it looked more reliable.
> >
> > Of course we could pass the version of the protocol as parameter to
> > userfaultfd too, but running the syscall multiple times until
> > -EPROTO didn't return anymore doesn't seem any better than writing
> > into the fd the wanted protocol until you read it back instead of
> > -1ULL. It just looked more reliable not having to run the syscall
> > again and again while depending on -EPROTO or some other
> > -Esomething.
> >
> >Thanks,
> >Andrea
> >
> >.
> >
>
>
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2014-10-31 01:28:48

by Hailiang Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

On 2014/10/30 20:49, Dr. David Alan Gilbert wrote:
> * zhanghailiang ([email protected]) wrote:
>> On 2014/10/30 1:46, Andrea Arcangeli wrote:
>>> Hi Zhanghailiang,
>>>
>>> On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
>>>> Hi Andrea,
>>>>
>>>> Thanks for your hard work on userfault;)
>>>>
>>>> This is really a useful API.
>>>>
>>>> I want to confirm a question:
>>>> Can we support distinguishing between writing and reading memory for userfault?
>>>> That is, we can decide whether writing a page, reading a page or both trigger userfault.
>>>>
>>>> I think this will help supporting vhost-scsi,ivshmem for migration,
>>>> we can trace dirty page in userspace.
>>>>
>>>> Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault,
>>>> but reading memory from migration thread will also trigger userfault.
>>>> It will be easy to implement live memory snapshot, if we support configuring
>>>> userfault for writing memory only.
>>>
>>> Mail is going to be long enough already so I'll just assume tracking
>>> dirty memory in userland (instead of doing it in kernel) is worthy
>>> feature to have here.
>>>
>>> After some chat during the KVMForum I've been already thinking it
>>> could be beneficial for some usage to give userland the information
>>> about the fault being read or write, combined with the ability of
>>> mapping pages wrprotected to mcopy_atomic (that would work without
>>> false positives only with MADV_DONTFORK also set, but it's already set
>>> in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
>>> checked also in the wrprotect faults, not just in the not present
>>> faults, but it's not a massive change. Returning the read/write
>>> information is also a not massive change. This will then payoff mostly
>>> if there's also a way to remove the memory atomically (kind of
>>> remap_anon_pages).
>>>
>>> Would that be enough? I mean are you still ok if non present read
>>> fault traps too (you'd be notified it's a read) and you get
>>> notification for both wrprotect and non present faults?
>>>
>> Hi Andrea,
>>
>> Thanks for your reply, and your patience;)
>>
>> Er, maybe i didn't describe clearly. What i really need for live memory snapshot
>> is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*.
>>
>> My initial solution scheme for live memory snapshot is:
>> (1) pause VM
>> (2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
>> (3) save deivce state to snapshot file
>> (4) resume VM
>> (5) snapshot thread begin to save page of memory to snapshot file
>> (6) VM is going to run, and it is OK for VM or other thread to read ram (no fault trap),
>> but if VM try to write page (dirty the page), there will be
>> a userfault trap notification.
>> (7) a fault-handle-thread reads the page request from userfaultfd,
>> it will copy content of the page to some buffers, and then remove the page's
>> wrprotect limit(still using the userfaultfd to tell kernel).
>> (8) after step (7), VM can continue to write the page which is now can be write.
>> (9) snapshot thread save the page cached in step (7)
>> (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.
>
> Hmm, I can see the same process being useful for the fault-tolerance schemes
> like COLO, it needs a memory state snapshot.
>
>> So, what i need for userfault is supporting only wrprotect fault. i don't
>> want to get notification for non present reading faults, it will influence
>> VM's performance and the efficiency of doing snapshot.
>
> What pages would be non-present at this point - just balloon?
>

Er, sorry, it should be 'no-present page faults';)

> Dave
>
>> Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
>> which have no dirty-page-tracing now.
>>
>>> The question then is how you mark the memory readonly to let the
>>> wrprotect faults trap if the memory already existed and you didn't map
>>> it yourself in the guest with mcopy_atomic with a readonly flag.
>>>
>>> My current plan would be:
>>>
>>> - keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
>>> fast path check in the not-present and wrprotect page fault
>>>
>>> - if VM_USERFAULT is set, find if there's a userfaultfd registered
>>> into that vma too
>>>
>>> if yes engage userfaultfd protocol
>>>
>>> otherwise raise SIGBUS (single threaded apps should be fine with
>>> SIGBUS and it'll avoid them to spawn a thread in order to talk the
>>> userfaultfd protocol)
>>>
>>> - if userfaultfd protocol is engaged, return read|write fault + fault
>>> address to read(ufd) syscalls
>>>
>>> - leave the "userfault" resolution mechanism independent of the
>>> userfaultfd protocol so we keep the two problems separated and we
>>> don't mix them in the same API which makes it even harder to
>>> finalize it.
>>>
>>> add mcopy_atomic (with a flag to map the page readonly too)
>>>
>>> The alternative would be to hide mcopy_atomic (and even
>>> remap_anon_pages in order to "remove" the memory atomically for
>>> the externalization into the cloud) as userfaultfd commands to
>>> write into the fd. But then there would be no much point to keep
>>> MADV_USERFAULT around if I do so and I could just remove it
>>> too or it doesn't look clean having to open the userfaultfd just
>>> to issue an hidden mcopy_atomic.
>>>
>>> So it becomes a decision if the basic SIGBUS mode for single
>>> threaded apps should be supported or not. As long as we support
>>> SIGBUS too and we don't force to use userfaultfd as the only
>>> mechanism to be notified about userfaults, having a separate
>>> mcopy_atomic syscall sounds cleaner.
>>>
>>> Perhaps mcopy_atomic could be used in other cases that may arise
>>> later that may not be connected with the userfault.
>>>
>>> Questions to double check the above plan is ok:
>>>
>>> 1) should I drop the SIGBUS behavior and MADV_USERFAULT?
>>>
>>> 2) should I hide mcopy_atomic as a write into the userfaultfd?
>>>
>>> NOTE: even if I hide mcopy_atomic as a userfaultfd command to write
>>> into the fd, the buffer pointer passed to write() syscall would
>>> still _not_ be pointing to the data like a regular write, but it
>>> would be a pointer to a command structure that points to the source
>>> and destination data of the "hidden" mcopy_atomic, the only
>>> advantage is that perhaps I could wakeup the blocked page faults
>>> without requiring an additional syscall.
>>>
>>> The standalone mcopy_atomic would still require a write into the
>>> userfaultfd as it happens now after remap_anon_pages returns, in
>>> order to wakeup the stopped page faults.
>>>
>>> 3) should I add a registration command to trap only write faults?
>>>
>>
>> Sure, that is what i really need;)
>>
>>
>> Best Regards???
>> zhanghailiang
>>
>>> The protocol can always be extended later anyway in a backwards
>>> compatible way but it's better if we get it fully featured from the
>>> start.
>>>
>>> For completeness, some answers for other questions I've seen floating
>>> around but that weren't posted on the list yet (you can skip reading
>>> the below part if not interested):
>>>
>>> - open("/dev/userfault") instead of sys_userfaultfd(), I don't see the
>>> benefit: userfaultfd is just like eventfd in terms of kernel API and
>>> registering a /dev/ device actually sounds trickier. userfault is a
>>> core VM feature and generally we prefer syscalls for core VM
>>> features instead of running ioctl on some chardev that may or may
>>> not exist. (like we did with /dev/ksm -> MADV_MERGEABLE)
>>>
>>> - there was a suggestion during KVMForum about allowing an external
>>> program to attach to any MM. Like ptrace. So you could have a single
>>> process managing all userfaults for different processes. However
>>> because I cannot allow multiple userfaultfd to register into the
>>> same range, this doesn't look very reliable (ptrace is kind of an
>>> optional/debug feature while if userfault goes wrong and returns
>>> -EBUSY things go bad) and there may be other complications. If I'd
>>> allow multiple userfaultfd to register into the same range, I
>>> wouldn't even know who to deliver the userfault to. It is an erratic
>>> behavior. Currently it'd return -EBUSY if the app has a bug and does
>>> that, but maybe later this can be relaxed to allow higher
>>> scalability with a flag (userfaultfd gets flags as parameters), but
>>> it still would need to be the same logic that manages userfaults and
>>> the only point of allowing multiple ufd to map the same range would
>>> be SMP scalability. So I tend to see the userfaultfd as a MM local
>>> thing. The thread managing the userfaults can still talk with
>>> another process in the local machine using pipes or sockets if it
>>> needs to.
>>>
>>> - the userfaultfd protocol version handshake was done this way because
>>> it looked more reliable.
>>>
>>> Of course we could pass the version of the protocol as parameter to
>>> userfaultfd too, but running the syscall multiple times until
>>> -EPROTO didn't return anymore doesn't seem any better than writing
>>> into the fd the wanted protocol until you read it back instead of
>>> -1ULL. It just looked more reliable not having to run the syscall
>>> again and again while depending on -EPROTO or some other
>>> -Esomething.
>>>
>>> Thanks,
>>> Andrea
>>>
>>> .
>>>
>>
>>
> --
> Dr. David Alan Gilbert / [email protected] / Manchester, UK
>
> .
>

2014-10-31 02:23:34

by Peter Feiner

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:
> On 2014/10/30 1:46, Andrea Arcangeli wrote:
> >On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
> >>I want to confirm a question:
> >>Can we support distinguishing between writing and reading memory for userfault?
> >>That is, we can decide whether writing a page, reading a page or both trigger userfault.
> >Mail is going to be long enough already so I'll just assume tracking
> >dirty memory in userland (instead of doing it in kernel) is worthy
> >feature to have here.

I'll open that can of worms :-)

> [...]
> Er, maybe i didn't describe clearly. What i really need for live memory snapshot
> is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*.
>
> So, what i need for userfault is supporting only wrprotect fault. i don't
> want to get notification for non present reading faults, it will influence
> VM's performance and the efficiency of doing snapshot.

Given that you do care about performance Zhanghailiang, I don't think that a
userfault handler is a good place to track dirty memory. Every dirtying write
will block on the userfault handler, which is an expensively slow proposition
compared to an in-kernel approach.

> Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
> which have no dirty-page-tracing now.

I do agree wholeheartedly with you here. Manually tracking non-guest writes
adds to the complexity of device emulation code. A central fault-driven means
for dirty tracking writes from the guest and host would be a welcome
simplification to implementing pre-copy migration. Indeed, that's exactly what
I'm working on! I'm using the softdirty bit, which was introduced recently for
CRIU migration, to replace the use of KVM's dirty logging and manual dirty
tracking by the VMM during pre-copy migration. See
Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To
make softdirty usable for live migration, I've added an API to atomically
test-and-clear the bit and write protect the page.

2014-10-31 03:32:46

by Hailiang Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

On 2014/10/31 10:23, Peter Feiner wrote:
> On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:
>> On 2014/10/30 1:46, Andrea Arcangeli wrote:
>>> On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
>>>> I want to confirm a question:
>>>> Can we support distinguishing between writing and reading memory for userfault?
>>>> That is, we can decide whether writing a page, reading a page or both trigger userfault.
>>> Mail is going to be long enough already so I'll just assume tracking
>>> dirty memory in userland (instead of doing it in kernel) is worthy
>>> feature to have here.
>
> I'll open that can of worms :-)
>
>> [...]
>> Er, maybe i didn't describe clearly. What i really need for live memory snapshot
>> is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*.
>>
>> So, what i need for userfault is supporting only wrprotect fault. i don't
>> want to get notification for non present reading faults, it will influence
>> VM's performance and the efficiency of doing snapshot.
>
> Given that you do care about performance Zhanghailiang, I don't think that a
> userfault handler is a good place to track dirty memory. Every dirtying write
> will block on the userfault handler, which is an expensively slow proposition
> compared to an in-kernel approach.
>

Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
we have to do this (block the write action), because we have to save the page before it
is dirtied by writing action. This is the difference, compared to pre-copy migration.

>> Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
>> which have no dirty-page-tracing now.
>
> I do agree wholeheartedly with you here. Manually tracking non-guest writes
> adds to the complexity of device emulation code. A central fault-driven means
> for dirty tracking writes from the guest and host would be a welcome
> simplification to implementing pre-copy migration. Indeed, that's exactly what
> I'm working on! I'm using the softdirty bit, which was introduced recently for
> CRIU migration, to replace the use of KVM's dirty logging and manual dirty
> tracking by the VMM during pre-copy migration. See

Great! Do you plan to issue your patches to community? I mean is your work based on
qemu? or an independent tool (CRIU migration?) for live-migration?
Maybe i could fix the migration problem for ivshmem in qemu now,
based on softdirty mechanism.

> Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To

I have read them cursorily, it is useful for pre-copy indeed. But it seems that
it can not meet my need for snapshot.

> make softdirty usable for live migration, I've added an API to atomically
> test-and-clear the bit and write protect the page.

How can i find the API? Is it been merged in kernel's master branch already?


Thanks,
zhanghailiang

2014-10-31 04:41:15

by Hailiang Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

On 2014/10/31 11:29, zhanghailiang wrote:
> On 2014/10/31 10:23, Peter Feiner wrote:
>> On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:
>>> On 2014/10/30 1:46, Andrea Arcangeli wrote:
>>>> On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
>>>>> I want to confirm a question:
>>>>> Can we support distinguishing between writing and reading memory for userfault?
>>>>> That is, we can decide whether writing a page, reading a page or both trigger userfault.
>>>> Mail is going to be long enough already so I'll just assume tracking
>>>> dirty memory in userland (instead of doing it in kernel) is worthy
>>>> feature to have here.
>>
>> I'll open that can of worms :-)
>>
>>> [...]
>>> Er, maybe i didn't describe clearly. What i really need for live memory snapshot
>>> is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*.
>>>
>>> So, what i need for userfault is supporting only wrprotect fault. i don't
>>> want to get notification for non present reading faults, it will influence
>>> VM's performance and the efficiency of doing snapshot.
>>
>> Given that you do care about performance Zhanghailiang, I don't think that a
>> userfault handler is a good place to track dirty memory. Every dirtying write
>> will block on the userfault handler, which is an expensively slow proposition
>> compared to an in-kernel approach.
>>
>
> Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
> we have to do this (block the write action), because we have to save the page before it
> is dirtied by writing action. This is the difference, compared to pre-copy migration.
>

Again;) For snapshot, i don't use its dirty tracing ability, i just use it to block write action,
and save page, and then i will remove its write protect.

>>> Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
>>> which have no dirty-page-tracing now.
>>
>> I do agree wholeheartedly with you here. Manually tracking non-guest writes
>> adds to the complexity of device emulation code. A central fault-driven means
>> for dirty tracking writes from the guest and host would be a welcome
>> simplification to implementing pre-copy migration. Indeed, that's exactly what
>> I'm working on! I'm using the softdirty bit, which was introduced recently for
>> CRIU migration, to replace the use of KVM's dirty logging and manual dirty
>> tracking by the VMM during pre-copy migration. See
>
> Great! Do you plan to issue your patches to community? I mean is your work based on
> qemu? or an independent tool (CRIU migration?) for live-migration?
> Maybe i could fix the migration problem for ivshmem in qemu now,
> based on softdirty mechanism.
>
>> Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To
>
> I have read them cursorily, it is useful for pre-copy indeed. But it seems that
> it can not meet my need for snapshot.
>
>> make softdirty usable for live migration, I've added an API to atomically
>> test-and-clear the bit and write protect the page.
>
> How can i find the API? Is it been merged in kernel's master branch already?
>
>
> Thanks,
> zhanghailiang
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> .
>

2014-10-31 05:17:50

by Andres Lagar-Cavilla

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

On Thu, Oct 30, 2014 at 9:38 PM, zhanghailiang
<[email protected]> wrote:
> On 2014/10/31 11:29, zhanghailiang wrote:
>>
>> On 2014/10/31 10:23, Peter Feiner wrote:
>>>
>>> On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:
>>>>
>>>> On 2014/10/30 1:46, Andrea Arcangeli wrote:
>>>>>
>>>>> On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
>>>>>>
>>>>>> I want to confirm a question:
>>>>>> Can we support distinguishing between writing and reading memory for
>>>>>> userfault?
>>>>>> That is, we can decide whether writing a page, reading a page or both
>>>>>> trigger userfault.
>>>>>
>>>>> Mail is going to be long enough already so I'll just assume tracking
>>>>> dirty memory in userland (instead of doing it in kernel) is worthy
>>>>> feature to have here.
>>>
>>>
>>> I'll open that can of worms :-)
>>>
>>>> [...]
>>>> Er, maybe i didn't describe clearly. What i really need for live memory
>>>> snapshot
>>>> is only wrprotect fault, like kvm's dirty tracing mechanism, *only
>>>> tracing write action*.
>>>>
>>>> So, what i need for userfault is supporting only wrprotect fault. i
>>>> don't
>>>> want to get notification for non present reading faults, it will
>>>> influence
>>>> VM's performance and the efficiency of doing snapshot.
>>>
>>>
>>> Given that you do care about performance Zhanghailiang, I don't think
>>> that a
>>> userfault handler is a good place to track dirty memory. Every dirtying
>>> write
>>> will block on the userfault handler, which is an expensively slow
>>> proposition
>>> compared to an in-kernel approach.
>>>
>>
>> Agreed, but for doing live memory snapshot (VM is running when do
>> snapsphot),
>> we have to do this (block the write action), because we have to save the
>> page before it
>> is dirtied by writing action. This is the difference, compared to pre-copy
>> migration.
>>
>
> Again;) For snapshot, i don't use its dirty tracing ability, i just use it
> to block write action,
> and save page, and then i will remove its write protect.

You could do a CoW in the kernel, post a notification, keep going, and
expose an interface for user-space to mmap the preserved copy. Getting
the life-cycle of the preserved page(s) right is tricky, but doable.
Anyway, it's easy to hand-wave without knowing your specific
requirements.

Opening the discussion a bit, this does look similar to the xen-access
interface, in which a xen domain vcpu could be stopped in its tracks
while user-space was notified (and acknowledged) a variety of
scenarios: page was written to, page was read from, vcpu is attempting
to execute from page, etc. Very applicable to anti-viruses right away,
for example you can enforce W^X properties on pages.

I don't know that Andrea wants to open the game so broadly for
userfault, and the code right now is very specific to triggering on
pte_none(), but that's a nice reward down this road.

Andres

>
>>>> Also, i think this feature will benefit for migration of ivshmem and
>>>> vhost-scsi
>>>> which have no dirty-page-tracing now.
>>>
>>>
>>> I do agree wholeheartedly with you here. Manually tracking non-guest
>>> writes
>>> adds to the complexity of device emulation code. A central fault-driven
>>> means
>>> for dirty tracking writes from the guest and host would be a welcome
>>> simplification to implementing pre-copy migration. Indeed, that's exactly
>>> what
>>> I'm working on! I'm using the softdirty bit, which was introduced
>>> recently for
>>> CRIU migration, to replace the use of KVM's dirty logging and manual
>>> dirty
>>> tracking by the VMM during pre-copy migration. See
>>
>>
>> Great! Do you plan to issue your patches to community? I mean is your work
>> based on
>> qemu? or an independent tool (CRIU migration?) for live-migration?
>> Maybe i could fix the migration problem for ivshmem in qemu now,
>> based on softdirty mechanism.
>>
>>> Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't
>>> familiar. To
>>
>>
>> I have read them cursorily, it is useful for pre-copy indeed. But it seems
>> that
>> it can not meet my need for snapshot.
>>
>>> make softdirty usable for live migration, I've added an API to atomically
>>> test-and-clear the bit and write protect the page.
>>
>>
>> How can i find the API? Is it been merged in kernel's master branch
>> already?
>>
>>
>> Thanks,
>> zhanghailiang
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> .
>>
>



--
Andres Lagar-Cavilla | Google Kernel Team | [email protected]

2014-10-31 08:12:58

by Hailiang Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

On 2014/10/31 13:17, Andres Lagar-Cavilla wrote:
> On Thu, Oct 30, 2014 at 9:38 PM, zhanghailiang
> <[email protected]> wrote:
>> On 2014/10/31 11:29, zhanghailiang wrote:
>>>
>>> On 2014/10/31 10:23, Peter Feiner wrote:
>>>>
>>>> On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote:
>>>>>
>>>>> On 2014/10/30 1:46, Andrea Arcangeli wrote:
>>>>>>
>>>>>> On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
>>>>>>>
>>>>>>> I want to confirm a question:
>>>>>>> Can we support distinguishing between writing and reading memory for
>>>>>>> userfault?
>>>>>>> That is, we can decide whether writing a page, reading a page or both
>>>>>>> trigger userfault.
>>>>>>
>>>>>> Mail is going to be long enough already so I'll just assume tracking
>>>>>> dirty memory in userland (instead of doing it in kernel) is worthy
>>>>>> feature to have here.
>>>>
>>>>
>>>> I'll open that can of worms :-)
>>>>
>>>>> [...]
>>>>> Er, maybe i didn't describe clearly. What i really need for live memory
>>>>> snapshot
>>>>> is only wrprotect fault, like kvm's dirty tracing mechanism, *only
>>>>> tracing write action*.
>>>>>
>>>>> So, what i need for userfault is supporting only wrprotect fault. i
>>>>> don't
>>>>> want to get notification for non present reading faults, it will
>>>>> influence
>>>>> VM's performance and the efficiency of doing snapshot.
>>>>
>>>>
>>>> Given that you do care about performance Zhanghailiang, I don't think
>>>> that a
>>>> userfault handler is a good place to track dirty memory. Every dirtying
>>>> write
>>>> will block on the userfault handler, which is an expensively slow
>>>> proposition
>>>> compared to an in-kernel approach.
>>>>
>>>
>>> Agreed, but for doing live memory snapshot (VM is running when do
>>> snapsphot),
>>> we have to do this (block the write action), because we have to save the
>>> page before it
>>> is dirtied by writing action. This is the difference, compared to pre-copy
>>> migration.
>>>
>>
>> Again;) For snapshot, i don't use its dirty tracing ability, i just use it
>> to block write action,
>> and save page, and then i will remove its write protect.
>
> You could do a CoW in the kernel, post a notification, keep going, and
> expose an interface for user-space to mmap the preserved copy. Getting
> the life-cycle of the preserved page(s) right is tricky, but doable.
> Anyway, it's easy to hand-wave without knowing your specific
> requirements.
>

Yes, what i need is very much like user-space COW feature, but i don't want to modify
any code of kvm to relize COW, usefault is a more generic way and more grace.
Besides, I'm not an expert in kernel:(

> Opening the discussion a bit, this does look similar to the xen-access
> interface, in which a xen domain vcpu could be stopped in its tracks

Right;)

> while user-space was notified (and acknowledged) a variety of
> scenarios: page was written to, page was read from, vcpu is attempting
> to execute from page, etc. Very applicable to anti-viruses right away,
> for example you can enforce W^X properties on pages.
>
> I don't know that Andrea wants to open the game so broadly for
> userfault, and the code right now is very specific to triggering on
> pte_none(), but that's a nice reward down this road.
>

I hope he will consider it. IMHO, it is a good extension for userfault
(write fault);)

Best Regards,
zhanghailiang

>>
>>>>> Also, i think this feature will benefit for migration of ivshmem and
>>>>> vhost-scsi
>>>>> which have no dirty-page-tracing now.
>>>>
>>>>
>>>> I do agree wholeheartedly with you here. Manually tracking non-guest
>>>> writes
>>>> adds to the complexity of device emulation code. A central fault-driven
>>>> means
>>>> for dirty tracking writes from the guest and host would be a welcome
>>>> simplification to implementing pre-copy migration. Indeed, that's exactly
>>>> what
>>>> I'm working on! I'm using the softdirty bit, which was introduced
>>>> recently for
>>>> CRIU migration, to replace the use of KVM's dirty logging and manual
>>>> dirty
>>>> tracking by the VMM during pre-copy migration. See
>>>
>>>
>>> Great! Do you plan to issue your patches to community? I mean is your work
>>> based on
>>> qemu? or an independent tool (CRIU migration?) for live-migration?
>>> Maybe i could fix the migration problem for ivshmem in qemu now,
>>> based on softdirty mechanism.
>>>
>>>> Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't
>>>> familiar. To
>>>
>>>
>>> I have read them cursorily, it is useful for pre-copy indeed. But it seems
>>> that
>>> it can not meet my need for snapshot.
>>>
>>>> make softdirty usable for live migration, I've added an API to atomically
>>>> test-and-clear the bit and write protect the page.
>>>
>>>
>>> How can i find the API? Is it been merged in kernel's master branch
>>> already?
>>>
>>>
>>> Thanks,
>>> zhanghailiang
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> .
>>>
>>
>
>
>

2014-10-31 19:39:38

by Peter Feiner

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

On Fri, Oct 31, 2014 at 11:29:49AM +0800, zhanghailiang wrote:
> Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
> we have to do this (block the write action), because we have to save the page before it
> is dirtied by writing action. This is the difference, compared to pre-copy migration.

Ah ha, I understand the difference now. I suppose that you have considered
doing a traditional pre-copy migration (that is, passes over memory saving
dirty pages, followed by a pause and a final dump of remaining dirty pages) to
a file. Your approach has the advantage of having the VM pause time bounded by
the time it takes to handle the userfault and do the write, as opposed to
pre-copy migration which has a pause time bounded by the time it takes to do
the final dump of dirty pages, which, in the worst case, is the time it takes
to dump all of the guest memory!

You could use the old fork & dump trick. Given that the guest's memory is
backed by private VMA (as of a year ago when I last looked, is always the case
for QEMU), you can have the kernel do the write protection for you.
Essentially, you fork Qemu and, in the child process, dump the guest memory
then exit. If the parent (including the guest) writes to guest memory, then it
will fault and the kernel will copy the page.

The fork & dump approach will give you the best performance w.r.t. guest pause
times (i.e., just pausing for the COW fault handler), but it does have the
distinct disadvantage of potentially using 2x the guest memory (i.e., if the
parent process races ahead and writes to all of the pages before you finish the
dump). To mitigate memory copying, you could madvise MADV_DONTNEED the child
memory as you copy it.

> Great! Do you plan to issue your patches to community? I mean is your work based on
> qemu? or an independent tool (CRIU migration?) for live-migration?
> Maybe i could fix the migration problem for ivshmem in qemu now,
> based on softdirty mechanism.

I absolutely plan on releasing these patches :-) CRIU was the first open-source
userland I had planned on integrating with. At Google, I'm working with our
home-grown Qemu replacement. However, I'd be happy to help with an effort to
get softdirty integrated in Qemu in the future.

> >Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To
>
> I have read them cursorily, it is useful for pre-copy indeed. But it seems that
> it can not meet my need for snapshot.

> >make softdirty usable for live migration, I've added an API to atomically
> >test-and-clear the bit and write protect the page.
>
> How can i find the API? Is it been merged in kernel's master branch already?

Negative. I'll be sure to CC you when I start sending this stuff upstream.

Peter

2014-11-01 08:56:47

by Hailiang Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

On 2014/11/1 3:39, Peter Feiner wrote:
> On Fri, Oct 31, 2014 at 11:29:49AM +0800, zhanghailiang wrote:
>> Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
>> we have to do this (block the write action), because we have to save the page before it
>> is dirtied by writing action. This is the difference, compared to pre-copy migration.
>
> Ah ha, I understand the difference now. I suppose that you have considered
> doing a traditional pre-copy migration (that is, passes over memory saving
> dirty pages, followed by a pause and a final dump of remaining dirty pages) to
> a file. Your approach has the advantage of having the VM pause time bounded by
> the time it takes to handle the userfault and do the write, as opposed to
> pre-copy migration which has a pause time bounded by the time it takes to do
> the final dump of dirty pages, which, in the worst case, is the time it takes
> to dump all of the guest memory!
>

Right! Strictly speaking, Migrate VM's state into a file(fd) is not snapshot,
Because its time is not decided (depend on the time of finishing mingration).
A VM's snasphot should be decided, it should be the time when i fire snapshot
command.
Snapshot is very like taking a photo, getting a VM's state on the time;)

> You could use the old fork & dump trick. Given that the guest's memory is
> backed by private VMA (as of a year ago when I last looked, is always the case
> for QEMU), you can have the kernel do the write protection for you.
> Essentially, you fork Qemu and, in the child process, dump the guest memory
> then exit. If the parent (including the guest) writes to guest memory, then it
> will fault and the kernel will copy the page.
>

It is difficult to do fork in qemu process, which has multi-threads and holds
all kinds of locks. actually, this scheme has been discussed in community long time
ago. It is not accepted.

> The fork & dump approach will give you the best performance w.r.t. guest pause
> times (i.e., just pausing for the COW fault handler), but it does have the
> distinct disadvantage of potentially using 2x the guest memory (i.e., if the

Agreed! This is the second reason why community does not accept it.

> parent process races ahead and writes to all of the pages before you finish the
> dump). To mitigate memory copying, you could madvise MADV_DONTNEED the child
> memory as you copy it.
>

IMHO,The scheme i mentioned in the previous email, may be the simplest and the
most efficient way, if userfault could support only wrprotect fault.
We can also do some optimization to reduce influence for VM when do snapshot,
such as caching the request pages by using memory buffer, etc.

>> Great! Do you plan to issue your patches to community? I mean is your work based on
>> qemu? or an independent tool (CRIU migration?) for live-migration?
>> Maybe i could fix the migration problem for ivshmem in qemu now,
>> based on softdirty mechanism.
>
> I absolutely plan on releasing these patches :-) CRIU was the first open-source
> userland I had planned on integrating with. At Google, I'm working with our
> home-grown Qemu replacement. However, I'd be happy to help with an effort to
> get softdirty integrated in Qemu in the future.
>

Great;)

>>> Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To
>>
>> I have read them cursorily, it is useful for pre-copy indeed. But it seems that
>> it can not meet my need for snapshot.
>
>>> make softdirty usable for live migration, I've added an API to atomically
>>> test-and-clear the bit and write protect the page.
>>
>> How can i find the API? Is it been merged in kernel's master branch already?
>
> Negative. I'll be sure to CC you when I start sending this stuff upstream.
>
>

OK, I look forward to it:)

2014-11-06 20:08:18

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits

On Fri, Oct 3, 2014 at 9:07 PM, Andrea Arcangeli <[email protected]> wrote:
> We run out of 32bits in vm_flags, noop change for 64bit archs.

What? Again?
As I see there are some free bits: 0x200, 0x1000, 0x80000

I prefer to reserve 0x02000000 for VM_ARCH_2

>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> fs/proc/task_mmu.c | 4 ++--
> include/linux/huge_mm.h | 4 ++--
> include/linux/ksm.h | 4 ++--
> include/linux/mm_types.h | 2 +-
> mm/huge_memory.c | 2 +-
> mm/ksm.c | 2 +-
> mm/madvise.c | 2 +-
> mm/mremap.c | 2 +-
> 8 files changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c341568..ee1c3a2 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> /*
> * Don't forget to update Documentation/ on changes.
> */
> - static const char mnemonics[BITS_PER_LONG][2] = {
> + static const char mnemonics[BITS_PER_LONG+1][2] = {
> /*
> * In case if we meet a flag we don't know about.
> */
> - [0 ... (BITS_PER_LONG-1)] = "??",
> + [0 ... (BITS_PER_LONG)] = "??",
>
> [ilog2(VM_READ)] = "rd",
> [ilog2(VM_WRITE)] = "wr",
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 63579cb..3aa10e0 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -121,7 +121,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
> #error "hugepages can't be allocated by the buddy allocator"
> #endif
> extern int hugepage_madvise(struct vm_area_struct *vma,
> - unsigned long *vm_flags, int advice);
> + vm_flags_t *vm_flags, int advice);
> extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
> unsigned long start,
> unsigned long end,
> @@ -183,7 +183,7 @@ static inline int split_huge_page(struct page *page)
> #define split_huge_page_pmd_mm(__mm, __address, __pmd) \
> do { } while (0)
> static inline int hugepage_madvise(struct vm_area_struct *vma,
> - unsigned long *vm_flags, int advice)
> + vm_flags_t *vm_flags, int advice)
> {
> BUG();
> return 0;
> diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> index 3be6bb1..8b35253 100644
> --- a/include/linux/ksm.h
> +++ b/include/linux/ksm.h
> @@ -18,7 +18,7 @@ struct mem_cgroup;
>
> #ifdef CONFIG_KSM
> int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> - unsigned long end, int advice, unsigned long *vm_flags);
> + unsigned long end, int advice, vm_flags_t *vm_flags);
> int __ksm_enter(struct mm_struct *mm);
> void __ksm_exit(struct mm_struct *mm);
>
> @@ -94,7 +94,7 @@ static inline int PageKsm(struct page *page)
>
> #ifdef CONFIG_MMU
> static inline int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> - unsigned long end, int advice, unsigned long *vm_flags)
> + unsigned long end, int advice, vm_flags_t *vm_flags)
> {
> return 0;
> }
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e0b286..2c876d1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -217,7 +217,7 @@ struct page_frag {
> #endif
> };
>
> -typedef unsigned long __nocast vm_flags_t;
> +typedef unsigned long long __nocast vm_flags_t;
>
> /*
> * A region containing a mapping of a non-memory backed file under NOMMU
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d9a21d06..e913a19 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1942,7 +1942,7 @@ out:
> #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
>
> int hugepage_madvise(struct vm_area_struct *vma,
> - unsigned long *vm_flags, int advice)
> + vm_flags_t *vm_flags, int advice)
> {
> switch (advice) {
> case MADV_HUGEPAGE:
> diff --git a/mm/ksm.c b/mm/ksm.c
> index fb75902..faf319e 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1736,7 +1736,7 @@ static int ksm_scan_thread(void *nothing)
> }
>
> int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> - unsigned long end, int advice, unsigned long *vm_flags)
> + unsigned long end, int advice, vm_flags_t *vm_flags)
> {
> struct mm_struct *mm = vma->vm_mm;
> int err;
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 0938b30..d5aee71 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -49,7 +49,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
> struct mm_struct *mm = vma->vm_mm;
> int error = 0;
> pgoff_t pgoff;
> - unsigned long new_flags = vma->vm_flags;
> + vm_flags_t new_flags = vma->vm_flags;
>
> switch (behavior) {
> case MADV_NORMAL:
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 05f1180..fa7db87 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -239,7 +239,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> {
> struct mm_struct *mm = vma->vm_mm;
> struct vm_area_struct *new_vma;
> - unsigned long vm_flags = vma->vm_flags;
> + vm_flags_t vm_flags = vma->vm_flags;
> unsigned long new_pgoff;
> unsigned long moved_len;
> unsigned long excess = 0;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-11-12 07:19:46

by Hailiang Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

Hi Andrea,

Is there any new about this discussion? ;)

Will you plan to support 'only wrprotect fault' in the userfault API?

Thanks,
zhanghailiang

On 2014/10/30 19:31, zhanghailiang wrote:
> On 2014/10/30 1:46, Andrea Arcangeli wrote:
>> Hi Zhanghailiang,
>>
>> On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
>>> Hi Andrea,
>>>
>>> Thanks for your hard work on userfault;)
>>>
>>> This is really a useful API.
>>>
>>> I want to confirm a question:
>>> Can we support distinguishing between writing and reading memory for userfault?
>>> That is, we can decide whether writing a page, reading a page or both trigger userfault.
>>>
>>> I think this will help supporting vhost-scsi,ivshmem for migration,
>>> we can trace dirty page in userspace.
>>>
>>> Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault,
>>> but reading memory from migration thread will also trigger userfault.
>>> It will be easy to implement live memory snapshot, if we support configuring
>>> userfault for writing memory only.
>>
>> Mail is going to be long enough already so I'll just assume tracking
>> dirty memory in userland (instead of doing it in kernel) is worthy
>> feature to have here.
>>
>> After some chat during the KVMForum I've been already thinking it
>> could be beneficial for some usage to give userland the information
>> about the fault being read or write, combined with the ability of
>> mapping pages wrprotected to mcopy_atomic (that would work without
>> false positives only with MADV_DONTFORK also set, but it's already set
>> in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
>> checked also in the wrprotect faults, not just in the not present
>> faults, but it's not a massive change. Returning the read/write
>> information is also a not massive change. This will then payoff mostly
>> if there's also a way to remove the memory atomically (kind of
>> remap_anon_pages).
>>
>> Would that be enough? I mean are you still ok if non present read
>> fault traps too (you'd be notified it's a read) and you get
>> notification for both wrprotect and non present faults?
>>
> Hi Andrea,
>
> Thanks for your reply, and your patience;)
>
> Er, maybe i didn't describe clearly. What i really need for live memory snapshot
> is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*.
>
> My initial solution scheme for live memory snapshot is:
> (1) pause VM
> (2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
> (3) save deivce state to snapshot file
> (4) resume VM
> (5) snapshot thread begin to save page of memory to snapshot file
> (6) VM is going to run, and it is OK for VM or other thread to read ram (no fault trap),
> but if VM try to write page (dirty the page), there will be
> a userfault trap notification.
> (7) a fault-handle-thread reads the page request from userfaultfd,
> it will copy content of the page to some buffers, and then remove the page's
> wrprotect limit(still using the userfaultfd to tell kernel).
> (8) after step (7), VM can continue to write the page which is now can be write.
> (9) snapshot thread save the page cached in step (7)
> (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.
>
> So, what i need for userfault is supporting only wrprotect fault. i don't
> want to get notification for non present reading faults, it will influence
> VM's performance and the efficiency of doing snapshot.
>
> Also, i think this feature will benefit for migration of ivshmem and vhost-scsi
> which have no dirty-page-tracing now.
>
>> The question then is how you mark the memory readonly to let the
>> wrprotect faults trap if the memory already existed and you didn't map
>> it yourself in the guest with mcopy_atomic with a readonly flag.
>>
>> My current plan would be:
>>
>> - keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
>> fast path check in the not-present and wrprotect page fault
>>
>> - if VM_USERFAULT is set, find if there's a userfaultfd registered
>> into that vma too
>>
>> if yes engage userfaultfd protocol
>>
>> otherwise raise SIGBUS (single threaded apps should be fine with
>> SIGBUS and it'll avoid them to spawn a thread in order to talk the
>> userfaultfd protocol)
>>
>> - if userfaultfd protocol is engaged, return read|write fault + fault
>> address to read(ufd) syscalls
>>
>> - leave the "userfault" resolution mechanism independent of the
>> userfaultfd protocol so we keep the two problems separated and we
>> don't mix them in the same API which makes it even harder to
>> finalize it.
>>
>> add mcopy_atomic (with a flag to map the page readonly too)
>>
>> The alternative would be to hide mcopy_atomic (and even
>> remap_anon_pages in order to "remove" the memory atomically for
>> the externalization into the cloud) as userfaultfd commands to
>> write into the fd. But then there would be no much point to keep
>> MADV_USERFAULT around if I do so and I could just remove it
>> too or it doesn't look clean having to open the userfaultfd just
>> to issue an hidden mcopy_atomic.
>>
>> So it becomes a decision if the basic SIGBUS mode for single
>> threaded apps should be supported or not. As long as we support
>> SIGBUS too and we don't force to use userfaultfd as the only
>> mechanism to be notified about userfaults, having a separate
>> mcopy_atomic syscall sounds cleaner.
>>
>> Perhaps mcopy_atomic could be used in other cases that may arise
>> later that may not be connected with the userfault.
>>
>> Questions to double check the above plan is ok:
>>
>> 1) should I drop the SIGBUS behavior and MADV_USERFAULT?
>>
>> 2) should I hide mcopy_atomic as a write into the userfaultfd?
>>
>> NOTE: even if I hide mcopy_atomic as a userfaultfd command to write
>> into the fd, the buffer pointer passed to write() syscall would
>> still _not_ be pointing to the data like a regular write, but it
>> would be a pointer to a command structure that points to the source
>> and destination data of the "hidden" mcopy_atomic, the only
>> advantage is that perhaps I could wakeup the blocked page faults
>> without requiring an additional syscall.
>>
>> The standalone mcopy_atomic would still require a write into the
>> userfaultfd as it happens now after remap_anon_pages returns, in
>> order to wakeup the stopped page faults.
>>
>> 3) should I add a registration command to trap only write faults?
>>
>
> Sure, that is what i really need;)
>
>
> Best Regards,
> zhanghailiang
>
>> The protocol can always be extended later anyway in a backwards
>> compatible way but it's better if we get it fully featured from the
>> start.
>>
>> For completeness, some answers for other questions I've seen floating
>> around but that weren't posted on the list yet (you can skip reading
>> the below part if not interested):
>>
>> - open("/dev/userfault") instead of sys_userfaultfd(), I don't see the
>> benefit: userfaultfd is just like eventfd in terms of kernel API and
>> registering a /dev/ device actually sounds trickier. userfault is a
>> core VM feature and generally we prefer syscalls for core VM
>> features instead of running ioctl on some chardev that may or may
>> not exist. (like we did with /dev/ksm -> MADV_MERGEABLE)
>>
>> - there was a suggestion during KVMForum about allowing an external
>> program to attach to any MM. Like ptrace. So you could have a single
>> process managing all userfaults for different processes. However
>> because I cannot allow multiple userfaultfd to register into the
>> same range, this doesn't look very reliable (ptrace is kind of an
>> optional/debug feature while if userfault goes wrong and returns
>> -EBUSY things go bad) and there may be other complications. If I'd
>> allow multiple userfaultfd to register into the same range, I
>> wouldn't even know who to deliver the userfault to. It is an erratic
>> behavior. Currently it'd return -EBUSY if the app has a bug and does
>> that, but maybe later this can be relaxed to allow higher
>> scalability with a flag (userfaultfd gets flags as parameters), but
>> it still would need to be the same logic that manages userfaults and
>> the only point of allowing multiple ufd to map the same range would
>> be SMP scalability. So I tend to see the userfaultfd as a MM local
>> thing. The thread managing the userfaults can still talk with
>> another process in the local machine using pipes or sockets if it
>> needs to.
>>
>> - the userfaultfd protocol version handshake was done this way because
>> it looked more reliable.
>>
>> Of course we could pass the version of the protocol as parameter to
>> userfaultfd too, but running the syscall multiple times until
>> -EPROTO didn't return anymore doesn't seem any better than writing
>> into the fd the wanted protocol until you read it back instead of
>> -1ULL. It just looked more reliable not having to run the syscall
>> again and again while depending on -EPROTO or some other
>> -Esomething.
>>
>> Thanks,
>> Andrea
>>
>> .
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> .
>

2014-11-19 18:50:29

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

Hi Zhang,

On Fri, Oct 31, 2014 at 09:26:09AM +0800, zhanghailiang wrote:
> On 2014/10/30 20:49, Dr. David Alan Gilbert wrote:
> > * zhanghailiang ([email protected]) wrote:
> >> On 2014/10/30 1:46, Andrea Arcangeli wrote:
> >>> Hi Zhanghailiang,
> >>>
> >>> On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
> >>>> Hi Andrea,
> >>>>
> >>>> Thanks for your hard work on userfault;)
> >>>>
> >>>> This is really a useful API.
> >>>>
> >>>> I want to confirm a question:
> >>>> Can we support distinguishing between writing and reading memory for userfault?
> >>>> That is, we can decide whether writing a page, reading a page or both trigger userfault.
> >>>>
> >>>> I think this will help supporting vhost-scsi,ivshmem for migration,
> >>>> we can trace dirty page in userspace.
> >>>>
> >>>> Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault,
> >>>> but reading memory from migration thread will also trigger userfault.
> >>>> It will be easy to implement live memory snapshot, if we support configuring
> >>>> userfault for writing memory only.
> >>>
> >>> Mail is going to be long enough already so I'll just assume tracking
> >>> dirty memory in userland (instead of doing it in kernel) is worthy
> >>> feature to have here.
> >>>
> >>> After some chat during the KVMForum I've been already thinking it
> >>> could be beneficial for some usage to give userland the information
> >>> about the fault being read or write, combined with the ability of
> >>> mapping pages wrprotected to mcopy_atomic (that would work without
> >>> false positives only with MADV_DONTFORK also set, but it's already set
> >>> in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
> >>> checked also in the wrprotect faults, not just in the not present
> >>> faults, but it's not a massive change. Returning the read/write
> >>> information is also a not massive change. This will then payoff mostly
> >>> if there's also a way to remove the memory atomically (kind of
> >>> remap_anon_pages).
> >>>
> >>> Would that be enough? I mean are you still ok if non present read
> >>> fault traps too (you'd be notified it's a read) and you get
> >>> notification for both wrprotect and non present faults?
> >>>
> >> Hi Andrea,
> >>
> >> Thanks for your reply, and your patience;)
> >>
> >> Er, maybe i didn't describe clearly. What i really need for live memory snapshot
> >> is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*.
> >>
> >> My initial solution scheme for live memory snapshot is:
> >> (1) pause VM
> >> (2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
> >> (3) save deivce state to snapshot file
> >> (4) resume VM
> >> (5) snapshot thread begin to save page of memory to snapshot file
> >> (6) VM is going to run, and it is OK for VM or other thread to read ram (no fault trap),
> >> but if VM try to write page (dirty the page), there will be
> >> a userfault trap notification.
> >> (7) a fault-handle-thread reads the page request from userfaultfd,
> >> it will copy content of the page to some buffers, and then remove the page's
> >> wrprotect limit(still using the userfaultfd to tell kernel).
> >> (8) after step (7), VM can continue to write the page which is now can be write.
> >> (9) snapshot thread save the page cached in step (7)
> >> (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.
> >
> > Hmm, I can see the same process being useful for the fault-tolerance schemes
> > like COLO, it needs a memory state snapshot.
> >
> >> So, what i need for userfault is supporting only wrprotect fault. i don't
> >> want to get notification for non present reading faults, it will influence
> >> VM's performance and the efficiency of doing snapshot.
> >
> > What pages would be non-present at this point - just balloon?
> >
>
> Er, sorry, it should be 'no-present page faults';)

Could you elaborate? The balloon pages or not yet allocated pages in
the guest, if they fault too (in addition to the wrprotect faults) it
doesn't sound a big deal, as it's not so common (balloon especially
shouldn't happen except during balloon deflating during the live
snapshotting). We could bypass non-present faults though, and only
track strict wrprotect faults.

2014-11-20 02:55:29

by Hailiang Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

On 2014/11/20 2:49, Andrea Arcangeli wrote:
> Hi Zhang,
>
> On Fri, Oct 31, 2014 at 09:26:09AM +0800, zhanghailiang wrote:
>> On 2014/10/30 20:49, Dr. David Alan Gilbert wrote:
>>> * zhanghailiang ([email protected]) wrote:
>>>> On 2014/10/30 1:46, Andrea Arcangeli wrote:
>>>>> Hi Zhanghailiang,
>>>>>
>>>>> On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
>>>>>> Hi Andrea,
>>>>>>
>>>>>> Thanks for your hard work on userfault;)
>>>>>>
>>>>>> This is really a useful API.
>>>>>>
>>>>>> I want to confirm a question:
>>>>>> Can we support distinguishing between writing and reading memory for userfault?
>>>>>> That is, we can decide whether writing a page, reading a page or both trigger userfault.
>>>>>>
>>>>>> I think this will help supporting vhost-scsi,ivshmem for migration,
>>>>>> we can trace dirty page in userspace.
>>>>>>
>>>>>> Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault,
>>>>>> but reading memory from migration thread will also trigger userfault.
>>>>>> It will be easy to implement live memory snapshot, if we support configuring
>>>>>> userfault for writing memory only.
>>>>>
>>>>> Mail is going to be long enough already so I'll just assume tracking
>>>>> dirty memory in userland (instead of doing it in kernel) is worthy
>>>>> feature to have here.
>>>>>
>>>>> After some chat during the KVMForum I've been already thinking it
>>>>> could be beneficial for some usage to give userland the information
>>>>> about the fault being read or write, combined with the ability of
>>>>> mapping pages wrprotected to mcopy_atomic (that would work without
>>>>> false positives only with MADV_DONTFORK also set, but it's already set
>>>>> in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
>>>>> checked also in the wrprotect faults, not just in the not present
>>>>> faults, but it's not a massive change. Returning the read/write
>>>>> information is also a not massive change. This will then payoff mostly
>>>>> if there's also a way to remove the memory atomically (kind of
>>>>> remap_anon_pages).
>>>>>
>>>>> Would that be enough? I mean are you still ok if non present read
>>>>> fault traps too (you'd be notified it's a read) and you get
>>>>> notification for both wrprotect and non present faults?
>>>>>
>>>> Hi Andrea,
>>>>
>>>> Thanks for your reply, and your patience;)
>>>>
>>>> Er, maybe i didn't describe clearly. What i really need for live memory snapshot
>>>> is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*.
>>>>
>>>> My initial solution scheme for live memory snapshot is:
>>>> (1) pause VM
>>>> (2) using userfaultfd to mark all memory of VM is wrprotect (readonly)
>>>> (3) save deivce state to snapshot file
>>>> (4) resume VM
>>>> (5) snapshot thread begin to save page of memory to snapshot file
>>>> (6) VM is going to run, and it is OK for VM or other thread to read ram (no fault trap),
>>>> but if VM try to write page (dirty the page), there will be
>>>> a userfault trap notification.
>>>> (7) a fault-handle-thread reads the page request from userfaultfd,
>>>> it will copy content of the page to some buffers, and then remove the page's
>>>> wrprotect limit(still using the userfaultfd to tell kernel).
>>>> (8) after step (7), VM can continue to write the page which is now can be write.
>>>> (9) snapshot thread save the page cached in step (7)
>>>> (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file.
>>>
>>> Hmm, I can see the same process being useful for the fault-tolerance schemes
>>> like COLO, it needs a memory state snapshot.
>>>
>>>> So, what i need for userfault is supporting only wrprotect fault. i don't
>>>> want to get notification for non present reading faults, it will influence
>>>> VM's performance and the efficiency of doing snapshot.
>>>
>>> What pages would be non-present at this point - just balloon?
>>>
>>
>> Er, sorry, it should be 'no-present page faults';)
>
> Could you elaborate? The balloon pages or not yet allocated pages in
> the guest, if they fault too (in addition to the wrprotect faults) it
> doesn't sound a big deal, as it's not so common (balloon especially
> shouldn't happen except during balloon deflating during the live

> snapshotting). We could bypass non-present faults though, and only
> track strict wrprotect faults.
>

Yes, you are right. This is what i really want, bypass all non-present faults
and only track strict wrprotect faults. ;)

So, do you plan to support that in the userfault API?

Thanks,
zhanghailiang

2014-11-20 17:30:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

Hi,

On Fri, Oct 31, 2014 at 12:39:32PM -0700, Peter Feiner wrote:
> On Fri, Oct 31, 2014 at 11:29:49AM +0800, zhanghailiang wrote:
> > Agreed, but for doing live memory snapshot (VM is running when do snapsphot),
> > we have to do this (block the write action), because we have to save the page before it
> > is dirtied by writing action. This is the difference, compared to pre-copy migration.
>
> Ah ha, I understand the difference now. I suppose that you have considered
> doing a traditional pre-copy migration (that is, passes over memory saving
> dirty pages, followed by a pause and a final dump of remaining dirty pages) to
> a file. Your approach has the advantage of having the VM pause time bounded by
> the time it takes to handle the userfault and do the write, as opposed to
> pre-copy migration which has a pause time bounded by the time it takes to do
> the final dump of dirty pages, which, in the worst case, is the time it takes
> to dump all of the guest memory!

It sounds really similar issue as live migration, one can implement a
precopy live snapshot, or a precopy+postcopy live snapshot or a pure
postcopy live snapshot.

The decision on the amount of precopy done before engaging postcopy
(zero passes, 1 pass, or more passes) would have similar tradeoffs
too, except instead of having to re-transmit the re-dirtied pages over
the wire, it would need to overwrite them to disk.

The more precopy passes, the longer it takes for the live snapshotting
process to finish and the more I/O there will be (for live migration it'd
be network bandwidth usage instead of amount of I/O), but the shortest
the postcopy runtime will be (and the shorter postcopy runtime is, the
fewer userfaults will end up triggering on writes, in turn reducing
the slowdown and the artificial fault latency introduced to the guest
runtime). But the more precopy passes the more overwriting will happen
during the "longer" precopy stage and the more overall load there will
be for the host (the otherwise idle part of the host).

For the postcopy live snapshot the wrprotect faults are quite
equivalent to the not-present faults of postcopy live migration logic.

> You could use the old fork & dump trick. Given that the guest's memory is
> backed by private VMA (as of a year ago when I last looked, is always the case
> for QEMU), you can have the kernel do the write protection for you.
> Essentially, you fork Qemu and, in the child process, dump the guest memory
> then exit. If the parent (including the guest) writes to guest memory, then it
> will fault and the kernel will copy the page.
>
> The fork & dump approach will give you the best performance w.r.t. guest pause
> times (i.e., just pausing for the COW fault handler), but it does have the
> distinct disadvantage of potentially using 2x the guest memory (i.e., if the
> parent process races ahead and writes to all of the pages before you finish the
> dump). To mitigate memory copying, you could madvise MADV_DONTNEED the child
> memory as you copy it.

This is a very good point. fork must be evaluated first because it
literally already provides you a readonly memory snapshot of the guest
memory.

The memory cons mentioned above could lead to both -ENOMEM of too many
guests runs live snapshots at the same time in the same host, unless
overcommit_memory is set to 1 (0 by default). Even then if too many
live snapshots are running in parallel you could hit the OOM killer if
there are just a bit too many faults at the same time, or you could
hit heavy swapping which isn't ideal either.

In fact the -ENOMEM avoidance (with qemu failing) is one of the two
critical reasons why qemu always set the guest memory as
MADV_DONTFORK. But that's not the only reason.

To use the fork() trick you'd need to undo the MADV_DONTFORK first but
that would open another problem: there's a race condition between
fork() O_DIRECT and <4k hardblocksize of virtio-blk. If there's any
read() syscall with O_DIRECT with len=512 while fork() is running
(think if the aio runs in parallel with the live snapshot thread that
forks the child to dump the snapshot) and if the guest writes with the
CPU to any 512 fragment of the same page that is the destination
buffer of the write(len=512) (on two different 512bytes area of the
same guest page) the O_DIRECT write will get lost.

So to use fork we'd need to fix this longstanding race (I tried but in
the end we declared it an userland issue because it's not exploitable
to bypass permissions or corrupt kernel or unrelated memory). Or you'd
need to add locking between the dataplane/aio threads and the live
snapshot thread to ensure no direct-io I/O is ever in-flight while
fork runs.

The O_DIRECT however would only help if it's qemu TCG, if it's KVM
it's not even enough to stop O_DIRECT reads. KVM would use
gup(write=1) from the async-pf all the time... and then the shadow
pagetables would go out of sync (it won't destabilize the host of
course, but the guest memory would be corrupt then and guest would
misbehave). In short all vcpu would need to be halted too in addition
to all direct-I/O. Possibly those gets stopped anyway before starting
the snapshot (they certainly are stopped before starting postcopy
live migration :).

Even if it'd be possible to serialize things in qemu to prevent the
race, unless we first fix fork vs o_direct race in the host kernel, I
wouldn't feel safe in removing MADV_DONTFORK and depend on fork for
the snapshotting. This is also because fork may still be used by qemu
in pci hotplug (to fork+exec but it cannot vfork because it has to
alter the signal handlers first). fork() is something people may do
without thinking it'll automatically trigger memory corruption in the
parent.

(If we'd use fork instead of userfaultfd for this, it'd also be nice
to add a madvise for THP, that will alter the COW faults on THP pages,
to copy only 4k and split the pmd into 512 ptes by default leaving 511
not-cowed pte readonly. The split_huge_page design change that adds a
failure path to split_huge_page would prevent having to split the
trans_huge_pmd on the child too, so it would be more efficient and
strightforward change than it would be if we'd add such a new madvise
right now. Redis would then use that new madvise too instead of
MADV_NOHUGEPAGE, as it uses fork for something similar as the above
live snapshot already and it creates random access cows that with THP
are copying and allocating more memory than with 4k pages. And
hopefully it's already getting the O_DIRECT race right if it happens
uses O_DIRECT + threads too.)

wrprotect userfaults would eliminate the need for fork, and they would
limit the maximal amount of memory allocated by the live snapshot to
the maximal number of asynchronous page faults multiplied by the
number of vcpus multiplied by the page size, and you can increase it
further by creating some buffering but you can still throttle it fine
to keep the buffer limited in size (not like with fork where the
buffer is potentially as large as the entire virtual machine
size). Furthermore you could in theory split the THP and map writable
only the 4k you copy during the wrprotect userfault (COW would always
cow the entire 2m increasing the latency of the fault in the parent).

Problem is, there are no kernel syscalls that allows you to mark all
trans_huge_pmds and ptes wrprotected without altering the vma too, and
we cannot mangle over vmas for doing these things as we could end up
with too many vmas and a -ENOMEM failure (not to tell the inefficiency
of such a load). So at the moment just adding the wrprotect faults to
the userfaultfd protocol isn't going to move the needle until you also
add new commands or syscalls to mangle pte/pmd wrprotect bits.

The real things to decide in API terms if those new operations (that
would likely be implemented in fremap.c) should be exposed to
userland as standalone syscalls that works similar to the
mremap/mprotect but never actually touch any vma and only hold the
mmap_sem for reading. Or if they should be embedded in the userfaultfd
wire protocol as additional commands to write into the ufd.

You'd need one syscall to mark all guest memory readonly. And then the
same syscall could be invoked on the 4k region that triggered
wrprotect-userfault to mark it writable again, just before writing the
same 4k region into the ufd to wakeup and retry the page fault.

The advantage of embedding all pte/pmd mangling inside special ufd
commands is that sometime the ufd write needed to wakeup the page
fault could be skipped (i.e. we could resolve the userfault with 1
syscall instead of 2). The downside is that it forces to change the
userfault protocol every time you want to add a new command. While if
we keep the ufd purely as a page fault event notification/wakeup
mechanism without allowing it to change the pte/pmds (and we leave the
task of mangling the pte/pmds to new syscalls), we could more easily
create a long lived userfault protocol that provides all features now
and we could extend the syscalls to mangle the address space with more
flexibility later.

Also the more commands that are only available through userfaultfd,
the more the MADV_USERFAULT for SIGBUS usage becomes less interesting
as SIGBUS would only provide a reduced set of features that cannot be
available without dealing with an userfaultfd. For example a wrprotect
fault (as result of the task forking) if the userfaultfd is closed
would then need to SIGBUS or not if only MADV_USERFAULT is set?

Until we added the wrprotect faults into the equation,
MADV_USERFAULT+SIGBUS was functionally equivalent (just less efficient
for multithreaded programs and uncapable of dealing with kernel
access). Once we add wrprotect faults I'm uncertain it is worth to
retain MADV_USERFAULT and SIGBUS.

The fast path branch that userfaultfd requires in the page fault does:

/* Deliver the page fault to userland, check inside PT lock */
if (vma->vm_flags & VM_USERFAULT) {
pte_unmap_unlock(page_table, ptl);
return handle_userfault(vma, address, flags);
}

It'd be trivial to change it to:

if (vma->vm_userfault_ctx != NULL_VM_USERFAULTFD_CTX) {
pte_unmap_unlock(page_table, ptl);
return handle_userfault(vma, address, flags);
}

(and the whole block would still be optimized at build time with
CONFIG_USERFAULT=n for embedded without virt needs)

In short we need to decide 1) if to retain the MADV_USERFAULT+SIGBUS
behavior, and 2) if to expose the new commands needed to flip the
wrprotect bits without altering the vmas and to copy the pages
atomically as standalone syscalls or as new commands of the
userfaultfd wire protocol.

About the question if I intend to add wrprotect faults, the answer is
that I certainly do. I think it is good idea regardless of the live
snapshotting usage, because it also allows to more efficiently
implement distributed share memory too (allowing to map the memory
readonly if shared and to make it exclusive again on write access) if
anybody dares.

In theory the userfaultfd could also pre-cow the page and return you
the page through the read() syscall if we add a new protocol later to
accellerate it. But I think the current protocol shouldn't go that far
and we should aim for a protocol that is usable by all even if some
more operation will have to happen in userland than in the
accelerated version specific for live snapshotting (the in-kernel cow
wouldn't necessarily provide a significant speedup anyway).

Supporting only wrprotect faults should be fine by just defining two
bits during registration, one for not-present and one for wrprotect
faults then it's up to you if you set one of the two or both. (at
least one has to be set)

> I absolutely plan on releasing these patches :-) CRIU was the first open-source
> userland I had planned on integrating with. At Google, I'm working with our
> home-grown Qemu replacement. However, I'd be happy to help with an effort to
> get softdirty integrated in Qemu in the future.

Improving precopy by removing the software driven log buffer sounds
like a great to precopy. But I tend to agree with Zhang that it's
orthogonal with the actual postcopy stage that requires to block the
page fault not just track it later (for both live migration and live
snapshot). precopy is not getting obsoleted by postcopy, they just
work in tandem and that applies especially to live migration but I see
no reason why the same applies to live snapshotting like said above.

Comments welcome, thanks!
Andrea

2014-11-20 17:39:14

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

Hi,

On Thu, Nov 20, 2014 at 10:54:29AM +0800, zhanghailiang wrote:
> Yes, you are right. This is what i really want, bypass all non-present faults
> and only track strict wrprotect faults. ;)
>
> So, do you plan to support that in the userfault API?

Yes I think it's good idea to support wrprotect/COW faults too.

I just wanted to understand if there was any other reason why you
needed only wrprotect faults, because the non-present faults didn't
look like a big performance concern if they triggered in addition to
wrprotect faults, but it's certainly ok to optimize them away so it's
fully optimal.

All it takes to differentiate the behavior should be one more bit
during registration so you can select non-present, wrprotect faults or
both. postcopy live migration would select only non-present faults,
postcopy live snapshot would select only wrprotect faults, anything
like distributed shared memory supporting shared readonly access and
exclusive write access, would select both flags.

I just sent an (unfortunately) longish but way more detailed email
about live snapshotting with userfaultfd but I just wanted to give a
shorter answer here too :).

Thanks,
Andrea

2014-11-21 07:27:50

by Hailiang Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/17] RFC: userfault v2

On 2014/11/21 1:38, Andrea Arcangeli wrote:
> Hi,
>
> On Thu, Nov 20, 2014 at 10:54:29AM +0800, zhanghailiang wrote:
>> Yes, you are right. This is what i really want, bypass all non-present faults
>> and only track strict wrprotect faults. ;)
>>
>> So, do you plan to support that in the userfault API?
>
> Yes I think it's good idea to support wrprotect/COW faults too.
>

Great! Then i can expect your patches. ;)

> I just wanted to understand if there was any other reason why you
> needed only wrprotect faults, because the non-present faults didn't
> look like a big performance concern if they triggered in addition to
> wrprotect faults, but it's certainly ok to optimize them away so it's
> fully optimal.
>

Er, you have got the answer, no special, it's only for optimality.

> All it takes to differentiate the behavior should be one more bit
> during registration so you can select non-present, wrprotect faults or
> both. postcopy live migration would select only non-present faults,
> postcopy live snapshot would select only wrprotect faults, anything
> like distributed shared memory supporting shared readonly access and
> exclusive write access, would select both flags.
>

It is really flexible in this way.

> I just sent an (unfortunately) longish but way more detailed email
> about live snapshotting with userfaultfd but I just wanted to give a
> shorter answer here too :).
>

Thanks for your explanation, and your patience. It is really useful,
now i know more details about why 'fork & dump live snapshot' scenario
is not acceptable. Thanks :)

> Thanks,
> Andrea
>
> .
>

2014-11-21 20:15:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

Hi Peter,

On Wed, Oct 29, 2014 at 05:56:59PM +0000, Peter Maydell wrote:
> On 29 October 2014 17:46, Andrea Arcangeli <[email protected]> wrote:
> > After some chat during the KVMForum I've been already thinking it
> > could be beneficial for some usage to give userland the information
> > about the fault being read or write
>
> ...I wonder if that would let us replace the current nasty
> mess we use in linux-user to detect read vs write faults
> (which uses a bunch of architecture-specific hacks including
> in some cases "look at the insn that triggered this SEGV and
> decode it to see if it was a load or a store"; see the
> various cpu_signal_handler() implementations in user-exec.c).

There's currently no plan to deliver to userland read access
notifications of a present page, simply because the task of the
userfaultfd is to handle the page fault in userland, but if the page
is mapped and readable it won't fault in the first place :). I just
mean it's not like gdb read watch.

Even if the region would be set to PROT_NONE it would still SEGV
without triggering an userfault (after all pte_present would still
true because the page is still mapped despite not being readable, so
in any case it wouldn't be considered a not-present page fault).

If you temporarily remove the page (which requires an unavoidable TLB
flush also considering if the page was previously mapped the TLB could
still resolve it for reads) it would work then, because the plan is to
provide read/write fault information through the userfaultfd.

In theory it would be possible to deliver PROT_NONE faults through
userfault too but it doesn't make much sense because PROT_NONE still
requires a TLB flush, in addition to the vma
modifications/splitting/rbtree-rebalance and the mmap_sem for writing
as well.

Temporarily removing/moving the page with remap_anon_pages shall be
much better than using PROT_NONE for this (or alternative syscall name
to differentiate it further from remap_file_pages, or equivalent
userfaultfd command if we decide to hide the pte/pmd mangling as
userfaultfd commands instead of adding new standalone syscalls). It
would have the only constraint that you must mark the region
MADV_DONTFORK if you intend linux-user to ever fork or it won't work
reliably (that constraint is to eliminate the need of additional rmap
complexity, precisely so that it doesn't turn into something more
intrusive like remap_file_pages). I assume that would be a fine
constraint for linux-user.

Thanks,
Andrea

2014-11-21 23:06:10

by Peter Maydell

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

On 21 November 2014 20:14, Andrea Arcangeli <[email protected]> wrote:
> Hi Peter,
>
> On Wed, Oct 29, 2014 at 05:56:59PM +0000, Peter Maydell wrote:
>> On 29 October 2014 17:46, Andrea Arcangeli <[email protected]> wrote:
>> > After some chat during the KVMForum I've been already thinking it
>> > could be beneficial for some usage to give userland the information
>> > about the fault being read or write
>>
>> ...I wonder if that would let us replace the current nasty
>> mess we use in linux-user to detect read vs write faults
>> (which uses a bunch of architecture-specific hacks including
>> in some cases "look at the insn that triggered this SEGV and
>> decode it to see if it was a load or a store"; see the
>> various cpu_signal_handler() implementations in user-exec.c).
>
> There's currently no plan to deliver to userland read access
> notifications of a present page, simply because the task of the
> userfaultfd is to handle the page fault in userland, but if the page
> is mapped and readable it won't fault in the first place :). I just
> mean it's not like gdb read watch.

If it's mapped and readable-but-not-writable then it should still
fault on write accesses, though? These are cases we currently get
SEGV for, anyway.

> Even if the region would be set to PROT_NONE it would still SEGV
> without triggering an userfault (after all pte_present would still
> true because the page is still mapped despite not being readable, so
> in any case it wouldn't be considered a not-present page fault).

Ah, I guess we have a terminology difference. I was considering
"page fault" to mean (roughly) "anything that causes the CPU to
take an exception on an attempted load/store" and expected that
userfaultfd would notify userspace of any of those. (Well, not
alignment faults, maybe, but I'm definitely surprised that
access permission issues don't get reported the same way as
page-completely-missing issues. In other words I was expecting
that this was "everything previously reported via SIGSEGV or
SIGBUS now comes via userfaultfd".)

> Temporarily removing/moving the page with remap_anon_pages shall be
> much better than using PROT_NONE for this (or alternative syscall name
> to differentiate it further from remap_file_pages, or equivalent
> userfaultfd command if we decide to hide the pte/pmd mangling as
> userfaultfd commands instead of adding new standalone syscalls).

We don't use PROT_NONE for the linux-user situation, we just use
mprotect() to remove the PAGE_WRITE permission so it's still
readable.

I suspect actually linux-user would be better off implementing
something like "if this is a page which we've mapped read-only
because we translated code out of it, then go ahead and remap
it r/w and throw away the translation and retry the access,
otherwise report SEGV to the guest", because taking SEGVs shouldn't
be a fast path in the guest binary. That would let us work without
architecture-specific junk and without requiring new kernel
features either. So you can ignore this whole tangent thread :-)

thanks
-- PMM

2014-11-25 19:46:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

On Fri, Nov 21, 2014 at 11:05:45PM +0000, Peter Maydell wrote:
> If it's mapped and readable-but-not-writable then it should still
> fault on write accesses, though? These are cases we currently get
> SEGV for, anyway.

Yes then it'll work just fine.

> Ah, I guess we have a terminology difference. I was considering
> "page fault" to mean (roughly) "anything that causes the CPU to
> take an exception on an attempted load/store" and expected that
> userfaultfd would notify userspace of any of those. (Well, not
> alignment faults, maybe, but I'm definitely surprised that
> access permission issues don't get reported the same way as
> page-completely-missing issues. In other words I was expecting
> that this was "everything previously reported via SIGSEGV or
> SIGBUS now comes via userfaultfd".)

Just not PROT_NONE SIGSEGV faults, i.e. PROT_NONE would still SIGSEGV
currently. Because it's not a not-present fault (the page is present,
just not mapped readable) and it's neither a wrprotect fault (it is
trapped with the vma vm_flags permission bits instead before the
actual page fault handler is invoked). userfaultfd hooks into the
common code of the page fault handler.

> > Temporarily removing/moving the page with remap_anon_pages shall be
> > much better than using PROT_NONE for this (or alternative syscall name
> > to differentiate it further from remap_file_pages, or equivalent
> > userfaultfd command if we decide to hide the pte/pmd mangling as
> > userfaultfd commands instead of adding new standalone syscalls).
>
> We don't use PROT_NONE for the linux-user situation, we just use
> mprotect() to remove the PAGE_WRITE permission so it's still
> readable.

Like said above it'll work just fine then.

> I suspect actually linux-user would be better off implementing
> something like "if this is a page which we've mapped read-only
> because we translated code out of it, then go ahead and remap
> it r/w and throw away the translation and retry the access,
> otherwise report SEGV to the guest", because taking SEGVs shouldn't
> be a fast path in the guest binary. That would let us work without
> architecture-specific junk and without requiring new kernel
> features either. So you can ignore this whole tangent thread :-)

You might get a significant boost if you use userfaultfd.

For postcopy live snapshot and postcopy live migration the main
benefit is the removal mprotect as a whole and the performance
improvement is a secondary benefit.

You can cap the max size of the JIT translated cache (and in turn the
maximal number of vmas generated by the mprotects) but we can't cap
the address space fragmentation. The faults may invoke way too many
mprotect and we may fragment the vma too much to the point we get
-ENOMEM.

Marking a page wrprotected however is always tricky, no matter if it's
fork doing it or KSM or something else. KSM just skips page that could
be under gup pins and retries them at the next pass. Fork simply won't
work right currently and it needs MADV_DONTFORK to avoid the
wrprotection entirely where you may use O_DIRECT mixed with threads
and fork.

For this new vma-less syscall (or ufd command) the best we could do is
to print a warning if any page marked wrprotected could be under GUP
pin (the warning could generate false positives as result of
speculative cache lookups that run lockless get_page_unless_zero() on
any pfn).

To avoid races the postcopy live snapshot feature I think it should be
enough to wait all in-flight I/O to complete before marking the guest
address space readonly (the KVM gup() side can be taken care of by
marking the shadow MMU readonly which is a must anyway, the mmu
notifier will take care of that part).

The postcopy live snapshot will have to copy the page so it's
effectively a COW in userland, and in turn it must ensure there's no
O_DIRECT in flight still writing to the page (despite we marked it
readonly) while the wrprotection syscall runs.

For your case probably there's no gup() in the equation unless you use
O_DIRECT (I don't think you use shadow-MMU in the kernel in
linux-user) so you don't have to worry about those races and it's just
simpler.