FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
reading to reduce the mmap_sem contention (for writing), like while
waiting for I/O completion. The problem is that right now practically
no get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not
leveraging that nifty feature.
Andres fixed it for the KVM page fault. However get_user_pages_fast
remains uncovered, and 99% of other get_user_pages aren't using it
either (the only exception being FOLL_NOWAIT in KVM which is really
nonblocking and in fact it doesn't even release the mmap_sem).
So this patchsets extends the optimization Andres did in the KVM page
fault to the whole kernel. It makes most important places (including
gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold
times during I/O.
The only few places that remains uncovered are drivers like v4l and
other exceptions that tends to work on their own memory and they're
not working on random user memory (for example like O_DIRECT that uses
gup_fast and is fully covered by this patch).
A follow up patch should probably also add a printk_once warning to
get_user_pages that should go obsolete and be phased out
eventually. The "vmas" parameter of get_user_pages makes it
fundamentally incompatible with FAULT_FOLL_ALLOW_RETRY (vmas array
becomes meaningless the moment the mmap_sem is released).
While this is just an optimization, this becomes an absolute
requirement for the userfaultfd. The userfaultfd allows to block the
page fault, and in order to do so I need to drop the mmap_sem
first. So this patch also ensures that all memory where
userfaultfd could be registered by KVM, the very first fault (no
matter if it is a regular page fault, or a get_user_pages) always
has FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is
waken only when the pagetable is already mapped. The second fault
attempt after the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's
ok to retry without it.
So I need this merged before I can attempt to merge the userfaultfd.
This has been running fully stable on a heavy KVM postcopy live
migration workload that also includes the new userfaultfd API allows
an unlimited number of userfaultfds per process and each one can at
any time register and unregister memory ranges, so each thread or each
shared lib can do userfaults in its own private memory independently
of each other and independently of the main process. This is also the
same load that exposed the nfs silent memory corruption and it uses
O_DIRECT also on nfs so get_user_pages_fast and all sort of
get_user_pages are exercised both by NFS and KVM at the same time on
the userfaultfd backed memory.
Reviews would be welcome, thanks,
Andrea
Andrea Arcangeli (3):
mm: gup: add get_user_pages_locked and get_user_pages_unlocked
mm: gup: use get_user_pages_fast and get_user_pages_unlocked
mm: gup: use get_user_pages_unlocked within get_user_pages_fast
Andres Lagar-Cavilla (1):
mm: gup: add FOLL_TRIED
arch/mips/mm/gup.c | 8 +-
arch/powerpc/mm/gup.c | 6 +-
arch/s390/kvm/kvm-s390.c | 4 +-
arch/s390/mm/gup.c | 6 +-
arch/sh/mm/gup.c | 6 +-
arch/sparc/mm/gup.c | 6 +-
arch/x86/mm/gup.c | 7 +-
drivers/dma/iovlock.c | 10 +-
drivers/iommu/amd_iommu_v2.c | 6 +-
drivers/media/pci/ivtv/ivtv-udma.c | 6 +-
drivers/misc/sgi-gru/grufault.c | 3 +-
drivers/scsi/st.c | 10 +-
drivers/video/fbdev/pvr2fb.c | 5 +-
include/linux/mm.h | 8 ++
mm/gup.c | 182 ++++++++++++++++++++++++++++++++++---
mm/mempolicy.c | 2 +-
mm/nommu.c | 23 +++++
mm/process_vm_access.c | 7 +-
mm/util.c | 10 +-
net/ceph/pagevec.c | 9 +-
20 files changed, 236 insertions(+), 88 deletions(-)
We can leverage the VM_FAULT_RETRY functionality in the page fault
paths better by using either get_user_pages_locked or
get_user_pages_unlocked.
The former allow conversion of get_user_pages invocations that will
have to pass a "&locked" parameter to know if the mmap_sem was dropped
during the call. Example from:
down_read(&mm->mmap_sem);
do_something()
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);
to:
int locked = 1;
down_read(&mm->mmap_sem);
do_something()
get_user_pages_locked(tsk, mm, ..., pages, &locked);
if (locked)
up_read(&mm->mmap_sem);
The latter is suitable only as a drop in replacement of the form:
down_read(&mm->mmap_sem);
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);
into:
get_user_pages_unlocked(tsk, mm, ..., pages);
Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must
be NULL for get_user_pages_locked|unlocked to be usable (the latter
original form wouldn't have been safe anyway if vmas wasn't null, for
the former we just make it explicit by dropping the parameter).
If vmas is not NULL these two methods cannot be used.
This patch then applies the new forms in various places, in some case
also replacing it with get_user_pages_fast whenever tsk and mm are
current and current->mm. get_user_pages_unlocked varies from
get_user_pages_fast only if mm is not current->mm (like when
get_user_pages works on some other process mm). Whenever tsk and mm
matches current and current->mm get_user_pages_fast must always be
used to increase performance and get the page lockless (only with irq
disabled).
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/mm.h | 7 +++
mm/gup.c | 178 +++++++++++++++++++++++++++++++++++++++++++++++++----
mm/nommu.c | 23 +++++++
3 files changed, 197 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0f4196a..8900ba9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1196,6 +1196,13 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
int write, int force, struct page **pages,
struct vm_area_struct **vmas);
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages,
+ int *locked);
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages);
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
struct kvec;
diff --git a/mm/gup.c b/mm/gup.c
index af7ea3e..b6d0076 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -580,6 +580,166 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
return 0;
}
+static inline long __get_user_pages_locked(struct task_struct *tsk,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long nr_pages,
+ int write, int force,
+ struct page **pages,
+ struct vm_area_struct **vmas,
+ int *locked,
+ bool notify_drop)
+{
+ int flags = FOLL_TOUCH;
+ long ret, pages_done;
+ bool lock_dropped;
+
+ if (locked) {
+ /* if VM_FAULT_RETRY can be returned, vmas become invalid */
+ BUG_ON(vmas);
+ /* check caller initialized locked */
+ BUG_ON(*locked != 1);
+ }
+
+ if (pages)
+ flags |= FOLL_GET;
+ if (write)
+ flags |= FOLL_WRITE;
+ if (force)
+ flags |= FOLL_FORCE;
+
+ pages_done = 0;
+ lock_dropped = false;
+ for (;;) {
+ ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
+ vmas, locked);
+ if (!locked)
+ /* VM_FAULT_RETRY couldn't trigger, bypass */
+ return ret;
+
+ /* VM_FAULT_RETRY cannot return errors */
+ if (!*locked) {
+ BUG_ON(ret < 0);
+ BUG_ON(nr_pages == 1 && ret);
+ }
+
+ if (!pages)
+ /* If it's a prefault don't insist harder */
+ return ret;
+
+ if (ret > 0) {
+ nr_pages -= ret;
+ pages_done += ret;
+ if (!nr_pages)
+ break;
+ }
+ if (*locked) {
+ /* VM_FAULT_RETRY didn't trigger */
+ if (!pages_done)
+ pages_done = ret;
+ break;
+ }
+ /* VM_FAULT_RETRY triggered, so seek to the faulting offset */
+ pages += ret;
+ start += ret << PAGE_SHIFT;
+
+ /*
+ * Repeat on the address that fired VM_FAULT_RETRY
+ * without FAULT_FLAG_ALLOW_RETRY but with
+ * FAULT_FLAG_TRIED.
+ */
+ *locked = 1;
+ lock_dropped = true;
+ down_read(&mm->mmap_sem);
+ ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
+ pages, NULL, NULL);
+ if (ret != 1) {
+ BUG_ON(ret > 1);
+ if (!pages_done)
+ pages_done = ret;
+ break;
+ }
+ nr_pages--;
+ pages_done++;
+ if (!nr_pages)
+ break;
+ pages++;
+ start += PAGE_SIZE;
+ }
+ if (notify_drop && lock_dropped && *locked) {
+ /*
+ * We must let the caller know we temporarily dropped the lock
+ * and so the critical section protected by it was lost.
+ */
+ up_read(&mm->mmap_sem);
+ *locked = 0;
+ }
+ return pages_done;
+}
+
+/*
+ * We can leverage the VM_FAULT_RETRY functionality in the page fault
+ * paths better by using either get_user_pages_locked() or
+ * get_user_pages_unlocked().
+ *
+ * get_user_pages_locked() is suitable to replace the form:
+ *
+ * down_read(&mm->mmap_sem);
+ * do_something()
+ * get_user_pages(tsk, mm, ..., pages, NULL);
+ * up_read(&mm->mmap_sem);
+ *
+ * to:
+ *
+ * int locked = 1;
+ * down_read(&mm->mmap_sem);
+ * do_something()
+ * get_user_pages_locked(tsk, mm, ..., pages, &locked);
+ * if (locked)
+ * up_read(&mm->mmap_sem);
+ */
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages,
+ int *locked)
+{
+ return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+ pages, NULL, locked, true);
+}
+EXPORT_SYMBOL(get_user_pages_locked);
+
+/*
+ * get_user_pages_unlocked() is suitable to replace the form:
+ *
+ * down_read(&mm->mmap_sem);
+ * get_user_pages(tsk, mm, ..., pages, NULL);
+ * up_read(&mm->mmap_sem);
+ *
+ * with:
+ *
+ * get_user_pages_unlocked(tsk, mm, ..., pages);
+ *
+ * It is functionally equivalent to get_user_pages_fast so
+ * get_user_pages_fast should be used instead, if the two parameters
+ * "tsk" and "mm" are respectively equal to current and current->mm,
+ * or if "force" shall be set to 1 (get_user_pages_fast misses the
+ * "force" parameter).
+ */
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages)
+{
+ long ret;
+ int locked = 1;
+ down_read(&mm->mmap_sem);
+ ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+ pages, NULL, &locked, false);
+ if (locked)
+ up_read(&mm->mmap_sem);
+ return ret;
+}
+EXPORT_SYMBOL(get_user_pages_unlocked);
+
/*
* get_user_pages() - pin user pages in memory
* @tsk: the task_struct to use for page fault accounting, or
@@ -629,22 +789,18 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
* use the correct cache flushing APIs.
*
* See also get_user_pages_fast, for performance critical applications.
+ *
+ * get_user_pages should be phased out in favor of
+ * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing
+ * should use get_user_pages because it cannot pass
+ * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
*/
long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages, int write,
int force, struct page **pages, struct vm_area_struct **vmas)
{
- int flags = FOLL_TOUCH;
-
- if (pages)
- flags |= FOLL_GET;
- if (write)
- flags |= FOLL_WRITE;
- if (force)
- flags |= FOLL_FORCE;
-
- return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
- NULL);
+ return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+ pages, vmas, NULL, false);
}
EXPORT_SYMBOL(get_user_pages);
diff --git a/mm/nommu.c b/mm/nommu.c
index a881d96..3918b0f 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -213,6 +213,29 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
}
EXPORT_SYMBOL(get_user_pages);
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages,
+ int *locked)
+{
+ return get_user_pages(tsk, mm, start, nr_pages, write, force,
+ pages, NULL);
+}
+EXPORT_SYMBOL(get_user_pages_locked);
+
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ int write, int force, struct page **pages)
+{
+ long ret;
+ down_read(&mm->mmap_sem);
+ ret = get_user_pages(tsk, mm, start, nr_pages, write, force,
+ pages, NULL);
+ up_read(&mm->mmap_sem);
+ return ret;
+}
+EXPORT_SYMBOL(get_user_pages_unlocked);
+
/**
* follow_pfn - look up PFN at a user virtual address
* @vma: memory mapping
From: Andres Lagar-Cavilla <[email protected]>
Reviewed-by: Radim Krčmář <[email protected]>
Signed-off-by: Andres Lagar-Cavilla <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/mm.h | 1 +
mm/gup.c | 4 ++++
2 files changed, 5 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..0f4196a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1985,6 +1985,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
+#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b..af7ea3e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
fault_flags |= FAULT_FLAG_ALLOW_RETRY;
if (*flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
+ if (*flags & FOLL_TRIED) {
+ VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+ fault_flags |= FAULT_FLAG_TRIED;
+ }
ret = handle_mm_fault(mm, vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/mips/mm/gup.c | 8 +++-----
arch/powerpc/mm/gup.c | 6 ++----
arch/s390/kvm/kvm-s390.c | 4 +---
arch/s390/mm/gup.c | 6 ++----
arch/sh/mm/gup.c | 6 ++----
arch/sparc/mm/gup.c | 6 ++----
arch/x86/mm/gup.c | 7 +++----
7 files changed, 15 insertions(+), 28 deletions(-)
diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 06ce17c..20884f5 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -301,11 +301,9 @@ slow_irqon:
start += nr << PAGE_SHIFT;
pages += nr;
- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start,
- (end - start) >> PAGE_SHIFT,
- write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
+ ret = get_user_pages_unlocked(current, mm, start,
+ (end - start) >> PAGE_SHIFT,
+ write, 0, pages);
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d874668..b70c34a 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -215,10 +215,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
start += nr << PAGE_SHIFT;
pages += nr;
- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start,
- nr_pages - nr, write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
+ ret = get_user_pages_unlocked(current, mm, start,
+ nr_pages - nr, write, 0, pages);
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 81b0e11..37ca29a 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1092,9 +1092,7 @@ long kvm_arch_fault_in_page(struct kvm_vcpu *vcpu, gpa_t gpa, int writable)
hva = gmap_fault(gpa, vcpu->arch.gmap);
if (IS_ERR_VALUE(hva))
return (long)hva;
- down_read(&mm->mmap_sem);
- rc = get_user_pages(current, mm, hva, 1, writable, 0, NULL, NULL);
- up_read(&mm->mmap_sem);
+ rc = get_user_pages_unlocked(current, mm, hva, 1, writable, 0, NULL);
return rc < 0 ? rc : 0;
}
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 639fce46..5c586c7 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -235,10 +235,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
/* Try to get the remaining pages with get_user_pages */
start += nr << PAGE_SHIFT;
pages += nr;
- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start,
- nr_pages - nr, write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
+ ret = get_user_pages_unlocked(current, mm, start,
+ nr_pages - nr, write, 0, pages);
/* Have to be a bit careful with return values */
if (nr > 0)
ret = (ret < 0) ? nr : ret + nr;
diff --git a/arch/sh/mm/gup.c b/arch/sh/mm/gup.c
index 37458f3..e15f52a 100644
--- a/arch/sh/mm/gup.c
+++ b/arch/sh/mm/gup.c
@@ -257,10 +257,8 @@ slow_irqon:
start += nr << PAGE_SHIFT;
pages += nr;
- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start,
- (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
+ ret = get_user_pages_unlocked(current, mm, start,
+ (end - start) >> PAGE_SHIFT, write, 0, pages);
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 1aed043..fa7de7d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -219,10 +219,8 @@ slow:
start += nr << PAGE_SHIFT;
pages += nr;
- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start,
- (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
+ ret = get_user_pages_unlocked(current, mm, start,
+ (end - start) >> PAGE_SHIFT, write, 0, pages);
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef..2ab183b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -388,10 +388,9 @@ slow_irqon:
start += nr << PAGE_SHIFT;
pages += nr;
- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start,
- (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
+ ret = get_user_pages_unlocked(current, mm, start,
+ (end - start) >> PAGE_SHIFT,
+ write, 0, pages);
/* Have to be a bit careful with return values */
if (nr > 0) {
Just an optimization.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
drivers/dma/iovlock.c | 10 ++--------
drivers/iommu/amd_iommu_v2.c | 6 ++----
drivers/media/pci/ivtv/ivtv-udma.c | 6 ++----
drivers/misc/sgi-gru/grufault.c | 3 +--
drivers/scsi/st.c | 10 ++--------
drivers/video/fbdev/pvr2fb.c | 5 +----
mm/mempolicy.c | 2 +-
mm/process_vm_access.c | 7 ++-----
mm/util.c | 10 ++--------
net/ceph/pagevec.c | 9 ++++-----
10 files changed, 19 insertions(+), 49 deletions(-)
diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
index bb48a57..12ea7c3 100644
--- a/drivers/dma/iovlock.c
+++ b/drivers/dma/iovlock.c
@@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
pages += page_list->nr_pages;
/* pin pages down */
- down_read(¤t->mm->mmap_sem);
- ret = get_user_pages(
- current,
- current->mm,
+ ret = get_user_pages_fast(
(unsigned long) iov[i].iov_base,
page_list->nr_pages,
1, /* write */
- 0, /* force */
- page_list->pages,
- NULL);
- up_read(¤t->mm->mmap_sem);
+ page_list->pages);
if (ret != page_list->nr_pages)
goto unpin;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 5f578e8..6963b73 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -519,10 +519,8 @@ static void do_fault(struct work_struct *work)
write = !!(fault->flags & PPR_FAULT_WRITE);
- down_read(&fault->state->mm->mmap_sem);
- npages = get_user_pages(NULL, fault->state->mm,
- fault->address, 1, write, 0, &page, NULL);
- up_read(&fault->state->mm->mmap_sem);
+ npages = get_user_pages_unlocked(NULL, fault->state->mm,
+ fault->address, 1, write, 0, &page);
if (npages == 1) {
put_page(page);
diff --git a/drivers/media/pci/ivtv/ivtv-udma.c b/drivers/media/pci/ivtv/ivtv-udma.c
index 7338cb2..96d866b 100644
--- a/drivers/media/pci/ivtv/ivtv-udma.c
+++ b/drivers/media/pci/ivtv/ivtv-udma.c
@@ -124,10 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long ivtv_dest_addr,
}
/* Get user pages for DMA Xfer */
- down_read(¤t->mm->mmap_sem);
- err = get_user_pages(current, current->mm,
- user_dma.uaddr, user_dma.page_count, 0, 1, dma->map, NULL);
- up_read(¤t->mm->mmap_sem);
+ err = get_user_pages_unlocked(current, current->mm,
+ user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);
if (user_dma.page_count != err) {
IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n",
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index f74fc0c..cd20669 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
#else
*pageshift = PAGE_SHIFT;
#endif
- if (get_user_pages
- (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
+ if (get_user_pages_fast(vaddr, 1, write, &page) <= 0)
return -EFAULT;
*paddr = page_to_phys(page);
put_page(page);
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index aff9689..c89dcfa 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
return -ENOMEM;
/* Try to fault in all of the necessary pages */
- down_read(¤t->mm->mmap_sem);
/* rw==READ means read from drive, write into memory area */
- res = get_user_pages(
- current,
- current->mm,
+ res = get_user_pages_fast(
uaddr,
nr_pages,
rw == READ,
- 0, /* don't force */
- pages,
- NULL);
- up_read(¤t->mm->mmap_sem);
+ pages);
/* Errors and no page mapped should return here */
if (res < nr_pages)
diff --git a/drivers/video/fbdev/pvr2fb.c b/drivers/video/fbdev/pvr2fb.c
index 167cfff..ff81f65 100644
--- a/drivers/video/fbdev/pvr2fb.c
+++ b/drivers/video/fbdev/pvr2fb.c
@@ -686,10 +686,7 @@ static ssize_t pvr2fb_write(struct fb_info *info, const char *buf,
if (!pages)
return -ENOMEM;
- down_read(¤t->mm->mmap_sem);
- ret = get_user_pages(current, current->mm, (unsigned long)buf,
- nr_pages, WRITE, 0, pages, NULL);
- up_read(¤t->mm->mmap_sem);
+ ret = get_user_pages_fast((unsigned long)buf, nr_pages, WRITE, pages);
if (ret < nr_pages) {
nr_pages = ret;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8f5330d..6606c10 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -881,7 +881,7 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr)
struct page *p;
int err;
- err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+ err = get_user_pages_fast(addr & PAGE_MASK, 1, 0, &p);
if (err >= 0) {
err = page_to_nid(p);
put_page(p);
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 5077afc..b159769 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -99,11 +99,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
size_t bytes;
/* Get the pages we're interested in */
- down_read(&mm->mmap_sem);
- pages = get_user_pages(task, mm, pa, pages,
- vm_write, 0, process_pages, NULL);
- up_read(&mm->mmap_sem);
-
+ pages = get_user_pages_unlocked(task, mm, pa, pages,
+ vm_write, 0, process_pages);
if (pages <= 0)
return -EFAULT;
diff --git a/mm/util.c b/mm/util.c
index 093c973..1b93f2d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -247,14 +247,8 @@ int __weak get_user_pages_fast(unsigned long start,
int nr_pages, int write, struct page **pages)
{
struct mm_struct *mm = current->mm;
- int ret;
-
- down_read(&mm->mmap_sem);
- ret = get_user_pages(current, mm, start, nr_pages,
- write, 0, pages, NULL);
- up_read(&mm->mmap_sem);
-
- return ret;
+ return get_user_pages_unlocked(current, mm, start, nr_pages,
+ write, 0, pages);
}
EXPORT_SYMBOL_GPL(get_user_pages_fast);
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 5550130..5504783 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -23,17 +23,16 @@ struct page **ceph_get_direct_page_vector(const void __user *data,
if (!pages)
return ERR_PTR(-ENOMEM);
- down_read(¤t->mm->mmap_sem);
while (got < num_pages) {
- rc = get_user_pages(current, current->mm,
- (unsigned long)data + ((unsigned long)got * PAGE_SIZE),
- num_pages - got, write_page, 0, pages + got, NULL);
+ rc = get_user_pages_fast((unsigned long)data +
+ ((unsigned long)got * PAGE_SIZE),
+ num_pages - got,
+ write_page, pages + got);
if (rc < 0)
break;
BUG_ON(rc == 0);
got += rc;
}
- up_read(¤t->mm->mmap_sem);
if (rc < 0)
goto fail;
return pages;
On Wed, Oct 01, 2014 at 10:56:36AM +0200, Andrea Arcangeli wrote:
> diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
> index f74fc0c..cd20669 100644
> --- a/drivers/misc/sgi-gru/grufault.c
> +++ b/drivers/misc/sgi-gru/grufault.c
> @@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
> #else
> *pageshift = PAGE_SHIFT;
> #endif
> - if (get_user_pages
> - (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
> + if (get_user_pages_fast(vaddr, 1, write, &page) <= 0)
> return -EFAULT;
> *paddr = page_to_phys(page);
> put_page(page);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 8f5330d..6606c10 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -881,7 +881,7 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr)
> struct page *p;
> int err;
>
> - err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
> + err = get_user_pages_fast(addr & PAGE_MASK, 1, 0, &p);
> if (err >= 0) {
> err = page_to_nid(p);
> put_page(p);
I just noticed I need to revert the above two changes... (both weren't
exercised during the testing).
On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote:
> +static inline long __get_user_pages_locked(struct task_struct *tsk,
> + struct mm_struct *mm,
> + unsigned long start,
> + unsigned long nr_pages,
> + int write, int force,
> + struct page **pages,
> + struct vm_area_struct **vmas,
> + int *locked,
> + bool notify_drop)
> +{
> + int flags = FOLL_TOUCH;
> + long ret, pages_done;
> + bool lock_dropped;
> +
> + if (locked) {
> + /* if VM_FAULT_RETRY can be returned, vmas become invalid */
> + BUG_ON(vmas);
> + /* check caller initialized locked */
> + BUG_ON(*locked != 1);
> + }
> +
> + if (pages)
> + flags |= FOLL_GET;
> + if (write)
> + flags |= FOLL_WRITE;
> + if (force)
> + flags |= FOLL_FORCE;
> +
> + pages_done = 0;
> + lock_dropped = false;
> + for (;;) {
> + ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
> + vmas, locked);
> + if (!locked)
> + /* VM_FAULT_RETRY couldn't trigger, bypass */
> + return ret;
> +
> + /* VM_FAULT_RETRY cannot return errors */
> + if (!*locked) {
> + BUG_ON(ret < 0);
> + BUG_ON(nr_pages == 1 && ret);
If I understand correctly, this second BUG_ON is asserting that when
__get_user_pages is asked for a single page and it is successfully gets the
page, then it shouldn't have dropped the mmap_sem. If that's the case, then
you could generalize this assertion to
BUG_ON(nr_pages == ret);
Otherwise, looks good!
Peter
Hi Andrea,
On 10/01/2014 04:56 AM, Andrea Arcangeli wrote:
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 8f5330d..6606c10 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -881,7 +881,7 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr)
> struct page *p;
> int err;
>
> - err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
> + err = get_user_pages_fast(addr & PAGE_MASK, 1, 0, &p);
> if (err >= 0) {
> err = page_to_nid(p);
> put_page(p);
This change looks bogus. mmap_sem might get locked in do_get_mempolicy(), and with this
change we'll try locking it again in get_user_pages_fast.
Thanks,
Sasha
On Wed, Oct 1, 2014 at 8:51 AM, Peter Feiner <[email protected]> wrote:
> On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote:
>> +static inline long __get_user_pages_locked(struct task_struct *tsk,
>> + struct mm_struct *mm,
>> + unsigned long start,
>> + unsigned long nr_pages,
>> + int write, int force,
>> + struct page **pages,
>> + struct vm_area_struct **vmas,
>> + int *locked,
>> + bool notify_drop)
>> +{
>> + int flags = FOLL_TOUCH;
>> + long ret, pages_done;
>> + bool lock_dropped;
>> +
>> + if (locked) {
>> + /* if VM_FAULT_RETRY can be returned, vmas become invalid */
>> + BUG_ON(vmas);
>> + /* check caller initialized locked */
>> + BUG_ON(*locked != 1);
>> + }
>> +
>> + if (pages)
>> + flags |= FOLL_GET;
>> + if (write)
>> + flags |= FOLL_WRITE;
>> + if (force)
>> + flags |= FOLL_FORCE;
>> +
>> + pages_done = 0;
>> + lock_dropped = false;
>> + for (;;) {
>> + ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
>> + vmas, locked);
>> + if (!locked)
>> + /* VM_FAULT_RETRY couldn't trigger, bypass */
>> + return ret;
>> +
>> + /* VM_FAULT_RETRY cannot return errors */
>> + if (!*locked) {
>> + BUG_ON(ret < 0);
>> + BUG_ON(nr_pages == 1 && ret);
>
> If I understand correctly, this second BUG_ON is asserting that when
> __get_user_pages is asked for a single page and it is successfully gets the
> page, then it shouldn't have dropped the mmap_sem. If that's the case, then
> you could generalize this assertion to
>
> BUG_ON(nr_pages == ret);
Even more strict:
BUG_ON(ret >= nr_pages);
Reviewed-by: Andres Lagar-Cavilla <[email protected]>
>
> Otherwise, looks good!
>
> Peter
--
Andres Lagar-Cavilla | Google Kernel Team | [email protected]
On Wed, Oct 01, 2014 at 10:06:27AM -0700, Andres Lagar-Cavilla wrote:
> On Wed, Oct 1, 2014 at 8:51 AM, Peter Feiner <[email protected]> wrote:
> > On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote:
> >> + /* VM_FAULT_RETRY cannot return errors */
> >> + if (!*locked) {
> >> + BUG_ON(ret < 0);
> >> + BUG_ON(nr_pages == 1 && ret);
> >
> > If I understand correctly, this second BUG_ON is asserting that when
> > __get_user_pages is asked for a single page and it is successfully gets the
> > page, then it shouldn't have dropped the mmap_sem. If that's the case, then
> > you could generalize this assertion to
> >
> > BUG_ON(nr_pages == ret);
Agreed.
>
> Even more strict:
> BUG_ON(ret >= nr_pages);
Agreed too, plus this should be quicker than my weaker check.
Maybe some BUG_ON can be deleted later or converted to VM_BUG_ON, but
initially I feel safer with the BUG_ON considering that is a slow
path.
> Reviewed-by: Andres Lagar-Cavilla <[email protected]>
Thanks!
On Wed, Oct 01, 2014 at 10:56:34AM +0200, Andrea Arcangeli wrote:
> From: Andres Lagar-Cavilla <[email protected]>
This needs a changelog....
> Reviewed-by: Radim Krčmář <[email protected]>
> Signed-off-by: Andres Lagar-Cavilla <[email protected]>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> include/linux/mm.h | 1 +
> mm/gup.c | 4 ++++
> 2 files changed, 5 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8981cc8..0f4196a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1985,6 +1985,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
> #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
> #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
> #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
> +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
>
> typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
> void *data);
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..af7ea3e 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
> fault_flags |= FAULT_FLAG_ALLOW_RETRY;
> if (*flags & FOLL_NOWAIT)
> fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
> + if (*flags & FOLL_TRIED) {
> + VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
> + fault_flags |= FAULT_FLAG_TRIED;
> + }
>
> ret = handle_mm_fault(mm, vma, address, fault_flags);
> if (ret & VM_FAULT_ERROR) {
On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote:
> +static inline long __get_user_pages_locked(struct task_struct *tsk,
> + struct mm_struct *mm,
> + unsigned long start,
> + unsigned long nr_pages,
> + int write, int force,
> + struct page **pages,
> + struct vm_area_struct **vmas,
> + int *locked,
> + bool notify_drop)
You might want to consider __always_inline to make sure it does indeed
get inlined and constant propagation works for @locked and @notify_drop.
On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote:
> +static inline long __get_user_pages_locked(struct task_struct *tsk,
> + struct mm_struct *mm,
> + unsigned long start,
> + unsigned long nr_pages,
> + int write, int force,
> + struct page **pages,
> + struct vm_area_struct **vmas,
> + int *locked,
> + bool notify_drop)
> +{
> + if (notify_drop && lock_dropped && *locked) {
> + /*
> + * We must let the caller know we temporarily dropped the lock
> + * and so the critical section protected by it was lost.
> + */
> + up_read(&mm->mmap_sem);
> + *locked = 0;
> + }
> + return pages_done;
> +}
> +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
> + unsigned long start, unsigned long nr_pages,
> + int write, int force, struct page **pages,
> + int *locked)
> +{
> + return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> + pages, NULL, locked, true);
> +}
> +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
> + unsigned long start, unsigned long nr_pages,
> + int write, int force, struct page **pages)
> +{
> + long ret;
> + int locked = 1;
> + down_read(&mm->mmap_sem);
> + ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> + pages, NULL, &locked, false);
> + if (locked)
> + up_read(&mm->mmap_sem);
> + return ret;
> +}
> long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> unsigned long start, unsigned long nr_pages, int write,
> int force, struct page **pages, struct vm_area_struct **vmas)
> {
> + return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> + pages, vmas, NULL, false);
> }
I'm wondering about that notify_drop parameter, what's the added
benefit? If you look at these 3 callers we can do away with it, since in
the second called where we have locked but !notify_drop we seem to do
the exact same thing afterwards anyway.
On Wed, Oct 01, 2014 at 10:56:36AM +0200, Andrea Arcangeli wrote:
> Just an optimization.
Does it make sense to split the thing in two? One where you apply
_unlocked and then one where you apply _fast?
On Thu, Oct 09, 2014 at 12:52:45PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 01, 2014 at 10:56:36AM +0200, Andrea Arcangeli wrote:
> > Just an optimization.
>
> Does it make sense to split the thing in two? One where you apply
> _unlocked and then one where you apply _fast?
Yes but I already dropped the _fast optimization, as the latency
enhancements to gup_fast were NAKed earlier in this thread. So this
patch has already been updated to only apply _unlocked.
http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=bc2e0473b601c6a330ddb4adbcf4c048b2233d4e
On Thu, Oct 09, 2014 at 12:47:23PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote:
> > +static inline long __get_user_pages_locked(struct task_struct *tsk,
> > + struct mm_struct *mm,
> > + unsigned long start,
> > + unsigned long nr_pages,
> > + int write, int force,
> > + struct page **pages,
> > + struct vm_area_struct **vmas,
> > + int *locked,
> > + bool notify_drop)
>
> You might want to consider __always_inline to make sure it does indeed
> get inlined and constant propagation works for @locked and @notify_drop.
Ok, that's included in the last patchset submit.
Thanks,
Andrea
On Thu, Oct 09, 2014 at 12:50:37PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote:
>
> > +static inline long __get_user_pages_locked(struct task_struct *tsk,
> > + struct mm_struct *mm,
> > + unsigned long start,
> > + unsigned long nr_pages,
> > + int write, int force,
> > + struct page **pages,
> > + struct vm_area_struct **vmas,
> > + int *locked,
> > + bool notify_drop)
> > +{
>
> > + if (notify_drop && lock_dropped && *locked) {
> > + /*
> > + * We must let the caller know we temporarily dropped the lock
> > + * and so the critical section protected by it was lost.
> > + */
> > + up_read(&mm->mmap_sem);
> > + *locked = 0;
> > + }
> > + return pages_done;
> > +}
>
> > +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
> > + unsigned long start, unsigned long nr_pages,
> > + int write, int force, struct page **pages,
> > + int *locked)
> > +{
> > + return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> > + pages, NULL, locked, true);
> > +}
>
> > +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
> > + unsigned long start, unsigned long nr_pages,
> > + int write, int force, struct page **pages)
> > +{
> > + long ret;
> > + int locked = 1;
> > + down_read(&mm->mmap_sem);
> > + ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> > + pages, NULL, &locked, false);
> > + if (locked)
> > + up_read(&mm->mmap_sem);
> > + return ret;
> > +}
>
> > long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > unsigned long start, unsigned long nr_pages, int write,
> > int force, struct page **pages, struct vm_area_struct **vmas)
> > {
> > + return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> > + pages, vmas, NULL, false);
> > }
>
> I'm wondering about that notify_drop parameter, what's the added
> benefit? If you look at these 3 callers we can do away with it, since in
> the second called where we have locked but !notify_drop we seem to do
The second (and third) caller pass notify_drop=false, so the
notify_drop parameter is always a noop for them. They certainly could
get away without it.
> the exact same thing afterwards anyway.
It makes a difference only to the first caller, if it wasn't for the
first caller notify_drop could be dropped. The first caller does this:
return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
pages, NULL, locked, true, FOLL_TOUCH);
^ notify_drop = true
Without "notify_drop=true" the first caller could make its own
respective caller think the lock has never been dropped, just because
it is locked by the time get_user_pages_locked returned. But the
caller must be made aware that the lock has been dropped during the
call and in turn any "vma" it got before inside the mmap_sem critical
section is now stale. That's all notify_drop achieves.
Thanks,
Andrea