2018-09-21 22:42:07

by Keith Busch

[permalink] [raw]
Subject: [PATCHv3 0/6] mm: faster get user pages

Changes since v2:

Combine only the output parameters in a struct that need tracking,
and squash to just one final kernel patch.

Fixed compile bugs for all configs

Keith Busch (6):
mm/gup_benchmark: Time put_page
mm/gup_benchmark: Add additional pinning methods
tools/gup_benchmark: Fix 'write' flag usage
tools/gup_benchmark: Allow user specified file
tools/gup_benchmark: Add parameter for hugetlb
mm/gup: Cache dev_pagemap while pinning pages

include/linux/huge_mm.h | 8 +--
include/linux/mm.h | 19 ++++++-
mm/gup.c | 90 +++++++++++++++++-------------
mm/gup_benchmark.c | 36 ++++++++++--
mm/huge_memory.c | 38 ++++++-------
mm/nommu.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 40 +++++++++++--
7 files changed, 154 insertions(+), 81 deletions(-)

--
2.14.4



2018-09-21 22:40:30

by Keith Busch

[permalink] [raw]
Subject: [PATCHv3 3/6] tools/gup_benchmark: Fix 'write' flag usage

If the '-w' parameter was provided, the benchmark would exit due to a
mssing 'break'.

Cc: Kirill Shutemov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
tools/testing/selftests/vm/gup_benchmark.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index c2f785ded9b9..b2082df8beb4 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -60,6 +60,7 @@ int main(int argc, char **argv)
break;
case 'w':
write = 1;
+ break;
default:
return -1;
}
--
2.14.4


2018-09-21 22:40:38

by Keith Busch

[permalink] [raw]
Subject: [PATCHv3 5/6] tools/gup_benchmark: Add parameter for hugetlb

Cc: Kirill Shutemov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
tools/testing/selftests/vm/gup_benchmark.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index f2c99e2436f8..5d96e2b3d2f1 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -38,7 +38,7 @@ int main(int argc, char **argv)
char *file = NULL;
char *p;

- while ((opt = getopt(argc, argv, "m:r:n:f:tTLU")) != -1) {
+ while ((opt = getopt(argc, argv, "m:r:n:f:tTLUH")) != -1) {
switch (opt) {
case 'm':
size = atoi(optarg) * MB;
@@ -64,6 +64,9 @@ int main(int argc, char **argv)
case 'w':
write = 1;
break;
+ case 'H':
+ flags |= MAP_HUGETLB;
+ break;
case 'f':
file = optarg;
flags &= ~(MAP_PRIVATE | MAP_ANONYMOUS);
--
2.14.4


2018-09-21 22:41:01

by Keith Busch

[permalink] [raw]
Subject: [PATCHv3 1/6] mm/gup_benchmark: Time put_page

We'd like to measure time to unpin user pages, so this adds a second
benchmark timer on put_page, separate from get_page.

Adding the field will breaks this ioctl ABI, but should be okay since
this an in-tree kernel selftest.

Cc: Kirill Shutemov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
mm/gup_benchmark.c | 8 ++++++--
tools/testing/selftests/vm/gup_benchmark.c | 6 ++++--
2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 6a473709e9b6..76cd35e477af 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -8,7 +8,8 @@
#define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)

struct gup_benchmark {
- __u64 delta_usec;
+ __u64 get_delta_usec;
+ __u64 put_delta_usec;
__u64 addr;
__u64 size;
__u32 nr_pages_per_call;
@@ -47,14 +48,17 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
}
end_time = ktime_get();

- gup->delta_usec = ktime_us_delta(end_time, start_time);
+ gup->get_delta_usec = ktime_us_delta(end_time, start_time);
gup->size = addr - gup->addr;

+ start_time = ktime_get();
for (i = 0; i < nr_pages; i++) {
if (!pages[i])
break;
put_page(pages[i]);
}
+ end_time = ktime_get();
+ gup->put_delta_usec = ktime_us_delta(end_time, start_time);

kvfree(pages);
return 0;
diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index 36df55132036..bdcb97acd0ac 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -17,7 +17,8 @@
#define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)

struct gup_benchmark {
- __u64 delta_usec;
+ __u64 get_delta_usec;
+ __u64 put_delta_usec;
__u64 addr;
__u64 size;
__u32 nr_pages_per_call;
@@ -81,7 +82,8 @@ int main(int argc, char **argv)
if (ioctl(fd, GUP_FAST_BENCHMARK, &gup))
perror("ioctl"), exit(1);

- printf("Time: %lld us", gup.delta_usec);
+ printf("Time: get:%lld put:%lld us", gup.get_delta_usec,
+ gup.put_delta_usec);
if (gup.size != size)
printf(", truncated (size: %lld)", gup.size);
printf("\n");
--
2.14.4


2018-09-21 22:41:11

by Keith Busch

[permalink] [raw]
Subject: [PATCHv3 2/6] mm/gup_benchmark: Add additional pinning methods

This patch provides new gup benchmark ioctl commands to run different
user page pinning methods, get_user_pages_longterm and get_user_pages,
in addition to the existing get_user_pages_fast.

Cc: Kirill Shutemov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
mm/gup_benchmark.c | 28 ++++++++++++++++++++++++++--
tools/testing/selftests/vm/gup_benchmark.c | 13 +++++++++++--
2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 76cd35e477af..e6d9ce001ffa 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -6,6 +6,8 @@
#include <linux/debugfs.h>

#define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
+#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
+#define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark)

struct gup_benchmark {
__u64 get_delta_usec;
@@ -41,7 +43,23 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = (next - addr) / PAGE_SIZE;
}

- nr = get_user_pages_fast(addr, nr, gup->flags & 1, pages + i);
+ switch (cmd) {
+ case GUP_FAST_BENCHMARK:
+ nr = get_user_pages_fast(addr, nr, gup->flags & 1,
+ pages + i);
+ break;
+ case GUP_LONGTERM_BENCHMARK:
+ nr = get_user_pages_longterm(addr, nr, gup->flags & 1,
+ pages + i, NULL);
+ break;
+ case GUP_BENCHMARK:
+ nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
+ NULL);
+ break;
+ default:
+ return -1;
+ }
+
if (nr <= 0)
break;
i += nr;
@@ -70,8 +88,14 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd,
struct gup_benchmark gup;
int ret;

- if (cmd != GUP_FAST_BENCHMARK)
+ switch (cmd) {
+ case GUP_FAST_BENCHMARK:
+ case GUP_LONGTERM_BENCHMARK:
+ case GUP_BENCHMARK:
+ break;
+ default:
return -EINVAL;
+ }

if (copy_from_user(&gup, (void __user *)arg, sizeof(gup)))
return -EFAULT;
diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index bdcb97acd0ac..c2f785ded9b9 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -15,6 +15,8 @@
#define PAGE_SIZE sysconf(_SC_PAGESIZE)

#define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
+#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
+#define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark)

struct gup_benchmark {
__u64 get_delta_usec;
@@ -30,9 +32,10 @@ int main(int argc, char **argv)
struct gup_benchmark gup;
unsigned long size = 128 * MB;
int i, fd, opt, nr_pages = 1, thp = -1, repeats = 1, write = 0;
+ int cmd = GUP_FAST_BENCHMARK;
char *p;

- while ((opt = getopt(argc, argv, "m:r:n:tT")) != -1) {
+ while ((opt = getopt(argc, argv, "m:r:n:tTLU")) != -1) {
switch (opt) {
case 'm':
size = atoi(optarg) * MB;
@@ -49,6 +52,12 @@ int main(int argc, char **argv)
case 'T':
thp = 0;
break;
+ case 'L':
+ cmd = GUP_LONGTERM_BENCHMARK;
+ break;
+ case 'U':
+ cmd = GUP_BENCHMARK;
+ break;
case 'w':
write = 1;
default:
@@ -79,7 +88,7 @@ int main(int argc, char **argv)

for (i = 0; i < repeats; i++) {
gup.size = size;
- if (ioctl(fd, GUP_FAST_BENCHMARK, &gup))
+ if (ioctl(fd, cmd, &gup))
perror("ioctl"), exit(1);

printf("Time: get:%lld put:%lld us", gup.get_delta_usec,
--
2.14.4


2018-09-21 22:42:34

by Keith Busch

[permalink] [raw]
Subject: [PATCHv3 4/6] tools/gup_benchmark: Allow user specified file

The gup benchmark by default maps anonymous memory. This patch allows a
user to specify a file to map, providing a means to test various
file backings, like device and filesystem DAX.

Cc: Kirill Shutemov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
tools/testing/selftests/vm/gup_benchmark.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index b2082df8beb4..f2c99e2436f8 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -33,9 +33,12 @@ int main(int argc, char **argv)
unsigned long size = 128 * MB;
int i, fd, opt, nr_pages = 1, thp = -1, repeats = 1, write = 0;
int cmd = GUP_FAST_BENCHMARK;
+ int file_map = -1;
+ int flags = MAP_ANONYMOUS | MAP_PRIVATE;
+ char *file = NULL;
char *p;

- while ((opt = getopt(argc, argv, "m:r:n:tTLU")) != -1) {
+ while ((opt = getopt(argc, argv, "m:r:n:f:tTLU")) != -1) {
switch (opt) {
case 'm':
size = atoi(optarg) * MB;
@@ -61,11 +64,22 @@ int main(int argc, char **argv)
case 'w':
write = 1;
break;
+ case 'f':
+ file = optarg;
+ flags &= ~(MAP_PRIVATE | MAP_ANONYMOUS);
+ flags |= MAP_SHARED;
+ break;
default:
return -1;
}
}

+ if (file) {
+ file_map = open(file, O_RDWR|O_CREAT);
+ if (file_map < 0)
+ perror("open"), exit(file_map);
+ }
+
gup.nr_pages_per_call = nr_pages;
gup.flags = write;

@@ -73,8 +87,7 @@ int main(int argc, char **argv)
if (fd == -1)
perror("open"), exit(1);

- p = mmap(NULL, size, PROT_READ | PROT_WRITE,
- MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, file_map, 0);
if (p == MAP_FAILED)
perror("mmap"), exit(1);
gup.addr = (unsigned long)p;
--
2.14.4


2018-09-21 22:44:31

by Keith Busch

[permalink] [raw]
Subject: [PATCHv3 6/6] mm/gup: Cache dev_pagemap while pinning pages

Pinning pages from ZONE_DEVICE memory needs to check the backing device's
live-ness, which is tracked in the device's dev_pagemap metadata. This
metadata is stored in a radix tree and looking it up adds measurable
software overhead.

This patch avoids repeating this relatively costly operation when
dev_pagemap is used by caching the last dev_pagemap when getting user
pages. The gup_benchmark reports this reduces the time to get user pages
to as low as 1/3 of the previous time.

The cached value is combined with other output parameters into a context
struct to keep the parameters fewer.

Cc: Kirill Shutemov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/huge_mm.h | 8 ++---
include/linux/mm.h | 19 +++++++++--
mm/gup.c | 90 +++++++++++++++++++++++++++----------------------
mm/huge_memory.c | 38 +++++++++------------
mm/nommu.c | 4 +--
5 files changed, 88 insertions(+), 71 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 99c19b06d9a4..5cbabdebe9af 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -213,9 +213,9 @@ static inline int hpage_nr_pages(struct page *page)
}

struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
- pmd_t *pmd, int flags);
+ pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
- pud_t *pud, int flags);
+ pud_t *pud, int flags, struct dev_pagemap **pgmap);

extern vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);

@@ -344,13 +344,13 @@ static inline void mm_put_huge_zero_page(struct mm_struct *mm)
}

static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
- unsigned long addr, pmd_t *pmd, int flags)
+ unsigned long addr, pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
{
return NULL;
}

static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
- unsigned long addr, pud_t *pud, int flags)
+ unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap)
{
return NULL;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a61ebe8ad4ca..79c80496dd50 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2534,15 +2534,28 @@ static inline vm_fault_t vmf_error(int err)
return VM_FAULT_SIGBUS;
}

+struct follow_page_context {
+ struct dev_pagemap *pgmap;
+ unsigned int page_mask;
+};
+
struct page *follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int foll_flags,
- unsigned int *page_mask);
+ struct follow_page_context *ctx);

static inline struct page *follow_page(struct vm_area_struct *vma,
unsigned long address, unsigned int foll_flags)
{
- unsigned int unused_page_mask;
- return follow_page_mask(vma, address, foll_flags, &unused_page_mask);
+ struct page *page;
+ struct follow_page_context ctx = {
+ .pgmap = NULL,
+ .page_mask = 0,
+ };
+
+ page = follow_page_mask(vma, address, foll_flags, &ctx);
+ if (ctx.pgmap)
+ put_dev_pagemap(ctx.pgmap);
+ return page;
}

#define FOLL_WRITE 0x01 /* check pte is writable */
diff --git a/mm/gup.c b/mm/gup.c
index 1abc8b4afff6..124e7293e381 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -71,10 +71,10 @@ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
}

static struct page *follow_page_pte(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, unsigned int flags)
+ unsigned long address, pmd_t *pmd, unsigned int flags,
+ struct dev_pagemap **pgmap)
{
struct mm_struct *mm = vma->vm_mm;
- struct dev_pagemap *pgmap = NULL;
struct page *page;
spinlock_t *ptl;
pte_t *ptep, pte;
@@ -116,8 +116,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
* Only return device mapping pages in the FOLL_GET case since
* they are only valid while holding the pgmap reference.
*/
- pgmap = get_dev_pagemap(pte_pfn(pte), NULL);
- if (pgmap)
+ *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
+ if (*pgmap)
page = pte_page(pte);
else
goto no_page;
@@ -156,9 +156,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
get_page(page);

/* drop the pgmap reference now that we hold the page */
- if (pgmap) {
- put_dev_pagemap(pgmap);
- pgmap = NULL;
+ if (*pgmap) {
+ put_dev_pagemap(*pgmap);
+ *pgmap = NULL;
}
}
if (flags & FOLL_TOUCH) {
@@ -210,7 +210,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,

static struct page *follow_pmd_mask(struct vm_area_struct *vma,
unsigned long address, pud_t *pudp,
- unsigned int flags, unsigned int *page_mask)
+ unsigned int flags,
+ struct follow_page_context *ctx)
{
pmd_t *pmd, pmdval;
spinlock_t *ptl;
@@ -258,13 +259,13 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
}
if (pmd_devmap(pmdval)) {
ptl = pmd_lock(mm, pmd);
- page = follow_devmap_pmd(vma, address, pmd, flags);
+ page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
spin_unlock(ptl);
if (page)
return page;
}
if (likely(!pmd_trans_huge(pmdval)))
- return follow_page_pte(vma, address, pmd, flags);
+ return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);

if ((flags & FOLL_NUMA) && pmd_protnone(pmdval))
return no_page_table(vma, flags);
@@ -284,7 +285,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
}
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
- return follow_page_pte(vma, address, pmd, flags);
+ return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
}
if (flags & FOLL_SPLIT) {
int ret;
@@ -307,18 +308,18 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
}

return ret ? ERR_PTR(ret) :
- follow_page_pte(vma, address, pmd, flags);
+ follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
}
page = follow_trans_huge_pmd(vma, address, pmd, flags);
spin_unlock(ptl);
- *page_mask = HPAGE_PMD_NR - 1;
+ ctx->page_mask = HPAGE_PMD_NR - 1;
return page;
}

-
static struct page *follow_pud_mask(struct vm_area_struct *vma,
unsigned long address, p4d_t *p4dp,
- unsigned int flags, unsigned int *page_mask)
+ unsigned int flags,
+ struct follow_page_context *ctx)
{
pud_t *pud;
spinlock_t *ptl;
@@ -344,7 +345,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
}
if (pud_devmap(*pud)) {
ptl = pud_lock(mm, pud);
- page = follow_devmap_pud(vma, address, pud, flags);
+ page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
spin_unlock(ptl);
if (page)
return page;
@@ -352,13 +353,13 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
if (unlikely(pud_bad(*pud)))
return no_page_table(vma, flags);

- return follow_pmd_mask(vma, address, pud, flags, page_mask);
+ return follow_pmd_mask(vma, address, pud, flags, ctx);
}

-
static struct page *follow_p4d_mask(struct vm_area_struct *vma,
unsigned long address, pgd_t *pgdp,
- unsigned int flags, unsigned int *page_mask)
+ unsigned int flags,
+ struct follow_page_context *ctx)
{
p4d_t *p4d;
struct page *page;
@@ -378,7 +379,7 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
return page;
return no_page_table(vma, flags);
}
- return follow_pud_mask(vma, address, p4d, flags, page_mask);
+ return follow_pud_mask(vma, address, p4d, flags, ctx);
}

/**
@@ -396,13 +397,13 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
*/
struct page *follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags,
- unsigned int *page_mask)
+ struct follow_page_context *ctx)
{
pgd_t *pgd;
struct page *page;
struct mm_struct *mm = vma->vm_mm;

- *page_mask = 0;
+ ctx->page_mask = 0;

/* make this handle hugepd */
page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
@@ -431,7 +432,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
return no_page_table(vma, flags);
}

- return follow_p4d_mask(vma, address, pgd, flags, page_mask);
+ return follow_p4d_mask(vma, address, pgd, flags, ctx);
}

static int get_gate_page(struct mm_struct *mm, unsigned long address,
@@ -659,9 +660,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas, int *nonblocking)
{
- long i = 0;
- unsigned int page_mask;
+ long ret = 0, i = 0;
struct vm_area_struct *vma = NULL;
+ struct follow_page_context ctx = {};

if (!nr_pages)
return 0;
@@ -691,12 +692,14 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
pages ? &pages[i] : NULL);
if (ret)
return i ? : ret;
- page_mask = 0;
+ ctx.page_mask = 0;
goto next_page;
}

- if (!vma || check_vma_flags(vma, gup_flags))
- return i ? : -EFAULT;
+ if (!vma || check_vma_flags(vma, gup_flags)) {
+ ret = -EFAULT;
+ goto out;
+ }
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
&start, &nr_pages, i,
@@ -709,23 +712,26 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
* If we have a pending SIGKILL, don't keep faulting pages and
* potentially allocating memory.
*/
- if (unlikely(fatal_signal_pending(current)))
- return i ? i : -ERESTARTSYS;
+ if (unlikely(fatal_signal_pending(current))) {
+ ret = -ERESTARTSYS;
+ goto out;
+ }
cond_resched();
- page = follow_page_mask(vma, start, foll_flags, &page_mask);
+
+ page = follow_page_mask(vma, start, foll_flags, &ctx);
if (!page) {
- int ret;
ret = faultin_page(tsk, vma, start, &foll_flags,
nonblocking);
switch (ret) {
case 0:
goto retry;
+ case -EBUSY:
+ ret = 0;
+ /* FALLTHRU */
case -EFAULT:
case -ENOMEM:
case -EHWPOISON:
- return i ? i : ret;
- case -EBUSY:
- return i;
+ goto out;
case -ENOENT:
goto next_page;
}
@@ -737,27 +743,31 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
*/
goto next_page;
} else if (IS_ERR(page)) {
- return i ? i : PTR_ERR(page);
+ ret = PTR_ERR(page);
+ goto out;
}
if (pages) {
pages[i] = page;
flush_anon_page(vma, page, start);
flush_dcache_page(page);
- page_mask = 0;
+ ctx.page_mask = 0;
}
next_page:
if (vmas) {
vmas[i] = vma;
- page_mask = 0;
+ ctx.page_mask = 0;
}
- page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
+ page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
if (page_increm > nr_pages)
page_increm = nr_pages;
i += page_increm;
start += page_increm * PAGE_SIZE;
nr_pages -= page_increm;
} while (nr_pages);
- return i;
+out:
+ if (ctx.pgmap)
+ put_dev_pagemap(ctx.pgmap);
+ return i ? i : ret;
}

static bool vma_permits_fault(struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 533f9b00147d..9839bf91b057 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -851,13 +851,23 @@ static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
update_mmu_cache_pmd(vma, addr, pmd);
}

+static struct page *pagemap_page(unsigned long pfn, struct dev_pagemap **pgmap)
+{
+ struct page *page;
+
+ *pgmap = get_dev_pagemap(pfn, *pgmap);
+ if (!*pgmap)
+ return ERR_PTR(-EFAULT);
+ page = pfn_to_page(pfn);
+ get_page(page);
+ return page;
+}
+
struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
- pmd_t *pmd, int flags)
+ pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
{
unsigned long pfn = pmd_pfn(*pmd);
struct mm_struct *mm = vma->vm_mm;
- struct dev_pagemap *pgmap;
- struct page *page;

assert_spin_locked(pmd_lockptr(mm, pmd));

@@ -886,14 +896,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
return ERR_PTR(-EEXIST);

pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
- pgmap = get_dev_pagemap(pfn, NULL);
- if (!pgmap)
- return ERR_PTR(-EFAULT);
- page = pfn_to_page(pfn);
- get_page(page);
- put_dev_pagemap(pgmap);
-
- return page;
+ return pagemap_page(pfn, pgmap);
}

int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -1000,12 +1003,10 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
}

struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
- pud_t *pud, int flags)
+ pud_t *pud, int flags, struct dev_pagemap **pgmap)
{
unsigned long pfn = pud_pfn(*pud);
struct mm_struct *mm = vma->vm_mm;
- struct dev_pagemap *pgmap;
- struct page *page;

assert_spin_locked(pud_lockptr(mm, pud));

@@ -1028,14 +1029,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
return ERR_PTR(-EEXIST);

pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
- pgmap = get_dev_pagemap(pfn, NULL);
- if (!pgmap)
- return ERR_PTR(-EFAULT);
- page = pfn_to_page(pfn);
- get_page(page);
- put_dev_pagemap(pgmap);
-
- return page;
+ return pagemap_page(pfn, pgmap);
}

int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/nommu.c b/mm/nommu.c
index e4aac33216ae..a795c70cf21e 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1711,9 +1711,9 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,

struct page *follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags,
- unsigned int *page_mask)
+ struct follow_page_context *ctx)
{
- *page_mask = 0;
+ ctx->page_mask = 0;
return NULL;
}

--
2.14.4


2018-10-02 10:55:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 1/6] mm/gup_benchmark: Time put_page

On Fri, Sep 21, 2018 at 10:39:51PM +0000, Keith Busch wrote:
> We'd like to measure time to unpin user pages, so this adds a second
> benchmark timer on put_page, separate from get_page.
>
> Adding the field will breaks this ioctl ABI, but should be okay since
> this an in-tree kernel selftest.
>
> Cc: Kirill Shutemov <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Dan Williams <[email protected]>
> Signed-off-by: Keith Busch <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2018-10-02 10:57:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 2/6] mm/gup_benchmark: Add additional pinning methods

On Fri, Sep 21, 2018 at 10:39:52PM +0000, Keith Busch wrote:
> This patch provides new gup benchmark ioctl commands to run different
> user page pinning methods, get_user_pages_longterm and get_user_pages,
> in addition to the existing get_user_pages_fast.
>
> Cc: Kirill Shutemov <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Dan Williams <[email protected]>
> Signed-off-by: Keith Busch <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2018-10-02 10:58:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 3/6] tools/gup_benchmark: Fix 'write' flag usage

On Fri, Sep 21, 2018 at 10:39:53PM +0000, Keith Busch wrote:
> If the '-w' parameter was provided, the benchmark would exit due to a
> mssing 'break'.
>
> Cc: Kirill Shutemov <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Dan Williams <[email protected]>
> Signed-off-by: Keith Busch <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2018-10-02 11:04:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 4/6] tools/gup_benchmark: Allow user specified file

On Fri, Sep 21, 2018 at 10:39:54PM +0000, Keith Busch wrote:
> The gup benchmark by default maps anonymous memory. This patch allows a
> user to specify a file to map, providing a means to test various
> file backings, like device and filesystem DAX.
>
> Cc: Kirill Shutemov <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Dan Williams <[email protected]>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> tools/testing/selftests/vm/gup_benchmark.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
> index b2082df8beb4..f2c99e2436f8 100644
> --- a/tools/testing/selftests/vm/gup_benchmark.c
> +++ b/tools/testing/selftests/vm/gup_benchmark.c
> @@ -33,9 +33,12 @@ int main(int argc, char **argv)
> unsigned long size = 128 * MB;
> int i, fd, opt, nr_pages = 1, thp = -1, repeats = 1, write = 0;
> int cmd = GUP_FAST_BENCHMARK;
> + int file_map = -1;
> + int flags = MAP_ANONYMOUS | MAP_PRIVATE;
> + char *file = NULL;
> char *p;
>
> - while ((opt = getopt(argc, argv, "m:r:n:tTLU")) != -1) {
> + while ((opt = getopt(argc, argv, "m:r:n:f:tTLU")) != -1) {
> switch (opt) {
> case 'm':
> size = atoi(optarg) * MB;
> @@ -61,11 +64,22 @@ int main(int argc, char **argv)
> case 'w':
> write = 1;
> break;
> + case 'f':
> + file = optarg;
> + flags &= ~(MAP_PRIVATE | MAP_ANONYMOUS);
> + flags |= MAP_SHARED;

Why do we want to assume shared mapping if a file is passed? Private-file
mapping is also valid target for the benchmark.

Maybe a separate option for shared? It would keep options more independent.

BTW, we can make a default file /dev/zero and don't have MAP_ANONYMOUS in
the flags: private mapping of /dev/zero would produce anonymous mapping.
No need in masking out MAP_ANONYMOUS on -f and no branch on 'if (file)'
below.

> + break;
> default:
> return -1;
> }
> }
>
> + if (file) {
> + file_map = open(file, O_RDWR|O_CREAT);
> + if (file_map < 0)
> + perror("open"), exit(file_map);
> + }
> +
> gup.nr_pages_per_call = nr_pages;
> gup.flags = write;
>
> @@ -73,8 +87,7 @@ int main(int argc, char **argv)
> if (fd == -1)
> perror("open"), exit(1);
>
> - p = mmap(NULL, size, PROT_READ | PROT_WRITE,
> - MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> + p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, file_map, 0);
> if (p == MAP_FAILED)
> perror("mmap"), exit(1);
> gup.addr = (unsigned long)p;
> --
> 2.14.4
>

--
Kirill A. Shutemov

2018-10-02 11:06:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 5/6] tools/gup_benchmark: Add parameter for hugetlb

On Fri, Sep 21, 2018 at 10:39:55PM +0000, Keith Busch wrote:

-ENOMSG

> Cc: Kirill Shutemov <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Dan Williams <[email protected]>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> tools/testing/selftests/vm/gup_benchmark.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
> index f2c99e2436f8..5d96e2b3d2f1 100644
> --- a/tools/testing/selftests/vm/gup_benchmark.c
> +++ b/tools/testing/selftests/vm/gup_benchmark.c
> @@ -38,7 +38,7 @@ int main(int argc, char **argv)
> char *file = NULL;
> char *p;
>
> - while ((opt = getopt(argc, argv, "m:r:n:f:tTLU")) != -1) {
> + while ((opt = getopt(argc, argv, "m:r:n:f:tTLUH")) != -1) {
> switch (opt) {
> case 'm':
> size = atoi(optarg) * MB;
> @@ -64,6 +64,9 @@ int main(int argc, char **argv)
> case 'w':
> write = 1;
> break;
> + case 'H':
> + flags |= MAP_HUGETLB;
> + break;
> case 'f':
> file = optarg;
> flags &= ~(MAP_PRIVATE | MAP_ANONYMOUS);
> --
> 2.14.4
>

--
Kirill A. Shutemov

2018-10-02 11:26:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 6/6] mm/gup: Cache dev_pagemap while pinning pages

On Fri, Sep 21, 2018 at 10:39:56PM +0000, Keith Busch wrote:
> Pinning pages from ZONE_DEVICE memory needs to check the backing device's
> live-ness, which is tracked in the device's dev_pagemap metadata. This
> metadata is stored in a radix tree and looking it up adds measurable
> software overhead.
>
> This patch avoids repeating this relatively costly operation when
> dev_pagemap is used by caching the last dev_pagemap when getting user
> pages. The gup_benchmark reports this reduces the time to get user pages
> to as low as 1/3 of the previous time.
>
> The cached value is combined with other output parameters into a context
> struct to keep the parameters fewer.
>
> Cc: Kirill Shutemov <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Dan Williams <[email protected]>
> Signed-off-by: Keith Busch <[email protected]>
> ---

....

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a61ebe8ad4ca..79c80496dd50 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2534,15 +2534,28 @@ static inline vm_fault_t vmf_error(int err)
> return VM_FAULT_SIGBUS;
> }
>
> +struct follow_page_context {
> + struct dev_pagemap *pgmap;
> + unsigned int page_mask;
> +};
> +
> struct page *follow_page_mask(struct vm_area_struct *vma,
> unsigned long address, unsigned int foll_flags,
> - unsigned int *page_mask);
> + struct follow_page_context *ctx);
>
> static inline struct page *follow_page(struct vm_area_struct *vma,
> unsigned long address, unsigned int foll_flags)
> {
> - unsigned int unused_page_mask;
> - return follow_page_mask(vma, address, foll_flags, &unused_page_mask);
> + struct page *page;
> + struct follow_page_context ctx = {
> + .pgmap = NULL,
> + .page_mask = 0,
> + };
> +
> + page = follow_page_mask(vma, address, foll_flags, &ctx);
> + if (ctx.pgmap)
> + put_dev_pagemap(ctx.pgmap);
> + return page;
> }

Do we still want to keep the function as inline? I don't think so.
Let's move it into mm/gup.c and make struct follow_page_context private to
the file.

>
> #define FOLL_WRITE 0x01 /* check pte is writable */
> diff --git a/mm/gup.c b/mm/gup.c
> index 1abc8b4afff6..124e7293e381 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -71,10 +71,10 @@ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> }
>
> static struct page *follow_page_pte(struct vm_area_struct *vma,
> - unsigned long address, pmd_t *pmd, unsigned int flags)
> + unsigned long address, pmd_t *pmd, unsigned int flags,
> + struct dev_pagemap **pgmap)
> {
> struct mm_struct *mm = vma->vm_mm;
> - struct dev_pagemap *pgmap = NULL;
> struct page *page;
> spinlock_t *ptl;
> pte_t *ptep, pte;
> @@ -116,8 +116,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> * Only return device mapping pages in the FOLL_GET case since
> * they are only valid while holding the pgmap reference.
> */
> - pgmap = get_dev_pagemap(pte_pfn(pte), NULL);
> - if (pgmap)
> + *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
> + if (*pgmap)
> page = pte_page(pte);
> else
> goto no_page;

Hm. Shouldn't get_dev_pagemap() call be under if (!*pgmap)?

... ah, never mind. I've got confused by get_dev_pagemap() interface.

> static bool vma_permits_fault(struct vm_area_struct *vma,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 533f9b00147d..9839bf91b057 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -851,13 +851,23 @@ static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> update_mmu_cache_pmd(vma, addr, pmd);
> }
>
> +static struct page *pagemap_page(unsigned long pfn, struct dev_pagemap **pgmap)

The function name doesn't reflect the fact that it takes pin on the page.
Maybe pagemap_get_page()?

> +{
> + struct page *page;
> +
> + *pgmap = get_dev_pagemap(pfn, *pgmap);
> + if (!*pgmap)
> + return ERR_PTR(-EFAULT);
> + page = pfn_to_page(pfn);
> + get_page(page);
> + return page;
> +}
> +

--
Kirill A. Shutemov

2018-10-02 16:42:53

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv3 6/6] mm/gup: Cache dev_pagemap while pinning pages

On 10/02/2018 04:26 AM, Kirill A. Shutemov wrote:
>> + page = follow_page_mask(vma, address, foll_flags, &ctx);
>> + if (ctx.pgmap)
>> + put_dev_pagemap(ctx.pgmap);
>> + return page;
>> }
> Do we still want to keep the function as inline? I don't think so.
> Let's move it into mm/gup.c and make struct follow_page_context private to
> the file.

Yeah, and let's have a put_follow_page_context() that does the
put_dev_pagemap() rather than spreading that if() to each call site.


2018-10-02 16:47:10

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCHv3 6/6] mm/gup: Cache dev_pagemap while pinning pages

On Tue, Oct 02, 2018 at 08:49:39AM -0700, Dave Hansen wrote:
> On 10/02/2018 04:26 AM, Kirill A. Shutemov wrote:
> >> + page = follow_page_mask(vma, address, foll_flags, &ctx);
> >> + if (ctx.pgmap)
> >> + put_dev_pagemap(ctx.pgmap);
> >> + return page;
> >> }
> > Do we still want to keep the function as inline? I don't think so.
> > Let's move it into mm/gup.c and make struct follow_page_context private to
> > the file.
>
> Yeah, and let's have a put_follow_page_context() that does the
> put_dev_pagemap() rather than spreading that if() to each call site.

Thanks for all the feedback. I will make a new version, but with the
gup_benchmark part split into an independent set since it is logically
separate from the final patch.