2018-11-10 08:54:19

by John Hubbard

[permalink] [raw]
Subject: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

From: John Hubbard <[email protected]>

Hi, here is fodder for conversation during LPC.

Changes since v1:

a) Uses a simpler set/clear/test bit approach in the page flags.

b) Added some initial performance results in the cover letter here, below.

c) Rebased to latest linux.git

d) Puts pages back on the LRU when its done with them.

Here is an updated proposal for tracking pages that have had
get_user_pages*() called on them. This is in support of fixing the problem
discussed in [1]. This RFC only shows how to set up a reliable
PageDmaPinned flag. What to *do* with that flag is left for a later
discussion.

The sequence would be:

-- apply patches 1-2,

-- convert the rest of the subsystems to call put_user_page*(). Patch #3
is an example of that. It converts infiniband.

-- apply patches 4-6,

-- apply more patches, to actually use the new PageDmaPinned flag.

One question up front was, "how do we ensure that either put_user_page()
or put_page() are called, depending on whether the page came from
get_user_pages() or not?". From this series, you can see that:

-- It's possible to assert within put_page(), that we are probably in the
right place. In practice this assertion definitely helps.

-- In the other direction, if put_user_page() is called when put_page()
should have been used instead, then a "clear" report of LRU list
corruption shows up reliably, because the first thing put_user_page()
attempts is to decrement the lru's list.prev pointer, and so you'll
see pointer values that are one less than an aligned pointer value.
This is not great, but it's usable. So I think that the conversion
will turn into an exercise of trying to get code coverage, and that
should work out.

I have lots of other patches, not shown here, in various stages of polish, to
convert enough of the kernel in order to run fio [2]. I did that, and got some
rather noisy performance results.

Here they are again:

Performance notes:

1. These fio results are noisy. The std deviation is large enough
that some of this could be noise. In order to highlight that, I did 5 runs
each of with, and without the patch, and while there is definitely a
performance drop on average, it's also true that there is overlap in the
results. In other words, the best "with patch" run is about the same as
the worst "without patch" run.

2. Initial profiling shows that we're adding about 1% total to the this
particular test run...I think. I may have to narrow this down some more,
but I don't seem to see any real lock contention. Hints or ideas on
measurement methods are welcome, btw.

-- 0.59% in put_user_page
-- 0.2% (or 0.54%, depending on how you read the perf out below) in
get_user_pages_fast:


1.36%--iov_iter_get_pages
|
--1.27%--get_user_pages_fast
|
--0.54%--pin_page_for_dma

0.59%--put_user_page

1.34% 0.03% fio [kernel.vmlinux] [k] _raw_spin_lock
0.95% 0.55% fio [kernel.vmlinux] [k] do_raw_spin_lock
0.17% 0.03% fio [kernel.vmlinux] [k] isolate_lru_page
0.06% 0.00% fio [kernel.vmlinux] [k] putback_lru_page

4. Here's some raw fio data: one run without the patch, and two with the patch:

------------------------------------------------------
WITHOUT the patch:
------------------------------------------------------
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov 6 20:18:06 2018
read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)
slat (usec): min=25, max=4870, avg=68.19, stdev=85.21
clat (usec): min=74, max=19814, avg=4525.40, stdev=1844.03
lat (usec): min=183, max=19927, avg=4593.69, stdev=1866.65
clat percentiles (usec):
| 1.00th=[ 3687], 5.00th=[ 3720], 10.00th=[ 3720], 20.00th=[ 3752],
| 30.00th=[ 3752], 40.00th=[ 3752], 50.00th=[ 3752], 60.00th=[ 3785],
| 70.00th=[ 4178], 80.00th=[ 4490], 90.00th=[ 6652], 95.00th=[ 8225],
| 99.00th=[13173], 99.50th=[14353], 99.90th=[16581], 99.95th=[17171],
| 99.99th=[18220]
bw ( KiB/s): min=49920, max=59320, per=100.00%, avg=55742.24, stdev=2224.20, samples=37
iops : min=12480, max=14830, avg=13935.35, stdev=556.05, samples=37
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=68.78%, 10=28.14%, 20=3.08%
cpu : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=54.4MiB/s (57.0MB/s), 54.4MiB/s-54.4MiB/s (57.0MB/s-57.0MB/s), io=1024MiB (1074MB), run=18826-18826msec

Disk stats (read/write):
nvme0n1: ios=259490/1, merge=0/0, ticks=14822/0, in_queue=19241, util=100.00%

------------------------------------------------------
With the patch applied:
------------------------------------------------------
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=51.2MiB/s,w=0KiB/s][r=13.1k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=2568: Tue Nov 6 20:03:50 2018
read: IOPS=12.8k, BW=50.1MiB/s (52.5MB/s)(1024MiB/20453msec)
slat (usec): min=33, max=4365, avg=74.05, stdev=85.79
clat (usec): min=39, max=19818, avg=4916.61, stdev=1961.79
lat (usec): min=100, max=20002, avg=4990.78, stdev=1985.23
clat percentiles (usec):
| 1.00th=[ 4047], 5.00th=[ 4080], 10.00th=[ 4080], 20.00th=[ 4080],
| 30.00th=[ 4113], 40.00th=[ 4113], 50.00th=[ 4113], 60.00th=[ 4146],
| 70.00th=[ 4178], 80.00th=[ 4817], 90.00th=[ 7308], 95.00th=[ 8717],
| 99.00th=[14091], 99.50th=[15270], 99.90th=[17433], 99.95th=[18220],
| 99.99th=[19006]
bw ( KiB/s): min=45370, max=55784, per=100.00%, avg=51332.33, stdev=1843.77, samples=40
iops : min=11342, max=13946, avg=12832.83, stdev=460.92, samples=40
lat (usec) : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=96.44%, 20=3.53%
cpu : usr=2.91%, sys=95.18%, ctx=398, majf=0, minf=73
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=50.1MiB/s (52.5MB/s), 50.1MiB/s-50.1MiB/s (52.5MB/s-52.5MB/s), io=1024MiB (1074MB), run=20453-20453msec

Disk stats (read/write):
nvme0n1: ios=261399/0, merge=0/0, ticks=16019/0, in_queue=20910, util=100.00%

------------------------------------------------------
OR, here's a better run WITH the patch applied, and you can see that this is nearly as good
as the "without" case:
------------------------------------------------------

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov 6 20:01:33 2018
read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)
slat (usec): min=30, max=12458, avg=69.71, stdev=88.01
clat (usec): min=39, max=25590, avg=4687.42, stdev=1925.29
lat (usec): min=97, max=25704, avg=4757.25, stdev=1946.06
clat percentiles (usec):
| 1.00th=[ 3884], 5.00th=[ 3884], 10.00th=[ 3916], 20.00th=[ 3916],
| 30.00th=[ 3916], 40.00th=[ 3916], 50.00th=[ 3949], 60.00th=[ 3949],
| 70.00th=[ 3982], 80.00th=[ 4555], 90.00th=[ 6915], 95.00th=[ 8848],
| 99.00th=[13566], 99.50th=[14877], 99.90th=[16909], 99.95th=[17695],
| 99.99th=[24249]
bw ( KiB/s): min=48905, max=58016, per=100.00%, avg=53855.79, stdev=2115.03, samples=38
iops : min=12226, max=14504, avg=13463.79, stdev=528.76, samples=38
lat (usec) : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=71.80%, 10=24.66%, 20=3.51%, 50=0.02%
cpu : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=52.5MiB/s (55.1MB/s), 52.5MiB/s-52.5MiB/s (55.1MB/s-55.1MB/s), io=1024MiB (1074MB), run=19499-19499msec

Disk stats (read/write):
nvme0n1: ios=260720/0, merge=0/0, ticks=15036/0, in_queue=19876, util=100.00%


[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] fio: https://github.com/axboe/fio

John Hubbard (6):
mm/gup: finish consolidating error handling
mm: introduce put_user_page*(), placeholder versions
infiniband/mm: convert put_page() to put_user_page*()
mm: introduce page->dma_pinned_flags, _count
mm: introduce zone_gup_lock, for dma-pinned pages
mm: track gup pages with page->dma_pinned_* fields

drivers/infiniband/core/umem.c | 7 +-
drivers/infiniband/core/umem_odp.c | 2 +-
drivers/infiniband/hw/hfi1/user_pages.c | 11 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 6 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 11 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 6 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 7 +-
include/linux/mm.h | 13 ++
include/linux/mm_types.h | 22 +++-
include/linux/mmzone.h | 6 +
include/linux/page-flags.h | 61 +++++++++
mm/gup.c | 58 +++++++-
mm/memcontrol.c | 8 ++
mm/page_alloc.c | 1 +
mm/swap.c | 138 ++++++++++++++++++++
15 files changed, 320 insertions(+), 37 deletions(-)

--
2.19.1



2018-11-10 08:51:53

by John Hubbard

[permalink] [raw]
Subject: [PATCH v2 6/6] mm: track gup pages with page->dma_pinned_* fields

From: John Hubbard <[email protected]>

This patch sets and restores the new page->dma_pinned_flags and
page->dma_pinned_count fields, but does not actually use them for
anything yet.

In order to use these fields at all, the page must be removed from
any LRU list that it's on. The patch also adds some precautions that
prevent the page from getting moved back onto an LRU, once it is
in this state.

This is in preparation to fix some problems that came up when using
devices (NICs, GPUs, for example) that set up direct access to a chunk
of system (CPU) memory, so that they can DMA to/from that memory.

Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
include/linux/mm.h | 19 +++++----------
mm/gup.c | 55 +++++++++++++++++++++++++++++++++++++++++--
mm/memcontrol.c | 8 +++++++
mm/swap.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 125 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 09fbb2c81aba..6c64b1e0b777 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -950,6 +950,10 @@ static inline void put_page(struct page *page)
{
page = compound_head(page);

+ VM_BUG_ON_PAGE(PageDmaPinned(page) &&
+ page_ref_count(page) <
+ atomic_read(&page->dma_pinned_count),
+ page);
/*
* For devmap managed pages we need to catch refcount transition from
* 2 to 1, when refcount reach one it means the page is free and we
@@ -964,21 +968,10 @@ static inline void put_page(struct page *page)
}

/*
- * put_user_page() - release a page that had previously been acquired via
- * a call to one of the get_user_pages*() functions.
- *
* Pages that were pinned via get_user_pages*() must be released via
- * either put_user_page(), or one of the put_user_pages*() routines
- * below. This is so that eventually, pages that are pinned via
- * get_user_pages*() can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special
- * handling.
+ * one of these put_user_pages*() routines:
*/
-static inline void put_user_page(struct page *page)
-{
- put_page(page);
-}
-
+void put_user_page(struct page *page);
void put_user_pages_dirty(struct page **pages, unsigned long npages);
void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
void put_user_pages(struct page **pages, unsigned long npages);
diff --git a/mm/gup.c b/mm/gup.c
index 55a41dee0340..ec1b26591532 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -25,6 +25,50 @@ struct follow_page_context {
unsigned int page_mask;
};

+static void pin_page_for_dma(struct page *page)
+{
+ int ret = 0;
+ struct zone *zone;
+
+ page = compound_head(page);
+ zone = page_zone(page);
+
+ spin_lock(zone_gup_lock(zone));
+
+ if (PageDmaPinned(page)) {
+ /* Page was not on an LRU list, because it was DMA-pinned. */
+ VM_BUG_ON_PAGE(PageLRU(page), page);
+
+ atomic_inc(&page->dma_pinned_count);
+ goto unlock_out;
+ }
+
+ /*
+ * Note that page->dma_pinned_flags is unioned with page->lru.
+ * The rules are: reading PageDmaPinned(page) is allowed even if
+ * PageLRU(page) is true. That works because of pointer alignment:
+ * the PageDmaPinned bit is less than the pointer alignment, so
+ * either the page is on an LRU, or (maybe) the PageDmaPinned
+ * bit is set.
+ *
+ * However, SetPageDmaPinned requires that the page is both locked,
+ * and also, removed from the LRU.
+ *
+ * The other flag, PageDmaPinnedWasLru, is not used for
+ * synchronization, and so is only read or written after we are
+ * certain that the full page->dma_pinned_flags field is available.
+ */
+ ret = isolate_lru_page(page);
+ if (ret == 0)
+ SetPageDmaPinnedWasLru(page);
+
+ atomic_set(&page->dma_pinned_count, 1);
+ SetPageDmaPinned(page);
+
+unlock_out:
+ spin_unlock(zone_gup_lock(zone));
+}
+
static struct page *no_page_table(struct vm_area_struct *vma,
unsigned int flags)
{
@@ -670,7 +714,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas, int *nonblocking)
{
- long ret = 0, i = 0;
+ long ret = 0, i = 0, j;
struct vm_area_struct *vma = NULL;
struct follow_page_context ctx = { NULL };

@@ -774,6 +818,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
nr_pages -= page_increm;
} while (nr_pages);
out:
+ if (pages)
+ for (j = 0; j < i; j++)
+ pin_page_for_dma(pages[j]);
+
if (ctx.pgmap)
put_dev_pagemap(ctx.pgmap);
return i ? i : ret;
@@ -1852,7 +1900,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages)
{
unsigned long addr, len, end;
- int nr = 0, ret = 0;
+ int nr = 0, ret = 0, i;

start &= PAGE_MASK;
addr = start;
@@ -1873,6 +1921,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
ret = nr;
}

+ for (i = 0; i < nr; i++)
+ pin_page_for_dma(pages[i]);
+
if (nr < nr_pages) {
/* Try to get the remaining pages with get_user_pages */
start += nr << PAGE_SHIFT;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e1469b80cb7..fbe61d13036f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2335,6 +2335,12 @@ static void lock_page_lru(struct page *page, int *isolated)
if (PageLRU(page)) {
struct lruvec *lruvec;

+ /*
+ * LRU and PageDmaPinned are mutually exclusive: they use the
+ * same fields in struct page, but for different purposes.
+ */
+ VM_BUG_ON_PAGE(PageDmaPinned(page), page);
+
lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_lru(page));
@@ -2352,6 +2358,8 @@ static void unlock_page_lru(struct page *page, int isolated)

lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
VM_BUG_ON_PAGE(PageLRU(page), page);
+ VM_BUG_ON_PAGE(PageDmaPinned(page), page);
+
SetPageLRU(page);
add_page_to_lru_list(page, lruvec, page_lru(page));
}
diff --git a/mm/swap.c b/mm/swap.c
index bb8c32595e5f..79f874ce78c3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -151,6 +151,51 @@ static void __put_user_pages_dirty(struct page **pages,
}
}

+/*
+ * put_user_page() - release a page that had previously been acquired via
+ * a call to one of the get_user_pages*() functions.
+ *
+ * Usage: Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ */
+void put_user_page(struct page *page)
+{
+ struct zone *zone = page_zone(page);
+
+ page = compound_head(page);
+
+ if (atomic_dec_and_test(&page->dma_pinned_count)) {
+ spin_lock(zone_gup_lock(zone));
+ /* Re-check while holding the lock, because
+ * pin_page_for_dma() or get_page() may have snuck in right
+ * after the atomic_dec_and_test, and raised the count
+ * above zero again. If so, just leave the flag set. And
+ * because the atomic_dec_and_test above already got the
+ * accounting correct, no other action is required.
+ */
+ VM_BUG_ON_PAGE(PageLRU(page), page);
+ VM_BUG_ON_PAGE(!PageDmaPinned(page), page);
+
+ if (atomic_read(&page->dma_pinned_count) == 0) {
+ ClearPageDmaPinned(page);
+
+ if (PageDmaPinnedWasLru(page)) {
+ ClearPageDmaPinnedWasLru(page);
+ putback_lru_page(page);
+ }
+ }
+
+ spin_unlock(zone_gup_lock(zone));
+ }
+
+ put_page(page);
+}
+EXPORT_SYMBOL(put_user_page);
+
/*
* put_user_pages_dirty() - for each page in the @pages array, make
* that page (or its head page, if a compound page) dirty, if it was
@@ -903,6 +948,12 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
VM_BUG_ON_PAGE(!PageHead(page), page);
VM_BUG_ON_PAGE(PageCompound(page_tail), page);
VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+
+ /*
+ * LRU and PageDmaPinned are mutually exclusive: they use the
+ * same fields in struct page, but for different purposes.
+ */
+ VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
VM_BUG_ON(NR_CPUS != 1 &&
!spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));

@@ -942,6 +993,13 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,

VM_BUG_ON_PAGE(PageLRU(page), page);

+ /*
+ * LRU and PageDmaPinned are mutually exclusive: they use the
+ * same fields in struct page, but for different purposes.
+ */
+ if (PageDmaPinned(page))
+ return;
+
SetPageLRU(page);
/*
* Page becomes evictable in two ways:
--
2.19.1


2018-11-10 08:51:53

by John Hubbard

[permalink] [raw]
Subject: [PATCH v2 5/6] mm: introduce zone_gup_lock, for dma-pinned pages

From: John Hubbard <[email protected]>

The page->dma_pinned_flags and _count fields require
lock protection. A lock at approximately the granularity
of the zone_lru_lock is called for, but adding to the
locking contention of zone_lru_lock is undesirable,
because that is a pre-existing hot spot. Fortunately,
these new dma_pinned_* fields can use an independent
lock, so this patch creates an entirely new lock, right
next to the zone_lru_lock.

Why "zone_gup_lock"?

Most of the naming refers to "DMA-pinned pages", but
"zone DMA lock" has other meanings already, so this is
called zone_gup_lock instead. The "dma pinning" is a result
of get_user_pages (gup) being called, so the name still
helps explain its use.

Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
include/linux/mmzone.h | 6 ++++++
mm/page_alloc.c | 1 +
2 files changed, 7 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 847705a6d0ec..125a6f34f6ba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -660,6 +660,7 @@ typedef struct pglist_data {
enum zone_type kswapd_classzone_idx;

int kswapd_failures; /* Number of 'reclaimed == 0' runs */
+ spinlock_t pinned_dma_lock;

#ifdef CONFIG_COMPACTION
int kcompactd_max_order;
@@ -729,6 +730,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
return &zone->zone_pgdat->lru_lock;
}

+static inline spinlock_t *zone_gup_lock(struct zone *zone)
+{
+ return &zone->zone_pgdat->pinned_dma_lock;
+}
+
static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
{
return &pgdat->lruvec;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a919ba5cb3c8..7cc0d9bdba17 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6305,6 +6305,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)

pgdat_page_ext_init(pgdat);
spin_lock_init(&pgdat->lru_lock);
+ spin_lock_init(&pgdat->pinned_dma_lock);
lruvec_init(node_lruvec(pgdat));
}

--
2.19.1


2018-11-10 08:52:03

by John Hubbard

[permalink] [raw]
Subject: [PATCH v2 4/6] mm: introduce page->dma_pinned_flags, _count

From: John Hubbard <[email protected]>

Add two struct page fields that, combined, are unioned with
struct page->lru. There is no change in the size of
struct page. These new fields are for type safety and clarity.

Also add page flag accessors to test, set and clear the new
page->dma_pinned_flags field.

The page->dma_pinned_count field will be used in upcoming
patches.

Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
include/linux/mm_types.h | 22 ++++++++++----
include/linux/page-flags.h | 61 ++++++++++++++++++++++++++++++++++++++
2 files changed, 77 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..017ab82e36ca 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -78,12 +78,22 @@ struct page {
*/
union {
struct { /* Page cache and anonymous pages */
- /**
- * @lru: Pageout list, eg. active_list protected by
- * zone_lru_lock. Sometimes used as a generic list
- * by the page owner.
- */
- struct list_head lru;
+ union {
+ /**
+ * @lru: Pageout list, eg. active_list protected
+ * by zone_lru_lock. Sometimes used as a
+ * generic list by the page owner.
+ */
+ struct list_head lru;
+ /* Used by get_user_pages*(). Pages may not be
+ * on an LRU while these dma_pinned_* fields
+ * are in use.
+ */
+ struct {
+ unsigned long dma_pinned_flags;
+ atomic_t dma_pinned_count;
+ };
+ };
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
pgoff_t index; /* Our offset within mapping. */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 50ce1bddaf56..3190b6b6a82f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -437,6 +437,67 @@ static __always_inline int __PageMovable(struct page *page)
PAGE_MAPPING_MOVABLE;
}

+/*
+ * Because page->dma_pinned_flags is unioned with page->lru, any page that
+ * uses these flags must NOT be on an LRU. That's partly enforced by
+ * ClearPageDmaPinned, which gives the page back to LRU.
+ *
+ * PageDmaPinned is checked without knowing whether it is a tail page or a
+ * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th
+ * bit in the first union of struct page), and instead uses bit 1 (0x2),
+ * rather than bit 0.
+ *
+ * PageDmaPinned can only be used if no other systems are using the same bit
+ * across the first struct page union. In this regard, it is similar to
+ * PageTail, and in fact, because of PageTail's constraint that bit 0 be left
+ * alone, bit 1 is also left alone so far: other union elements (ignoring tail
+ * pages) put pointers there, and pointer alignment leaves the lower two bits
+ * available.
+ *
+ * So, constraints include:
+ *
+ * -- Only use PageDmaPinned on non-tail pages.
+ * -- Remove the page from any LRU list first.
+ */
+#define PAGE_DMA_PINNED 0x2UL
+#define PAGE_DMA_PINNED_WAS_LRU 0x4UL
+
+static __always_inline int PageDmaPinned(struct page *page)
+{
+ VM_BUG_ON(page != compound_head(page));
+ return test_bit(PAGE_DMA_PINNED, &page->dma_pinned_flags);
+}
+
+static __always_inline void SetPageDmaPinned(struct page *page)
+{
+ VM_BUG_ON(page != compound_head(page));
+ set_bit(PAGE_DMA_PINNED, &page->dma_pinned_flags);
+}
+
+static __always_inline void ClearPageDmaPinned(struct page *page)
+{
+ VM_BUG_ON(page != compound_head(page));
+ clear_bit(PAGE_DMA_PINNED, &page->dma_pinned_flags);
+}
+
+static __always_inline int PageDmaPinnedWasLru(struct page *page)
+{
+ VM_BUG_ON(page != compound_head(page));
+ return test_bit(PAGE_DMA_PINNED_WAS_LRU, &page->dma_pinned_flags);
+}
+
+static __always_inline void SetPageDmaPinnedWasLru(struct page *page)
+{
+ VM_BUG_ON(page != compound_head(page));
+ set_bit(PAGE_DMA_PINNED_WAS_LRU, &page->dma_pinned_flags);
+}
+
+static __always_inline void ClearPageDmaPinnedWasLru(struct page *page)
+{
+ VM_BUG_ON(page != compound_head(page));
+ clear_bit(PAGE_DMA_PINNED_WAS_LRU, &page->dma_pinned_flags);
+}
+
#ifdef CONFIG_KSM
/*
* A KSM page is one of those write-protected "shared pages" or "merged pages"
--
2.19.1


2018-11-10 08:52:09

by John Hubbard

[permalink] [raw]
Subject: [PATCH v2 3/6] infiniband/mm: convert put_page() to put_user_page*()

From: John Hubbard <[email protected]>

For infiniband code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(), or
put_user_pages*(), instead of put_page()

This is a tiny part of the second step of fixing the problem described
in [1]. The steps are:

1) Provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem. Again, [1] provides details as to why that is
desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Cc: Doug Ledford <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Mike Marciniszyn <[email protected]>
Cc: Dennis Dalessandro <[email protected]>
Cc: Christian Benvenuti <[email protected]>

Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Dennis Dalessandro <[email protected]>
Acked-by: Jason Gunthorpe <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
drivers/infiniband/core/umem.c | 7 ++++---
drivers/infiniband/core/umem_odp.c | 2 +-
drivers/infiniband/hw/hfi1/user_pages.c | 11 ++++-------
drivers/infiniband/hw/mthca/mthca_memfree.c | 6 +++---
drivers/infiniband/hw/qib/qib_user_pages.c | 11 ++++-------
drivers/infiniband/hw/qib/qib_user_sdma.c | 6 +++---
drivers/infiniband/hw/usnic/usnic_uiom.c | 7 ++++---
7 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index c6144df47ea4..c2898bc7b3b2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,9 +58,10 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {

page = sg_page(sg);
- if (!PageDirty(page) && umem->writable && dirty)
- set_page_dirty_lock(page);
- put_page(page);
+ if (umem->writable && dirty)
+ put_user_pages_dirty_lock(&page, 1);
+ else
+ put_user_page(page);
}

sg_free_table(&umem->sg_head);
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 2b4c5e7dd5a1..8b5116b52d2a 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -667,7 +667,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
ret = -EFAULT;
break;
}
- put_page(local_page_list[j]);
+ put_user_page(local_page_list[j]);
continue;
}

diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..99ccc0483711 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -121,13 +121,10 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
size_t npages, bool dirty)
{
- size_t i;
-
- for (i = 0; i < npages; i++) {
- if (dirty)
- set_page_dirty_lock(p[i]);
- put_page(p[i]);
- }
+ if (dirty)
+ put_user_pages_dirty_lock(p, npages);
+ else
+ put_user_pages(p, npages);

if (mm) { /* during close after signal, mm can be NULL */
down_write(&mm->mmap_sem);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c
index cc9c0c8ccba3..b8b12effd009 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -481,7 +481,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,

ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
if (ret < 0) {
- put_page(pages[0]);
+ put_user_page(pages[0]);
goto out;
}

@@ -489,7 +489,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
mthca_uarc_virt(dev, uar, i));
if (ret) {
pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
- put_page(sg_page(&db_tab->page[i].mem));
+ put_user_page(sg_page(&db_tab->page[i].mem));
goto out;
}

@@ -555,7 +555,7 @@ void mthca_cleanup_user_db_tab(struct mthca_dev *dev, struct mthca_uar *uar,
if (db_tab->page[i].uvirt) {
mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1);
pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
- put_page(sg_page(&db_tab->page[i].mem));
+ put_user_page(sg_page(&db_tab->page[i].mem));
}
}

diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 16543d5e80c3..1a5c64c8695f 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -40,13 +40,10 @@
static void __qib_release_user_pages(struct page **p, size_t num_pages,
int dirty)
{
- size_t i;
-
- for (i = 0; i < num_pages; i++) {
- if (dirty)
- set_page_dirty_lock(p[i]);
- put_page(p[i]);
- }
+ if (dirty)
+ put_user_pages_dirty_lock(p, num_pages);
+ else
+ put_user_pages(p, num_pages);
}

/*
diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c b/drivers/infiniband/hw/qib/qib_user_sdma.c
index 926f3c8eba69..4a4b802b011f 100644
--- a/drivers/infiniband/hw/qib/qib_user_sdma.c
+++ b/drivers/infiniband/hw/qib/qib_user_sdma.c
@@ -321,7 +321,7 @@ static int qib_user_sdma_page_to_frags(const struct qib_devdata *dd,
* the caller can ignore this page.
*/
if (put) {
- put_page(page);
+ put_user_page(page);
} else {
/* coalesce case */
kunmap(page);
@@ -635,7 +635,7 @@ static void qib_user_sdma_free_pkt_frag(struct device *dev,
kunmap(pkt->addr[i].page);

if (pkt->addr[i].put_page)
- put_page(pkt->addr[i].page);
+ put_user_page(pkt->addr[i].page);
else
__free_page(pkt->addr[i].page);
} else if (pkt->addr[i].kvaddr) {
@@ -710,7 +710,7 @@ static int qib_user_sdma_pin_pages(const struct qib_devdata *dd,
/* if error, return all pages not managed by pkt */
free_pages:
while (i < j)
- put_page(pages[i++]);
+ put_user_page(pages[i++]);

done:
return ret;
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 49275a548751..2ef8d31dc838 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -77,9 +77,10 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty)
for_each_sg(chunk->page_list, sg, chunk->nents, i) {
page = sg_page(sg);
pa = sg_phys(sg);
- if (!PageDirty(page) && dirty)
- set_page_dirty_lock(page);
- put_page(page);
+ if (dirty)
+ put_user_pages_dirty_lock(&page, 1);
+ else
+ put_user_page(page);
usnic_dbg("pa: %pa\n", &pa);
}
kfree(chunk);
--
2.19.1


2018-11-10 08:52:20

by John Hubbard

[permalink] [raw]
Subject: [PATCH v2 2/6] mm: introduce put_user_page*(), placeholder versions

From: John Hubbard <[email protected]>

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This is the first step of fixing the problem described in [1]. The steps
are:

1) (This patch): provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem. Again, [1] provides details as to why that is
desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Ralph Campbell <[email protected]>

Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
include/linux/mm.h | 20 ++++++++++++
mm/swap.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..09fbb2c81aba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -963,6 +963,26 @@ static inline void put_page(struct page *page)
__put_page(page);
}

+/*
+ * put_user_page() - release a page that had previously been acquired via
+ * a call to one of the get_user_pages*() functions.
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ */
+static inline void put_user_page(struct page *page)
+{
+ put_page(page);
+}
+
+void put_user_pages_dirty(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages(struct page **pages, unsigned long npages);
+
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
#define SECTION_IN_PAGE_FLAGS
#endif
diff --git a/mm/swap.c b/mm/swap.c
index aa483719922e..bb8c32595e5f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -133,6 +133,86 @@ void put_pages_list(struct list_head *pages)
}
EXPORT_SYMBOL(put_pages_list);

+typedef int (*set_dirty_func)(struct page *page);
+
+static void __put_user_pages_dirty(struct page **pages,
+ unsigned long npages,
+ set_dirty_func sdf)
+{
+ unsigned long index;
+
+ for (index = 0; index < npages; index++) {
+ struct page *page = compound_head(pages[index]);
+
+ if (!PageDirty(page))
+ sdf(page);
+
+ put_user_page(page);
+ }
+}
+
+/*
+ * put_user_pages_dirty() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * set_page_dirty(), which does not lock the page, is used here.
+ * Therefore, it is the caller's responsibility to ensure that this is
+ * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ *
+ * @pages: array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages_dirty(struct page **pages, unsigned long npages)
+{
+ __put_user_pages_dirty(pages, npages, set_page_dirty);
+}
+EXPORT_SYMBOL(put_user_pages_dirty);
+
+/*
+ * put_user_pages_dirty_lock() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * This is just like put_user_pages_dirty(), except that it invokes
+ * set_page_dirty_lock(), instead of set_page_dirty().
+ *
+ * @pages: array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
+{
+ __put_user_pages_dirty(pages, npages, set_page_dirty_lock);
+}
+EXPORT_SYMBOL(put_user_pages_dirty_lock);
+
+/*
+ * put_user_pages() - for each page in the @pages array, release the page
+ * using put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * @pages: array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages(struct page **pages, unsigned long npages)
+{
+ unsigned long index;
+
+ for (index = 0; index < npages; index++)
+ put_user_page(pages[index]);
+}
+EXPORT_SYMBOL(put_user_pages);
+
/*
* get_kernel_pages() - pin kernel pages in memory
* @kiov: An array of struct kvec structures
--
2.19.1


2018-11-10 08:54:12

by John Hubbard

[permalink] [raw]
Subject: [PATCH v2 1/6] mm/gup: finish consolidating error handling

From: John Hubbard <[email protected]>

An upcoming patch wants to be able to operate on each page that
get_user_pages has retrieved. In order to do that, it's best to
have a common exit point from the routine. Most of this has been
taken care of by commit df06b37ffe5a4 ("mm/gup: cache dev_pagemap while
pinning pages"), but there was one case remaining.

Also, there was still an unnecessary shadow declaration (with a
different type) of the "ret" variable, which this commit removes.

Cc: Keith Busch <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
mm/gup.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f76e77a2d34b..55a41dee0340 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -696,12 +696,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
if (!vma || start >= vma->vm_end) {
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(mm, start)) {
- int ret;
ret = get_gate_page(mm, start & PAGE_MASK,
gup_flags, &vma,
pages ? &pages[i] : NULL);
if (ret)
- return i ? : ret;
+ goto out;
ctx.page_mask = 0;
goto next_page;
}
--
2.19.1


2018-11-11 14:11:32

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] mm: introduce put_user_page*(), placeholder versions

Hi,

On Sat, Nov 10, 2018 at 12:50:37AM -0800, [email protected] wrote:
> From: John Hubbard <[email protected]>
>
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
>
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
>
> This is the first step of fixing the problem described in [1]. The steps
> are:
>
> 1) (This patch): provide put_user_page*() routines, intended to be used
> for releasing pages that were pinned via get_user_pages*().
>
> 2) Convert all of the call sites for get_user_pages*(), to
> invoke put_user_page*(), instead of put_page(). This involves dozens of
> call sites, and will take some time.
>
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
> implement tracking of these pages. This tracking will be separate from
> the existing struct page refcounting.
>
> 4) Use the tracking and identification of these pages, to implement
> special handling (especially in writeback paths) when the pages are
> backed by a filesystem. Again, [1] provides details as to why that is
> desirable.
>
> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Christopher Lameter <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Jerome Glisse <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Ralph Campbell <[email protected]>
>
> Reviewed-by: Jan Kara <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>
> ---
> include/linux/mm.h | 20 ++++++++++++
> mm/swap.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 100 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5411de93a363..09fbb2c81aba 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -963,6 +963,26 @@ static inline void put_page(struct page *page)
> __put_page(page);
> }
>
> +/*
> + * put_user_page() - release a page that had previously been acquired via
> + * a call to one of the get_user_pages*() functions.

If your intention was to make it kernel-doc comment, it should start with

/**

and the @page parameter description should follow the brief description.

> + *
> + * Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely handled. In
> + * particular, interactions with RDMA and filesystems need special
> + * handling.
> + */
> +static inline void put_user_page(struct page *page)
> +{
> + put_page(page);
> +}
> +
> +void put_user_pages_dirty(struct page **pages, unsigned long npages);
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
> +void put_user_pages(struct page **pages, unsigned long npages);
> +
> #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> #define SECTION_IN_PAGE_FLAGS
> #endif
> diff --git a/mm/swap.c b/mm/swap.c
> index aa483719922e..bb8c32595e5f 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -133,6 +133,86 @@ void put_pages_list(struct list_head *pages)
> }
> EXPORT_SYMBOL(put_pages_list);
>
> +typedef int (*set_dirty_func)(struct page *page);
> +
> +static void __put_user_pages_dirty(struct page **pages,
> + unsigned long npages,
> + set_dirty_func sdf)
> +{
> + unsigned long index;
> +
> + for (index = 0; index < npages; index++) {
> + struct page *page = compound_head(pages[index]);
> +
> + if (!PageDirty(page))
> + sdf(page);
> +
> + put_user_page(page);
> + }
> +}
> +
> +/*

/**

> + * put_user_pages_dirty() - for each page in the @pages array, make
> + * that page (or its head page, if a compound page) dirty, if it was
> + * previously listed as clean. Then, release the page using
> + * put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * set_page_dirty(), which does not lock the page, is used here.
> + * Therefore, it is the caller's responsibility to ensure that this is
> + * safe. If not, then put_user_pages_dirty_lock() should be called instead.
> + *
> + * @pages: array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *

The parameter descriptions should be after the brief function description
and before the details

> + */
> +void put_user_pages_dirty(struct page **pages, unsigned long npages)
> +{
> + __put_user_pages_dirty(pages, npages, set_page_dirty);
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty);
> +
> +/*
> + * put_user_pages_dirty_lock() - for each page in the @pages array, make
> + * that page (or its head page, if a compound page) dirty, if it was
> + * previously listed as clean. Then, release the page using
> + * put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * This is just like put_user_pages_dirty(), except that it invokes
> + * set_page_dirty_lock(), instead of set_page_dirty().
> + *
> + * @pages: array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.

Same here

> + *
> + */
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
> +{
> + __put_user_pages_dirty(pages, npages, set_page_dirty_lock);
> +}
> +EXPORT_SYMBOL(put_user_pages_dirty_lock);
> +
> +/*
> + * put_user_pages() - for each page in the @pages array, release the page
> + * using put_user_page().
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * @pages: array of pages to be marked dirty and released.
> + * @npages: number of pages in the @pages array.
> + *

And here.

> + */
> +void put_user_pages(struct page **pages, unsigned long npages)
> +{
> + unsigned long index;
> +
> + for (index = 0; index < npages; index++)
> + put_user_page(pages[index]);
> +}
> +EXPORT_SYMBOL(put_user_pages);
> +
> /*
> * get_kernel_pages() - pin kernel pages in memory
> * @kiov: An array of struct kvec structures
> --
> 2.19.1
>

--
Sincerely yours,
Mike.


2018-11-12 13:58:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 6/6] mm: track gup pages with page->dma_pinned_* fields


Just as a side note, can you please CC me on the whole series next time?
Because this time I had to look up e.g. the introductory email in the
mailing list... Thanks!

On Sat 10-11-18 00:50:41, [email protected] wrote:
> From: John Hubbard <[email protected]>
>
> This patch sets and restores the new page->dma_pinned_flags and
> page->dma_pinned_count fields, but does not actually use them for
> anything yet.
>
> In order to use these fields at all, the page must be removed from
> any LRU list that it's on. The patch also adds some precautions that
> prevent the page from getting moved back onto an LRU, once it is
> in this state.
>
> This is in preparation to fix some problems that came up when using
> devices (NICs, GPUs, for example) that set up direct access to a chunk
> of system (CPU) memory, so that they can DMA to/from that memory.
>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Christopher Lameter <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Jan Kara <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>
> ---
> include/linux/mm.h | 19 +++++----------
> mm/gup.c | 55 +++++++++++++++++++++++++++++++++++++++++--
> mm/memcontrol.c | 8 +++++++
> mm/swap.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 125 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 09fbb2c81aba..6c64b1e0b777 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -950,6 +950,10 @@ static inline void put_page(struct page *page)
> {
> page = compound_head(page);
>
> + VM_BUG_ON_PAGE(PageDmaPinned(page) &&
> + page_ref_count(page) <
> + atomic_read(&page->dma_pinned_count),
> + page);
> /*
> * For devmap managed pages we need to catch refcount transition from
> * 2 to 1, when refcount reach one it means the page is free and we
> @@ -964,21 +968,10 @@ static inline void put_page(struct page *page)
> }
>
> /*
> - * put_user_page() - release a page that had previously been acquired via
> - * a call to one of the get_user_pages*() functions.
> - *
> * Pages that were pinned via get_user_pages*() must be released via
> - * either put_user_page(), or one of the put_user_pages*() routines
> - * below. This is so that eventually, pages that are pinned via
> - * get_user_pages*() can be separately tracked and uniquely handled. In
> - * particular, interactions with RDMA and filesystems need special
> - * handling.
> + * one of these put_user_pages*() routines:
> */
> -static inline void put_user_page(struct page *page)
> -{
> - put_page(page);
> -}
> -
> +void put_user_page(struct page *page);
> void put_user_pages_dirty(struct page **pages, unsigned long npages);
> void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
> void put_user_pages(struct page **pages, unsigned long npages);
> diff --git a/mm/gup.c b/mm/gup.c
> index 55a41dee0340..ec1b26591532 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -25,6 +25,50 @@ struct follow_page_context {
> unsigned int page_mask;
> };
>
> +static void pin_page_for_dma(struct page *page)
> +{
> + int ret = 0;
> + struct zone *zone;
> +
> + page = compound_head(page);
> + zone = page_zone(page);
> +
> + spin_lock(zone_gup_lock(zone));

A think you'll need irqsafe lock here as get_user_pages_fast() can get
called from interrupt context in some cases. And so can put_user_page()...

<snip>

> +/*
> + * put_user_page() - release a page that had previously been acquired via
> + * a call to one of the get_user_pages*() functions.
> + *
> + * Usage: Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely handled. In
> + * particular, interactions with RDMA and filesystems need special
> + * handling.
> + */
> +void put_user_page(struct page *page)
> +{
> + struct zone *zone = page_zone(page);
> +
> + page = compound_head(page);
> +
> + if (atomic_dec_and_test(&page->dma_pinned_count)) {
> + spin_lock(zone_gup_lock(zone));
> + /* Re-check while holding the lock, because
> + * pin_page_for_dma() or get_page() may have snuck in right
> + * after the atomic_dec_and_test, and raised the count
> + * above zero again. If so, just leave the flag set. And
> + * because the atomic_dec_and_test above already got the
> + * accounting correct, no other action is required.
> + */
> + VM_BUG_ON_PAGE(PageLRU(page), page);
> + VM_BUG_ON_PAGE(!PageDmaPinned(page), page);
> +
> + if (atomic_read(&page->dma_pinned_count) == 0) {

We have atomic_dec_and_lock[_irqsave]() exactly for constructs like this.

> + ClearPageDmaPinned(page);
> +
> + if (PageDmaPinnedWasLru(page)) {
> + ClearPageDmaPinnedWasLru(page);
> + putback_lru_page(page);
> + }
> + }
> +
> + spin_unlock(zone_gup_lock(zone));
> + }
> +
> + put_page(page);
> +}
> +EXPORT_SYMBOL(put_user_page);
> +

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-11-12 15:47:33

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] mm/gup: finish consolidating error handling

On Sat, Nov 10, 2018 at 12:50:36AM -0800, [email protected] wrote:
> From: John Hubbard <[email protected]>
>
> An upcoming patch wants to be able to operate on each page that
> get_user_pages has retrieved. In order to do that, it's best to
> have a common exit point from the routine. Most of this has been
> taken care of by commit df06b37ffe5a4 ("mm/gup: cache dev_pagemap while
> pinning pages"), but there was one case remaining.
>
> Also, there was still an unnecessary shadow declaration (with a
> different type) of the "ret" variable, which this commit removes.
>
> Cc: Keith Busch <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>
> ---
> mm/gup.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index f76e77a2d34b..55a41dee0340 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -696,12 +696,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> if (!vma || start >= vma->vm_end) {
> vma = find_extend_vma(mm, start);
> if (!vma && in_gate_area(mm, start)) {
> - int ret;
> ret = get_gate_page(mm, start & PAGE_MASK,
> gup_flags, &vma,
> pages ? &pages[i] : NULL);
> if (ret)
> - return i ? : ret;
> + goto out;
> ctx.page_mask = 0;
> goto next_page;
> }

This also fixes a potentially leaked dev_pagemap reference count if a
failure occurs when an iteration crosses a vma boundary. I don't think
it's normal to have different vma's on a users mapped zone device memory,
but good to fix anyway.

Reviewed-by: Keith Busch <[email protected]>

2018-11-12 16:17:24

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] mm/gup: finish consolidating error handling

On Mon, Nov 12, 2018 at 7:45 AM Keith Busch <[email protected]> wrote:
>
> On Sat, Nov 10, 2018 at 12:50:36AM -0800, [email protected] wrote:
> > From: John Hubbard <[email protected]>
> >
> > An upcoming patch wants to be able to operate on each page that
> > get_user_pages has retrieved. In order to do that, it's best to
> > have a common exit point from the routine. Most of this has been
> > taken care of by commit df06b37ffe5a4 ("mm/gup: cache dev_pagemap while
> > pinning pages"), but there was one case remaining.
> >
> > Also, there was still an unnecessary shadow declaration (with a
> > different type) of the "ret" variable, which this commit removes.
> >
> > Cc: Keith Busch <[email protected]>
> > Cc: Dan Williams <[email protected]>
> > Cc: Kirill A. Shutemov <[email protected]>
> > Cc: Dave Hansen <[email protected]>
> > Signed-off-by: John Hubbard <[email protected]>
> > ---
> > mm/gup.c | 3 +--
> > 1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index f76e77a2d34b..55a41dee0340 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -696,12 +696,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > if (!vma || start >= vma->vm_end) {
> > vma = find_extend_vma(mm, start);
> > if (!vma && in_gate_area(mm, start)) {
> > - int ret;
> > ret = get_gate_page(mm, start & PAGE_MASK,
> > gup_flags, &vma,
> > pages ? &pages[i] : NULL);
> > if (ret)
> > - return i ? : ret;
> > + goto out;
> > ctx.page_mask = 0;
> > goto next_page;
> > }
>
> This also fixes a potentially leaked dev_pagemap reference count if a
> failure occurs when an iteration crosses a vma boundary. I don't think
> it's normal to have different vma's on a users mapped zone device memory,
> but good to fix anyway.

Does not sound abnormal to me, we should promote this as a fix for the
current cycle with an updated changelog.

2018-11-15 00:46:52

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] mm/gup: finish consolidating error handling

On 11/12/18 8:14 AM, Dan Williams wrote:
> On Mon, Nov 12, 2018 at 7:45 AM Keith Busch <[email protected]> wrote:
>>
>> On Sat, Nov 10, 2018 at 12:50:36AM -0800, [email protected] wrote:
>>> From: John Hubbard <[email protected]>
>>>
>>> An upcoming patch wants to be able to operate on each page that
>>> get_user_pages has retrieved. In order to do that, it's best to
>>> have a common exit point from the routine. Most of this has been
>>> taken care of by commit df06b37ffe5a4 ("mm/gup: cache dev_pagemap while
>>> pinning pages"), but there was one case remaining.
>>>
>>> Also, there was still an unnecessary shadow declaration (with a
>>> different type) of the "ret" variable, which this commit removes.
>>>
>>> Cc: Keith Busch <[email protected]>
>>> Cc: Dan Williams <[email protected]>
>>> Cc: Kirill A. Shutemov <[email protected]>
>>> Cc: Dave Hansen <[email protected]>
>>> Signed-off-by: John Hubbard <[email protected]>
>>> ---
>>> mm/gup.c | 3 +--
>>> 1 file changed, 1 insertion(+), 2 deletions(-)
>>>
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index f76e77a2d34b..55a41dee0340 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -696,12 +696,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>> if (!vma || start >= vma->vm_end) {
>>> vma = find_extend_vma(mm, start);
>>> if (!vma && in_gate_area(mm, start)) {
>>> - int ret;
>>> ret = get_gate_page(mm, start & PAGE_MASK,
>>> gup_flags, &vma,
>>> pages ? &pages[i] : NULL);
>>> if (ret)
>>> - return i ? : ret;
>>> + goto out;
>>> ctx.page_mask = 0;
>>> goto next_page;
>>> }
>>
>> This also fixes a potentially leaked dev_pagemap reference count if a
>> failure occurs when an iteration crosses a vma boundary. I don't think
>> it's normal to have different vma's on a users mapped zone device memory,
>> but good to fix anyway.
>
> Does not sound abnormal to me, we should promote this as a fix for the
> current cycle with an updated changelog.
>

Andrew, should I send this patch separately, or do you have what you
need already?

thanks,
--
John Hubbard
NVIDIA

2018-11-15 06:30:32

by Chen, Rong A

[permalink] [raw]
Subject: [LKP] [mm] 0e9755bfa2: kernel_BUG_at_include/linux/mm.h

FYI, we noticed the following commit (built with gcc-7):

commit: 0e9755bfa2a6b02331e6a9453795507c097ca37f ("[PATCH v2 6/6] mm: track gup pages with page->dma_pinned_* fields")
url: https://github.com/0day-ci/linux/commits/john-hubbard-gmail-com/RFC-gup-dma-tracking-dma-pinned-pages/20181111-102330


in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 2G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+---------------------------------------------------------+------------+------------+
| | df8bea547b | 0e9755bfa2 |
+---------------------------------------------------------+------------+------------+
| boot_successes | 2 | 0 |
| boot_failures | 6 | 15 |
| invoked_oom-killer:gfp_mask=0x | 6 | |
| Mem-Info | 6 | |
| Out_of_memory_and_no_killable_processes | 6 | |
| Kernel_panic-not_syncing:System_is_deadlocked_on_memory | 6 | |
| kernel_BUG_at_include/linux/mm.h | 0 | 15 |
| invalid_opcode:#[##] | 0 | 15 |
| RIP:remove_arg_zero | 0 | 2 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 15 |
| RIP:__access_remote_vm | 0 | 12 |
| RIP:wp_page_copy | 0 | 3 |
+---------------------------------------------------------+------------+------------+



[ 9.236464] kernel BUG at include/linux/mm.h:956!
[ 9.236497] invalid opcode: 0000 [#1] SMP PTI
[ 9.237988] raw: 001fffc000080026 dead000000000104 dead000000000007 ffff91910a765bb1
[ 9.239418] CPU: 1 PID: 194 Comm: systemd-sysv-ge Not tainted 4.20.0-rc1-00233-g0e9755bf #4
[ 9.239419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 9.239424] RIP: 0010:__access_remote_vm+0x33a/0x390
[ 9.239427] Code: ff 48 83 e8 01 48 39 c7 0f 84 49 fe ff ff 0f 0b 8b 57 34 8b 47 10 39 c2 0f 8d 45 fe ff ff 48 c7 c6 10 62 08 96 e8 d6 3c ff ff <0f> 0b 41 0f b7 44 0d fe 66 89 44 0e fe e9 f2 fd ff ff 48 c7 c6 a0
[ 9.239428] RSP: 0018:ffffa03e40557d68 EFLAGS: 00010286
[ 9.239429] RAX: 0000000000000021 RBX: 0000000000000294 RCX: 0000000000000021
[ 9.239430] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000202
[ 9.239431] RBP: 0000000000000294 R08: 000000032015c592 R09: 0000000000000021
[ 9.239431] R10: 000000000007ffd4 R11: 0000000000000000 R12: 00007ffece5e8d5e
[ 9.239432] R13: ffff91917fac4000 R14: 0000000000000d5e R15: 0000000000000294
[ 9.239434] FS: 00007f659aebb8c0(0000) GS:ffff91916bd00000(0000) knlGS:0000000000000000
[ 9.239434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9.239435] CR2: 00007f659a513d30 CR3: 000000007f99e004 CR4: 00000000000606e0
[ 9.239441] Call Trace:
[ 9.242370] raw: 00000007fffffffe 0000000000000000 0000000500000002 ffff91915d117000
[ 9.243210] environ_read+0x161/0x1f0
[ 9.244466] page dumped because: VM_BUG_ON_PAGE(PageDmaPinned(page) && page_ref_count(page) < atomic_read(&page->dma_pinned_count))
[ 9.247107] __vfs_read+0x36/0x190
[ 9.247116] vfs_read+0x9b/0x140
[ 9.247118] ksys_read+0x52/0xc0
[ 9.247121] do_syscall_64+0x5b/0x180
[ 9.247125] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9.247128] RIP: 0033:0x7f659a5a86d0
[ 9.247130] Code: b6 fe ff ff 48 8d 3d 17 be 08 00 48 83 ec 08 e8 06 db 01 00 66 0f 1f 44 00 00 83 3d 39 30 2c 00 00 75 10 b8 00 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 de 9b 01 00 48 89 04 24
[ 9.247131] RSP: 002b:00007ffd88cd1f08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 9.247133] RAX: ffffffffffffffda RBX: 0000564c0244e060 RCX: 00007f659a5a86d0
[ 9.247139] RDX: 0000000000000400 RSI: 0000564c0244e290 RDI: 0000000000000003
[ 9.248735] page->mem_cgroup:ffff91915d117000
[ 9.249581] RBP: 0000000000000d68 R08: 0000000000000003 R09: 0000000000000410
[ 9.249582] R10: 0000000000080000 R11: 0000000000000246 R12: 00007f659a863440
[ 9.249582] R13: 00007f659a862900 R14: 0000000000000009 R15: 0000000000000000
[ 9.249585] Modules linked in:
[ 9.249616] ---[ end trace 4023260a226f976e ]---


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
lkp


Attachments:
(No filename) (4.75 kB)
config-4.20.0-rc1-00233-g0e9755bf (202.51 kB)
job-script (4.42 kB)
dmesg.xz (12.77 kB)
Download all attachments

2018-11-19 19:06:08

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

John, thanks for the discussion at LPC. One of the concerns we
raised however was the performance test. The numbers below are
rather obviously tainted. I think we need to get a better baseline
before concluding anything...

Here's my main concern:

On 11/10/2018 3:50 AM, [email protected] wrote:
> From: John Hubbard <[email protected]>
>...
> ------------------------------------------------------
> WITHOUT the patch:
> ------------------------------------------------------
> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
> fio-3.3
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s]
> reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov 6 20:18:06 2018
> read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)

~14000 4KB read IOPS is really, really low for an NVMe disk.

> cpu : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72

CPU is obviously the limiting factor. At these IOPS, it should be far
less.
> ------------------------------------------------------
> OR, here's a better run WITH the patch applied, and you can see that this is nearly as good
> as the "without" case:
> ------------------------------------------------------
>
> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
> fio-3.3
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s]
> reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov 6 20:01:33 2018
> read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)

Similar low IOPS.

> cpu : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73

Similar CPU saturation.

>

I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W
i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel
and fio version 3.1). Even then, the CPU saturates, so it's not
necessarily a perfect test. I'd like to see your runs both get to
"max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would
give the best comparison for making a decision.

Can you confirm what type of hardware you're running this test on?
CPU, memory speed and capacity, and NVMe device especially?

Tom.

2018-11-21 06:10:03

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

On 11/19/18 10:57 AM, Tom Talpey wrote:
> John, thanks for the discussion at LPC. One of the concerns we
> raised however was the performance test. The numbers below are
> rather obviously tainted. I think we need to get a better baseline
> before concluding anything...
>
> Here's my main concern:
>

Hi Tom,

Thanks again for looking at this!


> On 11/10/2018 3:50 AM, [email protected] wrote:
>> From: John Hubbard <[email protected]>
>> ...
>> ------------------------------------------------------
>> WITHOUT the patch:
>> ------------------------------------------------------
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
>> fio-3.3
>> Starting 1 process
>> Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s]
>> reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov  6 20:18:06 2018
>>     read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)
>
> ~14000 4KB read IOPS is really, really low for an NVMe disk.

Yes, but Jan Kara's original config file for fio is *intended* to highlight
the get_user_pages/put_user_pages changes. It was *not* intended to get max
performance, as you can see by the numjobs and direct IO parameters:

cat fio.conf
[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


So I'm thinking that this is not a "tainted" test, but rather, we're constraining
things a lot with these choices. It's hard to find a good test config to run that
allows decisions, but so far, I'm not really seeing anything that says "this
is so bad that we can't afford to fix the brokenness." I think.

After talking with you and reading this email, I did a bunch more test runs,
varying the following fio parameters:

-- direct
-- numjobs
-- iodepth

...with both the baseline 4.20-rc3 kernel, and with my patches applied. (btw, if
anyone cares, I'll post a github link that has a complete, testable patchset--not
ready for submission as such, but it works cleanly and will allow others to
attempt to reproduce my results).

What I'm seeing is that I can get 10x or better improvements in IOPS and BW,
just by going to 10 threads and turning off direct IO--as expected. So in the end,
I increased the number of threads, and also increased iodepth a bit.


Test results below...


>
>>    cpu          : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72
>
> CPU is obviously the limiting factor. At these IOPS, it should be far
> less.
>> ------------------------------------------------------
>> OR, here's a better run WITH the patch applied, and you can see that this is nearly as good
>> as the "without" case:
>> ------------------------------------------------------
>>
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
>> fio-3.3
>> Starting 1 process
>> Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s]
>> reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov  6 20:01:33 2018
>>     read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)
>
> Similar low IOPS.
>
>>    cpu          : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73
>
> Similar CPU saturation.
>
>>
>
> I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W
> i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel
> and fio version 3.1). Even then, the CPU saturates, so it's not
> necessarily a perfect test. I'd like to see your runs both get to
> "max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would
> give the best comparison for making a decision.

I can get to CPU < 100% by increasing to 10 or 20 threads, although it
makes latency ever so much worse.

>
> Can you confirm what type of hardware you're running this test on?
> CPU, memory speed and capacity, and NVMe device especially?
>
> Tom.

Yes, it's a nice new system, I don't expect any strange perf problems:

CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
(Intel X299 chipset)
Block device: nvme-Samsung_SSD_970_EVO_250GB
DRAM: 32 GB

So, here's a comparison using 20 threads, direct IO, for the baseline vs.
patched kernel (below). Highlights:

-- IOPS are similar, around 60k.
-- BW gets worse, dropping from 290 to 220 MB/s.
-- CPU is well under 100%.
-- latency is incredibly long, but...20 threads.

Baseline:

$ ./run.sh
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20
-------- Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.3
Starting 20 processes
Jobs: 4 (f=4): [_(8),R(2),_(2),R(1),_(1),R(1),_(5)][95.9%][r=244MiB/s,w=0KiB/s][r=62.5k,w=0 IOPS][eta 00m:03s]
reader: (groupid=0, jobs=20): err= 0: pid=14499: Tue Nov 20 16:20:35 2018
read: IOPS=74.2k, BW=290MiB/s (304MB/s)(20.0GiB/70644msec)
slat (usec): min=26, max=48167, avg=249.27, stdev=1200.02
clat (usec): min=42, max=147792, avg=67108.56, stdev=18062.46
lat (usec): min=103, max=147943, avg=67358.10, stdev=18109.75
clat percentiles (msec):
| 1.00th=[ 21], 5.00th=[ 40], 10.00th=[ 41], 20.00th=[ 47],
| 30.00th=[ 58], 40.00th=[ 65], 50.00th=[ 70], 60.00th=[ 75],
| 70.00th=[ 79], 80.00th=[ 83], 90.00th=[ 89], 95.00th=[ 93],
| 99.00th=[ 104], 99.50th=[ 109], 99.90th=[ 121], 99.95th=[ 125],
| 99.99th=[ 134]
bw ( KiB/s): min= 9712, max=46362, per=5.11%, avg=15164.99, stdev=2242.15, samples=2742
iops : min= 2428, max=11590, avg=3790.94, stdev=560.53, samples=2742
lat (usec) : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.98%, 50=20.44%
lat (msec) : 100=76.95%, 250=1.61%
cpu : usr=1.00%, sys=57.65%, ctx=158367, majf=0, minf=5284
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=5242880,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=290MiB/s (304MB/s), 290MiB/s-290MiB/s (304MB/s-304MB/s), io=20.0GiB (21.5GB), run=70644-70644msec

Disk stats (read/write):
nvme0n1: ios=5240738/7, merge=0/7, ticks=1457727/5, in_queue=1547139, util=100.00%

--------------------------------------------------------------
Patched:

<redforge> fast_256GB $ ./run.sh
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20
-------- Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.3
Starting 20 processes
Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s]
reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
slat (usec): min=26, max=50436, avg=337.21, stdev=1405.14
clat (usec): min=43, max=178839, avg=88963.96, stdev=21745.31
lat (usec): min=106, max=179041, avg=89301.43, stdev=21800.43
clat percentiles (msec):
| 1.00th=[ 50], 5.00th=[ 53], 10.00th=[ 55], 20.00th=[ 68],
| 30.00th=[ 79], 40.00th=[ 86], 50.00th=[ 93], 60.00th=[ 99],
| 70.00th=[ 103], 80.00th=[ 108], 90.00th=[ 114], 95.00th=[ 121],
| 99.00th=[ 134], 99.50th=[ 140], 99.90th=[ 150], 99.95th=[ 155],
| 99.99th=[ 163]
bw ( KiB/s): min= 4920, max=39733, per=5.07%, avg=11506.18, stdev=1540.18, samples=3650
iops : min= 1230, max= 9933, avg=2876.20, stdev=385.05, samples=3650
lat (usec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
lat (usec) : 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.23%, 50=1.13%
lat (msec) : 100=63.04%, 250=35.57%
cpu : usr=0.65%, sys=58.07%, ctx=188963, majf=0, minf=5303
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=5242880,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=222MiB/s (232MB/s), 222MiB/s-222MiB/s (232MB/s-232MB/s), io=20.0GiB (21.5GB), run=92385-92385msec

Disk stats (read/write):
nvme0n1: ios=5240550/7, merge=0/7, ticks=1513681/4, in_queue=1636411, util=100.00%


Thoughts?


thanks,
--
John Hubbard
NVIDIA

2018-11-21 19:39:59

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

On 11/21/2018 1:09 AM, John Hubbard wrote:
> On 11/19/18 10:57 AM, Tom Talpey wrote:
>> ~14000 4KB read IOPS is really, really low for an NVMe disk.
>
> Yes, but Jan Kara's original config file for fio is *intended* to highlight
> the get_user_pages/put_user_pages changes. It was *not* intended to get max
> performance, as you can see by the numjobs and direct IO parameters:
>
> cat fio.conf
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64

To be clear - I used those identical parameters, on my lower-spec
machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
higher than yours!

> So I'm thinking that this is not a "tainted" test, but rather, we're constraining
> things a lot with these choices. It's hard to find a good test config to run that
> allows decisions, but so far, I'm not really seeing anything that says "this
> is so bad that we can't afford to fix the brokenness." I think.

I'm not suggesting we tune the benchmark, I'm suggesting the results
on your system are not meaningful since they are orders of magnitude
low. And without meaningful data it's impossible to see the performance
impact of the change...

>> Can you confirm what type of hardware you're running this test on?
>> CPU, memory speed and capacity, and NVMe device especially?
>>
>> Tom.
>
> Yes, it's a nice new system, I don't expect any strange perf problems:
>
> CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
> (Intel X299 chipset)
> Block device: nvme-Samsung_SSD_970_EVO_250GB
> DRAM: 32 GB

The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
with a 4KB QD32 workload:


https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs

And the I7-7800X is a 6-core processor (12 hyperthreads).

> So, here's a comparison using 20 threads, direct IO, for the baseline vs.
> patched kernel (below). Highlights:
>
> -- IOPS are similar, around 60k.
> -- BW gets worse, dropping from 290 to 220 MB/s.
> -- CPU is well under 100%.
> -- latency is incredibly long, but...20 threads.
>
> Baseline:
>
> $ ./run.sh
> fio configuration:
> [reader]
> ioengine=libaio
> blocksize=4096
> size=1g
> rw=read
> group_reporting
> iodepth=256
> direct=1
> numjobs=20

Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
That's going to cause tremendous queuing, and context switching, far
outside of the get_user_pages() change.

But even so, it only brings IOPS to 74.2K, which is still far short of
the device's 200K spec.

Comparing anyway:


> Patched:
>
> -------- Running fio:
> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
> ...
> fio-3.3
> Starting 20 processes
> Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s]
> reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
> read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
> ...
> Thoughts?

Concern - the 74.2K IOPS unpatched drops to 56.8K patched!

What I'd really like to see is to go back to the original fio parameters
(1 thread, 64 iodepth) and try to get a result that gets at least close
to the speced 200K IOPS of the NVMe device. There seems to be something
wrong with yours, currently.

Then of course, the result with the patched get_user_pages, and
compare whichever of IOPS or CPU% changes, and how much.

If these are within a few percent, I agree it's good to go. If it's
roughly 25% like the result just above, that's a rocky road.

I can try this after the holiday on some basic hardware and might
be able to scrounge up better. Can you post that github link?

Tom.

2018-11-22 09:04:43

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

On 11/21/18 8:49 AM, Tom Talpey wrote:
> On 11/21/2018 1:09 AM, John Hubbard wrote:
>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>> ~14000 4KB read IOPS is really, really low for an NVMe disk.
>>
>> Yes, but Jan Kara's original config file for fio is *intended* to highlight
>> the get_user_pages/put_user_pages changes. It was *not* intended to get max
>> performance,  as you can see by the numjobs and direct IO parameters:
>>
>> cat fio.conf
>> [reader]
>> direct=1
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> numjobs=1
>> rw=read
>> iodepth=64
>
> To be clear - I used those identical parameters, on my lower-spec
> machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
> higher than yours!

OK, then something really is wrong here...

>
>> So I'm thinking that this is not a "tainted" test, but rather, we're constraining
>> things a lot with these choices. It's hard to find a good test config to run that
>> allows decisions, but so far, I'm not really seeing anything that says "this
>> is so bad that we can't afford to fix the brokenness." I think.
>
> I'm not suggesting we tune the benchmark, I'm suggesting the results
> on your system are not meaningful since they are orders of magnitude
> low. And without meaningful data it's impossible to see the performance
> impact of the change...
>
>>> Can you confirm what type of hardware you're running this test on?
>>> CPU, memory speed and capacity, and NVMe device especially?
>>>
>>> Tom.
>>
>> Yes, it's a nice new system, I don't expect any strange perf problems:
>>
>> CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
>>      (Intel X299 chipset)
>> Block device: nvme-Samsung_SSD_970_EVO_250GB
>> DRAM: 32 GB
>
> The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
> with a 4KB QD32 workload:
>
>
> https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs
>
> And the I7-7800X is a 6-core processor (12 hyperthreads).
>
>> So, here's a comparison using 20 threads, direct IO, for the baseline vs.
>> patched kernel (below). Highlights:
>>
>>     -- IOPS are similar, around 60k.
>>     -- BW gets worse, dropping from 290 to 220 MB/s.
>>     -- CPU is well under 100%.
>>     -- latency is incredibly long, but...20 threads.
>>
>> Baseline:
>>
>> $ ./run.sh
>> fio configuration:
>> [reader]
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> rw=read
>> group_reporting
>> iodepth=256
>> direct=1
>> numjobs=20
>
> Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
> That's going to cause tremendous queuing, and context switching, far
> outside of the get_user_pages() change.
>
> But even so, it only brings IOPS to 74.2K, which is still far short of
> the device's 200K spec.
>
> Comparing anyway:
>
>
>> Patched:
>>
>> -------- Running fio:
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
>> ...
>> fio-3.3
>> Starting 20 processes
>> Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s]
>> reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
>>     read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
>> ...
>> Thoughts?
>
> Concern - the 74.2K IOPS unpatched drops to 56.8K patched!

ACK. :)

>
> What I'd really like to see is to go back to the original fio parameters
> (1 thread, 64 iodepth) and try to get a result that gets at least close
> to the speced 200K IOPS of the NVMe device. There seems to be something
> wrong with yours, currently.

I'll dig into what has gone wrong with the test. I see fio putting data files
in the right place, so the obvious "using the wrong drive" is (probably)
not it. Even though it really feels like that sort of thing. We'll see.

>
> Then of course, the result with the patched get_user_pages, and
> compare whichever of IOPS or CPU% changes, and how much.
>
> If these are within a few percent, I agree it's good to go. If it's
> roughly 25% like the result just above, that's a rocky road.
>
> I can try this after the holiday on some basic hardware and might
> be able to scrounge up better. Can you post that github link?
>

Here:

[email protected]:johnhubbard/linux (branch: gup_dma_testing)


--
thanks,
John Hubbard
NVIDIA

2018-11-28 01:30:51

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

On 11/21/2018 5:06 PM, John Hubbard wrote:
> On 11/21/18 8:49 AM, Tom Talpey wrote:
>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>>> ~14000 4KB read IOPS is really, really low for an NVMe disk.
>>>
>>> Yes, but Jan Kara's original config file for fio is *intended* to highlight
>>> the get_user_pages/put_user_pages changes. It was *not* intended to get max
>>> performance,  as you can see by the numjobs and direct IO parameters:
>>>
>>> cat fio.conf
>>> [reader]
>>> direct=1
>>> ioengine=libaio
>>> blocksize=4096
>>> size=1g
>>> numjobs=1
>>> rw=read
>>> iodepth=64
>>
>> To be clear - I used those identical parameters, on my lower-spec
>> machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
>> higher than yours!
>
> OK, then something really is wrong here...
>
>>
>>> So I'm thinking that this is not a "tainted" test, but rather, we're constraining
>>> things a lot with these choices. It's hard to find a good test config to run that
>>> allows decisions, but so far, I'm not really seeing anything that says "this
>>> is so bad that we can't afford to fix the brokenness." I think.
>>
>> I'm not suggesting we tune the benchmark, I'm suggesting the results
>> on your system are not meaningful since they are orders of magnitude
>> low. And without meaningful data it's impossible to see the performance
>> impact of the change...
>>
>>>> Can you confirm what type of hardware you're running this test on?
>>>> CPU, memory speed and capacity, and NVMe device especially?
>>>>
>>>> Tom.
>>>
>>> Yes, it's a nice new system, I don't expect any strange perf problems:
>>>
>>> CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
>>>      (Intel X299 chipset)
>>> Block device: nvme-Samsung_SSD_970_EVO_250GB
>>> DRAM: 32 GB
>>
>> The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
>> with a 4KB QD32 workload:
>>
>>
>> https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs
>>
>> And the I7-7800X is a 6-core processor (12 hyperthreads).
>>
>>> So, here's a comparison using 20 threads, direct IO, for the baseline vs.
>>> patched kernel (below). Highlights:
>>>
>>>     -- IOPS are similar, around 60k.
>>>     -- BW gets worse, dropping from 290 to 220 MB/s.
>>>     -- CPU is well under 100%.
>>>     -- latency is incredibly long, but...20 threads.
>>>
>>> Baseline:
>>>
>>> $ ./run.sh
>>> fio configuration:
>>> [reader]
>>> ioengine=libaio
>>> blocksize=4096
>>> size=1g
>>> rw=read
>>> group_reporting
>>> iodepth=256
>>> direct=1
>>> numjobs=20
>>
>> Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
>> That's going to cause tremendous queuing, and context switching, far
>> outside of the get_user_pages() change.
>>
>> But even so, it only brings IOPS to 74.2K, which is still far short of
>> the device's 200K spec.
>>
>> Comparing anyway:
>>
>>
>>> Patched:
>>>
>>> -------- Running fio:
>>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
>>> ...
>>> fio-3.3
>>> Starting 20 processes
>>> Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s]
>>> reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
>>>     read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
>>> ...
>>> Thoughts?
>>
>> Concern - the 74.2K IOPS unpatched drops to 56.8K patched!
>
> ACK. :)
>
>>
>> What I'd really like to see is to go back to the original fio parameters
>> (1 thread, 64 iodepth) and try to get a result that gets at least close
>> to the speced 200K IOPS of the NVMe device. There seems to be something
>> wrong with yours, currently.
>
> I'll dig into what has gone wrong with the test. I see fio putting data files
> in the right place, so the obvious "using the wrong drive" is (probably)
> not it. Even though it really feels like that sort of thing. We'll see.
>
>>
>> Then of course, the result with the patched get_user_pages, and
>> compare whichever of IOPS or CPU% changes, and how much.
>>
>> If these are within a few percent, I agree it's good to go. If it's
>> roughly 25% like the result just above, that's a rocky road.
>>
>> I can try this after the holiday on some basic hardware and might
>> be able to scrounge up better. Can you post that github link?
>>
>
> Here:
>
> [email protected]:johnhubbard/linux (branch: gup_dma_testing)

I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.

Say, that branch reports it has not had a commit since June 30. Is that
the right one? What about gup_dma_for_lpc_2018?

Tom.

2018-11-28 02:54:03

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

On 11/27/18 5:21 PM, Tom Talpey wrote:
> On 11/21/2018 5:06 PM, John Hubbard wrote:
>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
[...]
>>>
>>> What I'd really like to see is to go back to the original fio parameters
>>> (1 thread, 64 iodepth) and try to get a result that gets at least close
>>> to the speced 200K IOPS of the NVMe device. There seems to be something
>>> wrong with yours, currently.
>>
>> I'll dig into what has gone wrong with the test. I see fio putting data files
>> in the right place, so the obvious "using the wrong drive" is (probably)
>> not it. Even though it really feels like that sort of thing. We'll see.
>>
>>>
>>> Then of course, the result with the patched get_user_pages, and
>>> compare whichever of IOPS or CPU% changes, and how much.
>>>
>>> If these are within a few percent, I agree it's good to go. If it's
>>> roughly 25% like the result just above, that's a rocky road.
>>>
>>> I can try this after the holiday on some basic hardware and might
>>> be able to scrounge up better. Can you post that github link?
>>>
>>
>> Here:
>>
>>     [email protected]:johnhubbard/linux (branch: gup_dma_testing)
>
> I'm super-limited here this week hardware-wise and have not been able
> to try testing with the patched kernel.
>
> I was able to compare my earlier quick test with a Bionic 4.15 kernel
> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
> test, and without your change.
>

So just to double check (again): you are running fio with these parameters,
right?

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64



> Say, that branch reports it has not had a commit since June 30. Is that
> the right one? What about gup_dma_for_lpc_2018?
>

That's the right branch, but the AuthorDate for the head commit (only) somehow
got stuck in the past. I just now amended that patch with a new date and pushed
it, so the head commit now shows Nov 27:

https://github.com/johnhubbard/linux/commits/gup_dma_testing


The actual code is the same, though. (It is still based on Nov 19th's f2ce1065e767
commit.)


thanks,
--
John Hubbard
NVIDIA

2018-11-28 14:00:37

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

On 11/27/2018 9:52 PM, John Hubbard wrote:
> On 11/27/18 5:21 PM, Tom Talpey wrote:
>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
> [...]
>>>>
>>>> What I'd really like to see is to go back to the original fio parameters
>>>> (1 thread, 64 iodepth) and try to get a result that gets at least close
>>>> to the speced 200K IOPS of the NVMe device. There seems to be something
>>>> wrong with yours, currently.
>>>
>>> I'll dig into what has gone wrong with the test. I see fio putting data files
>>> in the right place, so the obvious "using the wrong drive" is (probably)
>>> not it. Even though it really feels like that sort of thing. We'll see.
>>>
>>>>
>>>> Then of course, the result with the patched get_user_pages, and
>>>> compare whichever of IOPS or CPU% changes, and how much.
>>>>
>>>> If these are within a few percent, I agree it's good to go. If it's
>>>> roughly 25% like the result just above, that's a rocky road.
>>>>
>>>> I can try this after the holiday on some basic hardware and might
>>>> be able to scrounge up better. Can you post that github link?
>>>>
>>>
>>> Here:
>>>
>>>     [email protected]:johnhubbard/linux (branch: gup_dma_testing)
>>
>> I'm super-limited here this week hardware-wise and have not been able
>> to try testing with the patched kernel.
>>
>> I was able to compare my earlier quick test with a Bionic 4.15 kernel
>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
>> test, and without your change.
>>
>
> So just to double check (again): you are running fio with these parameters,
> right?
>
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64

Correct, I copy/pasted these directly. I also ran with size=10g because
the 1g provides a really small sample set.

There was one other difference, your results indicated fio 3.3 was used.
My Bionic install has fio 3.1. I don't find that relevant because our
goal is to compare before/after, which I haven't done yet.

Tom.

>
>
>
>> Say, that branch reports it has not had a commit since June 30. Is that
>> the right one? What about gup_dma_for_lpc_2018?
>>
>
> That's the right branch, but the AuthorDate for the head commit (only) somehow
> got stuck in the past. I just now amended that patch with a new date and pushed
> it, so the head commit now shows Nov 27:
>
> https://github.com/johnhubbard/linux/commits/gup_dma_testing
>
>
> The actual code is the same, though. (It is still based on Nov 19th's f2ce1065e767
> commit.)
>
>
> thanks,
>

2018-11-30 01:39:59

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

On 11/28/18 5:59 AM, Tom Talpey wrote:
> On 11/27/2018 9:52 PM, John Hubbard wrote:
>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>> [...]
>>> I'm super-limited here this week hardware-wise and have not been able
>>> to try testing with the patched kernel.
>>>
>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel
>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
>>> test, and without your change.
>>>
>>
>> So just to double check (again): you are running fio with these parameters,
>> right?
>>
>> [reader]
>> direct=1
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> numjobs=1
>> rw=read
>> iodepth=64
>
> Correct, I copy/pasted these directly. I also ran with size=10g because
> the 1g provides a really small sample set.
>
> There was one other difference, your results indicated fio 3.3 was used.
> My Bionic install has fio 3.1. I don't find that relevant because our
> goal is to compare before/after, which I haven't done yet.
>

OK, the 50 MB/s was due to my particular .config. I had some expensive debug options
set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated
speed of the Samsung NVMe device, so now we should have a clearer picture of the
performance that real users will see.

Continuing on, then: running a before and after test, I don't see any significant
difference in the fio results:

fio.conf:

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64

---------------------------------------------------------
Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:

$ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
clat percentiles (usec):
| 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326],
| 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326],
| 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326],
| 99.00th=[ 379], 99.50th=[ 594], 99.90th=[ 603], 99.95th=[ 611],
| 99.99th=[12125]
bw ( KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2
iops : min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2
lat (usec) : 250=0.08%, 500=99.30%, 750=0.59%
lat (msec) : 20=0.02%
cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec

Disk stats (read/write):
nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%

---------------------------------------------------------
With patches applied:

<redforge> fast_256GB $ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
clat percentiles (usec):
| 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326],
| 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326],
| 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326],
| 99.00th=[ 379], 99.50th=[ 594], 99.90th=[ 603], 99.95th=[ 611],
| 99.99th=[12125]
bw ( KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2
iops : min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2
lat (usec) : 250=0.08%, 500=99.30%, 750=0.59%
lat (msec) : 20=0.02%
cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec

Disk stats (read/write):
nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%


thanks,
--
John Hubbard
NVIDIA

2018-11-30 02:20:34

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

On 11/29/2018 8:39 PM, John Hubbard wrote:
> On 11/28/18 5:59 AM, Tom Talpey wrote:
>> On 11/27/2018 9:52 PM, John Hubbard wrote:
>>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>> [...]
>>>> I'm super-limited here this week hardware-wise and have not been able
>>>> to try testing with the patched kernel.
>>>>
>>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel
>>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
>>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
>>>> test, and without your change.
>>>>
>>>
>>> So just to double check (again): you are running fio with these parameters,
>>> right?
>>>
>>> [reader]
>>> direct=1
>>> ioengine=libaio
>>> blocksize=4096
>>> size=1g
>>> numjobs=1
>>> rw=read
>>> iodepth=64
>>
>> Correct, I copy/pasted these directly. I also ran with size=10g because
>> the 1g provides a really small sample set.
>>
>> There was one other difference, your results indicated fio 3.3 was used.
>> My Bionic install has fio 3.1. I don't find that relevant because our
>> goal is to compare before/after, which I haven't done yet.
>>
>
> OK, the 50 MB/s was due to my particular .config. I had some expensive debug options
> set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated
> speed of the Samsung NVMe device, so now we should have a clearer picture of the
> performance that real users will see.

Oh, good! I'm especially glad because I was having a heck of a time
reconfiguring the one machine I have available for this.

> Continuing on, then: running a before and after test, I don't see any significant
> difference in the fio results:

Excerpting from below:

> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73

vs

> With patches applied:
> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73

Perfect results, not CPU limited, and full IOPS.

Curiously identical, so I trust you've checked that you measured
both targets, but if so, I say it's good.

Tom.

>
> fio.conf:
>
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64
>
> ---------------------------------------------------------
> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>
> $ fio ./experimental-fio.conf
> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
> fio-3.3
> Starting 1 process
> Jobs: 1 (f=1)
> reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
> slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
> clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
> lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
> clat percentiles (usec):
> | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326],
> | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326],
> | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326],
> | 99.00th=[ 379], 99.50th=[ 594], 99.90th=[ 603], 99.95th=[ 611],
> | 99.99th=[12125]
> bw ( KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2
> iops : min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2
> lat (usec) : 250=0.08%, 500=99.30%, 750=0.59%
> lat (msec) : 20=0.02%
> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
> issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
> READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec
>
> Disk stats (read/write):
> nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%
>
> ---------------------------------------------------------
> With patches applied:
>
> <redforge> fast_256GB $ fio ./experimental-fio.conf
> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
> fio-3.3
> Starting 1 process
> Jobs: 1 (f=1)
> reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
> slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
> clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
> lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
> clat percentiles (usec):
> | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326],
> | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326],
> | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326],
> | 99.00th=[ 379], 99.50th=[ 594], 99.90th=[ 603], 99.95th=[ 611],
> | 99.99th=[12125]
> bw ( KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2
> iops : min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2
> lat (usec) : 250=0.08%, 500=99.30%, 750=0.59%
> lat (msec) : 20=0.02%
> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
> issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
> READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec
>
> Disk stats (read/write):
> nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%
>
>
> thanks,
>

2018-11-30 02:23:53

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

On 11/29/18 6:18 PM, Tom Talpey wrote:
> On 11/29/2018 8:39 PM, John Hubbard wrote:
>> On 11/28/18 5:59 AM, Tom Talpey wrote:
>>> On 11/27/2018 9:52 PM, John Hubbard wrote:
>>>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>>> [...]
>>>>> I'm super-limited here this week hardware-wise and have not been able
>>>>> to try testing with the patched kernel.
>>>>>
>>>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel
>>>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
>>>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
>>>>> test, and without your change.
>>>>>
>>>>
>>>> So just to double check (again): you are running fio with these parameters,
>>>> right?
>>>>
>>>> [reader]
>>>> direct=1
>>>> ioengine=libaio
>>>> blocksize=4096
>>>> size=1g
>>>> numjobs=1
>>>> rw=read
>>>> iodepth=64
>>>
>>> Correct, I copy/pasted these directly. I also ran with size=10g because
>>> the 1g provides a really small sample set.
>>>
>>> There was one other difference, your results indicated fio 3.3 was used.
>>> My Bionic install has fio 3.1. I don't find that relevant because our
>>> goal is to compare before/after, which I haven't done yet.
>>>
>>
>> OK, the 50 MB/s was due to my particular .config. I had some expensive debug options
>> set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated
>> speed of the Samsung NVMe device, so now we should have a clearer picture of the
>> performance that real users will see.
>
> Oh, good! I'm especially glad because I was having a heck of a time
> reconfiguring the one machine I have available for this.
>
>> Continuing on, then: running a before and after test, I don't see any significant
>> difference in the fio results:
>
> Excerpting from below:
>
>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>>     read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>    cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>
> vs
>
>> With patches applied:
>>     read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>    cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>
> Perfect results, not CPU limited, and full IOPS.
>
> Curiously identical, so I trust you've checked that you measured
> both targets, but if so, I say it's good.
>

Argh, copy-paste error in the email. The real "before" is ever so slightly
better, at 194K IOPS and 759 MB/s:

$ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1715: Thu Nov 29 17:07:09 2018
read: IOPS=194k, BW=759MiB/s (795MB/s)(1024MiB/1350msec)
slat (nsec): min=1245, max=2812.7k, avg=1538.03, stdev=5519.61
clat (usec): min=148, max=755, avg=326.85, stdev=18.13
lat (usec): min=150, max=3483, avg=328.41, stdev=19.53
clat percentiles (usec):
| 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326],
| 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326],
| 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326],
| 99.00th=[ 355], 99.50th=[ 537], 99.90th=[ 553], 99.95th=[ 553],
| 99.99th=[ 619]
bw ( KiB/s): min=767816, max=783096, per=99.84%, avg=775456.00, stdev=10804.59, samples=2
iops : min=191954, max=195774, avg=193864.00, stdev=2701.15, samples=2
lat (usec) : 250=0.09%, 500=99.30%, 750=0.61%, 1000=0.01%
cpu : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=759MiB/s (795MB/s), 759MiB/s-759MiB/s (795MB/s-795MB/s), io=1024MiB (1074MB), run=1350-1350msec

Disk stats (read/write):
nvme0n1: ios=222853/0, merge=0/0, ticks=71410/0, in_queue=71935, util=100.00%

thanks,
--
John Hubbard
NVIDIA
>
>>
>> fio.conf:
>>
>> [reader]
>> direct=1
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> numjobs=1
>> rw=read
>> iodepth=64
>>
>> ---------------------------------------------------------
>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>>
[deleted with prejudice. See the correction above, instead] --jhubbard
>>
>> ---------------------------------------------------------
>> With patches applied:
>>
>> <redforge> fast_256GB $ fio ./experimental-fio.conf
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
>> fio-3.3
>> Starting 1 process
>> Jobs: 1 (f=1)
>> reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
>>     read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>      slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
>>      clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
>>       lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
>>      clat percentiles (usec):
>>       |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
>>       | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
>>       | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
>>       | 99.00th=[  379], 99.50th=[  594], 99.90th=[  603], 99.95th=[  611],
>>       | 99.99th=[12125]
>>     bw (  KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2
>>     iops        : min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2
>>    lat (usec)   : 250=0.08%, 500=99.30%, 750=0.59%
>>    lat (msec)   : 20=0.02%
>>    cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>>       issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>>       latency   : target=0, window=0, percentile=100.00%, depth=64
>>
>> Run status group 0 (all jobs):
>>     READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec
>>
>> Disk stats (read/write):
>>    nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%
>>
>>
>> thanks,
>>
>

2018-11-30 02:31:22

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

On 11/29/2018 9:21 PM, John Hubbard wrote:
> On 11/29/18 6:18 PM, Tom Talpey wrote:
>> On 11/29/2018 8:39 PM, John Hubbard wrote:
>>> On 11/28/18 5:59 AM, Tom Talpey wrote:
>>>> On 11/27/2018 9:52 PM, John Hubbard wrote:
>>>>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>>>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>>>> [...]
>>>>>> I'm super-limited here this week hardware-wise and have not been able
>>>>>> to try testing with the patched kernel.
>>>>>>
>>>>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel
>>>>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
>>>>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
>>>>>> test, and without your change.
>>>>>>
>>>>>
>>>>> So just to double check (again): you are running fio with these parameters,
>>>>> right?
>>>>>
>>>>> [reader]
>>>>> direct=1
>>>>> ioengine=libaio
>>>>> blocksize=4096
>>>>> size=1g
>>>>> numjobs=1
>>>>> rw=read
>>>>> iodepth=64
>>>>
>>>> Correct, I copy/pasted these directly. I also ran with size=10g because
>>>> the 1g provides a really small sample set.
>>>>
>>>> There was one other difference, your results indicated fio 3.3 was used.
>>>> My Bionic install has fio 3.1. I don't find that relevant because our
>>>> goal is to compare before/after, which I haven't done yet.
>>>>
>>>
>>> OK, the 50 MB/s was due to my particular .config. I had some expensive debug options
>>> set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated
>>> speed of the Samsung NVMe device, so now we should have a clearer picture of the
>>> performance that real users will see.
>>
>> Oh, good! I'm especially glad because I was having a heck of a time
>> reconfiguring the one machine I have available for this.
>>
>>> Continuing on, then: running a before and after test, I don't see any significant
>>> difference in the fio results:
>>
>> Excerpting from below:
>>
>>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>>>      read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>     cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>
>> vs
>>
>>> With patches applied:
>>>      read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>     cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>
>> Perfect results, not CPU limited, and full IOPS.
>>
>> Curiously identical, so I trust you've checked that you measured
>> both targets, but if so, I say it's good.
>>
>
> Argh, copy-paste error in the email. The real "before" is ever so slightly
> better, at 194K IOPS and 759 MB/s:

Definitely better - note the system CPU is lower, which is probably the
reason for the increased IOPS.

> cpu : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73

Good result - a correct implementation, and faster.

Tom.


>
> $ fio ./experimental-fio.conf
> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
> fio-3.3
> Starting 1 process
> Jobs: 1 (f=1)
> reader: (groupid=0, jobs=1): err= 0: pid=1715: Thu Nov 29 17:07:09 2018
> read: IOPS=194k, BW=759MiB/s (795MB/s)(1024MiB/1350msec)
> slat (nsec): min=1245, max=2812.7k, avg=1538.03, stdev=5519.61
> clat (usec): min=148, max=755, avg=326.85, stdev=18.13
> lat (usec): min=150, max=3483, avg=328.41, stdev=19.53
> clat percentiles (usec):
> | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326],
> | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326],
> | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326],
> | 99.00th=[ 355], 99.50th=[ 537], 99.90th=[ 553], 99.95th=[ 553],
> | 99.99th=[ 619]
> bw ( KiB/s): min=767816, max=783096, per=99.84%, avg=775456.00, stdev=10804.59, samples=2
> iops : min=191954, max=195774, avg=193864.00, stdev=2701.15, samples=2
> lat (usec) : 250=0.09%, 500=99.30%, 750=0.61%, 1000=0.01%
> cpu : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
> issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
> READ: bw=759MiB/s (795MB/s), 759MiB/s-759MiB/s (795MB/s-795MB/s), io=1024MiB (1074MB), run=1350-1350msec
>
> Disk stats (read/write):
> nvme0n1: ios=222853/0, merge=0/0, ticks=71410/0, in_queue=71935, util=100.00%
>
> thanks,
>

2018-11-30 03:01:19

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

On 11/29/18 6:30 PM, Tom Talpey wrote:
> On 11/29/2018 9:21 PM, John Hubbard wrote:
>> On 11/29/18 6:18 PM, Tom Talpey wrote:
>>> On 11/29/2018 8:39 PM, John Hubbard wrote:
>>>> On 11/28/18 5:59 AM, Tom Talpey wrote:
>>>>> On 11/27/2018 9:52 PM, John Hubbard wrote:
>>>>>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>>>>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>>>>> [...]
>>> Excerpting from below:
>>>
>>>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>>>>       read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>>      cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>>
>>> vs
>>>
>>>> With patches applied:
>>>>       read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>>      cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>>
>>> Perfect results, not CPU limited, and full IOPS.
>>>
>>> Curiously identical, so I trust you've checked that you measured
>>> both targets, but if so, I say it's good.
>>>
>>
>> Argh, copy-paste error in the email. The real "before" is ever so slightly
>> better, at 194K IOPS and 759 MB/s:
>
> Definitely better - note the system CPU is lower, which is probably the
> reason for the increased IOPS.
>
>>    cpu          : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
>
> Good result - a correct implementation, and faster.
>

Thanks, Tom, I really appreciate your experience and help on what performance
should look like here. (I'm sure you can guess that this is the first time
I've worked with fio, heh.)

I'll send out a new, non-RFC patchset soon, then.

thanks,
--
John Hubbard
NVIDIA

2018-11-30 03:15:03

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

On 11/29/2018 10:00 PM, John Hubbard wrote:
> On 11/29/18 6:30 PM, Tom Talpey wrote:
>> On 11/29/2018 9:21 PM, John Hubbard wrote:
>>> On 11/29/18 6:18 PM, Tom Talpey wrote:
>>>> On 11/29/2018 8:39 PM, John Hubbard wrote:
>>>>> On 11/28/18 5:59 AM, Tom Talpey wrote:
>>>>>> On 11/27/2018 9:52 PM, John Hubbard wrote:
>>>>>>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>>>>>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>>>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>>>>>> [...]
>>>> Excerpting from below:
>>>>
>>>>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>>>>>       read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>>>      cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>>>
>>>> vs
>>>>
>>>>> With patches applied:
>>>>>       read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>>>      cpu          : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>>>
>>>> Perfect results, not CPU limited, and full IOPS.
>>>>
>>>> Curiously identical, so I trust you've checked that you measured
>>>> both targets, but if so, I say it's good.
>>>>
>>>
>>> Argh, copy-paste error in the email. The real "before" is ever so slightly
>>> better, at 194K IOPS and 759 MB/s:
>>
>> Definitely better - note the system CPU is lower, which is probably the
>> reason for the increased IOPS.
>>
>>>     cpu          : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
>>
>> Good result - a correct implementation, and faster.
>>
>
> Thanks, Tom, I really appreciate your experience and help on what performance
> should look like here. (I'm sure you can guess that this is the first time
> I've worked with fio, heh.)

No problem, happy to chip in. Feel free to add my

Tested-By: Tom Talpey <[email protected]>

I know, that's not the personal email I'm posting from, but it's me.

I'll be hopefully trying the code with the Linux SMB client (cifs.ko)
next week, Long Li is implementing direct io in that and we'll see how
it helps.

Mainly, I'm looking forward to seeing this enable RDMA-to-DAX.

Tom.

>
> I'll send out a new, non-RFC patchset soon, then.
>
> thanks,
>