2020-08-29 08:10:14

by John Hubbard

[permalink] [raw]
Subject: [PATCH v2 0/3] bio: Direct IO: convert to pin_user_pages_fast()

Hi,

Changes since v1:

* Now handles ITER_PIPE, by appying pin_user_page() to ITER_PIPE pages,
on the Direct IO path. Thanks to Al Viro for pointing me in the right
direction there.

* Removed the ceph and BIO_FOLL_PIN patches: the ceph improvements were
handled separately as a different patch entirely, by Jeff Layton. And
the BIO_FOLL_PIN idea turned out to be completely undesirable here.

Original cover letter, updated for v2:

This converts the Direct IO block/bio layer over to use FOLL_PIN pages
(those acquired via pin_user_pages*()). This effectively converts
several file systems (ext4, for example) that use the common Direct IO
routines. See "Remaining work", below for a bit more detail there.

Quite a few approaches have been considered over the years. This one is
inspired by Christoph Hellwig's July, 2019 observation that there are
only 5 ITER_ types, and we can simplify handling of them for Direct IO
[1]. After working through how bio submission and completion works, I
became convinced that this is the simplest and cleanest approach to
conversion.

Design notes ============

This whole approach depends on certain concepts:

1) Each struct bio instance must not mix different types of pages:
FOLL_PIN and non-FOLL_PIN pages. (By FOLL_PIN I'm referring to pages
that were acquired and pinned via pin_user_page*() routines.)
Fortunately, this is already an enforced constraint for bio's, as
evidenced by the existence and use of BIO_NO_PAGE_REF.

2) Christoph Hellwig's July, 2019 observation that there are
only 5 ITER_ types, and we can simplify handling of them for Direct IO
[1]. Accordingly, this series implements the following pseudocode:

Direct IO behavior:

ITER_IOVEC:
pin_user_pages_fast();
break;

ITER_PIPE:
for each page:
pin_user_page();
break;

ITER_KVEC: // already elevated page refcount, leave alone
ITER_BVEC: // already elevated page refcount, leave alone
ITER_DISCARD: // discard
return -EFAULT or -ENVALID;

...which works for callers that already have sorted out which case they
are in. Such as, Direct IO in the block/bio layers.

Now, this does leave ITER_KVEC and ITER_BVEC unconverted, but on the
other hand, it's not clear that these are actually affected in the real
world, by the get_user_pages()+filesystem interaction problems of [2].
If it turns out to matter, then those can be handled too, but it's just
more refactoring and surgery to do so.

Testing
=======

Performance: no obvious regressions from running fio (direct=1: Direct
IO) on both SSD and NVMe drives.

Functionality: selected non-destructive bare metal xfstests on xfs,
ext4, btrfs, orangefs filesystems, plus LTP tests.

Note that I have only a single x86 64-bit test machine, though.

Remaining work
==============

Non-converted call sites for iter_iov_get_pages*() at the
moment include: net, crypto, cifs, ceph, vhost, fuse, nfs/direct,
vhost/scsi. However, it's not clear which of those really have to be
converted, because some of them probably use ITER_BVEC or ITER_KVEC.

About-to-be-converted sites (in a subsequent patch) are: Direct IO for
filesystems that use the generic read/write functions.

[1] https://lore.kernel.org/kvm/[email protected]/

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/


John Hubbard (3):
mm/gup: introduce pin_user_page()
iov_iter: introduce iov_iter_pin_user_pages*() routines
bio: convert get_user_pages_fast() --> pin_user_pages_fast()

block/bio.c | 24 +++++-----
block/blk-map.c | 6 +--
fs/direct-io.c | 28 +++++------
fs/iomap/direct-io.c | 2 +-
include/linux/mm.h | 2 +
include/linux/uio.h | 5 ++
lib/iov_iter.c | 110 +++++++++++++++++++++++++++++++++++++++----
mm/gup.c | 30 ++++++++++++
8 files changed, 169 insertions(+), 38 deletions(-)

--
2.28.0


2020-08-29 08:10:36

by John Hubbard

[permalink] [raw]
Subject: [PATCH v2 3/3] bio: convert get_user_pages_fast() --> pin_user_pages_fast()

Change generic block/bio Direct IO routines, to acquire FOLL_PIN user
pages via the recently added routines:

iov_iter_pin_user_pages()
iov_iter_pin_user_pages_alloc()
pin_user_page()

This effectively converts several file systems (ext4, for example) that
use the common Direct IO routines.

Change the corresponding page release calls from put_page() to
unpin_user_page().

Change bio_release_pages() to handle FOLL_PIN pages. In fact, that
is now the *only* type of pages it handles now.

Design notes
============

Quite a few approaches have been considered over the years. This one is
inspired by Christoph Hellwig's July, 2019 observation that there are
only 5 ITER_ types, and we can simplify handling of them for Direct IO
[1]. Accordingly, this patch implements the following pseudocode:

Direct IO behavior:

ITER_IOVEC:
pin_user_pages_fast();
break;

ITER_PIPE:
for each page:
pin_user_page();
break;

ITER_KVEC: // already elevated page refcount, leave alone
ITER_BVEC: // already elevated page refcount, leave alone
ITER_DISCARD: // discard
return -EFAULT or -ENVALID;

...which works for callers that already have sorted out which case they
are in. Such as, Direct IO in the block/bio layers.

Now, this does leave ITER_KVEC and ITER_BVEC unconverted, but on the
other hand, it's not clear that these are actually affected in the real
world, by the get_user_pages()+filesystem interaction problems of [2].
If it turns out to matter, then those can be handled too, but it's just
more refactoring and surgery to do so.

Page acquisition: The iov_iter_get_pages*() routines
above are at just the right level in the call stack: the callers already
know which system to use, and so it's a small change to just drop in the
replacement routines. And it's a fan-in/fan-out point: block/bio call
sites for Direct IO funnel their page acquisitions through the
iov_iter_get_pages*() routines, and there are many other callers of
those. And we can't convert all of the callers at once--too many
subsystems are involved, and it would be a too large and too risky
patch.

Page release: there are already separate release routines: put_page()
vs. unpin_user_page(), so it's already done there.

[1] https://lore.kernel.org/kvm/[email protected]/

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

Signed-off-by: John Hubbard <[email protected]>
---
block/bio.c | 24 ++++++++++++------------
block/blk-map.c | 6 +++---
fs/direct-io.c | 28 ++++++++++++++--------------
fs/iomap/direct-io.c | 2 +-
4 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index a9931f23d933..f54e9414e6d9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -955,7 +955,7 @@ void bio_release_pages(struct bio *bio, bool mark_dirty)
bio_for_each_segment_all(bvec, bio, iter_all) {
if (mark_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
- put_page(bvec->bv_page);
+ unpin_user_page(bvec->bv_page);
}
}
EXPORT_SYMBOL_GPL(bio_release_pages);
@@ -986,9 +986,9 @@ static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
* @iter: iov iterator describing the region to be mapped
*
* Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * pages will have to be released using put_page() or unpin_user_page() when
+ * done. For multi-segment *iter, this function only adds pages from the next
+ * non-empty segment of the iov iterator.
*/
static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
@@ -1009,7 +1009,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);

- size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+ size = iov_iter_pin_user_pages(iter, pages, LONG_MAX, nr_pages, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;

@@ -1020,7 +1020,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)

if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
if (same_page)
- put_page(page);
+ unpin_user_page(page);
} else {
if (WARN_ON_ONCE(bio_full(bio, len)))
return -EINVAL;
@@ -1056,7 +1056,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);

- size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+ size = iov_iter_pin_user_pages(iter, pages, LONG_MAX, nr_pages, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;

@@ -1069,7 +1069,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
max_append_sectors, &same_page) != len)
return -EINVAL;
if (same_page)
- put_page(page);
+ unpin_user_page(page);
offset = 0;
}

@@ -1113,8 +1113,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
} else {
if (is_bvec)
ret = __bio_iov_bvec_add_pages(bio, iter);
- else
- ret = __bio_iov_iter_get_pages(bio, iter);
+ else
+ ret = __bio_iov_iter_get_pages(bio, iter);
}
} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));

@@ -1326,8 +1326,8 @@ void bio_set_pages_dirty(struct bio *bio)
* the BIO and re-dirty the pages in process context.
*
* It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on. It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on. It will run one unpin_user_page() against each page
+ * and will run one bio_put() against the BIO.
*/

static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk-map.c b/block/blk-map.c
index 6e804892d5ec..7a095b4947ea 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -275,7 +275,7 @@ static struct bio *bio_map_user_iov(struct request_queue *q,
size_t offs, added = 0;
int npages;

- bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+ bytes = iov_iter_pin_user_pages_alloc(iter, &pages, LONG_MAX, &offs);
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
goto out_unmap;
@@ -298,7 +298,7 @@ static struct bio *bio_map_user_iov(struct request_queue *q,
if (!bio_add_hw_page(q, bio, page, n, offs,
max_sectors, &same_page)) {
if (same_page)
- put_page(page);
+ unpin_user_page(page);
break;
}

@@ -312,7 +312,7 @@ static struct bio *bio_map_user_iov(struct request_queue *q,
* release the pages we didn't map into the bio, if any
*/
while (j < npages)
- put_page(pages[j++]);
+ unpin_user_page(pages[j++]);
kvfree(pages);
/* couldn't stuff something into bio? */
if (bytes)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 183299892465..b01c8d003bd3 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -170,7 +170,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
{
ssize_t ret;

- ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
+ ret = iov_iter_pin_user_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
&sdio->from);

if (ret < 0 && sdio->blocks_available && (dio->op == REQ_OP_WRITE)) {
@@ -182,7 +182,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
*/
if (dio->page_errors == 0)
dio->page_errors = ret;
- get_page(page);
+ pin_user_page(page);
dio->pages[0] = page;
sdio->head = 0;
sdio->tail = 1;
@@ -472,7 +472,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
{
while (sdio->head < sdio->tail)
- put_page(dio->pages[sdio->head++]);
+ unpin_user_page(dio->pages[sdio->head++]);
}

/*
@@ -739,7 +739,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
*/
if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE)
sdio->pages_in_io--;
- get_page(sdio->cur_page);
+ pin_user_page(sdio->cur_page);
sdio->final_block_in_bio = sdio->cur_page_block +
(sdio->cur_page_len >> sdio->blkbits);
ret = 0;
@@ -853,13 +853,13 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
*/
if (sdio->cur_page) {
ret = dio_send_cur_page(dio, sdio, map_bh);
- put_page(sdio->cur_page);
+ unpin_user_page(sdio->cur_page);
sdio->cur_page = NULL;
if (ret)
return ret;
}

- get_page(page); /* It is in dio */
+ pin_user_page(page); /* It is in dio */
sdio->cur_page = page;
sdio->cur_page_offset = offset;
sdio->cur_page_len = len;
@@ -874,7 +874,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
ret = dio_send_cur_page(dio, sdio, map_bh);
if (sdio->bio)
dio_bio_submit(dio, sdio);
- put_page(sdio->cur_page);
+ unpin_user_page(sdio->cur_page);
sdio->cur_page = NULL;
}
return ret;
@@ -974,7 +974,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,

ret = get_more_blocks(dio, sdio, map_bh);
if (ret) {
- put_page(page);
+ unpin_user_page(page);
goto out;
}
if (!buffer_mapped(map_bh))
@@ -1019,7 +1019,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,

/* AKPM: eargh, -ENOTBLK is a hack */
if (dio->op == REQ_OP_WRITE) {
- put_page(page);
+ unpin_user_page(page);
return -ENOTBLK;
}

@@ -1032,7 +1032,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
if (sdio->block_in_file >=
i_size_aligned >> blkbits) {
/* We hit eof */
- put_page(page);
+ unpin_user_page(page);
goto out;
}
zero_user(page, from, 1 << blkbits);
@@ -1072,7 +1072,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
sdio->next_block_for_io,
map_bh);
if (ret) {
- put_page(page);
+ unpin_user_page(page);
goto out;
}
sdio->next_block_for_io += this_chunk_blocks;
@@ -1087,8 +1087,8 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
break;
}

- /* Drop the ref which was taken in get_user_pages() */
- put_page(page);
+ /* Drop the ref which was taken in pin_user_pages() */
+ unpin_user_page(page);
}
out:
return ret;
@@ -1327,7 +1327,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
if (retval == 0)
retval = ret2;
- put_page(sdio.cur_page);
+ unpin_user_page(sdio.cur_page);
sdio.cur_page = NULL;
}
if (sdio.bio)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c1aafb2ab990..390f611528ea 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -194,7 +194,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;

- get_page(page);
+ pin_user_page(page);
__bio_add_page(bio, page, len, 0);
bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
iomap_dio_submit_bio(dio, iomap, bio, pos);
--
2.28.0

2020-08-29 08:10:38

by John Hubbard

[permalink] [raw]
Subject: [PATCH v2 2/3] iov_iter: introduce iov_iter_pin_user_pages*() routines

The new routines are:
iov_iter_pin_user_pages()
iov_iter_pin_user_pages_alloc()

and those correspond to these pre-existing routines:
iov_iter_get_pages()
iov_iter_get_pages_alloc()

Also, pipe_get_pages() and related are changed so as to pass
down a "use_pup" (use pin_user_page() instead of get_page()) bool
argument.

Unlike the iov_iter_get_pages*() routines, the
iov_iter_pin_user_pages*() routines assert that only ITER_IOVEC or
ITER_PIPE items are passed in. They then call pin_user_page*(), instead
of get_user_pages_fast() or get_page().

Why: In order to incrementally change Direct IO callers from calling
get_user_pages_fast() and put_page(), over to calling
pin_user_pages_fast() and unpin_user_page(), there need to be mid-level
routines that specifically call one or the other systems, for both page
acquisition and page release.

Signed-off-by: John Hubbard <[email protected]>
---
include/linux/uio.h | 5 ++
lib/iov_iter.c | 110 ++++++++++++++++++++++++++++++++++++++++----
2 files changed, 107 insertions(+), 8 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 3835a8a8e9ea..29b0504a27cc 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -229,6 +229,11 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages);

const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);

+ssize_t iov_iter_pin_user_pages(struct iov_iter *i, struct page **pages,
+ size_t maxsize, unsigned int maxpages, size_t *start);
+ssize_t iov_iter_pin_user_pages_alloc(struct iov_iter *i, struct page ***pages,
+ size_t maxsize, size_t *start);
+
static inline size_t iov_iter_count(const struct iov_iter *i)
{
return i->count;
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 5e40786c8f12..f25555eb3279 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1269,7 +1269,8 @@ static inline ssize_t __pipe_get_pages(struct iov_iter *i,
size_t maxsize,
struct page **pages,
int iter_head,
- size_t *start)
+ size_t *start,
+ bool use_pup)
{
struct pipe_inode_info *pipe = i->pipe;
unsigned int p_mask = pipe->ring_size - 1;
@@ -1280,7 +1281,11 @@ static inline ssize_t __pipe_get_pages(struct iov_iter *i,
maxsize = n;
n += *start;
while (n > 0) {
- get_page(*pages++ = pipe->bufs[iter_head & p_mask].page);
+ if (use_pup)
+ pin_user_page(*pages++ = pipe->bufs[iter_head & p_mask].page);
+ else
+ get_page(*pages++ = pipe->bufs[iter_head & p_mask].page);
+
iter_head++;
n -= PAGE_SIZE;
}
@@ -1290,7 +1295,7 @@ static inline ssize_t __pipe_get_pages(struct iov_iter *i,

static ssize_t pipe_get_pages(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
- size_t *start)
+ size_t *start, bool use_pup)
{
unsigned int iter_head, npages;
size_t capacity;
@@ -1306,8 +1311,51 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
npages = pipe_space_for_user(iter_head, i->pipe->tail, i->pipe);
capacity = min(npages, maxpages) * PAGE_SIZE - *start;

- return __pipe_get_pages(i, min(maxsize, capacity), pages, iter_head, start);
+ return __pipe_get_pages(i, min(maxsize, capacity), pages, iter_head,
+ start, use_pup);
+}
+
+ssize_t iov_iter_pin_user_pages(struct iov_iter *i,
+ struct page **pages, size_t maxsize, unsigned int maxpages,
+ size_t *start)
+{
+ size_t skip = i->iov_offset;
+ const struct iovec *iov;
+ struct iovec v;
+
+ if (unlikely(iov_iter_is_pipe(i)))
+ return pipe_get_pages(i, pages, maxsize, maxpages, start, true);
+ if (unlikely(iov_iter_is_discard(i)))
+ return -EFAULT;
+ if (WARN_ON_ONCE(!iter_is_iovec(i)))
+ return -EFAULT;
+
+ if (unlikely(!maxsize))
+ return 0;
+ maxsize = min(maxsize, i->count);
+
+ iterate_iovec(i, maxsize, v, iov, skip, ({
+ unsigned long addr = (unsigned long)v.iov_base;
+ size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
+ int n;
+ int res;
+
+ if (len > maxpages * PAGE_SIZE)
+ len = maxpages * PAGE_SIZE;
+ addr &= ~(PAGE_SIZE - 1);
+ n = DIV_ROUND_UP(len, PAGE_SIZE);
+
+ res = pin_user_pages_fast(addr, n,
+ iov_iter_rw(i) != WRITE ? FOLL_WRITE : 0,
+ pages);
+ if (unlikely(res < 0))
+ return res;
+ return (res == n ? len : res * PAGE_SIZE) - *start;
+ 0;
+ }))
+ return 0;
}
+EXPORT_SYMBOL(iov_iter_pin_user_pages);

ssize_t iov_iter_get_pages(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
@@ -1317,7 +1365,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
maxsize = i->count;

if (unlikely(iov_iter_is_pipe(i)))
- return pipe_get_pages(i, pages, maxsize, maxpages, start);
+ return pipe_get_pages(i, pages, maxsize, maxpages, start, false);
if (unlikely(iov_iter_is_discard(i)))
return -EFAULT;

@@ -1357,7 +1405,7 @@ static struct page **get_pages_array(size_t n)

static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize,
- size_t *start)
+ size_t *start, bool use_pup)
{
struct page **p;
unsigned int iter_head, npages;
@@ -1380,7 +1428,7 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
p = get_pages_array(npages);
if (!p)
return -ENOMEM;
- n = __pipe_get_pages(i, maxsize, p, iter_head, start);
+ n = __pipe_get_pages(i, maxsize, p, iter_head, start, use_pup);
if (n > 0)
*pages = p;
else
@@ -1388,6 +1436,52 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
return n;
}

+ssize_t iov_iter_pin_user_pages_alloc(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ size_t *start)
+{
+ struct page **p;
+ size_t skip = i->iov_offset;
+ const struct iovec *iov;
+ struct iovec v;
+
+ if (unlikely(iov_iter_is_pipe(i)))
+ return pipe_get_pages_alloc(i, pages, maxsize, start, true);
+ if (unlikely(iov_iter_is_discard(i)))
+ return -EFAULT;
+ if (WARN_ON_ONCE(!iter_is_iovec(i)))
+ return -EFAULT;
+
+ if (unlikely(!maxsize))
+ return 0;
+ maxsize = min(maxsize, i->count);
+
+ iterate_iovec(i, maxsize, v, iov, skip, ({
+ unsigned long addr = (unsigned long)v.iov_base;
+ size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
+ int n;
+ int res;
+
+ addr &= ~(PAGE_SIZE - 1);
+ n = DIV_ROUND_UP(len, PAGE_SIZE);
+ p = get_pages_array(n);
+ if (!p)
+ return -ENOMEM;
+
+ res = pin_user_pages_fast(addr, n,
+ iov_iter_rw(i) != WRITE ? FOLL_WRITE : 0, p);
+ if (unlikely(res < 0)) {
+ kvfree(p);
+ return res;
+ }
+ *pages = p;
+ return (res == n ? len : res * PAGE_SIZE) - *start;
+ 0;
+ }))
+ return 0;
+}
+EXPORT_SYMBOL(iov_iter_pin_user_pages_alloc);
+
ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize,
size_t *start)
@@ -1398,7 +1492,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
maxsize = i->count;

if (unlikely(iov_iter_is_pipe(i)))
- return pipe_get_pages_alloc(i, pages, maxsize, start);
+ return pipe_get_pages_alloc(i, pages, maxsize, start, false);
if (unlikely(iov_iter_is_discard(i)))
return -EFAULT;

--
2.28.0

2020-08-29 08:11:22

by John Hubbard

[permalink] [raw]
Subject: [PATCH v2 1/3] mm/gup: introduce pin_user_page()

pin_user_page() is the FOLL_PIN equivalent of get_page().

This was always a missing piece of the pin/unpin API calls (early
reviewers of pin_user_pages() asked about it, in fact), but until now,
it just wasn't needed. Finally though, now that the Direct IO pieces in
block/bio are about to be converted to use FOLL_PIN, it turns out that
there are some cases in which get_page() and get_user_pages_fast() were
both used. Converting those sites requires a drop-in replacement for
get_page(), which this patch supplies.

[1] and [2] provide some background about pin_user_pages() in general.

[1] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

[2] Documentation/core-api/pin_user_pages.rst

Signed-off-by: John Hubbard <[email protected]>
---
include/linux/mm.h | 2 ++
mm/gup.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 32 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 97c83773b6f0..51c6ae4b63f7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1152,6 +1152,8 @@ static inline void get_page(struct page *page)
page_ref_inc(page);
}

+void pin_user_page(struct page *page);
+
bool __must_check try_grab_page(struct page *page, unsigned int flags);

static inline __must_check bool try_get_page(struct page *page)
diff --git a/mm/gup.c b/mm/gup.c
index ae096ea7583f..2cae5bbbc862 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -123,6 +123,36 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
return NULL;
}

+/*
+ * pin_user_page() - elevate the page refcount, and mark as FOLL_PIN
+ *
+ * This the FOLL_PIN equivalent of get_page(). It is intended for use when the
+ * page will be released via unpin_user_page().
+ */
+void pin_user_page(struct page *page)
+{
+ int refs = 1;
+
+ page = compound_head(page);
+
+ VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
+
+ if (hpage_pincount_available(page))
+ hpage_pincount_add(page, 1);
+ else
+ refs = GUP_PIN_COUNTING_BIAS;
+
+ /*
+ * Similar to try_grab_compound_head(): even if using the
+ * hpage_pincount_add/_sub() routines, be sure to
+ * *also* increment the normal page refcount field at least
+ * once, so that the page really is pinned.
+ */
+ page_ref_add(page, refs);
+
+ mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED, 1);
+}
+
/**
* try_grab_page() - elevate a page's refcount by a flag-dependent amount
*
--
2.28.0

2020-08-29 14:58:16

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm/gup: introduce pin_user_page()

On Sat, Aug 29, 2020 at 01:08:51AM -0700, John Hubbard wrote:
> pin_user_page() is the FOLL_PIN equivalent of get_page().
>
> This was always a missing piece of the pin/unpin API calls (early
> reviewers of pin_user_pages() asked about it, in fact), but until now,
> it just wasn't needed. Finally though, now that the Direct IO pieces in
> block/bio are about to be converted to use FOLL_PIN, it turns out that
> there are some cases in which get_page() and get_user_pages_fast() were
> both used. Converting those sites requires a drop-in replacement for
> get_page(), which this patch supplies.

I find the name really confusing vs pin_user_pages*, as it suggests as
single version of the same. It also seems partially wrong, at least
in the direct I/O case as the only thing pinned here is the zero page.

So maybe pin_kernel_page is a better name together with an explanation?
Or just pin_page?

2020-08-29 15:02:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] iov_iter: introduce iov_iter_pin_user_pages*() routines

On Sat, Aug 29, 2020 at 01:08:52AM -0700, John Hubbard wrote:
> The new routines are:
> iov_iter_pin_user_pages()
> iov_iter_pin_user_pages_alloc()
>
> and those correspond to these pre-existing routines:
> iov_iter_get_pages()
> iov_iter_get_pages_alloc()
>
> Also, pipe_get_pages() and related are changed so as to pass
> down a "use_pup" (use pin_user_page() instead of get_page()) bool
> argument.
>
> Unlike the iov_iter_get_pages*() routines, the
> iov_iter_pin_user_pages*() routines assert that only ITER_IOVEC or
> ITER_PIPE items are passed in. They then call pin_user_page*(), instead
> of get_user_pages_fast() or get_page().
>
> Why: In order to incrementally change Direct IO callers from calling
> get_user_pages_fast() and put_page(), over to calling
> pin_user_pages_fast() and unpin_user_page(), there need to be mid-level
> routines that specifically call one or the other systems, for both page
> acquisition and page release.
>
> Signed-off-by: John Hubbard <[email protected]>
> ---
> include/linux/uio.h | 5 ++
> lib/iov_iter.c | 110 ++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 107 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 3835a8a8e9ea..29b0504a27cc 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -229,6 +229,11 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages);
>
> const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
>
> +ssize_t iov_iter_pin_user_pages(struct iov_iter *i, struct page **pages,
> + size_t maxsize, unsigned int maxpages, size_t *start);
> +ssize_t iov_iter_pin_user_pages_alloc(struct iov_iter *i, struct page ***pages,
> + size_t maxsize, size_t *start);
> +
> static inline size_t iov_iter_count(const struct iov_iter *i)
> {
> return i->count;
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 5e40786c8f12..f25555eb3279 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1269,7 +1269,8 @@ static inline ssize_t __pipe_get_pages(struct iov_iter *i,
> size_t maxsize,
> struct page **pages,
> int iter_head,
> - size_t *start)
> + size_t *start,
> + bool use_pup)
> {
> struct pipe_inode_info *pipe = i->pipe;
> unsigned int p_mask = pipe->ring_size - 1;
> @@ -1280,7 +1281,11 @@ static inline ssize_t __pipe_get_pages(struct iov_iter *i,
> maxsize = n;
> n += *start;
> while (n > 0) {
> - get_page(*pages++ = pipe->bufs[iter_head & p_mask].page);
> + if (use_pup)
> + pin_user_page(*pages++ = pipe->bufs[iter_head & p_mask].page);
> + else
> + get_page(*pages++ = pipe->bufs[iter_head & p_mask].page);

Maybe this would become a little more readable with a local variable
and a little more verbosity:

struct page *page = pipe->bufs[iter_head & p_mask].page;

if (use_pup)
pin_user_page(page);
else
get_page(page);

*pages++ = page;

2020-08-29 15:03:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] bio: convert get_user_pages_fast() --> pin_user_pages_fast()

> - size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> + size = iov_iter_pin_user_pages(iter, pages, LONG_MAX, nr_pages, &offset);

This is really a comment to the previous patch, but I only spotted it
here: I think the right name is iov_iter_pin_pages, as bvec, kvec and
pipe aren't usually user pages. Same as iov_iter_get_pages vs
get_user_pages. Same for the _alloc variant.

> + * here on. It will run one unpin_user_page() against each page
> + * and will run one bio_put() against the BIO.

Nit: the ant and the will still fit on the previous line.

2020-08-29 21:58:26

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm/gup: introduce pin_user_page()

On 8/29/20 7:54 AM, Christoph Hellwig wrote:
> On Sat, Aug 29, 2020 at 01:08:51AM -0700, John Hubbard wrote:
>> pin_user_page() is the FOLL_PIN equivalent of get_page().
>>
>> This was always a missing piece of the pin/unpin API calls (early
>> reviewers of pin_user_pages() asked about it, in fact), but until now,
>> it just wasn't needed. Finally though, now that the Direct IO pieces in
>> block/bio are about to be converted to use FOLL_PIN, it turns out that
>> there are some cases in which get_page() and get_user_pages_fast() were
>> both used. Converting those sites requires a drop-in replacement for
>> get_page(), which this patch supplies.
>
> I find the name really confusing vs pin_user_pages*, as it suggests as
> single version of the same. It also seems partially wrong, at least
> in the direct I/O case as the only thing pinned here is the zero page.
>
> So maybe pin_kernel_page is a better name together with an explanation?
> Or just pin_page?
>

Yes. Both its behavior and use are closer to get_page() than it is to
get_user_pages(). So pin_page() is just right.


thanks,
--
John Hubbard
NVIDIA

2020-08-29 22:00:20

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] iov_iter: introduce iov_iter_pin_user_pages*() routines

On 8/29/20 7:58 AM, Christoph Hellwig wrote:
> On Sat, Aug 29, 2020 at 01:08:52AM -0700, John Hubbard wrote:
...
>> @@ -1280,7 +1281,11 @@ static inline ssize_t __pipe_get_pages(struct iov_iter *i,
>> maxsize = n;
>> n += *start;
>> while (n > 0) {
>> - get_page(*pages++ = pipe->bufs[iter_head & p_mask].page);
>> + if (use_pup)
>> + pin_user_page(*pages++ = pipe->bufs[iter_head & p_mask].page);
>> + else
>> + get_page(*pages++ = pipe->bufs[iter_head & p_mask].page);
>
> Maybe this would become a little more readable with a local variable
> and a little more verbosity:
>
> struct page *page = pipe->bufs[iter_head & p_mask].page;
>
> if (use_pup)
> pin_user_page(page);
> else
> get_page(page);
>
> *pages++ = page;
>

Yes, that is cleaner, I'll change to that, thanks.

thanks,
--
John Hubbard
NVIDIA

2020-08-29 22:09:42

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] bio: convert get_user_pages_fast() --> pin_user_pages_fast()

On 8/29/20 8:02 AM, Christoph Hellwig wrote:
>> - size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
>> + size = iov_iter_pin_user_pages(iter, pages, LONG_MAX, nr_pages, &offset);
>
> This is really a comment to the previous patch, but I only spotted it
> here: I think the right name is iov_iter_pin_pages, as bvec, kvec and
> pipe aren't usually user pages. Same as iov_iter_get_pages vs
> get_user_pages. Same for the _alloc variant.
>

Yes, it is clearly misnamed now! Will fix.

>> + * here on. It will run one unpin_user_page() against each page
>> + * and will run one bio_put() against the BIO.
>
> Nit: the ant and the will still fit on the previous line.
>

Sorry about that, *usually* my text editor does the Right Thing for
those, I must have interfered with the natural flow of things. :)


thanks,
--
John Hubbard
NVIDIA

2020-08-30 20:21:00

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] iov_iter: introduce iov_iter_pin_user_pages*() routines

On Sat, Aug 29, 2020 at 01:08:52AM -0700, John Hubbard wrote:
> The new routines are:
> iov_iter_pin_user_pages()
> iov_iter_pin_user_pages_alloc()
>
> and those correspond to these pre-existing routines:
> iov_iter_get_pages()
> iov_iter_get_pages_alloc()
>
> Also, pipe_get_pages() and related are changed so as to pass
> down a "use_pup" (use pin_user_page() instead of get_page()) bool
> argument.
>
> Unlike the iov_iter_get_pages*() routines, the
> iov_iter_pin_user_pages*() routines assert that only ITER_IOVEC or
> ITER_PIPE items are passed in. They then call pin_user_page*(), instead
> of get_user_pages_fast() or get_page().
>
> Why: In order to incrementally change Direct IO callers from calling
> get_user_pages_fast() and put_page(), over to calling
> pin_user_pages_fast() and unpin_user_page(), there need to be mid-level
> routines that specifically call one or the other systems, for both page
> acquisition and page release.

Hmm... Do you plan to kill iov_iter_get_pages* off, eventually getting
rid of that use_pup argument?

2020-08-30 20:45:48

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] iov_iter: introduce iov_iter_pin_user_pages*() routines

On 8/30/20 1:17 PM, Al Viro wrote:
...
>> Why: In order to incrementally change Direct IO callers from calling
>> get_user_pages_fast() and put_page(), over to calling
>> pin_user_pages_fast() and unpin_user_page(), there need to be mid-level
>> routines that specifically call one or the other systems, for both page
>> acquisition and page release.
>
> Hmm... Do you plan to kill iov_iter_get_pages* off, eventually getting
> rid of that use_pup argument?
>

Yes. That is definitely something I'm interested in doing, and in fact,
I started to write words to that effect into the v1 cover letter. I lost
confidence at the last minute, after poking around the remaining call
sites (which are mostly network file systems, plus notably io_uring),
and wondering if I really understood what the hell I was doing. :)

So I decided to reduce the scope of the proposal, until I got some
feedback. Which I now have!

Looking at this again, I see that there are actually *very* few
ITER_KVEC and ITER_BVEC callers, so...yes, maybe this can all be
collapsed down to calling the new functions, which would always use pup,
and lead to the simplification you asked about.

Any advance warnings, advice, design thoughts are definitely welcome
here.


thanks,
--
John Hubbard
NVIDIA