2020-08-31 07:17:44

by John Hubbard

[permalink] [raw]
Subject: [PATCH v3 0/3] bio: Direct IO: convert to pin_user_pages_fast()

bio: convert get_user_pages_fast() --> pin_user_pages_fast()

Change generic block/bio Direct IO routines, to acquire FOLL_PIN user
pages via the recently added routines:

iov_iter_pin_pages()
iov_iter_pin_pages_alloc()
pin_page()

This effectively converts several file systems (ext4, for example) that
use the common Direct IO routines.

Change the corresponding page release calls from put_page() to
unpin_user_page().

Change bio_release_pages() to handle FOLL_PIN pages. In fact, after this
patch, that is the *only* type of pages that bio_release_pages()
handles.

Design notes
============

Quite a few approaches have been considered over the years. This one is
inspired by Christoph Hellwig's July, 2019 observation that there are
only 5 ITER_ types, and we can simplify handling of them for Direct IO
[1]. Accordingly, this patch implements the following pseudocode:

Direct IO behavior:

ITER_IOVEC:
pin_user_pages_fast();
break;

ITER_PIPE:
for each page:
pin_page();
break;

ITER_KVEC: // already elevated page refcount, leave alone
ITER_BVEC: // already elevated page refcount, leave alone
ITER_DISCARD: // discard
return -EFAULT or -ENVALID;

...which works for callers that already have sorted out which case they
are in. Such as, Direct IO in the block/bio layers.

Note that this does leave ITER_KVEC and ITER_BVEC unconverted, for now.

Page acquisition: The iov_iter_get_pages*() routines above are at just
the right level in the call stack: the callers already know which system
to use, and so it's a small change to just drop in the replacement
routines. And it's a fan-in/fan-out point: block/bio call sites for
Direct IO funnel their page acquisitions through the
iov_iter_get_pages*() routines, and there are many other callers of
those. And we can't convert all of the callers at once--too many
subsystems are involved, and it would be a too large and too risky
patch.

Page release: there are already separate release routines: put_page()
vs. unpin_user_page(), so it's already done there.

[1] https://lore.kernel.org/kvm/[email protected]/

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/



John Hubbard (3):
mm/gup: introduce pin_page()
iov_iter: introduce iov_iter_pin_pages*() routines
bio: convert get_user_pages_fast() --> pin_user_pages_fast()

block/bio.c | 24 ++++-----
block/blk-map.c | 6 +--
fs/direct-io.c | 28 +++++------
fs/iomap/direct-io.c | 2 +-
include/linux/mm.h | 2 +
include/linux/uio.h | 5 ++
lib/iov_iter.c | 113 ++++++++++++++++++++++++++++++++++++++++---
mm/gup.c | 33 +++++++++++++
8 files changed, 175 insertions(+), 38 deletions(-)

--
2.28.0


2020-08-31 07:17:49

by John Hubbard

[permalink] [raw]
Subject: [PATCH v3 3/3] bio: convert get_user_pages_fast() --> pin_user_pages_fast()

Change generic block/bio Direct IO routines, to acquire FOLL_PIN user
pages via the recently added routines:

iov_iter_pin_pages()
iov_iter_pin_pages_alloc()
pin_page()

This effectively converts several file systems (ext4, for example) that
use the common Direct IO routines.

Change the corresponding page release calls from put_page() to
unpin_user_page().

Change bio_release_pages() to handle FOLL_PIN pages. In fact, after this
patch, that is the *only* type of pages that bio_release_pages()
handles.

Design notes
============

Quite a few approaches have been considered over the years. This one is
inspired by Christoph Hellwig's July, 2019 observation that there are
only 5 ITER_ types, and we can simplify handling of them for Direct IO
[1]. Accordingly, this patch implements the following pseudocode:

Direct IO behavior:

ITER_IOVEC:
pin_user_pages_fast();
break;

ITER_PIPE:
for each page:
pin_page();
break;

ITER_KVEC: // already elevated page refcount, leave alone
ITER_BVEC: // already elevated page refcount, leave alone
ITER_DISCARD: // discard
return -EFAULT or -ENVALID;

...which works for callers that already have sorted out which case they
are in. Such as, Direct IO in the block/bio layers.

Note that this does leave ITER_KVEC and ITER_BVEC unconverted, for now.

Page acquisition: The iov_iter_get_pages*() routines above are at just
the right level in the call stack: the callers already know which system
to use, and so it's a small change to just drop in the replacement
routines. And it's a fan-in/fan-out point: block/bio call sites for
Direct IO funnel their page acquisitions through the
iov_iter_get_pages*() routines, and there are many other callers of
those. And we can't convert all of the callers at once--too many
subsystems are involved, and it would be a too large and too risky
patch.

Page release: there are already separate release routines: put_page()
vs. unpin_user_page(), so it's already done there.

[1] https://lore.kernel.org/kvm/[email protected]/

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

Cc: Christoph Hellwig <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
block/bio.c | 24 ++++++++++++------------
block/blk-map.c | 6 +++---
fs/direct-io.c | 28 ++++++++++++++--------------
fs/iomap/direct-io.c | 2 +-
4 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index a9931f23d933..76c5843f6050 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -955,7 +955,7 @@ void bio_release_pages(struct bio *bio, bool mark_dirty)
bio_for_each_segment_all(bvec, bio, iter_all) {
if (mark_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
- put_page(bvec->bv_page);
+ unpin_user_page(bvec->bv_page);
}
}
EXPORT_SYMBOL_GPL(bio_release_pages);
@@ -986,9 +986,9 @@ static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
* @iter: iov iterator describing the region to be mapped
*
* Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * pages will have to be released using put_page() or unpin_user_page() when
+ * done. For multi-segment *iter, this function only adds pages from the next
+ * non-empty segment of the iov iterator.
*/
static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
@@ -1009,7 +1009,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);

- size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+ size = iov_iter_pin_pages(iter, pages, LONG_MAX, nr_pages, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;

@@ -1020,7 +1020,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)

if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
if (same_page)
- put_page(page);
+ unpin_user_page(page);
} else {
if (WARN_ON_ONCE(bio_full(bio, len)))
return -EINVAL;
@@ -1056,7 +1056,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);

- size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+ size = iov_iter_pin_pages(iter, pages, LONG_MAX, nr_pages, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;

@@ -1069,7 +1069,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
max_append_sectors, &same_page) != len)
return -EINVAL;
if (same_page)
- put_page(page);
+ unpin_user_page(page);
offset = 0;
}

@@ -1113,8 +1113,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
} else {
if (is_bvec)
ret = __bio_iov_bvec_add_pages(bio, iter);
- else
- ret = __bio_iov_iter_get_pages(bio, iter);
+ else
+ ret = __bio_iov_iter_get_pages(bio, iter);
}
} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));

@@ -1326,8 +1326,8 @@ void bio_set_pages_dirty(struct bio *bio)
* the BIO and re-dirty the pages in process context.
*
* It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on. It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on. It will run one unpin_user_page() against each page, and will run
+ * one bio_put() against the BIO.
*/

static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk-map.c b/block/blk-map.c
index 6e804892d5ec..2057a8e5b4bc 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -275,7 +275,7 @@ static struct bio *bio_map_user_iov(struct request_queue *q,
size_t offs, added = 0;
int npages;

- bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+ bytes = iov_iter_pin_pages_alloc(iter, &pages, LONG_MAX, &offs);
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
goto out_unmap;
@@ -298,7 +298,7 @@ static struct bio *bio_map_user_iov(struct request_queue *q,
if (!bio_add_hw_page(q, bio, page, n, offs,
max_sectors, &same_page)) {
if (same_page)
- put_page(page);
+ unpin_user_page(page);
break;
}

@@ -312,7 +312,7 @@ static struct bio *bio_map_user_iov(struct request_queue *q,
* release the pages we didn't map into the bio, if any
*/
while (j < npages)
- put_page(pages[j++]);
+ unpin_user_page(pages[j++]);
kvfree(pages);
/* couldn't stuff something into bio? */
if (bytes)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 183299892465..5130ba32ae91 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -170,7 +170,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
{
ssize_t ret;

- ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
+ ret = iov_iter_pin_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
&sdio->from);

if (ret < 0 && sdio->blocks_available && (dio->op == REQ_OP_WRITE)) {
@@ -182,7 +182,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
*/
if (dio->page_errors == 0)
dio->page_errors = ret;
- get_page(page);
+ pin_page(page);
dio->pages[0] = page;
sdio->head = 0;
sdio->tail = 1;
@@ -472,7 +472,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
{
while (sdio->head < sdio->tail)
- put_page(dio->pages[sdio->head++]);
+ unpin_user_page(dio->pages[sdio->head++]);
}

/*
@@ -739,7 +739,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
*/
if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE)
sdio->pages_in_io--;
- get_page(sdio->cur_page);
+ pin_page(sdio->cur_page);
sdio->final_block_in_bio = sdio->cur_page_block +
(sdio->cur_page_len >> sdio->blkbits);
ret = 0;
@@ -853,13 +853,13 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
*/
if (sdio->cur_page) {
ret = dio_send_cur_page(dio, sdio, map_bh);
- put_page(sdio->cur_page);
+ unpin_user_page(sdio->cur_page);
sdio->cur_page = NULL;
if (ret)
return ret;
}

- get_page(page); /* It is in dio */
+ pin_page(page); /* It is in dio */
sdio->cur_page = page;
sdio->cur_page_offset = offset;
sdio->cur_page_len = len;
@@ -874,7 +874,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
ret = dio_send_cur_page(dio, sdio, map_bh);
if (sdio->bio)
dio_bio_submit(dio, sdio);
- put_page(sdio->cur_page);
+ unpin_user_page(sdio->cur_page);
sdio->cur_page = NULL;
}
return ret;
@@ -974,7 +974,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,

ret = get_more_blocks(dio, sdio, map_bh);
if (ret) {
- put_page(page);
+ unpin_user_page(page);
goto out;
}
if (!buffer_mapped(map_bh))
@@ -1019,7 +1019,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,

/* AKPM: eargh, -ENOTBLK is a hack */
if (dio->op == REQ_OP_WRITE) {
- put_page(page);
+ unpin_user_page(page);
return -ENOTBLK;
}

@@ -1032,7 +1032,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
if (sdio->block_in_file >=
i_size_aligned >> blkbits) {
/* We hit eof */
- put_page(page);
+ unpin_user_page(page);
goto out;
}
zero_user(page, from, 1 << blkbits);
@@ -1072,7 +1072,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
sdio->next_block_for_io,
map_bh);
if (ret) {
- put_page(page);
+ unpin_user_page(page);
goto out;
}
sdio->next_block_for_io += this_chunk_blocks;
@@ -1087,8 +1087,8 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
break;
}

- /* Drop the ref which was taken in get_user_pages() */
- put_page(page);
+ /* Drop the ref which was taken in pin_user_pages() */
+ unpin_user_page(page);
}
out:
return ret;
@@ -1327,7 +1327,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
if (retval == 0)
retval = ret2;
- put_page(sdio.cur_page);
+ unpin_user_page(sdio.cur_page);
sdio.cur_page = NULL;
}
if (sdio.bio)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c1aafb2ab990..ea29f0892a8c 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -194,7 +194,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;

- get_page(page);
+ pin_page(page);
__bio_add_page(bio, page, len, 0);
bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
iomap_dio_submit_bio(dio, iomap, bio, pos);
--
2.28.0

2020-08-31 07:18:04

by John Hubbard

[permalink] [raw]
Subject: [PATCH v3 1/3] mm/gup: introduce pin_page()

pin_page() is the FOLL_PIN equivalent of get_page().

This was always a missing piece of the pin/unpin API calls (early
reviewers of pin_user_pages() asked about it, in fact), but until now,
it just wasn't needed. Finally though, now that the Direct IO pieces in
block/bio are about to be converted to use FOLL_PIN, it turns out that
there are some cases in which get_page() and get_user_pages_fast() were
both used. Converting those sites requires a drop-in replacement for
get_page(), which this patch supplies.

[1] and [2] provide some background about the overall effort to convert
things to pin_user_page*() and unpin_user_page*().

[1] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

[2] Documentation/core-api/pin_user_pages.rst

Cc: Christoph Hellwig <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
include/linux/mm.h | 2 ++
mm/gup.c | 33 +++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ca6e6a81576b..24240cf66c44 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1154,6 +1154,8 @@ static inline void get_page(struct page *page)
page_ref_inc(page);
}

+void pin_page(struct page *page);
+
bool __must_check try_grab_page(struct page *page, unsigned int flags);

static inline __must_check bool try_get_page(struct page *page)
diff --git a/mm/gup.c b/mm/gup.c
index ae096ea7583f..a3a4bfae224a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -123,6 +123,39 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
return NULL;
}

+/*
+ * pin_page() - elevate the page refcount, and mark as FOLL_PIN
+ *
+ * This the FOLL_PIN equivalent of get_page(). It is intended for use when the
+ * page will be released via unpin_user_page().
+ *
+ * Unlike pin_user_page*(), pin_page() may be used on nearly any page, not just
+ * userspace-allocated pages.
+ */
+void pin_page(struct page *page)
+{
+ int refs = 1;
+
+ page = compound_head(page);
+
+ VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
+
+ if (hpage_pincount_available(page))
+ hpage_pincount_add(page, 1);
+ else
+ refs = GUP_PIN_COUNTING_BIAS;
+
+ /*
+ * Similar to try_grab_compound_head(): even if using the
+ * hpage_pincount_add/_sub() routines, be sure to
+ * *also* increment the normal page refcount field at least
+ * once, so that the page really is pinned.
+ */
+ page_ref_add(page, refs);
+
+ mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED, 1);
+}
+
/**
* try_grab_page() - elevate a page's refcount by a flag-dependent amount
*
--
2.28.0

2020-08-31 16:54:23

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] bio: convert get_user_pages_fast() --> pin_user_pages_fast()

On Mon, Aug 31, 2020 at 12:14:39AM -0700, John Hubbard wrote:
> @@ -1113,8 +1113,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> } else {
> if (is_bvec)
> ret = __bio_iov_bvec_add_pages(bio, iter);
> - else
> - ret = __bio_iov_iter_get_pages(bio, iter);
> + else
> + ret = __bio_iov_iter_get_pages(bio, iter);

Why the white space change?

Ira

2020-08-31 22:40:07

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] bio: convert get_user_pages_fast() --> pin_user_pages_fast()

On 8/31/20 9:52 AM, Ira Weiny wrote:
> On Mon, Aug 31, 2020 at 12:14:39AM -0700, John Hubbard wrote:
>> @@ -1113,8 +1113,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>> } else {
>> if (is_bvec)
>> ret = __bio_iov_bvec_add_pages(bio, iter);
>> - else
>> - ret = __bio_iov_iter_get_pages(bio, iter);
>> + else
>> + ret = __bio_iov_iter_get_pages(bio, iter);
>
> Why the white space change?
>

Yikes, that's an oversight, and...yes, it goes all the way back to v1.
Thanks for spotting it!

There is not supposed to be any diff at all, in that region of code.
I'll restore the hunk to its rightful, undisturbed state, in v4.


thanks,
--
John Hubbard
NVIDIA