2022-02-27 10:27:38

by jhubbard.send.patches

[permalink] [raw]
Subject: [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN

From: John Hubbard <[email protected]>

Hi,

The feedback on the RFC [1] prompted me to convert the core Direct IO
subsystem all at once. The key differences here, as compared to the RFC,
are:

* no dio_w_*() wrapper routines,

* no CONFIG parameter; and

* new iov_iter_pin_pages*() routines that pin pages without
affecting other callers of iov_iter_get_pages*(). Those other
callers (ceph, rds, net, ...) can be converted separately.

Also, many pre-existing callers of unpin_user_pages_dirty_lock() are
wrong, and this series adds a few more callers. So readers may naturally
wonder about that. I recently had a very productive discussion with Ted
Ts'o, who suggested a way to fix the problem, and I'm going to implement
it, next. However, I think it's best to do that fix separately from
this, probably layered on top, although it could go either before or
after.

As part of fixing the "get_user_pages() + file-backed memory" problem
[2], and to support various COW-related fixes as well [3], we need to
convert the Direct IO code from get_user_pages_fast(), to
pin_user_pages_fast(). Because pin_user_pages*() calls require a
corresponding call to unpin_user_page(), the conversion is more
elaborate than just substitution.

In the main patch (patch 4) I'm a little concerned about the
bio_map_user_iov() changes, because the sole caller,
blk_rq_map_user_iov(), has either a direct mapped case or a copy from
user case, and I'm still not sure that these are properly kept separate,
from an unpin pages point of view. So a close look there by reviewers
would be welcome.

Testing: this needs lots of filesystem testing.

In this patchset:

Patches 1, 2: provide a few new routines that will be used by
conversion: pin_user_page(), iov_iter_pin_pages(),
iov_iter_pin_pages_alloc().

Patch 3: provide a few asserts that only user space pages are being
passed in for Direct IO. (This patch could be folded into another
patch.)

Patch 4: Convert all Direct IO callers that use iomap, or
blockdev_direct_IO(), or bio_iov_iter_get_pages().

Patch 5, 6: convert a few other callers to the new system: NFS-Direct,
and fuse.

This is based on linux-next (next-20220225). I've also stashed it here:

https://github.com/johnhubbard/linux bio_pup_next_20220225


[1] https://lore.kernel.org/r/[email protected]

[2] https://lwn.net/Articles/753027/ "The trouble with get_user_pages()"

[3] https://lore.kernel.org/all/[email protected]/T/#u
(David Hildenbrand's mm/COW fixes)

John Hubbard (6):
mm/gup: introduce pin_user_page()
iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages
block, fs: assert that key paths use iovecs, and nothing else
block, bio, fs: convert most filesystems to pin_user_pages_fast()
NFS: direct-io: convert to FOLL_PIN pages
fuse: convert direct IO paths to use FOLL_PIN

block/bio.c | 29 ++++++++--------
block/blk-map.c | 6 ++--
fs/direct-io.c | 28 ++++++++--------
fs/fuse/dev.c | 7 ++--
fs/fuse/file.c | 38 +++++----------------
fs/iomap/direct-io.c | 2 +-
fs/nfs/direct.c | 15 +++------
include/linux/mm.h | 1 +
include/linux/uio.h | 4 +++
lib/iov_iter.c | 78 ++++++++++++++++++++++++++++++++++++++++++++
mm/gup.c | 34 +++++++++++++++++++
11 files changed, 170 insertions(+), 72 deletions(-)


base-commit: 06aeb1495c39c86ccfaf1adadc1d2200179f16eb
--
2.35.1


2022-02-27 10:54:09

by jhubbard.send.patches

[permalink] [raw]
Subject: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

From: John Hubbard <[email protected]>

Convert nfs-direct to use pin_user_pages_fast(), and unpin_user_pages(),
in place of get_user_pages_fast() and put_page().

Signed-off-by: John Hubbard <[email protected]>
---
fs/nfs/direct.c | 15 ++++-----------
1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index eabfdab543c8..42111a75c0f7 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -177,13 +177,6 @@ ssize_t nfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
return nfs_file_direct_write(iocb, iter);
}

-static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
-{
- unsigned int i;
- for (i = 0; i < npages; i++)
- put_page(pages[i]);
-}
-
void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
struct nfs_direct_req *dreq)
{
@@ -367,7 +360,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
size_t pgbase;
unsigned npages, i;

- result = iov_iter_get_pages_alloc(iter, &pagevec,
+ result = iov_iter_pin_pages_alloc(iter, &pagevec,
rsize, &pgbase);
if (result < 0)
break;
@@ -398,7 +391,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
pos += req_len;
dreq->bytes_left -= req_len;
}
- nfs_direct_release_pages(pagevec, npages);
+ unpin_user_pages(pagevec, npages);
kvfree(pagevec);
if (result < 0)
break;
@@ -811,7 +804,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
size_t pgbase;
unsigned npages, i;

- result = iov_iter_get_pages_alloc(iter, &pagevec,
+ result = iov_iter_pin_pages_alloc(iter, &pagevec,
wsize, &pgbase);
if (result < 0)
break;
@@ -850,7 +843,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
pos += req_len;
dreq->bytes_left -= req_len;
}
- nfs_direct_release_pages(pagevec, npages);
+ unpin_user_pages(pagevec, npages);
kvfree(pagevec);
if (result < 0)
break;
--
2.35.1

2022-02-27 12:08:13

by jhubbard.send.patches

[permalink] [raw]
Subject: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else

From: John Hubbard <[email protected]>

Upcoming changes to Direct IO will change it from acquiring pages via
get_user_pages_fast(), to calling pin_user_pages_fast() instead.

Place a few assertions at key points, that the pages are IOVEC (user
pages), to enforce the assumptions that there are no kernel or pipe or
other odd variations being passed.

Signed-off-by: John Hubbard <[email protected]>
---
block/bio.c | 4 ++++
fs/direct-io.c | 2 ++
2 files changed, 6 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index b15f5466ce08..4679d6539e2d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1167,6 +1167,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);

+ WARN_ON_ONCE(!iter_is_iovec(iter));
+
size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
@@ -1217,6 +1219,8 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);

+ WARN_ON_ONCE(!iter_is_iovec(iter));
+
size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 38bca4980a1c..7dbbbfef300d 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -169,6 +169,8 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
{
ssize_t ret;

+ WARN_ON_ONCE(!iter_is_iovec(sdio->iter));
+
ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
&sdio->from);

--
2.35.1

2022-02-27 12:34:54

by jhubbard.send.patches

[permalink] [raw]
Subject: [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages

From: John Hubbard <[email protected]>

Provide two new specialized routines that only handle user space pages,
and invoke pin_user_pages_fast() on them: iov_iter_pin_pages() and
iov_iter_pin_pages_alloc().

This allows subsequent patches to convert various callers of
iov_iter_get_pages*(), to the new calls, without having to attempt a
mass conversion all at once.

Signed-off-by: John Hubbard <[email protected]>
---
include/linux/uio.h | 4 +++
lib/iov_iter.c | 78 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 82 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 739285fe5a2f..208020c2b75a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -236,6 +236,10 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);
+ssize_t iov_iter_pin_pages(struct iov_iter *i, struct page **pages,
+ size_t maxsize, unsigned int maxpages, size_t *start);
+ssize_t iov_iter_pin_pages_alloc(struct iov_iter *i, struct page ***pages,
+ size_t maxsize, size_t *start);
int iov_iter_npages(const struct iov_iter *i, int maxpages);
void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state);

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 6dd5330f7a99..e64e8e4edd0c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1560,6 +1560,41 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
}
EXPORT_SYMBOL(iov_iter_get_pages);

+ssize_t iov_iter_pin_pages(struct iov_iter *i,
+ struct page **pages, size_t maxsize, unsigned int maxpages,
+ size_t *start)
+{
+ size_t len;
+ int n, res;
+
+ if (maxsize > i->count)
+ maxsize = i->count;
+ if (!maxsize)
+ return 0;
+
+ WARN_ON_ONCE(!iter_is_iovec(i));
+
+ if (likely(iter_is_iovec(i))) {
+ unsigned int gup_flags = 0;
+ unsigned long addr;
+
+ if (iov_iter_rw(i) != WRITE)
+ gup_flags |= FOLL_WRITE;
+ if (i->nofault)
+ gup_flags |= FOLL_NOFAULT;
+
+ addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
+ n = DIV_ROUND_UP(len, PAGE_SIZE);
+ res = pin_user_pages_fast(addr, n, gup_flags, pages);
+ if (unlikely(res <= 0))
+ return res;
+ return (res == n ? len : res * PAGE_SIZE) - *start;
+ }
+
+ return -EFAULT;
+}
+EXPORT_SYMBOL(iov_iter_pin_pages);
+
static struct page **get_pages_array(size_t n)
{
return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
@@ -1696,6 +1731,49 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
}
EXPORT_SYMBOL(iov_iter_get_pages_alloc);

+ssize_t iov_iter_pin_pages_alloc(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ size_t *start)
+{
+ struct page **p;
+ size_t len;
+ int n, res;
+
+ if (maxsize > i->count)
+ maxsize = i->count;
+ if (!maxsize)
+ return 0;
+
+ WARN_ON_ONCE(!iter_is_iovec(i));
+
+ if (likely(iter_is_iovec(i))) {
+ unsigned int gup_flags = 0;
+ unsigned long addr;
+
+ if (iov_iter_rw(i) != WRITE)
+ gup_flags |= FOLL_WRITE;
+ if (i->nofault)
+ gup_flags |= FOLL_NOFAULT;
+
+ addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
+ n = DIV_ROUND_UP(len, PAGE_SIZE);
+ p = get_pages_array(n);
+ if (!p)
+ return -ENOMEM;
+ res = pin_user_pages_fast(addr, n, gup_flags, p);
+ if (unlikely(res <= 0)) {
+ kvfree(p);
+ *pages = NULL;
+ return res;
+ }
+ *pages = p;
+ return (res == n ? len : res * PAGE_SIZE) - *start;
+ }
+
+ return -EFAULT;
+}
+EXPORT_SYMBOL(iov_iter_pin_pages_alloc);
+
size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
struct iov_iter *i)
{
--
2.35.1

2022-02-27 13:53:37

by jhubbard.send.patches

[permalink] [raw]
Subject: [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast()

From: John Hubbard <[email protected]>

Use pin_user_pages_fast(), pin_user_page(), and unpin_user_page() calls,
in place of get_user_pages_fast(), get_page() and put_page().

This converts the Direct IO parts of most filesystems over to using
FOLL_PIN (pin_user_page*()) page pinning.

Signed-off-by: John Hubbard <[email protected]>
---
block/bio.c | 25 ++++++++++++-------------
block/blk-map.c | 6 +++---
fs/direct-io.c | 26 +++++++++++++-------------
fs/iomap/direct-io.c | 2 +-
4 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4679d6539e2d..d078f992a9c9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1102,7 +1102,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
bio_for_each_segment_all(bvec, bio, iter_all) {
if (mark_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
- put_page(bvec->bv_page);
+ unpin_user_page(bvec->bv_page);
}
}
EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1130,10 +1130,9 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)

static void bio_put_pages(struct page **pages, size_t size, size_t off)
{
- size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
+ size_t nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);

- for (i = 0; i < nr; i++)
- put_page(pages[i]);
+ unpin_user_pages(pages, nr);
}

#define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *))
@@ -1144,9 +1143,9 @@ static void bio_put_pages(struct page **pages, size_t size, size_t off)
* @iter: iov iterator describing the region to be mapped
*
* Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * pages will have to be released using unpin_user_page() when done. For
+ * multi-segment *iter, this function only adds pages from the next non-empty
+ * segment of the iov iterator.
*/
static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
@@ -1169,7 +1168,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)

WARN_ON_ONCE(!iter_is_iovec(iter));

- size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+ size = iov_iter_pin_pages(iter, pages, LONG_MAX, nr_pages, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;

@@ -1180,7 +1179,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)

if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
if (same_page)
- put_page(page);
+ unpin_user_page(page);
} else {
if (WARN_ON_ONCE(bio_full(bio, len))) {
bio_put_pages(pages + i, left, offset);
@@ -1221,7 +1220,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)

WARN_ON_ONCE(!iter_is_iovec(iter));

- size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+ size = iov_iter_pin_pages(iter, pages, LONG_MAX, nr_pages, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;

@@ -1237,7 +1236,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
break;
}
if (same_page)
- put_page(page);
+ unpin_user_page(page);
offset = 0;
}

@@ -1434,8 +1433,8 @@ void bio_set_pages_dirty(struct bio *bio)
* the BIO and re-dirty the pages in process context.
*
* It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on. It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on. It will run one unpin_user_page() against each page and will run
+ * one bio_put() against the BIO.
*/

static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk-map.c b/block/blk-map.c
index c7f71d83eff1..ce450683994c 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -252,7 +252,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
size_t offs, added = 0;
int npages;

- bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+ bytes = iov_iter_pin_pages_alloc(iter, &pages, LONG_MAX, &offs);
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
goto out_unmap;
@@ -275,7 +275,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (!bio_add_hw_page(rq->q, bio, page, n, offs,
max_sectors, &same_page)) {
if (same_page)
- put_page(page);
+ unpin_user_page(page);
break;
}

@@ -289,7 +289,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
* release the pages we didn't map into the bio, if any
*/
while (j < npages)
- put_page(pages[j++]);
+ unpin_user_page(pages[j++]);
kvfree(pages);
/* couldn't stuff something into bio? */
if (bytes)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7dbbbfef300d..815407c26f57 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -171,7 +171,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)

WARN_ON_ONCE(!iter_is_iovec(sdio->iter));

- ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
+ ret = iov_iter_pin_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
&sdio->from);

if (ret < 0 && sdio->blocks_available && (dio->op == REQ_OP_WRITE)) {
@@ -183,7 +183,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
*/
if (dio->page_errors == 0)
dio->page_errors = ret;
- get_page(page);
+ pin_user_page(page);
dio->pages[0] = page;
sdio->head = 0;
sdio->tail = 1;
@@ -452,7 +452,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
{
while (sdio->head < sdio->tail)
- put_page(dio->pages[sdio->head++]);
+ unpin_user_page(dio->pages[sdio->head++]);
}

/*
@@ -717,7 +717,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
*/
if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE)
sdio->pages_in_io--;
- get_page(sdio->cur_page);
+ pin_user_page(sdio->cur_page);
sdio->final_block_in_bio = sdio->cur_page_block +
(sdio->cur_page_len >> sdio->blkbits);
ret = 0;
@@ -832,13 +832,13 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
*/
if (sdio->cur_page) {
ret = dio_send_cur_page(dio, sdio, map_bh);
- put_page(sdio->cur_page);
+ unpin_user_page(sdio->cur_page);
sdio->cur_page = NULL;
if (ret)
return ret;
}

- get_page(page); /* It is in dio */
+ pin_user_page(page); /* It is in dio */
sdio->cur_page = page;
sdio->cur_page_offset = offset;
sdio->cur_page_len = len;
@@ -853,7 +853,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
ret = dio_send_cur_page(dio, sdio, map_bh);
if (sdio->bio)
dio_bio_submit(dio, sdio);
- put_page(sdio->cur_page);
+ unpin_user_page(sdio->cur_page);
sdio->cur_page = NULL;
}
return ret;
@@ -953,7 +953,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,

ret = get_more_blocks(dio, sdio, map_bh);
if (ret) {
- put_page(page);
+ unpin_user_page(page);
goto out;
}
if (!buffer_mapped(map_bh))
@@ -998,7 +998,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,

/* AKPM: eargh, -ENOTBLK is a hack */
if (dio->op == REQ_OP_WRITE) {
- put_page(page);
+ unpin_user_page(page);
return -ENOTBLK;
}

@@ -1011,7 +1011,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
if (sdio->block_in_file >=
i_size_aligned >> blkbits) {
/* We hit eof */
- put_page(page);
+ unpin_user_page(page);
goto out;
}
zero_user(page, from, 1 << blkbits);
@@ -1051,7 +1051,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
sdio->next_block_for_io,
map_bh);
if (ret) {
- put_page(page);
+ unpin_user_page(page);
goto out;
}
sdio->next_block_for_io += this_chunk_blocks;
@@ -1067,7 +1067,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
}

/* Drop the ref which was taken in get_user_pages() */
- put_page(page);
+ unpin_user_page(page);
}
out:
return ret;
@@ -1289,7 +1289,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
if (retval == 0)
retval = ret2;
- put_page(sdio.cur_page);
+ unpin_user_page(sdio.cur_page);
sdio.cur_page = NULL;
}
if (sdio.bio)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 67cf9c16f80c..d0986a31a9d1 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -192,7 +192,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;

- get_page(page);
+ pin_user_page(page);
__bio_add_page(bio, page, len, 0);
iomap_dio_submit_bio(iter, dio, bio, pos);
}
--
2.35.1

2022-02-27 15:48:05

by jhubbard.send.patches

[permalink] [raw]
Subject: [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN

From: John Hubbard <[email protected]>

Convert the fuse filesystem to support the new iov_iter_get_pages()
behavior. That routine now invokes pin_user_pages_fast(), which means
that such pages must be released via unpin_user_page(), rather than via
put_page().

This commit also removes any possibility of kernel pages being handled,
in the fuse_get_user_pages() call. Although this may seem like a steep
price to pay, Christoph Hellwig actually recommended it a few years ago
for nearly the same situation [1].

[1] https://lore.kernel.org/kvm/[email protected]/

Signed-off-by: John Hubbard <[email protected]>
---
fs/fuse/dev.c | 7 +++++--
fs/fuse/file.c | 38 +++++++++-----------------------------
2 files changed, 14 insertions(+), 31 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e1b4a846c90d..9db85c4d549a 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -675,7 +675,10 @@ static void fuse_copy_finish(struct fuse_copy_state *cs)
flush_dcache_page(cs->pg);
set_page_dirty_lock(cs->pg);
}
- put_page(cs->pg);
+ if (cs->pipebufs)
+ put_page(cs->pg);
+ else
+ unpin_user_page(cs->pg);
}
cs->pg = NULL;
}
@@ -730,7 +733,7 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
}
} else {
size_t off;
- err = iov_iter_get_pages(cs->iter, &page, PAGE_SIZE, 1, &off);
+ err = iov_iter_pin_pages(cs->iter, &page, PAGE_SIZE, 1, &off);
if (err < 0)
return err;
BUG_ON(!err);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 94747bac3489..ecfa5bdde919 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -611,18 +611,6 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
args->out_args[0].size = count;
}

-static void fuse_release_user_pages(struct fuse_args_pages *ap,
- bool should_dirty)
-{
- unsigned int i;
-
- for (i = 0; i < ap->num_pages; i++) {
- if (should_dirty)
- set_page_dirty_lock(ap->pages[i]);
- put_page(ap->pages[i]);
- }
-}
-
static void fuse_io_release(struct kref *kref)
{
kfree(container_of(kref, struct fuse_io_priv, refcnt));
@@ -720,7 +708,8 @@ static void fuse_aio_complete_req(struct fuse_mount *fm, struct fuse_args *args,
struct fuse_io_priv *io = ia->io;
ssize_t pos = -1;

- fuse_release_user_pages(&ia->ap, io->should_dirty);
+ unpin_user_pages_dirty_lock(ia->ap.pages, ia->ap.num_pages,
+ io->should_dirty);

if (err) {
/* Nothing */
@@ -1382,25 +1371,14 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
size_t nbytes = 0; /* # bytes already packed in req */
ssize_t ret = 0;

- /* Special case for kernel I/O: can copy directly into the buffer */
- if (iov_iter_is_kvec(ii)) {
- unsigned long user_addr = fuse_get_user_addr(ii);
- size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
-
- if (write)
- ap->args.in_args[1].value = (void *) user_addr;
- else
- ap->args.out_args[0].value = (void *) user_addr;
-
- iov_iter_advance(ii, frag_size);
- *nbytesp = frag_size;
- return 0;
- }
+ /* Only user space buffers are allowed with fuse Direct IO. */
+ if (WARN_ON_ONCE(!iter_is_iovec(ii)))
+ return -EOPNOTSUPP;

while (nbytes < *nbytesp && ap->num_pages < max_pages) {
unsigned npages;
size_t start;
- ret = iov_iter_get_pages(ii, &ap->pages[ap->num_pages],
+ ret = iov_iter_pin_pages(ii, &ap->pages[ap->num_pages],
*nbytesp - nbytes,
max_pages - ap->num_pages,
&start);
@@ -1484,7 +1462,9 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
}

if (!io->async || nres < 0) {
- fuse_release_user_pages(&ia->ap, io->should_dirty);
+ unpin_user_pages_dirty_lock(ia->ap.pages,
+ ia->ap.num_pages,
+ io->should_dirty);
fuse_io_free(ia);
}
ia = NULL;
--
2.35.1

2022-02-27 22:53:54

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages

On 2/27/22 13:57, Jens Axboe wrote:
...
>> + return res;
>> + return (res == n ? len : res * PAGE_SIZE) - *start;
>
> Trying to be clever like that just makes the code a lot less readable. I
> should not have to reason about a return value. Same in the other
> function.
>

No argument there. This was shamelessly lifted from
iov_iter_get_pages(), and I initially calculated that keeping it
identical to that known-good code, where possible, was better than
fixing it up.

However, I'll go ahead and simplify it--with pleasure--based on this
feedback.


thanks,
--
John Hubbard
NVIDIA

2022-02-27 23:00:22

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast()

On 2/27/22 13:59, Jens Axboe wrote:
> On 2/27/22 2:34 AM, [email protected] wrote:
>> From: John Hubbard <[email protected]>
>>
>> Use pin_user_pages_fast(), pin_user_page(), and unpin_user_page() calls,
>> in place of get_user_pages_fast(), get_page() and put_page().
>>
>> This converts the Direct IO parts of most filesystems over to using
>> FOLL_PIN (pin_user_page*()) page pinning.
>
> The commit message needs to explain why a change is being made, not what
> is being done. The latter I can just look at the code for.
>
> Didn't even find it in in your cover letter, had to go to the original
> posting for that.
>

Sorry about that, I'll sit down and write up something thorough for this
patch. It is after the central thing, and it got neglected here.

thanks,
--
John Hubbard
NVIDIA

2022-02-28 00:03:28

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else

On Sun, Feb 27, 2022 at 01:34:31AM -0800, [email protected] wrote:
> From: John Hubbard <[email protected]>
>
> Upcoming changes to Direct IO will change it from acquiring pages via
> get_user_pages_fast(), to calling pin_user_pages_fast() instead.
>
> Place a few assertions at key points, that the pages are IOVEC (user
> pages), to enforce the assumptions that there are no kernel or pipe or
> other odd variations being passed.

Umm... And what should happen when O_DIRECT file gets passed to splice()?

2022-02-28 00:54:59

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else

> diff --git a/block/bio.c b/block/bio.c
> index b15f5466ce08..4679d6539e2d 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1167,6 +1167,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
> pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>
> + WARN_ON_ONCE(!iter_is_iovec(iter));
> +

If these make sense, why aren't they also returning an error?

--
Jens Axboe

2022-02-28 02:18:35

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast()

On 2/27/22 2:34 AM, [email protected] wrote:
> From: John Hubbard <[email protected]>
>
> Use pin_user_pages_fast(), pin_user_page(), and unpin_user_page() calls,
> in place of get_user_pages_fast(), get_page() and put_page().
>
> This converts the Direct IO parts of most filesystems over to using
> FOLL_PIN (pin_user_page*()) page pinning.

The commit message needs to explain why a change is being made, not what
is being done. The latter I can just look at the code for.

Didn't even find it in in your cover letter, had to go to the original
posting for that.

--
Jens Axboe

2022-02-28 06:30:45

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else

On 2/27/22 13:58, Jens Axboe wrote:
>> diff --git a/block/bio.c b/block/bio.c
>> index b15f5466ce08..4679d6539e2d 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -1167,6 +1167,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>> BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
>> pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>>
>> + WARN_ON_ONCE(!iter_is_iovec(iter));
>> +
>
> If these make sense, why aren't they also returning an error?
>

Oops. Got caught up in the luxury of local testing and watching the
logs, but also returning errors is of course the right answer. Will fix.

thanks,
--
John Hubbard
NVIDIA

2022-02-28 09:58:22

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else

On 2/27/22 14:15, Al Viro wrote:
> On Sun, Feb 27, 2022 at 01:34:31AM -0800, [email protected] wrote:
>> From: John Hubbard <[email protected]>
>>
>> Upcoming changes to Direct IO will change it from acquiring pages via
>> get_user_pages_fast(), to calling pin_user_pages_fast() instead.
>>
>> Place a few assertions at key points, that the pages are IOVEC (user
>> pages), to enforce the assumptions that there are no kernel or pipe or
>> other odd variations being passed.
>
> Umm... And what should happen when O_DIRECT file gets passed to splice()?

OK, right, this is not going to work as-is.

Any of the remaining iov_iter_get_pages*() callers that do direct IO
(vmsplice, ceph, ...) will end up calling bio_release_pages(), which is
now unconditionally calling unpin_user_pages(). So this is just
completely broken.

And so, once again, I'm back to, "must convert the whole pile of
iov_iter_get_page*() callers at once".

Back to the drawing board.

thanks,
--
John Hubbard
NVIDIA

2022-02-28 10:50:58

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else

On 2/27/22 14:15, Al Viro wrote:
> On Sun, Feb 27, 2022 at 01:34:31AM -0800, [email protected] wrote:
>> From: John Hubbard <[email protected]>
>>
>> Upcoming changes to Direct IO will change it from acquiring pages via
>> get_user_pages_fast(), to calling pin_user_pages_fast() instead.
>>
>> Place a few assertions at key points, that the pages are IOVEC (user
>> pages), to enforce the assumptions that there are no kernel or pipe or
>> other odd variations being passed.
>
> Umm... And what should happen when O_DIRECT file gets passed to splice()?

Hi Al,

First of all, full disclosure: I still haven't worked through how
splice() handles pages in all cases. I was hoping to defer it, by
limiting this series to not *all* of the original callers of
iov_iter_get_pages*().

This series leaves the splice() code pointing to iov_iter_get_pages(),
but maybe that's not possible after all.

Any advice or ideas about how to solve the O_DIRECT-to-splice() is very
welcome.


thanks,
--
John Hubbard
NVIDIA

2022-02-28 10:51:27

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages

> +ssize_t iov_iter_pin_pages(struct iov_iter *i,
> + struct page **pages, size_t maxsize, unsigned int maxpages,
> + size_t *start)
> +{
> + size_t len;
> + int n, res;
> +
> + if (maxsize > i->count)
> + maxsize = i->count;
> + if (!maxsize)
> + return 0;
> +
> + WARN_ON_ONCE(!iter_is_iovec(i));
> +
> + if (likely(iter_is_iovec(i))) {
> + unsigned int gup_flags = 0;
> + unsigned long addr;
> +
> + if (iov_iter_rw(i) != WRITE)
> + gup_flags |= FOLL_WRITE;
> + if (i->nofault)
> + gup_flags |= FOLL_NOFAULT;
> +
> + addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
> + n = DIV_ROUND_UP(len, PAGE_SIZE);
> + res = pin_user_pages_fast(addr, n, gup_flags, pages);
> + if (unlikely(res <= 0))
> + return res;
> + return (res == n ? len : res * PAGE_SIZE) - *start;

Trying to be clever like that just makes the code a lot less readable. I
should not have to reason about a return value. Same in the other
function.

--
Jens Axboe

2022-02-28 17:40:04

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN

On Sun, 27 Feb 2022 at 10:34, <[email protected]> wrote:
>
> From: John Hubbard <[email protected]>
>
> Convert the fuse filesystem to support the new iov_iter_get_pages()
> behavior. That routine now invokes pin_user_pages_fast(), which means
> that such pages must be released via unpin_user_page(), rather than via
> put_page().
>
> This commit also removes any possibility of kernel pages being handled,
> in the fuse_get_user_pages() call. Although this may seem like a steep
> price to pay, Christoph Hellwig actually recommended it a few years ago
> for nearly the same situation [1].

This might work for O_DIRECT, but fuse has this mode of operation
which turns normal "buffered" I/O into direct I/O. And that in turn
will break execve of such files.

So AFAICS we need to keep kvec handing in some way.

Thanks,
Miklos

2022-02-28 21:21:33

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN

On 2/28/22 07:59, Miklos Szeredi wrote:
> On Sun, 27 Feb 2022 at 10:34, <[email protected]> wrote:
>>
>> From: John Hubbard <[email protected]>
>>
>> Convert the fuse filesystem to support the new iov_iter_get_pages()
>> behavior. That routine now invokes pin_user_pages_fast(), which means
>> that such pages must be released via unpin_user_page(), rather than via
>> put_page().
>>
>> This commit also removes any possibility of kernel pages being handled,
>> in the fuse_get_user_pages() call. Although this may seem like a steep
>> price to pay, Christoph Hellwig actually recommended it a few years ago
>> for nearly the same situation [1].
>
> This might work for O_DIRECT, but fuse has this mode of operation
> which turns normal "buffered" I/O into direct I/O. And that in turn
> will break execve of such files.
>
> So AFAICS we need to keep kvec handing in some way.
>

Thanks for bringing that up! Do you have any hints for me, to jump start
a deeper look? And especially, sample programs that exercise this?


thanks,
--
John Hubbard
NVIDIA

2022-02-28 22:57:24

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages

On 2/27/22 13:57, Jens Axboe wrote:
>> +ssize_t iov_iter_pin_pages(struct iov_iter *i,
>> + struct page **pages, size_t maxsize, unsigned int maxpages,
>> + size_t *start)
>> +{
>> + size_t len;
>> + int n, res;
>> +
>> + if (maxsize > i->count)
>> + maxsize = i->count;
>> + if (!maxsize)
>> + return 0;
>> +
>> + WARN_ON_ONCE(!iter_is_iovec(i));
>> +
>> + if (likely(iter_is_iovec(i))) {
>> + unsigned int gup_flags = 0;
>> + unsigned long addr;
>> +
>> + if (iov_iter_rw(i) != WRITE)
>> + gup_flags |= FOLL_WRITE;
>> + if (i->nofault)
>> + gup_flags |= FOLL_NOFAULT;
>> +
>> + addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
>> + n = DIV_ROUND_UP(len, PAGE_SIZE);
>> + res = pin_user_pages_fast(addr, n, gup_flags, pages);
>> + if (unlikely(res <= 0))
>> + return res;
>> + return (res == n ? len : res * PAGE_SIZE) - *start;
>
> Trying to be clever like that just makes the code a lot less readable. I
> should not have to reason about a return value. Same in the other
> function.
>

Here is a differential patch on top of this one, and only showing one of
the two routines. How does this direction look to you?


diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index e64e8e4edd0c..8e96f1e9ebc6 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1588,7 +1588,17 @@ ssize_t iov_iter_pin_pages(struct iov_iter *i,
res = pin_user_pages_fast(addr, n, gup_flags, pages);
if (unlikely(res <= 0))
return res;
- return (res == n ? len : res * PAGE_SIZE) - *start;
+
+ /* Cap len at the number of pages that were actually pinned: */
+ if (res < n)
+ len = res * PAGE_SIZE;
+
+ /*
+ * The return value is the amount pinned in bytes that the
+ * caller will actually use. So, reduce it by the offset into
+ * the first page:
+ */
+ return len - *start;
}

return -EFAULT;

thanks,
--
John Hubbard
NVIDIA

2022-03-01 14:53:54

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN

On Mon, 28 Feb 2022 at 22:16, John Hubbard <[email protected]> wrote:
>
> On 2/28/22 07:59, Miklos Szeredi wrote:
> > On Sun, 27 Feb 2022 at 10:34, <[email protected]> wrote:
> >>
> >> From: John Hubbard <[email protected]>
> >>
> >> Convert the fuse filesystem to support the new iov_iter_get_pages()
> >> behavior. That routine now invokes pin_user_pages_fast(), which means
> >> that such pages must be released via unpin_user_page(), rather than via
> >> put_page().
> >>
> >> This commit also removes any possibility of kernel pages being handled,
> >> in the fuse_get_user_pages() call. Although this may seem like a steep
> >> price to pay, Christoph Hellwig actually recommended it a few years ago
> >> for nearly the same situation [1].
> >
> > This might work for O_DIRECT, but fuse has this mode of operation
> > which turns normal "buffered" I/O into direct I/O. And that in turn
> > will break execve of such files.
> >
> > So AFAICS we need to keep kvec handing in some way.
> >
>
> Thanks for bringing that up! Do you have any hints for me, to jump start

How about just leaving that special code in place? It bypasses page
refs and directly copies to the kernel buffer, so it should not have
any affect on the user page code.

> a deeper look? And especially, sample programs that exercise this?

Here's one:
# uncomment as appropriate:
#sudo dnf install fuse3-devel
#sudo apt install libfuse3-dev

cat <<EOF > fuse-dio-exec.c
#define FUSE_USE_VERSION 31
#include <fuse.h>
#include <errno.h>
#include <unistd.h>

static const char *filename = "/bin/true";

static int test_getattr(const char *path, struct stat *stbuf,
struct fuse_file_info *fi)
{
return lstat(filename, stbuf) == -1 ? -errno : 0;
}

static int test_open(const char *path, struct fuse_file_info *fi)
{
int res;

res = open(filename, fi->flags);
if (res == -1)
return -errno;

fi->fh = res;
fi->direct_io = 1;
return 0;
}

static int test_read(const char *path, char *buf, size_t size, off_t offset,
struct fuse_file_info *fi)
{
int res = pread(fi->fh, buf, size, offset);
return res == -1 ? -errno : res;
}

static int test_release(const char *path, struct fuse_file_info *fi)
{
close(fi->fh);
return 0;
}

static const struct fuse_operations test_oper = {
.getattr = test_getattr,
.open = test_open,
.release = test_release,
.read = test_read,
};

int main(int argc, char *argv[])
{
return fuse_main(argc, argv, &test_oper, NULL);
}
EOF

gcc -W fuse-dio-exec.c `pkg-config fuse3 --cflags --libs` -o fuse-dio-exec
touch /tmp/true

#run test:
./fuse-dio-exec /tmp/true
/tmp/true
umount /tmp/true

2022-03-02 09:58:38

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN

On 3/1/22 01:41, Miklos Szeredi wrote:
...
>>> This might work for O_DIRECT, but fuse has this mode of operation
>>> which turns normal "buffered" I/O into direct I/O. And that in turn
>>> will break execve of such files.
>>>
>>> So AFAICS we need to keep kvec handing in some way.
>>>
>>
>> Thanks for bringing that up! Do you have any hints for me, to jump start
>
> How about just leaving that special code in place? It bypasses page
> refs and directly copies to the kernel buffer, so it should not have
> any affect on the user page code.
>

Good idea, I'll go that direction.

>> a deeper look? And especially, sample programs that exercise this?
>
> Here's one:

This is really helpful, exactly what I was looking for.


thanks!
--
John Hubbard
NVIDIA

> # uncomment as appropriate:
> #sudo dnf install fuse3-devel
> #sudo apt install libfuse3-dev
>
> cat <<EOF > fuse-dio-exec.c
> #define FUSE_USE_VERSION 31
> #include <fuse.h>
> #include <errno.h>
> #include <unistd.h>
>
> static const char *filename = "/bin/true";
>
> static int test_getattr(const char *path, struct stat *stbuf,
> struct fuse_file_info *fi)
> {
> return lstat(filename, stbuf) == -1 ? -errno : 0;
> }
>
> static int test_open(const char *path, struct fuse_file_info *fi)
> {
> int res;
>
> res = open(filename, fi->flags);
> if (res == -1)
> return -errno;
>
> fi->fh = res;
> fi->direct_io = 1;
> return 0;
> }
>
> static int test_read(const char *path, char *buf, size_t size, off_t offset,
> struct fuse_file_info *fi)
> {
> int res = pread(fi->fh, buf, size, offset);
> return res == -1 ? -errno : res;
> }
>
> static int test_release(const char *path, struct fuse_file_info *fi)
> {
> close(fi->fh);
> return 0;
> }
>
> static const struct fuse_operations test_oper = {
> .getattr = test_getattr,
> .open = test_open,
> .release = test_release,
> .read = test_read,
> };
>
> int main(int argc, char *argv[])
> {
> return fuse_main(argc, argv, &test_oper, NULL);
> }
> EOF
>
> gcc -W fuse-dio-exec.c `pkg-config fuse3 --cflags --libs` -o fuse-dio-exec
> touch /tmp/true
>
> #run test:
> ./fuse-dio-exec /tmp/true
> /tmp/true
> umount /tmp/true

2022-08-29 20:14:33

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

On 8/29/22 09:08, Jan Kara wrote:
>> However, the core block/bio conversion in patch 4 still does depend upon
>> a key assumption, which I got from a 2019 email discussion with
>> Christoph Hellwig and others here [1], which says:
>>
>> "All pages released by bio_release_pages should come from
>> get_get_user_pages...".
>>
>> I really hope that still holds true. Otherwise this whole thing is in
>> trouble.
>>
>> [1] https://lore.kernel.org/kvm/[email protected]/
>
> Well as far as I've checked that discussion, Christoph was aware of pipe
> pages etc. (i.e., bvecs) entering direct IO code. But he had some patches
> [2] which enabled GUP to work for bvecs as well (using the kernel mapping
> under the hood AFAICT from a quick glance at the series). I suppose we
> could also handle this in __iov_iter_get_pages_alloc() by grabbing pin
> reference instead of plain get_page() for the case of bvec iter. That way
> we should have only pinned pages in bio_release_pages() even for the bvec
> case.

OK, thanks, that looks viable. So, that approach assumes that the
remaining two cases in __iov_iter_get_pages_alloc() will never end up
being released via bio_release_pages():

iov_iter_is_pipe(i)
iov_iter_is_xarray(i)

I'm actually a little worried about ITER_XARRAY, which is a recent addition.
It seems to be used in ways that are similar to ITER_BVEC, and cephfs is
using it. It's probably OK for now, for this series, which doesn't yet
convert cephfs.


>
> [2] http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/gup-bvec

Yes, I had looked through that again before sending this. The problem
for me was that that it didn't have to deal with releasing pages
differently (and therefore, differentiating between FOLL_PIN and
FOLL_GET pages). But it did enable GUP to handle bvecs, so with that
applied, one could then make the original claim about bio_release_pages()
and GUP, yes.


thanks,

--
John Hubbard
NVIDIA