2023-01-26 14:17:28

by David Howells

[permalink] [raw]
Subject: [PATCH v11 0/8] iov_iter: Improve page extraction (pin or just list)

Hi Al, Christoph,

Here are patches to provide support for extracting pages from an iov_iter
and to use this in the extraction functions in the block layer bio code.

The patches make the following changes:

(1) Add a function, iov_iter_extract_pages() to replace
iov_iter_get_pages*() that gets refs, pins or just lists the pages as
appropriate to the iterator type.

Add a function, iov_iter_extract_will_pin() that will indicate from
the iterator type how the cleanup is to be performed, returning true
if the pages will need unpinning, false otherwise.

(2) Make the bio struct carry a pair of flags to indicate the cleanup
mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (indicating
FOLL_GET was used) and BIO_PAGE_PINNED (indicating FOLL_PIN was used)
is added.

BIO_PAGE_REFFED will go away, but at the moment fs/direct-io.c sets it
and this series does not fully address that file.

(4) Add a function, bio_release_page(), to release a page appropriately to
the cleanup mode indicated by the BIO_PAGE_* flags.

(5) Make the iter-to-bio code use iov_iter_extract_pages() to retain the
pages appropriately and clean them up later.

(6) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.

I've pushed the patches here also:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract

David

Changes:
========
ver #11)
- Fix iov_iter_extract_kvec_pages() to include the offset into the page in
the returned starting offset.
- Use __bitwise for the extraction flags

ver #10)
- Fix use of i->kvec in iov_iter_extract_bvec_pages() to be i->bvec.
- Drop bio_set_cleanup_mode(), open coding it instead.

ver #9)
- It's now not permitted to use FOLL_PIN outside of mm/, so:
- Change iov_iter_extract_mode() into iov_iter_extract_will_pin() and
return true/false instead of FOLL_PIN/0.
- Drop of folio_put_unpin() and page_put_unpin() and instead call
unpin_user_page() (and put_page()) directly as necessary.
- Make __bio_release_pages() call bio_release_page() instead of
unpin_user_page() as there's no BIO_* -> FOLL_* translation to do.
- Drop the FOLL_* renumbering patch.
- Change extract_flags to extraction_flags.

ver #8)
- Import Christoph Hellwig's changes.
- Split the conversion-to-extraction patch.
- Drop the extract_flags arg from iov_iter_extract_mode().
- Don't default bios to BIO_PAGE_REFFED, but set explicitly.
- Switch FOLL_PIN and FOLL_GET when renumbering so PIN is at bit 0.
- Switch BIO_PAGE_PINNED and BIO_PAGE_REFFED so PINNED is at bit 0.
- We should always be using FOLL_PIN (not FOLL_GET) for DIO, so adjust the
patches for that.

ver #7)
- For now, drop the parts to pass the I/O direction to iov_iter_*pages*()
as it turned out to be a lot more complicated, with places not setting
IOCB_WRITE when they should, for example.
- Drop all the patches that changed things other then the block layer's
bio handling. The netfslib and cifs changes can go into a separate
patchset.
- Add support for extracting pages from KVEC-type iterators.
- When extracting from BVEC/KVEC, skip over empty vecs at the front.

ver #6)
- Fix write() syscall and co. not setting IOCB_WRITE.
- Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE.
- Use op_is_write() in bio_copy_user_iov().
- Drop the iterator direction checks from smbd_recv().
- Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of
gup_flags to iov_iter_get/extract_pages*().
- Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove.
- Add back the function to indicate the cleanup mode.
- Drop the cleanup_mode return arg to iov_iter_extract_pages().
- Provide a helper to clean up a page.
- Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have
the same numerical values, enforced with an assertion.
- Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and
NFS.
- Added in the patches to make CIFS do top-to-bottom iterators and use
various of the added extraction functions.
- Added a pair of work-in-progess patches to make sk_buff fragments store
FOLL_GET and FOLL_PIN.

ver #5)
- Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch.
- Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags.
- Add patch to allow bio_flagged() to be combined by gcc.

ver #4)
- Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
no longer referenced by linux/uio.h.
- Add ITER_SOURCE/DEST cleanup patches.
- Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST.
- Allow additional gup_flags to be passed into iov_iter_extract_pages().
- Add struct bio patch.

ver #3)
- Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
to get/pin_user_pages_fast()[1].

ver #2)
- Rolled the extraction cleanup mode query function into the extraction
function, returning the indication through the argument list.
- Fixed patch 4 (extract to scatterlist) to actually use the new
extraction API.

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/ # v6
Link: https://lore.kernel.org/r/[email protected]/ # v7
Link: https://lore.kernel.org/r/[email protected]/ # v8
Link: https://lore.kernel.org/r/[email protected]/ # v9

Christoph Hellwig (1):
block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted
logic

David Howells (7):
iov_iter: Define flags to qualify page extraction.
iov_iter: Add a function to extract a page list from an iterator
iomap: Don't get an reference on ZERO_PAGE for direct I/O block
zeroing
block: Fix bio_flagged() so that gcc can better optimise it
block: Add BIO_PAGE_PINNED and associated infrastructure
block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
block: convert bio_map_user_iov to use iov_iter_extract_pages

block/bio.c | 33 ++--
block/blk-map.c | 26 +--
block/blk.h | 12 ++
fs/direct-io.c | 2 +
fs/iomap/direct-io.c | 1 -
include/linux/bio.h | 5 +-
include/linux/blk_types.h | 3 +-
include/linux/uio.h | 35 +++-
lib/iov_iter.c | 335 +++++++++++++++++++++++++++++++++++++-
9 files changed, 411 insertions(+), 41 deletions(-)



2023-01-26 14:17:37

by David Howells

[permalink] [raw]
Subject: [PATCH v11 1/8] iov_iter: Define flags to qualify page extraction.

Define flags to qualify page extraction to pass into iov_iter_*_pages*()
rather than passing in FOLL_* flags.

For now only a flag to allow peer-to-peer DMA is supported.

Signed-off-by: David Howells <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
cc: Al Viro <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: [email protected]
cc: [email protected]
---

Notes:
ver #11)
- Use __bitwise for the extraction flags

ver #9)
- Change extract_flags to extraction_flags.

ver #7)
- Don't use FOLL_* as a parameter, but rather define constants
specifically to use with iov_iter_*_pages*().
- Drop the I/O direction constants for now.

block/bio.c | 6 +++---
block/blk-map.c | 8 ++++----
include/linux/uio.h | 10 ++++++++--
lib/iov_iter.c | 14 ++++++++------
4 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index ab59a491a883..b97f3991c904 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1245,11 +1245,11 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
*/
static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
+ iov_iter_extraction_t extraction_flags = 0;
unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
struct page **pages = (struct page **)bv;
- unsigned int gup_flags = 0;
ssize_t size, left;
unsigned len, i = 0;
size_t offset, trim;
@@ -1264,7 +1264,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);

if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
- gup_flags |= FOLL_PCI_P2PDMA;
+ extraction_flags |= ITER_ALLOW_P2PDMA;

/*
* Each segment in the iov is required to be a block size multiple.
@@ -1275,7 +1275,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
*/
size = iov_iter_get_pages(iter, pages,
UINT_MAX - bio->bi_iter.bi_size,
- nr_pages, &offset, gup_flags);
+ nr_pages, &offset, extraction_flags);
if (unlikely(size <= 0))
return size ? size : -EFAULT;

diff --git a/block/blk-map.c b/block/blk-map.c
index 19940c978c73..080dd60485be 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -265,9 +265,9 @@ static struct bio *blk_rq_map_bio_alloc(struct request *rq,
static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
gfp_t gfp_mask)
{
+ iov_iter_extraction_t extraction_flags = 0;
unsigned int max_sectors = queue_max_hw_sectors(rq->q);
unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
- unsigned int gup_flags = 0;
struct bio *bio;
int ret;
int j;
@@ -280,7 +280,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
return -ENOMEM;

if (blk_queue_pci_p2pdma(rq->q))
- gup_flags |= FOLL_PCI_P2PDMA;
+ extraction_flags |= ITER_ALLOW_P2PDMA;

while (iov_iter_count(iter)) {
struct page **pages, *stack_pages[UIO_FASTIOV];
@@ -291,10 +291,10 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
pages = stack_pages;
bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
- nr_vecs, &offs, gup_flags);
+ nr_vecs, &offs, extraction_flags);
} else {
bytes = iov_iter_get_pages_alloc(iter, &pages,
- LONG_MAX, &offs, gup_flags);
+ LONG_MAX, &offs, extraction_flags);
}
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 9f158238edba..bf77cd3d5fb1 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -13,6 +13,8 @@
struct page;
struct pipe_inode_info;

+typedef unsigned int iov_iter_extraction_t;
+
struct kvec {
void *iov_base; /* and that should *never* hold a userland pointer */
size_t iov_len;
@@ -252,12 +254,12 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *
loff_t start, size_t count);
ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start,
- unsigned gup_flags);
+ iov_iter_extraction_t extraction_flags);
ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize, size_t *start,
- unsigned gup_flags);
+ iov_iter_extraction_t extraction_flags);
ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);
int iov_iter_npages(const struct iov_iter *i, int maxpages);
@@ -360,4 +362,8 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
};
}

+/* Flags for iov_iter_get/extract_pages*() */
+/* Allow P2PDMA on the extracted pages */
+#define ITER_ALLOW_P2PDMA ((__force iov_iter_extraction_t)0x01)
+
#endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f9a3ff37ecd1..553afc870866 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1432,9 +1432,9 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize,
unsigned int maxpages, size_t *start,
- unsigned int gup_flags)
+ iov_iter_extraction_t extraction_flags)
{
- unsigned int n;
+ unsigned int n, gup_flags = 0;

if (maxsize > i->count)
maxsize = i->count;
@@ -1442,6 +1442,8 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
return 0;
if (maxsize > MAX_RW_COUNT)
maxsize = MAX_RW_COUNT;
+ if (extraction_flags & ITER_ALLOW_P2PDMA)
+ gup_flags |= FOLL_PCI_P2PDMA;

if (likely(user_backed_iter(i))) {
unsigned long addr;
@@ -1495,14 +1497,14 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,

ssize_t iov_iter_get_pages(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
- size_t *start, unsigned gup_flags)
+ size_t *start, iov_iter_extraction_t extraction_flags)
{
if (!maxpages)
return 0;
BUG_ON(!pages);

return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
- start, gup_flags);
+ start, extraction_flags);
}
EXPORT_SYMBOL_GPL(iov_iter_get_pages);

@@ -1515,14 +1517,14 @@ EXPORT_SYMBOL(iov_iter_get_pages2);

ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize,
- size_t *start, unsigned gup_flags)
+ size_t *start, iov_iter_extraction_t extraction_flags)
{
ssize_t len;

*pages = NULL;

len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
- gup_flags);
+ extraction_flags);
if (len <= 0) {
kvfree(*pages);
*pages = NULL;


2023-01-26 14:17:51

by David Howells

[permalink] [raw]
Subject: [PATCH v11 2/8] iov_iter: Add a function to extract a page list from an iterator

Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator. The pages may be returned with a pin added or nothing,
depending on the type of iterator.

Add a second function, iov_iter_extract_will_pin(), to determine how the
cleanup should be done.

There are two cases:

(1) ITER_IOVEC or ITER_UBUF iterator.

Extracted pages will have pins (FOLL_PIN) obtained on them so that a
concurrent fork() will forcibly copy the page so that DMA is done
to/from the parent's buffer and is unavailable to/unaffected by the
child process.

iov_iter_extract_will_pin() will return true for this case. The
caller should use something like unpin_user_page() to dispose of the
page.

(2) Any other sort of iterator.

No refs or pins are obtained on the page, the assumption is made that
the caller will manage page retention.

iov_iter_extract_will_pin() will return false. The pages don't need
additional disposal.

Signed-off-by: David Howells <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
cc: Al Viro <[email protected]>
cc: John Hubbard <[email protected]>
cc: David Hildenbrand <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---

Notes:
ver #11)
- Fix iov_iter_extract_kvec_pages() to include the offset into the page in
the returned starting offset.
- Use __bitwise for the extraction flags

ver #10)
- Fix use of i->kvec in iov_iter_extract_bvec_pages() to be i->bvec.

ver #9)
- Rename iov_iter_extract_mode() to iov_iter_extract_will_pin() and make
it return true/false not FOLL_PIN/0 as FOLL_PIN is going to be made
private to mm/.
- Change extract_flags to extraction_flags.

ver #8)
- It seems that all DIO is supposed to be done under FOLL_PIN now, and not
FOLL_GET, so switch to only using pin_user_pages() for user-backed
iters.
- Wrap an argument in brackets in the iov_iter_extract_mode() macro.
- Drop the extract_flags argument to iov_iter_extract_mode() for now
[hch].

ver #7)
- Switch to passing in iter-specific flags rather than FOLL_* flags.
- Drop the direction flags for now.
- Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
- Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
- Add support for extraction from KVEC-type iters.
- Use iov_iter_advance() rather than open-coding it.
- Make BVEC- and KVEC-type skip over initial empty vectors.

ver #6)
- Add back the function to indicate the cleanup mode.
- Drop the cleanup_mode return arg to iov_iter_extract_pages().
- Pass FOLL_SOURCE/DEST_BUF in gup_flags. Check this against the iter
data_source.

ver #4)
- Use ITER_SOURCE/DEST instead of WRITE/READ.
- Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.

ver #3)
- Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
to get/pin_user_pages_fast()[1].

include/linux/uio.h | 27 +++-
lib/iov_iter.c | 321 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 347 insertions(+), 1 deletion(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index bf77cd3d5fb1..b1be128bb2fa 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -361,9 +361,34 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
.count = count
};
}
-
/* Flags for iov_iter_get/extract_pages*() */
/* Allow P2PDMA on the extracted pages */
#define ITER_ALLOW_P2PDMA ((__force iov_iter_extraction_t)0x01)

+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+ size_t maxsize, unsigned int maxpages,
+ iov_iter_extraction_t extraction_flags,
+ size_t *offset0);
+
+/**
+ * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
+ * @iter: The iterator
+ *
+ * Examine the iterator and indicate by returning true or false as to how, if
+ * at all, pages extracted from the iterator will be retained by the extraction
+ * function.
+ *
+ * %true indicates that the pages will have a pin placed in them that the
+ * caller must unpin. This is must be done for DMA/async DIO to force fork()
+ * to forcibly copy a page for the child (the parent must retain the original
+ * page).
+ *
+ * %false indicates that no measures are taken and that it's up to the caller
+ * to retain the pages.
+ */
+static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
+{
+ return user_backed_iter(iter);
+}
+
#endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 553afc870866..d69a05950555 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1916,3 +1916,324 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
i->iov -= state->nr_segs - i->nr_segs;
i->nr_segs = state->nr_segs;
}
+
+/*
+ * Extract a list of contiguous pages from an ITER_PIPE iterator. This does
+ * not get references of its own on the pages, nor does it get a pin on them.
+ * If there's a partial page, it adds that first and will then allocate and add
+ * pages into the pipe to make up the buffer space to the amount required.
+ *
+ * The caller must hold the pipe locked and only transferring into a pipe is
+ * supported.
+ */
+static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ iov_iter_extraction_t extraction_flags,
+ size_t *offset0)
+{
+ unsigned int nr, offset, chunk, j;
+ struct page **p;
+ size_t left;
+
+ if (!sanity(i))
+ return -EFAULT;
+
+ offset = pipe_npages(i, &nr);
+ if (!nr)
+ return -EFAULT;
+ *offset0 = offset;
+
+ maxpages = min_t(size_t, nr, maxpages);
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+
+ left = maxsize;
+ for (j = 0; j < maxpages; j++) {
+ struct page *page = append_pipe(i, left, &offset);
+ if (!page)
+ break;
+ chunk = min_t(size_t, left, PAGE_SIZE - offset);
+ left -= chunk;
+ *p++ = page;
+ }
+ if (!j)
+ return -EFAULT;
+ return maxsize - left;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_XARRAY iterator. This does not
+ * get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ iov_iter_extraction_t extraction_flags,
+ size_t *offset0)
+{
+ struct page *page, **p;
+ unsigned int nr = 0, offset;
+ loff_t pos = i->xarray_start + i->iov_offset;
+ pgoff_t index = pos >> PAGE_SHIFT;
+ XA_STATE(xas, i->xarray, index);
+
+ offset = pos & ~PAGE_MASK;
+ *offset0 = offset;
+
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+
+ rcu_read_lock();
+ for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+ if (xas_retry(&xas, page))
+ continue;
+
+ /* Has the page moved or been split? */
+ if (unlikely(page != xas_reload(&xas))) {
+ xas_reset(&xas);
+ continue;
+ }
+
+ p[nr++] = find_subpage(page, xas.xa_index);
+ if (nr == maxpages)
+ break;
+ }
+ rcu_read_unlock();
+
+ maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_BVEC iterator. This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ iov_iter_extraction_t extraction_flags,
+ size_t *offset0)
+{
+ struct page **p, *page;
+ size_t skip = i->iov_offset, offset;
+ int k;
+
+ for (;;) {
+ if (i->nr_segs == 0)
+ return 0;
+ maxsize = min(maxsize, i->bvec->bv_len - skip);
+ if (maxsize)
+ break;
+ i->iov_offset = 0;
+ i->nr_segs--;
+ i->bvec++;
+ skip = 0;
+ }
+
+ skip += i->bvec->bv_offset;
+ page = i->bvec->bv_page + skip / PAGE_SIZE;
+ offset = skip % PAGE_SIZE;
+ *offset0 = offset;
+
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+ for (k = 0; k < maxpages; k++)
+ p[k] = page + k;
+
+ maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+/*
+ * Extract a list of virtually contiguous pages from an ITER_KVEC iterator.
+ * This does not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_kvec_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ iov_iter_extraction_t extraction_flags,
+ size_t *offset0)
+{
+ struct page **p, *page;
+ const void *kaddr;
+ size_t skip = i->iov_offset, offset, len;
+ int k;
+
+ for (;;) {
+ if (i->nr_segs == 0)
+ return 0;
+ maxsize = min(maxsize, i->kvec->iov_len - skip);
+ if (maxsize)
+ break;
+ i->iov_offset = 0;
+ i->nr_segs--;
+ i->kvec++;
+ skip = 0;
+ }
+
+ kaddr = i->kvec->iov_base + skip;
+ offset = (unsigned long)kaddr & ~PAGE_MASK;
+ *offset0 = offset;
+
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+
+ kaddr -= offset;
+ len = offset + maxsize;
+ for (k = 0; k < maxpages; k++) {
+ size_t seg = min_t(size_t, len, PAGE_SIZE);
+
+ if (is_vmalloc_or_module_addr(kaddr))
+ page = vmalloc_to_page(kaddr);
+ else
+ page = virt_to_page(kaddr);
+
+ p[k] = page;
+ len -= seg;
+ kaddr += PAGE_SIZE;
+ }
+
+ maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get a pin on
+ * each of them. This should only be used if the iterator is user-backed
+ * (IOBUF/UBUF).
+ *
+ * It does not get refs on the pages, but the pages must be unpinned by the
+ * caller once the transfer is complete.
+ *
+ * This is safe to be used where background IO/DMA *is* going to be modifying
+ * the buffer; using a pin rather than a ref makes forces fork() to give the
+ * child a copy of the page.
+ */
+static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
+ struct page ***pages,
+ size_t maxsize,
+ unsigned int maxpages,
+ iov_iter_extraction_t extraction_flags,
+ size_t *offset0)
+{
+ unsigned long addr;
+ unsigned int gup_flags = FOLL_PIN;
+ size_t offset;
+ int res;
+
+ if (i->data_source == ITER_DEST)
+ gup_flags |= FOLL_WRITE;
+ if (extraction_flags & ITER_ALLOW_P2PDMA)
+ gup_flags |= FOLL_PCI_P2PDMA;
+ if (i->nofault)
+ gup_flags |= FOLL_NOFAULT;
+
+ addr = first_iovec_segment(i, &maxsize);
+ *offset0 = offset = addr % PAGE_SIZE;
+ addr &= PAGE_MASK;
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+ if (unlikely(res <= 0))
+ return res;
+ maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+/**
+ * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator
+ * @i: The iterator to extract from
+ * @pages: Where to return the list of pages
+ * @maxsize: The maximum amount of iterator to extract
+ * @maxpages: The maximum size of the list of pages
+ * @extraction_flags: Flags to qualify request
+ * @offset0: Where to return the starting offset into (*@pages)[0]
+ *
+ * Extract a list of contiguous pages from the current point of the iterator,
+ * advancing the iterator. The maximum number of pages and the maximum amount
+ * of page contents can be set.
+ *
+ * If *@pages is NULL, a page list will be allocated to the required size and
+ * *@pages will be set to its base. If *@pages is not NULL, it will be assumed
+ * that the caller allocated a page list at least @maxpages in size and this
+ * will be filled in.
+ *
+ * @extraction_flags can have ITER_ALLOW_P2PDMA set to request peer-to-peer DMA
+ * be allowed on the pages extracted.
+ *
+ * The iov_iter_extract_will_pin() function can be used to query how cleanup
+ * should be performed.
+ *
+ * Extra refs or pins on the pages may be obtained as follows:
+ *
+ * (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF), pins will be
+ * added to the pages, but refs will not be taken.
+ * iov_iter_extract_will_pin() will return true.
+ *
+ * (*) If the iterator is ITER_PIPE, this must describe a destination for the
+ * data. Additional pages may be allocated and added to the pipe (which
+ * will hold the refs), but pins will not be obtained for the caller. The
+ * caller must hold the pipe lock. iov_iter_extract_will_pin() will
+ * return false.
+ *
+ * (*) If the iterator is ITER_KVEC, ITER_BVEC or ITER_XARRAY, the pages are
+ * merely listed; no extra refs or pins are obtained.
+ * iov_iter_extract_will_pin() will return 0.
+ *
+ * Note also:
+ *
+ * (*) Use with ITER_DISCARD is not supported as that has no content.
+ *
+ * On success, the function sets *@pages to the new pagelist, if allocated, and
+ * sets *offset0 to the offset into the first page.
+ *
+ * It may also return -ENOMEM and -EFAULT.
+ */
+ssize_t iov_iter_extract_pages(struct iov_iter *i,
+ struct page ***pages,
+ size_t maxsize,
+ unsigned int maxpages,
+ iov_iter_extraction_t extraction_flags,
+ size_t *offset0)
+{
+ maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT);
+ if (!maxsize)
+ return 0;
+
+ if (likely(user_backed_iter(i)))
+ return iov_iter_extract_user_pages(i, pages, maxsize,
+ maxpages, extraction_flags,
+ offset0);
+ if (iov_iter_is_kvec(i))
+ return iov_iter_extract_kvec_pages(i, pages, maxsize,
+ maxpages, extraction_flags,
+ offset0);
+ if (iov_iter_is_bvec(i))
+ return iov_iter_extract_bvec_pages(i, pages, maxsize,
+ maxpages, extraction_flags,
+ offset0);
+ if (iov_iter_is_pipe(i))
+ return iov_iter_extract_pipe_pages(i, pages, maxsize,
+ maxpages, extraction_flags,
+ offset0);
+ if (iov_iter_is_xarray(i))
+ return iov_iter_extract_xarray_pages(i, pages, maxsize,
+ maxpages, extraction_flags,
+ offset0);
+ return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(iov_iter_extract_pages);


2023-01-26 14:17:58

by David Howells

[permalink] [raw]
Subject: [PATCH v11 4/8] block: Fix bio_flagged() so that gcc can better optimise it

Fix bio_flagged() so that multiple instances of it, such as:

if (bio_flagged(bio, BIO_PAGE_REFFED) ||
bio_flagged(bio, BIO_PAGE_PINNED))

can be combined by the gcc optimiser into a single test in assembly
(arguably, this is a compiler optimisation issue[1]).

The missed optimisation stems from bio_flagged() comparing the result of
the bitwise-AND to zero. This results in an out-of-line bio_release_page()
being compiled to something like:

<+0>: mov 0x14(%rdi),%eax
<+3>: test $0x1,%al
<+5>: jne 0xffffffff816dac53 <bio_release_pages+11>
<+7>: test $0x2,%al
<+9>: je 0xffffffff816dac5c <bio_release_pages+20>
<+11>: movzbl %sil,%esi
<+15>: jmp 0xffffffff816daba1 <__bio_release_pages>
<+20>: jmp 0xffffffff81d0b800 <__x86_return_thunk>

However, the test is superfluous as the return type is bool. Removing it
results in:

<+0>: testb $0x3,0x14(%rdi)
<+4>: je 0xffffffff816e4af4 <bio_release_pages+15>
<+6>: movzbl %sil,%esi
<+10>: jmp 0xffffffff816dab7c <__bio_release_pages>
<+15>: jmp 0xffffffff81d0b7c0 <__x86_return_thunk>

instead.

Also, the MOVZBL instruction looks unnecessary[2] - I think it's just
're-booling' the mark_dirty parameter.

Signed-off-by: David Howells <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
cc: Jens Axboe <[email protected]>
cc: [email protected]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108370 [1]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108371 [2]
Link: https://lore.kernel.org/r/167391056756.2311931.356007731815807265.stgit@warthog.procyon.org.uk/ # v6
---
include/linux/bio.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index c1da63f6c808..10366b8bdb13 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -227,7 +227,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)

static inline bool bio_flagged(struct bio *bio, unsigned int bit)
{
- return (bio->bi_flags & (1U << bit)) != 0;
+ return bio->bi_flags & (1U << bit);
}

static inline void bio_set_flag(struct bio *bio, unsigned int bit)


2023-01-26 14:18:03

by David Howells

[permalink] [raw]
Subject: [PATCH v11 5/8] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic

From: Christoph Hellwig <[email protected]>

Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
meaning is only set when a page reference has been acquired that needs to
be released by bio_release_pages().

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: David Howells <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
cc: Al Viro <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Jan Kara <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: [email protected]
---

Notes:
ver #8)
- Split out from another patch [hch].
- Don't default to BIO_PAGE_REFFED [hch].

ver #5)
- Split from patch that uses iov_iter_extract_pages().

block/bio.c | 2 +-
block/blk-map.c | 1 +
fs/direct-io.c | 2 ++
fs/iomap/direct-io.c | 1 -
include/linux/bio.h | 2 +-
include/linux/blk_types.h | 2 +-
6 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b97f3991c904..bf9bf53232be 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1198,7 +1198,6 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
bio->bi_io_vec = (struct bio_vec *)iter->bvec;
bio->bi_iter.bi_bvec_done = iter->iov_offset;
bio->bi_iter.bi_size = size;
- bio_set_flag(bio, BIO_NO_PAGE_REF);
bio_set_flag(bio, BIO_CLONED);
}

@@ -1343,6 +1342,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
return 0;
}

+ bio_set_flag(bio, BIO_PAGE_REFFED);
do {
ret = __bio_iov_iter_get_pages(bio, iter);
} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
diff --git a/block/blk-map.c b/block/blk-map.c
index 080dd60485be..f1f70b50388d 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -282,6 +282,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (blk_queue_pci_p2pdma(rq->q))
extraction_flags |= ITER_ALLOW_P2PDMA;

+ bio_set_flag(bio, BIO_PAGE_REFFED);
while (iov_iter_count(iter)) {
struct page **pages, *stack_pages[UIO_FASTIOV];
ssize_t bytes;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 03d381377ae1..07810465fc9d 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -403,6 +403,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
bio->bi_end_io = dio_bio_end_aio;
else
bio->bi_end_io = dio_bio_end_io;
+ /* for now require references for all pages */
+ bio_set_flag(bio, BIO_PAGE_REFFED);
sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
}
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 47db4ead1e74..c0e75900e754 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -202,7 +202,6 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;

- bio_set_flag(bio, BIO_NO_PAGE_REF);
__bio_add_page(bio, page, len, 0);
iomap_dio_submit_bio(iter, dio, bio, pos);
}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 10366b8bdb13..805957c99147 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,7 +484,7 @@ void zero_fill_bio(struct bio *bio);

static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
{
- if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+ if (bio_flagged(bio, BIO_PAGE_REFFED))
__bio_release_pages(bio, mark_dirty);
}

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..7daa261f4f98 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -318,7 +318,7 @@ struct bio {
* bio flags
*/
enum {
- BIO_NO_PAGE_REF, /* don't put release vec pages */
+ BIO_PAGE_REFFED, /* put pages in bio_release_pages() */
BIO_CLONED, /* doesn't own data */
BIO_BOUNCED, /* bio is a bounce bio */
BIO_QUIET, /* Make BIO Quiet */


2023-01-26 14:18:08

by David Howells

[permalink] [raw]
Subject: [PATCH v11 6/8] block: Add BIO_PAGE_PINNED and associated infrastructure

Add BIO_PAGE_PINNED to indicate that the pages in a bio are pinned
(FOLL_PIN) and that the pin will need removing.

Signed-off-by: David Howells <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
cc: Al Viro <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Jan Kara <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: [email protected]
---

Notes:
ver #10)
- Drop bio_set_cleanup_mode(), open coding it instead.

ver #9)
- Only consider pinning in bio_set_cleanup_mode(). Ref'ing pages in
struct bio is going away.
- page_put_unpin() is removed; call unpin_user_page() and put_page()
directly.
- Use bio_release_page() in __bio_release_pages().
- BIO_PAGE_PINNED and BIO_PAGE_REFFED can't both be set, so use if-else
when testing both of them.

ver #8)
- Move the infrastructure to clean up pinned pages to this patch [hch].
- Put BIO_PAGE_PINNED before BIO_PAGE_REFFED as the latter should
probably be removed at some point. FOLL_PIN can then be renumbered
first.

block/bio.c | 6 +++---
block/blk.h | 12 ++++++++++++
include/linux/bio.h | 3 ++-
include/linux/blk_types.h | 1 +
4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index bf9bf53232be..547e38883934 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1176,7 +1176,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
bio_for_each_segment_all(bvec, bio, iter_all) {
if (mark_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
- put_page(bvec->bv_page);
+ bio_release_page(bio, bvec->bv_page);
}
}
EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1496,8 +1496,8 @@ void bio_set_pages_dirty(struct bio *bio)
* the BIO and re-dirty the pages in process context.
*
* It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on. It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on. It will unpin each page and will run one bio_put() against the
+ * BIO.
*/

static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk.h b/block/blk.h
index 4c3b3325219a..f02381405311 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -425,6 +425,18 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
struct page *page, unsigned int len, unsigned int offset,
unsigned int max_sectors, bool *same_page);

+/*
+ * Clean up a page appropriately, where the page may be pinned, may have a
+ * ref taken on it or neither.
+ */
+static inline void bio_release_page(struct bio *bio, struct page *page)
+{
+ if (bio_flagged(bio, BIO_PAGE_PINNED))
+ unpin_user_page(page);
+ else if (bio_flagged(bio, BIO_PAGE_REFFED))
+ put_page(page);
+}
+
struct request_queue *blk_alloc_queue(int node_id);

int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 805957c99147..b2c09997d79c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,7 +484,8 @@ void zero_fill_bio(struct bio *bio);

static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
{
- if (bio_flagged(bio, BIO_PAGE_REFFED))
+ if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+ bio_flagged(bio, BIO_PAGE_PINNED))
__bio_release_pages(bio, mark_dirty);
}

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 7daa261f4f98..a0e339ff3d09 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -318,6 +318,7 @@ struct bio {
* bio flags
*/
enum {
+ BIO_PAGE_PINNED, /* Unpin pages in bio_release_pages() */
BIO_PAGE_REFFED, /* put pages in bio_release_pages() */
BIO_CLONED, /* doesn't own data */
BIO_BOUNCED, /* bio is a bounce bio */


2023-01-26 14:18:10

by David Howells

[permalink] [raw]
Subject: [PATCH v11 3/8] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing

ZERO_PAGE can't go away, no need to hold an extra reference.

Signed-off-by: David Howells <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
cc: Al Viro <[email protected]>
cc: David Hildenbrand <[email protected]>
cc: [email protected]
---
fs/iomap/direct-io.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9804714b1751..47db4ead1e74 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -202,7 +202,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;

- get_page(page);
+ bio_set_flag(bio, BIO_NO_PAGE_REF);
__bio_add_page(bio, page, len, 0);
iomap_dio_submit_bio(iter, dio, bio, pos);
}


2023-01-26 14:18:29

by David Howells

[permalink] [raw]
Subject: [PATCH v11 7/8] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages

This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.

The pages need to be pinned for DIO rather than having refs taken on them to
prevent VM copy-on-write from malfunctioning during a concurrent fork() (the
result of the I/O could otherwise end up being affected by/visible to the
child process).

Signed-off-by: David Howells <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
cc: Al Viro <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Jan Kara <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: [email protected]
---

Notes:
ver #10)
- Drop bio_set_cleanup_mode(), open coding it instead.

ver #8)
- Split the patch up a bit [hch].
- We should only be using pinned/non-pinned pages and not ref'd pages,
so adjust the comments appropriately.

ver #7)
- Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.

ver #5)
- Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
BIO_* flags and got rid of bi_cleanup_mode.
- Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.

block/bio.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 547e38883934..fc57f0aa098e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1212,7 +1212,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
}

if (same_page)
- put_page(page);
+ bio_release_page(bio, page);
return 0;
}

@@ -1226,7 +1226,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
queue_max_zone_append_sectors(q), &same_page) != len)
return -EINVAL;
if (same_page)
- put_page(page);
+ bio_release_page(bio, page);
return 0;
}

@@ -1237,10 +1237,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
* @bio: bio to add pages to
* @iter: iov iterator describing the region to be mapped
*
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Extracts pages from *iter and appends them to @bio's bvec array. The pages
+ * will have to be cleaned up in the way indicated by the BIO_PAGE_PINNED flag.
+ * For a multi-segment *iter, this function only adds pages from the next
+ * non-empty segment of the iov iterator.
*/
static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
@@ -1272,9 +1272,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
* result to ensure the bio's total size is correct. The remainder of
* the iov data will be picked up in the next bio iteration.
*/
- size = iov_iter_get_pages(iter, pages,
- UINT_MAX - bio->bi_iter.bi_size,
- nr_pages, &offset, extraction_flags);
+ size = iov_iter_extract_pages(iter, &pages,
+ UINT_MAX - bio->bi_iter.bi_size,
+ nr_pages, extraction_flags, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;

@@ -1307,7 +1307,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
iov_iter_revert(iter, left);
out:
while (i < nr_pages)
- put_page(pages[i++]);
+ bio_release_page(bio, pages[i++]);

return ret;
}
@@ -1342,7 +1342,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
return 0;
}

- bio_set_flag(bio, BIO_PAGE_REFFED);
+ if (iov_iter_extract_will_pin(iter))
+ bio_set_flag(bio, BIO_PAGE_PINNED);
do {
ret = __bio_iov_iter_get_pages(bio, iter);
} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));


2023-01-26 14:18:47

by David Howells

[permalink] [raw]
Subject: [PATCH v11 8/8] block: convert bio_map_user_iov to use iov_iter_extract_pages

This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.

The pages need to be pinned for DIO rather than having refs taken on them
to prevent VM copy-on-write from malfunctioning during a concurrent fork()
(the result of the I/O could otherwise end up being visible to/affected by
the child process).

Signed-off-by: David Howells <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
cc: Al Viro <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Jan Kara <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: [email protected]
---

Notes:
ver #10)
- Drop bio_set_cleanup_mode(), open coding it instead.

ver #8)
- Split the patch up a bit [hch].
- We should only be using pinned/non-pinned pages and not ref'd pages,
so adjust the comments appropriately.

ver #7)
- Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.

ver #5)
- Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
BIO_* flags and got rid of bi_cleanup_mode.
- Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.

block/blk-map.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index f1f70b50388d..0f1593e144da 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -281,22 +281,21 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,

if (blk_queue_pci_p2pdma(rq->q))
extraction_flags |= ITER_ALLOW_P2PDMA;
+ if (iov_iter_extract_will_pin(iter))
+ bio_set_flag(bio, BIO_PAGE_PINNED);

- bio_set_flag(bio, BIO_PAGE_REFFED);
while (iov_iter_count(iter)) {
- struct page **pages, *stack_pages[UIO_FASTIOV];
+ struct page *stack_pages[UIO_FASTIOV];
+ struct page **pages = stack_pages;
ssize_t bytes;
size_t offs;
int npages;

- if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
- pages = stack_pages;
- bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
- nr_vecs, &offs, extraction_flags);
- } else {
- bytes = iov_iter_get_pages_alloc(iter, &pages,
- LONG_MAX, &offs, extraction_flags);
- }
+ if (nr_vecs > ARRAY_SIZE(stack_pages))
+ pages = NULL;
+
+ bytes = iov_iter_extract_pages(iter, &pages, LONG_MAX,
+ nr_vecs, extraction_flags, &offs);
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
goto out_unmap;
@@ -318,7 +317,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (!bio_add_hw_page(rq->q, bio, page, n, offs,
max_sectors, &same_page)) {
if (same_page)
- put_page(page);
+ bio_release_page(bio, page);
break;
}

@@ -330,7 +329,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
* release the pages we didn't map into the bio, if any
*/
while (j < npages)
- put_page(pages[j++]);
+ bio_release_page(bio, pages[j++]);
if (pages != stack_pages)
kvfree(pages);
/* couldn't stuff something into bio? */


2023-01-26 21:54:48

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v11 1/8] iov_iter: Define flags to qualify page extraction.

On Thu, Jan 26, 2023 at 02:16:19PM +0000, David Howells wrote:

> +typedef unsigned int iov_iter_extraction_t;

> +/* Flags for iov_iter_get/extract_pages*() */
> +/* Allow P2PDMA on the extracted pages */
> +#define ITER_ALLOW_P2PDMA ((__force iov_iter_extraction_t)0x01)

That __force makes sense only if you make it a bitwise type -
otherwise it's meaningless. IOW, either turn the typedef into

typedef unsigned int __bitwise iov_iter_extraction_t;

or lose __force in the cast...

2023-01-26 22:00:03

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v11 2/8] iov_iter: Add a function to extract a page list from an iterator

On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote:

> +/**
> + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
> + * @iter: The iterator
> + *
> + * Examine the iterator and indicate by returning true or false as to how, if
> + * at all, pages extracted from the iterator will be retained by the extraction
> + * function.
> + *
> + * %true indicates that the pages will have a pin placed in them that the
> + * caller must unpin. This is must be done for DMA/async DIO to force fork()
> + * to forcibly copy a page for the child (the parent must retain the original
> + * page).
> + *
> + * %false indicates that no measures are taken and that it's up to the caller
> + * to retain the pages.
> + */
> +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
> +{
> + return user_backed_iter(iter);
> +}
> +

Wait a sec; why would we want a pin for pages we won't be modifying?
A reference - sure, but...

2023-01-26 22:36:56

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v11 2/8] iov_iter: Add a function to extract a page list from an iterator

On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote:
> On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote:
>
> > +/**
> > + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
> > + * @iter: The iterator
> > + *
> > + * Examine the iterator and indicate by returning true or false as to how, if
> > + * at all, pages extracted from the iterator will be retained by the extraction
> > + * function.
> > + *
> > + * %true indicates that the pages will have a pin placed in them that the
> > + * caller must unpin. This is must be done for DMA/async DIO to force fork()
> > + * to forcibly copy a page for the child (the parent must retain the original
> > + * page).
> > + *
> > + * %false indicates that no measures are taken and that it's up to the caller
> > + * to retain the pages.
> > + */
> > +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
> > +{
> > + return user_backed_iter(iter);
> > +}
> > +
>
> Wait a sec; why would we want a pin for pages we won't be modifying?
> A reference - sure, but...

After having looked through the earlier iterations of the patchset -
sorry, but that won't fly for (at least) vmsplice(). There we can't
pin those suckers; thankfully, we don't need to - they are used only
for fetches, so FOLL_GET is sufficient. With your "we'll just pin them,
source or destination" you won't be able to convert at least that
call of iov_iter_get_pages2(). And there might be other similar cases;
I won't swear there's more, but ISTR running into more than one of
the "pin won't be OK here, but fortunately it's a data source" places.

2023-01-26 22:49:43

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v11 2/8] iov_iter: Add a function to extract a page list from an iterator

On 1/26/23 14:36, Al Viro wrote:
...
>>> +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
>>> +{
>>> + return user_backed_iter(iter);
>>> +}
>>> +
>>
>> Wait a sec; why would we want a pin for pages we won't be modifying?
>> A reference - sure, but...
>
> After having looked through the earlier iterations of the patchset -
> sorry, but that won't fly for (at least) vmsplice(). There we can't
> pin those suckers; thankfully, we don't need to - they are used only
> for fetches, so FOLL_GET is sufficient. With your "we'll just pin them,
> source or destination" you won't be able to convert at least that
> call of iov_iter_get_pages2(). And there might be other similar cases;
> I won't swear there's more, but ISTR running into more than one of
> the "pin won't be OK here, but fortunately it's a data source" places.

Assuming that "page is a data source" means that we are writing out from
the page to a block device (so, a WRITE operation, which of course
actually *reads* from the page), then...

...one thing I'm worried about now is whether Jan's original problem
report [1] can be fixed, because that involves page writeback. And it
seems like we need to mark the pages involved as "maybe dma-pinned" via
FOLL_PIN pins, in order to solve it.

Or am I missing a key point (I hope)?


[1] https://lore.kernel.org/linux-mm/[email protected]/T/#u

thanks,
--
John Hubbard
NVIDIA

2023-01-26 23:45:50

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v11 2/8] iov_iter: Add a function to extract a page list from an iterator

On 26.01.23 23:36, Al Viro wrote:
> On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote:
>> On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote:
>>
>>> +/**
>>> + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
>>> + * @iter: The iterator
>>> + *
>>> + * Examine the iterator and indicate by returning true or false as to how, if
>>> + * at all, pages extracted from the iterator will be retained by the extraction
>>> + * function.
>>> + *
>>> + * %true indicates that the pages will have a pin placed in them that the
>>> + * caller must unpin. This is must be done for DMA/async DIO to force fork()
>>> + * to forcibly copy a page for the child (the parent must retain the original
>>> + * page).
>>> + *
>>> + * %false indicates that no measures are taken and that it's up to the caller
>>> + * to retain the pages.
>>> + */
>>> +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
>>> +{
>>> + return user_backed_iter(iter);
>>> +}
>>> +
>>
>> Wait a sec; why would we want a pin for pages we won't be modifying?
>> A reference - sure, but...
>
> After having looked through the earlier iterations of the patchset -
> sorry, but that won't fly for (at least) vmsplice(). There we can't
> pin those suckers;

We'll need a way to pass FOLL_LONGTERM to pin_user_pages_fast() to
handle such long-term pinning as vmsplice() needs. But the release path
(unpin) will be the same.

--
Thanks,

David / dhildenb


2023-01-26 23:46:39

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v11 1/8] iov_iter: Define flags to qualify page extraction.

On 26.01.23 22:54, Al Viro wrote:
> On Thu, Jan 26, 2023 at 02:16:19PM +0000, David Howells wrote:
>
>> +typedef unsigned int iov_iter_extraction_t;
>
>> +/* Flags for iov_iter_get/extract_pages*() */
>> +/* Allow P2PDMA on the extracted pages */
>> +#define ITER_ALLOW_P2PDMA ((__force iov_iter_extraction_t)0x01)
>
> That __force makes sense only if you make it a bitwise type -
> otherwise it's meaningless. IOW, either turn the typedef into
>
> typedef unsigned int __bitwise iov_iter_extraction_t;
>
> or lose __force in the cast...
>

Well spotted, iov_iter_extraction_t is missing __bitwise.

--
Thanks,

David / dhildenb


2023-01-26 23:59:01

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v11 2/8] iov_iter: Add a function to extract a page list from an iterator

Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is
vmspliced into a pipe with the pipe holding a pin on it because pinned pages
are removed from all page tables. Is this actually the case? I can't see
offhand where in mm/gup.c it does this.

David


2023-01-27 00:11:37

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v11 2/8] iov_iter: Add a function to extract a page list from an iterator

On 27.01.23 00:56, David Howells wrote:
> Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is
> vmspliced into a pipe with the pipe holding a pin on it because pinned pages
> are removed from all page tables. Is this actually the case? I can't see
> offhand where in mm/gup.c it does this.

Pinning a page is mostly taking a "special" reference on the page,
indicating to the system that the page maybe pinned. For an ordinary
order-0 page, this is increasing the refcount by 1024 instead of 1.

In addition, we'll do some COW-unsharing magic depending on the page
type (e.g., anon vs. fike-backed), and FOLL_LONGTERM. So if the page is
mapped R/O only and we want to pin it R/O (!FOLL_WRITE), we might
replace it in the page table by a different page via a fault
(FAULT_FLAG_UNSHARE).

Last but not least, with FOLL_LONGTERM we will make sure to migrate the
target page off of MIGRATE_MOVABLE/CMA memory where the unmovable page
(while pinned) could otherwise cause trouble (e.g., blocking memory
hotunplug). So again, we'd replace it in the page tale by a different
page via a fault.

In all cases, the page won't be unmapped from the page table.

--
Thanks,

David / dhildenb


2023-01-27 00:53:01

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v11 2/8] iov_iter: Add a function to extract a page list from an iterator

On Thu, Jan 26, 2023 at 11:56:50PM +0000, David Howells wrote:
> Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is
> vmspliced into a pipe with the pipe holding a pin on it because pinned pages
> are removed from all page tables. Is this actually the case? I can't see
> offhand where in mm/gup.c it does this.

It doesn't; sorry, really confused memories of what's going on, took a while
to sort them out (FWIW, writeback is where we unmap and check if page is
pinned, while pin_user_pages running into an unmapped page will end up
with handle_mm_fault() (->fault(), actually) try to get the sucker locked
and block on that until the writeback is over).

Said that, I still think that pinned pages (arbitrary pagecache ones,
at that) ending up in a pipe is a seriously bad idea. It's trivial to
arrange for them to stay that way indefinitely - no priveleges needed,
very few limits, etc.

2023-01-27 01:21:43

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v11 2/8] iov_iter: Add a function to extract a page list from an iterator

On Fri, Jan 27, 2023 at 12:52:38AM +0000, Al Viro wrote:
> On Thu, Jan 26, 2023 at 11:56:50PM +0000, David Howells wrote:
> > Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is
> > vmspliced into a pipe with the pipe holding a pin on it because pinned pages
> > are removed from all page tables. Is this actually the case? I can't see
> > offhand where in mm/gup.c it does this.
>
> It doesn't; sorry, really confused memories of what's going on, took a while
> to sort them out (FWIW, writeback is where we unmap and check if page is
> pinned, while pin_user_pages running into an unmapped page will end up
> with handle_mm_fault() (->fault(), actually) try to get the sucker locked
> and block on that until the writeback is over).

Umm... OK, I really need to reread that area. Hopefully will be done with
that by tomorrow...

2023-01-27 02:05:58

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v11 2/8] iov_iter: Add a function to extract a page list from an iterator

On Fri, Jan 27, 2023 at 12:44:08AM +0100, David Hildenbrand wrote:
> On 26.01.23 23:36, Al Viro wrote:
> > On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote:
> > > On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote:
> > >
> > > > +/**
> > > > + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
> > > > + * @iter: The iterator
> > > > + *
> > > > + * Examine the iterator and indicate by returning true or false as to how, if
> > > > + * at all, pages extracted from the iterator will be retained by the extraction
> > > > + * function.
> > > > + *
> > > > + * %true indicates that the pages will have a pin placed in them that the
> > > > + * caller must unpin. This is must be done for DMA/async DIO to force fork()
> > > > + * to forcibly copy a page for the child (the parent must retain the original
> > > > + * page).
> > > > + *
> > > > + * %false indicates that no measures are taken and that it's up to the caller
> > > > + * to retain the pages.
> > > > + */
> > > > +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
> > > > +{
> > > > + return user_backed_iter(iter);
> > > > +}
> > > > +
> > >
> > > Wait a sec; why would we want a pin for pages we won't be modifying?
> > > A reference - sure, but...
> >
> > After having looked through the earlier iterations of the patchset -
> > sorry, but that won't fly for (at least) vmsplice(). There we can't
> > pin those suckers;
>
> We'll need a way to pass FOLL_LONGTERM to pin_user_pages_fast() to handle
> such long-term pinning as vmsplice() needs. But the release path (unpin)
> will be the same.

Umm... Are you saying that if the source area contains DAX mmaps, vmsplice()
from it will fail?

2023-01-27 12:30:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v11 2/8] iov_iter: Add a function to extract a page list from an iterator

On Fri 27-01-23 02:02:31, Al Viro wrote:
> On Fri, Jan 27, 2023 at 12:44:08AM +0100, David Hildenbrand wrote:
> > On 26.01.23 23:36, Al Viro wrote:
> > > On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote:
> > > > On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote:
> > > >
> > > > > +/**
> > > > > + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
> > > > > + * @iter: The iterator
> > > > > + *
> > > > > + * Examine the iterator and indicate by returning true or false as to how, if
> > > > > + * at all, pages extracted from the iterator will be retained by the extraction
> > > > > + * function.
> > > > > + *
> > > > > + * %true indicates that the pages will have a pin placed in them that the
> > > > > + * caller must unpin. This is must be done for DMA/async DIO to force fork()
> > > > > + * to forcibly copy a page for the child (the parent must retain the original
> > > > > + * page).
> > > > > + *
> > > > > + * %false indicates that no measures are taken and that it's up to the caller
> > > > > + * to retain the pages.
> > > > > + */
> > > > > +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
> > > > > +{
> > > > > + return user_backed_iter(iter);
> > > > > +}
> > > > > +
> > > >
> > > > Wait a sec; why would we want a pin for pages we won't be modifying?
> > > > A reference - sure, but...
> > >
> > > After having looked through the earlier iterations of the patchset -
> > > sorry, but that won't fly for (at least) vmsplice(). There we can't
> > > pin those suckers;
> >
> > We'll need a way to pass FOLL_LONGTERM to pin_user_pages_fast() to handle
> > such long-term pinning as vmsplice() needs. But the release path (unpin)
> > will be the same.
>
> Umm... Are you saying that if the source area contains DAX mmaps, vmsplice()
> from it will fail?

Yes, that's the plan. Because as you wrote elsewhere, it is otherwise too easy
to lock up operations such as truncate(2) on DAX filesystems.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-01-27 12:36:00

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v11 2/8] iov_iter: Add a function to extract a page list from an iterator

On 27.01.23 13:30, Jan Kara wrote:
> On Fri 27-01-23 02:02:31, Al Viro wrote:
>> On Fri, Jan 27, 2023 at 12:44:08AM +0100, David Hildenbrand wrote:
>>> On 26.01.23 23:36, Al Viro wrote:
>>>> On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote:
>>>>> On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote:
>>>>>
>>>>>> +/**
>>>>>> + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
>>>>>> + * @iter: The iterator
>>>>>> + *
>>>>>> + * Examine the iterator and indicate by returning true or false as to how, if
>>>>>> + * at all, pages extracted from the iterator will be retained by the extraction
>>>>>> + * function.
>>>>>> + *
>>>>>> + * %true indicates that the pages will have a pin placed in them that the
>>>>>> + * caller must unpin. This is must be done for DMA/async DIO to force fork()
>>>>>> + * to forcibly copy a page for the child (the parent must retain the original
>>>>>> + * page).
>>>>>> + *
>>>>>> + * %false indicates that no measures are taken and that it's up to the caller
>>>>>> + * to retain the pages.
>>>>>> + */
>>>>>> +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
>>>>>> +{
>>>>>> + return user_backed_iter(iter);
>>>>>> +}
>>>>>> +
>>>>>
>>>>> Wait a sec; why would we want a pin for pages we won't be modifying?
>>>>> A reference - sure, but...
>>>>
>>>> After having looked through the earlier iterations of the patchset -
>>>> sorry, but that won't fly for (at least) vmsplice(). There we can't
>>>> pin those suckers;
>>>
>>> We'll need a way to pass FOLL_LONGTERM to pin_user_pages_fast() to handle
>>> such long-term pinning as vmsplice() needs. But the release path (unpin)
>>> will be the same.
>>
>> Umm... Are you saying that if the source area contains DAX mmaps, vmsplice()
>> from it will fail?
>
> Yes, that's the plan. Because as you wrote elsewhere, it is otherwise too easy
> to lock up operations such as truncate(2) on DAX filesystems.

Right, it's then the same behavior as we already have for other
FOLL_LONGTERM users, such as RDMA or io_uring.

... if we're afraid of breaking existing setups we could add some kind
of fallback to copy to a buffer like ordinary pipe writes.

--
Thanks,

David / dhildenb


2023-01-27 12:39:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v11 2/8] iov_iter: Add a function to extract a page list from an iterator

On Fri 27-01-23 00:52:38, Al Viro wrote:
> On Thu, Jan 26, 2023 at 11:56:50PM +0000, David Howells wrote:
> > Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is
> > vmspliced into a pipe with the pipe holding a pin on it because pinned pages
> > are removed from all page tables. Is this actually the case? I can't see
> > offhand where in mm/gup.c it does this.
>
> It doesn't; sorry, really confused memories of what's going on, took a while
> to sort them out (FWIW, writeback is where we unmap and check if page is
> pinned, while pin_user_pages running into an unmapped page will end up
> with handle_mm_fault() (->fault(), actually) try to get the sucker locked
> and block on that until the writeback is over).
>
> Said that, I still think that pinned pages (arbitrary pagecache ones,
> at that) ending up in a pipe is a seriously bad idea. It's trivial to
> arrange for them to stay that way indefinitely - no priveleges needed,
> very few limits, etc.

I tend to agree but is there a big difference compared to normal page
references? There's no difference for memory usage, pages still can be
truncated from the file and disk space reclaimed (this is where DAX has
problems...) so standard file operations won't notice. The only difference
is that they could stay permanently dirty (we don't know whether the pin
owner copies data to or from the page) so it could cause trouble with dirty
throttling - and it is really only the throttling itself - page reclaim
will have the same troubles with both pins and ordinary page references...
Am I missing something?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-01-30 07:41:50

by Naresh Kamboju

[permalink] [raw]
Subject: [PATCH v11 0/8] iov_iter: Improve page extraction (pin or just list)

Build test pass on arm, arm64, i386, mips, parisc, powerpc, riscv, s390, sh,
sparc and x86_64.
Boot and LTP smoke pass on qemu-arm64, qemu-armv7, qemu-i386 and qemu-x86_64.

Tested-by: Linux Kernel Functional Testing <[email protected]>
Tested-by: Anders Roxell <[email protected]>

Please refer following link for details of testing.
https://qa-reports.linaro.org/~anders.roxell/linux-mainline-patches/build/lore_kernel_org_linux-mm_20230126141626_2809643-1-dhowells_redhat_com/?results_layout=table&failures_only=false#!#test-results

--
Linaro LKFT
https://lkft.linaro.org