2023-01-20 18:00:29

by David Howells

[permalink] [raw]
Subject: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)

Hi Al, Christoph,

Here are patches to provide support for extracting pages from an iov_iter
and a patch to use the primary extraction function in the block layer bio
code.

The patches make the following changes:

(1) Add a function, iov_iter_extract_pages() to replace
iov_iter_get_pages*() that gets refs, pins or just lists the pages as
appropriate to the iterator type and the I/O direction.

Add a function, iov_iter_extract_mode() that will indicate from the
iterator type and the I/O direction how the cleanup is to be
performed, returning FOLL_GET, FOLL_PIN or 0.

(2) Add a function, folio_put_unpin(), and a wrapper, page_put_unpin(),
that take a page and the return from iov_iter_extract_mode() and do
the right thing to clean up the page.

(3) Make the bio struct carry a pair of flags to indicate the cleanup
mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
added.

(4) Add a function, bio_release_page(), to release a page appropriately to
the cleanup mode indicated by the BIO_PAGE_* flags.

(5) Make the iter-to-bio code use iov_iter_extract_pages() to retain the
pages appropriately and clean them up later.

(6) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.

(7) Renumber FOLL_GET and FOLL_PIN down so that they're at bits 0 and 1
and coincident with BIO_PAGE_REFFED and BIO_PAGE_PINNED. The compiler
can then optimise on that. Also, it's probably going to be necessary
to embed these in the page pointer in sk_buff fragments. This patch
can go independently through the mm tree.

Changes:
========
ver #7)
- For now, drop the parts to pass the I/O direction to iov_iter_*pages*()
as it turned out to be a lot more complicated, with places not setting
IOCB_WRITE when they should, for example.
- Drop all the patches that changed things other then the block layer's
bio handling. The netfslib and cifs changes can go into a separate
patchset.
- Add support for extracting pages from KVEC-type iterators.
- When extracting from BVEC/KVEC, skip over empty vecs at the front.

ver #6)
- Fix write() syscall and co. not setting IOCB_WRITE.
- Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE.
- Use op_is_write() in bio_copy_user_iov().
- Drop the iterator direction checks from smbd_recv().
- Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of
gup_flags to iov_iter_get/extract_pages*().
- Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove.
- Add back the function to indicate the cleanup mode.
- Drop the cleanup_mode return arg to iov_iter_extract_pages().
- Provide a helper to clean up a page.
- Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have
the same numerical values, enforced with an assertion.
- Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and
NFS.
- Added in the patches to make CIFS do top-to-bottom iterators and use
various of the added extraction functions.
- Added a pair of work-in-progess patches to make sk_buff fragments store
FOLL_GET and FOLL_PIN.

ver #5)
- Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch.
- Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags.
- Add patch to allow bio_flagged() to be combined by gcc.

ver #4)
- Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
no longer referenced by linux/uio.h.
- Add ITER_SOURCE/DEST cleanup patches.
- Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST.
- Allow additional gup_flags to be passed into iov_iter_extract_pages().
- Add struct bio patch.

ver #3)
- Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
to get/pin_user_pages_fast()[1].

ver #2)
- Rolled the extraction cleanup mode query function into the extraction
function, returning the indication through the argument list.
- Fixed patch 4 (extract to scatterlist) to actually use the new
extraction API.

I've pushed the patches (excluding the two WIP networking patches) here
also:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract

David

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/ # v6

David Howells (8):
iov_iter: Define flags to qualify page extraction.
iov_iter: Add a function to extract a page list from an iterator
mm: Provide a helper to drop a pin/ref on a page
block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the
meaning
block: Add BIO_PAGE_PINNED
block: Make bio structs pin pages rather than ref'ing if appropriate
block: Fix bio_flagged() so that gcc can better optimise it
mm: Renumber FOLL_GET and FOLL_PIN down

block/bio.c | 43 ++--
block/blk-map.c | 26 +--
block/blk.h | 29 +++
fs/iomap/direct-io.c | 1 -
include/linux/bio.h | 5 +-
include/linux/blk_types.h | 3 +-
include/linux/mm.h | 17 +-
include/linux/uio.h | 35 ++-
lib/iov_iter.c | 438 +++++++++++++++++++++++++++++++++++++-
mm/gup.c | 22 ++
10 files changed, 571 insertions(+), 48 deletions(-)


2023-01-20 18:11:49

by David Howells

[permalink] [raw]
Subject: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator. The pages may be returned with a reference added or a pin
added or neither, depending on the type of iterator and the direction of
transfer. The caller must pass FOLL_READ_FROM_MEM or FOLL_WRITE_TO_MEM
as part of gup_flags to indicate how the iterator contents are to be used.

Add a second function, iov_iter_extract_mode(), to determine how the
cleanup should be done.

There are three cases:

(1) Transfer *into* an ITER_IOVEC or ITER_UBUF iterator.

Extracted pages will have pins obtained on them (but not references)
so that fork() doesn't CoW the pages incorrectly whilst the I/O is in
progress.

iov_iter_extract_mode() will return FOLL_PIN for this case. The
caller should use something like unpin_user_page() to dispose of the
page.

(2) Transfer is *out of* an ITER_IOVEC or ITER_UBUF iterator.

Extracted pages will have references obtained on them, but not pins.

iov_iter_extract_mode() will return FOLL_GET. The caller should use
something like put_page() for page disposal.

(3) Any other sort of iterator.

No refs or pins are obtained on the page, the assumption is made that
the caller will manage page retention. ITER_ALLOW_P2PDMA is not
permitted.

iov_iter_extract_mode() will return 0. The pages don't need
additional disposal.

Signed-off-by: David Howells <[email protected]>
cc: Al Viro <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: John Hubbard <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
Link: https://lore.kernel.org/r/166920903885.1461876.692029808682876184.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997421646.9475.14837976344157464997.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305163883.1521586.10777155475378874823.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344728530.2425628.9613910866466387722.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391053207.2311931.16398133457201442907.stgit@warthog.procyon.org.uk/ # v6
---

Notes:
ver #7)
- Switch to passing in iter-specific flags rather than FOLL_* flags.
- Drop the direction flags for now.
- Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
- Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
- Add support for extraction from KVEC-type iters.
- Use iov_iter_advance() rather than open-coding it.
- Make BVEC- and KVEC-type skip over initial empty vectors.

ver #6)
- Add back the function to indicate the cleanup mode.
- Drop the cleanup_mode return arg to iov_iter_extract_pages().
- Pass FOLL_SOURCE/DEST_BUF in gup_flags. Check this against the iter
data_source.

ver #4)
- Use ITER_SOURCE/DEST instead of WRITE/READ.
- Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.

ver #3)
- Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
to get/pin_user_pages_fast()[1].

include/linux/uio.h | 28 +++
lib/iov_iter.c | 424 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 452 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 46d5080314c6..a4233049ab7a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -363,4 +363,32 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
/* Flags for iov_iter_get/extract_pages*() */
#define ITER_ALLOW_P2PDMA 0x01 /* Allow P2PDMA on the extracted pages */

+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+ size_t maxsize, unsigned int maxpages,
+ unsigned int extract_flags, size_t *offset0);
+
+/**
+ * iov_iter_extract_mode - Indicate how pages from the iterator will be retained
+ * @iter: The iterator
+ * @extract_flags: How the iterator is to be used
+ *
+ * Examine the iterator and @extract_flags and indicate by returning FOLL_PIN,
+ * FOLL_GET or 0 as to how, if at all, pages extracted from the iterator will
+ * be retained by the extraction function.
+ *
+ * FOLL_GET indicates that the pages will have a reference taken on them that
+ * the caller must put. This can be done for DMA/async DIO write from a page.
+ *
+ * FOLL_PIN indicates that the pages will have a pin placed in them that the
+ * caller must unpin. This is must be done for DMA/async DIO read to a page to
+ * avoid CoW problems in fork.
+ *
+ * 0 indicates that no measures are taken and that it's up to the caller to
+ * retain the pages.
+ */
+#define iov_iter_extract_mode(iter, extract_flags) \
+ (user_backed_iter(iter) ? \
+ (iter->data_source == ITER_SOURCE) ? \
+ FOLL_GET : FOLL_PIN : 0)
+
#endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index fb04abe7d746..843abe566efb 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1916,3 +1916,427 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
i->iov -= state->nr_segs - i->nr_segs;
i->nr_segs = state->nr_segs;
}
+
+/*
+ * Extract a list of contiguous pages from an ITER_PIPE iterator. This does
+ * not get references of its own on the pages, nor does it get a pin on them.
+ * If there's a partial page, it adds that first and will then allocate and add
+ * pages into the pipe to make up the buffer space to the amount required.
+ *
+ * The caller must hold the pipe locked and only transferring into a pipe is
+ * supported.
+ */
+static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ unsigned int nr, offset, chunk, j;
+ struct page **p;
+ size_t left;
+
+ if (!sanity(i))
+ return -EFAULT;
+
+ offset = pipe_npages(i, &nr);
+ if (!nr)
+ return -EFAULT;
+ *offset0 = offset;
+
+ maxpages = min_t(size_t, nr, maxpages);
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+
+ left = maxsize;
+ for (j = 0; j < maxpages; j++) {
+ struct page *page = append_pipe(i, left, &offset);
+ if (!page)
+ break;
+ chunk = min_t(size_t, left, PAGE_SIZE - offset);
+ left -= chunk;
+ *p++ = page;
+ }
+ if (!j)
+ return -EFAULT;
+ return maxsize - left;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_XARRAY iterator. This does not
+ * get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ struct page *page, **p;
+ unsigned int nr = 0, offset;
+ loff_t pos = i->xarray_start + i->iov_offset;
+ pgoff_t index = pos >> PAGE_SHIFT;
+ XA_STATE(xas, i->xarray, index);
+
+ offset = pos & ~PAGE_MASK;
+ *offset0 = offset;
+
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+
+ rcu_read_lock();
+ for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+ if (xas_retry(&xas, page))
+ continue;
+
+ /* Has the page moved or been split? */
+ if (unlikely(page != xas_reload(&xas))) {
+ xas_reset(&xas);
+ continue;
+ }
+
+ p[nr++] = find_subpage(page, xas.xa_index);
+ if (nr == maxpages)
+ break;
+ }
+ rcu_read_unlock();
+
+ maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_BVEC iterator. This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ struct page **p, *page;
+ size_t skip = i->iov_offset, offset;
+ int k;
+
+ for (;;) {
+ if (i->nr_segs == 0)
+ return 0;
+ maxsize = min(maxsize, i->bvec->bv_len - skip);
+ if (maxsize)
+ break;
+ i->iov_offset = 0;
+ i->nr_segs--;
+ i->kvec++;
+ skip = 0;
+ }
+
+ skip += i->bvec->bv_offset;
+ page = i->bvec->bv_page + skip / PAGE_SIZE;
+ offset = skip % PAGE_SIZE;
+ *offset0 = offset;
+
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+ for (k = 0; k < maxpages; k++)
+ p[k] = page + k;
+
+ maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+/*
+ * Extract a list of virtually contiguous pages from an ITER_KVEC iterator.
+ * This does not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_kvec_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ struct page **p, *page;
+ const void *kaddr;
+ size_t skip = i->iov_offset, offset, len;
+ int k;
+
+ for (;;) {
+ if (i->nr_segs == 0)
+ return 0;
+ maxsize = min(maxsize, i->kvec->iov_len - skip);
+ if (maxsize)
+ break;
+ i->iov_offset = 0;
+ i->nr_segs--;
+ i->kvec++;
+ skip = 0;
+ }
+
+ offset = skip % PAGE_SIZE;
+ *offset0 = offset;
+ kaddr = i->kvec->iov_base;
+
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+
+ kaddr -= offset;
+ len = offset + maxsize;
+ for (k = 0; k < maxpages; k++) {
+ size_t seg = min_t(size_t, len, PAGE_SIZE);
+
+ if (is_vmalloc_or_module_addr(kaddr))
+ page = vmalloc_to_page(kaddr);
+ else
+ page = virt_to_page(kaddr);
+
+ p[k] = page;
+ len -= seg;
+ kaddr += PAGE_SIZE;
+ }
+
+ maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+/*
+ * Get the first segment from an ITER_UBUF or ITER_IOVEC iterator. The
+ * iterator must not be empty.
+ */
+static unsigned long iov_iter_extract_first_user_segment(const struct iov_iter *i,
+ size_t *size)
+{
+ size_t skip;
+ long k;
+
+ if (iter_is_ubuf(i))
+ return (unsigned long)i->ubuf + i->iov_offset;
+
+ for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
+ size_t len = i->iov[k].iov_len - skip;
+
+ if (unlikely(!len))
+ continue;
+ if (*size > len)
+ *size = len;
+ return (unsigned long)i->iov[k].iov_base + skip;
+ }
+ BUG(); // if it had been empty, we wouldn't get called
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get references
+ * on them. This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred out of the buffer described by
+ * the iterator (ie. this is the source).
+ *
+ * The pages are returned with incremented refcounts that the caller must undo
+ * once the transfer is complete, but no additional pins are obtained.
+ *
+ * This is only safe to be used where background IO/DMA is not going to be
+ * modifying the buffer, and so won't cause a problem with CoW on fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_get(struct iov_iter *i,
+ struct page ***pages,
+ size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ unsigned long addr;
+ unsigned int gup_flags = FOLL_GET;
+ size_t offset;
+ int res;
+
+ if (WARN_ON_ONCE(i->data_source != ITER_SOURCE))
+ return -EFAULT;
+
+ if (extract_flags & ITER_ALLOW_P2PDMA)
+ gup_flags |= FOLL_PCI_P2PDMA;
+ if (i->nofault)
+ gup_flags |= FOLL_NOFAULT;
+
+ addr = iov_iter_extract_first_user_segment(i, &maxsize);
+ *offset0 = offset = addr % PAGE_SIZE;
+ addr &= PAGE_MASK;
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ res = get_user_pages_fast(addr, maxpages, gup_flags, *pages);
+ if (unlikely(res <= 0))
+ return res;
+ maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get a pin on
+ * each of them. This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred into the buffer described by the
+ * iterator (ie. this is the destination).
+ *
+ * It does not get refs on the pages, but the pages must be unpinned by the
+ * caller once the transfer is complete.
+ *
+ * This is safe to be used where background IO/DMA *is* going to be modifying
+ * the buffer; using a pin rather than a ref makes sure that CoW happens
+ * correctly in the parent during fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_pin(struct iov_iter *i,
+ struct page ***pages,
+ size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ unsigned long addr;
+ unsigned int gup_flags = FOLL_PIN | FOLL_WRITE;
+ size_t offset;
+ int res;
+
+ if (WARN_ON_ONCE(i->data_source != ITER_DEST))
+ return -EFAULT;
+
+ if (extract_flags & ITER_ALLOW_P2PDMA)
+ gup_flags |= FOLL_PCI_P2PDMA;
+ if (i->nofault)
+ gup_flags |= FOLL_NOFAULT;
+
+ addr = first_iovec_segment(i, &maxsize);
+ *offset0 = offset = addr % PAGE_SIZE;
+ addr &= PAGE_MASK;
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+ if (unlikely(res <= 0))
+ return res;
+ maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ if (iov_iter_extract_mode(i, extract_flags) == FOLL_GET)
+ return iov_iter_extract_user_pages_and_get(i, pages, maxsize,
+ maxpages, extract_flags,
+ offset0);
+ else
+ return iov_iter_extract_user_pages_and_pin(i, pages, maxsize,
+ maxpages, extract_flags,
+ offset0);
+}
+
+/**
+ * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator
+ * @i: The iterator to extract from
+ * @pages: Where to return the list of pages
+ * @maxsize: The maximum amount of iterator to extract
+ * @maxpages: The maximum size of the list of pages
+ * @extract_flags: Flags to qualify request
+ * @offset0: Where to return the starting offset into (*@pages)[0]
+ *
+ * Extract a list of contiguous pages from the current point of the iterator,
+ * advancing the iterator. The maximum number of pages and the maximum amount
+ * of page contents can be set.
+ *
+ * If *@pages is NULL, a page list will be allocated to the required size and
+ * *@pages will be set to its base. If *@pages is not NULL, it will be assumed
+ * that the caller allocated a page list at least @maxpages in size and this
+ * will be filled in.
+ *
+ * @extract_flags can have ITER_ALLOW_P2PDMA set to request peer-to-peer DMA be
+ * allowed on the pages extracted.
+ *
+ * The iov_iter_extract_mode() function can be used to query how cleanup should
+ * be performed.
+ *
+ * Extra refs or pins on the pages may be obtained as follows:
+ *
+ * (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ * transferred /OUT OF/ the buffer (@i->data_source == ITER_SOURCE), refs
+ * will be taken on the pages, but pins will not be added. This can be
+ * used for DMA from a page; it cannot be used for DMA to a page, as it
+ * may cause page-COW problems in fork. iov_iter_extract_mode() will
+ * return FOLL_GET.
+ *
+ * (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ * transferred /INTO/ the described buffer (@i->data_source |= ITER_DEST),
+ * pins will be added to the pages, but refs will not be taken. This must
+ * be used for DMA to a page. iov_iter_extract_mode() will return
+ * FOLL_PIN.
+ *
+ * (*) If the iterator is ITER_PIPE, this must describe a destination for the
+ * data. Additional pages may be allocated and added to the pipe (which
+ * will hold the refs), but neither refs nor pins will be obtained for the
+ * caller. The caller must hold the pipe lock. iov_iter_extract_mode()
+ * will return 0.
+ *
+ * (*) If the iterator is ITER_KVEC, ITER_BVEC or ITER_XARRAY, the pages are
+ * merely listed; no extra refs or pins are obtained.
+ * iov_iter_extract_mode() will return 0.
+ *
+ * Note also:
+ *
+ * (*) Peer-to-peer DMA (ITER_ALLOW_P2PDMA) is only permitted with user-backed
+ * iterators.
+ *
+ * (*) Use with ITER_DISCARD is not supported as that has no content.
+ *
+ * On success, the function sets *@pages to the new pagelist, if allocated, and
+ * sets *offset0 to the offset into the first page..
+ *
+ * It may also return -ENOMEM and -EFAULT.
+ */
+ssize_t iov_iter_extract_pages(struct iov_iter *i,
+ struct page ***pages,
+ size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT);
+ if (!maxsize)
+ return 0;
+
+ if (likely(user_backed_iter(i)))
+ return iov_iter_extract_user_pages(i, pages, maxsize,
+ maxpages, extract_flags,
+ offset0);
+ if (WARN_ON_ONCE(extract_flags & ITER_ALLOW_P2PDMA))
+ return -EIO;
+ if (iov_iter_is_kvec(i))
+ return iov_iter_extract_kvec_pages(i, pages, maxsize,
+ maxpages, extract_flags,
+ offset0);
+ if (iov_iter_is_bvec(i))
+ return iov_iter_extract_bvec_pages(i, pages, maxsize,
+ maxpages, extract_flags,
+ offset0);
+ if (iov_iter_is_pipe(i))
+ return iov_iter_extract_pipe_pages(i, pages, maxsize,
+ maxpages, extract_flags,
+ offset0);
+ if (iov_iter_is_xarray(i))
+ return iov_iter_extract_xarray_pages(i, pages, maxsize,
+ maxpages, extract_flags,
+ offset0);
+ return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(iov_iter_extract_pages);

2023-01-20 18:13:25

by David Howells

[permalink] [raw]
Subject: [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate

Convert the block layer's bio code to use iov_iter_extract_pages() instead
of iov_iter_get_pages(). This will pin pages or leave them unaltered
rather than getting a ref on them as appropriate to the source iterator.

The pages need to be pinned for DIO-read rather than having refs taken on
them to prevent VM copy-on-write from malfunctioning during a concurrent
fork() (the result of the I/O would otherwise end up only visible to the
child process and not the parent).

To implement this:

(1) If the BIO_PAGE_REFFED flag is set, this causes attached pages to be
passed to put_page() during cleanup.

(2) A BIO_PAGE_PINNED flag is provided. If set, this causes attached
pages to be passed to unpin_user_page() during cleanup.

(3) BIO_PAGE_REFFED is set by default and BIO_PAGE_PINNED is cleared by
default when the bio is (re-)initialised.

(4) If iov_iter_extract_pages() indicates FOLL_GET, this causes
BIO_PAGE_REFFED to be set and if FOLL_PIN is indicated, this causes
BIO_PAGE_PINNED to be set. If it returns neither FOLL_* flag, then
both BIO_PAGE_* flags will be cleared.

Mixing sets of pages with different clean up modes is not supported.

(5) Cloned bio structs have both flags cleared.

(6) bio_release_pages() will do the release if either BIO_PAGE_* flag is
set.

[!] Note that this is tested a bit with ext4, but nothing else.

Signed-off-by: David Howells <[email protected]>
cc: Al Viro <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Jan Kara <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/167305166150.1521586.10220949115402059720.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344731521.2425628.5403113335062567245.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391056047.2311931.6772604381276147664.stgit@warthog.procyon.org.uk/ # v6
---

Notes:
ver #7)
- Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.

ver #5)
- Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
BIO_* flags and got rid of bi_cleanup_mode.
- Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.

block/bio.c | 34 +++++++++++++++++++---------------
block/blk-map.c | 22 +++++++++++-----------
block/blk.h | 29 +++++++++++++++++++++++++++++
include/linux/bio.h | 3 ++-
4 files changed, 61 insertions(+), 27 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index cfe11f4799d1..2a6568b58501 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -245,8 +245,9 @@ static void bio_free(struct bio *bio)
* when IO has completed, or when the bio is released.
*
* We set the initial assumption that pages attached to the bio will be
- * released with put_page() by setting BIO_PAGE_REFFED; if the pages
- * should not be put, this flag should be cleared.
+ * released with put_page() by setting BIO_PAGE_REFFED, but this should be set
+ * to BIO_PAGE_PINNED if the page should be unpinned instead; if the pages
+ * should not be put or unpinned, these flags should be cleared.
*/
void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
unsigned short max_vecs, blk_opf_t opf)
@@ -819,6 +820,7 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
{
bio_set_flag(bio, BIO_CLONED);
bio_clear_flag(bio, BIO_PAGE_REFFED);
+ bio_clear_flag(bio, BIO_PAGE_PINNED);
bio->bi_ioprio = bio_src->bi_ioprio;
bio->bi_iter = bio_src->bi_iter;

@@ -1183,7 +1185,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
bio_for_each_segment_all(bvec, bio, iter_all) {
if (mark_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
- put_page(bvec->bv_page);
+ bio_release_page(bio, bvec->bv_page);
}
}
EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1220,7 +1222,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
}

if (same_page)
- put_page(page);
+ bio_release_page(bio, page);
return 0;
}

@@ -1234,7 +1236,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
queue_max_zone_append_sectors(q), &same_page) != len)
return -EINVAL;
if (same_page)
- put_page(page);
+ bio_release_page(bio, page);
return 0;
}

@@ -1245,10 +1247,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
* @bio: bio to add pages to
* @iter: iov iterator describing the region to be mapped
*
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Extracts pages from *iter and appends them to @bio's bvec array. The pages
+ * will have to be cleaned up in the way indicated by the BIO_PAGE_REFFED and
+ * BIO_PAGE_PINNED flags. For a multi-segment *iter, this function only adds
+ * pages from the next non-empty segment of the iov iterator.
*/
static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
@@ -1280,9 +1282,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
* result to ensure the bio's total size is correct. The remainder of
* the iov data will be picked up in the next bio iteration.
*/
- size = iov_iter_get_pages(iter, pages,
- UINT_MAX - bio->bi_iter.bi_size,
- nr_pages, &offset, extract_flags);
+ size = iov_iter_extract_pages(iter, &pages,
+ UINT_MAX - bio->bi_iter.bi_size,
+ nr_pages, extract_flags, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;

@@ -1315,7 +1317,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
iov_iter_revert(iter, left);
out:
while (i < nr_pages)
- put_page(pages[i++]);
+ bio_release_page(bio, pages[i++]);

return ret;
}
@@ -1344,6 +1346,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
int ret = 0;

+ bio_set_cleanup_mode(bio, iter, 0);
+
if (iov_iter_is_bvec(iter)) {
bio_iov_bvec_set(bio, iter);
iov_iter_advance(iter, bio->bi_iter.bi_size);
@@ -1496,8 +1500,8 @@ void bio_set_pages_dirty(struct bio *bio)
* the BIO and re-dirty the pages in process context.
*
* It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on. It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on. It will run one put_page() or unpin_user_page() against each page
+ * and will run one bio_put() against the BIO.
*/

static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk-map.c b/block/blk-map.c
index bc111261fc82..7d1bc75b9cf2 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -282,20 +282,20 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (blk_queue_pci_p2pdma(rq->q))
extract_flags |= ITER_ALLOW_P2PDMA;

+ bio_set_cleanup_mode(bio, iter, extract_flags);
+
while (iov_iter_count(iter)) {
- struct page **pages, *stack_pages[UIO_FASTIOV];
+ struct page *stack_pages[UIO_FASTIOV];
+ struct page **pages = stack_pages;
ssize_t bytes;
size_t offs;
int npages;

- if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
- pages = stack_pages;
- bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
- nr_vecs, &offs, extract_flags);
- } else {
- bytes = iov_iter_get_pages_alloc(iter, &pages,
- LONG_MAX, &offs, extract_flags);
- }
+ if (nr_vecs > ARRAY_SIZE(stack_pages))
+ pages = NULL;
+
+ bytes = iov_iter_extract_pages(iter, &pages, LONG_MAX,
+ nr_vecs, extract_flags, &offs);
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
goto out_unmap;
@@ -317,7 +317,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (!bio_add_hw_page(rq->q, bio, page, n, offs,
max_sectors, &same_page)) {
if (same_page)
- put_page(page);
+ bio_release_page(bio, page);
break;
}

@@ -329,7 +329,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
* release the pages we didn't map into the bio, if any
*/
while (j < npages)
- put_page(pages[j++]);
+ bio_release_page(bio, pages[j++]);
if (pages != stack_pages)
kvfree(pages);
/* couldn't stuff something into bio? */
diff --git a/block/blk.h b/block/blk.h
index 4c3b3325219a..16c8a7a84a16 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -425,6 +425,35 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
struct page *page, unsigned int len, unsigned int offset,
unsigned int max_sectors, bool *same_page);

+/*
+ * Set the cleanup mode for a bio from an iterator and the extraction flags.
+ */
+static inline void bio_set_cleanup_mode(struct bio *bio, struct iov_iter *iter,
+ unsigned int extract_flags)
+{
+ unsigned int cleanup_mode;
+
+ bio_clear_flag(bio, BIO_PAGE_REFFED);
+ cleanup_mode = iov_iter_extract_mode(iter, extract_flags);
+ if (cleanup_mode & FOLL_GET)
+ bio_set_flag(bio, BIO_PAGE_REFFED);
+ if (cleanup_mode & FOLL_PIN)
+ bio_set_flag(bio, BIO_PAGE_PINNED);
+}
+
+/*
+ * Clean up a page appropriately, where the page may be pinned, may have a
+ * ref taken on it or neither.
+ */
+static inline void bio_release_page(struct bio *bio, struct page *page)
+{
+ unsigned int gup_flags = 0;
+
+ gup_flags |= bio_flagged(bio, BIO_PAGE_REFFED) ? FOLL_GET : 0;
+ gup_flags |= bio_flagged(bio, BIO_PAGE_PINNED) ? FOLL_PIN : 0;
+ page_put_unpin(page, gup_flags);
+}
+
struct request_queue *blk_alloc_queue(int node_id);

int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 63bfd91793f9..1c6f051f6ff2 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -482,7 +482,8 @@ void zero_fill_bio(struct bio *bio);

static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
{
- if (bio_flagged(bio, BIO_PAGE_REFFED))
+ if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+ bio_flagged(bio, BIO_PAGE_PINNED))
__bio_release_pages(bio, mark_dirty);
}


2023-01-20 18:14:08

by David Howells

[permalink] [raw]
Subject: [PATCH v7 7/8] block: Fix bio_flagged() so that gcc can better optimise it

Fix bio_flagged() so that multiple instances of it, such as:

if (bio_flagged(bio, BIO_PAGE_REFFED) ||
bio_flagged(bio, BIO_PAGE_PINNED))

can be combined by the gcc optimiser into a single test in assembly
(arguably, this is a compiler optimisation issue[1]).

The missed optimisation stems from bio_flagged() comparing the result of
the bitwise-AND to zero. This results in an out-of-line bio_release_page()
being compiled to something like:

<+0>: mov 0x14(%rdi),%eax
<+3>: test $0x1,%al
<+5>: jne 0xffffffff816dac53 <bio_release_pages+11>
<+7>: test $0x2,%al
<+9>: je 0xffffffff816dac5c <bio_release_pages+20>
<+11>: movzbl %sil,%esi
<+15>: jmp 0xffffffff816daba1 <__bio_release_pages>
<+20>: jmp 0xffffffff81d0b800 <__x86_return_thunk>

However, the test is superfluous as the return type is bool. Removing it
results in:

<+0>: testb $0x3,0x14(%rdi)
<+4>: je 0xffffffff816e4af4 <bio_release_pages+15>
<+6>: movzbl %sil,%esi
<+10>: jmp 0xffffffff816dab7c <__bio_release_pages>
<+15>: jmp 0xffffffff81d0b7c0 <__x86_return_thunk>

instead.

Also, the MOVZBL instruction looks unnecessary[2] - I think it's just
're-booling' the mark_dirty parameter.

Fixes: b7c44ed9d2fc ("block: manipulate bio->bi_flags through helpers")
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
cc: Jens Axboe <[email protected]>
cc: [email protected]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108370 [1]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108371 [2]
Link: https://lore.kernel.org/r/167391056756.2311931.356007731815807265.stgit@warthog.procyon.org.uk/ # v6
---
include/linux/bio.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1c6f051f6ff2..2e6109b0fca8 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -227,7 +227,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)

static inline bool bio_flagged(struct bio *bio, unsigned int bit)
{
- return (bio->bi_flags & (1U << bit)) != 0;
+ return bio->bi_flags & (1U << bit);
}

static inline void bio_set_flag(struct bio *bio, unsigned int bit)

2023-01-20 18:28:29

by David Howells

[permalink] [raw]
Subject: [PATCH v7 3/8] mm: Provide a helper to drop a pin/ref on a page

Provide a helper in the get_user_pages code to drop a pin or a ref on a
page based on being given FOLL_GET or FOLL_PIN in its flags argument or do
nothing if neither is set.

Signed-off-by: David Howells <[email protected]>
cc: Al Viro <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
include/linux/mm.h | 3 +++
mm/gup.c | 22 ++++++++++++++++++++++
2 files changed, 25 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..f1cf8f4eb946 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1367,6 +1367,9 @@ static inline bool is_cow_mapping(vm_flags_t flags)
#define SECTION_IN_PAGE_FLAGS
#endif

+void folio_put_unpin(struct folio *folio, unsigned int flags);
+void page_put_unpin(struct page *page, unsigned int flags);
+
/*
* The identification function is mainly used by the buddy allocator for
* determining if two pages could be buddies. We are not really identifying
diff --git a/mm/gup.c b/mm/gup.c
index f45a3a5be53a..3ee4b4c7e0cb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -191,6 +191,28 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
folio_put_refs(folio, refs);
}

+/**
+ * folio_put_unpin - Unpin/put a folio as appropriate
+ * @folio: The folio to release
+ * @flags: gup flags indicating the mode of release (FOLL_*)
+ *
+ * Release a folio according to the flags. If FOLL_GET is set, the folio has a
+ * ref dropped; if FOLL_PIN is set, it is unpinned; otherwise it is left
+ * unaltered.
+ */
+void folio_put_unpin(struct folio *folio, unsigned int flags)
+{
+ if (flags & (FOLL_GET | FOLL_PIN))
+ gup_put_folio(folio, 1, flags);
+}
+EXPORT_SYMBOL_GPL(folio_put_unpin);
+
+void page_put_unpin(struct page *page, unsigned int flags)
+{
+ folio_put_unpin(page_folio(page), flags);
+}
+EXPORT_SYMBOL_GPL(page_put_unpin);
+
/**
* try_grab_page() - elevate a page's refcount by a flag-dependent amount
* @page: pointer to page to be grabbed

2023-01-20 18:28:33

by David Howells

[permalink] [raw]
Subject: [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down

Renumber FOLL_GET and FOLL_PIN down to bit 0 and 1 respectively so that
they are coincidentally the same as BIO_PAGE_REFFED and BIO_PAGE_PINNED and
also so that they can be stored in the bottom two bits of a page pointer
(something I'm looking at for zerocopy socket fragments).

Signed-off-by: David Howells <[email protected]>
cc: Al Viro <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
include/linux/mm.h | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f1cf8f4eb946..33c9eacd9548 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3074,12 +3074,13 @@ static inline vm_fault_t vmf_error(int err)
struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
unsigned int foll_flags);

-#define FOLL_WRITE 0x01 /* check pte is writable */
-#define FOLL_TOUCH 0x02 /* mark page accessed */
-#define FOLL_GET 0x04 /* do get_page on page */
-#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
-#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
-#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
+#define FOLL_GET 0x01 /* do get_page on page (equivalent to BIO_FOLL_GET) */
+#define FOLL_PIN 0x02 /* pages must be released via unpin_user_page */
+#define FOLL_WRITE 0x04 /* check pte is writable */
+#define FOLL_TOUCH 0x08 /* mark page accessed */
+#define FOLL_DUMP 0x10 /* give error on hole if it would be zero */
+#define FOLL_FORCE 0x20 /* get_user_pages read/write w/o permission */
+#define FOLL_NOWAIT 0x40 /* if a disk transfer is needed, start the IO
* and return without waiting upon it */
#define FOLL_NOFAULT 0x80 /* do not fault in pages */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
@@ -3088,7 +3089,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_ANON 0x8000 /* don't do file mappings */
#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
-#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
#define FOLL_PCI_P2PDMA 0x100000 /* allow returning PCI P2PDMA pages */
#define FOLL_INTERRUPTIBLE 0x200000 /* allow interrupts from generic signals */

2023-01-20 18:30:14

by David Howells

[permalink] [raw]
Subject: [PATCH v7 5/8] block: Add BIO_PAGE_PINNED

BIO_PAGE_PINNED to indicate that the pages in a bio were pinned (FOLL_PIN)
and that the pin will need removing.

Signed-off-by: David Howells <[email protected]>
cc: Al Viro <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Jan Kara <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: [email protected]
---
include/linux/blk_types.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 86711fb0534a..42b40156c517 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -319,6 +319,7 @@ struct bio {
*/
enum {
BIO_PAGE_REFFED, /* Pages need refs putting (equivalent to FOLL_GET) */
+ BIO_PAGE_PINNED, /* Pages need unpinning (equivalent to FOLL_PIN) */
BIO_CLONED, /* doesn't own data */
BIO_BOUNCED, /* bio is a bounce bio */
BIO_QUIET, /* Make BIO Quiet */

2023-01-20 18:30:43

by David Howells

[permalink] [raw]
Subject: [PATCH v7 1/8] iov_iter: Define flags to qualify page extraction.

Define flags to qualify page extraction to pass into iov_iter_*_pages*()
rather than passing in FOLL_* flags.

For now only a flag to allow peer-to-peer DMA is allowed; but in future,
additional flags will be provided to indicate the I/O direction.

Signed-off-by: David Howells <[email protected]>
cc: Al Viro <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: [email protected]
cc: [email protected]
---

Notes:
ver #7)
- Don't use FOLL_* as a parameter, but rather define constants
specifically to use with iov_iter_*_pages*().
- Drop the I/O direction constants for now.

block/bio.c | 6 +++---
block/blk-map.c | 8 ++++----
include/linux/uio.h | 7 +++++--
lib/iov_iter.c | 14 ++++++++------
4 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f96fcae3f75..6a6c73d14bfd 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1249,7 +1249,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
struct page **pages = (struct page **)bv;
- unsigned int gup_flags = 0;
+ unsigned int extract_flags = 0;
ssize_t size, left;
unsigned len, i = 0;
size_t offset, trim;
@@ -1264,7 +1264,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);

if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
- gup_flags |= FOLL_PCI_P2PDMA;
+ extract_flags |= ITER_ALLOW_P2PDMA;

/*
* Each segment in the iov is required to be a block size multiple.
@@ -1275,7 +1275,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
*/
size = iov_iter_get_pages(iter, pages,
UINT_MAX - bio->bi_iter.bi_size,
- nr_pages, &offset, gup_flags);
+ nr_pages, &offset, extract_flags);
if (unlikely(size <= 0))
return size ? size : -EFAULT;

diff --git a/block/blk-map.c b/block/blk-map.c
index 19940c978c73..bc111261fc82 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -267,7 +267,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
{
unsigned int max_sectors = queue_max_hw_sectors(rq->q);
unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
- unsigned int gup_flags = 0;
+ unsigned int extract_flags = 0;
struct bio *bio;
int ret;
int j;
@@ -280,7 +280,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
return -ENOMEM;

if (blk_queue_pci_p2pdma(rq->q))
- gup_flags |= FOLL_PCI_P2PDMA;
+ extract_flags |= ITER_ALLOW_P2PDMA;

while (iov_iter_count(iter)) {
struct page **pages, *stack_pages[UIO_FASTIOV];
@@ -291,10 +291,10 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
pages = stack_pages;
bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
- nr_vecs, &offs, gup_flags);
+ nr_vecs, &offs, extract_flags);
} else {
bytes = iov_iter_get_pages_alloc(iter, &pages,
- LONG_MAX, &offs, gup_flags);
+ LONG_MAX, &offs, extract_flags);
}
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 9f158238edba..46d5080314c6 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -252,12 +252,12 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *
loff_t start, size_t count);
ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start,
- unsigned gup_flags);
+ unsigned extract_flags);
ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize, size_t *start,
- unsigned gup_flags);
+ unsigned extract_flags);
ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);
int iov_iter_npages(const struct iov_iter *i, int maxpages);
@@ -360,4 +360,7 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
};
}

+/* Flags for iov_iter_get/extract_pages*() */
+#define ITER_ALLOW_P2PDMA 0x01 /* Allow P2PDMA on the extracted pages */
+
#endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f9a3ff37ecd1..fb04abe7d746 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1432,9 +1432,9 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize,
unsigned int maxpages, size_t *start,
- unsigned int gup_flags)
+ unsigned int extract_flags)
{
- unsigned int n;
+ unsigned int n, gup_flags = 0;

if (maxsize > i->count)
maxsize = i->count;
@@ -1442,6 +1442,8 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
return 0;
if (maxsize > MAX_RW_COUNT)
maxsize = MAX_RW_COUNT;
+ if (extract_flags & ITER_ALLOW_P2PDMA)
+ gup_flags |= FOLL_PCI_P2PDMA;

if (likely(user_backed_iter(i))) {
unsigned long addr;
@@ -1495,14 +1497,14 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,

ssize_t iov_iter_get_pages(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
- size_t *start, unsigned gup_flags)
+ size_t *start, unsigned extract_flags)
{
if (!maxpages)
return 0;
BUG_ON(!pages);

return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
- start, gup_flags);
+ start, extract_flags);
}
EXPORT_SYMBOL_GPL(iov_iter_get_pages);

@@ -1515,14 +1517,14 @@ EXPORT_SYMBOL(iov_iter_get_pages2);

ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize,
- size_t *start, unsigned gup_flags)
+ size_t *start, unsigned extract_flags)
{
ssize_t len;

*pages = NULL;

len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
- gup_flags);
+ extract_flags);
if (len <= 0) {
kvfree(*pages);
*pages = NULL;

2023-01-20 18:33:03

by David Howells

[permalink] [raw]
Subject: [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning

Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning. In a
following patch I intend to add a BIO_PAGE_PINNED flag to indicate that the
page needs unpinning and this way both flags have the same logic.

Signed-off-by: David Howells <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
cc: Al Viro <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Jan Kara <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/167305166150.1521586.10220949115402059720.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344730802.2425628.14034153595667416149.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391054631.2311931.7588488803802952158.stgit@warthog.procyon.org.uk/ # v6
---

Notes:
ver #5)
- Split from patch that uses iov_iter_extract_pages().

block/bio.c | 9 ++++++++-
fs/iomap/direct-io.c | 1 -
include/linux/bio.h | 2 +-
include/linux/blk_types.h | 2 +-
4 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 6a6c73d14bfd..cfe11f4799d1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -243,6 +243,10 @@ static void bio_free(struct bio *bio)
* Users of this function have their own bio allocation. Subsequently,
* they must remember to pair any call to bio_init() with bio_uninit()
* when IO has completed, or when the bio is released.
+ *
+ * We set the initial assumption that pages attached to the bio will be
+ * released with put_page() by setting BIO_PAGE_REFFED; if the pages
+ * should not be put, this flag should be cleared.
*/
void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
unsigned short max_vecs, blk_opf_t opf)
@@ -274,6 +278,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
#ifdef CONFIG_BLK_DEV_INTEGRITY
bio->bi_integrity = NULL;
#endif
+ bio_set_flag(bio, BIO_PAGE_REFFED);
bio->bi_vcnt = 0;

atomic_set(&bio->__bi_remaining, 1);
@@ -302,6 +307,7 @@ void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf)
{
bio_uninit(bio);
memset(bio, 0, BIO_RESET_BYTES);
+ bio_set_flag(bio, BIO_PAGE_REFFED);
atomic_set(&bio->__bi_remaining, 1);
bio->bi_bdev = bdev;
if (bio->bi_bdev)
@@ -812,6 +818,7 @@ EXPORT_SYMBOL(bio_put);
static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
{
bio_set_flag(bio, BIO_CLONED);
+ bio_clear_flag(bio, BIO_PAGE_REFFED);
bio->bi_ioprio = bio_src->bi_ioprio;
bio->bi_iter = bio_src->bi_iter;

@@ -1198,7 +1205,7 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
bio->bi_io_vec = (struct bio_vec *)iter->bvec;
bio->bi_iter.bi_bvec_done = iter->iov_offset;
bio->bi_iter.bi_size = size;
- bio_set_flag(bio, BIO_NO_PAGE_REF);
+ bio_clear_flag(bio, BIO_PAGE_REFFED);
bio_set_flag(bio, BIO_CLONED);
}

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9804714b1751..c0e75900e754 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -202,7 +202,6 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;

- get_page(page);
__bio_add_page(bio, page, len, 0);
iomap_dio_submit_bio(iter, dio, bio, pos);
}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 22078a28d7cb..63bfd91793f9 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -482,7 +482,7 @@ void zero_fill_bio(struct bio *bio);

static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
{
- if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+ if (bio_flagged(bio, BIO_PAGE_REFFED))
__bio_release_pages(bio, mark_dirty);
}

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..86711fb0534a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -318,7 +318,7 @@ struct bio {
* bio flags
*/
enum {
- BIO_NO_PAGE_REF, /* don't put release vec pages */
+ BIO_PAGE_REFFED, /* Pages need refs putting (equivalent to FOLL_GET) */
BIO_CLONED, /* doesn't own data */
BIO_BOUNCED, /* bio is a bounce bio */
BIO_QUIET, /* Make BIO Quiet */

2023-01-20 19:05:29

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down

On Fri, Jan 20, 2023 at 05:55:56PM +0000, David Howells wrote:
> Renumber FOLL_GET and FOLL_PIN down to bit 0 and 1 respectively so that
> they are coincidentally the same as BIO_PAGE_REFFED and BIO_PAGE_PINNED and
> also so that they can be stored in the bottom two bits of a page pointer
> (something I'm looking at for zerocopy socket fragments).
>
> Signed-off-by: David Howells <[email protected]>

Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>

Although I was hoping you'd renumber some of the others since we
currently have gaps at 0x200, 0x400, 0x1000, and 0x4000 as well
as the 0x40 you're using here.

2023-01-20 20:06:39

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down

Matthew Wilcox <[email protected]> wrote:

> Although I was hoping you'd renumber some of the others since we
> currently have gaps at 0x200, 0x400, 0x1000, and 0x4000 as well
> as the 0x40 you're using here.

Since the other patches don't require this one, I can renumber them all and
send it directly to Andrew.

David

2023-01-21 13:06:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

> +#define iov_iter_extract_mode(iter, extract_flags) \
> + (user_backed_iter(iter) ? \
> + (iter->data_source == ITER_SOURCE) ? \
> + FOLL_GET : FOLL_PIN : 0)

The extract_flags argument is unused here.

2023-01-21 13:10:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning

On Fri, Jan 20, 2023 at 05:55:52PM +0000, David Howells wrote:
> Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning. In a
> following patch I intend to add a BIO_PAGE_PINNED flag to indicate that the
> page needs unpinning and this way both flags have the same logic.
>
> Signed-off-by: David Howells <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>

As pointed out last time, we really should not set the flag by default.
When a bio is allocated there are no pages to be released, that only
happens when we add dio-like pages to the bio. Instead of explaining
why I mean again, I've put together a version of this series that
implements this and my other suggestions here:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dio-pin-pages

This also tests fine with xfs and btrfs and nvme passthrough I/O.

2023-01-21 13:31:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate

This really does three things:

1) add the BIO_PAGE_PINNED infrastructure
2) convert bio_iov_iter_get_pages to use iov_iter_extract_pages
3) convert bio_map_user_iov to use iov_iter_extract_pages

so it should be three pages, with the first one also containing
what is in your previous patch.

> @@ -1183,7 +1185,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
> bio_for_each_segment_all(bvec, bio, iter_all) {
> if (mark_dirty && !PageCompound(bvec->bv_page))
> set_page_dirty_lock(bvec->bv_page);
> - put_page(bvec->bv_page);
> + bio_release_page(bio, bvec->bv_page);

This can be micro-optimized a bit by only doing the flags conversion
once.

2023-01-21 13:31:33

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On Fri, Jan 20, 2023 at 05:55:50PM +0000, David Howells wrote:
> (3) Any other sort of iterator.
>
> No refs or pins are obtained on the page, the assumption is made that
> the caller will manage page retention. ITER_ALLOW_P2PDMA is not
> permitted.

The ITER_ALLOW_P2PDMA check here is fundamentally wrong and will
break things like io_uring to fixed buffers on NVMe. ITER_ALLOW_P2PDMA
should strictly increase the group of acceptable iters, it does must
not restrict anything.

2023-01-21 13:38:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 1/8] iov_iter: Define flags to qualify page extraction.

On Fri, Jan 20, 2023 at 05:55:49PM +0000, David Howells wrote:
> For now only a flag to allow peer-to-peer DMA is allowed; but in future,
> additional flags will be provided to indicate the I/O direction.

Al still doesn't seem to be sold on the latter part, so maybe just
drop this sentence for now.

2023-01-21 13:49:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] block: Add BIO_PAGE_PINNED

On Fri, Jan 20, 2023 at 05:55:53PM +0000, David Howells wrote:
> BIO_PAGE_PINNED to indicate that the pages in a bio were pinned (FOLL_PIN)
> and that the pin will need removing.

Just adding the flag without the infrastructure and users is a bit
silly. See the link to my branch to see what I think is better split.

2023-01-21 14:29:01

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On Sat, Jan 21, 2023 at 01:30:47PM +0000, David Howells wrote:
> So just drop the check? Or do I actually need to do something to the pages to
> make it work? If I understand the code correctly, it's not actually operable
> right now on any page that doesn't have an appropriately marked PTE.

Yes, just drop the check. The flag just signals that the caller can
deal with p2p pages. If it gets any or not is a totally different
story.

2023-01-21 14:55:19

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

Christoph Hellwig <[email protected]> wrote:

> On Fri, Jan 20, 2023 at 05:55:50PM +0000, David Howells wrote:
> > (3) Any other sort of iterator.
> >
> > No refs or pins are obtained on the page, the assumption is made that
> > the caller will manage page retention. ITER_ALLOW_P2PDMA is not
> > permitted.
>
> The ITER_ALLOW_P2PDMA check here is fundamentally wrong and will
> break things like io_uring to fixed buffers on NVMe. ITER_ALLOW_P2PDMA
> should strictly increase the group of acceptable iters, it does must
> not restrict anything.

So just drop the check? Or do I actually need to do something to the pages to
make it work? If I understand the code correctly, it's not actually operable
right now on any page that doesn't have an appropriately marked PTE.

David

2023-01-23 09:39:42

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning

Christoph Hellwig <[email protected]> wrote:

> ... I've put together a version of this series that implements this and my
> other suggestions here:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dio-pin-pages
>
> This also tests fine with xfs and btrfs and nvme passthrough I/O.

That looks okay. Do you need to put a Co-developed-by tag on the "block:
invert BIO_NO_PAGE_REF" patch?

Should I take that set of patches and send it on to Linus when the window
opens? Or should it go through the block tree?

David


2023-01-23 09:57:13

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning

On Mon, Jan 23, 2023 at 09:38:45AM +0000, David Howells wrote:
> That looks okay. Do you need to put a Co-developed-by tag on the "block:
> invert BIO_NO_PAGE_REF" patch?

I can do if you want, although there's really not much left of the
original.

> Should I take that set of patches and send it on to Linus when the window
> opens? Or should it go through the block tree?

I think the block tree would be better.

2023-01-23 11:29:46

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On 20.01.23 18:55, David Howells wrote:
> Add a function, iov_iter_extract_pages(), to extract a list of pages from
> an iterator. The pages may be returned with a reference added or a pin
> added or neither, depending on the type of iterator and the direction of
> transfer. The caller must pass FOLL_READ_FROM_MEM or FOLL_WRITE_TO_MEM
> as part of gup_flags to indicate how the iterator contents are to be used.
>
> Add a second function, iov_iter_extract_mode(), to determine how the
> cleanup should be done.
>
> There are three cases:
>
> (1) Transfer *into* an ITER_IOVEC or ITER_UBUF iterator.
>
> Extracted pages will have pins obtained on them (but not references)
> so that fork() doesn't CoW the pages incorrectly whilst the I/O is in
> progress.
>
> iov_iter_extract_mode() will return FOLL_PIN for this case. The
> caller should use something like unpin_user_page() to dispose of the
> page.
>
> (2) Transfer is *out of* an ITER_IOVEC or ITER_UBUF iterator.
>
> Extracted pages will have references obtained on them, but not pins.
>
> iov_iter_extract_mode() will return FOLL_GET. The caller should use
> something like put_page() for page disposal.
>
> (3) Any other sort of iterator.
>
> No refs or pins are obtained on the page, the assumption is made that
> the caller will manage page retention. ITER_ALLOW_P2PDMA is not
> permitted.
>
> iov_iter_extract_mode() will return 0. The pages don't need
> additional disposal.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Al Viro <[email protected]>
> cc: Christoph Hellwig <[email protected]>
> cc: John Hubbard <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> cc: [email protected]
> Link: https://lore.kernel.org/r/166920903885.1461876.692029808682876184.stgit@warthog.procyon.org.uk/ # v2
> Link: https://lore.kernel.org/r/166997421646.9475.14837976344157464997.stgit@warthog.procyon.org.uk/ # v3
> Link: https://lore.kernel.org/r/167305163883.1521586.10777155475378874823.stgit@warthog.procyon.org.uk/ # v4
> Link: https://lore.kernel.org/r/167344728530.2425628.9613910866466387722.stgit@warthog.procyon.org.uk/ # v5
> Link: https://lore.kernel.org/r/167391053207.2311931.16398133457201442907.stgit@warthog.procyon.org.uk/ # v6
> ---
>
> Notes:
> ver #7)
> - Switch to passing in iter-specific flags rather than FOLL_* flags.
> - Drop the direction flags for now.
> - Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
> - Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
> - Add support for extraction from KVEC-type iters.
> - Use iov_iter_advance() rather than open-coding it.
> - Make BVEC- and KVEC-type skip over initial empty vectors.
>
> ver #6)
> - Add back the function to indicate the cleanup mode.
> - Drop the cleanup_mode return arg to iov_iter_extract_pages().
> - Pass FOLL_SOURCE/DEST_BUF in gup_flags. Check this against the iter
> data_source.
>
> ver #4)
> - Use ITER_SOURCE/DEST instead of WRITE/READ.
> - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.
>
> ver #3)
> - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
> to get/pin_user_pages_fast()[1].
>
> include/linux/uio.h | 28 +++
> lib/iov_iter.c | 424 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 452 insertions(+)
>
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 46d5080314c6..a4233049ab7a 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -363,4 +363,32 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
> /* Flags for iov_iter_get/extract_pages*() */
> #define ITER_ALLOW_P2PDMA 0x01 /* Allow P2PDMA on the extracted pages */
>
> +ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
> + size_t maxsize, unsigned int maxpages,
> + unsigned int extract_flags, size_t *offset0);
> +
> +/**
> + * iov_iter_extract_mode - Indicate how pages from the iterator will be retained
> + * @iter: The iterator
> + * @extract_flags: How the iterator is to be used
> + *
> + * Examine the iterator and @extract_flags and indicate by returning FOLL_PIN,
> + * FOLL_GET or 0 as to how, if at all, pages extracted from the iterator will
> + * be retained by the extraction function.
> + *
> + * FOLL_GET indicates that the pages will have a reference taken on them that
> + * the caller must put. This can be done for DMA/async DIO write from a page.
> + *
> + * FOLL_PIN indicates that the pages will have a pin placed in them that the
> + * caller must unpin. This is must be done for DMA/async DIO read to a page to
> + * avoid CoW problems in fork.
> + *
> + * 0 indicates that no measures are taken and that it's up to the caller to
> + * retain the pages.
> + */
> +#define iov_iter_extract_mode(iter, extract_flags) \
> + (user_backed_iter(iter) ? \
> + (iter->data_source == ITER_SOURCE) ? \
> + FOLL_GET : FOLL_PIN : 0)
> +
>

How does this work align with the goal of no longer using FOLL_GET for
O_DIRECT? We should get rid of any FOLL_GET usage for accessing page
content.

@John, any comments?

--
Thanks,

David / dhildenb


2023-01-23 11:30:02

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate

In this function:

void __bio_release_pages(struct bio *bio, bool mark_dirty)
{
unsigned int gup_flags = bio_to_gup_flags(bio);
struct bvec_iter_all iter_all;
struct bio_vec *bvec;

bio_for_each_segment_all(bvec, bio, iter_all) {
if (mark_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
>>>> page_put_unpin(bvec->bv_page, gup_flags);
}
}

that ought to be a call to bio_release_page(), but the optimiser doesn't want
to inline it:-/

I found the only way I can get the compiler to properly inline it without it
repeating the calculations is to renumber the FOLL_* constants down and then
make bio_release_page() something like:

static inline __attribute__((always_inline))
void bio_release_page(struct bio *bio, struct page *page)
{
page_put_unpin(page,
((bio->bi_flags & (1 << BIO_PAGE_REFFED)) ? FOLL_GET : 0) |
((bio->bi_flags & (1 << BIO_PAGE_PINNED)) ? FOLL_PIN : 0));
}

I guess the compiler optimiser isn't perfect yet:-)

David


2023-01-23 11:51:59

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

David Hildenbrand <[email protected]> wrote:

> How does this work align with the goal of no longer using FOLL_GET for
> O_DIRECT? We should get rid of any FOLL_GET usage for accessing page content.

Would that run the risk of changes being made by the child being visible to
the a DIO write if the parent changes the buffer first?


PARENT CHILD
====== =====
start-DIO-write
fork() = pid fork() = 0
alter-buffer
CoW happens
page copied original page retained
alter-buffer
<DMA-happens>

David


2023-01-23 12:01:46

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

David Howells <[email protected]> wrote:

> > How does this work align with the goal of no longer using FOLL_GET for
> > O_DIRECT? We should get rid of any FOLL_GET usage for accessing page content.
>
> Would that run the risk of changes being made by the child being visible to
> the a DIO write if the parent changes the buffer first?
>
>
> PARENT CHILD
> ====== =====
> start-DIO-write
> fork() = pid fork() = 0
> alter-buffer
> CoW happens
> page copied original page retained
> alter-buffer
> <DMA-happens>

Ah, I think I might have got the wrong end of the stick. A pinned page is
*always* copied on fork() if I understand copy_present_pte() correctly.

David


2023-01-23 13:12:53

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On 23.01.23 12:51, David Howells wrote:
> David Hildenbrand <[email protected]> wrote:
>
>> How does this work align with the goal of no longer using FOLL_GET for
>> O_DIRECT? We should get rid of any FOLL_GET usage for accessing page content.
>
> Would that run the risk of changes being made by the child being visible to
> the a DIO write if the parent changes the buffer first?
>
>
> PARENT CHILD
> ====== =====
> start-DIO-write
> fork() = pid fork() = 0
> alter-buffer
> CoW happens
> page copied original page retained
> alter-buffer
> <DMA-happens>

FOLL_PIN users are fine in that regard, because we properly detect
"maybe pinned" during fork() and copy the page. See
tools/testing/selftests/mm/cow.c (still called
tools/testing/selftests/vm/cow.c upstream IIRC) for some test cases for
that handling.

FOLL_GET does not work as expected in that regard: pages can't be
detected as pinned and we won't be copying them during fork(). We'll end
up COW-sharing them, which can result in trouble later.

Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure
what the status is. Interestingly,
Documentation/core-api/pin_user_pages.rst already documents that "CASE
1: Direct IO (DIO)" uses FOLL_PIN ... which does, unfortunately, no
reflect reality yet.

--
Thanks,

David / dhildenb


2023-01-23 13:20:53

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

David Hildenbrand <[email protected]> wrote:

> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
> unfortunately, no reflect reality yet.

Yeah - I just came across that.

Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
get_user_pages()? In which case my patches only need keep track of
pinned/not-pinned and never "got".

David


2023-01-23 13:25:11

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On 23.01.23 14:19, David Howells wrote:
> David Hildenbrand <[email protected]> wrote:
>
>> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
>> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
>> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
>> unfortunately, no reflect reality yet.
>
> Yeah - I just came across that.
>
> Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
> get_user_pages()? In which case my patches only need keep track of
> pinned/not-pinned and never "got".

That would be the ideal case: whenever intending to access page content,
use FOLL_PIN instead of FOLL_GET.

The issue that John was trying to sort out was that there are plenty of
callsites that do a simple put_page() instead of calling
unpin_user_page(). IIRC, handling that correctly in existing code --
what was pinned must be released via unpin_user_page() -- was the
biggest workitem.

Not sure how that relates to your work here (that's why I was asking):
if you could avoid FOLL_GET, that would be great :)

--
Thanks,

David / dhildenb


2023-01-23 13:39:39

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

David Hildenbrand <[email protected]> wrote:

> That would be the ideal case: whenever intending to access page content, use
> FOLL_PIN instead of FOLL_GET.
>
> The issue that John was trying to sort out was that there are plenty of
> callsites that do a simple put_page() instead of calling
> unpin_user_page(). IIRC, handling that correctly in existing code -- what was
> pinned must be released via unpin_user_page() -- was the biggest workitem.
>
> Not sure how that relates to your work here (that's why I was asking): if you
> could avoid FOLL_GET, that would be great :)

Well, it simplifies things a bit.

I can make the new iov_iter_extract_pages() just do "pin" or "don't pin" and
do no ref-getting at all. Things can be converted over to "unpin the pages or
doing nothing" as they're converted over to using iov_iter_extract_pages()
from iov_iter_get_pages*().

The block bio code then only needs a single bit of state: pinned or not
pinned.

For cifs RDMA, do I need to make it pass in FOLL_LONGTERM? And does that need
a special cleanup?

sk_buff fragment handling could still be tricky. I'm thinking that in that
code I'll need to store FOLL_GET/PIN in the bottom two bits of the frag page
pointer. Sometimes it allocates a new page and attaches it (have ref);
sometimes it does zerocopy to/from a page (have pin) and sometimes it may be
pointing to a kernel buffer (don't pin or ref).

David


2023-01-23 14:21:47

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On 23.01.23 14:38, David Howells wrote:
> David Hildenbrand <[email protected]> wrote:
>
>> That would be the ideal case: whenever intending to access page content, use
>> FOLL_PIN instead of FOLL_GET.
>>
>> The issue that John was trying to sort out was that there are plenty of
>> callsites that do a simple put_page() instead of calling
>> unpin_user_page(). IIRC, handling that correctly in existing code -- what was
>> pinned must be released via unpin_user_page() -- was the biggest workitem.
>>
>> Not sure how that relates to your work here (that's why I was asking): if you
>> could avoid FOLL_GET, that would be great :)
>
> Well, it simplifies things a bit.
>
> I can make the new iov_iter_extract_pages() just do "pin" or "don't pin" and
> do no ref-getting at all. Things can be converted over to "unpin the pages or
> doing nothing" as they're converted over to using iov_iter_extract_pages()
> from iov_iter_get_pages*().
>
> The block bio code then only needs a single bit of state: pinned or not
> pinned.

Unfortunately, I'll have to let BIO experts comment on that :) I only
know the MM side of things here.

>
> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM? And does that need
> a special cleanup?

Anything that holds pins "possibly forever" should that. vmsplice() is
another example that should use it, once properly using FOLL_PIN.
[FOLL_GET | FOLL_LONGTERM is not really used/defined with semantics]

>
> sk_buff fragment handling could still be tricky. I'm thinking that in that
> code I'll need to store FOLL_GET/PIN in the bottom two bits of the frag page
> pointer. Sometimes it allocates a new page and attaches it (have ref);
> sometimes it does zerocopy to/from a page (have pin) and sometimes it may be
> pointing to a kernel buffer (don't pin or ref).
>
> David
>

--
Thanks,

David / dhildenb


2023-01-23 14:49:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On Mon, Jan 23, 2023 at 03:20:41PM +0100, David Hildenbrand wrote:
> > The block bio code then only needs a single bit of state: pinned or not
> > pinned.
>
> Unfortunately, I'll have to let BIO experts comment on that :) I only know
> the MM side of things here.

It can work for bios.

2023-01-23 14:49:36

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate

On Mon, Jan 23, 2023 at 11:28:40AM +0000, David Howells wrote:
> void __bio_release_pages(struct bio *bio, bool mark_dirty)
> {
> unsigned int gup_flags = bio_to_gup_flags(bio);
> struct bvec_iter_all iter_all;
> struct bio_vec *bvec;
>
> bio_for_each_segment_all(bvec, bio, iter_all) {
> if (mark_dirty && !PageCompound(bvec->bv_page))
> set_page_dirty_lock(bvec->bv_page);
> >>>> page_put_unpin(bvec->bv_page, gup_flags);
> }
> }
>
> that ought to be a call to bio_release_page(), but the optimiser doesn't want
> to inline it:-/

Why? __bio_release_pages is the fast path, no need to force using
bio_relese_page which is otherwise only used for error cleanup.

2023-01-23 16:11:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On Mon 23-01-23 13:38:31, David Howells wrote:
> David Hildenbrand <[email protected]> wrote:
>
> > That would be the ideal case: whenever intending to access page content, use
> > FOLL_PIN instead of FOLL_GET.
> >
> > The issue that John was trying to sort out was that there are plenty of
> > callsites that do a simple put_page() instead of calling
> > unpin_user_page(). IIRC, handling that correctly in existing code -- what was
> > pinned must be released via unpin_user_page() -- was the biggest workitem.
> >
> > Not sure how that relates to your work here (that's why I was asking): if you
> > could avoid FOLL_GET, that would be great :)
>
> Well, it simplifies things a bit.
>
> I can make the new iov_iter_extract_pages() just do "pin" or "don't pin" and
> do no ref-getting at all. Things can be converted over to "unpin the pages or
> doing nothing" as they're converted over to using iov_iter_extract_pages()
> from iov_iter_get_pages*().
>
> The block bio code then only needs a single bit of state: pinned or not
> pinned.

I'm all for using only pin/unpin in the end. But you'd still need to handle
the 'put' for the intermediate time when there are still FOLL_GET users of
the bio infrastructure, wouldn't you?

> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM? And does that need
> a special cleanup?

FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
whenever there isn't reasonably bound time after which the page is
unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
long the RDMA is going to be running it should be using FOLL_LONGTERM. The
thing is that pins can block e.g. truncate for DAX inodes and so longterm
pins are not supported for DAX backed pages.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-01-23 16:17:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On Mon, Jan 23, 2023 at 05:11:14PM +0100, Jan Kara wrote:
> I'm all for using only pin/unpin in the end. But you'd still need to handle
> the 'put' for the intermediate time when there are still FOLL_GET users of
> the bio infrastructure, wouldn't you?

Yes, we need it for the transition. But Dave already has a patch to
convert the legacy direct-io code as well, at which point we can
the retire the get bit.


2023-01-23 16:31:57

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)

On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote:
> (3) Make the bio struct carry a pair of flags to indicate the cleanup
> mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
> FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
> added.

I think there's a simpler solution than all of this.

As I understand the fundamental problem here, the question is
when to copy a page on fork. We have the optimisation of COW, but
O_DIRECT/RDMA/... breaks it. So all this page pinning is to indicate
to the fork code "You can't do COW to this page".

Why do we want to track that information on a per-page basis? Wouldn't it
be easier to have a VM_NOCOW flag in vma->vm_flags? Set it the first
time somebody does an O_DIRECT read or RDMA pin. That's it. Pages in
that VMA will now never be COWed, regardless of their refcount/mapcount.
And the whole "did we pin or get this page" problem goes away. Along
with folio->pincount.


2023-01-23 16:41:25

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)

Matthew Wilcox <[email protected]> wrote:

> Why do we want to track that information on a per-page basis? Wouldn't it
> be easier to have a VM_NOCOW flag in vma->vm_flags? Set it the first
> time somebody does an O_DIRECT read or RDMA pin. That's it. Pages in
> that VMA will now never be COWed, regardless of their refcount/mapcount.
> And the whole "did we pin or get this page" problem goes away. Along
> with folio->pincount.

Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW
if they did a single DIO to/from it.

Also you only mention DIO read - but what about "start DIO write; fork(); touch
buffer" in the parent - now the write buffer belongs to the child and they can
affect the parent's write.

David


2023-01-23 16:43:26

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)

On Mon 23-01-23 16:31:32, Matthew Wilcox wrote:
> On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote:
> > (3) Make the bio struct carry a pair of flags to indicate the cleanup
> > mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
> > FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
> > added.
>
> I think there's a simpler solution than all of this.
>
> As I understand the fundamental problem here, the question is
> when to copy a page on fork. We have the optimisation of COW, but
> O_DIRECT/RDMA/... breaks it. So all this page pinning is to indicate
> to the fork code "You can't do COW to this page".
>
> Why do we want to track that information on a per-page basis? Wouldn't it
> be easier to have a VM_NOCOW flag in vma->vm_flags? Set it the first
> time somebody does an O_DIRECT read or RDMA pin. That's it. Pages in
> that VMA will now never be COWed, regardless of their refcount/mapcount.
> And the whole "did we pin or get this page" problem goes away. Along
> with folio->pincount.

Well, but anon COW code is not the only (planned) consumer of the pincount.
Filesystems also need to know whether a (shared pagecache) page is pinned
and can thus be modified behind their backs. And for that VMA tracking
isn't really an option.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-01-23 16:43:40

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)

On Mon, Jan 23, 2023 at 04:38:47PM +0000, David Howells wrote:
> Matthew Wilcox <[email protected]> wrote:
>
> > Why do we want to track that information on a per-page basis? Wouldn't it
> > be easier to have a VM_NOCOW flag in vma->vm_flags? Set it the first
> > time somebody does an O_DIRECT read or RDMA pin. That's it. Pages in
> > that VMA will now never be COWed, regardless of their refcount/mapcount.
> > And the whole "did we pin or get this page" problem goes away. Along
> > with folio->pincount.
>
> Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW
> if they did a single DIO to/from it.

Yes. Would that be an actual problem for any real application?

We could do this with a vm_pincount if it's essential to be able to
count how often it's happened and be able to fork() without COW if it's
something that happened in the past and is now not happening.

> Also you only mention DIO read - but what about "start DIO write; fork(); touch
> buffer" in the parent - now the write buffer belongs to the child and they can
> affect the parent's write.

I'm struggling to see the problem here. If the child hasn't exec'd, the
parent and child are still in the same security domain. The parent
could have modified the buffer before calling fork().

2023-01-23 17:20:57

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)

Matthew Wilcox <[email protected]> wrote:

> > Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW
> > if they did a single DIO to/from it.
>
> Yes. Would that be an actual problem for any real application?

Without auditing all applications that do direct I/O writes, it's hard to
say - but a big database engine, Oracle for example, forking off a process,
say, could cause a massive slow down as fork suddenly has to copy a huge
amount of malloc'd data unnecessarily[*].

[*] I'm making wild assumptions about how Oracle's DB engine works.

> > Also you only mention DIO read - but what about "start DIO write; fork();
> > touch buffer" in the parent - now the write buffer belongs to the child
> > and they can affect the parent's write.
>
> I'm struggling to see the problem here. If the child hasn't exec'd, the
> parent and child are still in the same security domain. The parent
> could have modified the buffer before calling fork().

It could still inadvertently change the data its parent set to write out. The
child *shouldn't* be able to change the parent's in-progress write. The most
obvious problem would be in something that does DIO from a stack buffer, I
think.

David


2023-01-23 17:25:38

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)

On Mon 23-01-23 16:42:56, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 04:38:47PM +0000, David Howells wrote:
> > Matthew Wilcox <[email protected]> wrote:
> > Also you only mention DIO read - but what about "start DIO write; fork(); touch
> > buffer" in the parent - now the write buffer belongs to the child and they can
> > affect the parent's write.
>
> I'm struggling to see the problem here. If the child hasn't exec'd, the
> parent and child are still in the same security domain. The parent
> could have modified the buffer before calling fork().

Sadly they are not. Android in particular starts applications by forking
one big binary (zygote) that has multiple apps linked together and relies
on the fact the child cannot influence the parent after the fork. We've
already had CVEs with GUP & COW & fork due to this. David Hildebrand has a
lot of memories regarding this I believe ;)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-01-23 17:34:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)

On Mon, Jan 23, 2023 at 05:42:18PM +0100, Jan Kara wrote:
> On Mon 23-01-23 16:31:32, Matthew Wilcox wrote:
> > On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote:
> > > (3) Make the bio struct carry a pair of flags to indicate the cleanup
> > > mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
> > > FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
> > > added.
> >
> > I think there's a simpler solution than all of this.
> >
> > As I understand the fundamental problem here, the question is
> > when to copy a page on fork. We have the optimisation of COW, but
> > O_DIRECT/RDMA/... breaks it. So all this page pinning is to indicate
> > to the fork code "You can't do COW to this page".
> >
> > Why do we want to track that information on a per-page basis? Wouldn't it
> > be easier to have a VM_NOCOW flag in vma->vm_flags? Set it the first
> > time somebody does an O_DIRECT read or RDMA pin. That's it. Pages in
> > that VMA will now never be COWed, regardless of their refcount/mapcount.
> > And the whole "did we pin or get this page" problem goes away. Along
> > with folio->pincount.
>
> Well, but anon COW code is not the only (planned) consumer of the pincount.
> Filesystems also need to know whether a (shared pagecache) page is pinned
> and can thus be modified behind their backs. And for that VMA tracking
> isn't really an option.

Bleh, I'd forgotten about that problem. We really do need to keep
track of which pages are under I/O for this case, because we need to
tell the filesystem that they are now available for writeback.

That said, I don't know that we need to keep track of it in the
pages themselves. Can't we have something similar to rmap which
keeps track of a range of pinned pages, and have it taken care of
at a higher level (ie unpin the pages in the dio_iodone_t rather
than in the BIO completion handler)?

I'm not even sure why pinned pagecache pages remain on the LRU.
They should probably go onto the unevictable list with the mlocked
pages, then on unpin get marked dirty and placed on the active list.
There's no point in writing back a pinned page since it can be
written to at any instant without any part of the kernel knowing.


2023-01-23 18:05:15

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)

On Mon, Jan 23, 2023 at 05:19:51PM +0000, David Howells wrote:
> Matthew Wilcox <[email protected]> wrote:
>
> > > Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW
> > > if they did a single DIO to/from it.
> >
> > Yes. Would that be an actual problem for any real application?
>
> Without auditing all applications that do direct I/O writes, it's hard to
> say - but a big database engine, Oracle for example, forking off a process,
> say, could cause a massive slow down as fork suddenly has to copy a huge
> amount of malloc'd data unnecessarily[*].
>
> [*] I'm making wild assumptions about how Oracle's DB engine works.

Yes. The cache is shared between all Oracle processes, so it's not COWed.
Indeed (as the mshare patches show), what Oracle wants is _more_ sharing
between the processes, not _less_.

> > > Also you only mention DIO read - but what about "start DIO write; fork();
> > > touch buffer" in the parent - now the write buffer belongs to the child
> > > and they can affect the parent's write.
> >
> > I'm struggling to see the problem here. If the child hasn't exec'd, the
> > parent and child are still in the same security domain. The parent
> > could have modified the buffer before calling fork().
>
> It could still inadvertently change the data its parent set to write out. The
> child *shouldn't* be able to change the parent's in-progress write. The most
> obvious problem would be in something that does DIO from a stack buffer, I
> think.

If it's a problem then O_DIRECT writes can also set the NOCOW flag.

2023-01-23 19:56:52

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On 1/23/23 05:24, David Hildenbrand wrote:
> On 23.01.23 14:19, David Howells wrote:
>> David Hildenbrand <[email protected]> wrote:
>>
>>> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
>>> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
>>> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
>>> unfortunately, no reflect reality yet.
>>

Yes, that part of the documentation is...aspirational. :) But this
series is taking us there, so good. Let me go review v8 of the series in
actual detail to verify but it sounds very promising.

>> Yeah - I just came across that.
>>
>> Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
>> get_user_pages()?  In which case my patches only need keep track of
>> pinned/not-pinned and never "got".
>
> That would be the ideal case: whenever intending to access page content, use FOLL_PIN instead of FOLL_GET.
>
> The issue that John was trying to sort out was that there are plenty of callsites that do a simple put_page() instead of calling unpin_user_page(). IIRC, handling that correctly in existing code -- what was pinned must be released via unpin_user_page() -- was the biggest workitem.
>
> Not sure how that relates to your work here (that's why I was asking): if you could avoid FOLL_GET, that would be great :)
>

The largest part of this problem has been: __iov_iter_get_pages_alloc()
calls get_user_pages_fast() (so, FOLL_GET), as part of the Direct IO
path. And that __iov_iter_get_pages_alloc() is also called by a wide
variety of things that are not Direct IO: networking, crytpo, RDS.

So splitting out a variant that only Direct IO uses is a great move and
should allow conversion of Direct IO to FOLL_PIN. Again, let me go do an
actual review to check on that.

thanks,
--
John Hubbard
NVIDIA

2023-01-23 22:54:08

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)

On 1/23/23 09:33, Matthew Wilcox wrote:
...
> Bleh, I'd forgotten about that problem. We really do need to keep
> track of which pages are under I/O for this case, because we need to
> tell the filesystem that they are now available for writeback.
>
> That said, I don't know that we need to keep track of it in the
> pages themselves. Can't we have something similar to rmap which
> keeps track of a range of pinned pages, and have it taken care of
> at a higher level (ie unpin the pages in the dio_iodone_t rather
> than in the BIO completion handler)?
>
> I'm not even sure why pinned pagecache pages remain on the LRU.
> They should probably go onto the unevictable list with the mlocked

This is an intriguing idea, but...

> pages, then on unpin get marked dirty and placed on the active list.
> There's no point in writing back a pinned page since it can be
> written to at any instant without any part of the kernel knowing.
>

There have been filesystems discussions about this: if a page goes
unwritten for "too long", it's not good. To address that, bounce buffers
were proposed for periodic writeback of pinned pages. The idea with
bounce buffers is: even though the page is never guaranteed up to date
(because DMA can instantly make it effectively dirty), it is at least
much less out of date after a periodic writeback, then it was before.

And that is important for some file system use cases.


thanks,
--
John Hubbard
NVIDIA

2023-01-23 23:08:15

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On 1/23/23 08:11, Jan Kara wrote:
>> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM? And does that need
>> a special cleanup?
>
> FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
> whenever there isn't reasonably bound time after which the page is
> unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
> long the RDMA is going to be running it should be using FOLL_LONGTERM. The

Yes, we have been pretty consistently deciding that RDMA generally
implies FOLL_LONGTERM. (And furthermore, FOLL_LONGTERM implies
FOLL_PIN--that one is actually enforced by the gup/pup APIs.)


thanks,
--
John Hubbard
NVIDIA

2023-01-24 05:58:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On Mon, Jan 23, 2023 at 03:07:48PM -0800, John Hubbard wrote:
> On 1/23/23 08:11, Jan Kara wrote:
> > > For cifs RDMA, do I need to make it pass in FOLL_LONGTERM? And does that need
> > > a special cleanup?
> >
> > FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
> > whenever there isn't reasonably bound time after which the page is
> > unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
> > long the RDMA is going to be running it should be using FOLL_LONGTERM. The
>
> Yes, we have been pretty consistently deciding that RDMA generally
> implies FOLL_LONGTERM. (And furthermore, FOLL_LONGTERM implies
> FOLL_PIN--that one is actually enforced by the gup/pup APIs.)

That's weird. For storage or file systems, pages are pinnen just as
long when using RDMA as when using local DMA, in fact if you do RDMA
to really fast remote media vs slow local media (e.g. SSD vs disk) you
might pin it shorter when using RDMA.

I think FOLL_LONGTERM makes sense for non-ODP user space memory
registrations for RDMA, which will last basically forever. It does
not really make much sense at all for in-kernel memory registration for
RDMA that are very short term.

2023-01-24 06:55:43

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On 1/23/23 21:57, Christoph Hellwig wrote:
> On Mon, Jan 23, 2023 at 03:07:48PM -0800, John Hubbard wrote:
>> On 1/23/23 08:11, Jan Kara wrote:
>>>> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM? And does that need
>>>> a special cleanup?
>>>
>>> FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
>>> whenever there isn't reasonably bound time after which the page is
>>> unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
>>> long the RDMA is going to be running it should be using FOLL_LONGTERM. The
>>
>> Yes, we have been pretty consistently deciding that RDMA generally
>> implies FOLL_LONGTERM. (And furthermore, FOLL_LONGTERM implies
>> FOLL_PIN--that one is actually enforced by the gup/pup APIs.)
>
> That's weird. For storage or file systems, pages are pinnen just as
> long when using RDMA as when using local DMA, in fact if you do RDMA
> to really fast remote media vs slow local media (e.g. SSD vs disk) you
> might pin it shorter when using RDMA.

ah OK, it is possible that I've been thinking too much about the HPC
cases that set up RDMA and leave it there, and not enough about
storage.

>
> I think FOLL_LONGTERM makes sense for non-ODP user space memory
> registrations for RDMA, which will last basically forever. It does
> not really make much sense at all for in-kernel memory registration for
> RDMA that are very short term.

Agreed.


thanks,
--
John Hubbard
NVIDIA


2023-01-24 10:25:06

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)

On 23.01.23 18:25, Jan Kara wrote:
> On Mon 23-01-23 16:42:56, Matthew Wilcox wrote:
>> On Mon, Jan 23, 2023 at 04:38:47PM +0000, David Howells wrote:
>>> Matthew Wilcox <[email protected]> wrote:
>>> Also you only mention DIO read - but what about "start DIO write; fork(); touch
>>> buffer" in the parent - now the write buffer belongs to the child and they can
>>> affect the parent's write.
>>
>> I'm struggling to see the problem here. If the child hasn't exec'd, the
>> parent and child are still in the same security domain. The parent
>> could have modified the buffer before calling fork().
>
> Sadly they are not. Android in particular starts applications by forking
> one big binary (zygote) that has multiple apps linked together and relies
> on the fact the child cannot influence the parent after the fork. We've
> already had CVEs with GUP & COW & fork due to this. David Hildebrand has a
> lot of memories regarding this I believe ;)

:)

Once FOLL_PIN is used most of the issues go away and we don't have to
play any games with VM flags or similar ...

With FOLL_PIN, I consider anon a solved problem and not worth any new
fancy ideas.

--
Thanks,

David / dhildenb


2023-01-24 10:29:40

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)

On Mon 23-01-23 17:33:24, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 05:42:18PM +0100, Jan Kara wrote:
> > On Mon 23-01-23 16:31:32, Matthew Wilcox wrote:
> > > On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote:
> > > > (3) Make the bio struct carry a pair of flags to indicate the cleanup
> > > > mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
> > > > FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
> > > > added.
> > >
> > > I think there's a simpler solution than all of this.
> > >
> > > As I understand the fundamental problem here, the question is
> > > when to copy a page on fork. We have the optimisation of COW, but
> > > O_DIRECT/RDMA/... breaks it. So all this page pinning is to indicate
> > > to the fork code "You can't do COW to this page".
> > >
> > > Why do we want to track that information on a per-page basis? Wouldn't it
> > > be easier to have a VM_NOCOW flag in vma->vm_flags? Set it the first
> > > time somebody does an O_DIRECT read or RDMA pin. That's it. Pages in
> > > that VMA will now never be COWed, regardless of their refcount/mapcount.
> > > And the whole "did we pin or get this page" problem goes away. Along
> > > with folio->pincount.
> >
> > Well, but anon COW code is not the only (planned) consumer of the pincount.
> > Filesystems also need to know whether a (shared pagecache) page is pinned
> > and can thus be modified behind their backs. And for that VMA tracking
> > isn't really an option.
>
> Bleh, I'd forgotten about that problem. We really do need to keep
> track of which pages are under I/O for this case, because we need to
> tell the filesystem that they are now available for writeback.
>
> That said, I don't know that we need to keep track of it in the
> pages themselves. Can't we have something similar to rmap which
> keeps track of a range of pinned pages, and have it taken care of
> at a higher level (ie unpin the pages in the dio_iodone_t rather
> than in the BIO completion handler)?

We could but bear in mind that there are more places than just (the two)
direct IO paths. There are many GUP users that can be operating on mmapped
pagecache and that can modify page data. Also note that e.g. direct IO is
rather performance sensitive so any CPU overhead that is added to that path
gets noticed. That was actually the reason why we keep pinned pages on the
LRU - John tried to remove them while pinning but the overhead was not
acceptable.

Finally, since we need to handle pinning for anon pages as well, just
(ab)using the page refcount looks like the least painful solution.

> I'm not even sure why pinned pagecache pages remain on the LRU.
> They should probably go onto the unevictable list with the mlocked
> pages, then on unpin get marked dirty and placed on the active list.
> There's no point in writing back a pinned page since it can be
> written to at any instant without any part of the kernel knowing.

True but as John said sometimes we need to writeout even pinned page - e.g.
on fsync(2). For some RDMA users which keep pages pinned for days or
months, this is actually crutial...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-01-24 13:21:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)

On Tue, Jan 24, 2023 at 11:29:31AM +0100, Jan Kara wrote:
> True but as John said sometimes we need to writeout even pinned page - e.g.
> on fsync(2). For some RDMA users which keep pages pinned for days or
> months, this is actually crutial...

I think we have to distinguish between short term (just FOLL_PIN)
and long term (FOLL_PIN | FOLL_LONGERM) pins.

For short term ones the proper thing to do in data integrity writeback
is to simply wait for the unpin. For the long term pins that obviously
can't work. The right answer for that is complicated and I don't have
a good answer yet. The best one would probably to probibit them on
MAP_SHARED file backed mappings - this might break some existing
software, but that software is already so broken that this might be
best.

2023-01-26 22:15:28

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On Mon, Jan 23, 2023 at 02:24:13PM +0100, David Hildenbrand wrote:
> On 23.01.23 14:19, David Howells wrote:
> > David Hildenbrand <[email protected]> wrote:
> >
> > > Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
> > > the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
> > > already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
> > > unfortunately, no reflect reality yet.
> >
> > Yeah - I just came across that.
> >
> > Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
> > get_user_pages()? In which case my patches only need keep track of
> > pinned/not-pinned and never "got".
>
> That would be the ideal case: whenever intending to access page content, use
> FOLL_PIN instead of FOLL_GET.
>
> The issue that John was trying to sort out was that there are plenty of
> callsites that do a simple put_page() instead of calling unpin_user_page().
> IIRC, handling that correctly in existing code -- what was pinned must be
> released via unpin_user_page() -- was the biggest workitem.
>
> Not sure how that relates to your work here (that's why I was asking): if
> you could avoid FOLL_GET, that would be great :)

Take a good look at iter_to_pipe(). It does *not* need to pin anything
(we have an ITER_SOURCE there); with this approach it will. And it
will stuff those pinned references into a pipe, where they can sit
indefinitely.

IOW, I don't believe it's a usable approach.

2023-01-26 23:43:23

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On 26.01.23 23:15, Al Viro wrote:
> On Mon, Jan 23, 2023 at 02:24:13PM +0100, David Hildenbrand wrote:
>> On 23.01.23 14:19, David Howells wrote:
>>> David Hildenbrand <[email protected]> wrote:
>>>
>>>> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
>>>> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
>>>> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
>>>> unfortunately, no reflect reality yet.
>>>
>>> Yeah - I just came across that.
>>>
>>> Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
>>> get_user_pages()? In which case my patches only need keep track of
>>> pinned/not-pinned and never "got".
>>
>> That would be the ideal case: whenever intending to access page content, use
>> FOLL_PIN instead of FOLL_GET.
>>
>> The issue that John was trying to sort out was that there are plenty of
>> callsites that do a simple put_page() instead of calling unpin_user_page().
>> IIRC, handling that correctly in existing code -- what was pinned must be
>> released via unpin_user_page() -- was the biggest workitem.
>>
>> Not sure how that relates to your work here (that's why I was asking): if
>> you could avoid FOLL_GET, that would be great :)
>
> Take a good look at iter_to_pipe(). It does *not* need to pin anything
> (we have an ITER_SOURCE there); with this approach it will. And it
> will stuff those pinned references into a pipe, where they can sit
> indefinitely.
>
> IOW, I don't believe it's a usable approach.
>

Not sure what makes you believe that FOLL_GET is any better for this
long-term pinning, I'd like to learn about that.

As raised already somewhere in the whole discussion by me, the right way
to take such a long-term ping as vmsplice() does is to use
FOLL_PIN|FOLL_LONGTERM. As also raised, that will fix the last remaining
vmsplice()+hugetlb COW issue as tested by the cow.c vm selftest and make
sure to migrate that memory off of MIGRATE_MOVABLE/CMA memory where we
cannot tolerate to have long-term unmovable memory sitting around.

--
Thanks,

David / dhildenb


2023-01-27 00:08:38

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

David Hildenbrand <[email protected]> wrote:

> As raised already somewhere in the whole discussion by me, the right way to
> take such a long-term ping as vmsplice() does is to use
> FOLL_PIN|FOLL_LONGTERM.

So the pipe infrastructure would have to be able to pin pages instead of
carrying refs on them? What about pages just allocated and added to the pipe
ring in normal pipe use?

David


2023-01-27 00:21:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator

On 27.01.23 01:05, David Howells wrote:
> David Hildenbrand <[email protected]> wrote:
>
>> As raised already somewhere in the whole discussion by me, the right way to
>> take such a long-term ping as vmsplice() does is to use
>> FOLL_PIN|FOLL_LONGTERM.
>
> So the pipe infrastructure would have to be able to pin pages instead of
> carrying refs on them? What about pages just allocated and added to the pipe
> ring in normal pipe use?
Ordinary kernel allocations (alloc_page() ...) always have to be freed
via put_page() and friends. Such allocations are unmovable as default
and don't require any special care.

Pages mapped into user space are movable as default and might be placed
on ZONE_MOVABLE/CMA memory (well, and might be swapped out).
FOLL_LONGTERM makes sure to migrate these pages off of such problematic
physical memory regions, such that we can safely pin them until eternity.

--
Thanks,

David / dhildenb