Hi Al, Christoph,
Here are patches to provide support for extracting pages from an iov_iter
and to use this in the extraction functions in the block layer bio code.
The patches make the following changes:
(1) Add a function, iov_iter_extract_pages() to replace
iov_iter_get_pages*() that gets refs, pins or just lists the pages as
appropriate to the iterator type.
Add a function, iov_iter_extract_mode() that will indicate from the
iterator type how the cleanup is to be performed, returning FOLL_PIN
or 0.
(2) Add a function, folio_put_unpin(), and a wrapper, page_put_unpin(),
that take a page and the return from iov_iter_extract_mode() and do
the right thing to clean up the page.
(3) Make the bio struct carry a pair of flags to indicate the cleanup
mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
added.
(4) Add a function, bio_release_page(), to release a page appropriately to
the cleanup mode indicated by the BIO_PAGE_* flags.
(5) Make the iter-to-bio code use iov_iter_extract_pages() to retain the
pages appropriately and clean them up later.
(6) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.
(7) Renumber FOLL_PIN and FOLL_GET down so that they're at bits 0 and 1
and coincident with BIO_PAGE_PINNED and BIO_PAGE_REFFED. The compiler
can then optimise on that. Also, it's probably going to be necessary
to embed these in the page pointer in sk_buff fragments. This patch
can go independently through the mm tree.
I've pushed the patches here also:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract
David
Changes:
========
ver #8)
- Import Christoph Hellwig's changes.
- Split the conversion-to-extraction patch.
- Drop the extract_flags arg from iov_iter_extract_mode().
- Don't default bios to BIO_PAGE_REFFED, but set explicitly.
- Switch FOLL_PIN and FOLL_GET when renumbering so PIN is at bit 0.
- Switch BIO_PAGE_PINNED and BIO_PAGE_REFFED so PINNED is at bit 0.
- We should always be using FOLL_PIN (not FOLL_GET) for DIO, so adjust the
patches for that.
ver #7)
- For now, drop the parts to pass the I/O direction to iov_iter_*pages*()
as it turned out to be a lot more complicated, with places not setting
IOCB_WRITE when they should, for example.
- Drop all the patches that changed things other then the block layer's
bio handling. The netfslib and cifs changes can go into a separate
patchset.
- Add support for extracting pages from KVEC-type iterators.
- When extracting from BVEC/KVEC, skip over empty vecs at the front.
ver #6)
- Fix write() syscall and co. not setting IOCB_WRITE.
- Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE.
- Use op_is_write() in bio_copy_user_iov().
- Drop the iterator direction checks from smbd_recv().
- Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of
gup_flags to iov_iter_get/extract_pages*().
- Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove.
- Add back the function to indicate the cleanup mode.
- Drop the cleanup_mode return arg to iov_iter_extract_pages().
- Provide a helper to clean up a page.
- Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have
the same numerical values, enforced with an assertion.
- Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and
NFS.
- Added in the patches to make CIFS do top-to-bottom iterators and use
various of the added extraction functions.
- Added a pair of work-in-progess patches to make sk_buff fragments store
FOLL_GET and FOLL_PIN.
ver #5)
- Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch.
- Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags.
- Add patch to allow bio_flagged() to be combined by gcc.
ver #4)
- Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
no longer referenced by linux/uio.h.
- Add ITER_SOURCE/DEST cleanup patches.
- Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST.
- Allow additional gup_flags to be passed into iov_iter_extract_pages().
- Add struct bio patch.
ver #3)
- Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
to get/pin_user_pages_fast()[1].
ver #2)
- Rolled the extraction cleanup mode query function into the extraction
function, returning the indication through the argument list.
- Fixed patch 4 (extract to scatterlist) to actually use the new
extraction API.
Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/ # v6
Link: https://lore.kernel.org/r/[email protected]/ # v7
Christoph Hellwig (2):
iomap: don't get an reference on ZERO_PAGE for direct I/O block
zeroing
block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the
meaning
David Howells (8):
iov_iter: Define flags to qualify page extraction.
iov_iter: Add a function to extract a page list from an iterator
mm: Provide a helper to drop a pin/ref on a page
block: Fix bio_flagged() so that gcc can better optimise it
block: Switch to pinning pages.
block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
block: convert bio_map_user_iov to use iov_iter_extract_pages
mm: Renumber FOLL_PIN and FOLL_GET down
block/bio.c | 33 ++--
block/blk-map.c | 25 ++-
block/blk.h | 28 ++++
fs/direct-io.c | 2 +
fs/iomap/direct-io.c | 1 -
include/linux/bio.h | 5 +-
include/linux/blk_types.h | 3 +-
include/linux/mm.h | 35 ++--
include/linux/uio.h | 29 +++-
lib/iov_iter.c | 334 +++++++++++++++++++++++++++++++++++++-
mm/gup.c | 22 +++
11 files changed, 461 insertions(+), 56 deletions(-)
Fix bio_flagged() so that multiple instances of it, such as:
if (bio_flagged(bio, BIO_PAGE_REFFED) ||
bio_flagged(bio, BIO_PAGE_PINNED))
can be combined by the gcc optimiser into a single test in assembly
(arguably, this is a compiler optimisation issue[1]).
The missed optimisation stems from bio_flagged() comparing the result of
the bitwise-AND to zero. This results in an out-of-line bio_release_page()
being compiled to something like:
<+0>: mov 0x14(%rdi),%eax
<+3>: test $0x1,%al
<+5>: jne 0xffffffff816dac53 <bio_release_pages+11>
<+7>: test $0x2,%al
<+9>: je 0xffffffff816dac5c <bio_release_pages+20>
<+11>: movzbl %sil,%esi
<+15>: jmp 0xffffffff816daba1 <__bio_release_pages>
<+20>: jmp 0xffffffff81d0b800 <__x86_return_thunk>
However, the test is superfluous as the return type is bool. Removing it
results in:
<+0>: testb $0x3,0x14(%rdi)
<+4>: je 0xffffffff816e4af4 <bio_release_pages+15>
<+6>: movzbl %sil,%esi
<+10>: jmp 0xffffffff816dab7c <__bio_release_pages>
<+15>: jmp 0xffffffff81d0b7c0 <__x86_return_thunk>
instead.
Also, the MOVZBL instruction looks unnecessary[2] - I think it's just
're-booling' the mark_dirty parameter.
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
cc: Jens Axboe <[email protected]>
cc: [email protected]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108370 [1]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108371 [2]
Link: https://lore.kernel.org/r/167391056756.2311931.356007731815807265.stgit@warthog.procyon.org.uk/ # v6
---
include/linux/bio.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c1da63f6c808..10366b8bdb13 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -227,7 +227,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)
static inline bool bio_flagged(struct bio *bio, unsigned int bit)
{
- return (bio->bi_flags & (1U << bit)) != 0;
+ return bio->bi_flags & (1U << bit);
}
static inline void bio_set_flag(struct bio *bio, unsigned int bit)
Provide a helper in the get_user_pages code to drop a pin or a ref on a
page based on being given FOLL_GET or FOLL_PIN in its flags argument or do
nothing if neither is set.
Signed-off-by: David Howells <[email protected]>
cc: Al Viro <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
include/linux/mm.h | 3 +++
mm/gup.c | 22 ++++++++++++++++++++++
2 files changed, 25 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8f857163ac89..3de9d88f8524 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1367,6 +1367,9 @@ static inline bool is_cow_mapping(vm_flags_t flags)
#define SECTION_IN_PAGE_FLAGS
#endif
+void folio_put_unpin(struct folio *folio, unsigned int flags);
+void page_put_unpin(struct page *page, unsigned int flags);
+
/*
* The identification function is mainly used by the buddy allocator for
* determining if two pages could be buddies. We are not really identifying
diff --git a/mm/gup.c b/mm/gup.c
index f45a3a5be53a..3ee4b4c7e0cb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -191,6 +191,28 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
folio_put_refs(folio, refs);
}
+/**
+ * folio_put_unpin - Unpin/put a folio as appropriate
+ * @folio: The folio to release
+ * @flags: gup flags indicating the mode of release (FOLL_*)
+ *
+ * Release a folio according to the flags. If FOLL_GET is set, the folio has a
+ * ref dropped; if FOLL_PIN is set, it is unpinned; otherwise it is left
+ * unaltered.
+ */
+void folio_put_unpin(struct folio *folio, unsigned int flags)
+{
+ if (flags & (FOLL_GET | FOLL_PIN))
+ gup_put_folio(folio, 1, flags);
+}
+EXPORT_SYMBOL_GPL(folio_put_unpin);
+
+void page_put_unpin(struct page *page, unsigned int flags)
+{
+ folio_put_unpin(page_folio(page), flags);
+}
+EXPORT_SYMBOL_GPL(page_put_unpin);
+
/**
* try_grab_page() - elevate a page's refcount by a flag-dependent amount
* @page: pointer to page to be grabbed
From: Christoph Hellwig <[email protected]>
ZERO_PAGE can't go away, no need to hold an extra reference.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/iomap/direct-io.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9804714b1751..47db4ead1e74 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -202,7 +202,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
- get_page(page);
+ bio_set_flag(bio, BIO_NO_PAGE_REF);
__bio_add_page(bio, page, len, 0);
iomap_dio_submit_bio(iter, dio, bio, pos);
}
Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator. The pages may be returned with a pin added or nothing,
depending on the type of iterator.
Add a second function, iov_iter_extract_mode(), to determine how the
cleanup should be done.
There are two cases:
(1) ITER_IOVEC or ITER_UBUF iterator.
Extracted pages will have pins (FOLL_PIN) obtained on them so that a
concurrent fork() will forcibly copy the page so that DMA is done
to/from the parent's buffer and is unavailable to/unaffected by the
child process.
iov_iter_extract_mode() will return FOLL_PIN for this case. The
caller should use something like folio_put_unpin() to dispose of the
page.
(2) Any other sort of iterator.
No refs or pins are obtained on the page, the assumption is made that
the caller will manage page retention.
iov_iter_extract_mode() will return 0. The pages don't need
additional disposal.
Signed-off-by: David Howells <[email protected]>
cc: Al Viro <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: John Hubbard <[email protected]>
cc: David Hildenbrand <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
Notes:
ver #8)
- It seems that all DIO is supposed to be done under FOLL_PIN now, and not
FOLL_GET, so switch to only using pin_user_pages() for user-backed
iters.
- Wrap an argument in brackets in the iov_iter_extract_mode() macro.
- Drop the extract_flags argument to iov_iter_extract_mode() for now
[hch].
ver #7)
- Switch to passing in iter-specific flags rather than FOLL_* flags.
- Drop the direction flags for now.
- Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
- Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
- Add support for extraction from KVEC-type iters.
- Use iov_iter_advance() rather than open-coding it.
- Make BVEC- and KVEC-type skip over initial empty vectors.
ver #6)
- Add back the function to indicate the cleanup mode.
- Drop the cleanup_mode return arg to iov_iter_extract_pages().
- Pass FOLL_SOURCE/DEST_BUF in gup_flags. Check this against the iter
data_source.
ver #4)
- Use ITER_SOURCE/DEST instead of WRITE/READ.
- Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.
ver #3)
- Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
to get/pin_user_pages_fast()[1].
include/linux/uio.h | 22 +++
lib/iov_iter.c | 320 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 342 insertions(+)
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 46d5080314c6..a8165335f8da 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -363,4 +363,26 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
/* Flags for iov_iter_get/extract_pages*() */
#define ITER_ALLOW_P2PDMA 0x01 /* Allow P2PDMA on the extracted pages */
+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+ size_t maxsize, unsigned int maxpages,
+ unsigned int extract_flags, size_t *offset0);
+
+/**
+ * iov_iter_extract_mode - Indicate how pages from the iterator will be retained
+ * @iter: The iterator
+ *
+ * Examine the iterator and indicate by returning FOLL_PIN or 0 as to how, if
+ * at all, pages extracted from the iterator will be retained by the extraction
+ * function.
+ *
+ * FOLL_PIN indicates that the pages will have a pin placed in them that the
+ * caller must unpin. This is must be done for DMA/async DIO to force fork()
+ * to forcibly copy a page for the child (the parent must retain the original
+ * page).
+ *
+ * 0 indicates that no measures are taken and that it's up to the caller to
+ * retain the pages.
+ */
+#define iov_iter_extract_mode(iter) (user_backed_iter(iter) ? FOLL_PIN : 0)
+
#endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index fb04abe7d746..57f3e9404160 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1916,3 +1916,323 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
i->iov -= state->nr_segs - i->nr_segs;
i->nr_segs = state->nr_segs;
}
+
+/*
+ * Extract a list of contiguous pages from an ITER_PIPE iterator. This does
+ * not get references of its own on the pages, nor does it get a pin on them.
+ * If there's a partial page, it adds that first and will then allocate and add
+ * pages into the pipe to make up the buffer space to the amount required.
+ *
+ * The caller must hold the pipe locked and only transferring into a pipe is
+ * supported.
+ */
+static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ unsigned int nr, offset, chunk, j;
+ struct page **p;
+ size_t left;
+
+ if (!sanity(i))
+ return -EFAULT;
+
+ offset = pipe_npages(i, &nr);
+ if (!nr)
+ return -EFAULT;
+ *offset0 = offset;
+
+ maxpages = min_t(size_t, nr, maxpages);
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+
+ left = maxsize;
+ for (j = 0; j < maxpages; j++) {
+ struct page *page = append_pipe(i, left, &offset);
+ if (!page)
+ break;
+ chunk = min_t(size_t, left, PAGE_SIZE - offset);
+ left -= chunk;
+ *p++ = page;
+ }
+ if (!j)
+ return -EFAULT;
+ return maxsize - left;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_XARRAY iterator. This does not
+ * get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ struct page *page, **p;
+ unsigned int nr = 0, offset;
+ loff_t pos = i->xarray_start + i->iov_offset;
+ pgoff_t index = pos >> PAGE_SHIFT;
+ XA_STATE(xas, i->xarray, index);
+
+ offset = pos & ~PAGE_MASK;
+ *offset0 = offset;
+
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+
+ rcu_read_lock();
+ for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+ if (xas_retry(&xas, page))
+ continue;
+
+ /* Has the page moved or been split? */
+ if (unlikely(page != xas_reload(&xas))) {
+ xas_reset(&xas);
+ continue;
+ }
+
+ p[nr++] = find_subpage(page, xas.xa_index);
+ if (nr == maxpages)
+ break;
+ }
+ rcu_read_unlock();
+
+ maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_BVEC iterator. This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ struct page **p, *page;
+ size_t skip = i->iov_offset, offset;
+ int k;
+
+ for (;;) {
+ if (i->nr_segs == 0)
+ return 0;
+ maxsize = min(maxsize, i->bvec->bv_len - skip);
+ if (maxsize)
+ break;
+ i->iov_offset = 0;
+ i->nr_segs--;
+ i->kvec++;
+ skip = 0;
+ }
+
+ skip += i->bvec->bv_offset;
+ page = i->bvec->bv_page + skip / PAGE_SIZE;
+ offset = skip % PAGE_SIZE;
+ *offset0 = offset;
+
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+ for (k = 0; k < maxpages; k++)
+ p[k] = page + k;
+
+ maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+/*
+ * Extract a list of virtually contiguous pages from an ITER_KVEC iterator.
+ * This does not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_kvec_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ struct page **p, *page;
+ const void *kaddr;
+ size_t skip = i->iov_offset, offset, len;
+ int k;
+
+ for (;;) {
+ if (i->nr_segs == 0)
+ return 0;
+ maxsize = min(maxsize, i->kvec->iov_len - skip);
+ if (maxsize)
+ break;
+ i->iov_offset = 0;
+ i->nr_segs--;
+ i->kvec++;
+ skip = 0;
+ }
+
+ offset = skip % PAGE_SIZE;
+ *offset0 = offset;
+ kaddr = i->kvec->iov_base;
+
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+
+ kaddr -= offset;
+ len = offset + maxsize;
+ for (k = 0; k < maxpages; k++) {
+ size_t seg = min_t(size_t, len, PAGE_SIZE);
+
+ if (is_vmalloc_or_module_addr(kaddr))
+ page = vmalloc_to_page(kaddr);
+ else
+ page = virt_to_page(kaddr);
+
+ p[k] = page;
+ len -= seg;
+ kaddr += PAGE_SIZE;
+ }
+
+ maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get a pin on
+ * each of them. This should only be used if the iterator is user-backed
+ * (IOBUF/UBUF).
+ *
+ * It does not get refs on the pages, but the pages must be unpinned by the
+ * caller once the transfer is complete.
+ *
+ * This is safe to be used where background IO/DMA *is* going to be modifying
+ * the buffer; using a pin rather than a ref makes forces fork() to give the
+ * child a copy of the page.
+ */
+static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
+ struct page ***pages,
+ size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ unsigned long addr;
+ unsigned int gup_flags = FOLL_PIN;
+ size_t offset;
+ int res;
+
+ if (i->data_source == ITER_DEST)
+ gup_flags |= FOLL_WRITE;
+ if (extract_flags & ITER_ALLOW_P2PDMA)
+ gup_flags |= FOLL_PCI_P2PDMA;
+ if (i->nofault)
+ gup_flags |= FOLL_NOFAULT;
+
+ addr = first_iovec_segment(i, &maxsize);
+ *offset0 = offset = addr % PAGE_SIZE;
+ addr &= PAGE_MASK;
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+ if (unlikely(res <= 0))
+ return res;
+ maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+/**
+ * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator
+ * @i: The iterator to extract from
+ * @pages: Where to return the list of pages
+ * @maxsize: The maximum amount of iterator to extract
+ * @maxpages: The maximum size of the list of pages
+ * @extract_flags: Flags to qualify request
+ * @offset0: Where to return the starting offset into (*@pages)[0]
+ *
+ * Extract a list of contiguous pages from the current point of the iterator,
+ * advancing the iterator. The maximum number of pages and the maximum amount
+ * of page contents can be set.
+ *
+ * If *@pages is NULL, a page list will be allocated to the required size and
+ * *@pages will be set to its base. If *@pages is not NULL, it will be assumed
+ * that the caller allocated a page list at least @maxpages in size and this
+ * will be filled in.
+ *
+ * @extract_flags can have ITER_ALLOW_P2PDMA set to request peer-to-peer DMA be
+ * allowed on the pages extracted.
+ *
+ * The iov_iter_extract_mode() function can be used to query how cleanup should
+ * be performed.
+ *
+ * Extra refs or pins on the pages may be obtained as follows:
+ *
+ * (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF), pins will be
+ * added to the pages, but refs will not be taken.
+ * iov_iter_extract_mode() will return FOLL_PIN.
+ *
+ * (*) If the iterator is ITER_PIPE, this must describe a destination for the
+ * data. Additional pages may be allocated and added to the pipe (which
+ * will hold the refs), but pins will not be obtained for the caller. The
+ * caller must hold the pipe lock. iov_iter_extract_mode() will return 0.
+ *
+ * (*) If the iterator is ITER_KVEC, ITER_BVEC or ITER_XARRAY, the pages are
+ * merely listed; no extra refs or pins are obtained.
+ * iov_iter_extract_mode() will return 0.
+ *
+ * Note also:
+ *
+ * (*) Use with ITER_DISCARD is not supported as that has no content.
+ *
+ * On success, the function sets *@pages to the new pagelist, if allocated, and
+ * sets *offset0 to the offset into the first page.
+ *
+ * It may also return -ENOMEM and -EFAULT.
+ */
+ssize_t iov_iter_extract_pages(struct iov_iter *i,
+ struct page ***pages,
+ size_t maxsize,
+ unsigned int maxpages,
+ unsigned int extract_flags,
+ size_t *offset0)
+{
+ maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT);
+ if (!maxsize)
+ return 0;
+
+ if (likely(user_backed_iter(i)))
+ return iov_iter_extract_user_pages(i, pages, maxsize,
+ maxpages, extract_flags,
+ offset0);
+ if (iov_iter_is_kvec(i))
+ return iov_iter_extract_kvec_pages(i, pages, maxsize,
+ maxpages, extract_flags,
+ offset0);
+ if (iov_iter_is_bvec(i))
+ return iov_iter_extract_bvec_pages(i, pages, maxsize,
+ maxpages, extract_flags,
+ offset0);
+ if (iov_iter_is_pipe(i))
+ return iov_iter_extract_pipe_pages(i, pages, maxsize,
+ maxpages, extract_flags,
+ offset0);
+ if (iov_iter_is_xarray(i))
+ return iov_iter_extract_xarray_pages(i, pages, maxsize,
+ maxpages, extract_flags,
+ offset0);
+ return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(iov_iter_extract_pages);
This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.
The pages need to be pinned for DIO rather than having refs taken on them to
prevent VM copy-on-write from malfunctioning during a concurrent fork() (the
result of the I/O could otherwise end up being affected by/visible to the
child process).
Signed-off-by: David Howells <[email protected]>
cc: Al Viro <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Jan Kara <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: [email protected]
---
Notes:
ver #8)
- Split the patch up a bit [hch].
- We should only be using pinned/non-pinned pages and not ref'd pages,
so adjust the comments appropriately.
ver #7)
- Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.
ver #5)
- Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
BIO_* flags and got rid of bi_cleanup_mode.
- Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.
block/bio.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 6f98bcfc0c92..6dc54bf3ed27 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1213,7 +1213,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
}
if (same_page)
- put_page(page);
+ bio_release_page(bio, page);
return 0;
}
@@ -1227,7 +1227,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
queue_max_zone_append_sectors(q), &same_page) != len)
return -EINVAL;
if (same_page)
- put_page(page);
+ bio_release_page(bio, page);
return 0;
}
@@ -1238,10 +1238,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
* @bio: bio to add pages to
* @iter: iov iterator describing the region to be mapped
*
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Extracts pages from *iter and appends them to @bio's bvec array. The pages
+ * will have to be cleaned up in the way indicated by the BIO_PAGE_PINNED flag.
+ * For a multi-segment *iter, this function only adds pages from the next
+ * non-empty segment of the iov iterator.
*/
static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
@@ -1273,9 +1273,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
* result to ensure the bio's total size is correct. The remainder of
* the iov data will be picked up in the next bio iteration.
*/
- size = iov_iter_get_pages(iter, pages,
- UINT_MAX - bio->bi_iter.bi_size,
- nr_pages, &offset, extract_flags);
+ size = iov_iter_extract_pages(iter, &pages,
+ UINT_MAX - bio->bi_iter.bi_size,
+ nr_pages, extract_flags, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
@@ -1308,7 +1308,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
iov_iter_revert(iter, left);
out:
while (i < nr_pages)
- put_page(pages[i++]);
+ bio_release_page(bio, pages[i++]);
return ret;
}
@@ -1343,7 +1343,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
return 0;
}
- bio_set_flag(bio, BIO_PAGE_REFFED);
+ bio_set_cleanup_mode(bio, iter);
do {
ret = __bio_iov_iter_get_pages(bio, iter);
} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.
The pages need to be pinned for DIO rather than having refs taken on them
to prevent VM copy-on-write from malfunctioning during a concurrent fork()
(the result of the I/O could otherwise end up being visible to/affected by
the child process).
Signed-off-by: David Howells <[email protected]>
cc: Al Viro <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Jan Kara <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: [email protected]
---
Notes:
ver #8)
- Split the patch up a bit [hch].
- We should only be using pinned/non-pinned pages and not ref'd pages,
so adjust the comments appropriately.
ver #7)
- Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.
ver #5)
- Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
BIO_* flags and got rid of bi_cleanup_mode.
- Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.
block/blk-map.c | 22 ++++++++++------------
1 file changed, 10 insertions(+), 12 deletions(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index a4ada4389d5e..b9b36af3c3f7 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -282,21 +282,19 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (blk_queue_pci_p2pdma(rq->q))
extract_flags |= ITER_ALLOW_P2PDMA;
- bio_set_flag(bio, BIO_PAGE_REFFED);
+ bio_set_cleanup_mode(bio, iter);
while (iov_iter_count(iter)) {
- struct page **pages, *stack_pages[UIO_FASTIOV];
+ struct page *stack_pages[UIO_FASTIOV];
+ struct page **pages = stack_pages;
ssize_t bytes;
size_t offs;
int npages;
- if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
- pages = stack_pages;
- bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
- nr_vecs, &offs, extract_flags);
- } else {
- bytes = iov_iter_get_pages_alloc(iter, &pages,
- LONG_MAX, &offs, extract_flags);
- }
+ if (nr_vecs > ARRAY_SIZE(stack_pages))
+ pages = NULL;
+
+ bytes = iov_iter_extract_pages(iter, &pages, LONG_MAX,
+ nr_vecs, extract_flags, &offs);
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
goto out_unmap;
@@ -318,7 +316,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (!bio_add_hw_page(rq->q, bio, page, n, offs,
max_sectors, &same_page)) {
if (same_page)
- put_page(page);
+ bio_release_page(bio, page);
break;
}
@@ -330,7 +328,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
* release the pages we didn't map into the bio, if any
*/
while (j < npages)
- put_page(pages[j++]);
+ bio_release_page(bio, pages[j++]);
if (pages != stack_pages)
kvfree(pages);
/* couldn't stuff something into bio? */
From: Christoph Hellwig <[email protected]>
Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
meaning is only set when a page reference has been acquired that needs to
be relesed by bio_release_pages.
Signed-off-by: David Howells <[email protected]>
cc: Al Viro <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Jan Kara <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: [email protected]
---
Notes:
ver #8)
- Don't default to BIO_PAGE_REFFED [hch].
ver #5)
- Split from patch that uses iov_iter_extract_pages().
block/bio.c | 2 +-
block/blk-map.c | 1 +
fs/direct-io.c | 2 ++
fs/iomap/direct-io.c | 1 -
include/linux/bio.h | 2 +-
include/linux/blk_types.h | 2 +-
6 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index a289bbff036f..40c2b01906da 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1198,7 +1198,6 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
bio->bi_io_vec = (struct bio_vec *)iter->bvec;
bio->bi_iter.bi_bvec_done = iter->iov_offset;
bio->bi_iter.bi_size = size;
- bio_set_flag(bio, BIO_NO_PAGE_REF);
bio_set_flag(bio, BIO_CLONED);
}
@@ -1343,6 +1342,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
return 0;
}
+ bio_set_flag(bio, BIO_PAGE_REFFED);
do {
ret = __bio_iov_iter_get_pages(bio, iter);
} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
diff --git a/block/blk-map.c b/block/blk-map.c
index bc111261fc82..a4ada4389d5e 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -282,6 +282,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (blk_queue_pci_p2pdma(rq->q))
extract_flags |= ITER_ALLOW_P2PDMA;
+ bio_set_flag(bio, BIO_PAGE_REFFED);
while (iov_iter_count(iter)) {
struct page **pages, *stack_pages[UIO_FASTIOV];
ssize_t bytes;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 03d381377ae1..07810465fc9d 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -403,6 +403,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
bio->bi_end_io = dio_bio_end_aio;
else
bio->bi_end_io = dio_bio_end_io;
+ /* for now require references for all pages */
+ bio_set_flag(bio, BIO_PAGE_REFFED);
sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
}
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 47db4ead1e74..c0e75900e754 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -202,7 +202,6 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
- bio_set_flag(bio, BIO_NO_PAGE_REF);
__bio_add_page(bio, page, len, 0);
iomap_dio_submit_bio(iter, dio, bio, pos);
}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 10366b8bdb13..805957c99147 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,7 +484,7 @@ void zero_fill_bio(struct bio *bio);
static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
{
- if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+ if (bio_flagged(bio, BIO_PAGE_REFFED))
__bio_release_pages(bio, mark_dirty);
}
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..7daa261f4f98 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -318,7 +318,7 @@ struct bio {
* bio flags
*/
enum {
- BIO_NO_PAGE_REF, /* don't put release vec pages */
+ BIO_PAGE_REFFED, /* put pages in bio_release_pages() */
BIO_CLONED, /* doesn't own data */
BIO_BOUNCED, /* bio is a bounce bio */
BIO_QUIET, /* Make BIO Quiet */
Add BIO_PAGE_PINNED to indicate that the pages in a bio are pinned
(FOLL_PIN) and that the pin will need removing.
Signed-off-by: David Howells <[email protected]>
cc: Al Viro <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Jan Kara <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: [email protected]
---
Notes:
ver #8)
- Move the infrastructure to clean up pinned pages to this patch [hch].
- Put BIO_PAGE_PINNED before BIO_PAGE_REFFED as the latter should
probably be removed at some point. FOLL_PIN can then be renumbered
first.
block/bio.c | 7 ++++---
block/blk.h | 28 ++++++++++++++++++++++++++++
include/linux/bio.h | 3 ++-
include/linux/blk_types.h | 1 +
4 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 40c2b01906da..6f98bcfc0c92 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1170,13 +1170,14 @@ bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
void __bio_release_pages(struct bio *bio, bool mark_dirty)
{
+ unsigned int gup_flags = bio_to_gup_flags(bio);
struct bvec_iter_all iter_all;
struct bio_vec *bvec;
bio_for_each_segment_all(bvec, bio, iter_all) {
if (mark_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
- put_page(bvec->bv_page);
+ page_put_unpin(bvec->bv_page, gup_flags);
}
}
EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1496,8 +1497,8 @@ void bio_set_pages_dirty(struct bio *bio)
* the BIO and re-dirty the pages in process context.
*
* It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on. It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on. It will run one page_put_unpin() against each page and will run
+ * one bio_put() against the BIO.
*/
static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk.h b/block/blk.h
index 4c3b3325219a..294044d696e0 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -425,6 +425,34 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
struct page *page, unsigned int len, unsigned int offset,
unsigned int max_sectors, bool *same_page);
+/*
+ * Set the cleanup mode for a bio from an iterator and the extraction flags.
+ */
+static inline void bio_set_cleanup_mode(struct bio *bio, struct iov_iter *iter)
+{
+ unsigned int cleanup_mode = iov_iter_extract_mode(iter);
+
+ if (cleanup_mode & FOLL_GET)
+ bio_set_flag(bio, BIO_PAGE_REFFED);
+ if (cleanup_mode & FOLL_PIN)
+ bio_set_flag(bio, BIO_PAGE_PINNED);
+}
+
+static inline unsigned int bio_to_gup_flags(struct bio *bio)
+{
+ return (bio_flagged(bio, BIO_PAGE_REFFED) ? FOLL_GET : 0) |
+ (bio_flagged(bio, BIO_PAGE_PINNED) ? FOLL_PIN : 0);
+}
+
+/*
+ * Clean up a page appropriately, where the page may be pinned, may have a
+ * ref taken on it or neither.
+ */
+static inline void bio_release_page(struct bio *bio, struct page *page)
+{
+ page_put_unpin(page, bio_to_gup_flags(bio));
+}
+
struct request_queue *blk_alloc_queue(int node_id);
int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 805957c99147..b2c09997d79c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,7 +484,8 @@ void zero_fill_bio(struct bio *bio);
static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
{
- if (bio_flagged(bio, BIO_PAGE_REFFED))
+ if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+ bio_flagged(bio, BIO_PAGE_PINNED))
__bio_release_pages(bio, mark_dirty);
}
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 7daa261f4f98..a0e339ff3d09 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -318,6 +318,7 @@ struct bio {
* bio flags
*/
enum {
+ BIO_PAGE_PINNED, /* Unpin pages in bio_release_pages() */
BIO_PAGE_REFFED, /* put pages in bio_release_pages() */
BIO_CLONED, /* doesn't own data */
BIO_BOUNCED, /* bio is a bounce bio */
Renumber FOLL_PIN and FOLL_GET down to bit 0 and 1 respectively so that
they are coincidentally the same as BIO_PAGE_PINNED and BIO_PAGE_REFFED and
also so that they can be stored in the bottom two bits of a page pointer
(something I'm looking at for zerocopy socket fragments).
(Note that BIO_PAGE_REFFED should probably be got rid of at some point,
hence why FOLL_PIN is at 0.)
Also renumber down the other FOLL_* flags to close the gaps.
Signed-off-by: David Howells <[email protected]>
cc: Al Viro <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
Notes:
ver #8)
- Put FOLL_PIN at bit 0 and FOLL_GET at bit 1 to match BIO_PAGE_*.
- Renumber the remaining flags down to fill in the gap.
include/linux/mm.h | 32 +++++++++++++++++---------------
1 file changed, 17 insertions(+), 15 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3de9d88f8524..c95bc4f77e8f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3074,26 +3074,28 @@ static inline vm_fault_t vmf_error(int err)
struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
unsigned int foll_flags);
-#define FOLL_WRITE 0x01 /* check pte is writable */
-#define FOLL_TOUCH 0x02 /* mark page accessed */
-#define FOLL_GET 0x04 /* do get_page on page */
-#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
-#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
-#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
+#define FOLL_PIN 0x01 /* pages must be released via unpin_user_page */
+#define FOLL_GET 0x02 /* do get_page on page (equivalent to BIO_FOLL_GET) */
+#define FOLL_WRITE 0x04 /* check pte is writable */
+#define FOLL_TOUCH 0x08 /* mark page accessed */
+#define FOLL_DUMP 0x10 /* give error on hole if it would be zero */
+#define FOLL_FORCE 0x20 /* get_user_pages read/write w/o permission */
+#define FOLL_NOWAIT 0x40 /* if a disk transfer is needed, start the IO
* and return without waiting upon it */
#define FOLL_NOFAULT 0x80 /* do not fault in pages */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
-#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
-#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
-#define FOLL_ANON 0x8000 /* don't do file mappings */
-#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
-#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
-#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
-#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
-#define FOLL_PCI_P2PDMA 0x100000 /* allow returning PCI P2PDMA pages */
-#define FOLL_INTERRUPTIBLE 0x200000 /* allow interrupts from generic signals */
+#define FOLL_TRIED 0x200 /* a retry, previous pass started an IO */
+#define FOLL_REMOTE 0x400 /* we are working on non-current tsk/mm */
+#define FOLL_ANON 0x800 /* don't do file mappings */
+#define FOLL_LONGTERM 0x1000 /* mapping lifetime is indefinite: see below */
+#define FOLL_SPLIT_PMD 0x2000 /* split huge pmd before returning */
+#define FOLL_FAST_ONLY 0x4000 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_PCI_P2PDMA 0x8000 /* allow returning PCI P2PDMA pages */
+#define FOLL_INTERRUPTIBLE 0x10000 /* allow interrupts from generic signals */
/*
+ * Note that FOLL_PIN is sorted to bit 0 to be coincident with BIO_PAGE_PINNED.
+ *
* FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
* other. Here is what they mean, and how to use them:
*
Looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
Looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
On Mon, Jan 23, 2023 at 05:30:01PM +0000, David Howells wrote:
> From: Christoph Hellwig <[email protected]>
>
> ZERO_PAGE can't go away, no need to hold an extra reference.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> Signed-off-by: David Howells <[email protected]>
I did originally attribute this to you as it's just extracted as
an intermedia step from your patch. But I can live with it being
credited to me if you want.
I think the subject is incorrect. It's not really renamed but
inverted.
On Mon, Jan 23, 2023 at 05:30:04PM +0000, David Howells wrote:
> Add BIO_PAGE_PINNED to indicate that the pages in a bio are pinned
> (FOLL_PIN) and that the pin will need removing.
Looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
Looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
Looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
On Mon, Jan 23, 2023 at 05:30:07PM +0000, David Howells wrote:
> Renumber FOLL_PIN and FOLL_GET down to bit 0 and 1 respectively so that
> they are coincidentally the same as BIO_PAGE_PINNED and BIO_PAGE_REFFED and
> also so that they can be stored in the bottom two bits of a page pointer
> (something I'm looking at for zerocopy socket fragments).
>
> (Note that BIO_PAGE_REFFED should probably be got rid of at some point,
> hence why FOLL_PIN is at 0.)
>
> Also renumber down the other FOLL_* flags to close the gaps.
Taking the risk of having a long bikeshed: I much prefer (1U << bitnr)
style definition that make it obvious where there are holes, but
otherwise:
Reviewed-by: Christoph Hellwig <[email protected]>
On 1/23/23 09:29, David Howells wrote:
> Hi Al, Christoph,
>
> Here are patches to provide support for extracting pages from an iov_iter
> and to use this in the extraction functions in the block layer bio code.
>
Hi David,
It's great to see this series. I attempted this a few times but got
caught in a loop of "don't quite see all the pieces, but it almost makes
sense...Al Viro has spotted major problems (again)...squirrel!"; and
repeat. :)
I saw your earlier versions go by and expected that they would end up
being an iov_iter prerequisite to getting Direct IO converted over to
FOLL_PIN. But now it looks like you are actually doing the conversion as
well! That's really excellent.
I've made a first pass through the series and have some minor notes
that I'll send out shortly, but it looks nice overall.
thanks,
--
John Hubbard
NVIDIA
On 1/23/23 09:30, David Howells wrote:
> From: Christoph Hellwig <[email protected]>
>
> ZERO_PAGE can't go away, no need to hold an extra reference.
That statement is true, but...
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> Signed-off-by: David Howells <[email protected]>
> ---
> fs/iomap/direct-io.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 9804714b1751..47db4ead1e74 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -202,7 +202,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> bio->bi_private = dio;
> bio->bi_end_io = iomap_dio_bio_end_io;
>
> - get_page(page);
> + bio_set_flag(bio, BIO_NO_PAGE_REF);
...is it accurate to assume that the entire bio is pointing to the zero
page? I recall working through this area earlier last year, and ended up
just letting the zero page get pinned, and then unpinning upon release,
which is harmless.
I like your approach, if it is possible. I'm just not sure that it's
correct given that bio's can have more than one page.
thanks,
--
John Hubbard
NVIDIA
On 1/23/23 09:30, David Howells wrote:
> Provide a helper in the get_user_pages code to drop a pin or a ref on a
> page based on being given FOLL_GET or FOLL_PIN in its flags argument or do
> nothing if neither is set.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Al Viro <[email protected]>
> cc: Christoph Hellwig <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> cc: [email protected]
> cc: [email protected]
> ---
> include/linux/mm.h | 3 +++
> mm/gup.c | 22 ++++++++++++++++++++++
> 2 files changed, 25 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8f857163ac89..3de9d88f8524 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1367,6 +1367,9 @@ static inline bool is_cow_mapping(vm_flags_t flags)
> #define SECTION_IN_PAGE_FLAGS
> #endif
>
> +void folio_put_unpin(struct folio *folio, unsigned int flags);
> +void page_put_unpin(struct page *page, unsigned int flags);
How about these names instead:
folio_put_or_unpin()
page_put_or_unpin()
?
Also, could we please change the name of the flags argument, to
gup_flags?
> +
> /*
> * The identification function is mainly used by the buddy allocator for
> * determining if two pages could be buddies. We are not really identifying
> diff --git a/mm/gup.c b/mm/gup.c
> index f45a3a5be53a..3ee4b4c7e0cb 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -191,6 +191,28 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> folio_put_refs(folio, refs);
> }
>
> +/**
> + * folio_put_unpin - Unpin/put a folio as appropriate
> + * @folio: The folio to release
> + * @flags: gup flags indicating the mode of release (FOLL_*)
> + *
> + * Release a folio according to the flags. If FOLL_GET is set, the folio has a
> + * ref dropped; if FOLL_PIN is set, it is unpinned; otherwise it is left
> + * unaltered.
> + */
> +void folio_put_unpin(struct folio *folio, unsigned int flags)
> +{
> + if (flags & (FOLL_GET | FOLL_PIN))
Another minor complication is that FOLL_PIN is supposed to be an
internal-to-mm flag. But here (and in another part of the series), it
has leaked into the public API. One approach would be to give up and
just admit that, like FOLL_GET, FOLL_PIN has escaped into the wild.
Another approach would be to use a new set of flags, such as
USE_FOLL_GET and USE_FOLL_PIN.
But I'm starting to lean toward the first approach: just let FOLL_PIN be
used in this way, treat it as part of the external API at least for this
area (not for gup/pup calls, though).
So after all that thinking out loud, I think this is OK to use FOLL_PIN.
+Cc Jason Gunthorpe, because he is about to split up FOLL_* into public
and internal sets.
thanks,
--
John Hubbard
NVIDIA
On 1/23/23 09:30, David Howells wrote:
> Renumber FOLL_PIN and FOLL_GET down to bit 0 and 1 respectively so that
> they are coincidentally the same as BIO_PAGE_PINNED and BIO_PAGE_REFFED and
> also so that they can be stored in the bottom two bits of a page pointer
> (something I'm looking at for zerocopy socket fragments).
>
> (Note that BIO_PAGE_REFFED should probably be got rid of at some point,
> hence why FOLL_PIN is at 0.)
>
> Also renumber down the other FOLL_* flags to close the gaps.
Should we also get these sorted into internal-to-mm and public sets?
Because Jason (+Cc) again was about to split them apart into
mm/internal.h [1] and that might make that a little cleaner.
Also, I don't think that there is any large readability difference
either way between 0x and <<1, so whatever you and Christophe settle on
there seems fine.
So either way with those points,
Reviewed-by: John Hubbard <[email protected]>
thanks,
--
John Hubbard
NVIDIA
>
> Signed-off-by: David Howells <[email protected]>
> cc: Al Viro <[email protected]>
> cc: Christoph Hellwig <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> cc: [email protected]
> ---
>
> Notes:
> ver #8)
> - Put FOLL_PIN at bit 0 and FOLL_GET at bit 1 to match BIO_PAGE_*.
> - Renumber the remaining flags down to fill in the gap.
>
> include/linux/mm.h | 32 +++++++++++++++++---------------
> 1 file changed, 17 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 3de9d88f8524..c95bc4f77e8f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3074,26 +3074,28 @@ static inline vm_fault_t vmf_error(int err)
> struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
> unsigned int foll_flags);
>
> -#define FOLL_WRITE 0x01 /* check pte is writable */
> -#define FOLL_TOUCH 0x02 /* mark page accessed */
> -#define FOLL_GET 0x04 /* do get_page on page */
> -#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
> -#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
> -#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
> +#define FOLL_PIN 0x01 /* pages must be released via unpin_user_page */
> +#define FOLL_GET 0x02 /* do get_page on page (equivalent to BIO_FOLL_GET) */
> +#define FOLL_WRITE 0x04 /* check pte is writable */
> +#define FOLL_TOUCH 0x08 /* mark page accessed */
> +#define FOLL_DUMP 0x10 /* give error on hole if it would be zero */
> +#define FOLL_FORCE 0x20 /* get_user_pages read/write w/o permission */
> +#define FOLL_NOWAIT 0x40 /* if a disk transfer is needed, start the IO
> * and return without waiting upon it */
> #define FOLL_NOFAULT 0x80 /* do not fault in pages */
> #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
> -#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
> -#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
> -#define FOLL_ANON 0x8000 /* don't do file mappings */
> -#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
> -#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
> -#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
> -#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
> -#define FOLL_PCI_P2PDMA 0x100000 /* allow returning PCI P2PDMA pages */
> -#define FOLL_INTERRUPTIBLE 0x200000 /* allow interrupts from generic signals */
> +#define FOLL_TRIED 0x200 /* a retry, previous pass started an IO */
> +#define FOLL_REMOTE 0x400 /* we are working on non-current tsk/mm */
> +#define FOLL_ANON 0x800 /* don't do file mappings */
> +#define FOLL_LONGTERM 0x1000 /* mapping lifetime is indefinite: see below */
> +#define FOLL_SPLIT_PMD 0x2000 /* split huge pmd before returning */
> +#define FOLL_FAST_ONLY 0x4000 /* gup_fast: prevent fall-back to slow gup */
> +#define FOLL_PCI_P2PDMA 0x8000 /* allow returning PCI P2PDMA pages */
> +#define FOLL_INTERRUPTIBLE 0x10000 /* allow interrupts from generic signals */
>
> /*
> + * Note that FOLL_PIN is sorted to bit 0 to be coincident with BIO_PAGE_PINNED.
> + *
> * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
> * other. Here is what they mean, and how to use them:
> *
>
>
On 1/23/23 19:08, John Hubbard wrote:
> On 1/23/23 09:30, David Howells wrote:
>> Renumber FOLL_PIN and FOLL_GET down to bit 0 and 1 respectively so that
>> they are coincidentally the same as BIO_PAGE_PINNED and BIO_PAGE_REFFED and
>> also so that they can be stored in the bottom two bits of a page pointer
>> (something I'm looking at for zerocopy socket fragments).
>>
>> (Note that BIO_PAGE_REFFED should probably be got rid of at some point,
>> hence why FOLL_PIN is at 0.)
>>
>> Also renumber down the other FOLL_* flags to close the gaps.
>
> Should we also get these sorted into internal-to-mm and public sets?
> Because Jason (+Cc) again was about to split them apart into
> mm/internal.h [1] and that might make that a little cleaner.
I see that I omitted both Jason's Cc and the reference, so I'll resend.
Sorry for the extra noise.
[1] https://lore.kernel.org/all/[email protected]/
>
> Also, I don't think that there is any large readability difference
> either way between 0x and <<1, so whatever you and Christophe settle on
> there seems fine.
>
> So either way with those points,
>
> Reviewed-by: John Hubbard <[email protected]>
>
> thanks,
thanks,
--
John Hubbard
NVIDIA
On Mon, Jan 23, 2023 at 06:42:28PM -0800, John Hubbard wrote:
> > @@ -202,7 +202,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> > bio->bi_private = dio;
> > bio->bi_end_io = iomap_dio_bio_end_io;
> > - get_page(page);
> > + bio_set_flag(bio, BIO_NO_PAGE_REF);
>
> ...is it accurate to assume that the entire bio is pointing to the zero
> page? I recall working through this area earlier last year, and ended up
> just letting the zero page get pinned, and then unpinning upon release,
> which is harmless.
Yes, the bio is built 4 lines above what is quoted here, and submitted
right after it. It only contains the ZERO_PAGE.
On 1/23/23 21:59, Christoph Hellwig wrote:
> On Mon, Jan 23, 2023 at 06:42:28PM -0800, John Hubbard wrote:
>>> @@ -202,7 +202,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>>> bio->bi_private = dio;
>>> bio->bi_end_io = iomap_dio_bio_end_io;
>>> - get_page(page);
>>> + bio_set_flag(bio, BIO_NO_PAGE_REF);
>>
>> ...is it accurate to assume that the entire bio is pointing to the zero
>> page? I recall working through this area earlier last year, and ended up
>> just letting the zero page get pinned, and then unpinning upon release,
>> which is harmless.
>
> Yes, the bio is built 4 lines above what is quoted here, and submitted
> right after it. It only contains the ZERO_PAGE.
OK, yes. All good, then.
thanks,
--
John Hubbard
NVIDIA
John Hubbard <[email protected]> wrote:
> > Renumber FOLL_PIN and FOLL_GET down to bit 0 and 1 respectively so that
> > they are coincidentally the same as BIO_PAGE_PINNED and BIO_PAGE_REFFED and
> > also so that they can be stored in the bottom two bits of a page pointer
> > (something I'm looking at for zerocopy socket fragments).
> > (Note that BIO_PAGE_REFFED should probably be got rid of at some point,
> > hence why FOLL_PIN is at 0.)
> > Also renumber down the other FOLL_* flags to close the gaps.
>
> Should we also get these sorted into internal-to-mm and public sets?
> Because Jason (+Cc) again was about to split them apart into
> mm/internal.h [1] and that might make that a little cleaner.
My plan was to push this patch by itself through akpm since it's only an
optimisation and not necessary to the rest of the patches here.
David
On 23.01.23 18:29, David Howells wrote:
> Hi Al, Christoph,
>
> Here are patches to provide support for extracting pages from an iov_iter
> and to use this in the extraction functions in the block layer bio code.
>
> The patches make the following changes:
>
> (1) Add a function, iov_iter_extract_pages() to replace
> iov_iter_get_pages*() that gets refs, pins or just lists the pages as
> appropriate to the iterator type.
>
> Add a function, iov_iter_extract_mode() that will indicate from the
> iterator type how the cleanup is to be performed, returning FOLL_PIN
> or 0.
>
> (2) Add a function, folio_put_unpin(), and a wrapper, page_put_unpin(),
> that take a page and the return from iov_iter_extract_mode() and do
> the right thing to clean up the page.
>
> (3) Make the bio struct carry a pair of flags to indicate the cleanup
> mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
> FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
> added.
>
> (4) Add a function, bio_release_page(), to release a page appropriately to
> the cleanup mode indicated by the BIO_PAGE_* flags.
>
> (5) Make the iter-to-bio code use iov_iter_extract_pages() to retain the
> pages appropriately and clean them up later.
>
> (6) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.
>
> (7) Renumber FOLL_PIN and FOLL_GET down so that they're at bits 0 and 1
> and coincident with BIO_PAGE_PINNED and BIO_PAGE_REFFED. The compiler
> can then optimise on that. Also, it's probably going to be necessary
> to embed these in the page pointer in sk_buff fragments. This patch
> can go independently through the mm tree.
^ I feel like some of that information might be stale now that you're
only using FOLL_PIN.
>
> I've pushed the patches here also:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract
I gave this a quick test and it indeed fixes the last remaining test
case of my O_DIRECT+fork tests [1] that was still failing on upstream
(test3).
Once landed upstream, if we feel confident enough (I tend to), we could
adjust the open() man page to state that O_DIRECT can now be run
concurrently with fork(). Especially, the following documentation might
be adjusted:
"O_DIRECT I/Os should never be run concurrently with the fork(2)
system call, if the memory buffer is a private mapping (i.e., any
mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory
allocated on the heap and statically allocated buffers). Any such
I/Os, whether submitted via an asynchronous I/O interface or from
another thread in the process, should be completed before fork(2) is
called. Failure to do so can result in data corruption and undefined
behavior in parent and child processes."
This series does not yet fix vmsplice()+hugetlb ... simply because your
series does not mess with the vmsplice() implementation I assume ;) Once
vmsplice() uses FOLL_PIN, all cow tests should be passing as well. Easy
to test:
$ cd tools/testing/selftests/vm/
$ echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
$ echo 2 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
$ ./cow
...
Bail out! 8 out of 190 tests failed
# Totals: pass:181 fail:8 xfail:0 xpass:0 skip:1 error:0
[1] https://gitlab.com/davidhildenbrand/o_direct_fork_tests
--
Thanks,
David / dhildenb
On Mon, Jan 23, 2023 at 07:11:42PM -0800, John Hubbard wrote:
> On 1/23/23 19:08, John Hubbard wrote:
> > On 1/23/23 09:30, David Howells wrote:
> > > Renumber FOLL_PIN and FOLL_GET down to bit 0 and 1 respectively so that
> > > they are coincidentally the same as BIO_PAGE_PINNED and BIO_PAGE_REFFED and
> > > also so that they can be stored in the bottom two bits of a page pointer
> > > (something I'm looking at for zerocopy socket fragments).
> > >
> > > (Note that BIO_PAGE_REFFED should probably be got rid of at some point,
> > > hence why FOLL_PIN is at 0.)
> > >
> > > Also renumber down the other FOLL_* flags to close the gaps.
> >
> > Should we also get these sorted into internal-to-mm and public sets?
> > Because Jason (+Cc) again was about to split them apart into
> > mm/internal.h [1] and that might make that a little cleaner.
>
> I see that I omitted both Jason's Cc and the reference, so I'll resend.
> Sorry for the extra noise.
>
> [1] https://lore.kernel.org/all/[email protected]/
Yeah, I already wrote a similar patch, using the 1<< notation,
splitting the internal/external, and rebasing on the move to
mm_types.. I can certainly drop that patch if we'd rather do this.
Though, I'm not so keen on using FOLL_ internal flags inside the block
layer.. Can you stick with the BIO versions of these?
Jason
On Tue, Jan 24, 2023 at 01:44:21PM +0100, David Hildenbrand wrote:
> Once landed upstream, if we feel confident enough (I tend to), we could
> adjust the open() man page to state that O_DIRECT can now be run
> concurrently with fork(). Especially, the following documentation might be
> adjusted:
Note that while these series coverts the two most commonly used
O_DIRECT implementations, there are various others ones that do not
pin the pages yet.
On Tue, Jan 24, 2023 at 09:13:30AM -0400, Jason Gunthorpe wrote:
> Yeah, I already wrote a similar patch, using the 1<< notation,
> splitting the internal/external, and rebasing on the move to
> mm_types.. I can certainly drop that patch if we'd rather do this.
Given that you are doing more work in that area it might be best
to drop this patch from this series.
> Though, I'm not so keen on using FOLL_ internal flags inside the block
> layer.. Can you stick with the BIO versions of these?
The block layer doesn't really use it - the new helper in iov_iter.c
returns it, and the block layer instantly turns it into an internal
flag. But maybe it should just return a bool pinned (by reference)
now?
On 24.01.23 14:16, Christoph Hellwig wrote:
> On Tue, Jan 24, 2023 at 01:44:21PM +0100, David Hildenbrand wrote:
>> Once landed upstream, if we feel confident enough (I tend to), we could
>> adjust the open() man page to state that O_DIRECT can now be run
>> concurrently with fork(). Especially, the following documentation might be
>> adjusted:
>
> Note that while these series coverts the two most commonly used
> O_DIRECT implementations, there are various others ones that do not
> pin the pages yet.
Thanks for the info ... I assume these are then for other filesystems,
right? (such that we could adjust the tests to exercise these as well)
... do we have a list (or is it easy to make one)? :)
--
Thanks,
David / dhildenb
On Tue, Jan 24, 2023 at 02:22:36PM +0100, David Hildenbrand wrote:
> > Note that while these series coverts the two most commonly used
> > O_DIRECT implementations, there are various others ones that do not
> > pin the pages yet.
>
> Thanks for the info ... I assume these are then for other filesystems,
> right? (such that we could adjust the tests to exercise these as well)
Yes. There's the fs/direct-io.c code still used by a few block based
file systems, and then all the not block based file systems as well
(e.g. NFS, cifs).
> ... do we have a list (or is it easy to make one)? :)
fs/direct-io.c is easy, just grep for blockdev_direct_IO.
The others are more complicated to find, but a grep for
iov_iter_get_pages2 and iov_iter_get_pages_alloc2 in fs/ should be a
good approximation.
On 24.01.23 14:32, Christoph Hellwig wrote:
> On Tue, Jan 24, 2023 at 02:22:36PM +0100, David Hildenbrand wrote:
>>> Note that while these series coverts the two most commonly used
>>> O_DIRECT implementations, there are various others ones that do not
>>> pin the pages yet.
>>
>> Thanks for the info ... I assume these are then for other filesystems,
>> right? (such that we could adjust the tests to exercise these as well)
>
> Yes. There's the fs/direct-io.c code still used by a few block based
> file systems, and then all the not block based file systems as well
> (e.g. NFS, cifs).
>
>> ... do we have a list (or is it easy to make one)? :)
>
> fs/direct-io.c is easy, just grep for blockdev_direct_IO.
>
> The others are more complicated to find, but a grep for
> iov_iter_get_pages2 and iov_iter_get_pages_alloc2 in fs/ should be a
> good approximation.
Right, iov_iter_get_pages2() includes vmsplice() that still needs love.
Thanks!
--
Thanks,
David / dhildenb
Jason Gunthorpe <[email protected]> wrote:
> Though, I'm not so keen on using FOLL_ internal flags inside the block
> layer.. Can you stick with the BIO versions of these?
It's used in a new iov_iter function and will be used in a bunch of
filesystems including network filesystems. I'm also looking at using it in
the sk_buff handling also.
David
On Tue, Jan 24, 2023 at 05:18:07AM -0800, Christoph Hellwig wrote:
> On Tue, Jan 24, 2023 at 09:13:30AM -0400, Jason Gunthorpe wrote:
> > Yeah, I already wrote a similar patch, using the 1<< notation,
> > splitting the internal/external, and rebasing on the move to
> > mm_types.. I can certainly drop that patch if we'd rather do this.
>
> Given that you are doing more work in that area it might be best
> to drop this patch from this series.
>
> > Though, I'm not so keen on using FOLL_ internal flags inside the block
> > layer.. Can you stick with the BIO versions of these?
>
> The block layer doesn't really use it - the new helper in iov_iter.c
> returns it, and the block layer instantly turns it into an internal
> flag. But maybe it should just return a bool pinned (by reference)
> now?
Yes, a bool flag to the new mm helper instead of a FOLL_* seems good
to me
Thanks,
Jason
On Tue, Jan 24, 2023 at 01:46:23PM +0000, David Howells wrote:
> Jason Gunthorpe <[email protected]> wrote:
>
> > Yeah, I already wrote a similar patch, using the 1<< notation,
> > splitting the internal/external, and rebasing on the move to
> > mm_types.. I can certainly drop that patch if we'd rather do this.
>
> Note that I've already given Andrew a patch to move FOLL_* to mm_types.h.
>
> I'm happy to go with your patch on top of that if you can just renumber
> FOLL_PIN to 0 and FOLL_GET to 1.
I moved FOLL_PIN to internal.h because it is not supposed to be used
outside mm/*
I would prefer to keep it that way.
Jason
Jason Gunthorpe <[email protected]> wrote:
> Yeah, I already wrote a similar patch, using the 1<< notation,
> splitting the internal/external, and rebasing on the move to
> mm_types.. I can certainly drop that patch if we'd rather do this.
Note that I've already given Andrew a patch to move FOLL_* to mm_types.h.
I'm happy to go with your patch on top of that if you can just renumber
FOLL_PIN to 0 and FOLL_GET to 1.
David
Jason Gunthorpe <[email protected]> wrote:
> I moved FOLL_PIN to internal.h because it is not supposed to be used
> outside mm/*
>
> I would prefer to keep it that way.
I'll need to do something else, then, such as creating a new pair of 'cleanup'
flags:
[include/linux/mm_types.h]
#define PAGE_CLEANUP_UNPIN (1U << 0)
#define PAGE_CLEANUP_PUT (1U << 0)
[mm/gup.h]
void folio_put_unpin(struct folio *folio, unsigned int cleanup_flags)
{
unsigned int gup_flags = 0;
cleanup_flags &= PAGE_CLEANUP_UNPIN | PAGE_CLEANUP_PUT;
if (!cleanup_flags)
return;
gup_flags |= cleanup_flags & PAGE_CLEANUP_UNPIN ? FOLL_PIN : 0;
gup_flags |= cleanup_flags & PAGE_CLEANUP_PUT ? FOLL_GET : 0;
gup_put_folio(folio, 1, flags);
}
EXPORT_SYMBOL_GPL(folio_put_unpin);
David
On Tue, Jan 24, 2023 at 01:57:08PM +0000, David Howells wrote:
> Jason Gunthorpe <[email protected]> wrote:
>
> > I moved FOLL_PIN to internal.h because it is not supposed to be used
> > outside mm/*
> >
> > I would prefer to keep it that way.
>
> I'll need to do something else, then, such as creating a new pair of 'cleanup'
> flags:
>
> [include/linux/mm_types.h]
> #define PAGE_CLEANUP_UNPIN (1U << 0)
> #define PAGE_CLEANUP_PUT (1U << 0)
>
> [mm/gup.h]
> void folio_put_unpin(struct folio *folio, unsigned int cleanup_flags)
> {
> unsigned int gup_flags = 0;
>
> cleanup_flags &= PAGE_CLEANUP_UNPIN | PAGE_CLEANUP_PUT;
> if (!cleanup_flags)
> return;
> gup_flags |= cleanup_flags & PAGE_CLEANUP_UNPIN ? FOLL_PIN : 0;
> gup_flags |= cleanup_flags & PAGE_CLEANUP_PUT ? FOLL_GET : 0;
> gup_put_folio(folio, 1, flags);
> }
> EXPORT_SYMBOL_GPL(folio_put_unpin);
I suggest:
if (cleanup_flags & PAGE_CLEANUP_UNPIN)
gup_put_folio(folio, 1, true);
else if (cleanup_flags & PAGE_CLEANUP_PUT)
gup_put_folio(folio, 1, false);
or if you are really counting cycles
if (cleanup_flags & PAGE_CLEANUP_NEEDED)
gup_put_folio(folio, 1, cleanup_flags & PAGE_CLEANUP_UNPIN)
Jason
If you look here:
https://lore.kernel.org/r/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/
you can see additional patches fixing other users. Christoph asked if I could
pare down the patchset to the minimum to fix the bio case for the moment.
It will be easier to do the others once iov_iter_extract_pages() is upstream
as the individual bits can go via their respective maintainers.
However, I do want to see about adding cifs iteratorisation in this merge
window also, which also depends on the iov_iter_extract_pages() function being
added.
David
On Tue, Jan 24, 2023 at 01:57:08PM +0000, David Howells wrote:
> [include/linux/mm_types.h]
> #define PAGE_CLEANUP_UNPIN (1U << 0)
> #define PAGE_CLEANUP_PUT (1U << 0)
With the latest series we don't need PAGE_CLEANUP_PUT at all.
Jason Gunthorpe <[email protected]> wrote:
> if (cleanup_flags & PAGE_CLEANUP_UNPIN)
> gup_put_folio(folio, 1, true);
> else if (cleanup_flags & PAGE_CLEANUP_PUT)
> gup_put_folio(folio, 1, false);
gup_put_folio() doesn't take a bool - it takes FOLL_PIN and FOLL_GET:
static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
though I could do:
if (cleanup_flags & PAGE_CLEANUP_UNPIN)
gup_put_folio(folio, 1, FOLL_PIN);
else if (cleanup_flags & PAGE_CLEANUP_PUT)
gup_put_folio(folio, 1, FOLL_GET);
But the reason I wrote it like this:
gup_flags |= cleanup_flags & PAGE_CLEANUP_UNPIN ? FOLL_PIN : 0;
gup_flags |= cleanup_flags & PAGE_CLEANUP_PUT ? FOLL_GET : 0;
is that if PAGE_CLEANUP_UNPIN == FOLL_PIN and PAGE_CLEANUP_PUT == FOLL_GET,
the C compiler optimiser should be able to turn all of that into a single AND
instruction.
David
Christoph Hellwig <[email protected]> wrote:
> On Tue, Jan 24, 2023 at 01:57:08PM +0000, David Howells wrote:
> > [include/linux/mm_types.h]
> > #define PAGE_CLEANUP_UNPIN (1U << 0)
> > #define PAGE_CLEANUP_PUT (1U << 0)
>
> With the latest series we don't need PAGE_CLEANUP_PUT at all.
We will when it comes to skbuffs.
David
On Tue, Jan 24, 2023 at 02:12:25PM +0000, David Howells wrote:
> > With the latest series we don't need PAGE_CLEANUP_PUT at all.
>
> We will when it comes to skbuffs.
I'm a little doubtful of that. But IFF skbuffs need it, we can
just change the interface at that point.
On Tue, Jan 24, 2023 at 02:11:50PM +0000, David Howells wrote:
> Jason Gunthorpe <[email protected]> wrote:
>
> > if (cleanup_flags & PAGE_CLEANUP_UNPIN)
> > gup_put_folio(folio, 1, true);
> > else if (cleanup_flags & PAGE_CLEANUP_PUT)
> > gup_put_folio(folio, 1, false);
>
> gup_put_folio() doesn't take a bool - it takes FOLL_PIN and FOLL_GET:
My point was to change that to take a 'bool unpin' because FOLL_PIN is
not to be used outside gup.c
Jason
Christoph Hellwig <[email protected]> wrote:
> > > With the latest series we don't need PAGE_CLEANUP_PUT at all.
> >
> > We will when it comes to skbuffs.
>
> I'm a little doubtful of that.
skbuff fragments will be and will have to be a mixture of allocated pages that
need putting and pinned or non-ref'd and non-pinned zerocopy stuff. I have
posted a patch that works for the limited amount of driverage that I use on my
test machine.
Think network filesystem messages where you have a mixture of protocol bits
generated by the kernel and data provided by direct I/O being sent by zerocopy
(libceph, for example).
David
On 23.01.23 18:29, David Howells wrote:
> Add a function, iov_iter_extract_pages(), to extract a list of pages from
> an iterator. The pages may be returned with a pin added or nothing,
> depending on the type of iterator.
>
> Add a second function, iov_iter_extract_mode(), to determine how the
> cleanup should be done.
>
> There are two cases:
>
> (1) ITER_IOVEC or ITER_UBUF iterator.
>
> Extracted pages will have pins (FOLL_PIN) obtained on them so that a
> concurrent fork() will forcibly copy the page so that DMA is done
> to/from the parent's buffer and is unavailable to/unaffected by the
> child process.
>
> iov_iter_extract_mode() will return FOLL_PIN for this case. The
> caller should use something like folio_put_unpin() to dispose of the
> page.
>
> (2) Any other sort of iterator.
>
> No refs or pins are obtained on the page, the assumption is made that
> the caller will manage page retention.
>
> iov_iter_extract_mode() will return 0. The pages don't need
> additional disposal.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Al Viro <[email protected]>
> cc: Christoph Hellwig <[email protected]>
> cc: John Hubbard <[email protected]>
> cc: David Hildenbrand <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> cc: [email protected]
> ---
>
> Notes:
> ver #8)
> - It seems that all DIO is supposed to be done under FOLL_PIN now, and not
> FOLL_GET, so switch to only using pin_user_pages() for user-backed
> iters.
> - Wrap an argument in brackets in the iov_iter_extract_mode() macro.
> - Drop the extract_flags argument to iov_iter_extract_mode() for now
> [hch].
>
> ver #7)
> - Switch to passing in iter-specific flags rather than FOLL_* flags.
> - Drop the direction flags for now.
> - Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
> - Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
> - Add support for extraction from KVEC-type iters.
> - Use iov_iter_advance() rather than open-coding it.
> - Make BVEC- and KVEC-type skip over initial empty vectors.
>
> ver #6)
> - Add back the function to indicate the cleanup mode.
> - Drop the cleanup_mode return arg to iov_iter_extract_pages().
> - Pass FOLL_SOURCE/DEST_BUF in gup_flags. Check this against the iter
> data_source.
>
> ver #4)
> - Use ITER_SOURCE/DEST instead of WRITE/READ.
> - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.
>
> ver #3)
> - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
> to get/pin_user_pages_fast()[1].
>
> include/linux/uio.h | 22 +++
> lib/iov_iter.c | 320 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 342 insertions(+)
>
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 46d5080314c6..a8165335f8da 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -363,4 +363,26 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
> /* Flags for iov_iter_get/extract_pages*() */
> #define ITER_ALLOW_P2PDMA 0x01 /* Allow P2PDMA on the extracted pages */
>
> +ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
> + size_t maxsize, unsigned int maxpages,
> + unsigned int extract_flags, size_t *offset0);
> +
> +/**
> + * iov_iter_extract_mode - Indicate how pages from the iterator will be retained
> + * @iter: The iterator
> + *
> + * Examine the iterator and indicate by returning FOLL_PIN or 0 as to how, if
> + * at all, pages extracted from the iterator will be retained by the extraction
> + * function.
> + *
> + * FOLL_PIN indicates that the pages will have a pin placed in them that the
> + * caller must unpin. This is must be done for DMA/async DIO to force fork()
> + * to forcibly copy a page for the child (the parent must retain the original
> + * page).
> + *
> + * 0 indicates that no measures are taken and that it's up to the caller to
> + * retain the pages.
> + */
> +#define iov_iter_extract_mode(iter) (user_backed_iter(iter) ? FOLL_PIN : 0)
> +
Does it make sense to move that to the patch where it is needed? (do we
need it at all anymore?)
--
Thanks,
David / dhildenb
Jason Gunthorpe <[email protected]> wrote:
> My point was to change that to take a 'bool unpin' because FOLL_PIN is
> not to be used outside gup.c
I need a 3-state wrapper. But if I can't do it in gup.c, then I'll have to do
it elsewhere. As Christoph says, most of the places will only be pinned or
not-pinned. The whole point was to avoid generating new constants when
existing constants would do.
David.
On 23.01.23 18:30, David Howells wrote:
> Provide a helper in the get_user_pages code to drop a pin or a ref on a
> page based on being given FOLL_GET or FOLL_PIN in its flags argument or do
> nothing if neither is set.
Does the FOLL_GET part still apply to this patch set?
--
Thanks,
David / dhildenb
On 23.01.23 18:30, David Howells wrote:
> From: Christoph Hellwig <[email protected]>
>
> ZERO_PAGE can't go away, no need to hold an extra reference.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> Signed-off-by: David Howells <[email protected]>
> ---
> fs/iomap/direct-io.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 9804714b1751..47db4ead1e74 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -202,7 +202,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> bio->bi_private = dio;
> bio->bi_end_io = iomap_dio_bio_end_io;
>
> - get_page(page);
> + bio_set_flag(bio, BIO_NO_PAGE_REF);
> __bio_add_page(bio, page, len, 0);
> iomap_dio_submit_bio(iter, dio, bio, pos);
> }
>
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On Tue, Jan 24, 2023 at 02:27:53PM +0000, David Howells wrote:
> Jason Gunthorpe <[email protected]> wrote:
>
> > My point was to change that to take a 'bool unpin' because FOLL_PIN is
> > not to be used outside gup.c
>
> I need a 3-state wrapper. But if I can't do it in gup.c, then I'll have to do
> it elsewhere. As Christoph says, most of the places will only be pinned or
> not-pinned. The whole point was to avoid generating new constants when
> existing constants would do.
What is the 3rd state?
Jason
On 23.01.23 18:30, David Howells wrote:
> Add BIO_PAGE_PINNED to indicate that the pages in a bio are pinned
> (FOLL_PIN) and that the pin will need removing.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Al Viro <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Jan Kara <[email protected]>
> cc: Christoph Hellwig <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: Logan Gunthorpe <[email protected]>
> cc: [email protected]
> ---
>
> Notes:
> ver #8)
> - Move the infrastructure to clean up pinned pages to this patch [hch].
> - Put BIO_PAGE_PINNED before BIO_PAGE_REFFED as the latter should
> probably be removed at some point. FOLL_PIN can then be renumbered
> first.
>
> block/bio.c | 7 ++++---
> block/blk.h | 28 ++++++++++++++++++++++++++++
> include/linux/bio.h | 3 ++-
> include/linux/blk_types.h | 1 +
> 4 files changed, 35 insertions(+), 4 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 40c2b01906da..6f98bcfc0c92 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1170,13 +1170,14 @@ bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
>
> void __bio_release_pages(struct bio *bio, bool mark_dirty)
> {
> + unsigned int gup_flags = bio_to_gup_flags(bio);
> struct bvec_iter_all iter_all;
> struct bio_vec *bvec;
>
> bio_for_each_segment_all(bvec, bio, iter_all) {
> if (mark_dirty && !PageCompound(bvec->bv_page))
> set_page_dirty_lock(bvec->bv_page);
> - put_page(bvec->bv_page);
> + page_put_unpin(bvec->bv_page, gup_flags);
> }
> }
> EXPORT_SYMBOL_GPL(__bio_release_pages);
> @@ -1496,8 +1497,8 @@ void bio_set_pages_dirty(struct bio *bio)
> * the BIO and re-dirty the pages in process context.
> *
> * It is expected that bio_check_pages_dirty() will wholly own the BIO from
> - * here on. It will run one put_page() against each page and will run one
> - * bio_put() against the BIO.
> + * here on. It will run one page_put_unpin() against each page and will run
> + * one bio_put() against the BIO.
> */
>
> static void bio_dirty_fn(struct work_struct *work);
> diff --git a/block/blk.h b/block/blk.h
> index 4c3b3325219a..294044d696e0 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -425,6 +425,34 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
> struct page *page, unsigned int len, unsigned int offset,
> unsigned int max_sectors, bool *same_page);
>
> +/*
> + * Set the cleanup mode for a bio from an iterator and the extraction flags.
> + */
> +static inline void bio_set_cleanup_mode(struct bio *bio, struct iov_iter *iter)
> +{
> + unsigned int cleanup_mode = iov_iter_extract_mode(iter);
> +
> + if (cleanup_mode & FOLL_GET)
> + bio_set_flag(bio, BIO_PAGE_REFFED);
> + if (cleanup_mode & FOLL_PIN)
> + bio_set_flag(bio, BIO_PAGE_PINNED);
Can FOLL_GET ever happen?
IOW, can't this even be
if (user_backed_iter(iter))
bio_set_flag(bio, BIO_PAGE_PINNED);
--
Thanks,
David / dhildenb
David Hildenbrand <[email protected]> wrote:
> > +#define iov_iter_extract_mode(iter) (user_backed_iter(iter) ? FOLL_PIN : 0)
> > +
>
> Does it make sense to move that to the patch where it is needed? (do we need
> it at all anymore?)
Yes, we need something. Granted, there are now only two alternatives, not
three: either the pages are going to be pinned or they're not going to be
pinned, but I would rather have a specific function that tells you than just
use, say, user_backed_iter() to make it easier to adjust it later if we need
to - and easier to find the places where we need to adjust.
But if it's preferred we could require people to use user_backed_iter()
instead.
David
On 24.01.23 15:35, David Howells wrote:
> David Hildenbrand <[email protected]> wrote:
>
>>> +#define iov_iter_extract_mode(iter) (user_backed_iter(iter) ? FOLL_PIN : 0)
>>> +
>>
>> Does it make sense to move that to the patch where it is needed? (do we need
>> it at all anymore?)
>
> Yes, we need something. Granted, there are now only two alternatives, not
> three: either the pages are going to be pinned or they're not going to be
> pinned, but I would rather have a specific function that tells you than just
> use, say, user_backed_iter() to make it easier to adjust it later if we need
> to - and easier to find the places where we need to adjust.
>
> But if it's preferred we could require people to use user_backed_iter()
> instead.
At least reduces the occurrences of FOLL_PIN :)
--
Thanks,
David / dhildenb
David Hildenbrand <[email protected]> wrote:
> > Provide a helper in the get_user_pages code to drop a pin or a ref on a
> > page based on being given FOLL_GET or FOLL_PIN in its flags argument or do
> > nothing if neither is set.
>
> Does the FOLL_GET part still apply to this patch set?
Yes. Christoph insisted that the bio conversion patch be split up. That
means there's an interval where you can get FOLL_GET from that. However,
since Jason wants to hide FOLL_PUT, this is going to have to be removed and
the switching done in the bio code for the bio case (until that reduces to
just pinning) and the skbuff cleanup code (when that is eventually done - that
will have the possibility of skbuffs comprising a mix of ref'd, pinned and
unpinned data, albeit in separate fragments; I've posted patches that
illustrate this[1]).
David
https://lore.kernel.org/all/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/ [1] Patches 33-34
David Hildenbrand <[email protected]> wrote:
> At least reduces the occurrences of FOLL_PIN :)
I don't see where the problem is in letting people supply FOLL_PIN or
FOLL_GET. Why even have pin_user_pages() and get_user_pages() since they end
up at the same place. They should be inline wrappers, if separate functions
at all.
David
David Hildenbrand <[email protected]> wrote:
> > +static inline void bio_set_cleanup_mode(struct bio *bio, struct iov_iter *iter)
> > +{
> > + unsigned int cleanup_mode = iov_iter_extract_mode(iter);
> > +
> > + if (cleanup_mode & FOLL_GET)
> > + bio_set_flag(bio, BIO_PAGE_REFFED);
> > + if (cleanup_mode & FOLL_PIN)
> > + bio_set_flag(bio, BIO_PAGE_PINNED);
>
> Can FOLL_GET ever happen?
Yes - unless patches 8 and 9 are merged. I had them as one, but Christoph
split them up.
David
On Tue, Jan 24, 2023 at 02:41:33PM +0000, David Howells wrote:
> Yes. Christoph insisted that the bio conversion patch be split up. That
> means there's an interval where you can get FOLL_GET from that.
The only place where we have both is in the block layer. It never gets
set by bio_set_cleanup_mode.
Instead we can just keep using put_page dirctly for the BIO_PAGE_REFFED
case in the callers of bio_release_page and in bio_release_pages itself,
and then do away with bio_to_gup_flags and bio_release_page entirely.
On 24.01.23 15:45, David Howells wrote:
> David Hildenbrand <[email protected]> wrote:
>
>> At least reduces the occurrences of FOLL_PIN :)
>
> I don't see where the problem is in letting people supply FOLL_PIN or
> FOLL_GET. Why even have pin_user_pages() and get_user_pages() since they end
> up at the same place. They should be inline wrappers, if separate functions
> at all.
There once was a proposed goal of restricting FOLL_GET to core-mm and
allowing other drivers etc. to only use FOLL_PIN. Providing
pin_user_pages() and the corresponding unpin part seemed very clean to me.
To me that makes perfect sense and will prevent any misuse once we
converted everything relevant to FOLL_PIN.
To be precise, I think we could get away in this patch set by not using
FOLL_PIN and FOLL_GET and it wouldn't really reduce readability of the
code ...
--
Thanks,
David / dhildenb
On Tue, Jan 24, 2023 at 02:47:51PM +0000, David Howells wrote:
> > > +static inline void bio_set_cleanup_mode(struct bio *bio, struct iov_iter *iter)
> > > +{
> > > + unsigned int cleanup_mode = iov_iter_extract_mode(iter);
> > > +
> > > + if (cleanup_mode & FOLL_GET)
> > > + bio_set_flag(bio, BIO_PAGE_REFFED);
> > > + if (cleanup_mode & FOLL_PIN)
> > > + bio_set_flag(bio, BIO_PAGE_PINNED);
> >
> > Can FOLL_GET ever happen?
>
> Yes - unless patches 8 and 9 are merged. I had them as one, but Christoph
> split them up.
It can't. Per your latest branch:
#define iov_iter_extract_mode(iter) (user_backed_iter(iter) ? FOLL_PIN : 0)
On 24.01.23 15:52, Christoph Hellwig wrote:
> On Tue, Jan 24, 2023 at 02:41:33PM +0000, David Howells wrote:
>> Yes. Christoph insisted that the bio conversion patch be split up. That
>> means there's an interval where you can get FOLL_GET from that.
>
> The only place where we have both is in the block layer. It never gets
> set by bio_set_cleanup_mode.
>
> Instead we can just keep using put_page dirctly for the BIO_PAGE_REFFED
> case in the callers of bio_release_page and in bio_release_pages itself,
> and then do away with bio_to_gup_flags and bio_release_page entirely.
>
Thank you, just what I had in mind ...
--
Thanks,
David / dhildenb
Jason Gunthorpe <[email protected]> wrote:
> What is the 3rd state?
Consider a network filesystem message generated for a direct I/O that the
network filesystem does zerocopy on. You may have an sk_buff that has
fragments from one or more of three different sources:
(1) Fragments consisting of specifically allocated pages, such as the
IP/UDP/TCP headers that have refs taken on them.
(2) Fragments consisting of zerocopy kernel buffers that has neither refs nor
pins belonging to the sk_buff.
iov_iter_extract_pages() will not take pins when extracting from, say, an
XARRAY-type or KVEC-type iterator. iov_iter_extract_mode() will return
0.
(3) Fragments consisting of zerocopy user buffers that have pins taken on
them belonging to the sk_buff.
iov_iter_extract_pages() will take pins when extracting from, say, a
UBUF-type or IOVEC-type iterator. iov_iter_extract_mode() will return
FOLL_PIN (at the moment).
So you have three states: Ref'd, pinned and no-retention.
David
Christoph Hellwig <[email protected]> wrote:
> It can't. Per your latest branch:
Yes it can. Patch 6:
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -282,6 +282,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (blk_queue_pci_p2pdma(rq->q))
extract_flags |= ITER_ALLOW_P2PDMA;
+ bio_set_flag(bio, BIO_PAGE_REFFED);
while (iov_iter_count(iter)) {
struct page **pages, *stack_pages[UIO_FASTIOV];
ssize_t bytes;
Patch 7:
+static inline unsigned int bio_to_gup_flags(struct bio *bio)
+{
+ return (bio_flagged(bio, BIO_PAGE_REFFED) ? FOLL_GET : 0) |
+ (bio_flagged(bio, BIO_PAGE_PINNED) ? FOLL_PIN : 0);
+}
+
+/*
+ * Clean up a page appropriately, where the page may be pinned, may have a
+ * ref taken on it or neither.
+ */
+static inline void bio_release_page(struct bio *bio, struct page *page)
+{
+ page_put_unpin(page, bio_to_gup_flags(bio));
+}
At patch 8, you can get either FOLL_GET, FOLL_PIN or 0, depending on the path
you go through.
David
Christoph Hellwig <[email protected]> wrote:
> The only place where we have both is in the block layer. It never gets
> set by bio_set_cleanup_mode.
It gets set directly by patch 6.
David
On Tue, Jan 24, 2023 at 02:59:25PM +0000, David Howells wrote:
> Jason Gunthorpe <[email protected]> wrote:
>
> > What is the 3rd state?
>
> Consider a network filesystem message generated for a direct I/O that the
> network filesystem does zerocopy on. You may have an sk_buff that has
> fragments from one or more of three different sources:
>
> (1) Fragments consisting of specifically allocated pages, such as the
> IP/UDP/TCP headers that have refs taken on them.
>
> (2) Fragments consisting of zerocopy kernel buffers that has neither refs nor
> pins belonging to the sk_buff.
>
> iov_iter_extract_pages() will not take pins when extracting from, say, an
> XARRAY-type or KVEC-type iterator. iov_iter_extract_mode() will return
> 0.
>
> (3) Fragments consisting of zerocopy user buffers that have pins taken on
> them belonging to the sk_buff.
>
> iov_iter_extract_pages() will take pins when extracting from, say, a
> UBUF-type or IOVEC-type iterator. iov_iter_extract_mode() will return
> FOLL_PIN (at the moment).
>
> So you have three states: Ref'd, pinned and no-retention.
Isn't that this:
if (cleanup_flags & PAGE_CLEANUP_NEEDED)
gup_put_folio(folio, 1, cleanup_flags & PAGE_CLEANUP_UNPIN)
?
Three states - decr get, decr pinned, do nothing?
Jason
Jason Gunthorpe <[email protected]> wrote:
> Isn't that this:
>
> if (cleanup_flags & PAGE_CLEANUP_NEEDED)
> gup_put_folio(folio, 1, cleanup_flags & PAGE_CLEANUP_UNPIN)
>
>
> ?
Yes. As claimed, that would be three states.
> Three states - decr get, decr pinned, do nothing?
Yes. Don't worry about it. I'm not going to do it in gup.c.
David
On Tue, Jan 24, 2023 at 03:03:53PM +0000, David Howells wrote:
> Christoph Hellwig <[email protected]> wrote:
>
> > It can't. Per your latest branch:
>
> Yes it can. Patch 6:
This never involves the cleanup mode as input. And as I pointed out
in the other mail, there is no need for the FOLL_ flags on the
cleanup side. You can just check the bio flag in bio_release_apges
and call either put_page or unpin_user_page on that. The direct
callers of bio_release_page never mix them pin and get cases anyway.
Let me find some time to code this up if it's easier to understand
that way.
On 24.01.23 17:44, Christoph Hellwig wrote:
> On Tue, Jan 24, 2023 at 03:03:53PM +0000, David Howells wrote:
>> Christoph Hellwig <[email protected]> wrote:
>>
>>> It can't. Per your latest branch:
>>
>> Yes it can. Patch 6:
>
> This never involves the cleanup mode as input. And as I pointed out
> in the other mail, there is no need for the FOLL_ flags on the
> cleanup side. You can just check the bio flag in bio_release_apges
> and call either put_page or unpin_user_page on that. The direct
> callers of bio_release_page never mix them pin and get cases anyway.
> Let me find some time to code this up if it's easier to understand
> that way.
In case this series gets resend (which I assume), it would be great to
CC linux-mm on the whole thing.
--
Thanks,
David / dhildenb
This is the incremental patch. Doesn't do the FOLL_PIN to bool
conversion for the extra helper yet, and needs to be folded into the
original patches still.
diff --git a/block/bio.c b/block/bio.c
index 6dc54bf3ed27d4..bd8433f3644fd7 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1170,14 +1170,20 @@ bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
void __bio_release_pages(struct bio *bio, bool mark_dirty)
{
- unsigned int gup_flags = bio_to_gup_flags(bio);
+ bool pinned = bio_flagged(bio, BIO_PAGE_PINNED);
+ bool reffed = bio_flagged(bio, BIO_PAGE_REFFED);
struct bvec_iter_all iter_all;
struct bio_vec *bvec;
bio_for_each_segment_all(bvec, bio, iter_all) {
if (mark_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
- page_put_unpin(bvec->bv_page, gup_flags);
+
+ if (pinned)
+ unpin_user_page(bvec->bv_page);
+ /* this can go away once direct-io.c is converted: */
+ else if (reffed)
+ put_page(bvec->bv_page);
}
}
EXPORT_SYMBOL_GPL(__bio_release_pages);
diff --git a/block/blk.h b/block/blk.h
index 294044d696e09f..a16d4425d2751c 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -430,27 +430,19 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
*/
static inline void bio_set_cleanup_mode(struct bio *bio, struct iov_iter *iter)
{
- unsigned int cleanup_mode = iov_iter_extract_mode(iter);
-
- if (cleanup_mode & FOLL_GET)
- bio_set_flag(bio, BIO_PAGE_REFFED);
- if (cleanup_mode & FOLL_PIN)
+ if (iov_iter_extract_mode(iter) & FOLL_PIN)
bio_set_flag(bio, BIO_PAGE_PINNED);
}
-static inline unsigned int bio_to_gup_flags(struct bio *bio)
-{
- return (bio_flagged(bio, BIO_PAGE_REFFED) ? FOLL_GET : 0) |
- (bio_flagged(bio, BIO_PAGE_PINNED) ? FOLL_PIN : 0);
-}
-
/*
* Clean up a page appropriately, where the page may be pinned, may have a
* ref taken on it or neither.
*/
static inline void bio_release_page(struct bio *bio, struct page *page)
{
- page_put_unpin(page, bio_to_gup_flags(bio));
+ WARN_ON_ONCE(bio_flagged(bio, BIO_PAGE_REFFED));
+ if (bio_flagged(bio, BIO_PAGE_PINNED))
+ unpin_user_page(page);
}
struct request_queue *blk_alloc_queue(int node_id);
Christoph Hellwig <[email protected]> wrote:
> + WARN_ON_ONCE(bio_flagged(bio, BIO_PAGE_REFFED));
It's still set by fs/direct-io.c.
David
David Hildenbrand <[email protected]> wrote:
> In case this series gets resend (which I assume), it would be great to CC
> linux-mm on the whole thing.
v9 is already posted, but I hadn't added linux-mm to it. I dropped all the
bits that touched the mm side of things.
David
On Tue, Jan 24, 2023 at 06:37:17PM +0000, David Howells wrote:
> Christoph Hellwig <[email protected]> wrote:
>
> > + WARN_ON_ONCE(bio_flagged(bio, BIO_PAGE_REFFED));
>
> It's still set by fs/direct-io.c.
But that should never end up in bio_release_page, only
bio_release_pages.