Hi Christoph, David,
Since Christoph asked nicely[1] ;-), here are three patches that go on top
of the similar patches for bio structs now in the block tree that make the
old block direct-IO code use iov_iter_extract_pages() and page pinning.
There are three patches:
(1) Make page pinning not add or remove a pin to/from a ZERO_PAGE, thereby
allowing the dio code to insert zero pages in the middle of dealing
with pinned pages.
A pair of functions are provided to wrap the testing of a page or
folio to see if it is a zero page.
(2) Provide a function to allow an additional pin to be taken on a page we
already have pinned (and do nothing for a zero page).
(3) Switch direct-io.c over to using page pinning and to use
iov_iter_extract_pages() so that pages from non-user-backed iterators
aren't pinned.
I've pushed the patches here also:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-old-dio
David
Changes
=======
ver #2)
- Fix use of ZERO_PAGE().
- Add wrappers for testing if a page is a zero page.
- Return the zero page obtained, not ZERO_PAGE(0) unconditionally.
- Need to set BIO_PAGE_PINNED conditionally, and not BIO_PAGE_REFFED.
Link: https://lore.kernel.org/r/ZGxfrOLZ4aN9/[email protected]/ [1]
Link: https://lore.kernel.org/r/[email protected]/ # v1
David Howells (3):
mm: Don't pin ZERO_PAGE in pin_user_pages()
mm: Provide a function to get an additional pin on a page
block: Use iov_iter_extract_pages() and page pinning in direct-io.c
fs/direct-io.c | 72 ++++++++++++++++++++++++-----------------
include/linux/mm.h | 1 +
include/linux/pgtable.h | 10 ++++++
mm/gup.c | 54 ++++++++++++++++++++++++++++++-
4 files changed, 107 insertions(+), 30 deletions(-)
Provide a function to get an additional pin on a page that we already have
a pin on. This will be used in fs/direct-io.c when dispatching multiple
bios to a page we've extracted from a user-backed iter rather than redoing
the extraction.
Signed-off-by: David Howells <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: David Hildenbrand <[email protected]>
cc: Andrew Morton <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Al Viro <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Jan Kara <[email protected]>
cc: Jeff Layton <[email protected]>
cc: Jason Gunthorpe <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: Hillf Danton <[email protected]>
cc: Christian Brauner <[email protected]>
cc: Linus Torvalds <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
include/linux/mm.h | 1 +
mm/gup.c | 29 +++++++++++++++++++++++++++++
2 files changed, 30 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..931b75dae7ff 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2383,6 +2383,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
int pin_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
+void page_get_additional_pin(struct page *page);
int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
diff --git a/mm/gup.c b/mm/gup.c
index 69b002628f5d..4b4353a184ed 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -275,6 +275,35 @@ void unpin_user_page(struct page *page)
}
EXPORT_SYMBOL(unpin_user_page);
+/**
+ * page_get_additional_pin - Try to get an additional pin on a pinned page
+ * @page: The page to be pinned
+ *
+ * Get an additional pin on a page we already have a pin on. Makes no change
+ * if the page is the zero_page.
+ */
+void page_get_additional_pin(struct page *page)
+{
+ struct folio *folio = page_folio(page);
+
+ if (page == ZERO_PAGE(0))
+ return;
+
+ /*
+ * Similar to try_grab_folio(): be sure to *also* increment the normal
+ * page refcount field at least once, so that the page really is
+ * pinned.
+ */
+ if (folio_test_large(folio)) {
+ WARN_ON_ONCE(atomic_read(&folio->_pincount) < 1);
+ folio_ref_add(folio, 1);
+ atomic_add(1, &folio->_pincount);
+ } else {
+ WARN_ON_ONCE(folio_ref_count(folio) < GUP_PIN_COUNTING_BIAS);
+ folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+ }
+}
+
static inline struct folio *gup_folio_range_next(struct page *start,
unsigned long npages, unsigned long i, unsigned int *ntails)
{
Change the old block-based direct-I/O code to use iov_iter_extract_pages()
to pin user pages or leave kernel pages unpinned rather than taking refs
when submitting bios.
This makes use of the preceding patches to not take pins on the zero page
(thereby allowing insertion of zero pages in with pinned pages) and to get
additional pins on pages, allowing an extracted page to be used in multiple
bios without having to re-extract it.
Signed-off-by: David Howells <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: David Hildenbrand <[email protected]>
cc: Andrew Morton <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Al Viro <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Jan Kara <[email protected]>
cc: Jeff Layton <[email protected]>
cc: Jason Gunthorpe <[email protected]>
cc: Logan Gunthorpe <[email protected]>
cc: Hillf Danton <[email protected]>
cc: Christian Brauner <[email protected]>
cc: Linus Torvalds <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
Notes:
ver #2)
- Need to set BIO_PAGE_PINNED conditionally, not BIO_PAGE_REFFED.
fs/direct-io.c | 72 ++++++++++++++++++++++++++++++--------------------
1 file changed, 43 insertions(+), 29 deletions(-)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index ad20f3428bab..5d4c5be0fb41 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -42,8 +42,8 @@
#include "internal.h"
/*
- * How many user pages to map in one call to get_user_pages(). This determines
- * the size of a structure in the slab cache
+ * How many user pages to map in one call to iov_iter_extract_pages(). This
+ * determines the size of a structure in the slab cache
*/
#define DIO_PAGES 64
@@ -121,12 +121,13 @@ struct dio {
struct inode *inode;
loff_t i_size; /* i_size when submitted */
dio_iodone_t *end_io; /* IO completion function */
+ bool need_unpin; /* T if we need to unpin the pages */
void *private; /* copy from map_bh.b_private */
/* BIO completion state */
spinlock_t bio_lock; /* protects BIO fields below */
- int page_errors; /* errno from get_user_pages() */
+ int page_errors; /* err from iov_iter_extract_pages() */
int is_async; /* is IO async ? */
bool defer_completion; /* defer AIO completion to workqueue? */
bool should_dirty; /* if pages should be dirtied */
@@ -165,14 +166,14 @@ static inline unsigned dio_pages_present(struct dio_submit *sdio)
*/
static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
{
+ struct page **pages = dio->pages;
const enum req_op dio_op = dio->opf & REQ_OP_MASK;
ssize_t ret;
- ret = iov_iter_get_pages2(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
- &sdio->from);
+ ret = iov_iter_extract_pages(sdio->iter, &pages, LONG_MAX,
+ DIO_PAGES, 0, &sdio->from);
if (ret < 0 && sdio->blocks_available && dio_op == REQ_OP_WRITE) {
- struct page *page = ZERO_PAGE(0);
/*
* A memory fault, but the filesystem has some outstanding
* mapped blocks. We need to use those blocks up to avoid
@@ -180,8 +181,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
*/
if (dio->page_errors == 0)
dio->page_errors = ret;
- get_page(page);
- dio->pages[0] = page;
+ dio->pages[0] = ZERO_PAGE(0);
sdio->head = 0;
sdio->tail = 1;
sdio->from = 0;
@@ -201,9 +201,9 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
/*
* Get another userspace page. Returns an ERR_PTR on error. Pages are
- * buffered inside the dio so that we can call get_user_pages() against a
- * decent number of pages, less frequently. To provide nicer use of the
- * L1 cache.
+ * buffered inside the dio so that we can call iov_iter_extract_pages()
+ * against a decent number of pages, less frequently. To provide nicer use of
+ * the L1 cache.
*/
static inline struct page *dio_get_page(struct dio *dio,
struct dio_submit *sdio)
@@ -219,6 +219,18 @@ static inline struct page *dio_get_page(struct dio *dio,
return dio->pages[sdio->head];
}
+static void dio_pin_page(struct dio *dio, struct page *page)
+{
+ if (dio->need_unpin)
+ page_get_additional_pin(page);
+}
+
+static void dio_unpin_page(struct dio *dio, struct page *page)
+{
+ if (dio->need_unpin)
+ unpin_user_page(page);
+}
+
/*
* dio_complete() - called when all DIO BIO I/O has been completed
*
@@ -402,8 +414,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
bio->bi_end_io = dio_bio_end_aio;
else
bio->bi_end_io = dio_bio_end_io;
- /* for now require references for all pages */
- bio_set_flag(bio, BIO_PAGE_REFFED);
+ if (dio->need_unpin)
+ bio_set_flag(bio, BIO_PAGE_PINNED);
sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
}
@@ -444,8 +456,9 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
*/
static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
{
- while (sdio->head < sdio->tail)
- put_page(dio->pages[sdio->head++]);
+ if (dio->need_unpin)
+ unpin_user_pages(dio->pages + sdio->head,
+ sdio->tail - sdio->head);
}
/*
@@ -676,7 +689,7 @@ static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio,
*
* Return zero on success. Non-zero means the caller needs to start a new BIO.
*/
-static inline int dio_bio_add_page(struct dio_submit *sdio)
+static inline int dio_bio_add_page(struct dio *dio, struct dio_submit *sdio)
{
int ret;
@@ -688,7 +701,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
*/
if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE)
sdio->pages_in_io--;
- get_page(sdio->cur_page);
+ dio_pin_page(dio, sdio->cur_page);
sdio->final_block_in_bio = sdio->cur_page_block +
(sdio->cur_page_len >> sdio->blkbits);
ret = 0;
@@ -743,11 +756,11 @@ static inline int dio_send_cur_page(struct dio *dio, struct dio_submit *sdio,
goto out;
}
- if (dio_bio_add_page(sdio) != 0) {
+ if (dio_bio_add_page(dio, sdio) != 0) {
dio_bio_submit(dio, sdio);
ret = dio_new_bio(dio, sdio, sdio->cur_page_block, map_bh);
if (ret == 0) {
- ret = dio_bio_add_page(sdio);
+ ret = dio_bio_add_page(dio, sdio);
BUG_ON(ret != 0);
}
}
@@ -804,13 +817,13 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
*/
if (sdio->cur_page) {
ret = dio_send_cur_page(dio, sdio, map_bh);
- put_page(sdio->cur_page);
+ dio_unpin_page(dio, sdio->cur_page);
sdio->cur_page = NULL;
if (ret)
return ret;
}
- get_page(page); /* It is in dio */
+ dio_pin_page(dio, page); /* It is in dio */
sdio->cur_page = page;
sdio->cur_page_offset = offset;
sdio->cur_page_len = len;
@@ -825,7 +838,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
ret = dio_send_cur_page(dio, sdio, map_bh);
if (sdio->bio)
dio_bio_submit(dio, sdio);
- put_page(sdio->cur_page);
+ dio_unpin_page(dio, sdio->cur_page);
sdio->cur_page = NULL;
}
return ret;
@@ -926,7 +939,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
ret = get_more_blocks(dio, sdio, map_bh);
if (ret) {
- put_page(page);
+ dio_unpin_page(dio, page);
goto out;
}
if (!buffer_mapped(map_bh))
@@ -971,7 +984,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
/* AKPM: eargh, -ENOTBLK is a hack */
if (dio_op == REQ_OP_WRITE) {
- put_page(page);
+ dio_unpin_page(dio, page);
return -ENOTBLK;
}
@@ -984,7 +997,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
if (sdio->block_in_file >=
i_size_aligned >> blkbits) {
/* We hit eof */
- put_page(page);
+ dio_unpin_page(dio, page);
goto out;
}
zero_user(page, from, 1 << blkbits);
@@ -1024,7 +1037,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
sdio->next_block_for_io,
map_bh);
if (ret) {
- put_page(page);
+ dio_unpin_page(dio, page);
goto out;
}
sdio->next_block_for_io += this_chunk_blocks;
@@ -1039,8 +1052,8 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
break;
}
- /* Drop the ref which was taken in get_user_pages() */
- put_page(page);
+ /* Drop the pin which was taken in get_user_pages() */
+ dio_unpin_page(dio, page);
}
out:
return ret;
@@ -1135,6 +1148,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
/* will be released by direct_io_worker */
inode_lock(inode);
}
+ dio->need_unpin = iov_iter_extract_will_pin(iter);
/* Once we sampled i_size check for reads beyond EOF */
dio->i_size = i_size_read(inode);
@@ -1259,7 +1273,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
if (retval == 0)
retval = ret2;
- put_page(sdio.cur_page);
+ dio_unpin_page(dio, sdio.cur_page);
sdio.cur_page = NULL;
}
if (sdio.bio)
On Thu, May 25, 2023 at 3:40 PM David Howells <[email protected]> wrote:
>
> +void page_get_additional_pin(struct page *page)
> +{
> + struct folio *folio = page_folio(page);
> +
> + if (page == ZERO_PAGE(0))
> + return;
You added that nice "is_zero_folio()", and then you did the above anyway..
Linus
Linus Torvalds <[email protected]> wrote:
> > + if (page == ZERO_PAGE(0))
> > + return;
>
> You added that nice "is_zero_folio()", and then you did the above anyway..
Bah. Missed it because it was in a different patch.
David
On Thu, May 25, 2023 at 11:39:53PM +0100, David Howells wrote:
> Change the old block-based direct-I/O code to use iov_iter_extract_pages()
> to pin user pages or leave kernel pages unpinned rather than taking refs
> when submitting bios.
>
> This makes use of the preceding patches to not take pins on the zero page
> (thereby allowing insertion of zero pages in with pinned pages) and to get
> additional pins on pages, allowing an extracted page to be used in multiple
> bios without having to re-extract it.
I'm not seeing where we skip the unpin of the zero page, as commented
in patch 1 (but maybe I'm not reviewing carefully enough as I'm at a
conference right now).
Otherwise my only rather cosmetic comment right now is that I'd called
the "need_unpin" member is_pinned.
On Thu, May 25, 2023 at 11:39:52PM +0100, David Howells wrote:
> +/**
> + * page_get_additional_pin - Try to get an additional pin on a pinned page
> + * @page: The page to be pinned
> + *
> + * Get an additional pin on a page we already have a pin on. Makes no change
> + * if the page is the zero_page.
> + */
> +void page_get_additional_pin(struct page *page)
page_get_additional_pin seems like an odd name, mixing the get and
pin terminologies. What about repin_page? Or move to a folio interface
from the start can call it folio_repin?
Christoph Hellwig <[email protected]> wrote:
> I'm not seeing where we skip the unpin of the zero page, as commented
> in patch 1 (but maybe I'm not reviewing carefully enough as I'm at a
> conference right now).
It's done by unpin_user_page{,s}(), hidden away in gup.c. See the commit
message for patch 1:
Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a
pointer to it from the page tables and make unpin_user_page*()
correspondingly ignore a ZERO_PAGE when unpinning.
David
Christoph Hellwig <[email protected]> wrote:
> > +void page_get_additional_pin(struct page *page)
>
> page_get_additional_pin seems like an odd name, mixing the get and
> pin terminologies. What about repin_page?
I considered that, though repin_page() suggests putting a pin back in after
one is removed, but I can go with that if no one objects.
> Or move to a folio interface from the start can call it folio_repin?
I also considered this, but the entire gup interface is page-based at the
moment, but I can do that too:-/
David