2022-02-25 11:24:51

by John Hubbard

[permalink] [raw]
Subject: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN

Hi,

Summary:

This puts some prerequisites in place, including a CONFIG parameter,
making it possible to start converting and testing the Direct IO part of
each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().

It will take "a few" kernel releases to get the whole thing done.

Details:

As part of fixing the "get_user_pages() + file-backed memory" problem
[1], and to support various COW-related fixes as well [2], we need to
convert the Direct IO code from get_user_pages_fast(), to
pin_user_pages_fast(). Because pin_user_pages*() calls require a
corresponding call to unpin_user_page(), the conversion is more
elaborate than just substitution.

Further complicating the conversion, the block/bio layers get their
Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
each of which has a large number of callers. All of those callers need
to be audited and changed so that they call unpin_user_page(), rather
than put_page().

After quite some time exploring and consulting with people as well, it
is clear that this cannot be done in just one patchset. That's because,
not only is this large and time-consuming (for example, Chaitanya
Kulkarni's first reaction, after looking into the details, was, "convert
the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
but it is also spread across many filesystems.

With that in mind, let's apply most of this patchset soon-ish, and then
work on the filesystem conversions, likely over the course of a few
kernel releases. Once complete, then apply the last patch, and then one
final name change to remove the dio_w_ prefixes, and get us back to the
original names.

In this patchset:

Patches 1, 2, 3: provide the prerequisites to start converting call
sites to call the new dio_w_*() wrapper functions.

Patch 4: convert the core allocation routines to
dio_w_pin_user_pages_fast().

Patches 5, 6: convert a couple of callers (NFS, fuse) to use FOLL_PIN.
This also is a placeholder to show that "filesystems need to be
converted at this point".

At this point, Ubuntu 20.04 boots up and is able to support running some
fio direct IO tests, while keeping the foll pin counts in /proc/vmstat
balanced. (Ubuntu uses fuse during startup, interestingly enough.)

Patch 7: Get rid of the CONFIG parameter, thus effectively switching the
default Direct IO mechanism over to pin_user_pages_fast().

(Not shown): Patch 8: trivial but large: rename everything to get rid of
the dio_w_ prefix, and delete the wrappers.

This is based on mmotm as of about an hour ago. I've also stashed it
here:

https://github.com/johnhubbard/linux bio_pup_mmotm_20220224

[1] https://lwn.net/Articles/753027/ "The trouble with get_user_pages()"

[2] https://lore.kernel.org/all/[email protected]/T/#u
(David Hildenbrand's mm/COW fixes)

John Hubbard (7):
mm/gup: introduce pin_user_page()
block: add dio_w_*() wrappers for pin, unpin user pages
block, fs: assert that key paths use iovecs, and nothing else
block, bio, fs: initial pin_user_pages_fast() changes
NFS: direct-io: convert to FOLL_PIN pages
fuse: convert direct IO paths to use FOLL_PIN
block, direct-io: flip the switch: use pin_user_pages_fast()

block/bio.c | 22 +++++++++++++---------
block/blk-map.c | 4 ++--
fs/direct-io.c | 26 ++++++++++++++------------
fs/fuse/dev.c | 5 ++++-
fs/fuse/file.c | 23 ++++++++---------------
fs/iomap/direct-io.c | 2 +-
fs/nfs/direct.c | 2 +-
include/linux/bvec.h | 4 ++++
include/linux/mm.h | 1 +
lib/iov_iter.c | 4 ++--
mm/gup.c | 34 ++++++++++++++++++++++++++++++++++
11 files changed, 84 insertions(+), 43 deletions(-)


base-commit: 218d3ca9c0ea1c35f1bc5099325b7df54b52bbdd
prerequisite-patch-id: 7d5a742e37171a15d83a9b3ac9ba0951b573eed8
--
2.35.1


2022-02-25 11:26:05

by John Hubbard

[permalink] [raw]
Subject: [RFC PATCH 2/7] block: add dio_w_*() wrappers for pin, unpin user pages

Add a new config parameter, CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO, and
dio_w_*() wrapper functions. Together, these allow the developer to
choose between these sets of routines, for Direct IO code paths:

a) pin_user_pages_fast()
pin_user_page()
unpin_user_page()

b) get_user_pages_fast()
get_page()
put_page()

CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO is a temporary setting, and will
be deleted once the conversion is complete. In the meantime, developers
can enable this in order to try out each filesystem.

More information: The Direct IO part of the block infrastructure is
being changed to use pin_user_page*() and unpin_user_page*() calls, in
place of a mix of get_user_pages_fast(), get_page(), and put_page().
These have to be changed over all at the same time, for block, bio, and
all filesystems.

While that changeover is in progress (but disabled via this new CONFIG
option), kernel developers need a way to test their changes. The steps
are:

a) Enable CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO

b) Monitor these /proc/vmstat items:

nr_foll_pin_acquired
nr_foll_pin_released

...to ensure that they remain equal, when "at rest".

Signed-off-by: John Hubbard <[email protected]>
---
block/Kconfig | 25 +++++++++++++++++++++++++
include/linux/bvec.h | 11 +++++++++++
2 files changed, 36 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index 168b873eb666..f6ca5e9597e4 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -50,6 +50,31 @@ config BLK_DEV_BSG_COMMON
config BLK_ICQ
bool

+config BLK_USE_PIN_USER_PAGES_FOR_DIO
+ bool "DEVELOPERS ONLY: Enable pin_user_pages() for Direct IO" if EXPERT
+ default n
+ help
+ For Direct IO code, retain the pages via calls to
+ pin_user_pages_fast(), instead of via get_user_pages_fast().
+ Likewise, use pin_user_page() instead of get_page(). And then
+ release such pages via unpin_user_page(), instead of
+ put_page().
+
+ This is a temporary setting, which will be deleted once the
+ conversion is completed, reviewed, and tested. In the meantime,
+ developers can enable this in order to try out each filesystem.
+ For that, it's best to monitor these /proc/vmstat items:
+
+ nr_foll_pin_acquired
+ nr_foll_pin_released
+
+ ...to ensure that they remain equal, when "at rest".
+
+ Say yes here ONLY if are actively developing or testing the
+ block layer or filesystems with pin_user_pages_fast().
+ Otherwise, this is just a way to throw off the refcounting of
+ pages in the system.
+
config BLK_DEV_BSGLIB
bool "Block layer SG support v4 helper lib"
select BLK_DEV_BSG_COMMON
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 35c25dff651a..a96a68c687f6 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -241,4 +241,15 @@ static inline void *bvec_virt(struct bio_vec *bvec)
return page_address(bvec->bv_page) + bvec->bv_offset;
}

+#ifdef CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO
+#define dio_w_pin_user_pages_fast(s, n, p, f) pin_user_pages_fast(s, n, p, f)
+#define dio_w_pin_user_page(p) pin_user_page(p)
+#define dio_w_unpin_user_page(p) unpin_user_page(p)
+
+#else
+#define dio_w_pin_user_pages_fast(s, n, p, f) get_user_pages_fast(s, n, p, f)
+#define dio_w_pin_user_page(p) get_page(p)
+#define dio_w_unpin_user_page(p) put_page(p)
+#endif
+
#endif /* __LINUX_BVEC_H */
--
2.35.1

2022-02-25 11:50:21

by John Hubbard

[permalink] [raw]
Subject: [RFC PATCH 6/7] fuse: convert direct IO paths to use FOLL_PIN

Convert the fuse filesystem to support the new iov_iter_get_pages()
behavior. That routine now invokes pin_user_pages_fast(), which means that
such pages must be released via unpin_user_page(), rather than via
put_page().

This commit also removes any possibility of kernel pages being handled, in
the fuse_get_user_pages() call. Although this may seem like a steep price
to pay, Christoph Hellwig actually recommended it a few years ago for
nearly the same situation [1].

[1] https://lore.kernel.org/kvm/[email protected]/

Signed-off-by: John Hubbard <[email protected]>
---
fs/fuse/dev.c | 5 ++++-
fs/fuse/file.c | 23 ++++++++---------------
2 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e1b4a846c90d..a93037c96b89 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -675,7 +675,10 @@ static void fuse_copy_finish(struct fuse_copy_state *cs)
flush_dcache_page(cs->pg);
set_page_dirty_lock(cs->pg);
}
- put_page(cs->pg);
+ if (cs->pipebufs)
+ put_page(cs->pg);
+ else
+ unpin_user_page(cs->pg);
}
cs->pg = NULL;
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 94747bac3489..395c2fb613fb 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -619,7 +619,7 @@ static void fuse_release_user_pages(struct fuse_args_pages *ap,
for (i = 0; i < ap->num_pages; i++) {
if (should_dirty)
set_page_dirty_lock(ap->pages[i]);
- put_page(ap->pages[i]);
+ unpin_user_page(ap->pages[i]);
}
}

@@ -1382,20 +1382,13 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
size_t nbytes = 0; /* # bytes already packed in req */
ssize_t ret = 0;

- /* Special case for kernel I/O: can copy directly into the buffer */
- if (iov_iter_is_kvec(ii)) {
- unsigned long user_addr = fuse_get_user_addr(ii);
- size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
-
- if (write)
- ap->args.in_args[1].value = (void *) user_addr;
- else
- ap->args.out_args[0].value = (void *) user_addr;
-
- iov_iter_advance(ii, frag_size);
- *nbytesp = frag_size;
- return 0;
- }
+ /*
+ * TODO: this warning can eventually be removed. It is just to help
+ * developers, during a transitional period, while direct IO code is
+ * moving over to FOLL_PIN pages.
+ */
+ if (WARN_ON_ONCE(!iter_is_iovec(ii)))
+ return -EOPNOTSUPP;

while (nbytes < *nbytesp && ap->num_pages < max_pages) {
unsigned npages;
--
2.35.1

2022-02-25 12:18:04

by John Hubbard

[permalink] [raw]
Subject: [RFC PATCH 5/7] NFS: direct-io: convert to FOLL_PIN pages

Now that Direct IO's core allocators invoke pin_user_pages_fast(), those
pages must free released via unpin_user_page(), instead of put_page().

Signed-off-by: John Hubbard <[email protected]>
---
fs/nfs/direct.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index eabfdab543c8..2e0d399c5a5a 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -181,7 +181,7 @@ static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
{
unsigned int i;
for (i = 0; i < npages; i++)
- put_page(pages[i]);
+ dio_w_unpin_user_page(pages[i]);
}

void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
--
2.35.1

2022-02-25 12:37:59

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN

On Fri 25-02-22 00:50:18, John Hubbard wrote:
> Hi,
>
> Summary:
>
> This puts some prerequisites in place, including a CONFIG parameter,
> making it possible to start converting and testing the Direct IO part of
> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().
>
> It will take "a few" kernel releases to get the whole thing done.
>
> Details:
>
> As part of fixing the "get_user_pages() + file-backed memory" problem
> [1], and to support various COW-related fixes as well [2], we need to
> convert the Direct IO code from get_user_pages_fast(), to
> pin_user_pages_fast(). Because pin_user_pages*() calls require a
> corresponding call to unpin_user_page(), the conversion is more
> elaborate than just substitution.
>
> Further complicating the conversion, the block/bio layers get their
> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
> each of which has a large number of callers. All of those callers need
> to be audited and changed so that they call unpin_user_page(), rather
> than put_page().
>
> After quite some time exploring and consulting with people as well, it
> is clear that this cannot be done in just one patchset. That's because,
> not only is this large and time-consuming (for example, Chaitanya
> Kulkarni's first reaction, after looking into the details, was, "convert
> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
> but it is also spread across many filesystems.

With having modified fs/direct-io.c and fs/iomap/direct-io.c which
filesystems do you know are missing conversion? Or is it that you just want
to make sure with audit everything is fine? The only fs I could find
unconverted by your changes is ceph. Am I missing something?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-02-25 13:04:26

by John Hubbard

[permalink] [raw]
Subject: [RFC PATCH 7/7] block, direct-io: flip the switch: use pin_user_pages_fast()

Remove CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO, but leave the dio_w_*()
wrapper functions in place, with the pin_user_pages_fast() defines as
the only choice.

A subsequent patch is now possible, to rename all dio_w_*() functions so
as to remove the dio_w_ prefix.

Signed-off-by: John Hubbard <[email protected]>
---
block/Kconfig | 25 -------------------------
include/linux/bvec.h | 7 -------
2 files changed, 32 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index f6ca5e9597e4..168b873eb666 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -50,31 +50,6 @@ config BLK_DEV_BSG_COMMON
config BLK_ICQ
bool

-config BLK_USE_PIN_USER_PAGES_FOR_DIO
- bool "DEVELOPERS ONLY: Enable pin_user_pages() for Direct IO" if EXPERT
- default n
- help
- For Direct IO code, retain the pages via calls to
- pin_user_pages_fast(), instead of via get_user_pages_fast().
- Likewise, use pin_user_page() instead of get_page(). And then
- release such pages via unpin_user_page(), instead of
- put_page().
-
- This is a temporary setting, which will be deleted once the
- conversion is completed, reviewed, and tested. In the meantime,
- developers can enable this in order to try out each filesystem.
- For that, it's best to monitor these /proc/vmstat items:
-
- nr_foll_pin_acquired
- nr_foll_pin_released
-
- ...to ensure that they remain equal, when "at rest".
-
- Say yes here ONLY if are actively developing or testing the
- block layer or filesystems with pin_user_pages_fast().
- Otherwise, this is just a way to throw off the refcounting of
- pages in the system.
-
config BLK_DEV_BSGLIB
bool "Block layer SG support v4 helper lib"
select BLK_DEV_BSG_COMMON
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index a96a68c687f6..5bc98b334efe 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -241,15 +241,8 @@ static inline void *bvec_virt(struct bio_vec *bvec)
return page_address(bvec->bv_page) + bvec->bv_offset;
}

-#ifdef CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO
#define dio_w_pin_user_pages_fast(s, n, p, f) pin_user_pages_fast(s, n, p, f)
#define dio_w_pin_user_page(p) pin_user_page(p)
#define dio_w_unpin_user_page(p) unpin_user_page(p)

-#else
-#define dio_w_pin_user_pages_fast(s, n, p, f) get_user_pages_fast(s, n, p, f)
-#define dio_w_pin_user_page(p) get_page(p)
-#define dio_w_unpin_user_page(p) put_page(p)
-#endif
-
#endif /* __LINUX_BVEC_H */
--
2.35.1

2022-02-25 20:52:11

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN

On 25.02.22 09:50, John Hubbard wrote:
> Hi,
>
> Summary:
>
> This puts some prerequisites in place, including a CONFIG parameter,
> making it possible to start converting and testing the Direct IO part of
> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().
>
> It will take "a few" kernel releases to get the whole thing done.
>
> Details:
>
> As part of fixing the "get_user_pages() + file-backed memory" problem
> [1], and to support various COW-related fixes as well [2], we need to
> convert the Direct IO code from get_user_pages_fast(), to
> pin_user_pages_fast(). Because pin_user_pages*() calls require a
> corresponding call to unpin_user_page(), the conversion is more
> elaborate than just substitution.
>
> Further complicating the conversion, the block/bio layers get their
> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
> each of which has a large number of callers. All of those callers need
> to be audited and changed so that they call unpin_user_page(), rather
> than put_page().

vmsplice is another candidate that uses iov_iter_get_pages() and should
be converted to FOLL_PIN. For that particular user, we have to also pass
FOLL_LONGTERM -- vmsplice as it stands can block memory hotunplug / CMA
/ ... for all eternity.

--
Thanks,

David / dhildenb

2022-02-26 00:17:01

by Chaitanya Kulkarni

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN

On 2/25/22 04:05, Jan Kara wrote:
> On Fri 25-02-22 00:50:18, John Hubbard wrote:
>> Hi,
>>
>> Summary:
>>
>> This puts some prerequisites in place, including a CONFIG parameter,
>> making it possible to start converting and testing the Direct IO part of
>> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().
>>
>> It will take "a few" kernel releases to get the whole thing done.
>>
>> Details:
>>
>> As part of fixing the "get_user_pages() + file-backed memory" problem
>> [1], and to support various COW-related fixes as well [2], we need to
>> convert the Direct IO code from get_user_pages_fast(), to
>> pin_user_pages_fast(). Because pin_user_pages*() calls require a
>> corresponding call to unpin_user_page(), the conversion is more
>> elaborate than just substitution.
>>
>> Further complicating the conversion, the block/bio layers get their
>> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
>> each of which has a large number of callers. All of those callers need
>> to be audited and changed so that they call unpin_user_page(), rather
>> than put_page().
>>
>> After quite some time exploring and consulting with people as well, it
>> is clear that this cannot be done in just one patchset. That's because,
>> not only is this large and time-consuming (for example, Chaitanya
>> Kulkarni's first reaction, after looking into the details, was, "convert
>> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
>> but it is also spread across many filesystems.
>
> With having modified fs/direct-io.c and fs/iomap/direct-io.c which
> filesystems do you know are missing conversion? Or is it that you just want
> to make sure with audit everything is fine? The only fs I could find
> unconverted by your changes is ceph. Am I missing something?

if I understand your comment correctly file systems which are listed in
the list see [1] (all the credit goes to John to have a complete list)
that are not using iomap but use XXX_XXX_direct_IO() should be fine,
since in the callchain going from :-

XXX_XXX_direct_io()
__blkdev_direct_io()
do_direct_io()

...

submit_page_selection()
get/put_page() <---

will take care of itself ?

>
> Honza
>

[1]

jfs_direct_IO()
nilfs_direct_IO()
ntfs_dirct_IO()
reiserfs_direct_IO()
udf_direct_IO()
ocfs2_dirct_IO()
affs_direct_IO()
exfat_direct_IO()
ext2_direct_IO()
fat_direct_IO()
hfs_direct_IO()
hfs_plus_direct_IO()

-ck


2022-02-26 01:33:01

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN

On 2/25/22 04:05, Jan Kara wrote:
...
>> After quite some time exploring and consulting with people as well, it
>> is clear that this cannot be done in just one patchset. That's because,
>> not only is this large and time-consuming (for example, Chaitanya
>> Kulkarni's first reaction, after looking into the details, was, "convert
>> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
>> but it is also spread across many filesystems.
>
> With having modified fs/direct-io.c and fs/iomap/direct-io.c which
> filesystems do you know are missing conversion? Or is it that you just want
> to make sure with audit everything is fine? The only fs I could find
> unconverted by your changes is ceph. Am I missing something?
>
> Honza

There are a few more filesystems that call iov_iter_get_pages() or
iov_iter_get_pages_alloc(), plus networking things as well, plus some
others that are hard to categorize, such as vhost. So we have:

* ceph
* rds
* cifs
* p9
* net: __zerocopy_sg_from_iter(), tls_setup_from_iter(),
* crypto: af_alg_make_sg() (maybe N/A)
* vmsplice() (as David Hildenbrand mentioned)
* vhost: vhost_scsi_map_to_sgl()

In addition to that, I was also worried that maybe the
blockdev_direct_IO() or iomap filesystems might be breaking encapsulation
occasionally, by calling put_page() on the direct IO user page buffer.
Perhaps in error paths.

Are you pretty sure that that last concern is not valid? That would be a
welcome bit of news.



thanks,
--
John Hubbard
NVIDIA

2022-02-26 01:55:56

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN

On 2/25/22 11:36, John Hubbard wrote:
> On 2/25/22 04:05, Jan Kara wrote:
> ...
>> With having modified fs/direct-io.c and fs/iomap/direct-io.c which
>> filesystems do you know are missing conversion? Or is it that you just want
>> to make sure with audit everything is fine? The only fs I could find
>> unconverted by your changes is ceph. Am I missing something?
>>
>>                                 Honza
>
> There are a few more filesystems that call iov_iter_get_pages() or
> iov_iter_get_pages_alloc(), plus networking things as well, plus some
> others that are hard to categorize, such as vhost. So we have:
>
> * ceph
> * rds
> * cifs
> * p9
> * net: __zerocopy_sg_from_iter(), tls_setup_from_iter(),
> * crypto: af_alg_make_sg() (maybe N/A)
> * vmsplice() (as David Hildenbrand mentioned)
> * vhost: vhost_scsi_map_to_sgl()

...although...if each filesystem does *not* require custom attention,
then there is another, maybe better approach, which is: factor out an
iovec-only pair of allocators, and transition each subsystem when it's
ready. So:

dio_iov_iter_get_pages()
dio_iov_iter_get_pages_alloc()

That would allow doing this a bit at a time, and without the horrible
CONFIG parameter that is switched over all at once.

The bio_release_pages() still calls unpin_user_page(), though, so that
means that all Direct IO callers would still have to change over at
once.


thanks,
--
John Hubbard
NVIDIA

2022-02-26 02:06:57

by John Hubbard

[permalink] [raw]
Subject: [RFC PATCH 1/7] mm/gup: introduce pin_user_page()

pin_user_page() is an externally-usable version of try_grab_page(), but
with semantics that match get_page(), so that it can act as a drop-in
replacement for get_page(). Specifically, pin_user_page() has a void
return type.

pin_user_page() elevates a page's refcount is using FOLL_PIN rules. This
means that the caller must release the page via unpin_user_page().

Signed-off-by: John Hubbard <[email protected]>
---
include/linux/mm.h | 1 +
mm/gup.c | 34 ++++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 929488a47181..bb51f5487aef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1914,6 +1914,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
long get_user_pages(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas);
+void pin_user_page(struct page *page);
long pin_user_pages(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas);
diff --git a/mm/gup.c b/mm/gup.c
index 5c3f6ede17eb..44446241c3a9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3034,6 +3034,40 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
}
EXPORT_SYMBOL(pin_user_pages);

+/**
+ * pin_user_page() - apply a FOLL_PIN reference to a page ()
+ *
+ * @page: the page to be pinned.
+ *
+ * Similar to get_user_pages(), in that the page's refcount is elevated using
+ * FOLL_PIN rules.
+ *
+ * IMPORTANT: That means that the caller must release the page via
+ * unpin_user_page().
+ *
+ */
+void pin_user_page(struct page *page)
+{
+ struct folio *folio = page_folio(page);
+
+ WARN_ON_ONCE(folio_ref_count(folio) <= 0);
+
+ /*
+ * Similar to try_grab_page(): be sure to *also*
+ * increment the normal page refcount field at least once,
+ * so that the page really is pinned.
+ */
+ if (folio_test_large(folio)) {
+ folio_ref_add(folio, 1);
+ atomic_add(1, folio_pincount_ptr(folio));
+ } else {
+ folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+ }
+
+ node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
+}
+EXPORT_SYMBOL(pin_user_page);
+
/*
* pin_user_pages_unlocked() is the FOLL_PIN variant of
* get_user_pages_unlocked(). Behavior is the same, except that this one sets
--
2.35.1

2022-02-26 02:25:48

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN

On 2/25/22 05:12, David Hildenbrand wrote:
> vmsplice is another candidate that uses iov_iter_get_pages() and should

Yes.

> be converted to FOLL_PIN. For that particular user, we have to also pass
> FOLL_LONGTERM -- vmsplice as it stands can block memory hotunplug / CMA
> / ... for all eternity.
>

Oh, thanks for pointing that out. I was thinking about Direct IO, and
would have overlooked that vmsplice need FOLL_LONGTERM.

I'm still quite vague about which parts of vmsplice are in pipe buffers
(and therefore will *not* get FOLL_PIN pages) and which are in user space
buffers. Just haven't spent enough time with vmsplice yet, hopefully it
will clear up with some study of the code.


thanks,
--
John Hubbard
NVIDIA

2022-02-26 02:39:33

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN

On Fri 25-02-22 16:14:14, Chaitanya Kulkarni wrote:
> On 2/25/22 04:05, Jan Kara wrote:
> > On Fri 25-02-22 00:50:18, John Hubbard wrote:
> >> Hi,
> >>
> >> Summary:
> >>
> >> This puts some prerequisites in place, including a CONFIG parameter,
> >> making it possible to start converting and testing the Direct IO part of
> >> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().
> >>
> >> It will take "a few" kernel releases to get the whole thing done.
> >>
> >> Details:
> >>
> >> As part of fixing the "get_user_pages() + file-backed memory" problem
> >> [1], and to support various COW-related fixes as well [2], we need to
> >> convert the Direct IO code from get_user_pages_fast(), to
> >> pin_user_pages_fast(). Because pin_user_pages*() calls require a
> >> corresponding call to unpin_user_page(), the conversion is more
> >> elaborate than just substitution.
> >>
> >> Further complicating the conversion, the block/bio layers get their
> >> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
> >> each of which has a large number of callers. All of those callers need
> >> to be audited and changed so that they call unpin_user_page(), rather
> >> than put_page().
> >>
> >> After quite some time exploring and consulting with people as well, it
> >> is clear that this cannot be done in just one patchset. That's because,
> >> not only is this large and time-consuming (for example, Chaitanya
> >> Kulkarni's first reaction, after looking into the details, was, "convert
> >> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
> >> but it is also spread across many filesystems.
> >
> > With having modified fs/direct-io.c and fs/iomap/direct-io.c which
> > filesystems do you know are missing conversion? Or is it that you just want
> > to make sure with audit everything is fine? The only fs I could find
> > unconverted by your changes is ceph. Am I missing something?
>
> if I understand your comment correctly file systems which are listed in
> the list see [1] (all the credit goes to John to have a complete list)
> that are not using iomap but use XXX_XXX_direct_IO() should be fine,
> since in the callchain going from :-
>
> XXX_XXX_direct_io()
> __blkdev_direct_io()
> do_direct_io()
>
> ...
>
> submit_page_selection()
> get/put_page() <---
>
> will take care of itself ?

Yes, John's changes to fs/direct-io.c should take care of these
filesystems using __blkdev_direct_io().

Honza

> [1]
>
> jfs_direct_IO()
> nilfs_direct_IO()
> ntfs_dirct_IO()
> reiserfs_direct_IO()
> udf_direct_IO()
> ocfs2_dirct_IO()
> affs_direct_IO()
> exfat_direct_IO()
> ext2_direct_IO()
> fat_direct_IO()
> hfs_direct_IO()
> hfs_plus_direct_IO()
>
> -ck
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-02-28 14:09:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] mm/gup: introduce pin_user_page()

On 25.02.22 09:50, John Hubbard wrote:
> pin_user_page() is an externally-usable version of try_grab_page(), but
> with semantics that match get_page(), so that it can act as a drop-in
> replacement for get_page(). Specifically, pin_user_page() has a void
> return type.
>
> pin_user_page() elevates a page's refcount is using FOLL_PIN rules. This
> means that the caller must release the page via unpin_user_page().
>
> Signed-off-by: John Hubbard <[email protected]>
> ---
> include/linux/mm.h | 1 +
> mm/gup.c | 34 ++++++++++++++++++++++++++++++++++
> 2 files changed, 35 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 929488a47181..bb51f5487aef 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1914,6 +1914,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
> long get_user_pages(unsigned long start, unsigned long nr_pages,
> unsigned int gup_flags, struct page **pages,
> struct vm_area_struct **vmas);
> +void pin_user_page(struct page *page);
> long pin_user_pages(unsigned long start, unsigned long nr_pages,
> unsigned int gup_flags, struct page **pages,
> struct vm_area_struct **vmas);
> diff --git a/mm/gup.c b/mm/gup.c
> index 5c3f6ede17eb..44446241c3a9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3034,6 +3034,40 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
> }
> EXPORT_SYMBOL(pin_user_pages);
>
> +/**
> + * pin_user_page() - apply a FOLL_PIN reference to a page ()
> + *
> + * @page: the page to be pinned.
> + *
> + * Similar to get_user_pages(), in that the page's refcount is elevated using
> + * FOLL_PIN rules.
> + *
> + * IMPORTANT: That means that the caller must release the page via
> + * unpin_user_page().
> + *
> + */
> +void pin_user_page(struct page *page)
> +{
> + struct folio *folio = page_folio(page);
> +
> + WARN_ON_ONCE(folio_ref_count(folio) <= 0);
> +
> + /*
> + * Similar to try_grab_page(): be sure to *also*
> + * increment the normal page refcount field at least once,
> + * so that the page really is pinned.
> + */
> + if (folio_test_large(folio)) {
> + folio_ref_add(folio, 1);
> + atomic_add(1, folio_pincount_ptr(folio));
> + } else {
> + folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
> + }
> +
> + node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
> +}
> +EXPORT_SYMBOL(pin_user_page);
> +
> /*
> * pin_user_pages_unlocked() is the FOLL_PIN variant of
> * get_user_pages_unlocked(). Behavior is the same, except that this one sets

I assume that function will only get called on a page that has been
obtained by a previous pin_user_pages_fast(), correct?

--
Thanks,

David / dhildenb

2022-02-28 21:20:05

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] mm/gup: introduce pin_user_page()

On 2/28/22 05:27, David Hildenbrand wrote:
...
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 5c3f6ede17eb..44446241c3a9 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -3034,6 +3034,40 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
>> }
>> EXPORT_SYMBOL(pin_user_pages);
>>
>> +/**
>> + * pin_user_page() - apply a FOLL_PIN reference to a page ()
>> + *
>> + * @page: the page to be pinned.
>> + *
>> + * Similar to get_user_pages(), in that the page's refcount is elevated using
>> + * FOLL_PIN rules.
>> + *
>> + * IMPORTANT: That means that the caller must release the page via
>> + * unpin_user_page().
>> + *
>> + */
>> +void pin_user_page(struct page *page)
>> +{
>> + struct folio *folio = page_folio(page);
>> +
>> + WARN_ON_ONCE(folio_ref_count(folio) <= 0);
>> +
>> + /*
>> + * Similar to try_grab_page(): be sure to *also*
>> + * increment the normal page refcount field at least once,
>> + * so that the page really is pinned.
>> + */
>> + if (folio_test_large(folio)) {
>> + folio_ref_add(folio, 1);
>> + atomic_add(1, folio_pincount_ptr(folio));
>> + } else {
>> + folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
>> + }
>> +
>> + node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
>> +}
>> +EXPORT_SYMBOL(pin_user_page);
>> +
>> /*
>> * pin_user_pages_unlocked() is the FOLL_PIN variant of
>> * get_user_pages_unlocked(). Behavior is the same, except that this one sets
>
> I assume that function will only get called on a page that has been
> obtained by a previous pin_user_pages_fast(), correct?
>

Well, no. This is meant to be used in place of get_page(), for code that
knows that the pages will be released via unpin_user_page(). So there is
no special prerequisite there.


thanks,
--
John Hubbard
NVIDIA

2022-03-01 08:29:32

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] mm/gup: introduce pin_user_page()

On 28.02.22 22:14, John Hubbard wrote:
> On 2/28/22 05:27, David Hildenbrand wrote:
> ...
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 5c3f6ede17eb..44446241c3a9 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -3034,6 +3034,40 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
>>> }
>>> EXPORT_SYMBOL(pin_user_pages);
>>>
>>> +/**
>>> + * pin_user_page() - apply a FOLL_PIN reference to a page ()
>>> + *
>>> + * @page: the page to be pinned.
>>> + *
>>> + * Similar to get_user_pages(), in that the page's refcount is elevated using
>>> + * FOLL_PIN rules.
>>> + *
>>> + * IMPORTANT: That means that the caller must release the page via
>>> + * unpin_user_page().
>>> + *
>>> + */
>>> +void pin_user_page(struct page *page)
>>> +{
>>> + struct folio *folio = page_folio(page);
>>> +
>>> + WARN_ON_ONCE(folio_ref_count(folio) <= 0);
>>> +
>>> + /*
>>> + * Similar to try_grab_page(): be sure to *also*
>>> + * increment the normal page refcount field at least once,
>>> + * so that the page really is pinned.
>>> + */
>>> + if (folio_test_large(folio)) {
>>> + folio_ref_add(folio, 1);
>>> + atomic_add(1, folio_pincount_ptr(folio));
>>> + } else {
>>> + folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
>>> + }
>>> +
>>> + node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
>>> +}
>>> +EXPORT_SYMBOL(pin_user_page);
>>> +
>>> /*
>>> * pin_user_pages_unlocked() is the FOLL_PIN variant of
>>> * get_user_pages_unlocked(). Behavior is the same, except that this one sets
>>
>> I assume that function will only get called on a page that has been
>> obtained by a previous pin_user_pages_fast(), correct?
>>
>
> Well, no. This is meant to be used in place of get_page(), for code that
> knows that the pages will be released via unpin_user_page(). So there is
> no special prerequisite there.

That might be problematic and possibly the wrong approach, depending on
*what* we're actually pinning and what we're intending to do with that.

My assumption would have been that this interface is to duplicate a pin
on a page, which would be perfectly fine, because the page actually saw
a FOLL_PIN previously.

We're taking a pin on a page that we haven't obtained via FOLL_PIN if I
understand correctly. Which raises the questions, how do we end up with
the pages here, and what are we doing to do with them (use them like we
obtained them via FOLL_PIN?)?


If it's converting FOLL_GET -> FOLL_PIN manually, then we're bypassing
FOLL_PIN special handling in GUP code:

page = get_user_pages(FOLL_GET)
pin_user_page(page)
put_page(page)


For anonymous pages, we'll bail out for example once we have

https://lkml.kernel.org/r/[email protected]

Because the conditions for pinned anonymous pages might no longer hold.

If we won't call pin_user_page() on anonymous pages, it would be fine.
But then, I still wonder how we come up the "struct page" here.

--
Thanks,

David / dhildenb

2022-03-01 09:35:20

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] mm/gup: introduce pin_user_page()

On 3/1/22 00:11, David Hildenbrand wrote:
>> ...
>>>> +EXPORT_SYMBOL(pin_user_page);
>>>> +
>>>> /*
>>>> * pin_user_pages_unlocked() is the FOLL_PIN variant of
>>>> * get_user_pages_unlocked(). Behavior is the same, except that this one sets
>>>
>>> I assume that function will only get called on a page that has been
>>> obtained by a previous pin_user_pages_fast(), correct?
>>>
>>
>> Well, no. This is meant to be used in place of get_page(), for code that
>> knows that the pages will be released via unpin_user_page(). So there is
>> no special prerequisite there.
>
> That might be problematic and possibly the wrong approach, depending on
> *what* we're actually pinning and what we're intending to do with that.
>
> My assumption would have been that this interface is to duplicate a pin

I see that I need to put more documentation here, so people don't have
to assume things... :)

> on a page, which would be perfectly fine, because the page actually saw
> a FOLL_PIN previously.
>
> We're taking a pin on a page that we haven't obtained via FOLL_PIN if I
> understand correctly. Which raises the questions, how do we end up with
> the pages here, and what are we doing to do with them (use them like we
> obtained them via FOLL_PIN?)?
>
>
> If it's converting FOLL_GET -> FOLL_PIN manually, then we're bypassing
> FOLL_PIN special handling in GUP code:
>
> page = get_user_pages(FOLL_GET)
> pin_user_page(page)
> put_page(page)

No, that's not where this is going at all. The idea, which I now see
needs better documentation, is to handle file-backed pages. Only.

We're not converting from one type to another, nor are we doubling up.
We're just keeping the pin type consistent so that the vast block-
processing machinery can take pages in and handle them, then release
them at the end with bio_release_pages(), which will call
unpin_user_pages().

>
>
> For anonymous pages, we'll bail out for example once we have
>
> https://lkml.kernel.org/r/[email protected]
>
> Because the conditions for pinned anonymous pages might no longer hold.
>
> If we won't call pin_user_page() on anonymous pages, it would be fine.

We won't, and in fact, I should add WARN_ON_ONCE(PageAnon(page)) to
this function.

> But then, I still wonder how we come up the "struct page" here.
>

From the file system. For example, the NFS-direct and fuse conversions
in the last patches show how that works.

Thanks for this feedback, this is very helpful.

thanks,
--
John Hubbard
NVIDIA

2022-03-01 13:48:38

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] mm/gup: introduce pin_user_page()


>>>> That might be problematic and possibly the wrong approach, depending on
>> *what* we're actually pinning and what we're intending to do with that.
>>
>> My assumption would have been that this interface is to duplicate a pin
>
> I see that I need to put more documentation here, so people don't have
> to assume things... :)
>

Yes, please :)

>> on a page, which would be perfectly fine, because the page actually saw
>> a FOLL_PIN previously.
>>
>> We're taking a pin on a page that we haven't obtained via FOLL_PIN if I
>> understand correctly. Which raises the questions, how do we end up with
>> the pages here, and what are we doing to do with them (use them like we
>> obtained them via FOLL_PIN?)?
>>
>>
>> If it's converting FOLL_GET -> FOLL_PIN manually, then we're bypassing
>> FOLL_PIN special handling in GUP code:
>>
>> page = get_user_pages(FOLL_GET)
>> pin_user_page(page)
>> put_page(page)
>
> No, that's not where this is going at all. The idea, which I now see
> needs better documentation, is to handle file-backed pages. Only.
>
> We're not converting from one type to another, nor are we doubling up.
> We're just keeping the pin type consistent so that the vast block-
> processing machinery can take pages in and handle them, then release
> them at the end with bio_release_pages(), which will call
> unpin_user_pages().
>

Ah, okay, that makes sense. Glad to hear that we're intending to use
this with !anon pages only.

>>
>>
>> For anonymous pages, we'll bail out for example once we have
>>
>> https://lkml.kernel.org/r/[email protected]
>>
>> Because the conditions for pinned anonymous pages might no longer hold.
>>
>> If we won't call pin_user_page() on anonymous pages, it would be fine.
>
> We won't, and in fact, I should add WARN_ON_ONCE(PageAnon(page)) to
> this function.

Exactly what I would have suggested,

>
>> But then, I still wonder how we come up the "struct page" here.
>>
>
> From the file system. For example, the NFS-direct and fuse conversions
> in the last patches show how that works.
>
> Thanks for this feedback, this is very helpful.

Thanks for clarifying, John!

--
Thanks,

David / dhildenb