2022-08-27 08:38:08

by John Hubbard

[permalink] [raw]
Subject: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

Convert the NFS Direct IO layer to use pin_user_pages_fast() and
unpin_user_page(), instead of get_user_pages_fast() and put_page().

Signed-off-by: John Hubbard <[email protected]>
---
fs/nfs/direct.c | 19 ++++++-------------
1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 1707f46b1335..f6e47329e092 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -142,13 +142,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
return 0;
}

-static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
-{
- unsigned int i;
- for (i = 0; i < npages; i++)
- put_page(pages[i]);
-}
-
void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
struct nfs_direct_req *dreq)
{
@@ -332,11 +325,11 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
size_t pgbase;
unsigned npages, i;

- result = iov_iter_get_pages_alloc2(iter, &pagevec,
+ result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
rsize, &pgbase);
if (result < 0)
break;
-
+
bytes = result;
npages = (result + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
for (i = 0; i < npages; i++) {
@@ -362,7 +355,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
pos += req_len;
dreq->bytes_left -= req_len;
}
- nfs_direct_release_pages(pagevec, npages);
+ dio_w_unpin_user_pages(pagevec, npages);
kvfree(pagevec);
if (result < 0)
break;
@@ -791,8 +784,8 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
size_t pgbase;
unsigned npages, i;

- result = iov_iter_get_pages_alloc2(iter, &pagevec,
- wsize, &pgbase);
+ result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
+ wsize, &pgbase);
if (result < 0)
break;

@@ -829,7 +822,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
pos += req_len;
dreq->bytes_left -= req_len;
}
- nfs_direct_release_pages(pagevec, npages);
+ dio_w_unpin_user_pages(pagevec, npages);
kvfree(pagevec);
if (result < 0)
break;
--
2.37.2


2022-08-27 22:52:32

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
> unpin_user_page(), instead of get_user_pages_fast() and put_page().

Again, this stuff can be hit with ITER_BVEC iterators

> - result = iov_iter_get_pages_alloc2(iter, &pagevec,
> + result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
> rsize, &pgbase);

and this will break on those.

2022-08-28 00:26:12

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

On 8/27/22 15:48, Al Viro wrote:
> On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
>> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
>> unpin_user_page(), instead of get_user_pages_fast() and put_page().
>
> Again, this stuff can be hit with ITER_BVEC iterators
>
>> - result = iov_iter_get_pages_alloc2(iter, &pagevec,
>> + result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
>> rsize, &pgbase);
>
> and this will break on those.

If anyone has an example handy, of a user space program that leads
to this situation (O_DIRECT with ITER_BVEC), it would really help
me reach enlightenment a lot quicker in this area. :)

thanks,

--
John Hubbard
NVIDIA

2022-08-28 00:43:13

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

On Sat, Aug 27, 2022 at 04:55:18PM -0700, John Hubbard wrote:
> On 8/27/22 15:48, Al Viro wrote:
> > On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
> >> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
> >> unpin_user_page(), instead of get_user_pages_fast() and put_page().
> >
> > Again, this stuff can be hit with ITER_BVEC iterators
> >
> >> - result = iov_iter_get_pages_alloc2(iter, &pagevec,
> >> + result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
> >> rsize, &pgbase);
> >
> > and this will break on those.
>
> If anyone has an example handy, of a user space program that leads
> to this situation (O_DIRECT with ITER_BVEC), it would really help
> me reach enlightenment a lot quicker in this area. :)

Er... splice(2) to O_DIRECT-opened file on e.g. ext4? Or
sendfile(2) to the same, for that matter...

2022-08-28 00:56:37

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

On 8/27/22 17:39, Al Viro wrote:
> On Sun, Aug 28, 2022 at 01:38:57AM +0100, Al Viro wrote:
>> On Sat, Aug 27, 2022 at 04:55:18PM -0700, John Hubbard wrote:
>>> On 8/27/22 15:48, Al Viro wrote:
>>>> On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
>>>>> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
>>>>> unpin_user_page(), instead of get_user_pages_fast() and put_page().
>>>>
>>>> Again, this stuff can be hit with ITER_BVEC iterators
>>>>
>>>>> - result = iov_iter_get_pages_alloc2(iter, &pagevec,
>>>>> + result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
>>>>> rsize, &pgbase);
>>>>
>>>> and this will break on those.
>>>
>>> If anyone has an example handy, of a user space program that leads
>>> to this situation (O_DIRECT with ITER_BVEC), it would really help
>>> me reach enlightenment a lot quicker in this area. :)
>>
>> Er... splice(2) to O_DIRECT-opened file on e.g. ext4? Or
>> sendfile(2) to the same, for that matter...
>
> s/ext4/nfs/ to hit this particular codepath, obviously.

aha, thanks. I do remember that you alerted me earlier to splice(2)
problems, and I thought I'd neatly sidestepped those this time, by
staying with user_backed_iter(i) cases.

But I see that it's going to take something more, after all.



thanks,

--
John Hubbard
NVIDIA

2022-08-28 00:56:45

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

On Sun, Aug 28, 2022 at 01:38:57AM +0100, Al Viro wrote:
> On Sat, Aug 27, 2022 at 04:55:18PM -0700, John Hubbard wrote:
> > On 8/27/22 15:48, Al Viro wrote:
> > > On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
> > >> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
> > >> unpin_user_page(), instead of get_user_pages_fast() and put_page().
> > >
> > > Again, this stuff can be hit with ITER_BVEC iterators
> > >
> > >> - result = iov_iter_get_pages_alloc2(iter, &pagevec,
> > >> + result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
> > >> rsize, &pgbase);
> > >
> > > and this will break on those.
> >
> > If anyone has an example handy, of a user space program that leads
> > to this situation (O_DIRECT with ITER_BVEC), it would really help
> > me reach enlightenment a lot quicker in this area. :)
>
> Er... splice(2) to O_DIRECT-opened file on e.g. ext4? Or
> sendfile(2) to the same, for that matter...

s/ext4/nfs/ to hit this particular codepath, obviously.

2022-08-29 05:03:02

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

On 8/27/22 17:39, Al Viro wrote:
> On Sun, Aug 28, 2022 at 01:38:57AM +0100, Al Viro wrote:
>> On Sat, Aug 27, 2022 at 04:55:18PM -0700, John Hubbard wrote:
>>> On 8/27/22 15:48, Al Viro wrote:
>>>> On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
>>>>> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
>>>>> unpin_user_page(), instead of get_user_pages_fast() and put_page().
>>>>
>>>> Again, this stuff can be hit with ITER_BVEC iterators
>>>>
>>>>> - result = iov_iter_get_pages_alloc2(iter, &pagevec,
>>>>> + result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
>>>>> rsize, &pgbase);
>>>>
>>>> and this will break on those.
>>>
>>> If anyone has an example handy, of a user space program that leads
>>> to this situation (O_DIRECT with ITER_BVEC), it would really help
>>> me reach enlightenment a lot quicker in this area. :)
>>
>> Er... splice(2) to O_DIRECT-opened file on e.g. ext4? Or
>> sendfile(2) to the same, for that matter...
>
> s/ext4/nfs/ to hit this particular codepath, obviously.

OK, I have a solution to this that's pretty easy:

1) Get rid of the user_backed_iter(i) check in
dio_w_iov_iter_pin_pages() and dio_w_iov_iter_pin_pages_alloc(), and

2) At the call sites, match up the unpin calls appropriately.

...and apply a similar fix for the fuse conversion patch.

However, the core block/bio conversion in patch 4 still does depend upon
a key assumption, which I got from a 2019 email discussion with
Christoph Hellwig and others here [1], which says:

"All pages released by bio_release_pages should come from
get_get_user_pages...".

I really hope that still holds true. Otherwise this whole thing is in
trouble.



[1] https://lore.kernel.org/kvm/[email protected]/

thanks,

--
John Hubbard
NVIDIA

2022-08-29 16:16:34

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

On Sun 28-08-22 21:59:49, John Hubbard wrote:
> On 8/27/22 17:39, Al Viro wrote:
> > On Sun, Aug 28, 2022 at 01:38:57AM +0100, Al Viro wrote:
> >> On Sat, Aug 27, 2022 at 04:55:18PM -0700, John Hubbard wrote:
> >>> On 8/27/22 15:48, Al Viro wrote:
> >>>> On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
> >>>>> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
> >>>>> unpin_user_page(), instead of get_user_pages_fast() and put_page().
> >>>>
> >>>> Again, this stuff can be hit with ITER_BVEC iterators
> >>>>
> >>>>> - result = iov_iter_get_pages_alloc2(iter, &pagevec,
> >>>>> + result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
> >>>>> rsize, &pgbase);
> >>>>
> >>>> and this will break on those.
> >>>
> >>> If anyone has an example handy, of a user space program that leads
> >>> to this situation (O_DIRECT with ITER_BVEC), it would really help
> >>> me reach enlightenment a lot quicker in this area. :)
> >>
> >> Er... splice(2) to O_DIRECT-opened file on e.g. ext4? Or
> >> sendfile(2) to the same, for that matter...
> >
> > s/ext4/nfs/ to hit this particular codepath, obviously.
>
> OK, I have a solution to this that's pretty easy:
>
> 1) Get rid of the user_backed_iter(i) check in
> dio_w_iov_iter_pin_pages() and dio_w_iov_iter_pin_pages_alloc(), and
>
> 2) At the call sites, match up the unpin calls appropriately.
>
> ...and apply a similar fix for the fuse conversion patch.
>
> However, the core block/bio conversion in patch 4 still does depend upon
> a key assumption, which I got from a 2019 email discussion with
> Christoph Hellwig and others here [1], which says:
>
> "All pages released by bio_release_pages should come from
> get_get_user_pages...".
>
> I really hope that still holds true. Otherwise this whole thing is in
> trouble.
>
> [1] https://lore.kernel.org/kvm/[email protected]/

Well as far as I've checked that discussion, Christoph was aware of pipe
pages etc. (i.e., bvecs) entering direct IO code. But he had some patches
[2] which enabled GUP to work for bvecs as well (using the kernel mapping
under the hood AFAICT from a quick glance at the series). I suppose we
could also handle this in __iov_iter_get_pages_alloc() by grabbing pin
reference instead of plain get_page() for the case of bvec iter. That way
we should have only pinned pages in bio_release_pages() even for the bvec
case.

[2] http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/gup-bvec

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-31 09:55:16

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

On Mon 29-08-22 12:59:26, John Hubbard wrote:
> On 8/29/22 09:08, Jan Kara wrote:
> >> However, the core block/bio conversion in patch 4 still does depend upon
> >> a key assumption, which I got from a 2019 email discussion with
> >> Christoph Hellwig and others here [1], which says:
> >>
> >> "All pages released by bio_release_pages should come from
> >> get_get_user_pages...".
> >>
> >> I really hope that still holds true. Otherwise this whole thing is in
> >> trouble.
> >>
> >> [1] https://lore.kernel.org/kvm/[email protected]/
> >
> > Well as far as I've checked that discussion, Christoph was aware of pipe
> > pages etc. (i.e., bvecs) entering direct IO code. But he had some patches
> > [2] which enabled GUP to work for bvecs as well (using the kernel mapping
> > under the hood AFAICT from a quick glance at the series). I suppose we
> > could also handle this in __iov_iter_get_pages_alloc() by grabbing pin
> > reference instead of plain get_page() for the case of bvec iter. That way
> > we should have only pinned pages in bio_release_pages() even for the bvec
> > case.
>
> OK, thanks, that looks viable. So, that approach assumes that the
> remaining two cases in __iov_iter_get_pages_alloc() will never end up
> being released via bio_release_pages():
>
> iov_iter_is_pipe(i)
> iov_iter_is_xarray(i)
>
> I'm actually a little worried about ITER_XARRAY, which is a recent addition.
> It seems to be used in ways that are similar to ITER_BVEC, and cephfs is
> using it. It's probably OK for now, for this series, which doesn't yet
> convert cephfs.

So after looking into that a bit more, I think a clean approach would be to
provide iov_iter_pin_pages2() and iov_iter_pages_alloc2(), under the hood
in __iov_iter_get_pages_alloc() make sure we use pin_user_page() instead of
get_page() in all the cases (using this in pipe_get_pages() and
iter_xarray_get_pages() is easy) and then make all bio handling use the
pinning variants for iters. I think at least iov_iter_is_pipe() case needs
to be handled as well because as I wrote above, pipe pages can enter direct
IO code e.g. for splice(2).

Also I think that all iov_iter_get_pages2() (or the _alloc2 variant) users
actually do want the "pin page" semantics in the end (they are accessing
page contents) so eventually we should convert them all to
iov_iter_pin_pages2() and remove iov_iter_get_pages2() altogether. But this
will take some more conversion work with networking etc. so I'd start with
converting bios only.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-08-31 18:09:01

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

On 8/31/22 02:43, Jan Kara wrote:
>> OK, thanks, that looks viable. So, that approach assumes that the
>> remaining two cases in __iov_iter_get_pages_alloc() will never end up
>> being released via bio_release_pages():
>>
>> iov_iter_is_pipe(i)
>> iov_iter_is_xarray(i)
>>
>> I'm actually a little worried about ITER_XARRAY, which is a recent addition.
>> It seems to be used in ways that are similar to ITER_BVEC, and cephfs is
>> using it. It's probably OK for now, for this series, which doesn't yet
>> convert cephfs.
>
> So after looking into that a bit more, I think a clean approach would be to
> provide iov_iter_pin_pages2() and iov_iter_pages_alloc2(), under the hood
> in __iov_iter_get_pages_alloc() make sure we use pin_user_page() instead of
> get_page() in all the cases (using this in pipe_get_pages() and
> iter_xarray_get_pages() is easy) and then make all bio handling use the
> pinning variants for iters. I think at least iov_iter_is_pipe() case needs
> to be handled as well because as I wrote above, pipe pages can enter direct
> IO code e.g. for splice(2).
>

OK, yes.

> Also I think that all iov_iter_get_pages2() (or the _alloc2 variant) users
> actually do want the "pin page" semantics in the end (they are accessing
> page contents) so eventually we should convert them all to
> iov_iter_pin_pages2() and remove iov_iter_get_pages2() altogether. But this
> will take some more conversion work with networking etc. so I'd start with
> converting bios only.
>
> Honza


I'll give this approach a spin, thanks very much for the guidance!


thanks,

--
John Hubbard
NVIDIA

2022-09-01 00:45:55

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

On Wed, Aug 31, 2022 at 11:43:49AM +0200, Jan Kara wrote:

> So after looking into that a bit more, I think a clean approach would be to
> provide iov_iter_pin_pages2() and iov_iter_pages_alloc2(), under the hood
> in __iov_iter_get_pages_alloc() make sure we use pin_user_page() instead of
> get_page() in all the cases (using this in pipe_get_pages() and
> iter_xarray_get_pages() is easy) and then make all bio handling use the
> pinning variants for iters. I think at least iov_iter_is_pipe() case needs
> to be handled as well because as I wrote above, pipe pages can enter direct
> IO code e.g. for splice(2).
>
> Also I think that all iov_iter_get_pages2() (or the _alloc2 variant) users
> actually do want the "pin page" semantics in the end (they are accessing
> page contents) so eventually we should convert them all to
> iov_iter_pin_pages2() and remove iov_iter_get_pages2() altogether. But this
> will take some more conversion work with networking etc. so I'd start with
> converting bios only.

Not sure, TBH...

FWIW, quite a few of the callers of iov_iter_get_pages2() do *NOT* need to
grab any references for BVEC/XARRAY/PIPE cases. What's more, it would be
bloody useful to have a variant that doesn't grab references for
!iter->user_backed case - that could be usable for KVEC as well, simplifying
several callers.

Requirements:
* recepients of those struct page * should have a way to make
dropping the page refs conditional (obviously); bio machinery can be told
to do so.
* callers should *NOT* do something like
"set an ITER_BVEC iter, with page references grabbed and stashed in
bio_vec array, call async read_iter() and drop the references in array - the
refs we grab in dio will serve"
Note that for sync IO that pattern is fine whether we grab/drop anything
inside read_iter(); for async we could take depopulating the bio_vec
array to the IO completion or downstream of that.
* the code dealing with the references returned by iov_iter_..._pages
should *NOT* play silly buggers with refcounts - something like "I'll grab
a reference, start DMA and report success; page will stay around until I
get around to dropping the ref and callers don't need to wait for that" deep
in the bowels of infinibad stack (or something equally tasteful) is seriously
asking for trouble.

Future plans from the last cycle included iov_iter_find_pages{,_alloc}() that
would *not* grab references on anything other than IOVEC and UBUF (would advance
the iterator, same as iov_iter_get_pages2(), though). Then iov_iter_get_...()
would become a wrapper for that. After that - look into switching the users
of ..._get_... to ..._find_.... Hadn't done much in that direction yet,
though - need to redo the analysis first.

That primitive might very well do FOLL_PIN instead of FOLL_GET for IOVEC and
UBUF...

2022-09-01 09:37:24

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages

On Thu 01-09-22 01:38:21, Al Viro wrote:
> On Wed, Aug 31, 2022 at 11:43:49AM +0200, Jan Kara wrote:
>
> > So after looking into that a bit more, I think a clean approach would be to
> > provide iov_iter_pin_pages2() and iov_iter_pages_alloc2(), under the hood
> > in __iov_iter_get_pages_alloc() make sure we use pin_user_page() instead of
> > get_page() in all the cases (using this in pipe_get_pages() and
> > iter_xarray_get_pages() is easy) and then make all bio handling use the
> > pinning variants for iters. I think at least iov_iter_is_pipe() case needs
> > to be handled as well because as I wrote above, pipe pages can enter direct
> > IO code e.g. for splice(2).
> >
> > Also I think that all iov_iter_get_pages2() (or the _alloc2 variant) users
> > actually do want the "pin page" semantics in the end (they are accessing
> > page contents) so eventually we should convert them all to
> > iov_iter_pin_pages2() and remove iov_iter_get_pages2() altogether. But this
> > will take some more conversion work with networking etc. so I'd start with
> > converting bios only.
>
> Not sure, TBH...
>
> FWIW, quite a few of the callers of iov_iter_get_pages2() do *NOT* need to
> grab any references for BVEC/XARRAY/PIPE cases. What's more, it would be
> bloody useful to have a variant that doesn't grab references for
> !iter->user_backed case - that could be usable for KVEC as well, simplifying
> several callers.
>
> Requirements:
> * recepients of those struct page * should have a way to make
> dropping the page refs conditional (obviously); bio machinery can be told
> to do so.
> * callers should *NOT* do something like
> "set an ITER_BVEC iter, with page references grabbed and stashed in
> bio_vec array, call async read_iter() and drop the references in array - the
> refs we grab in dio will serve"
> Note that for sync IO that pattern is fine whether we grab/drop anything
> inside read_iter(); for async we could take depopulating the bio_vec
> array to the IO completion or downstream of that.
> * the code dealing with the references returned by iov_iter_..._pages
> should *NOT* play silly buggers with refcounts - something like "I'll grab
> a reference, start DMA and report success; page will stay around until I
> get around to dropping the ref and callers don't need to wait for that" deep
> in the bowels of infinibad stack (or something equally tasteful) is seriously
> asking for trouble.

I agree we could get away without grabbing references in some cases. But it
is a performance vs robustness tradeoff I'd say. E.g. with XARRAY case I
can see people feeding pages from struct address_space and it is unclear
what else than page reference would protect them from being freed by
reclaim. Furthermore if e.g. writeback can happen for such struct
address_space (or other filesystem operations requiring stable data), we
really need a full page pin and not just page reference to signal that the
page may be under DMA and may be changing under your hands. So I'm not
against having some special cases that avoid grabbing page reference / page
pin but I think it is justified only in performance sensitive cases and
when we can make sure filesystem backed page (because of above mentioned
data stability issues) or anon page (because of cow handling) cannot
possibly enter this path.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR