Hi,
This converts the Direct IO block/bio layer over to use FOLL_PIN pages
(those acquired via pin_user_pages*()). This effectively converts
several file systems (ext4, for example) that use the common Direct IO
routines. See "Remaining work", below for a bit more detail there.
Quite a few approaches have been considered over the years. This one is
inspired by Christoph Hellwig's July, 2019 observation that there are
only 5 ITER_ types, and we can simplify handling of them for Direct IO
[1]. After working through how bio submission and completion works, I
became convinced that this is the simplest and cleanest approach to
conversion.
Not content to let well enough alone, I then continued on to the
unthinkable: adding a new flag to struct bio, whose "short int" flags
field was full, thuse triggering an expansion of the field from 16, to
32 bits. This allows for a nice assertion in bio_release_pages(), that
the bio page release mechanism matches the page acquisition mechanism.
This is especially welcome for a change that affects a lot of callers
and could really make a mess if there is a bug somewhere.
I'm unable to spot any performance implications, either theoretically or
via (rather light) performance testing, from enlarging bio.bi_flags, but
I suspect that there are maybe still valid reasons for having such a
tiny bio.bi_flags field. I just have no idea what they are. (Hardware
that knows the size of a bio? No, because there would be obvious
build-time assertions, and comments about such a constraint.) Anyway, I
can drop that patch if it seems like too much cost for too little
benefit.
And finally, as long as we're all staring at the iter_iov code, I'm
including a nice easy ceph patch, that removes one more caller of
iter_iov_get_pages().
Design notes ============
This whole approach depends on certain concepts:
1) Each struct bio instance must not mix different types of pages:
FOLL_PIN and non-FOLL_PIN pages. (By FOLL_PIN I'm referring to pages
that were acquired and pinned via pin_user_page*() routines.)
Fortunately, this is already an enforced constraint for bio's, as
evidenced by the existence and use of BIO_NO_PAGE_REF.
2) Christoph Hellwig's July, 2019 observation that there are
only 5 ITER_ types, and we can simplify handling of them for Direct IO
[1]. Accordingly, this series implements the following pseudocode:
Direct IO behavior:
ITER_IOVEC:
pin_user_pages_fast();
break;
ITER_KVEC: // already elevated page refcount, leave alone
ITER_BVEC: // already elevated page refcount, leave alone
ITER_PIPE: // just, no :)
ITER_DISCARD: // discard
return -EFAULT or -ENVALID;
...which works for callers that already have sorted out which case they
are in. Such as, Direct IO in the block/bio layers.
Now, this does leave ITER_KVEC and ITER_BVEC unconverted, but on the
other hand, it's not clear that these are actually affected in the real
world, by the get_user_pages()+filesystem interaction problems of [2].
If it turns out to matter, then those can be handled too, but it's just
more refactoring and surgery to do so.
Testing
=======
Performance: no obvious regressions from running fio (direct=1: Direct
IO) on both SSD and NVMe drives.
Functionality: selected non-destructive bare metal xfstests on xfs,
ext4, btrfs, orangefs filesystems, plus LTP tests.
Note that I have only a single x86 64-bit test machine, though.
Remaining work
==============
Non-converted call sites for iter_iov_get_pages*() at the
moment include: net, crypto, cifs, ceph, vhost, fuse, nfs/direct,
vhost/scsi.
About-to-be-converted sites (in a subsequent patch) are: Direct IO for
filesystems that use the generic read/write functions.
[1] https://lore.kernel.org/kvm/[email protected]/
[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/
John Hubbard (5):
iov_iter: introduce iov_iter_pin_user_pages*() routines
mm/gup: introduce pin_user_page()
bio: convert get_user_pages_fast() --> pin_user_pages_fast()
bio: introduce BIO_FOLL_PIN flag
fs/ceph: use pipe_get_pages_alloc() for pipe
block/bio.c | 29 +++++++------
block/blk-map.c | 7 +--
fs/ceph/file.c | 3 +-
fs/direct-io.c | 30 ++++++-------
fs/iomap/direct-io.c | 2 +-
include/linux/blk_types.h | 5 ++-
include/linux/mm.h | 2 +
include/linux/uio.h | 9 +++-
lib/iov_iter.c | 91 +++++++++++++++++++++++++++++++++++++--
mm/gup.c | 30 +++++++++++++
10 files changed, 169 insertions(+), 39 deletions(-)
--
2.28.0
On Fri, Aug 21, 2020 at 09:20:54PM -0700, John Hubbard wrote:
> Direct IO behavior:
>
> ITER_IOVEC:
> pin_user_pages_fast();
> break;
>
> ITER_KVEC: // already elevated page refcount, leave alone
> ITER_BVEC: // already elevated page refcount, leave alone
> ITER_PIPE: // just, no :)
Why? What's wrong with splice to O_DIRECT file?
On 8/24/20 6:54 PM, Al Viro wrote:
> On Fri, Aug 21, 2020 at 09:20:54PM -0700, John Hubbard wrote:
>
>> Direct IO behavior:
>>
>> ITER_IOVEC:
>> pin_user_pages_fast();
>> break;
>>
>> ITER_KVEC: // already elevated page refcount, leave alone
>> ITER_BVEC: // already elevated page refcount, leave alone
>> ITER_PIPE: // just, no :)
>
> Why? What's wrong with splice to O_DIRECT file?
>
Oh! I'll take a look. Is this the fs/splice.c stuff? I ruled this out early
mainly based on Christoph's comment in [1] ("ITER_PIPE is rejected іn the
direct I/O path"), but if it's supportable then I'll hook it up.
(As you can see, I'm still very much coming up to speed on the various things
that invoke iov_iter_get_pages*().)
[1] https://lore.kernel.org/kvm/[email protected]/
thanks,
--
John Hubbard
NVIDIA
On Tue, Aug 25, 2020 at 02:54:28AM +0100, Al Viro wrote:
> On Fri, Aug 21, 2020 at 09:20:54PM -0700, John Hubbard wrote:
>
> > Direct IO behavior:
> >
> > ITER_IOVEC:
> > pin_user_pages_fast();
> > break;
> >
> > ITER_KVEC: // already elevated page refcount, leave alone
> > ITER_BVEC: // already elevated page refcount, leave alone
> > ITER_PIPE: // just, no :)
>
> Why? What's wrong with splice to O_DIRECT file?
Sorry - s/to/from/, obviously.
To spell it out: consider generic_file_splice_read() behaviour when
the source had been opened with O_DIRECT; you will get a call of
->read_iter() into ITER_PIPE destination. And it bloody well
will hit iov_iter_get_pages() on common filesystems, to pick the
pages we want to read into.
So... what's wrong with having that "pin" primitive making sure
the pages are there and referenced by the pipe?
On 8/24/20 7:07 PM, Al Viro wrote:
> On Tue, Aug 25, 2020 at 02:54:28AM +0100, Al Viro wrote:
>> On Fri, Aug 21, 2020 at 09:20:54PM -0700, John Hubbard wrote:
>>
>>> Direct IO behavior:
>>>
>>> ITER_IOVEC:
>>> pin_user_pages_fast();
>>> break;
>>>
>>> ITER_KVEC: // already elevated page refcount, leave alone
>>> ITER_BVEC: // already elevated page refcount, leave alone
>>> ITER_PIPE: // just, no :)
>>
>> Why? What's wrong with splice to O_DIRECT file?
>
> Sorry - s/to/from/, obviously.
>
> To spell it out: consider generic_file_splice_read() behaviour when
> the source had been opened with O_DIRECT; you will get a call of
> ->read_iter() into ITER_PIPE destination. And it bloody well
> will hit iov_iter_get_pages() on common filesystems, to pick the
> pages we want to read into.
>
> So... what's wrong with having that "pin" primitive making sure
> the pages are there and referenced by the pipe?
>
(our emails crossed) OK, yes, let me hook that up. I was just unaware
of that flow, I'll go off and figure it out.
Thanks for looking at this!
thanks,
--
John Hubbard
NVIDIA
On Mon, Aug 24, 2020 at 07:07:02PM -0700, John Hubbard wrote:
> On 8/24/20 6:54 PM, Al Viro wrote:
> > On Fri, Aug 21, 2020 at 09:20:54PM -0700, John Hubbard wrote:
> >
> > > Direct IO behavior:
> > >
> > > ITER_IOVEC:
> > > pin_user_pages_fast();
> > > break;
> > >
> > > ITER_KVEC: // already elevated page refcount, leave alone
> > > ITER_BVEC: // already elevated page refcount, leave alone
> > > ITER_PIPE: // just, no :)
> >
> > Why? What's wrong with splice to O_DIRECT file?
> >
>
> Oh! I'll take a look. Is this the fs/splice.c stuff? I ruled this out early
> mainly based on Christoph's comment in [1] ("ITER_PIPE is rejected іn the
> direct I/O path"), but if it's supportable then I'll hook it up.
; cat >a.c <<'EOF'
#define _GNU_SOURCE
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
int main()
{
int fd = open("./a.out", O_DIRECT);
splice(fd, NULL, 1, NULL, 4096, 0);
return 0;
}
EOF
; cc a.c
; ./a.out | wc -c
4096
and you just had ->read_iter() called with ITER_PIPE destination.
On 8/24/20 7:22 PM, Al Viro wrote:
> On Mon, Aug 24, 2020 at 07:07:02PM -0700, John Hubbard wrote:
>> On 8/24/20 6:54 PM, Al Viro wrote:
>>> On Fri, Aug 21, 2020 at 09:20:54PM -0700, John Hubbard wrote:
>>>
>>>> Direct IO behavior:
>>>>
>>>> ITER_IOVEC:
>>>> pin_user_pages_fast();
>>>> break;
>>>>
>>>> ITER_KVEC: // already elevated page refcount, leave alone
>>>> ITER_BVEC: // already elevated page refcount, leave alone
>>>> ITER_PIPE: // just, no :)
>>>
>>> Why? What's wrong with splice to O_DIRECT file?
>>>
>>
>> Oh! I'll take a look. Is this the fs/splice.c stuff? I ruled this out early
>> mainly based on Christoph's comment in [1] ("ITER_PIPE is rejected іn the
>> direct I/O path"), but if it's supportable then I'll hook it up.
>
> ; cat >a.c <<'EOF'
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdlib.h>
>
> int main()
> {
> int fd = open("./a.out", O_DIRECT);
> splice(fd, NULL, 1, NULL, 4096, 0);
> return 0;
> }
> EOF
> ; cc a.c
> ; ./a.out | wc -c
> 4096
>
> and you just had ->read_iter() called with ITER_PIPE destination.
>
That example saves me a lot of time! Much appreciated.
thanks,
--
John Hubbard
NVIDIA