Hi Christoph, Al,
One of the aims I have for netfslib is to hide the involvement of pages/folios
entirely from the filesystem. That way the filesystem need not concern itself
with changes such as multipage folios appearing in the VM.
To this end, I'm trying to make it such that each netfs_io_subrequest contains
an iterator that describes the segment of buffer that a subrequest is dealing
with. The filesystem interprets the buffer appropriately, and can even pass
the iterator directly to kernel_sendmsg() or kernel_recvmsg() if this is
convenient.
In netfslib and in the network filesystems using it, however, there are a
number of situations where we need to "convert" an iterator:
(1) Async direct I/O.
In the async case direct I/O, we cannot hold on to the iterator when we
return, even if the operation is still in progress (ie. we return
EIOCBQUEUED), as it is likely to be on the caller's stack.
Also, simply copying the iterator isn't sufficient as virtual userspace
addresses cannot be trusted and we may have to pin the pages that
comprise the buffer.
(2) Crypto.
The crypto interface takes scatterlists, not iterators, so we need to be
able to convert an iterator into a scatterlist in order to do content
encryption within netfslib. Doing this in netfslib makes it easier to
store content-encrypted files encrypted in fscache.
(3) RDMA.
To perform RDMA, a buffer list needs to be presented as a QPE array.
Currently, cifs converts the iterator it is given to lists of pages, then
each list to a scatterlist and thence to a QPE array. I have code to
pass the iterator down to the bottom, using an intermediate BVEC iterator
instead of a page list if I can't pass down the original directly (eg. an
XARRAY iterator on the pagecache), but I still end up converting it to a
scatterlist, which is then converted to a QPE. I'm trying to go directly
from an iterator to a QPE array, thus avoiding the need to allocate an
sglist.
Constraints:
(A) Userspace gives us a list (IOVEC/UBUF) of buffers that may not be page
aligned and may not be contiguous; further, within a particular buffer
span, the pages may not be contiguous and may be part of multipage
folios.
Converting to a BVEC iterator allows a whole buffer to be described, and
extracting a subset of a BVEC iterator is straightforward.
(B) Kernel buffers may not be pinnable. If we get a KVEC iterator, say, we
can't assume that we can pin the pages (say the buffer is part of the
kernel rodata or belongs to a device - say a flash).
This may also apply to mmap'd devices in userspace iovecs.
(C) We don't want to pin pages if we can avoid it.
(D) PIPE iterators.
So, my first attempt at dealing with (1) involved creating a function that
extracted part of an iterator into another iterator[2]. Just copying and
shaping if possible (assuming, say, that an XARRAY iterator doesn't need to
pin the pages), but otherwise using repeated application of
iov_iter_get_pages() to build up a BVEC iterator (which is basically just a
list of {page,offset,len} tuples).
Al objected on the basis that it was pinning pages that it didn't need to (say
extracting BVEC->BVEC) and that it didn't deal correctly with PIPE (because
the underlying pipe would get advanced too early) or KVEC/BVEC (because it
might refer to a page that was un-get_pages-able).
Christoph objected that it shouldn't be available as a general purpose helper
and that it should be kept inside cifs - but I'm wanting to use it inside of
netfslib also.
My first attempt at dealing with (2) involved creating a function to scan an
iterator[2] and call a function on each segment of it. This could be used to
perform checksumming or to build up a scatterlist. However, as Al pointed
out, I didn't get the IOBUF or KVEC handling right. Mostly, though, I want to
convert to an sglist and work from that.
I then had a go at implementing a common framework[3] to extract an iterator
into another iterator, an sglist, a RDMA QPE array or any other type of list
we might envision. Al's not keen on that for a number of reasons (see his
reply) including that it loses type safety and that I should be using
iov_iter_get_pages2() - which he already objected to me doing in[1]:-/
So any thoughts on what the right way to do this is? What is the right API?
I have three things I need to make from a source iterator: a copy and/or a
subset iterator, a scatterlist and an RDMA QPE array, and several different
types of iterator to extract from. I shouldn't pin pages unless I need to,
sometimes pages cannot be pinned and sometimes I may have to add the physical
address to the entry.
If I can share part of the infrastructure, that would seem to be a good thing.
David
https://lore.kernel.org/r/165364824259.3334034.5837838050291740324.stgit@warthog.procyon.org.uk/ [1]
https://lore.kernel.org/r/165364824973.3334034.10715738699511650662.stgit@warthog.procyon.org.uk/ [2]
https://lore.kernel.org/r/[email protected]/ [3]
On Fri, Oct 14, 2022 at 04:26:57PM +0100, David Howells wrote:
> (1) Async direct I/O.
>
> In the async case direct I/O, we cannot hold on to the iterator when we
> return, even if the operation is still in progress (ie. we return
> EIOCBQUEUED), as it is likely to be on the caller's stack.
>
> Also, simply copying the iterator isn't sufficient as virtual userspace
> addresses cannot be trusted and we may have to pin the pages that
> comprise the buffer.
This is very related to the discussion we are having related to pinning
for O_DIRECT with Ira and Al. What block file systems do is to take
the pages from the iter and some flags on what is pinned. We can
generalize this to store all extra state in a flags word, or byte the
bullet and allow cloning of the iter in one form or another.
> (2) Crypto.
>
> The crypto interface takes scatterlists, not iterators, so we need to be
> able to convert an iterator into a scatterlist in order to do content
> encryption within netfslib. Doing this in netfslib makes it easier to
> store content-encrypted files encrypted in fscache.
Note that the scatterlist is generally a pretty bad interface. We've
been talking for a while to have an interface that takes a page array
as an input and return an array of { dma_addr, len } tuples. Thinking
about it taking in an iter might actually be an even better idea.
> (3) RDMA.
>
> To perform RDMA, a buffer list needs to be presented as a QPE array.
> Currently, cifs converts the iterator it is given to lists of pages, then
> each list to a scatterlist and thence to a QPE array. I have code to
> pass the iterator down to the bottom, using an intermediate BVEC iterator
> instead of a page list if I can't pass down the original directly (eg. an
> XARRAY iterator on the pagecache), but I still end up converting it to a
> scatterlist, which is then converted to a QPE. I'm trying to go directly
> from an iterator to a QPE array, thus avoiding the need to allocate an
> sglist.
I'm not sure what you mean with QPE. The fundamental low-level
interface in RDMA is the ib_sge. If you feed it to RDMA READ/WRITE
requests the interface for that is the RDMA R/W API in
drivers/infiniband/core/rw.c, which currently takes a scatterlist but
to which all of the above remarks on DMA interface apply. For RDMA
SEND that ULP has to do a dma_map_single/page to fill it, which is a
quite horrible layering violation and should move into the driver, but
that is going to a massive change to the whole RDMA subsystem, so
unlikely to happen anytime soon.
Neither case has anything to do with what should be in common iov_iter
code, all this needs to live in the RDMA subsystem as a consumer.
Christoph Hellwig <[email protected]> wrote:
> > (1) Async direct I/O.
> >
> > In the async case direct I/O, we cannot hold on to the iterator when we
> > return, even if the operation is still in progress (ie. we return
> > EIOCBQUEUED), as it is likely to be on the caller's stack.
> >
> > Also, simply copying the iterator isn't sufficient as virtual userspace
> > addresses cannot be trusted and we may have to pin the pages that
> > comprise the buffer.
>
> This is very related to the discussion we are having related to pinning
> for O_DIRECT with Ira and Al.
Do you have a link to that discussion? I don't see anything obvious on
fsdevel including Ira.
I do see a discussion involving iov_iter_pin_pages, but I don't see Ira
involved in that.
> What block file systems do is to take the pages from the iter and some flags
> on what is pinned. We can generalize this to store all extra state in a
> flags word, or byte the bullet and allow cloning of the iter in one form or
> another.
Yeah, I know. A list of pages is not an ideal solution. It can only handle
contiguous runs of pages, possibly with a partial page at either end. A bvec
iterator would be of more use as it can handle a series of partial pages.
Note also that I would need to turn the pages *back* into an iterator in order
to commune with sendmsg() in the nether reaches of some network filesystems.
> > (2) Crypto.
> >
> > The crypto interface takes scatterlists, not iterators, so we need to
> > be able to convert an iterator into a scatterlist in order to do
> > content encryption within netfslib. Doing this in netfslib makes it
> > easier to store content-encrypted files encrypted in fscache.
>
> Note that the scatterlist is generally a pretty bad interface. We've
> been talking for a while to have an interface that takes a page array
> as an input and return an array of { dma_addr, len } tuples. Thinking
> about it taking in an iter might actually be an even better idea.
It would be nice to be able to pass an iterator to the crypto layer. I'm not
sure what the crypto people think of that.
> > (3) RDMA.
> >
> > To perform RDMA, a buffer list needs to be presented as a QPE array.
> > Currently, cifs converts the iterator it is given to lists of pages,
> > then each list to a scatterlist and thence to a QPE array. I have
> > code to pass the iterator down to the bottom, using an intermediate
> > BVEC iterator instead of a page list if I can't pass down the
> > original directly (eg. an XARRAY iterator on the pagecache), but I
> > still end up converting it to a scatterlist, which is then converted
> > to a QPE. I'm trying to go directly from an iterator to a QPE array,
> > thus avoiding the need to allocate an sglist.
>
> I'm not sure what you mean with QPE. The fundamental low-level
> interface in RDMA is the ib_sge.
Sorry, yes. ib_sge array. I think it appears as QPs on the wire.
> If you feed it to RDMA READ/WRITE requests the interface for that is the
> RDMA R/W API in drivers/infiniband/core/rw.c, which currently takes a
> scatterlist but to which all of the above remarks on DMA interface apply.
> For RDMA SEND that ULP has to do a dma_map_single/page to fill it, which is
> a quite horrible layering violation and should move into the driver, but
> that is going to a massive change to the whole RDMA subsystem, so unlikely
> to happen anytime soon.
In cifs, as it is upstream, in RDMA transmission, the iterator is converted
into a clutch of pages in the top, which is converted back into iterators
(smbd_send()) and those into scatterlists (smbd_post_send_data()), thence into
sge lists (see smbd_post_send_sgl()).
I have patches that pass an iterator (which it decants to a bvec if async) all
the way down to the bottom layer. Snippets are then converted to scatterlists
and those to sge lists. I would like to skip the scatterlist intermediate and
convert directly to sge lists.
On the other hand, if you think the RDMA API should be taking scatterlists
rather than sge lists, that would be fine. Even better if I can just pass an
iterator in directly - though neither scatterlist nor iterator has a place to
put the RDMA local_dma_key - though I wonder if that's actually necessary for
each sge element, or whether it could be handed through as part of the request
as a hole.
> Neither case has anything to do with what should be in common iov_iter
> code, all this needs to live in the RDMA subsystem as a consumer.
That's fine in principle. However, I have some extraction code that can
convert an iterator to another iterator, an sglist or an rdma sge list, using
a common core of code to do all three.
I can split it up if that is preferable.
Do you have code that's ready to be used? I can make immediate use of it.
David
On Thu, Oct 20, 2022 at 03:03:56PM +0100, David Howells wrote:
> Christoph Hellwig <[email protected]> wrote:
>
> > > (1) Async direct I/O.
> > >
> > > In the async case direct I/O, we cannot hold on to the iterator when we
> > > return, even if the operation is still in progress (ie. we return
> > > EIOCBQUEUED), as it is likely to be on the caller's stack.
> > >
> > > Also, simply copying the iterator isn't sufficient as virtual userspace
> > > addresses cannot be trusted and we may have to pin the pages that
> > > comprise the buffer.
> >
> > This is very related to the discussion we are having related to pinning
> > for O_DIRECT with Ira and Al.
>
> Do you have a link to that discussion? I don't see anything obvious on
> fsdevel including Ira.
I think Christoph meant to say John Hubbard.
>
> I do see a discussion involving iov_iter_pin_pages, but I don't see Ira
> involved in that.
This one?
https://lore.kernel.org/all/[email protected]/
I've been casually reading it but not directly involved.
Ira
>
> > What block file systems do is to take the pages from the iter and some flags
> > on what is pinned. We can generalize this to store all extra state in a
> > flags word, or byte the bullet and allow cloning of the iter in one form or
> > another.
>
> Yeah, I know. A list of pages is not an ideal solution. It can only handle
> contiguous runs of pages, possibly with a partial page at either end. A bvec
> iterator would be of more use as it can handle a series of partial pages.
>
> Note also that I would need to turn the pages *back* into an iterator in order
> to commune with sendmsg() in the nether reaches of some network filesystems.
>
> > > (2) Crypto.
> > >
> > > The crypto interface takes scatterlists, not iterators, so we need to
> > > be able to convert an iterator into a scatterlist in order to do
> > > content encryption within netfslib. Doing this in netfslib makes it
> > > easier to store content-encrypted files encrypted in fscache.
> >
> > Note that the scatterlist is generally a pretty bad interface. We've
> > been talking for a while to have an interface that takes a page array
> > as an input and return an array of { dma_addr, len } tuples. Thinking
> > about it taking in an iter might actually be an even better idea.
>
> It would be nice to be able to pass an iterator to the crypto layer. I'm not
> sure what the crypto people think of that.
>
> > > (3) RDMA.
> > >
> > > To perform RDMA, a buffer list needs to be presented as a QPE array.
> > > Currently, cifs converts the iterator it is given to lists of pages,
> > > then each list to a scatterlist and thence to a QPE array. I have
> > > code to pass the iterator down to the bottom, using an intermediate
> > > BVEC iterator instead of a page list if I can't pass down the
> > > original directly (eg. an XARRAY iterator on the pagecache), but I
> > > still end up converting it to a scatterlist, which is then converted
> > > to a QPE. I'm trying to go directly from an iterator to a QPE array,
> > > thus avoiding the need to allocate an sglist.
> >
> > I'm not sure what you mean with QPE. The fundamental low-level
> > interface in RDMA is the ib_sge.
>
> Sorry, yes. ib_sge array. I think it appears as QPs on the wire.
>
> > If you feed it to RDMA READ/WRITE requests the interface for that is the
> > RDMA R/W API in drivers/infiniband/core/rw.c, which currently takes a
> > scatterlist but to which all of the above remarks on DMA interface apply.
> > For RDMA SEND that ULP has to do a dma_map_single/page to fill it, which is
> > a quite horrible layering violation and should move into the driver, but
> > that is going to a massive change to the whole RDMA subsystem, so unlikely
> > to happen anytime soon.
>
> In cifs, as it is upstream, in RDMA transmission, the iterator is converted
> into a clutch of pages in the top, which is converted back into iterators
> (smbd_send()) and those into scatterlists (smbd_post_send_data()), thence into
> sge lists (see smbd_post_send_sgl()).
>
> I have patches that pass an iterator (which it decants to a bvec if async) all
> the way down to the bottom layer. Snippets are then converted to scatterlists
> and those to sge lists. I would like to skip the scatterlist intermediate and
> convert directly to sge lists.
>
> On the other hand, if you think the RDMA API should be taking scatterlists
> rather than sge lists, that would be fine. Even better if I can just pass an
> iterator in directly - though neither scatterlist nor iterator has a place to
> put the RDMA local_dma_key - though I wonder if that's actually necessary for
> each sge element, or whether it could be handed through as part of the request
> as a hole.
>
> > Neither case has anything to do with what should be in common iov_iter
> > code, all this needs to live in the RDMA subsystem as a consumer.
>
> That's fine in principle. However, I have some extraction code that can
> convert an iterator to another iterator, an sglist or an rdma sge list, using
> a common core of code to do all three.
>
> I can split it up if that is preferable.
>
> Do you have code that's ready to be used? I can make immediate use of it.
>
> David
>
On Thu, Oct 20, 2022 at 03:03:56PM +0100, David Howells wrote:
> > What block file systems do is to take the pages from the iter and some flags
> > on what is pinned. We can generalize this to store all extra state in a
> > flags word, or byte the bullet and allow cloning of the iter in one form or
> > another.
>
> Yeah, I know. A list of pages is not an ideal solution. It can only handle
> contiguous runs of pages, possibly with a partial page at either end. A bvec
> iterator would be of more use as it can handle a series of partial pages.
>
> Note also that I would need to turn the pages *back* into an iterator in order
> to commune with sendmsg() in the nether reaches of some network filesystems.
Yes. So I think the right thing here is to make sure we can send
the iter through the whole stack without a convesion.
> It would be nice to be able to pass an iterator to the crypto layer. I'm not
> sure what the crypto people think of that.
Let's ask them..
> On the other hand, if you think the RDMA API should be taking scatterlists
> rather than sge lists, that would be fine. Even better if I can just pass an
> iterator in directly - though neither scatterlist nor iterator has a place to
> put the RDMA local_dma_key - though I wonder if that's actually necessary for
> each sge element, or whether it could be handed through as part of the request
> as a hole.
Well, in the long run it should not take scatterlists either, as they
are a bad data structure. But what should happen in the long run is
that the DMA mapping is only done in the hardware drivers, not the ULPs,
which is a really nasty layering violation. This requires the strange
ib_dma_* stubs to disable DMA mapping for the software drivers, and it
also does complete unneeded DMA mappings for sends that are inline in
the SQE as supported by some Mellanox / Nvidia hardware.
> That's fine in principle. However, I have some extraction code that can
> convert an iterator to another iterator, an sglist or an rdma sge list, using
> a common core of code to do all three.
So I think the iterator to iterator is a really bad idea and we should
not have it at all. It just works around the issue about not being
able to easily keeping state after an iter based get_user_pages, but
that is beeing addressed at the moment. The iter to ib_sge/scatterlist
are very much RDMA specific at the moment, so I guess that might be a
good place to keep them. In fact I suspect the scatterlist conversion
should not be a public API at all, but hidden in rw.c and only be used
internally for the DMA mapping.
On Thu, Oct 20, 2022 at 08:30:34PM -0700, Ira Weiny wrote:
> > Do you have a link to that discussion? I don't see anything obvious on
> > fsdevel including Ira.
>
> I think Christoph meant to say John Hubbard.
Oops, sorry for my bad memory, both of you doing important tree-wide
MM rework at the same time go me really confused.
>
> >
> > I do see a discussion involving iov_iter_pin_pages, but I don't see Ira
> > involved in that.
>
> This one?
>
> https://lore.kernel.org/all/[email protected]/
>
> I've been casually reading it but not directly involved.
Yes!
On Mon, Oct 24, 2022 at 07:57:24AM -0700, Christoph Hellwig wrote:
> So I think the iterator to iterator is a really bad idea and we should
> not have it at all. It just works around the issue about not being
> able to easily keeping state after an iter based get_user_pages, but
> that is beeing addressed at the moment. The iter to ib_sge/scatterlist
> are very much RDMA specific at the moment, so I guess that might be a
> good place to keep them. In fact I suspect the scatterlist conversion
> should not be a public API at all, but hidden in rw.c and only be used
> internally for the DMA mapping.
1) iter-to-scatterlist use is much wider than RDMA. Other places like that
include e.g. vhost_scsi_map_to_sgl(), p9_get_mapped_pages(),
rds_message_zcopy_from_user(), tls_setup_from_iter()...
2) there's a limit to how far we can propagate an arbitrary iov_iter -
ITER_IOVEC/ITER_UBUF ones are absolutely tied to mm_struct of the
originating process. We can't use them for anything async - not
without the horrors a-la use_mm().
3) sendmsg() and recvmsg() are not suited for the situations where
we have a bunch of pages + some kmalloc'ed object. Consider e.g.
NFS write; what goes on the wire is a combination of fixed-sized
request put together by NFS client code with pages containing the
data to be sent. Ideally we'd like to send the entire bunch at
once; AFAICS there are only 3 ways to do that -
* virt_to_page() for the fixed-sized part, build ITER_BVEC
iterator in ->msg_iter containing that page + the rest of submitted
pages, pass to ->sendmsg().
* kmap() each data page, build ITER_KVEC iterator, pass to
->sendmsg(). Forget about any kind of zero-copy. And that's
kmap(), not kmap_local_page().
* try to implement heterogeneous iov_iter, with mix of (at
least) kvec and bvec parts. Fucking nightmare, IMO, and anything
similar to iov_iter_get_pages() on those will have an insane
semantics.
We can do separate sendmsg() for kvec and bvec parts,
but that doesn't come for free either. *AND* bvec part is very
likely not the original iterator we got those pages from.
Unless I'm misunderstanding dhowells, that's not too dissimilar to
the reasons behind his proposed primitive...
My problem with all that stuff is that we ought to sort out the
lifetime and pin_user issues around the iov_iter. What I really
want to avoid is "no worries, we'd extracted stuff into ITER_BVEC, it's
stable and can be passed around in arbitrary way" kind of primitive.
Because *that* has no chance to work.
As far as I can see, we have the following constraints:
* page references put into ITER_BVEC (and ITER_XARRAY) must not
go away while the iov_iter is being used. That's on the creator of
iov_iter.
* pages found in iterator might be used past the lifetime of
iterator. We need the underlying pages to survive until the last
use. "Grab a page reference" is *NOT* a solution in general case.
* pages found in data-destination iterator may have their
contents modified, both during the iterator lifetime and asynchronously.
If it has a chance to be a user-mapped page, we must either
a) have it locked by caller and have no modifications after
it gets unlocked or
b) have it pinned (sensu pin_user_pages()) by the caller and
have no modifications until the unpin_user_page().
* data objects located in those pages might have the
lifetime *NOT* controlled by page refcount. In particular, if we
grab a page reference to something kmalloc'ed, holding onto that
reference is not enough to make the access to data safe in any sense
other than "it won't oops on you". kfree() won't care about the
elevated page refcount and kmalloc() after that kfree() might reuse
the same memory. That's the main reason why iov_iter_get_pages()
on ITER_KVEC is a non-starter - too dangerous. We can find the
underlying pages, but we shouldn't grab references to those;
the caller must make sure that object will not be freed until
after the async access ends (by arranging a suitable completion
callback of some sort, etc.)
* iov_iter_get_pages...() is the only place where we find
the underlying pages. All other primitives are synchronous -
they need pages to be alive and in a suitable state for access
at the moment they are called, but that's it.
* page references obtained from iov_iter_get_pages...() can
end up in various places. No, it's not just bio - not even close
to that. Any place where we might retain those references for
async work MUST have a way to tell whether the reference is counting
and whether we should do unpin_user_page when we are done. This
really needs to be audited. We need to understand where those
page references might end up and how can the caller tell when
async access is finished.
Note that one of those places is skb fragment list; MSG_ZEROCOPY
sendmsg() can and will stick page references in there. "managed" shite
tries to deal with that. I'm not fond of the way it's done, to put it mildly.
It _might_ cope with everything io-uring throws at it at the moment,
but the whole skb_zcopy_downgrade_managed() thing is asking for
trouble. Again, randomly deciding to go grab a reference to
a page we got from fuck knows where is a bad, bad idea.
BTW, for some taste of the fun involved in that audit,
try to track the call chains leading to osd_req_op_extent_osd_data_bvec_pos()
and see what pages might end up stuffed into ceph_osd_data by it; later
(possibly much later) those will be stuffed into ITER_BVEC msg->msg_iter...
You'll come to hate drivers/block/rbd.c long before you are done with
that ;-/
AFAICS, we need the following:
1) audit all places where we stuff something into ITER_BVEC/ITER_XARRAY.
I've some of that done (last cycle, so it might have been invalidated),
but some really scary ones remain (ceph and nfs transport, mostly).
2) audit all places where iov_iter_get_pages...() gets called, in order
to find out where page references go and when are they dropped by the
current mainline. Note that there's a non-trivial interplay with
ITER_BVEC audit - those pages can be used to populate an ITER_BVEC iterator
*and* ITER_BVEC iterators can end up being passed to iov_iter_get_pages...().
NOTE: in some cases we have logics for coalescing adjacent subranges of
the same page; that can get interesting if we might end up mixing references
of different sorts there (some pinning, some not). AFAICS that should
never happen for bio, but I'm not certain about e.g. nfs pagelists.
My preference for iov_iter_get_pages...() replacement would be to have
it do
pin_user_pages() if it's a data-destination user-backed iterator
get_user_pages() if it's a data-source user-backed iterator
just return the fucking struct page * if it's not user-backed.
Caller of iov_iter_get_pages...() replacement should be aware of the
kind of iterator it's dealing with, on the level of "is it user-backed"
and "is it data-destination". It needs that to decide what to do with
the page references when we are done with them. Blind grabbing refcount
on pages from ITER_BVEC is a bad idea.
Another issue with iov_iter_get_pages...() is that compound page turns
into a bunch of references to individual subpages; io-uring folks have
noticed the problem, but their solution is... inelegant. I wonder if
we would be better off with a variant of the primitive that would give
out compound pages; it would need different calling conventions,
obviously (current ones assume that all pages except the first and
the last one have PAGE_SIZE worth of data in them).
Some questions from partial ITER_BVEC/ITER_XARRAY audit I'd done last
cycle:
Can we assume that all pages involved ->issue_read() are supposed to be
locked by the caller? netfs question, so that's over to dhowells...
What protects pages involved in ITER_XARRAY iterator created by
afs_read_dir()? Note that we are not guaranteed inode_lock() on
the directory in question...
What is guaranteed for the pages involved in ceph transport? I have
not managed to get through the call graph for that stuff - too deep,
varied and nasty; besides, there's some work from jlayton in the
area, so...
io_import_fixed() sets ITER_BVEC over pinned pages; see io_pin_pages() for
the place where that's done. A scary question is what prevents an early
unpin of those...
vring: fuck knows. We have physical addresses stored and we work with
pfn_to_page() results. Insertion is up to users of those primitives and
so's the exclusion of use vs. removals. Hell knows what they store there
and what kind of exclusion (if any) are they using. It is *not* uniform.
Note that if we can get a userland page there, we have to deal with more
than just the primitive that calls copy_to_iter() - there's one right
next to it doing kmap_atomic() + modify + unmap, with no reference to
any iov_iter. And it has exact same needs as copy_to_iter()
in that respect... I don't know the vdpa stuff anywhere near well enough,
unfortunately.
FWIW, I've ported #work.iov_iter on top of 6.1-rc1; let's use that
as base point for any further work in those directions.
Signed-off-by: Al Viro <[email protected]>
---
lib/iov_iter.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c3ca28ca68a6..e9a8fc9ee8ee 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -703,17 +703,16 @@ static inline bool page_copy_sane(struct page *page, size_t offset, size_t n)
head = compound_head(page);
v += (page - head) << PAGE_SHIFT;
- if (likely(n <= v && v <= (page_size(head))))
- return true;
- WARN_ON(1);
- return false;
+ if (WARN_ON(n > v || v > page_size(head)))
+ return false;
+ return true;
}
size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
size_t res = 0;
- if (unlikely(!page_copy_sane(page, offset, bytes)))
+ if (!page_copy_sane(page, offset, bytes))
return 0;
if (unlikely(iov_iter_is_pipe(i)))
return copy_page_to_iter_pipe(page, offset, bytes, i);
@@ -808,7 +807,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t byt
struct iov_iter *i)
{
char *kaddr = kmap_atomic(page), *p = kaddr + offset;
- if (unlikely(!page_copy_sane(page, offset, bytes))) {
+ if (!page_copy_sane(page, offset, bytes)) {
kunmap_atomic(kaddr);
return 0;
}
--
2.30.2
Signed-off-by: Al Viro <[email protected]>
---
drivers/infiniband/ulp/rtrs/rtrs-clt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-clt.c b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
index 8546b8816524..88282b288abd 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-clt.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
@@ -966,7 +966,7 @@ static void rtrs_clt_init_req(struct rtrs_clt_io_req *req,
refcount_set(&req->ref, 1);
req->mp_policy = clt_path->clt->mp_policy;
- iov_iter_kvec(&iter, READ, vec, 1, usr_len);
+ iov_iter_kvec(&iter, WRITE, vec, 1, usr_len);
len = _copy_from_iter(req->iu->buf, usr_len, &iter);
WARN_ON(len != usr_len);
--
2.30.2
Signed-off-by: Al Viro <[email protected]>
---
arch/s390/kernel/crash_dump.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index dd74fe664ed1..7ad7f20320b9 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -153,7 +153,7 @@ int copy_oldmem_kernel(void *dst, unsigned long src, size_t count)
kvec.iov_base = dst;
kvec.iov_len = count;
- iov_iter_kvec(&iter, WRITE, &kvec, 1, count);
+ iov_iter_kvec(&iter, READ, &kvec, 1, count);
if (copy_oldmem_iter(&iter, src, count) < count)
return -EFAULT;
return 0;
--
2.30.2
Al Viro <[email protected]> wrote:
> * try to implement heterogeneous iov_iter, with mix of (at
> least) kvec and bvec parts. Fucking nightmare, IMO, and anything
> similar to iov_iter_get_pages() on those will have an insane
> semantics.
An "iterator of iterators" might be the easiest way to do that, where the
iterator has an array of other iterators of diverse types and advances through
them. Sounds a bit mad, though.
> We can do separate sendmsg() for kvec and bvec parts,
> but that doesn't come for free either. *AND* bvec part is very
> likely not the original iterator we got those pages from.
Cifs, for example, does that. A cifs data packet consists of some kvec-type
things surrounding a data object, currently a list of pages, passed one at a
time to sendmsg/recvmsg. I'm trying to change the list of pages thing to use
iterators right down to the socket, but I then end up with {kvec,xarray,kvec}
type things in the most common case.
> Unless I'm misunderstanding dhowells, that's not too dissimilar to
> the reasons behind his proposed primitive...
Yes.
> My problem with all that stuff is that we ought to sort out the
> lifetime and pin_user issues around the iov_iter. What I really
> want to avoid is "no worries, we'd extracted stuff into ITER_BVEC, it's
> stable and can be passed around in arbitrary way" kind of primitive.
> Because *that* has no chance to work.
What I'm intending to do in netfslib is just use an ITER_BVEC as a list of
{page,off,len} tuples. The netfs_io_request struct is used to manage the
lifetime of the pages.
Having dicussed this with you and Willy, I can make it pin/unpin the pages in
an IOBUF/UBUF if appropriate to the I/O environment rather than ref get/put -
but it means doing something other than iov_iter_get_pages2(). I could add an
iov_iter_pin_pages2() or pass FOLL_* flags into __iov_iter_get_pages_alloc()
and wrappers, say.
> * page references put into ITER_BVEC (and ITER_XARRAY) must not
> go away while the iov_iter is being used. That's on the creator of
> iov_iter.
Yep.
> * pages found in iterator might be used past the lifetime of
> iterator. We need the underlying pages to survive until the last
> use. "Grab a page reference" is *NOT* a solution in general case.
Yep, but I need to understand where I need to use pinning rather than ref'ing.
> * pages found in data-destination iterator may have their
> contents modified, both during the iterator lifetime and asynchronously.
> If it has a chance to be a user-mapped page, we must either
> a) have it locked by caller and have no modifications after
> it gets unlocked or
> b) have it pinned (sensu pin_user_pages()) by the caller and
> have no modifications until the unpin_user_page().
I can do the pinning, sure, if I have the API to do that.
I guess I'd need to trap page_mkwrite() to prevent modifications - though both
cifs and nfs seem to currently allow modifications of pinned pages to take
place during I/O under certain conditions.
> * page references obtained from iov_iter_get_pages...() can
> end up in various places. No, it's not just bio - not even close
> to that. Any place where we might retain those references for
> async work MUST have a way to tell whether the reference is counting
> and whether we should do unpin_user_page when we are done. This
> really needs to be audited. We need to understand where those
> page references might end up and how can the caller tell when
> async access is finished.
> Note that one of those places is skb fragment list; MSG_ZEROCOPY
> sendmsg() can and will stick page references in there. ...
Good point. I was considering adding zerocopy for afs/rxrpc - but I probably
need to think more on that.
> AFAICS, we need the following:
>
> 1) audit all places where we stuff something into ITER_BVEC/ITER_XARRAY.
> I've some of that done (last cycle, so it might have been invalidated),
> but some really scary ones remain (ceph and nfs transport, mostly).
We're trying to get the ceph bits up into netfslib - at least then it'll be
common between 9p, afs, ceph and cifs.
> 2) audit all places where iov_iter_get_pages...() gets called, in order
> to find out where page references go and when are they dropped by the
> current mainline. Note that there's a non-trivial interplay with
> ITER_BVEC audit - those pages can be used to populate an ITER_BVEC iterator
> *and* ITER_BVEC iterators can end up being passed to iov_iter_get_pages...().
> NOTE: in some cases we have logics for coalescing adjacent subranges of
> the same page; that can get interesting if we might end up mixing references
> of different sorts there (some pinning, some not). AFAICS that should
> never happen for bio, but I'm not certain about e.g. nfs pagelists.
>
> My preference for iov_iter_get_pages...() replacement would be to have
> it do
> pin_user_pages() if it's a data-destination user-backed iterator
> get_user_pages() if it's a data-source user-backed iterator
Okay - sounds like what I was expecting. I need to fix my cifs patches to do
this correctly.
> just return the fucking struct page * if it's not user-backed.
> Caller of iov_iter_get_pages...() replacement should be aware of the
> kind of iterator it's dealing with, on the level of "is it user-backed"
> and "is it data-destination". It needs that to decide what to do with
> the page references when we are done with them. Blind grabbing refcount
> on pages from ITER_BVEC is a bad idea.
Is it worth making iov_iter_get/pin_user_pages() only work with ITER_IOVEC and
ITER_UBUF and disallow the rest?
> Another issue with iov_iter_get_pages...() is that compound page turns
> into a bunch of references to individual subpages; io-uring folks have
> noticed the problem, but their solution is... inelegant. I wonder if
> we would be better off with a variant of the primitive that would give
> out compound pages; it would need different calling conventions,
> obviously (current ones assume that all pages except the first and
> the last one have PAGE_SIZE worth of data in them).
One of the problems there is that the kmap functions only handles individual
pages. Willy has a patch that allows you to vmap a whole folio on a highmem
machine (just a bit of maths on a non-highmem machine), but that might need to
do memory allocation...
> Some questions from partial ITER_BVEC/ITER_XARRAY audit I'd done last
> cycle:
>
> Can we assume that all pages involved ->issue_read() are supposed to be
> locked by the caller? netfs question, so that's over to dhowells...
If the pages come from the pagecache, then yes, they're locked; if they're in
a private bounce buffer created by netfslib, then no, they're not. However,
the network filesystem tells netfslib when it's done or partially done and
leaves the unlocking, unref'ing, unpinning or whatever to netfslib. netfslib
has somewhere to store the appropriate state.
> What protects pages involved in ITER_XARRAY iterator created by
> afs_read_dir()? Note that we are not guaranteed inode_lock() on
> the directory in question...
Yeah - that needs fixing. The size of the data can change, but I don't update
the iterator. There is an rwsem preventing the data from being reread,
though, whilst we're scanning it.
> What is guaranteed for the pages involved in ceph transport? I have
> not managed to get through the call graph for that stuff - too deep,
> varied and nasty; besides, there's some work from jlayton in the
> area, so...
We're trying to make it such that we can pass the iterator that netfslib
generates down to libceph.
David
This could explain why your are dropping the unlikely, 'cause just
from the page this is non-obvious. Especially as the patch seems to
do a lot more than just removing an unlikely.
On Mon, Oct 24, 2022 at 08:53:28PM +0100, Al Viro wrote:
> 1) iter-to-scatterlist use is much wider than RDMA. Other places like that
> include e.g. vhost_scsi_map_to_sgl(), p9_get_mapped_pages(),
> rds_message_zcopy_from_user(), tls_setup_from_iter()...
RDS is RDMA. vhost_scsi_map_to_sgl and p9_get_mapped_pages do some
odd virtio thing. But point taken, it is spread further than it should
be at the moment. It is however a rather bad data structure that really
should not spead much further.
> 2) there's a limit to how far we can propagate an arbitrary iov_iter -
> ITER_IOVEC/ITER_UBUF ones are absolutely tied to mm_struct of the
> originating process. We can't use them for anything async - not
> without the horrors a-la use_mm().
But why would you pass them on? It is much better to just convert
them to a bio_vec and pass that on. We could still feed that to n
iter later, and in fact there are a bunch of good reasons to do so.
But in pretty much all those cases you really do not want to keep
the whole iov_iter state.
> We can do separate sendmsg() for kvec and bvec parts,
> but that doesn't come for free either. *AND* bvec part is very
> likely not the original iterator we got those pages from.
sendmsg model seems to be very much built around that model with
MSG_MORE. But even with a 'converter' how do you plan to build
such a mixed iter anyay?
> My problem with all that stuff is that we ought to sort out the
> lifetime and pin_user issues around the iov_iter. What I really
> want to avoid is "no worries, we'd extracted stuff into ITER_BVEC, it's
> stable and can be passed around in arbitrary way" kind of primitive.
> Because *that* has no chance to work.
Yes. I think the first thing we need in this whole area is to sort
the pinning out. After that we can talk about all kinds of convenience
helpers.
> As far as I can see, we have the following constraints:
>
> * page references put into ITER_BVEC (and ITER_XARRAY) must not
> go away while the iov_iter is being used. That's on the creator of
> iov_iter.
*nod*
> * pages found in iterator might be used past the lifetime of
> iterator. We need the underlying pages to survive until the last
> use. "Grab a page reference" is *NOT* a solution in general case.
> * pages found in data-destination iterator may have their
> contents modified, both during the iterator lifetime and asynchronously.
This is where the trouble start. If you want to be able to feed
kmalloced data into throgh ITER_KVEC (or ITER_BVEC for the matter),
you can't just grab any kind of hold to it. The only way to do that
is by telling the caller you're done with it. I.e. how aio/io_ring/etc
use ki_complete - the callee owns the data until it declares it is done
by calling ->ki_complete. But no 'borrowing' of refeferences as the
only sane way to do that would be page refcounts, but those do not
work for everything.
> If it has a chance to be a user-mapped page, we must either
> a) have it locked by caller and have no modifications after
> it gets unlocked or
> b) have it pinned (sensu pin_user_pages()) by the caller and
> have no modifications until the unpin_user_page().
Yes. And I think we need a good counter part to iov_iter_pin_pages
that undoes any required pinning, so that users of iov_iter_pin_pages
and iov_iter_unpin_pages can use these helpers without even thinking
about the rules. That requires passing some amount of state to the
unpin side. It could just be an unsigned long with flags probably,
or we keep the iov_iter alive and look at that.
> Another issue with iov_iter_get_pages...() is that compound page turns
> into a bunch of references to individual subpages; io-uring folks have
> noticed the problem, but their solution is... inelegant. I wonder if
> we would be better off with a variant of the primitive that would give
> out compound pages; it would need different calling conventions,
> obviously (current ones assume that all pages except the first and
> the last one have PAGE_SIZE worth of data in them).
The new name for compound pages is folios, and yes the whole get/pin
user pages machinery needs to switch to that.
David Howells <[email protected]> wrote:
> > What protects pages involved in ITER_XARRAY iterator created by
> > afs_read_dir()? Note that we are not guaranteed inode_lock() on
> > the directory in question...
>
> Yeah - that needs fixing. The size of the data can change, but I don't update
> the iterator.
Actually, no. The iterator is the output buffer for afs_fetch_data(). If the
buffer turned out to be too small we drop the validate_lock and go round and
try again.
req->actual_len and req->file_size are updated by afs_fetch_data() from the
RPC reply. req->len tells the RPC delivery code how big the buffer is (which
we don't have to fill if there's less data available than we have buffer
space).
David