2023-02-09 13:56:13

by Stefan Metzmacher

[permalink] [raw]
Subject: copy on write for splice() from file to pipe?

Hi Linus and others,

as written in a private mail before, I'm currently trying to
make use of IORING_OP_SPLICE in order to get zero copy support
in Samba.

The most important use cases are 8 Mbytes reads and writes to
files. where "memcpy" (at the lowest end copy_user_enhanced_fast_string())
is the obvious performance killer.

I have a prototype that offers really great performance
avoiding "memcpy" by using splice() (in order to get async IORING_OP_SPLICE).

So we have two cases:

1. network -> socket -> splice -> pipe -> splice -> file -> storage

2. storage -> file -> splice -> pipe -> splice -> socket -> network

With 1. I guess everything can work reliable, once
the pages are created/filled in the socket receive buffer
they are used exclusively and they won't be shared on
the way to the file. Which means we can be sure that
data arrives unmodified in the file(system).

But with 2. there's a problem, as the pages from the file,
which are spliced into the pipe are still shared without
copy on write with the file(system). It means writes to the file
after the first splice modify the content of the spliced pages!
So the content may change uncontrolled before it reaches the network!
I created a little example that demonstrates the problem (see below),
it gives the following output:

> open(O_TMPFILE) => ffd[3]
> pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0x1f sret[4096]
> pipe() => ret[0]
> splice(count=PIPE_BUF*2,ofs=0) sret[8192]
> pwrite(count=PIPE_BUF,ofs=0) 0xf0 sret[4096]
> pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0xf0 sret[4096]
> read(from_pipe, count=PIPE_BUF) sret[4096]
> memcmp() at ofs=0, expecting 0x00 => ret[240]
> memcmp() at ofs=0, checking for 0xf0 => ret[0]
> read(from_pipe, count=PIPE_BUF) sret[4096]
> memcmp() at ofs=PIPE_BUF, expecting 0x1f => ret[209]
> memcmp() at ofs=PIPE_BUF, checking for 0xf0 => ret[0]

After reading from the pipe we get the values we have written to
the file instead of the values we had at the time of splice.

For things like web servers, which mostly serve static content, this
isn't a problem, but it is for Samba, when reads and writes may happen within
microseconds, before the content is pushed to the network.

I'm wondering if there's a possible way out of this, maybe triggered by a new
flag passed to splice.

I looked through the code and noticed the existence of IOMAP_F_SHARED.
Maybe the splice from the page cache to the pipe could set IOMAP_F_SHARED,
while incrementing the refcount and the network driver could remove it again
when the refcount reaches 1 again.

Is there any other way we could archive something like this?

In addition and/or as alternative I was thinking about a flag to
preadv2() (and IORING_OP_READV) to indicate the use of something
like async_memcpy(), instead of doing the copy via the cpu.
That in combination with IORING_OP_SENDMSG_ZC would avoid "memcpy"
on the cpu.

Any hints, remarks and prototype patches are highly welcome.

Thanks!
metze

#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <limits.h>

int main(void)
{
int ffd;
int pfds[2];
char buf [PIPE_BUF] = {0, };
char buf2 [PIPE_BUF] = {0, };
ssize_t sret;
int ret;
off_t ofs;

memset(buf, 0x1f, PIPE_BUF);

ffd = open("/tmp/", O_RDWR | O_TMPFILE, S_IRUSR | S_IWUSR);
printf("open(O_TMPFILE) => ffd[%d]\n", ffd);

sret = pwrite(ffd, buf, PIPE_BUF, PIPE_BUF);
printf("pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0x1f sret[%zd]\n", sret);

ret = pipe(pfds);
printf("pipe() => ret[%d]\n", ret);

ofs = 0;
sret = splice(ffd, &ofs, pfds[1], NULL, PIPE_BUF*2, 0);
printf("splice(count=PIPE_BUF*2,ofs=0) sret[%zd]\n", sret);

memset(buf, 0xf0, PIPE_BUF);

sret = pwrite(ffd, buf, PIPE_BUF, 0);
printf("pwrite(count=PIPE_BUF,ofs=0) 0xf0 sret[%zd]\n", sret);
sret = pwrite(ffd, buf, PIPE_BUF, PIPE_BUF);
printf("pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0xf0 sret[%zd]\n", sret);

sret = read(pfds[0], buf, PIPE_BUF);
printf("read(from_pipe, count=PIPE_BUF) sret[%zd]\n", sret);

memset(buf2, 0x00, PIPE_BUF);
ret = memcmp(buf, buf2, PIPE_BUF);
printf("memcmp() at ofs=0, expecting 0x00 => ret[%d]\n", ret);
memset(buf2, 0xf0, PIPE_BUF);
ret = memcmp(buf, buf2, PIPE_BUF);
printf("memcmp() at ofs=0, checking for 0xf0 => ret[%d]\n", ret);

sret = read(pfds[0], buf, PIPE_BUF);
printf("read(from_pipe, count=PIPE_BUF) sret[%zd]\n", sret);

memset(buf2, 0x1f, PIPE_BUF);
ret = memcmp(buf, buf2, PIPE_BUF);
printf("memcmp() at ofs=PIPE_BUF, expecting 0x1f => ret[%d]\n", ret);
memset(buf2, 0xf0, PIPE_BUF);
ret = memcmp(buf, buf2, PIPE_BUF);
printf("memcmp() at ofs=PIPE_BUF, checking for 0xf0 => ret[%d]\n", ret);
return 0;
}


2023-02-09 14:11:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Thu, Feb 09, 2023 at 02:55:59PM +0100, Stefan Metzmacher wrote:
> Hi Linus and others,
>
> as written in a private mail before, I'm currently trying to
> make use of IORING_OP_SPLICE in order to get zero copy support
> in Samba.

I have to ask why. In a modern network, isn't all data encrypted?
So you have to encrypt into a different buffer, and then you checksum
that buffer. So it doesn't matter if writes can change the page cache
after you called splice(), you just need to have the data be consistent
so the checksum doesn't change.

2023-02-09 14:29:49

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

Am 09.02.23 um 15:11 schrieb Matthew Wilcox:
> On Thu, Feb 09, 2023 at 02:55:59PM +0100, Stefan Metzmacher wrote:
>> Hi Linus and others,
>>
>> as written in a private mail before, I'm currently trying to
>> make use of IORING_OP_SPLICE in order to get zero copy support
>> in Samba.
>
> I have to ask why. In a modern network, isn't all data encrypted?

No people use plain connections for performance sensitive
workloads and have client and server in isolated vlans.

> So you have to encrypt into a different buffer, and then you checksum
> that buffer. So it doesn't matter if writes can change the page cache
> after you called splice(), you just need to have the data be consistent
> so the checksum doesn't change.

SMB offers checksuming (signing) only as well as authenticated
encryption.

For signing only I experimented with splice() in combination with
tee(), so that I can checksum the data after reading from tee,
while I can still splice() into the socket.

For encryption the async_memcpy flag to preadv2 could be usefull
if we keep using userspace encryption using gnutls.
If using the kernel crypto socket, we could also use splice to
add the file data into the crypto functions and the same problem
can happen, because some algorithms may encrypt and sign the data
in separate steps and it doesn't expect the data to be changed.

metze


2023-02-09 16:41:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

Adding Jens, because he's one of the main splice people. You do seem
to be stepping on his work ;)

Jens, see

https://lore.kernel.org/lkml/[email protected]

On Thu, Feb 9, 2023 at 5:56 AM Stefan Metzmacher <[email protected]> wrote:
>
> So we have two cases:
>
> 1. network -> socket -> splice -> pipe -> splice -> file -> storage
>
> 2. storage -> file -> splice -> pipe -> splice -> socket -> network
>
> With 1. I guess everything can work reliable [..]
>
> But with 2. there's a problem, as the pages from the file,
> which are spliced into the pipe are still shared without
> copy on write with the file(system).

Well, honestly, that's really the whole point of splice. It was
designed to be a way to share the storage data without having to go
through a copy.

> I'm wondering if there's a possible way out of this, maybe triggered by a new
> flag passed to splice.

Not really.

So basically, you cannot do "copy on write" on a page cache page,
because that breaks sharing.

You *want* the sharing to break, but that's because you're violating
what splice() was for, but think about all the cases where somebody is
just using mmap() and expects to see the file changes.

You also aren't thinking of the case where the page is already mapped
writably, and user processes may be changing the data at any time.

> I looked through the code and noticed the existence of IOMAP_F_SHARED.

Yeah, no. That's a hacky filesystem thing. It's not even a flag in
anything core like 'struct page', it's just entirely internal to the
filesystem itself.

> Is there any other way we could archive something like this?

I suspect you simply want to copy it at splice time, rather than push
the page itself into the pipe as we do in copy_page_to_iter_pipe().

Because the whole point of zero-copy really is that zero copy. And the
whole point of splice() was to *not* complicate the rest of the system
over-much, while allowing special cases.

Linux is not the heap of bad ideas that is Hurd that does various
versioning etc, and that made copy-on-write a first-class citizen
because it uses the concept of "immutable mapped data" for reads and
writes.

Now, I do see a couple of possible alternatives to "just create a stable copy".

For example, we very much have the notion of "confirm buffer data
before copying". It's used for things like "I started the IO on the
page, but the IO failed with an error, so even though I gave you a
splice buffer, it turns out you can't use it".

And I do wonder if we could introduce a notion of "optimistic splice",
where the splice works exactly the way it does now (you get a page
reference), but the "confirm" phase could check whether something has
changed in that mapping (using the file versioning or whatever - I'm
hand-waving) and simply fail the confirm.

That would mean that the "splice to socket" part would fail in your
chain, and you'd have to re-try it. But then the onus would be on
*you* as a splicer, not on the rest of the system to fix up your
special case.

That idea sounds fairly far out there, and complicated and maybe not
usable. So I'm just throwing it out as a "let's try to think of
alternative solutions".

Anybody?

Linus

2023-02-09 19:17:18

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

Hi Linus,

> Adding Jens, because he's one of the main splice people. You do seem
> to be stepping on his work ;)
>
> Jens, see
>
> https://lore.kernel.org/lkml/[email protected]

Ok, thanks! Maybe Jens should apear in the output of:

scripts/get_maintainer.pl fs/splice.c

> On Thu, Feb 9, 2023 at 5:56 AM Stefan Metzmacher <[email protected]> wrote:
>>
>> So we have two cases:
>>
>> 1. network -> socket -> splice -> pipe -> splice -> file -> storage
>>
>> 2. storage -> file -> splice -> pipe -> splice -> socket -> network
>>
>> With 1. I guess everything can work reliable [..]
>>
>> But with 2. there's a problem, as the pages from the file,
>> which are spliced into the pipe are still shared without
>> copy on write with the file(system).
>
> Well, honestly, that's really the whole point of splice. It was
> designed to be a way to share the storage data without having to go
> through a copy.


>> I'm wondering if there's a possible way out of this, maybe triggered by a new
>> flag passed to splice.
>
> Not really.
>
> So basically, you cannot do "copy on write" on a page cache page,
> because that breaks sharing.
>
> You *want* the sharing to break, but that's because you're violating
> what splice() was for, but think about all the cases where somebody is
> just using mmap() and expects to see the file changes.
>
> You also aren't thinking of the case where the page is already mapped
> writably, and user processes may be changing the data at any time.

I do because we're using that in our tdb library, but I hoped there would be
a way out...

>> I looked through the code and noticed the existence of IOMAP_F_SHARED.
>
> Yeah, no. That's a hacky filesystem thing. It's not even a flag in
> anything core like 'struct page', it's just entirely internal to the
> filesystem itself.

Ok, I guess it's used for shared blocks in the filesystems,
in order to support things like cow support in order to allow
snapshots, correct?

>> Is there any other way we could archive something like this?
>
> I suspect you simply want to copy it at splice time, rather than push
> the page itself into the pipe as we do in copy_page_to_iter_pipe().
>
> Because the whole point of zero-copy really is that zero copy. And the
> whole point of splice() was to *not* complicate the rest of the system
> over-much, while allowing special cases.
>
> Linux is not the heap of bad ideas that is Hurd that does various
> versioning etc, and that made copy-on-write a first-class citizen
> because it uses the concept of "immutable mapped data" for reads and
> writes.

Ok, thanks very much for the detailed feedback!

> Now, I do see a couple of possible alternatives to "just create a stable copy".
>
> For example, we very much have the notion of "confirm buffer data
> before copying". It's used for things like "I started the IO on the
> page, but the IO failed with an error, so even though I gave you a
> splice buffer, it turns out you can't use it".
>
> And I do wonder if we could introduce a notion of "optimistic splice",
> where the splice works exactly the way it does now (you get a page
> reference), but the "confirm" phase could check whether something has
> changed in that mapping (using the file versioning or whatever - I'm
> hand-waving) and simply fail the confirm.
>
> That would mean that the "splice to socket" part would fail in your
> chain, and you'd have to re-try it. But then the onus would be on
> *you* as a splicer, not on the rest of the system to fix up your
> special case.
>
> That idea sounds fairly far out there, and complicated and maybe not
> usable. So I'm just throwing it out as a "let's try to think of
> alternative solutions".

That sounds complicated and still racy.

Any comment about the idea of having a preadv2() flag that
asks for a dma copy with something like async_memcpy() instead
of the default that ends up in copy_user_enhanced_fast_string()?
If that would be possible, a similar flag would also be possible
for splice() in order to dma copy the pages into the pipe.

metze

2023-02-09 19:36:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Thu, Feb 9, 2023 at 11:17 AM Stefan Metzmacher <[email protected]> wrote:
>
> Any comment about the idea of having a preadv2() flag that
> asks for a dma copy with something like async_memcpy() instead
> of the default that ends up in copy_user_enhanced_fast_string()?

I guarantee that you will only slow things down with some odd async_memcpy.

There are zero DMA engines that do memory copying better than memcpy -
the only way you can do better if is the destination isn't memory in
the first place but the final device, and then we just call it "dma".

Linus

2023-02-09 19:49:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Thu, Feb 9, 2023 at 11:36 AM Linus Torvalds
<[email protected]> wrote:
>
> I guarantee that you will only slow things down with some odd async_memcpy.

Extended note: even if the copies themselves would then be done
concurrently with other work (so "not faster, but more parallel"), the
synchronization required at the end would then end up being costly
enough to eat up any possible win. Plus you'd still end up with a
fundamental problem of "what if the data changes in the meantime".

And that's ignoring all the practical problems of actually starting
the async copy, which traditionally requires virtual to physical
translations (where "physical" is whatever the DMA address space is).

So I don't think there are any actual real cases of async memory copy
engines being even _remotely_ better than memcpy outside of
microcontrollers (and historical computers before caches - people may
still remember things like the Amiga blitter fondly).

Again, the exception ends up being if you can actually use real DMA to
not do a memory copy, but to transfer data directly to or from the
device. That's in some way what 'splice()' is designed to allow you to
do, but exactly by the pipe part ending up being the "conceptual
buffer" for the zero-copy pages.

So this is exactly *why* splicing from a file all the way to the
network will then show any file changes that have happened in between
that "splice started" and "network card got the data". You're supposed
to use splice only when you can guarantee the data stability (or,
alternatively, when you simply don't care about the data stability,
and getting the changed data is perfectly fine).

Linus

2023-02-09 20:33:15

by Jeremy Allison

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Thu, Feb 09, 2023 at 11:48:35AM -0800, Linus Torvalds via samba-technical wrote:
>
>So this is exactly *why* splicing from a file all the way to the
>network will then show any file changes that have happened in between
>that "splice started" and "network card got the data". You're supposed
>to use splice only when you can guarantee the data stability (or,
>alternatively, when you simply don't care about the data stability,
>and getting the changed data is perfectly fine).

Metze, correct me if I'm wrong but isn't this exactly the "file
is leased in SMB3" case ?

We already know if a file is leased, and so only use the splice calls
for I/O in that case, and fall back to the slower calls in the
non-leased case.

2023-02-10 02:16:16

by Dave Chinner

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Thu, Feb 09, 2023 at 08:41:02AM -0800, Linus Torvalds wrote:
> Adding Jens, because he's one of the main splice people. You do seem
> to be stepping on his work ;)
>
> Jens, see
>
> https://lore.kernel.org/lkml/[email protected]
>
> On Thu, Feb 9, 2023 at 5:56 AM Stefan Metzmacher <[email protected]> wrote:
> >
> > So we have two cases:
> >
> > 1. network -> socket -> splice -> pipe -> splice -> file -> storage
> >
> > 2. storage -> file -> splice -> pipe -> splice -> socket -> network
> >
> > With 1. I guess everything can work reliable [..]
> >
> > But with 2. there's a problem, as the pages from the file,
> > which are spliced into the pipe are still shared without
> > copy on write with the file(system).
>
> Well, honestly, that's really the whole point of splice. It was
> designed to be a way to share the storage data without having to go
> through a copy.
>
> > I'm wondering if there's a possible way out of this, maybe triggered by a new
> > flag passed to splice.
>
> Not really.
>
> So basically, you cannot do "copy on write" on a page cache page,
> because that breaks sharing.
>
> You *want* the sharing to break, but that's because you're violating
> what splice() was for, but think about all the cases where somebody is
> just using mmap() and expects to see the file changes.
>
> You also aren't thinking of the case where the page is already mapped
> writably, and user processes may be changing the data at any time.
>
> > I looked through the code and noticed the existence of IOMAP_F_SHARED.
>
> Yeah, no. That's a hacky filesystem thing. It's not even a flag in
> anything core like 'struct page', it's just entirely internal to the
> filesystem itself.

It's the mechanism that the filesystem uses to tell the generic
write IO path that the filesystem needs to allocate a new COW extent
in the backing store because it can't write to the original extent.
i.e. it's not allowed to overwrite in place.

It's no different to the VM_SHARED flag in the vma so the generic
page fault path knows if it has to allocate a new COW page to take
place on a write fault because it can't write to the original page.
i.e. it's not allowed to overwrite in place.

So by the same measure, VM_SHARED is a "hacky mm thing". It's not
even a flag in anything core like 'struct page', it's just entirely
internal to the mm subsystem itself.

COW is COW is COW no matter which layer implements. :/

> > Is there any other way we could archive something like this?
>
> I suspect you simply want to copy it at splice time, rather than push
> the page itself into the pipe as we do in copy_page_to_iter_pipe().
>
> Because the whole point of zero-copy really is that zero copy. And the
> whole point of splice() was to *not* complicate the rest of the system
> over-much, while allowing special cases.
>
> Linux is not the heap of bad ideas that is Hurd that does various
> versioning etc, and that made copy-on-write a first-class citizen
> because it uses the concept of "immutable mapped data" for reads and
> writes.
>
> Now, I do see a couple of possible alternatives to "just create a stable copy".
>
> For example, we very much have the notion of "confirm buffer data
> before copying". It's used for things like "I started the IO on the
> page, but the IO failed with an error, so even though I gave you a
> splice buffer, it turns out you can't use it".
>
> And I do wonder if we could introduce a notion of "optimistic splice",
> where the splice works exactly the way it does now (you get a page
> reference), but the "confirm" phase could check whether something has
> changed in that mapping (using the file versioning or whatever - I'm
> hand-waving) and simply fail the confirm.
>
> That would mean that the "splice to socket" part would fail in your
> chain, and you'd have to re-try it. But then the onus would be on
> *you* as a splicer, not on the rest of the system to fix up your
> special case.
>
> That idea sounds fairly far out there, and complicated and maybe not
> usable. So I'm just throwing it out as a "let's try to think of
> alternative solutions".

Oh, that's sounds like an exact analogy to the new IOMAP_F_STALE
flag and the validity cookie we have in the iomap write path code.
The iomap contains cached, unserialised information, and the
filesystem side mapping it is derived from can change asynchronously
(e.g. by IO completion doing unwritten extent conversion). Hence the
cached iomap can become stale, and that's a data corruption vector.

The validity cookie is created when the iomap is built, and it is
passed to a filesystem callback when a folio is locked for copy-in.
This allows the IO path to detect that the filesystem side extent
map has changed during the write() operations before we modify the
contents of the folio. It is done under the locked folio so that the
validation is atomic w.r.t. the modification to the folio contents
we are about to perform.

On detection of a cookie mismatch, the write operation then sets the
IOMAP_F_STALE flag, backs out of the write to that page and ends the
write to the iomap. The iomap infrastructure then remaps the file
range from the offset of the folio at which the iomap change was
detected. The write the proceeds with the new, up to date iomap....

We have had a similar "is the cached iomap still valid?" mechanism
on the writeback side of the page cache for years. The details are
slightly different, though I plan to move that code to use the same
IOMAP_F_STALE infrastructure in the near future because it
simplifies the writeback context wrapper shenanigans an awful lot.
And it helps make it explicit that iomaps are cached/shadowed
state, not the canonical source of reality.

Applying the same principle it to multiply referenced cached page
contents will be more complex. I suspect we might be able to
leverage inode->i_version or ctime as the "data changed" cookie as
they are both supposed to change on every explicit user data
modification made to an inode. However, I think most of the
complexity would be in requiring spliced pages to travel in some
kind of container that holds the necessary verification
information....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-02-10 04:06:36

by Dave Chinner

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 01:16:03PM +1100, Dave Chinner wrote:
> On Thu, Feb 09, 2023 at 08:41:02AM -0800, Linus Torvalds wrote:
> > Now, I do see a couple of possible alternatives to "just create a stable copy".
> >
> > For example, we very much have the notion of "confirm buffer data
> > before copying". It's used for things like "I started the IO on the
> > page, but the IO failed with an error, so even though I gave you a
> > splice buffer, it turns out you can't use it".
> >
> > And I do wonder if we could introduce a notion of "optimistic splice",
> > where the splice works exactly the way it does now (you get a page
> > reference), but the "confirm" phase could check whether something has
> > changed in that mapping (using the file versioning or whatever - I'm
> > hand-waving) and simply fail the confirm.
> >
> > That would mean that the "splice to socket" part would fail in your
> > chain, and you'd have to re-try it. But then the onus would be on
> > *you* as a splicer, not on the rest of the system to fix up your
> > special case.
> >
> > That idea sounds fairly far out there, and complicated and maybe not
> > usable. So I'm just throwing it out as a "let's try to think of
> > alternative solutions".
>
> Oh, that's sounds like an exact analogy to the new IOMAP_F_STALE
> flag and the validity cookie we have in the iomap write path code.
> The iomap contains cached, unserialised information, and the
> filesystem side mapping it is derived from can change asynchronously
> (e.g. by IO completion doing unwritten extent conversion). Hence the
> cached iomap can become stale, and that's a data corruption vector.
>
> The validity cookie is created when the iomap is built, and it is
> passed to a filesystem callback when a folio is locked for copy-in.
> This allows the IO path to detect that the filesystem side extent
> map has changed during the write() operations before we modify the
> contents of the folio. It is done under the locked folio so that the
> validation is atomic w.r.t. the modification to the folio contents
> we are about to perform.
>
> On detection of a cookie mismatch, the write operation then sets the
> IOMAP_F_STALE flag, backs out of the write to that page and ends the
> write to the iomap. The iomap infrastructure then remaps the file
> range from the offset of the folio at which the iomap change was
> detected. The write the proceeds with the new, up to date iomap....
>
> We have had a similar "is the cached iomap still valid?" mechanism
> on the writeback side of the page cache for years. The details are
> slightly different, though I plan to move that code to use the same
> IOMAP_F_STALE infrastructure in the near future because it
> simplifies the writeback context wrapper shenanigans an awful lot.
> And it helps make it explicit that iomaps are cached/shadowed
> state, not the canonical source of reality.
>
> Applying the same principle it to multiply referenced cached page
> contents will be more complex. I suspect we might be able to
> leverage inode->i_version or ctime as the "data changed" cookie as
> they are both supposed to change on every explicit user data
> modification made to an inode. However, I think most of the
> complexity would be in requiring spliced pages to travel in some
> kind of container that holds the necessary verification
> information....

So while I was pondering the complexity of this and watching a great
big shiny rocket create lots of heat, light and noise, it occurred
to me that we already have a mechanism for preventing page cache
data from being changed while the folios are under IO:
SB_I_STABLE_WRITES and folio_wait_stable().

That is, md/dm RAID5/6 devices require the folio contents to be
stable during writeback to calculate parity during IO submission,
and folio_wait_stable() is the mechanism that guarantees pages don't
get changed while they are under IO.

So instead of the complex dance described above to detect data
changes at the splice destination, we simply have splice add
temporary folio state that folio_wait_stable() blocks on to prevents
new user data modification on that folio from occurring until the
splice reference and stable state flag goes away.

Whilst this would work for folios that are currently clean, I
suspect it won't work properly with folios that are already mapped
and dirty because there won't be another page fault to trap the data
being changed. i.e. no folio_clear_dirty_for_io() call has been
made on the page to trigger new write faults. Perhaps a variant that
just propagates the pte dirty bit into the folio and clears it (i.e.
first part of folio_clear_dirty_for_io()) would be sufficient in
this case.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-02-10 04:45:06

by Matthew Wilcox

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 03:06:26PM +1100, Dave Chinner wrote:
> So while I was pondering the complexity of this and watching a great
> big shiny rocket create lots of heat, light and noise, it occurred

That was kind of fun

> to me that we already have a mechanism for preventing page cache
> data from being changed while the folios are under IO:
> SB_I_STABLE_WRITES and folio_wait_stable().

I thought about bringing that up, but it's not quite right. That works
great for writeback, but it only works for writeback. We'd need to track
another per-folio counter ... it'd be like the page pinning kerfuffle,
only worse. And for such a rare thing it seems like a poor use of 32
bits of per-page state. Not to mention that you can effectively block
all writes to a file for an indefinite time by splicing pages to a pipe
that you then never read from.


2023-02-10 04:47:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Thu, Feb 9, 2023 at 8:06 PM Dave Chinner <[email protected]> wrote:
>>
> So while I was pondering the complexity of this and watching a great
> big shiny rocket create lots of heat, light and noise, it occurred
> to me that we already have a mechanism for preventing page cache
> data from being changed while the folios are under IO:
> SB_I_STABLE_WRITES and folio_wait_stable().

No, Dave. Not at all.

Stop and think.

splice() is not some "while under IO" thing. It's *UNBOUNDED*.

Let me just give an example: random user A does

fd = open("/dev/passwd", O_RDONLY);
splice(fd, NULL, pipefd, NULL, ..);
sleep(10000);

and you now want to block all writes to the page in that file as long
as it's being held on to, do you?

So no.

The above is also why something like IOMAP_F_SHARED is not relevant.
The whole point of splice is to act as a way to communicate pages
between *DIFFERENT* subsystems. The only thing they have in common is
the buffer (typically a page reference, but it can be other things)
that is getting transferred.

A spliced page - by definition - is not in some controlled state where
one filesystem (or one subsystem like networking) owns it, because the
whole and only point of splice is to act as that "take data from one
random source and feed it in to another random destination", and avoid
the N*M complexity matrix of N sources and M destinations.

So no. We cannot block writes, because there is no bounded time for
them. And no, we cannot use some kind of "mark this IO as shared",
because there is no "this IO".

It is also worth noting that the shared behavior (ie "splice acts as a
temporary shared buffer") might even be something that people actually
expect and depend on for semantics. I hope not, but it's not entirely
impossible that people change the source (could be file data for the
above "splice from file" case, but could be your own virtual memory
image for "vmsplice()") _after_ splicing the source, and before
splicing it to the destination.

(It sounds very unlikely that people would do that for file contents,
but vmsplice() might intentionally use buffers that may be "live").

Now, to be honest, I hope nobody plays games like that. In fact, I'm a
bit sorry that I was pushing splice() in the first place. Playing
games with zero-copy tends to cause problems, and we've had some nasty
security issues in this area too.

Now, splice() is a *lot* cleaner conceptually than "sendfile()" ever
was, exactly because it allows that whole "many different sources,
many different destinations" model. But this is also very much an
example of how "generic" may be something that is revered in computer
science, but is often a *horrible* thing in reality.

Because if this was just "sendfile()", and would be one operation that
moves file contents from the page cache to the network buffers, then
your idea of "prevent data from being changed while in transit" would
actually be valid.

Special cases are often much simpler and easier, and sometimes the
special cases are all you actually want.

Splice() is not a special case. It's sadly a very interesting and very
generic model for sharing buffers, and that does cause very real
problems.

Linus

2023-02-10 06:20:01

by Dave Chinner

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Thu, Feb 09, 2023 at 08:47:07PM -0800, Linus Torvalds wrote:
> On Thu, Feb 9, 2023 at 8:06 PM Dave Chinner <[email protected]> wrote:
> >>
> > So while I was pondering the complexity of this and watching a great
> > big shiny rocket create lots of heat, light and noise, it occurred
> > to me that we already have a mechanism for preventing page cache
> > data from being changed while the folios are under IO:
> > SB_I_STABLE_WRITES and folio_wait_stable().
>
> No, Dave. Not at all.
>
> Stop and think.

I have.

> splice() is not some "while under IO" thing. It's *UNBOUNDED*.

Splice has two sides - a source where we splice to the transport
pipe, then a destination where we splice pages from the transport
pipe. For better or worse, time in the transport pipe is unbounded,
but that does not mean the srouce or destination have unbound
processing times.

However, transport times being unbound are largely irrelevant, and
miss the fact that the application does not require pages in transit
to be stable.

The application we are talking about here is file -> pipe -> network
stack for zero copy sending of static file data and the problem is
that the file pages are not stable whilst they are under IO in the
network stack.

IOWs, the application does not care if the data changes whilst they
are in transport attached to the pipe - it only cares that the
contents are stable once they have been delivered and are now wholly
owned by the network stack IO path so that the OTW encodings
(checksum, encryption, whatever) done within the network IO path
don't get compromised.

i.e. the file pages only need to be stable whilst the network stack
IO path checksums and DMAs the data to the network hardware.

That's exactly the same IO context that the block device stack
requires the page contents to be stable - across parity/checksum
calculations and the subsequent DMA transfers to the storage
hardware.

I'm suggesting that the page should only need to be held stable
whilst it is under IO, whether that IO is in the network stack via
skbs or in the block device stack via bios. Both network and block
IO are bounded by fixed time limits, both IO paths typically only
need pages held stable for a few milliseconds at a time, and both
have worst case IO times in error situations are typically bound at
a few minutes.

IOWs, splice is a complete misdirection here - it doesn't need to
know a thing about stable data requirements at all. It's the
destination processing that requires stable data, not the transport
mechanism.

Hence if we have a generic mechanism that the network stack can use
to detect a file backed page and mark it needing to be stable whilst
the network stack is doing IO on it, everything on the filesystem
side should just work like it does for pages under IO in the block
device stack...

Indeed, I suspect that a filesystem -> pipe -> filesystem zero copy
path via splice probably also needs stable source pages for some
filesystems, in which case we need exactly the same mechanism as
we need for stable pages in the network stack zero copy splice
destiantion path....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-02-10 06:58:19

by Dave Chinner

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 04:44:41AM +0000, Matthew Wilcox wrote:
> On Fri, Feb 10, 2023 at 03:06:26PM +1100, Dave Chinner wrote:
> > So while I was pondering the complexity of this and watching a great
> > big shiny rocket create lots of heat, light and noise, it occurred
>
> That was kind of fun

:)

> > to me that we already have a mechanism for preventing page cache
> > data from being changed while the folios are under IO:
> > SB_I_STABLE_WRITES and folio_wait_stable().
>
> I thought about bringing that up, but it's not quite right. That works
> great for writeback, but it only works for writeback. We'd need to track
> another per-folio counter ... it'd be like the page pinning kerfuffle,
> only worse.

Hmmm - I didn't think of that. It needs the counter because the
"stable request" is per folio reference state, not per folio state,
right? And the single flag works for writeback because we can only
have one writeback context in progress at a time?

Yeah, not sure how to deal with that easily.

> And for such a rare thing it seems like a poor use of 32
> bits of per-page state.

Maybe, but zero copy file data -> network send is a pretty common
workload. Web servers, file servers, remote backup programs, etc.
Having files being written whilst others are reading them is not as
common, but that does happen in a wide variety of shared file server
environments.

Regardless, I just had a couple of ideas - it they don't work so be
it.

> Not to mention that you can effectively block
> all writes to a file for an indefinite time by splicing pages to a pipe
> that you then never read from.

No, I wasn't suggesting that we make pages in transit stable - they
only need to be stable while the network stack requires them to be
stable....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-02-10 15:15:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Thu, Feb 9, 2023 at 10:58 PM Dave Chinner <[email protected]> wrote:
>
> On Fri, Feb 10, 2023 at 04:44:41AM +0000, Matthew Wilcox wrote:
> > On Fri, Feb 10, 2023 at 03:06:26PM +1100, Dave Chinner wrote:
> > > So while I was pondering the complexity of this and watching a great
> > > big shiny rocket create lots of heat, light and noise, it occurred
> >
> > That was kind of fun
>
> :)
>
> > > to me that we already have a mechanism for preventing page cache
> > > data from being changed while the folios are under IO:
> > > SB_I_STABLE_WRITES and folio_wait_stable().
> >
> > I thought about bringing that up, but it's not quite right. That works
> > great for writeback, but it only works for writeback. We'd need to track
> > another per-folio counter ... it'd be like the page pinning kerfuffle,
> > only worse.
>
> Hmmm - I didn't think of that. It needs the counter because the
> "stable request" is per folio reference state, not per folio state,
> right? And the single flag works for writeback because we can only
> have one writeback context in progress at a time?
>
> Yeah, not sure how to deal with that easily.
>
> > And for such a rare thing it seems like a poor use of 32
> > bits of per-page state.
>
> Maybe, but zero copy file data -> network send is a pretty common
> workload. Web servers, file servers, remote backup programs, etc.
> Having files being written whilst others are reading them is not as
> common, but that does happen in a wide variety of shared file server
> environments.
>
> Regardless, I just had a couple of ideas - it they don't work so be
> it.
>
> > Not to mention that you can effectively block
> > all writes to a file for an indefinite time by splicing pages to a pipe
> > that you then never read from.
>
> No, I wasn't suggesting that we make pages in transit stable - they
> only need to be stable while the network stack requires them to be
> stable....

This is exactly where the existing splice API is problematic. You
can't splice from a file to a network socket right now. First you
splice to a pipe, and now that pipe contains some magical stuff. And
it stays there potentially forever. Then you splice it again to a
socket.

Would this be better if user code could splice straight to a socket?
At least in principle, there could be a _limited_ amount of time
during which anything needs to wait, and it's fundamentally entirely
reasonable if a concurrent write to a file affects data being
zero-copied to a socket _during the time after the zero-copy send is
requested and before it reports completion_.

Frankly, I really don't like having non-immutable data in a pipe. A
pipe is supposed to be a thing into which bytes are written and out
from which the *same* bytes emerge, at least to the extent that anyone
can observe it. Do we really want:

$ some_program | tee file21 > file2

to potentially write different data to file1 and file2?

--Andy

2023-02-10 16:34:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 7:15 AM Andy Lutomirski <[email protected]> wrote:
>
> Frankly, I really don't like having non-immutable data in a pipe.

That statement is completely nonsensical.

A pipe is a kernel buffer. If you want the buffer to be immutable, you
do "read()" and "write()" on it. End of story.

In contrast, "splice()" is literally the "map" operation. It's
"mmap()", without the "m", because it turns out that memory mapping
has a few problems:

(a) mmap fundamentally only works on a page granularity

(b) mmap has huge setup and teardown costs with page tables and TLB's

and so splice() is just basically "strange mmap with that kernel
buffer that is pipe"

Really. That is what spice *is*. There's absolutely no question about it.

It has some advantages over mmap, in that because it's a software
mapping, we can "map" partial pages, which means that you can also map
data that might not be in full pages (it might be an incoming network
buffer, for example).

It has a lot of disadvantages over mmap too, of course. Not using
hardware means that it doesn't really show up as a linear mapping, and
you don't get the nice hardware lookup accelerations - but it also
means that you don't have the downsides (ie TLB costs etc).

So I'm not saying that it is the same thing as "mmap", but I very much
_am_ saying that there's a very real and direct similarity. There are
three fundamental IO models in Unix: read, write, and mmap. And mmap()
is very much a "you get a direct window into something that can change
under you". And splice() is exactly that same thing.

It was literally designed to be "look, we want zero-copy networking,
and we could do 'sendfile()' by mmap'ing the file, but mmap - and
particularly munmap - is too expensive, so we map things into kernel
buffers instead".

So saying "I really don't like having non-immutable data in a pipe" is
complete nonsense. It's syntactically correct English, but it makes no
conceptual sense.

You can say "I don't like 'splice()'". That's fine. I used to think
splice was a really cool concept, but I kind of hate it these days.
Not liking splice() makes a ton of sense.

But given splice, saying "I don't like non-immutable data" really is
complete nonsense.

If you want a stable buffer, use read() and write(). It's that simple.
If you want to send data from a file to the network, and want a stable
buffer in between the two, then "read()" and "write()" is *exactly*
what you should do.

With read and write, there's no mmap()/munmap() overhead of the file,
and you already have the buffer (iwe call it "user address space").
The only downside is the extra copy.

So if you want to send stable, unchanging file contents to the
network, there is absolutely *no* reason to ever involve a pipe at
all, and you should entirely ignore splice().

The *only* reason to ever use splice() is because you don't want to
copy data, and just want a reference to it, and want to keep it all in
kernel space, because the kernel<->user boundary ends up either
requiring copies, or that page alignment, and is generally fairly
expensive.

But once you decide to go that way, you need to understand that you
don't have "immutable data". You asked for a reference, you got a
reference, and it *will* change.

That's not something specific to "splice()". It's fundamental to the
whole *concept* of zero-copy. If you don't want copies, and the source
file changes, then you see those changes.

So saying "I really don't like having non-immutable data in a pipe"
really is nonsense. Don't use splice. Or do, and realize that it's a
*mapping*.

Because that is literally the whole - and ONLY - reason for splice in
the first place.

Linus

2023-02-10 17:24:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Thu, Feb 9, 2023 at 10:19 PM Dave Chinner <[email protected]> wrote:
>
> Splice has two sides - a source where we splice to the transport
> pipe, then a destination where we splice pages from the transport
> pipe. For better or worse, time in the transport pipe is unbounded,
> but that does not mean the srouce or destination have unbound
> processing times.

Well, they are possibly fairly unbounded too - think things like
network packet re-send timeouts etc.

So the data lifetime - even just on just one side - can _easily_ be
"multiple seconds" even when things are normal, and if you have actual
network connectivity issues we are easily talking minutes.

So I don't think a scheme based on locking works even for just the
"one side" operations - at least in the general case.

That said, I wasn't really serious about my "retry" model either.
Maybe it could be made to work, but it sounds messy.

And when it comes to networking, in general things like TCP checksums
etc should be ok even with data that isn't stable. When doing things
by hand, networking should always use the "copy-and-checksum"
functions that do the checksum while copying (so even if the source
data changes, the checksum is going to be the checksum for the data
that was copied).

And in many (most?) smarter network cards, the card itself does the
checksum, again on the data as it is transferred from memory.

So it's not like "networking needs a stable source" is some really
_fundamental_ requirement for things like that to work.

But it may well be that we have situations where some network driver
does the checksumming separately from then copying the data.

Linus

2023-02-10 17:47:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 9:23 AM Linus Torvalds
<[email protected]> wrote:
>
> And when it comes to networking, in general things like TCP checksums
> etc should be ok even with data that isn't stable. When doing things
> by hand, networking should always use the "copy-and-checksum"
> functions that do the checksum while copying (so even if the source
> data changes, the checksum is going to be the checksum for the data
> that was copied).
>
> And in many (most?) smarter network cards, the card itself does the
> checksum, again on the data as it is transferred from memory.
>
> So it's not like "networking needs a stable source" is some really
> _fundamental_ requirement for things like that to work.
>
> But it may well be that we have situations where some network driver
> does the checksumming separately from then copying the data.

Ok, so I decided to try to take a look.

Somebody who actually does networking (and drivers in particular)
should probably check this, but it *looks* like the IPv4 TCP case
(just to pick the ony I looked at) gores through
tcp_sendpage_locked(), which does

if (!(sk->sk_route_caps & NETIF_F_SG))
return sock_no_sendpage_locked(sk, page, offset, size, flags);

which basically says "if you can't handle fragmented socket buffers,
do that 'no_sendpage' case".

So that will basically end up just falling back to a kernel
'sendmsg()', which does a copy and then it's stable.

But for the networks that *can* handle fragmented socket buffers, it
then calls do_tcp_sendpages() instead, which just creates a skb
fragment of the page (with tcp_build_frag()).

I wonder if that case should just require NETIF_F_HW_CSUM?

Linus

2023-02-10 17:57:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 8:34 AM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Feb 10, 2023 at 7:15 AM Andy Lutomirski <[email protected]> wrote:
> >
> > Frankly, I really don't like having non-immutable data in a pipe.
>
> That statement is completely nonsensical.

I know what splice() is. I'm trying to make the point that it may not
be the right API for most (all?) of its use cases, that we can maybe
do better, and that we should maybe even consider deprecating (and
simplifying and the cost of performance) splice in the moderately near
future. And I think I agree with you on most of what you're saying.

> It was literally designed to be "look, we want zero-copy networking,
> and we could do 'sendfile()' by mmap'ing the file, but mmap - and
> particularly munmap - is too expensive, so we map things into kernel
> buffers instead".

Indeed. mmap() + sendfile() + munmap() is extraordinarily expensive
and is not the right solution to zero-copy networking.

>
> So saying "I really don't like having non-immutable data in a pipe" is
> complete nonsense. It's syntactically correct English, but it makes no
> conceptual sense.
>
> You can say "I don't like 'splice()'". That's fine. I used to think
> splice was a really cool concept, but I kind of hate it these days.
> Not liking splice() makes a ton of sense.
>
> But given splice, saying "I don't like non-immutable data" really is
> complete nonsense.

I am saying exactly what I meant. Obviously mutable data exists. I'm
saying that *putting it in a pipe* *while it's still mutable* is not
good. Which implies that I don't think splice() is good. No offense.

I am *not* saying that the mere existence of mutable data is a problem.

> That's not something specific to "splice()". It's fundamental to the
> whole *concept* of zero-copy. If you don't want copies, and the source
> file changes, then you see those changes.

Of course! A user program copying data from a file to a network
fundamentally does this:

Step 1: start the process.
Step 2: data goes out to the actual wire or a buffer on the NIC or is
otherwise in a place other than page cache, and the kernel reports
completion.

There are many ways to make this happen. Step 1 could be starting
read() and step 2 could be send() returning. Step 1 could be be
sticking something in an io_uring queue and step 2 could be reporting
completion. Step 1 could be splice()ing to a pipe and step 2 could be
a splice from the pipe to a socket completing (and maybe even later
when the data actually goes out).

*Obviously* any change to the file between steps 1 and 2 may change
the data that goes out the wire.

> So the data lifetime - even just on just one side - can _easily_ be
> "multiple seconds" even when things are normal, and if you have actual
> network connectivity issues we are easily talking minutes.

True.

But splice is extra nasty: step 1 happens potentially arbitrarily long
before step 2, and the kernel doesn't even know which socket the data
is destined for in step 1. So step 1 can't usefully return
-EWOULDBLOCK, for example. And it's awkward for the kernel to report
errors, because steps 1 and 2 are so disconnected. And I'm not
convinced there's any corresponding benefit.


In any case, maybe io_uring gives an opportunity to do much better.
io_uring makes it *efficient* for largish numbers of long-running
operations to all be pending at once. Would an API like this work
better (very handwavy -- I make absolutely no promises that this is
compatible with existing users -- new opcodes might be needed):

Submit IORING_OP_SPLICE from a *file* to a socket: this tells the
kernel to kindly send data from the file in question to the network.
Writes to the file before submission will be reflected in the data
sent. Writes after submission may or may not be reflected. (This is
step 1 above.)

The operation completes (and is reported in the CQ) only after the
kernel knows that the data has been snapshotted (step 2 above). So
completion can be reported when the data is DMAed out or when it's
checksummed-and-copied or if the kernel decides to copy it for any
other reason *and* the kernel knows that it won't need to read the
data again for possible retransmission. As you said, this could
easily take minutes, but that seems maybe okay to me.

(And if Samba needs to make sure that future writes don't change the
outgoing data even two seconds later when the data has been sent but
not acked, then maybe a fancy API could be added to help, or maybe
Samba shouldn't be using zero copy IO in the first place!)

If the file is truncated or some other problem happens, the operation can fail.


I don't know how easy or hard this is to implement, but it seems like
it would be quite pleasant to *use* from user code, it ought to be
even faster than splice-to-pipe-then-splice-to-socket (simply because
there is less bookkeeping), and it doesn't seem like any file change
tracking would be needed in the kernel.


If this works and becomes popular enough, splice-from-file-to-pipe
could *maybe* be replaced in the kernel with a plain copy.

--Andy

2023-02-10 18:20:13

by Jeremy Allison

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 09:57:20AM -0800, Andy Lutomirski via samba-technical wrote:
>
>(And if Samba needs to make sure that future writes don't change the
>outgoing data even two seconds later when the data has been sent but
>not acked, then maybe a fancy API could be added to help, or maybe
>Samba shouldn't be using zero copy IO in the first place!)

Samba doesn't need any of this. The simplest thing to do is
to restrict splice-based zero-copy IO to files leased by
a single client, where exclusive access to changes is controled
by the client redirector.

2023-02-10 18:37:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 9:57 AM Andy Lutomirski <[email protected]> wrote:
>
> I am saying exactly what I meant. Obviously mutable data exists. I'm
> saying that *putting it in a pipe* *while it's still mutable* is not
> good. Which implies that I don't think splice() is good. No offense.

No offense at all. As mentioned, I have grown to detest splice over the years.

That said, in defense of splice(), it really does solve a lot of
conceptual problems.

And I still think that conceptually it's absolutely lovely in *theory*.

And part of it is very much the fact that pipes are useful and have
the infrastructure for other things. So you can mix regular read/write
calls with splice, and it actually makes sense. One of the design
goals was for things like static http, where you don't really send out
just file contents, there's a whole header to it as well.

So it's not just a specialized "send file contents to network", it's a
"you can do a write() call to start filling the pipe buffer with the
http header, then a splice() to start filling the file data".

And it was also designed to allow other sources, notably things like
video capture cards etc. And very much multiple destinations (again,
media accelerators).

So it all "makes sense" conceptually as a generic pipe (sic) between
different sources and sinks. And again, using a pipe as the mechanism
then also makes perfect sense in a historical Unix context of
"everything is a pipe".

But.

The above just tries to make sense of the design, and excuses for it.
I want to re-iterate that I think it's all lovely and coherent
conceptually. But in practice, it's just a huge pain.

The same way "everything is a pipeline of processes" is very much
historical Unix and very useful for shell scripting, but isn't
actually then normally very useful for larger problems, splice()
really never lived up to that conceptual issue, and it's just really
really nasty in practice.

But we're stuck with it.

I'm not convinced your suggestion of extending io_uring with new
primitives is any better in practice, though.

Linus

2023-02-10 19:02:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 10:37 AM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Feb 10, 2023 at 9:57 AM Andy Lutomirski <[email protected]> wrote:
>
> I'm not convinced your suggestion of extending io_uring with new
> primitives is any better in practice, though.


I don't know if I'm really suggesting new primitives. I think I'm
making two change suggestions that go together.

First, let splice() and IORING_OP_SPLICE copy (or zero-copy) data from
a file to a socket.

Second, either make splice more strict or add a new "strict splice"
variant. Strict splice only completes when it can promise that writes
to the source that start after strict splice's completion won't change
what gets written to the destination.


I think that strict splice fixes Stefan's use case. It's also easier
to reason about than regular splice.


The major caveat here is that zero-copy strict splice is fundamentally
a potentially long-running operation in a way that zero-copy splice()
isn't right now. So the combination of O_NONBLOCK and strict splice()
(the syscall, not necessarily the io_uring operation) to something
like a TCP socket requires complicated locking or change tracking to
make sense. This means that a splice() syscall providing strict
semantics to a TCP socket may just need to do a copy, at least in many
cases. But maybe that's fine -- very-high-performance networking is
moving pretty aggressively to io_uring anyway.


And my possibly-quite-out-there claim is that, if Linux implements
strict splice, maybe non-strict splice could get replaced in a user
ABI-compatible manner with a much simpler non-zero-copy
implementation. And strict splice from a file to a pipe could be
implemented as a copy -- high performance users can, if needed, start
strict-splicing from a file directly to a socket.

2023-02-10 19:18:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 11:02 AM Andy Lutomirski <[email protected]> wrote:
>
> Second, either make splice more strict or add a new "strict splice"
> variant. Strict splice only completes when it can promise that writes
> to the source that start after strict splice's completion won't change
> what gets written to the destination.

The thing ius, I think your "strict splice" is pointless and wrong.

It's pointless, because it simply means that it won't perform well.

And since the whole point of splice was performance, it's wrong.

I really think the whole "source needs to be stable" is barking up the
wrong tree.

You are pointing fingers at splice().

And I think that's wrong.

We should point the fingers at either the _user_ of splice - as Jeremy
Allison has done a couple of times - or we should point it at the sink
that cannot deal with unstable sources.

Because that whole "source is unstable" is what allows for that higher
performance. The moment you start requiring stability, you _will_ lose
it. You will have to lock the page, you'll have to umap it from any
shared mappings, etc etc. And even if there are no writers, or no
current mappers, all that effort to make sure that is the case is
actually fairly expensive.

So I would instead suggest a different approach entirely, with several
different steps:

- make sure people are *aware* of this all.

Maybe this thread raised some awareness of it for some people, but
more realistically - maybe we can really document this whole issue
somewhere much more clearly

- it sounds like the particular user in question (samba) already very
much has a reasonable model for "I have exclusive access to this" that
just wasn't used

- and finally, I do think it might make sense for the networking
people to look at how the networking side works with 'sendpage()'.

Because I really think that your "strict splice" model would just mean
that now the kernel would have to add not just a memcpy, but also a
new allocation for that new stable buffer for the memcpy, and that
would all just be very very pointless.

Alternatively, it would require some kind of nasty hard locking
together with other limitations on what can be done by non-splice
users.

Linus

2023-02-10 19:28:07

by Jeremy Allison

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 11:18:05AM -0800, Linus Torvalds via samba-technical wrote:
>
>We should point the fingers at either the _user_ of splice - as Jeremy
>Allison has done a couple of times - or we should point it at the sink
>that cannot deal with unstable sources.
> ....
> - it sounds like the particular user in question (samba) already very
>much has a reasonable model for "I have exclusive access to this" that
>just wasn't used

Having said that, I just had a phone discussion with Ralph Boehme
on the Samba Team, who has been following along with this in
read-only mode, and he did point out one case I had missed.

1). Client opens file with a lease. Hurrah, we think we can use splice() !
2). Client writes into file.
3). Client calls SMB_FLUSH to ensure data is on disk.
4). Client reads the data just wrtten to ensure it's good.
5). Client overwrites the previously written data.

Now when client issues (4), the read request, if we
zero-copy using splice() - I don't think theres a way
we get notified when the data has finally left the
system and the mapped splice memory in the buffer cache
is safe to overwrite by the write (5).

So the read in (4) could potentially return the data
written in (5), if the buffer cache mapped memory has
not yet been sent out over the network.

That is certainly unexpected behavior for the client,
even if the client leased the file.

If that's the case, then splice() is unusable for
Samba even in the leased file case.

> Maybe this thread raised some awareness of it for some people, but
>more realistically - maybe we can really document this whole issue
>somewhere much more clearly

Complete comprehensive documentation on this would
be extremely helpful (to say the least :-).

2023-02-10 19:29:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 11:18 AM Linus Torvalds
<[email protected]> wrote:
>
> - and finally, I do think it might make sense for the networking
> people to look at how the networking side works with 'sendpage()'.

Put another way: I do not believe that it is at all fundamental that
you can't send data from an changing source over the network. That's
likely _particularly_ true of the people who care the most, and who
already have network cards that do a lot of the heavy lifting for you.

So why spend a lot of effort to stabilize the data, if it's not
needed, when the primary users of it would likely not want that
performance hit and extra work in the first place?

Then making that "strict mode" be the only mode going forward and just
disallowing people from doing the simple thing sounds particularly
wrong.

For example, it may *literally* be that the IPV4 TCP case could be
fixed with something trivial like this

--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1134,7 +1134,8 @@ EXPORT_SYMBOL_GPL(do_tcp_sendpages);
int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
- if (!(sk->sk_route_caps & NETIF_F_SG))
+ if (!(sk->sk_route_caps & NETIF_F_SG) ||
+ !(sk->sk_route_caps & (NETIF_F_HW_CSUM | NETIF_F_IP_CSUM)))
return sock_no_sendpage_locked(sk, page, offset, size, flags);

tcp_rate_check_app_limited(sk); /* is sending application-limited? */

which would basically make hardware that can't deal with the data
changing under it just fall back to the "safe and slow" model on its
own.

But then hardware that doesn't care would "just work".

See what I'm saying? The above patch may be garbage because I don't
understand the network driver rules fully, so don't take the above as
some kind of "last word" on this AT ALL. But I'm just saying that
requiring stable sources doesn't necessarily make any sense at all.

Linus

2023-02-10 19:30:19

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

Am 10.02.23 um 19:19 schrieb Jeremy Allison:
> On Fri, Feb 10, 2023 at 09:57:20AM -0800, Andy Lutomirski via samba-technical wrote:
>>
>> (And if Samba needs to make sure that future writes don't change the
>> outgoing data even two seconds later when the data has been sent but
>> not acked, then maybe a fancy API could be added to help, or maybe
>> Samba shouldn't be using zero copy IO in the first place!)
>
> Samba doesn't need any of this. The simplest thing to do is
> to restrict splice-based zero-copy IO to files leased by
> a single client, where exclusive access to changes is controled
> by the client redirector.

Yes, I guess we can use it if the file is read-only (from it's acls),
or when the client has a read lease. And of course we can have an I don't care
option, maybe reusing 'use sendfile = yes' as that has the same problem in
the existing code already.

metze

2023-02-10 19:43:57

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

Am 10.02.23 um 20:27 schrieb Jeremy Allison:
> On Fri, Feb 10, 2023 at 11:18:05AM -0800, Linus Torvalds via samba-technical wrote:
>>
>> We should point the fingers at either the _user_ of splice - as Jeremy
>> Allison has done a couple of times - or we should point it at the sink
>> that cannot deal with unstable sources.
>> ....
>> - it sounds like the particular user in question (samba) already very
>> much has a reasonable model for "I have exclusive access to this" that
>> just wasn't used
>
> Having said that, I just had a phone discussion with Ralph Boehme
> on the Samba Team, who has been following along with this in
> read-only mode, and he did point out one case I had missed.
>
> 1). Client opens file with a lease. Hurrah, we think we can use splice() !
> 2). Client writes into file.
> 3). Client calls SMB_FLUSH to ensure data is on disk.
> 4). Client reads the data just wrtten to ensure it's good.
> 5). Client overwrites the previously written data.
>
> Now when client issues (4), the read request, if we
> zero-copy using splice() - I don't think theres a way
> we get notified when the data has finally left the
> system and the mapped splice memory in the buffer cache
> is safe to overwrite by the write (5).
>
> So the read in (4) could potentially return the data
> written in (5), if the buffer cache mapped memory has
> not yet been sent out over the network.
>
> That is certainly unexpected behavior for the client,
> even if the client leased the file.
>
> If that's the case, then splice() is unusable for
> Samba even in the leased file case.

I think we just need some coordination in userspace.

What might be helpful in addition would be some kind of
notification that all pages are no longer used by the network
layer, IORING_OP_SENDMSG_ZC already supports such a notification,
maybe we can build something similar.

>>   Maybe this thread raised some awareness of it for some people, but
>> more realistically - maybe we can really document this whole issue
>> somewhere much more clearly
>
> Complete comprehensive documentation on this would
> be extremely helpful (to say the least :-).

Yes, good documentation is always good :-)

2023-02-10 19:44:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 11:27 AM Jeremy Allison <[email protected]> wrote:
>
> 1). Client opens file with a lease. Hurrah, we think we can use splice() !
> 2). Client writes into file.
> 3). Client calls SMB_FLUSH to ensure data is on disk.
> 4). Client reads the data just wrtten to ensure it's good.
> 5). Client overwrites the previously written data.
>
> Now when client issues (4), the read request, if we
> zero-copy using splice() - I don't think theres a way
> we get notified when the data has finally left the
> system and the mapped splice memory in the buffer cache
> is safe to overwrite by the write (5).

Well, but we know that either:

(a) the client has already gotten the read reply, and does the write
afterwards. So (4) has already not just left the network stack, but
actually made it all the way to the client.

OR

(b) (4) and (5) clearly aren't ordered on the client side (ie your
"client" is not one single thread, and did an independent read and
overlapping write), and the client can't rely on one happening before
the other _anyway_.

So if it's (b), then you might as well do the write first, because
there's simply no ordering between the two. If you have a concurrent
read and a concurrent write to the same file, the read result is going
to be random anyway.

(And yes, you can find POSIX language specifies that writes are atomic
"all or nothing" operations, but Linux has actually never done that,
and it's literally a nonsensical requirement and not actually true in
any system: try doing a single gigabyte "write()" system call, and at
a minimum you'll see the file size grow when doing "stat()" calls in
another window. So it's purely "POSIX says that, but it bears no
relationship to the truth")

Or am I missing something?

Linus

2023-02-10 19:54:58

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

Am 10.02.23 um 20:42 schrieb Linus Torvalds:
> On Fri, Feb 10, 2023 at 11:27 AM Jeremy Allison <[email protected]> wrote:
>>
>> 1). Client opens file with a lease. Hurrah, we think we can use splice() !
>> 2). Client writes into file.
>> 3). Client calls SMB_FLUSH to ensure data is on disk.
>> 4). Client reads the data just wrtten to ensure it's good.
>> 5). Client overwrites the previously written data.
>>
>> Now when client issues (4), the read request, if we
>> zero-copy using splice() - I don't think theres a way
>> we get notified when the data has finally left the
>> system and the mapped splice memory in the buffer cache
>> is safe to overwrite by the write (5).
>
> Well, but we know that either:
>
> (a) the client has already gotten the read reply, and does the write
> afterwards. So (4) has already not just left the network stack, but
> actually made it all the way to the client.
>
> OR
>
> (b) (4) and (5) clearly aren't ordered on the client side (ie your
> "client" is not one single thread, and did an independent read and
> overlapping write), and the client can't rely on one happening before
> the other _anyway_.
>
> So if it's (b), then you might as well do the write first, because
> there's simply no ordering between the two. If you have a concurrent
> read and a concurrent write to the same file, the read result is going
> to be random anyway.

I guess that's true, most clients won't have a problem.

However in theory it's possible that client uses a feature
called compounding, which means two requests are batched on the
way to the server they are processed sequentially and the responses
are batched on the way back again.

But we already have detection for that and the existing code also avoids
sendfile() in that case.

metze

2023-02-10 19:56:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 11:18 AM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Feb 10, 2023 at 11:02 AM Andy Lutomirski <[email protected]> wrote:
> >
> > Second, either make splice more strict or add a new "strict splice"
> > variant. Strict splice only completes when it can promise that writes
> > to the source that start after strict splice's completion won't change
> > what gets written to the destination.
>
> The thing ius, I think your "strict splice" is pointless and wrong.
>
> It's pointless, because it simply means that it won't perform well.
>
> And since the whole point of splice was performance, it's wrong.
>
> I really think the whole "source needs to be stable" is barking up the
> wrong tree.
>
> You are pointing fingers at splice().
>
> And I think that's wrong.
>
> We should point the fingers at either the _user_ of splice - as Jeremy
> Allison has done a couple of times - or we should point it at the sink
> that cannot deal with unstable sources.
>
> Because that whole "source is unstable" is what allows for that higher
> performance. The moment you start requiring stability, you _will_ lose
> it. You will have to lock the page, you'll have to umap it from any
> shared mappings, etc etc. And even if there are no writers, or no
> current mappers, all that effort to make sure that is the case is
> actually fairly expensive.

...

> Because I really think that your "strict splice" model would just mean
> that now the kernel would have to add not just a memcpy, but also a
> new allocation for that new stable buffer for the memcpy, and that
> would all just be very very pointless.
>
> Alternatively, it would require some kind of nasty hard locking
> together with other limitations on what can be done by non-splice
> users.

I could be wrong, but I don't think any of this is necessary. My
strict splice isn't intended to be any more stable than current splice
-- it's intended to complete more slowly and more informatively. Now
maybe I'm wrong and the impleentation would be nasty, but I think that
the only bookkeeping needed is to arrange strict-splice to not
complete until the kernel is done with the source's page cache. The
use of the source is refcounted already, and a bit of extra work might
be needed to track which strict-splice the reference came from, but
unless I've missed something, it's not crazy.

Looking at the current splice implementaiton, a splice that isn't
"strictly completed" is sort of represented by a struct pipe_buffer (I
think). The actual implementation of strict-splice might consist of
separating pipe_buffer out from a pipe and adding an io_kiocb* and a
refcount to it. Or maybe even just adding an io_kiocb* and making the
existing refcouting keep also track the io_kiocb*, but that might be
complicated. This all boils down to tracking an actual splice all the
way through its lifecycle and not reporting it as done until it's all
the way done. Anything else is icing on the cake, no?

There is absolutely no need to lock files or make page-cache pages
immutable or anything like that.

i think this is almost exactly what Jeremy and Stefan are asking for
re: notification when the system is done with a zero-copy send:

> What might be helpful in addition would be some kind of
notification that all pages are no longer used by the network
layer, IORING_OP_SENDMSG_ZC already supports such a notification,
maybe we can build something similar.

2023-02-10 20:27:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 11:56 AM Andy Lutomirski <[email protected]> wrote:
>
> i think this is almost exactly what Jeremy and Stefan are asking for
> re: notification when the system is done with a zero-copy send:

Yeah, right now it's all just "incremented page counts", I think.

Even the pipe code itself doesn't know about writes that have already
been done, but that are pending elsewhere.

You'd have to ask the target file descriptor itself about "how much do
you have pending" or something.

Linus

2023-02-10 20:32:15

by Jens Axboe

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On 2/10/23 1:27 PM, Linus Torvalds wrote:
> On Fri, Feb 10, 2023 at 11:56 AM Andy Lutomirski <[email protected]> wrote:
>>
>> i think this is almost exactly what Jeremy and Stefan are asking for
>> re: notification when the system is done with a zero-copy send:
>
> Yeah, right now it's all just "incremented page counts", I think.
>
> Even the pipe code itself doesn't know about writes that have already
> been done, but that are pending elsewhere.
>
> You'd have to ask the target file descriptor itself about "how much do
> you have pending" or something.

No, we very much do have that for io_uring zerocopy sends, which was in
the bit below you snipped from the reply. It'll tell you when data has
been sent out, and when the data has been acked. Or what am I missing
here? Late to this thread, and there's a ton of stuff go to through.

--
Jens Axboe



2023-02-10 20:37:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 12:32 PM Jens Axboe <[email protected]> wrote:
>
> No, we very much do have that for io_uring zerocopy sends, which was in
> the bit below you snipped from the reply. It'll tell you when data has
> been sent out, and when the data has been acked.

Hmm. splice() itself definitely doesn't have that data - there's no
"io context" for it.

There is only the pipe buffers, and they are released when the data
has been accepted - which is not the same as used (eg the networking
layer just takes another ref to the page and says "I'm done").

Maybe adding some io context to the pipe buffer would be possible, but
it's certainly not obvious how, without changing splice() semantics
completely.

Linus

2023-02-10 20:39:50

by Jens Axboe

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On 2/10/23 1:36 PM, Linus Torvalds wrote:
> On Fri, Feb 10, 2023 at 12:32 PM Jens Axboe <[email protected]> wrote:
>>
>> No, we very much do have that for io_uring zerocopy sends, which was in
>> the bit below you snipped from the reply. It'll tell you when data has
>> been sent out, and when the data has been acked.
>
> Hmm. splice() itself definitely doesn't have that data - there's no
> "io context" for it.
>
> There is only the pipe buffers, and they are released when the data
> has been accepted - which is not the same as used (eg the networking
> layer just takes another ref to the page and says "I'm done").
>
> Maybe adding some io context to the pipe buffer would be possible, but
> it's certainly not obvious how, without changing splice() semantics
> completely.

Right, I'm referencing doing zerocopy data sends with io_uring, using
IORING_OP_SEND_ZC. This isn't from a file, it's from a memory location,
but the important bit here is the split notifications and how you
could wire up a OP_SENDFILE similarly to what Andy described.

--
Jens Axboe



2023-02-10 20:45:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 12:39 PM Jens Axboe <[email protected]> wrote:
>
> Right, I'm referencing doing zerocopy data sends with io_uring, using
> IORING_OP_SEND_ZC. This isn't from a file, it's from a memory location,
> but the important bit here is the split notifications and how you
> could wire up a OP_SENDFILE similarly to what Andy described.

Sure, I think it's much more reasonable with io_uring than with splice itself.

So I was mainly just reacting to the "strict-splice" thing where Andy
was talking about tracking the page refcounts. I don't think anything
like that can be done at a splice() level, but higher levels that
actually know about the whole IO might be able to do something like
that.

Maybe we're just talking past each other.

Linus

2023-02-10 20:45:43

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

Am 09.02.23 um 20:48 schrieb Linus Torvalds via samba-technical:
> On Thu, Feb 9, 2023 at 11:36 AM Linus Torvalds
> <[email protected]> wrote:
>>
>> I guarantee that you will only slow things down with some odd async_memcpy.
>
> Extended note: even if the copies themselves would then be done
> concurrently with other work (so "not faster, but more parallel"), the
> synchronization required at the end would then end up being costly
> enough to eat up any possible win. Plus you'd still end up with a
> fundamental problem of "what if the data changes in the meantime".
>
> And that's ignoring all the practical problems of actually starting
> the async copy, which traditionally requires virtual to physical
> translations (where "physical" is whatever the DMA address space is).
>
> So I don't think there are any actual real cases of async memory copy
> engines being even _remotely_ better than memcpy outside of
> microcontrollers (and historical computers before caches - people may
> still remember things like the Amiga blitter fondly).
>
> Again, the exception ends up being if you can actually use real DMA to
> not do a memory copy, but to transfer data directly to or from the
> device. That's in some way what 'splice()' is designed to allow you to
> do, but exactly by the pipe part ending up being the "conceptual
> buffer" for the zero-copy pages.
>
> So this is exactly *why* splicing from a file all the way to the
> network will then show any file changes that have happened in between
> that "splice started" and "network card got the data". You're supposed
> to use splice only when you can guarantee the data stability (or,
> alternatively, when you simply don't care about the data stability,
> and getting the changed data is perfectly fine).

Ok, thanks for the explanation!

Looking at this patch from David Howells :
https://lore.kernel.org/linux-fsdevel/[email protected]/
indicates that we don't have that problem with O_DIRECT as it operates
on dedicated pages (not part of the shared page cache). And these pages
might be filled via DMA (depending on the filesystem and block device).

Is my understanding correct?

Together with this patch:
https://lore.kernel.org/linux-fsdevel/[email protected]/

I guess it would be easy to pass a flag (maybe SPLICE_F_FORCE_COPY)
down to generic_file_splice_read() and let it create dedicated pages
and use memcpy() from the page cache to the dedicated pages.
This would mean one memcpy(), but it would allow the pipe to be used
for the splice() to the socket, tee(). This might be easier than
using pread() followed by vmsplice(SPLICE_F_GIFT).

The usage of SPLICE_F_GIFT is very confusing to me...
I found this example in libkcapi using the kernel crypto sockets:
https://github.com/smuellerDD/libkcapi/blob/master/lib/kcapi-kernel-if.c#L324
where it just passes SPLICE_F_GIFT together with an iovec passed from the
caller of the library.

To me it's not clear if the caller can still use it's buffers referenced by
the passed iovec...

metze



2023-02-10 20:50:21

by Jens Axboe

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On 2/10/23 1:44 PM, Linus Torvalds wrote:
> On Fri, Feb 10, 2023 at 12:39 PM Jens Axboe <[email protected]> wrote:
>>
>> Right, I'm referencing doing zerocopy data sends with io_uring, using
>> IORING_OP_SEND_ZC. This isn't from a file, it's from a memory location,
>> but the important bit here is the split notifications and how you
>> could wire up a OP_SENDFILE similarly to what Andy described.
>
> Sure, I think it's much more reasonable with io_uring than with splice itself.
>
> So I was mainly just reacting to the "strict-splice" thing where Andy
> was talking about tracking the page refcounts. I don't think anything
> like that can be done at a splice() level, but higher levels that
> actually know about the whole IO might be able to do something like
> that.
>
> Maybe we're just talking past each other.

Maybe slightly, as I was not really intending to comment on the strict
splice thing. But yeah I agree on splice, it would not be trivial to do
there. At least with io_uring we have the communication channel we need.
And tracking page refcounts seems iffy and fraught with potential
issues.

--
Jens Axboe



2023-02-10 20:51:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 12:45 PM Stefan Metzmacher <[email protected]> wrote:
>
> I guess it would be easy to pass a flag (maybe SPLICE_F_FORCE_COPY)
> down to generic_file_splice_read() and let it create dedicated pages
> and use memcpy() from the page cache to the dedicated pages.

I really think you'd be much better off passing it off to the
*destination*, not the source.

The destination knows whether it needs to copy or not, and in the case
of at least networking, already has the copy option.

The destination might also already know whether it can use the data
synchronously, so that there isn't even any issue with "data may
change later".

In contrast, the source has no clue. It just knows "I'm a page cache
page". It would have to always copy.

Linus

2023-02-10 21:15:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 12:50 PM Jens Axboe <[email protected]> wrote:
>
> On 2/10/23 1:44 PM, Linus Torvalds wrote:
> > On Fri, Feb 10, 2023 at 12:39 PM Jens Axboe <[email protected]> wrote:
> >>
> >> Right, I'm referencing doing zerocopy data sends with io_uring, using
> >> IORING_OP_SEND_ZC. This isn't from a file, it's from a memory location,
> >> but the important bit here is the split notifications and how you
> >> could wire up a OP_SENDFILE similarly to what Andy described.
> >
> > Sure, I think it's much more reasonable with io_uring than with splice itself.
> >
> > So I was mainly just reacting to the "strict-splice" thing where Andy
> > was talking about tracking the page refcounts. I don't think anything
> > like that can be done at a splice() level, but higher levels that
> > actually know about the whole IO might be able to do something like
> > that.
> >
> > Maybe we're just talking past each other.
>
> Maybe slightly, as I was not really intending to comment on the strict
> splice thing. But yeah I agree on splice, it would not be trivial to do
> there. At least with io_uring we have the communication channel we need.
> And tracking page refcounts seems iffy and fraught with potential
> issues.
>

Hmm.

Are there any real-world use cases for zero-copy splice() that
actually depend on splicing from a file to a pipe and then later from
the pipe to a socket (or file or whatever)? Or would everything
important be covered by a potential new io_uring operation that copies
from one fd directly to another fd?

Maybe I'm getting far ahead of myself.

2023-02-10 21:27:25

by Jens Axboe

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On 2/10/23 2:14?PM, Andy Lutomirski wrote:
> On Fri, Feb 10, 2023 at 12:50 PM Jens Axboe <[email protected]> wrote:
>>
>> On 2/10/23 1:44?PM, Linus Torvalds wrote:
>>> On Fri, Feb 10, 2023 at 12:39 PM Jens Axboe <[email protected]> wrote:
>>>>
>>>> Right, I'm referencing doing zerocopy data sends with io_uring, using
>>>> IORING_OP_SEND_ZC. This isn't from a file, it's from a memory location,
>>>> but the important bit here is the split notifications and how you
>>>> could wire up a OP_SENDFILE similarly to what Andy described.
>>>
>>> Sure, I think it's much more reasonable with io_uring than with splice itself.
>>>
>>> So I was mainly just reacting to the "strict-splice" thing where Andy
>>> was talking about tracking the page refcounts. I don't think anything
>>> like that can be done at a splice() level, but higher levels that
>>> actually know about the whole IO might be able to do something like
>>> that.
>>>
>>> Maybe we're just talking past each other.
>>
>> Maybe slightly, as I was not really intending to comment on the strict
>> splice thing. But yeah I agree on splice, it would not be trivial to do
>> there. At least with io_uring we have the communication channel we need.
>> And tracking page refcounts seems iffy and fraught with potential
>> issues.
>>
>
> Hmm.
>
> Are there any real-world use cases for zero-copy splice() that
> actually depend on splicing from a file to a pipe and then later from
> the pipe to a socket (or file or whatever)? Or would everything
> important be covered by a potential new io_uring operation that copies
> from one fd directly to another fd?

I think it makes sense. As Linus has referenced, the sex appeal of
splice is the fact that it is dealing with pipes, and you can access
these internal buffers through other means. But that is probably largely
just something that is sexy design wise, nothing that _really_ matters
in practice. And the pipes do get in the way, for example I had to add
pipe resizing fcntl helpers to bump the size. If you're doing a plain
sendfile, the pipes just kind of get in the way too imho.

Another upside (from the io_uring) perspective is that splice isn't very
efficient through io_uring, as it requires offload to io-wq. This could
obviously be solved by some refactoring in terms of non-blocking, but it
hasn't really been that relevant (and nobody has complained about it). A
new sendfile op would nicely get around that too as it could be designed
with async in nature, rather than the classic sync syscall model that
splice follows.

--
Jens Axboe


2023-02-10 21:52:35

by Jens Axboe

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On 2/10/23 2:27 PM, Jens Axboe wrote:
> On 2/10/23 2:14?PM, Andy Lutomirski wrote:
>> On Fri, Feb 10, 2023 at 12:50 PM Jens Axboe <[email protected]> wrote:
>>>
>>> On 2/10/23 1:44?PM, Linus Torvalds wrote:
>>>> On Fri, Feb 10, 2023 at 12:39 PM Jens Axboe <[email protected]> wrote:
>>>>>
>>>>> Right, I'm referencing doing zerocopy data sends with io_uring, using
>>>>> IORING_OP_SEND_ZC. This isn't from a file, it's from a memory location,
>>>>> but the important bit here is the split notifications and how you
>>>>> could wire up a OP_SENDFILE similarly to what Andy described.
>>>>
>>>> Sure, I think it's much more reasonable with io_uring than with splice itself.
>>>>
>>>> So I was mainly just reacting to the "strict-splice" thing where Andy
>>>> was talking about tracking the page refcounts. I don't think anything
>>>> like that can be done at a splice() level, but higher levels that
>>>> actually know about the whole IO might be able to do something like
>>>> that.
>>>>
>>>> Maybe we're just talking past each other.
>>>
>>> Maybe slightly, as I was not really intending to comment on the strict
>>> splice thing. But yeah I agree on splice, it would not be trivial to do
>>> there. At least with io_uring we have the communication channel we need.
>>> And tracking page refcounts seems iffy and fraught with potential
>>> issues.
>>>
>>
>> Hmm.
>>
>> Are there any real-world use cases for zero-copy splice() that
>> actually depend on splicing from a file to a pipe and then later from
>> the pipe to a socket (or file or whatever)? Or would everything
>> important be covered by a potential new io_uring operation that copies
>> from one fd directly to another fd?
>
> I think it makes sense. As Linus has referenced, the sex appeal of
> splice is the fact that it is dealing with pipes, and you can access
> these internal buffers through other means. But that is probably largely
> just something that is sexy design wise, nothing that _really_ matters
> in practice. And the pipes do get in the way, for example I had to add
> pipe resizing fcntl helpers to bump the size. If you're doing a plain
> sendfile, the pipes just kind of get in the way too imho.
>
> Another upside (from the io_uring) perspective is that splice isn't very
> efficient through io_uring, as it requires offload to io-wq. This could
> obviously be solved by some refactoring in terms of non-blocking, but it
> hasn't really been that relevant (and nobody has complained about it). A
> new sendfile op would nicely get around that too as it could be designed
> with async in nature, rather than the classic sync syscall model that
> splice follows.

Speaking of splice/io_uring, Ming posted this today:

https://lore.kernel.org/io-uring/[email protected]/

--
Jens Axboe



2023-02-10 22:09:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 1:51 PM Jens Axboe <[email protected]> wrote:
>
> Speaking of splice/io_uring, Ming posted this today:
>
> https://lore.kernel.org/io-uring/[email protected]/

Ugh. Some of that is really ugly. Both 'ignore_sig' and
'ack_page_consuming' just look wrong. Pure random special cases.

And that 'ignore_sig' is particularly ugly, since the only thing that
sets it also sets SPLICE_F_NONBLOCK.

And the *only* thing that actually then checks that field is
'splice_from_pipe_next()', where there are exactly two
signal_pending() checks that it adds to, and

(a) the first one is to protect from endless loops

(b) the second one is irrelevant when SPLICE_F_NONBLOCK is set

So honestly, just NAK on that series.

I think that instead of 'ignore_sig' (which shouldn't exist), that
first 'signal_pending()' check in splice_from_pipe_next() should just
be changed into a 'fatal_signal_pending()'.

But that 'ack_page_consuming' thing looks even more disgusting, and
since I'm not sure why it even exists, I don't know what it's doing
wrong.

Let's agree not to make splice() worse, while people are talking about
how bad it already is, ok?

Linus

2023-02-10 22:16:57

by Jens Axboe

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On 2/10/23 3:08?PM, Linus Torvalds wrote:
> On Fri, Feb 10, 2023 at 1:51 PM Jens Axboe <[email protected]> wrote:
>>
>> Speaking of splice/io_uring, Ming posted this today:
>>
>> https://lore.kernel.org/io-uring/[email protected]/
>
> Ugh. Some of that is really ugly. Both 'ignore_sig' and
> 'ack_page_consuming' just look wrong. Pure random special cases.
>
> And that 'ignore_sig' is particularly ugly, since the only thing that
> sets it also sets SPLICE_F_NONBLOCK.
>
> And the *only* thing that actually then checks that field is
> 'splice_from_pipe_next()', where there are exactly two
> signal_pending() checks that it adds to, and
>
> (a) the first one is to protect from endless loops
>
> (b) the second one is irrelevant when SPLICE_F_NONBLOCK is set
>
> So honestly, just NAK on that series.
>
> I think that instead of 'ignore_sig' (which shouldn't exist), that
> first 'signal_pending()' check in splice_from_pipe_next() should just
> be changed into a 'fatal_signal_pending()'.
>
> But that 'ack_page_consuming' thing looks even more disgusting, and
> since I'm not sure why it even exists, I don't know what it's doing
> wrong.
>
> Let's agree not to make splice() worse, while people are talking about
> how bad it already is, ok?

I was in no way advocating for this series, but it seems relevant as we
are discussing splice :-). I have pointed Ming at this discussion too.

--
Jens Axboe


2023-02-10 22:18:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 2:08 PM Linus Torvalds
<[email protected]> wrote:
>
> (a) the first one is to protect from endless loops

Just to clarify: they're not "endless loops" per se, but we have
splice sources and destinations that always succeed, like /dev/zero
and /dev/null.

So things like "sendfile()" that are happy to just repeat until done
do need to have some kind of signal handling even for the case when
we're not actually waiting for data. That's what that whole

/*
* Check for signal early to make process killable when there are
* always buffers available
*/

this is all about. See commit c725bfce7968 ("vfs: Make sendfile(2)
killable even better") for a less obvious example than that
"zero->null" kind of thing.

(I actually suspect that /dev/zero no longer works as a splice source,
since we disabled the whole "fall back to regular IO" that Christoph
did in 36e2c7421f02 "fs: don't allow splice read/write without
explicit ops").

Linus

2023-02-10 22:26:06

by Jens Axboe

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On 2/10/23 3:17?PM, Linus Torvalds wrote:
> On Fri, Feb 10, 2023 at 2:08 PM Linus Torvalds
> <[email protected]> wrote:
>>
>> (a) the first one is to protect from endless loops
>
> Just to clarify: they're not "endless loops" per se, but we have
> splice sources and destinations that always succeed, like /dev/zero
> and /dev/null.
>
> So things like "sendfile()" that are happy to just repeat until done
> do need to have some kind of signal handling even for the case when
> we're not actually waiting for data. That's what that whole
>
> /*
> * Check for signal early to make process killable when there are
> * always buffers available
> */
>
> this is all about. See commit c725bfce7968 ("vfs: Make sendfile(2)
> killable even better") for a less obvious example than that
> "zero->null" kind of thing.
>
> (I actually suspect that /dev/zero no longer works as a splice source,
> since we disabled the whole "fall back to regular IO" that Christoph
> did in 36e2c7421f02 "fs: don't allow splice read/write without
> explicit ops").

Yet another one... Since it has a read_iter, should be fixable with just
adding the generic splice_read.

--
Jens Axboe


2023-02-10 22:35:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 2:26 PM Jens Axboe <[email protected]> wrote:
> >
> > (I actually suspect that /dev/zero no longer works as a splice source,
> > since we disabled the whole "fall back to regular IO" that Christoph
> > did in 36e2c7421f02 "fs: don't allow splice read/write without
> > explicit ops").
>
> Yet another one... Since it has a read_iter, should be fixable with just
> adding the generic splice_read.

I actually very consciously did *not* want to add cases of
generic_splice_read() "just because we can".

I've been on a "let's minimize the reach of splice" thing for a while.
I really loved Christoph's patches, even if I may not have been hugely
vocal about it. His getting rid of set/get_fs() got rid of a *lot* of
splice pain.

And rather than try to make everything work with splice that used to
work just because it fell back on read/write, I was waiting for actual
regression reports.

Even when splice fails, a lot of user space then falls back on
read/write, and unless there is some really fundamental reason not to,
I think that's always the right thing to do.

So we do have a number of "add splice_write/splice_read" commits, but
they are hopefully all the result of people actually noticing
breakage.

You can do

git log --grep=36e2c7421f02

to see at least some of them, and I really don't want to see them
without a "Reported-by" and an actual issue.

Exactly because I'm not all that enamoured with splice any more.

Linus

2023-02-10 22:41:58

by David Laight

[permalink] [raw]
Subject: RE: copy on write for splice() from file to pipe?

From: Linus Torvalds
> Sent: 10 February 2023 17:24
...
> And when it comes to networking, in general things like TCP checksums
> etc should be ok even with data that isn't stable. When doing things
> by hand, networking should always use the "copy-and-checksum"
> functions that do the checksum while copying (so even if the source
> data changes, the checksum is going to be the checksum for the data
> that was copied).
>
> And in many (most?) smarter network cards, the card itself does the
> checksum, again on the data as it is transferred from memory.
>
> So it's not like "networking needs a stable source" is some really
> _fundamental_ requirement for things like that to work.

It is also worth remembering that TCP needs to be able
to retransmit the data and a much later time.
So the application must not change the data until it has
been acked by the remote system.

Operating systems that do asynchronous IO directly from
application buffers have callbacks/events to tell the
application when it is allowed to modify the buffers.
For TCP this won't be indicated until after the ACK
is received.
I don't think io_uring has any way to indicate anything
other than 'the data has been accepted by the socket'.

If you have 'kernel pages containing data' (eg from writes
into a pipe, or data received from a network) then they have
a single 'owner' and can be passed about.
But user-pages (including mmapped files) have multiple owners
so you are never going to be able to pass them as 'immutable
data'.
If you mmap a very large (and maybe sparse) file and then
try to do a very large (multi-GB) send() (with or without
any kind of page loaning) there is always the possibility
that the data that is actually sent was written while the
send() call was in progress.
Any kind of asynchronous send() just makes it more obvious.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-02-10 22:51:17

by Jens Axboe

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On 2/10/23 3:35?PM, Linus Torvalds wrote:
> On Fri, Feb 10, 2023 at 2:26 PM Jens Axboe <[email protected]> wrote:
>>>
>>> (I actually suspect that /dev/zero no longer works as a splice source,
>>> since we disabled the whole "fall back to regular IO" that Christoph
>>> did in 36e2c7421f02 "fs: don't allow splice read/write without
>>> explicit ops").
>>
>> Yet another one... Since it has a read_iter, should be fixable with just
>> adding the generic splice_read.
>
> I actually very consciously did *not* want to add cases of
> generic_splice_read() "just because we can".
>
> I've been on a "let's minimize the reach of splice" thing for a while.
> I really loved Christoph's patches, even if I may not have been hugely
> vocal about it. His getting rid of set/get_fs() got rid of a *lot* of
> splice pain.
>
> And rather than try to make everything work with splice that used to
> work just because it fell back on read/write, I was waiting for actual
> regression reports.
>
> Even when splice fails, a lot of user space then falls back on
> read/write, and unless there is some really fundamental reason not to,
> I think that's always the right thing to do.
>
> So we do have a number of "add splice_write/splice_read" commits, but
> they are hopefully all the result of people actually noticing
> breakage.
>
> You can do
>
> git log --grep=36e2c7421f02
>
> to see at least some of them, and I really don't want to see them
> without a "Reported-by" and an actual issue.

Oh I already did that the last few times (and there's quite a bit). And
while I agree that getting rid of the ancient set/get_fs bits was great,
it is still annoying to knowingly have regressions. The problem with
this approach is that the time from when you start the "experiment" to
when the first report comes in, it'll take a while as most don't run
-git kernels at all. And the ones that do are generally just standard
distro setups on their workstation/laptop.

The time is my main concern, it takes many years before you're fully
covered. Maybe if that series had been pushed to stable as well we'd
have a better shot at weeding them out.

> Exactly because I'm not all that enamoured with splice any more.

I don't think anyone has been for years, I sure have not and haven't
worked on it in decades outside of exposing some for io_uring.The
latter was probably a mistake and we should've done something else, but
there is something to be said for the devil you know... Outside of that,
looks like the last real change was support for bigger pipes in 2010.
But:

1) Interesting bits do come up. Some of these, as this discussion has
highlighted, are probably better served somewhere else, especially if
they require changes/additions. Some may be valid and fine.

2) Knowingly breaking things isn't very nice, and if anyone else did
that, they'd have you screaming at them.

So while I do kind of agree with you on some points, I don't think it
was done very well from that perspective. And when we spot things like
zero not working with splice, we should probably add the patch to make
it work rather than wait for someone to complain. I just recently had to
fixup random/urandom for that because of a report, and I'd like to think
I have better things to do than deal with known fallout.

--
Jens Axboe


2023-02-10 22:51:58

by Jens Axboe

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On 2/10/23 3:41?PM, David Laight wrote:
> From: Linus Torvalds
>> Sent: 10 February 2023 17:24
> ...
>> And when it comes to networking, in general things like TCP checksums
>> etc should be ok even with data that isn't stable. When doing things
>> by hand, networking should always use the "copy-and-checksum"
>> functions that do the checksum while copying (so even if the source
>> data changes, the checksum is going to be the checksum for the data
>> that was copied).
>>
>> And in many (most?) smarter network cards, the card itself does the
>> checksum, again on the data as it is transferred from memory.
>>
>> So it's not like "networking needs a stable source" is some really
>> _fundamental_ requirement for things like that to work.
>
> It is also worth remembering that TCP needs to be able
> to retransmit the data and a much later time.
> So the application must not change the data until it has
> been acked by the remote system.

This has been covered, and:

> I don't think io_uring has any way to indicate anything
> other than 'the data has been accepted by the socket'.

This is wrong and has also been covered.

--
Jens Axboe


2023-02-11 03:20:37

by Ming Lei

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 10, 2023 at 02:08:35PM -0800, Linus Torvalds wrote:
> On Fri, Feb 10, 2023 at 1:51 PM Jens Axboe <[email protected]> wrote:
> >
> > Speaking of splice/io_uring, Ming posted this today:
> >
> > https://lore.kernel.org/io-uring/[email protected]/
>
> Ugh. Some of that is really ugly. Both 'ignore_sig' and
> 'ack_page_consuming' just look wrong. Pure random special cases.
>
> And that 'ignore_sig' is particularly ugly, since the only thing that
> sets it also sets SPLICE_F_NONBLOCK.
>
> And the *only* thing that actually then checks that field is
> 'splice_from_pipe_next()', where there are exactly two
> signal_pending() checks that it adds to, and
>
> (a) the first one is to protect from endless loops
>
> (b) the second one is irrelevant when SPLICE_F_NONBLOCK is set
>
> So honestly, just NAK on that series.
>
> I think that instead of 'ignore_sig' (which shouldn't exist), that
> first 'signal_pending()' check in splice_from_pipe_next() should just
> be changed into a 'fatal_signal_pending()'.

Good point, here the signal is often from task_work_add() called by
io_uring.

>
> But that 'ack_page_consuming' thing looks even more disgusting, and
> since I'm not sure why it even exists, I don't know what it's doing
> wrong.

The motivation is for confirming that if the produced buffer can be used
for READ or WRITE. Another way could be to add PIPE_BUF_FLAG_MAY_READ[WRITE].

thanks,
Ming


2023-02-11 06:18:23

by Ming Lei

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Sat, Feb 11, 2023 at 11:18:38AM +0800, Ming Lei wrote:
> On Fri, Feb 10, 2023 at 02:08:35PM -0800, Linus Torvalds wrote:
> > On Fri, Feb 10, 2023 at 1:51 PM Jens Axboe <[email protected]> wrote:
> > >
> > > Speaking of splice/io_uring, Ming posted this today:
> > >
> > > https://lore.kernel.org/io-uring/[email protected]/
> >
> > Ugh. Some of that is really ugly. Both 'ignore_sig' and
> > 'ack_page_consuming' just look wrong. Pure random special cases.
> >
> > And that 'ignore_sig' is particularly ugly, since the only thing that
> > sets it also sets SPLICE_F_NONBLOCK.
> >
> > And the *only* thing that actually then checks that field is
> > 'splice_from_pipe_next()', where there are exactly two
> > signal_pending() checks that it adds to, and
> >
> > (a) the first one is to protect from endless loops
> >
> > (b) the second one is irrelevant when SPLICE_F_NONBLOCK is set
> >
> > So honestly, just NAK on that series.
> >
> > I think that instead of 'ignore_sig' (which shouldn't exist), that
> > first 'signal_pending()' check in splice_from_pipe_next() should just
> > be changed into a 'fatal_signal_pending()'.
>
> Good point, here the signal is often from task_work_add() called by
> io_uring.
>
> >
> > But that 'ack_page_consuming' thing looks even more disgusting, and
> > since I'm not sure why it even exists, I don't know what it's doing
> > wrong.
>
> The motivation is for confirming that if the produced buffer can be used
> for READ or WRITE. Another way could be to add PIPE_BUF_FLAG_MAY_READ[WRITE].

BTW, I meant the added flags are source/sink private flags, which are
not used by generic pipe/splice code, just used by the actual source and
sink subsystem.

thanks,
Ming


2023-02-11 14:13:54

by Jens Axboe

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On 2/10/23 8:18?PM, Ming Lei wrote:
> On Fri, Feb 10, 2023 at 02:08:35PM -0800, Linus Torvalds wrote:
>> On Fri, Feb 10, 2023 at 1:51 PM Jens Axboe <[email protected]> wrote:
>>>
>>> Speaking of splice/io_uring, Ming posted this today:
>>>
>>> https://lore.kernel.org/io-uring/[email protected]/
>>
>> Ugh. Some of that is really ugly. Both 'ignore_sig' and
>> 'ack_page_consuming' just look wrong. Pure random special cases.
>>
>> And that 'ignore_sig' is particularly ugly, since the only thing that
>> sets it also sets SPLICE_F_NONBLOCK.
>>
>> And the *only* thing that actually then checks that field is
>> 'splice_from_pipe_next()', where there are exactly two
>> signal_pending() checks that it adds to, and
>>
>> (a) the first one is to protect from endless loops
>>
>> (b) the second one is irrelevant when SPLICE_F_NONBLOCK is set
>>
>> So honestly, just NAK on that series.
>>
>> I think that instead of 'ignore_sig' (which shouldn't exist), that
>> first 'signal_pending()' check in splice_from_pipe_next() should just
>> be changed into a 'fatal_signal_pending()'.
>
> Good point, here the signal is often from task_work_add() called by
> io_uring.

Usually you'd use task_sigpending() to distinguis the two, but
fatal_signal_pending() as Linus suggests would also work. The only
concern here is that since you'll be potentially blocking on waiting for
the pipe to be readable - if task does indeed have task_work pending and
that very task_work is the one that will ensure that the pipe is now
readable, then you're waiting condition will never be satisfied.

--
Jens Axboe


2023-02-11 15:06:41

by Ming Lei

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Sat, Feb 11, 2023 at 07:13:44AM -0700, Jens Axboe wrote:
> On 2/10/23 8:18?PM, Ming Lei wrote:
> > On Fri, Feb 10, 2023 at 02:08:35PM -0800, Linus Torvalds wrote:
> >> On Fri, Feb 10, 2023 at 1:51 PM Jens Axboe <[email protected]> wrote:
> >>>
> >>> Speaking of splice/io_uring, Ming posted this today:
> >>>
> >>> https://lore.kernel.org/io-uring/[email protected]/
> >>
> >> Ugh. Some of that is really ugly. Both 'ignore_sig' and
> >> 'ack_page_consuming' just look wrong. Pure random special cases.
> >>
> >> And that 'ignore_sig' is particularly ugly, since the only thing that
> >> sets it also sets SPLICE_F_NONBLOCK.
> >>
> >> And the *only* thing that actually then checks that field is
> >> 'splice_from_pipe_next()', where there are exactly two
> >> signal_pending() checks that it adds to, and
> >>
> >> (a) the first one is to protect from endless loops
> >>
> >> (b) the second one is irrelevant when SPLICE_F_NONBLOCK is set
> >>
> >> So honestly, just NAK on that series.
> >>
> >> I think that instead of 'ignore_sig' (which shouldn't exist), that
> >> first 'signal_pending()' check in splice_from_pipe_next() should just
> >> be changed into a 'fatal_signal_pending()'.
> >
> > Good point, here the signal is often from task_work_add() called by
> > io_uring.
>
> Usually you'd use task_sigpending() to distinguis the two, but
> fatal_signal_pending() as Linus suggests would also work. The only
> concern here is that since you'll be potentially blocking on waiting for
> the pipe to be readable - if task does indeed have task_work pending and
> that very task_work is the one that will ensure that the pipe is now
> readable, then you're waiting condition will never be satisfied.

The 2nd signal_pending() will break the loop to get task_work handled,
so it is safe to only change the 1st one to fatal_signal_pending().


Thanks,
Ming


2023-02-11 15:33:23

by Jens Axboe

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On 2/11/23 8:05 AM, Ming Lei wrote:
> On Sat, Feb 11, 2023 at 07:13:44AM -0700, Jens Axboe wrote:
>> On 2/10/23 8:18?PM, Ming Lei wrote:
>>> On Fri, Feb 10, 2023 at 02:08:35PM -0800, Linus Torvalds wrote:
>>>> On Fri, Feb 10, 2023 at 1:51 PM Jens Axboe <[email protected]> wrote:
>>>>>
>>>>> Speaking of splice/io_uring, Ming posted this today:
>>>>>
>>>>> https://lore.kernel.org/io-uring/[email protected]/
>>>>
>>>> Ugh. Some of that is really ugly. Both 'ignore_sig' and
>>>> 'ack_page_consuming' just look wrong. Pure random special cases.
>>>>
>>>> And that 'ignore_sig' is particularly ugly, since the only thing that
>>>> sets it also sets SPLICE_F_NONBLOCK.
>>>>
>>>> And the *only* thing that actually then checks that field is
>>>> 'splice_from_pipe_next()', where there are exactly two
>>>> signal_pending() checks that it adds to, and
>>>>
>>>> (a) the first one is to protect from endless loops
>>>>
>>>> (b) the second one is irrelevant when SPLICE_F_NONBLOCK is set
>>>>
>>>> So honestly, just NAK on that series.
>>>>
>>>> I think that instead of 'ignore_sig' (which shouldn't exist), that
>>>> first 'signal_pending()' check in splice_from_pipe_next() should just
>>>> be changed into a 'fatal_signal_pending()'.
>>>
>>> Good point, here the signal is often from task_work_add() called by
>>> io_uring.
>>
>> Usually you'd use task_sigpending() to distinguis the two, but
>> fatal_signal_pending() as Linus suggests would also work. The only
>> concern here is that since you'll be potentially blocking on waiting for
>> the pipe to be readable - if task does indeed have task_work pending and
>> that very task_work is the one that will ensure that the pipe is now
>> readable, then you're waiting condition will never be satisfied.
>
> The 2nd signal_pending() will break the loop to get task_work handled,
> so it is safe to only change the 1st one to fatal_signal_pending().

OK, but then the ignore_sig change should not be needed at all, just
changing that first bit to fatal_signal_pending() would do the trick?

--
Jens Axboe



2023-02-11 18:58:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Sat, Feb 11, 2023 at 7:33 AM Jens Axboe <[email protected]> wrote:
>
> OK, but then the ignore_sig change should not be needed at all, just
> changing that first bit to fatal_signal_pending() would do the trick?

Right. That was my point. The 'ignore_sig' flag just doesn't make
sense. It was a hack for a case that shouldn't exist.

Linus

2023-02-12 02:46:24

by Jens Axboe

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On 2/11/23 11:57 AM, Linus Torvalds wrote:
> On Sat, Feb 11, 2023 at 7:33 AM Jens Axboe <[email protected]> wrote:
>>
>> OK, but then the ignore_sig change should not be needed at all, just
>> changing that first bit to fatal_signal_pending() would do the trick?
>
> Right. That was my point. The 'ignore_sig' flag just doesn't make
> sense. It was a hack for a case that shouldn't exist.

Yep, just wanted to confirm that we'd _only_ do that first one and
not go to sleep later on ignoring a signal as that could lead to
issues. Your fatal signal pending suggestion is all we need.

--
Jens Axboe



2023-02-13 09:08:41

by Herbert Xu

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

Linus Torvalds <[email protected]> wrote:
>
> - if (!(sk->sk_route_caps & NETIF_F_SG))
> + if (!(sk->sk_route_caps & NETIF_F_SG) ||
> + !(sk->sk_route_caps & (NETIF_F_HW_CSUM | NETIF_F_IP_CSUM)))
> return sock_no_sendpage_locked(sk, page, offset, size, flags);

NETIF_F_SG depends on checksum offload so it should already be
calling sock_no_sendpage_locked if checksum offload is not
supported.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2023-02-13 09:44:23

by Herbert Xu

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

David Laight <[email protected]> wrote:
>
> It is also worth remembering that TCP needs to be able
> to retransmit the data and a much later time.
> So the application must not change the data until it has
> been acked by the remote system.

I don't believe changing the data on retransmit violates any
standards. Correct me if I'm wrong :)
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2023-02-13 09:44:32

by Herbert Xu

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

Linus Torvalds <[email protected]> wrote:
>
> Ok, so I decided to try to take a look.
>
> Somebody who actually does networking (and drivers in particular)
> should probably check this, but it *looks* like the IPv4 TCP case
> (just to pick the ony I looked at) gores through
> tcp_sendpage_locked(), which does
>
> if (!(sk->sk_route_caps & NETIF_F_SG))
> return sock_no_sendpage_locked(sk, page, offset, size, flags);
>
> which basically says "if you can't handle fragmented socket buffers,
> do that 'no_sendpage' case".
>
> So that will basically end up just falling back to a kernel
> 'sendmsg()', which does a copy and then it's stable.
>
> But for the networks that *can* handle fragmented socket buffers, it
> then calls do_tcp_sendpages() instead, which just creates a skb
> fragment of the page (with tcp_build_frag()).
>
> I wonder if that case should just require NETIF_F_HW_CSUM?

NETIF_F_SG already depends on checksum offload (either via
NETIF_F_HW_CSUM or something else that is equivalent).

So are you guys just imagining non-existant problems?

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2023-02-13 09:45:14

by Herbert Xu

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

Dave Chinner <[email protected]> wrote:
>
> IOWs, the application does not care if the data changes whilst they
> are in transport attached to the pipe - it only cares that the
> contents are stable once they have been delivered and are now wholly
> owned by the network stack IO path so that the OTW encodings
> (checksum, encryption, whatever) done within the network IO path
> don't get compromised.

Is this even a real problem? The network stack doesn't care at
all if you modify the pages while it's being processed. All the
things you've mentioned (checksum, encryption, etc.) will be
self-consistent on the wire.

Even when actual hardware offload is involved it's hard to see how
things could possibly go wrong unless the hardware was going out of
its way to do the wrong thing by fetching from memory twice.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2023-02-13 18:01:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Mon, Feb 13, 2023 at 1:45 AM Herbert Xu <[email protected]> wrote:
>
> Dave Chinner <[email protected]> wrote:
> >
> > IOWs, the application does not care if the data changes whilst they
> > are in transport attached to the pipe - it only cares that the
> > contents are stable once they have been delivered and are now wholly
> > owned by the network stack IO path so that the OTW encodings
> > (checksum, encryption, whatever) done within the network IO path
> > don't get compromised.
>
> Is this even a real problem? The network stack doesn't care at
> all if you modify the pages while it's being processed. All the
> things you've mentioned (checksum, encryption, etc.) will be
> self-consistent on the wire.
>
> Even when actual hardware offload is involved it's hard to see how
> things could possibly go wrong unless the hardware was going out of
> its way to do the wrong thing by fetching from memory twice.
>

There's a difference between "kernel speaks TCP (or whatever)
correctly" and "kernel does what the application needs it to do".
When I write programs that send data on the network, I want the kernel
to send the data that I asked it to send. As a silly but obvious
example, suppose I have two threads, and all I/O is blocking
(O_NONBLOCK is not set, etc):

char buffer[1024] = "A";

Thread A:
send(fd, buffer, 1, 0);

Thread B:
mb();
buffer[0] = 'B';
mb();


Obviously, there are three possible valid outcomes: Thread A can go
first (send returns before B changes the buffer), and 'A' gets sent.
Thread B can go first (the buffer is changed before send() starts),
and 'B' gets sent. Or both can run concurrently, in which case the
data sent is indeterminate.

But it is not valid for send() to return, then the buffer to change,
and 'B' to get sent, just like:

char foo[] = "A";
send(fd, foo, 1, 0);
foo[0] = 'B';

must send 'A', not 'B'.

The trouble with splice() is that there is no clear point at which the
splice is complete and the data being sent is committed. I don't
think user applications need the data committed particularly quickly,
but I do think it needs to be committed "eventually* and there needs
to be a point at which the application knows it's been committed.
Right now, if a user program does:

Write 'A' to a file
splice that file to a pipe
splice that pipe to a socket
... wait until when? ...
Write 'B' to a file

There is nothing the user program can wait for to make sure that 'A'
gets sent, but saying that the kernel speaks TCP correctly without
solving this problem doesn't actually solve the problem.

2023-02-14 01:22:31

by Herbert Xu

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Mon, Feb 13, 2023 at 10:01:27AM -0800, Andy Lutomirski wrote:
>
> There's a difference between "kernel speaks TCP (or whatever)
> correctly" and "kernel does what the application needs it to do".

Sure I get where you are coming from. It's just that the other
participants in the discussion were thinking of stability for the
sake of TCP (or TLS or some other protocol the kernel implements)
and that simply is a non-issue.

Having a better way to communicate completion to the user would be
nice. The only way to do it right now seems to be polling with
SIOCOUTQ.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2023-02-17 23:15:20

by Andy Lutomirski

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

> On Feb 13, 2023, at 5:22 PM, Herbert Xu <[email protected]> wrote:
>
> On Mon, Feb 13, 2023 at 10:01:27AM -0800, Andy Lutomirski wrote:
>>
>> There's a difference between "kernel speaks TCP (or whatever)
>> correctly" and "kernel does what the application needs it to do".
>
> Sure I get where you are coming from. It's just that the other
> participants in the discussion were thinking of stability for the
> sake of TCP (or TLS or some other protocol the kernel implements)
> and that simply is a non-issue.

I can certainly imagine TLS or similar protocols breaking if data
changes if the implementation is too clever and retransmission
happens. Suppose 2000 bytes are sent via splice using in-kernel TLS,
and it goes out on the wire as two TCP segments. The first segment is
dropped but the second is received. The kernel resends the first
segment using different data. This really ought to cause an integrity
check at the far end to fail.

I don't know if any existing kTLS is clever enough to regenerate
outgoing data when it needs to retransmit a segment, but it would be
an interesting optimization for serving static content over TLS.



>
> Having a better way to communicate completion to the user would be
> nice. The only way to do it right now seems to be polling with
> SIOCOUTQ.
>
>

2023-02-20 04:55:20

by Herbert Xu

[permalink] [raw]
Subject: Re: copy on write for splice() from file to pipe?

On Fri, Feb 17, 2023 at 03:13:14PM -0800, Andy Lutomirski wrote:
>
> I can certainly imagine TLS or similar protocols breaking if data
> changes if the implementation is too clever and retransmission
> happens. Suppose 2000 bytes are sent via splice using in-kernel TLS,
> and it goes out on the wire as two TCP segments. The first segment is
> dropped but the second is received. The kernel resends the first
> segment using different data. This really ought to cause an integrity
> check at the far end to fail.

The TLS layer is completely separate from TCP so it's like any
normal TCP user from user-space. IOW the encrypted data will be
held by TCP until acknowledged so during retransmission it will
simply resend the previously encrypted data rather than encrypting
the same data twice.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt