2021-05-07 18:23:57

by Kees Cook

[permalink] [raw]
Subject: Re: splice() from /dev/zero to a pipe does not work (5.9+)

On Fri, May 07, 2021 at 07:05:51PM +0100, Colin Ian King wrote:
> Hi,
>
> While doing some micro benchmarking with stress-ng I discovered that
> since linux 5.9 the splicing from /dev/zero to a pipe now fails with
> -EINVAL.
>
> I bisected this down to the following commit:
>
> 36e2c7421f02a22f71c9283e55fdb672a9eb58e7 is the first bad commit
> commit 36e2c7421f02a22f71c9283e55fdb672a9eb58e7
> Author: Christoph Hellwig <[email protected]>
> Date: Thu Sep 3 16:22:34 2020 +0200
>
> fs: don't allow splice read/write without explicit ops
>
> I'm not sure if this has been reported before, or if it's intentional
> behavior or not. As it stands, it's a regression in the stress-ng splice
> test case.

The general loss of generic splice read/write is known. Here's some
early conversations I was involved in:

https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/202009181443.C2179FB@keescook/
https://lore.kernel.org/lkml/[email protected]/

And it's been getting re-implemented in individual places:

$ git log --oneline --no-merges --grep 36e2c742
42984af09afc jffs2: Hook up splice_write callback
a35d8f016e0b nilfs2: make splice write available again
f8ad8187c3b5 fs/pipe: allow sendfile() to pipe again
f2d6c2708bd8 kernfs: wire up ->splice_read and ->splice_write
9bb48c82aced tty: implement write_iter
dd78b0c483e3 tty: implement read_iter
14e3e989f6a5 proc mountinfo: make splice available again
c1048828c3db orangefs: add splice file operations
960f4f8a4e60 fs: 9p: add generic splice_write file operation
cf03f316ad20 fs: 9p: add generic splice_read file operations
06a17bbe1d47 afs: Fix copy_file_range()

So the question is likely, "do we want this for /dev/zero?"

--
Kees Cook


2021-05-07 19:08:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: splice() from /dev/zero to a pipe does not work (5.9+)

On Fri, May 7, 2021 at 11:21 AM Kees Cook <[email protected]> wrote:
>
> So the question is likely, "do we want this for /dev/zero?"

Well, /dev/zero should at least be safe, and I guess it's actually
interesting from a performance testing standpoint (ie useful for some
kind of "what is the overhead of the splice code with no data copy").

So I'll happily take a sane patch for /dev/zero, although I think it
probably only makes sense if it's made to use the zero page explicitly
(ie exactly for that "no data copy testing" case).

So very much *not* using generic_file_splice_read(), even if that
might be the one-liner.

/dev/zero should probably also use the (already existing)
splice_write_null() function for the .splice_write case.

Anybody willing to look into this? My gu feel is that it *should* be easy to do.

That said - looking at the current 'pipe_zero()', it uses
'push_pipe()' to actually allocation regular pages, and then clear
them.

Which is basically what a generic_file_splice_read() would do, and it
feels incredibly pointless and stupid to me.

I *think* we should be able to just do something like

len = size;
while (len > 0) {
struct pipe_buffer *buf;
unsigned int tail = pipe->tail;
unsigned int head = pipe->head;
unsigned int mask = pipe->ring_size - 1;

if (pipe_full(head, tail, pipe->max_usage))
break;
buf = &pipe->bufs[iter_head & p_mask];
buf->ops = &zero_pipe_buf_ops;
buf->page = ZERO_PAGE(0);
buf->offset = 0;
buf->len = min_t(ssize_t, len, PAGE_SIZE);
len -= buf->len;
pipe->head = head+1;
}
return size - len;

but honestly, I haven't thought a lot about it.

Al? This is another of those "right up your alley" things.

Maybe it's not worth it, and just using generic_file_splice_read() is
the way to go, but I do get the feeling that if we are splicing
/dev/null, the whole _point_ of it is about benchmarking, not "make it
work".

Linus

2021-05-07 19:18:49

by Al Viro

[permalink] [raw]
Subject: Re: splice() from /dev/zero to a pipe does not work (5.9+)

On Fri, May 07, 2021 at 12:06:31PM -0700, Linus Torvalds wrote:

> That said - looking at the current 'pipe_zero()', it uses
> 'push_pipe()' to actually allocation regular pages, and then clear
> them.
>
> Which is basically what a generic_file_splice_read() would do, and it
> feels incredibly pointless and stupid to me.
>
> I *think* we should be able to just do something like
>
> len = size;
> while (len > 0) {
> struct pipe_buffer *buf;
> unsigned int tail = pipe->tail;
> unsigned int head = pipe->head;
> unsigned int mask = pipe->ring_size - 1;
>
> if (pipe_full(head, tail, pipe->max_usage))
> break;
> buf = &pipe->bufs[iter_head & p_mask];
> buf->ops = &zero_pipe_buf_ops;
> buf->page = ZERO_PAGE(0);
> buf->offset = 0;
> buf->len = min_t(ssize_t, len, PAGE_SIZE);
> len -= buf->len;
> pipe->head = head+1;
> }
> return size - len;
>
> but honestly, I haven't thought a lot about it.
>
> Al? This is another of those "right up your alley" things.

Umm... That would do wonders to anything that used to do
copy_to_user()/clear_user()/copy_to_user() and got converted
to copy_to_iter()/iov_iter_zero()/copy_to_iter()...

Are you sure we can shove zero page into pipe, anyway?
IIRC, get_page()/put_page() on that is not allowed, and
I'm not at all sure that nothing in e.g. fuse splice-related
logics would go ahead an do just that. Or am I confused
about the page refcounting for those?

2021-05-07 19:31:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: splice() from /dev/zero to a pipe does not work (5.9+)

On Fri, May 7, 2021 at 12:17 PM Al Viro <[email protected]> wrote:
>
> Umm... That would do wonders to anything that used to do
> copy_to_user()/clear_user()/copy_to_user() and got converted
> to copy_to_iter()/iov_iter_zero()/copy_to_iter()...

I didn't mean for iov_iter_zero doing this - only splice_read_zero().

> Are you sure we can shove zero page into pipe, anyway?
> IIRC, get_page()/put_page() on that is not allowed,

That's what the

buf->ops = &zero_pipe_buf_ops;

is for. The zero_pipe_buf_ops would have empty get and release
functions, and a 'steal' function that always returns false.

That's how the pipe pages are supposed to work: there are people who
put non-page data (ie things like skbuff allocations etc) into a
splice pipe buffer. It's why we have those "ops" pointers.

Linus

2021-05-07 20:33:15

by Al Viro

[permalink] [raw]
Subject: Re: splice() from /dev/zero to a pipe does not work (5.9+)

On Fri, May 07, 2021 at 12:29:44PM -0700, Linus Torvalds wrote:
> On Fri, May 7, 2021 at 12:17 PM Al Viro <[email protected]> wrote:
> >
> > Umm... That would do wonders to anything that used to do
> > copy_to_user()/clear_user()/copy_to_user() and got converted
> > to copy_to_iter()/iov_iter_zero()/copy_to_iter()...
>
> I didn't mean for iov_iter_zero doing this - only splice_read_zero().
>
> > Are you sure we can shove zero page into pipe, anyway?
> > IIRC, get_page()/put_page() on that is not allowed,
>
> That's what the
>
> buf->ops = &zero_pipe_buf_ops;
>
> is for. The zero_pipe_buf_ops would have empty get and release
> functions, and a 'steal' function that always returns false.
>
> That's how the pipe pages are supposed to work: there are people who
> put non-page data (ie things like skbuff allocations etc) into a
> splice pipe buffer. It's why we have those "ops" pointers.

Supposed to - sure, but I'd like to verify that they actually do work
that way before we go there. Let me RTFS a bit...

2021-05-12 15:31:37

by Colin King

[permalink] [raw]
Subject: Re: splice() from /dev/zero to a pipe does not work (5.9+)

On 07/05/2021 19:21, Kees Cook wrote:
> On Fri, May 07, 2021 at 07:05:51PM +0100, Colin Ian King wrote:
>> Hi,
>>
>> While doing some micro benchmarking with stress-ng I discovered that
>> since linux 5.9 the splicing from /dev/zero to a pipe now fails with
>> -EINVAL.
>>
>> I bisected this down to the following commit:
>>
>> 36e2c7421f02a22f71c9283e55fdb672a9eb58e7 is the first bad commit
>> commit 36e2c7421f02a22f71c9283e55fdb672a9eb58e7
>> Author: Christoph Hellwig <[email protected]>
>> Date: Thu Sep 3 16:22:34 2020 +0200
>>
>> fs: don't allow splice read/write without explicit ops
>>
>> I'm not sure if this has been reported before, or if it's intentional
>> behavior or not. As it stands, it's a regression in the stress-ng splice
>> test case.
>
> The general loss of generic splice read/write is known. Here's some
> early conversations I was involved in:
>
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/202009181443.C2179FB@keescook/
> https://lore.kernel.org/lkml/[email protected]/
>
> And it's been getting re-implemented in individual places:
>
> $ git log --oneline --no-merges --grep 36e2c742
> 42984af09afc jffs2: Hook up splice_write callback
> a35d8f016e0b nilfs2: make splice write available again
> f8ad8187c3b5 fs/pipe: allow sendfile() to pipe again
> f2d6c2708bd8 kernfs: wire up ->splice_read and ->splice_write
> 9bb48c82aced tty: implement write_iter
> dd78b0c483e3 tty: implement read_iter
> 14e3e989f6a5 proc mountinfo: make splice available again
> c1048828c3db orangefs: add splice file operations
> 960f4f8a4e60 fs: 9p: add generic splice_write file operation
> cf03f316ad20 fs: 9p: add generic splice_read file operations
> 06a17bbe1d47 afs: Fix copy_file_range()

Ah..so this explains why copy_file_range() also returns -EINVAL now on
some file systems, such a minix since that uses splicing too via
do_splice_direct(). :-/

>
> So the question is likely, "do we want this for /dev/zero?"
>