On Mon, Jun 26, 2023 at 05:56:28PM +0200, Christian Brauner wrote:
> I mean, one workaround would probably be poll() even with O_NONBLOCK but
> I get why that's annoying and half of a solution.
poll() doesn't really change anything since it turns into a hotloop of
.revents=POLLHUP with a disconnected writer, cf.
https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#m747e2bbd0c5cffb6baaf1955f6f8b0d97e216839
The only apparently-supported way to poll pipes under linux is to
sleep()/read(O_NONBLOCK) like in the bad old days.
And even if that was a working work-around, the fundamental problem of
./spl>fifo excluding all other access to fifo is quite unfortunate too.
> tl;dr it by splicing from a regular file to a pipe where the regular
> file in splice isn't O_NONBLOCK
(Most noticeable when the "regular file" is a tty and thus the I/O never
completes by design.)
On Mon, 26 Jun 2023 at 09:14, Ahelenia Ziemiańska
<[email protected]> wrote:
>
> And even if that was a working work-around, the fundamental problem of
> ./spl>fifo excluding all other access to fifo is quite unfortunate too.
So I already sent an earlier broken version of this patch to Ahelenia
and Christian, but here's an actually slightly tested version with the
obvious bugs fixed.
The keyword here being "obvious". It's probably broken in many
non-obvious ways, but I'm sending it out in case anybody wants to play
around with it.
It boots for me. It even changes behavior of programs that splice()
and used to keep the pipe lock over the IO and no longer do.
But it might do unspeakable things to your pets, so caveat emptor. It
"feels" right to me. But it's a rather quick hack, and really needs
more eyes and more thought. I mention O_NDELAY in the (preliminary)
commit message, but there are probably other issues that need thinking
about.
Linus
On Thu, Jul 06, 2023 at 02:56:45PM -0700, Linus Torvalds wrote:
> On Mon, 26 Jun 2023 at 09:14, Ahelenia Ziemiańska
> <[email protected]> wrote:
> >
> > And even if that was a working work-around, the fundamental problem of
> > ./spl>fifo excluding all other access to fifo is quite unfortunate too.
>
> So I already sent an earlier broken version of this patch to Ahelenia
> and Christian, but here's an actually slightly tested version with the
> obvious bugs fixed.
>
> The keyword here being "obvious". It's probably broken in many
> non-obvious ways, but I'm sending it out in case anybody wants to play
> around with it.
>
> It boots for me. It even changes behavior of programs that splice()
> and used to keep the pipe lock over the IO and no longer do.
>
> But it might do unspeakable things to your pets, so caveat emptor. It
> "feels" right to me. But it's a rather quick hack, and really needs
> more eyes and more thought. I mention O_NDELAY in the (preliminary)
> commit message, but there are probably other issues that need thinking
> about.
Forgot to say, fwiw, I've been running this through the LTP splice,
pipe, and ipc tests without issues. A hanging reader can be signaled
away cleanly with this.
On Fri, 7 Jul 2023 at 10:21, Christian Brauner <[email protected]> wrote:
>
> Forgot to say, fwiw, I've been running this through the LTP splice,
> pipe, and ipc tests without issues. A hanging reader can be signaled
> away cleanly with this.
So that patch still has a couple of "wait for this" cases remaining.
In particular, when we do a read, and we do have pipe buffers, both
the read() system call and a number of internal splice functions will
go "Ahh, I have data", and then do pipe_buf_confirm() and read it.
Which then results in pipe_buf_confirm() blocking. It now blocks
interruptibly, which is much nicer, but several of these users *could*
just do a non-blocking confirmation instead, and wait for pipe
readability.
HOWEVER, that's slightly less trivial than you'd expect, because the
"wait for readability" needs to be done without the pipe lock held -
so you can't actually check the pipe buffer state at that point (since
you need the pipe lock to look up the buffer).
That's true even of "trivial" cases like actual user-space "read()
with O_NONBLOCK and poll()" situations.
Now, the solution to all this is *fairly* straightforward:
(a) don't use "!pipe_empty()" for a readability check.
We already have "pipe_readable()", but it's hidden in fs/pipe.c,
so all the splice() code ended up writing the "does this pipe have
data" using "!pipe_empty()" instead.
(b) make "pipe_buf_confirm()" take a "non-blocking" boolean argument,
and if it is non-blocking but hits one of those blocked pages, set
"pipe->not_ready", and return -EAGAIN.
This is ok, because "pipe_buf_confirm()" is always under the pipe
lock, and we'll just clear "pipe->not_ready" under the pipe lock after
finalizing all those pages (and before waking up readers)
(c) make "pipe_wait_readable()" and "poll()" know about this all, so
that we wait properly for a pipe that was not ready to become ready
This all makes *most* users deal properly with these blocking events.
In particular, things like splice_to_socket() can now do the whole
proper "wait without holding the pipe lock" sequence, even when the
pipe is not empty, just in this blocked state.
This *may* also make all the cases Jens had with io_uring and splicing
JustWork(tm).
NOTE! NOTE! NOTE! Once more, this "feels right to me", and I'd argue
that the basic approach is fairly straightfoward. The patch is also
not horrendous. It all makes a fair amount of sense. BUT! I haven't
tested this, and like the previous patch, I really would want people
to think about this a lot.
Comments? Jens?
Linus
On 7/7/23 1:10?PM, Linus Torvalds wrote:
> On Fri, 7 Jul 2023 at 10:21, Christian Brauner <[email protected]> wrote:
>>
>> Forgot to say, fwiw, I've been running this through the LTP splice,
>> pipe, and ipc tests without issues. A hanging reader can be signaled
>> away cleanly with this.
>
> So that patch still has a couple of "wait for this" cases remaining.
>
> In particular, when we do a read, and we do have pipe buffers, both
> the read() system call and a number of internal splice functions will
> go "Ahh, I have data", and then do pipe_buf_confirm() and read it.
>
> Which then results in pipe_buf_confirm() blocking. It now blocks
> interruptibly, which is much nicer, but several of these users *could*
> just do a non-blocking confirmation instead, and wait for pipe
> readability.
>
> HOWEVER, that's slightly less trivial than you'd expect, because the
> "wait for readability" needs to be done without the pipe lock held -
> so you can't actually check the pipe buffer state at that point (since
> you need the pipe lock to look up the buffer).
>
> That's true even of "trivial" cases like actual user-space "read()
> with O_NONBLOCK and poll()" situations.
>
> Now, the solution to all this is *fairly* straightforward:
>
> (a) don't use "!pipe_empty()" for a readability check.
>
> We already have "pipe_readable()", but it's hidden in fs/pipe.c,
> so all the splice() code ended up writing the "does this pipe have
> data" using "!pipe_empty()" instead.
>
> (b) make "pipe_buf_confirm()" take a "non-blocking" boolean argument,
> and if it is non-blocking but hits one of those blocked pages, set
> "pipe->not_ready", and return -EAGAIN.
>
> This is ok, because "pipe_buf_confirm()" is always under the pipe
> lock, and we'll just clear "pipe->not_ready" under the pipe lock after
> finalizing all those pages (and before waking up readers)
>
> (c) make "pipe_wait_readable()" and "poll()" know about this all, so
> that we wait properly for a pipe that was not ready to become ready
>
> This all makes *most* users deal properly with these blocking events.
> In particular, things like splice_to_socket() can now do the whole
> proper "wait without holding the pipe lock" sequence, even when the
> pipe is not empty, just in this blocked state.
>
> This *may* also make all the cases Jens had with io_uring and splicing
> JustWork(tm).
Exactly! I was reading this thread with excitement just now, would be
nice to get rid of that kludge.
> NOTE! NOTE! NOTE! Once more, this "feels right to me", and I'd argue
> that the basic approach is fairly straightfoward. The patch is also
> not horrendous. It all makes a fair amount of sense. BUT! I haven't
> tested this, and like the previous patch, I really would want people
> to think about this a lot.
>
> Comments? Jens?
I'll take a closer look at this, but won't be until Monday most likely.
But the approach seems sane, and going in a more idiomatic direction
than before. So seems promising.
--
Jens Axboe
On Fri, Jul 07, 2023 at 12:10:36PM -0700, Linus Torvalds wrote:
> On Fri, 7 Jul 2023 at 10:21, Christian Brauner <[email protected]> wrote:
> > Forgot to say, fwiw, I've been running this through the LTP splice,
> > pipe, and ipc tests without issues. A hanging reader can be signaled
> > away cleanly with this.
> NOTE! NOTE! NOTE! Once more, this "feels right to me", and I'd argue
> that the basic approach is fairly straightfoward. The patch is also
> not horrendous. It all makes a fair amount of sense. BUT! I haven't
> tested this, and like the previous patch, I really would want people
> to think about this a lot.
>
> Comments? Jens?
I applied the patch upthread + this diff to 4f6b6c2b2f86b7878a770736bf478d8a263ff0bc;
during test setup I got a null deref (building defconfig minus graphics).
Reproducible, full BUG dump attached; trace of
[ 149.878931] <TASK>
[ 149.879533] ? __die+0x1e/0x60
[ 149.880309] ? page_fault_oops+0x17c/0x470
[ 149.881313] ? search_module_extables+0x14/0x50
[ 149.882422] ? exc_page_fault+0x67/0x150
[ 149.883397] ? asm_exc_page_fault+0x26/0x30
[ 149.884426] ? __pfx_pipe_to_null+0x10/0x10
[ 149.885451] ? splice_from_pipe_next+0x129/0x150
[ 149.886580] __splice_from_pipe+0x39/0x1c0
[ 149.887594] ? __pfx_pipe_to_null+0x10/0x10
[ 149.888615] ? __pfx_pipe_to_null+0x10/0x10
[ 149.889635] splice_from_pipe+0x5c/0x90
[ 149.890579] do_splice+0x35c/0x840
[ 149.891407] __do_splice+0x1eb/0x210
[ 149.892176] __x64_sys_splice+0xad/0x120
[ 149.893019] do_syscall_64+0x3e/0x90
[ 149.893798] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
$ scripts/faddr2line vmlinux splice_from_pipe_next+0x129
splice_from_pipe_next+0x129/0x150:
pipe_buf_release at include/linux/pipe_fs_i.h:221
(inlined by) eat_empty_buffer at fs/splice.c:594
(inlined by) splice_from_pipe_next at fs/splice.c:640
I gamed this down to
echo c | grep c >/dev/null
where grep is
ii grep 3.8-5 amd64 GNU grep, egrep and fgrep
and strace of the same invocation (on the host) ends with
newfstatat(1, "", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}, AT_EMPTY_PATH) = 0
newfstatat(AT_FDCWD, "/dev/null", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}, 0) = 0
newfstatat(0, "", {st_mode=S_IFIFO|0600, st_size=0, ...}, AT_EMPTY_PATH) = 0
lseek(0, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
read(0, "c\n", 98304) = 2
splice(0, NULL, 1, NULL, 98304, SPLICE_F_MOVE) = 0
close(1) = 0
close(2) = 0
exit_group(0) = ?
+++ exited with 0 +++
And can also reproduce it with
echo | { read -r _; exec ./wr; } > /dev/null
(where ./wr is "while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0) {}").
However:
echo | ./wr > /dev/null
does NOT crash.
Besides that, this doesn't solve the original issue, inasmuch as
./v > fifo &
head fifo &
echo zupa > fifo
(where ./v splices from an empty pty to stdout; v.c attached)
echo still sleeps until ./v dies, though it also succumbs to ^C now.
"OTOH, on 4f6b6c2b2f86b7878a770736bf478d8a263ff0bc,
"timeout 10 ./v > fifo &" (then lines 2 and 3 as above) does
kill ./v -> unblock echo -> head copies "zupa",
i.e. life resumes as normal after the splicer went away.
With the patches, echo zupa is stuck forever (until you signal it)!
This is kinda worse.
On Fri, 7 Jul 2023 at 15:41, Ahelenia Ziemiańska
<[email protected]> wrote:
>
> (inlined by) eat_empty_buffer at fs/splice.c:594
Ahh, eat_empty_buffer() ends up releasing the buffer without waiting for it.
And the reason for that is actually somewhat interesting: we do have that
while (!pipe_readable(pipe)) {
..
above it, but the logic for this all is that pipes with pipe buffers
are by *default* considered readable until they try to actually
confirm the buffer, and at that point they might say "oh, I have to
return -EAGAIN and set 'not_ready'".
And that splice_from_pipe_next() doesn't do that.
End result: it will happily free that pipe buffer that is still in the
process of being filled.
The good news is that I think the fix is probably trivial. Something
like the attached?
Again - NOT TESTED.
> Besides that, this doesn't solve the original issue, inasmuch as
> ./v > fifo &
> head fifo &
> echo zupa > fifo
> (where ./v splices from an empty pty to stdout; v.c attached)
> echo still sleeps until ./v dies, though it also succumbs to ^C now.
Yeah, I concentrated on just making everything interruptible,
But the fact that the echo has to wait for the previous write to
finish is kind of fundamental. We can't just magically do writes out
of order. 'v' is busy writing to the fifo, we can't let some other
write just come in.
(We *could* make the splice in ./v not fill the whole pipe buffer, and
allow some other writes to fill in buffers afterwards, but at _some_
point you'll hit the "pipe buffers are full and busy, can't add any
more without waiting for them to empty").
One thing we could possibly do is to say that we just don't accept any
new writes if there are old busy splices in process. So we could make
new writes return -EBUSY or something, I guess.
Linus
On Thu, Jul 06, 2023 at 02:56:45PM -0700, Linus Torvalds wrote:
> +static int busy_pipe_buf_confirm(struct pipe_inode_info *pipe,
> + struct pipe_buffer *buf)
> +{
> + struct page *page = buf->page;
> +
> + if (folio_wait_bit_interruptible(page_folio(page), PG_locked))
> + return -EINTR;
Do we really want interruptible here rather than killable? That is,
do we want SIGWINCH or SIGALRM to result in a short read? I assume
it's OK to return a short read because userspace has explicitly asked
for O_NONBLOCK and can therefore be expected to actually check the
return value from read().
On Fri, 7 Jul 2023 at 17:00, Matthew Wilcox <[email protected]> wrote:
>
> Do we really want interruptible here rather than killable?
Yes, we actually do need to be just regular interruptible,
This is a bog-standard "IO to/from pipe" situation, which is interruptible.
> That is, do we want SIGWINCH or SIGALRM to result in a short read?
Now, that's a different issue, and is actually handled by the signal
layer: a signal that is ignored (where "ignored" includes the case of
"default handler") will be dropped early, exactly because we don't
want to interrupt things like tty or pipe reads when you resize the
window.
Of course, if you actually *catch* SIGWINCH, then you will get that
short read on a window change.
Linus
On Fri, 7 Jul 2023 at 17:07, Linus Torvalds
<[email protected]> wrote:
>
> > That is, do we want SIGWINCH or SIGALRM to result in a short read?
>
> Now, that's a different issue, and is actually handled by the signal
> layer: a signal that is ignored (where "ignored" includes the case of
> "default handler") will be dropped early, exactly because we don't
> want to interrupt things like tty or pipe reads when you resize the
> window.
In case you care, it's prepare_signal() -> sig_ignored() ->
sig_task_ignored() -> sig_handler_ignored() logic that does this
short-circuiting.
And while I don't think it's required by POSIX, this was definitely a
case where lots of programs that *don't* use any signal handlers at
all are very much not expecting to see -EINTR as a return value, and
used to break exactly on things like SIGWINCH when reading from a tty
or a pipe.
But that logic goes back to before linux-1.0.
In fact, I think it goes back to at least 0.99.10 (June '93).
Linus
On Fri, Jul 07, 2023 at 03:57:40PM -0700, Linus Torvalds wrote:
> On Fri, 7 Jul 2023 at 15:41, Ahelenia Ziemiańska
> <[email protected]> wrote:
> > (inlined by) eat_empty_buffer at fs/splice.c:594
> Ahh, eat_empty_buffer() ends up releasing the buffer without waiting for it.
>
> And the reason for that is actually somewhat interesting: we do have that
>
> while (!pipe_readable(pipe)) {
> ..
>
> above it, but the logic for this all is that pipes with pipe buffers
> are by *default* considered readable until they try to actually
> confirm the buffer, and at that point they might say "oh, I have to
> return -EAGAIN and set 'not_ready'".
>
> And that splice_from_pipe_next() doesn't do that.
>
> End result: it will happily free that pipe buffer that is still in the
> process of being filled.
>
> The good news is that I think the fix is probably trivial. Something
> like the attached?
>
> Again - NOT TESTED
Same reproducer, backtrace attached:
$ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e
splice_from_pipe_next+0x6e/0x180:
pipe_buf_confirm at include/linux/pipe_fs_i.h:233
(inlined by) eat_empty_buffer at fs/splice.c:597
(inlined by) splice_from_pipe_next at fs/splice.c:647
Looks like the same place.
> > Besides that, this doesn't solve the original issue, inasmuch as
> > ./v > fifo &
> > head fifo &
> > echo zupa > fifo
> > (where ./v splices from an empty pty to stdout; v.c attached)
> > echo still sleeps until ./v dies, though it also succumbs to ^C now.
> Yeah, I concentrated on just making everything interruptible,
>
> But the fact that the echo has to wait for the previous write to
> finish is kind of fundamental. We can't just magically do writes out
> of order. 'v' is busy writing to the fifo, we can't let some other
> write just come in.
(It's no longer busy writing to it when it gets killed by timeout in my
second example, but echo still doesn't wake up.)
> (We *could* make the splice in ./v not fill the whole pipe buffer, and
> allow some other writes to fill in buffers afterwards, but at _some_
> point you'll hit the "pipe buffers are full and busy, can't add any
> more without waiting for them to empty").
You are, but, well, that's also the case when the pipe is full.
As it stands, the pipe is /empty/ and yet /no-one can write to it/.
This is the crux of the issue at hand.
(Coincidentally, you're describing what looks quite similar to pt 1.
from <naljsvzzemr6pjiwwcdpdsdh5vtycdr6fmi2fk2dlr4nn4kq6o@ycanbgxhslti>.)
I think we got away with it for so long because most files behave
like regular files/blockdevs and the read is always guaranteed to
complete ~instantly, but splice is, fundamentally, /not/ writing
the whole time because it's not /reading/ the whole time when it's
reading an empty socket/a chardev/whatever.
Or, rather: splice() from a non-seekable (non-mmap()pable?)
fundamentally doesn't really make much sense, because you're not
gluing a bit of the pro-verbial page cache (forgive me my term
abuse here, you see what I'm getting at tho) to the end of the pipe,
you're just telling the kernel to run a read()/write() loop for you.
sendfile() works around this by reading and then separately writing
to its special buffer (in the form of an anonymous process-local pipe).
splice() just raw-dogs the read with the write lock held,
but just doesn't check if it can do it.
That's how it's, honestly, shaking out to me: someone just forgot
a check the first time they wrote it.
Because the precondition for "does reading directly into the pipe
buffer make sense" is "is it directly equivalent to read(f)/write(p)",
and that holds only for seekables.
/Maybe/ for, like, sockets if there's already data, or as a special
case similar to pipe->pipe. But then for sockets you're already
using sendfile(), so?
To that end, I'm including a patch based on
4f6b6c2b2f86b7878a770736bf478d8a263ff0bc
that does just that: EINVAL.
Maybe if you're worried about breaking compatibility
(which idk if it's an issue since splice and sendfile
are temperamental w.r.t. what they take anyway across versions;
you need a read()/write() fallback anyway)
that case could become sendfile-like by first reading into a buffer
once then pipe->pipe splicing out of it separately.
Though, even the usual sendfile read/write loop only works on seekables.
> One thing we could possibly do is to say that we just don't accept any
> new writes if there are old busy splices in process. So we could make
> new writes return -EBUSY or something, I guess.
Same logic as above applies + no-one's really expecting EBUSY +
what /would/ you do about EBUSY in this case?
-- >8 --
From 552d93bee8e1b8333ce84ed0ca8490f6712c5b8b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ahelenia=20Ziemia=C5=84ska?=
<[email protected]>
Date: Sat, 8 Jul 2023 01:47:59 +0200
Subject: [PATCH] splice: file->pipe: -EINVAL if file non-seekable
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Both the logical semantic of "tie a page from the page cache to the pipe"
and the implemented semantic of "lock the pipe, then read into it"
(thus holding the write lock for as as long as the read is outstanding)
only make sense if the read is guaranteed to complete instantly.
This has been a long-standing omission and, when the read doesn't
immediately complete (because it's a tty, socket, &c.), the write lock
‒ which for pipes is a pipe-global mutex ‒ is held until,
thus excluding all mutual users of the pipe, until it does.
Refuse it. Use read()/write() in userspace instead of getting the kernel
to do it for you, badly, when there's no point to doing so.
Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wjEC_Rh8+-rtEi8C45upO-Ffw=M_i1211qS_3AvWZCbOg@mail.gmail.com/t/#u
Signed-off-by: Ahelenia Ziemiańska <[email protected]>
---
fs/splice.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/splice.c b/fs/splice.c
index 004eb1c4ce31..14cf3cea1091 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1309,6 +1309,8 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
if (opipe) {
if (off_out)
return -ESPIPE;
+ if (!(in->f_mode & FMODE_LSEEK))
+ return -EINVAL;
if (off_in) {
if (!(in->f_mode & FMODE_PREAD))
return -EINVAL;
--
2.39.2
On Fri, 7 Jul 2023 at 17:30, Ahelenia Ziemiańska
<[email protected]> wrote:
>
> Same reproducer, backtrace attached:
> $ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e
> splice_from_pipe_next+0x6e/0x180:
> pipe_buf_confirm at include/linux/pipe_fs_i.h:233
Bah. I should have looked more closely at your case.
This is a buffer without an 'ops' pointer. So it looks like was
already released.
And the reason is that the pipe was readable because there were no
writers, and I had put the
if (!pipe->writers)
return 0;
check in splice_from_pipe_next() in the wrong place. It needs to be
*before* the eat_empty_buffer() call.
Duh.
Anyway, while I think that fixes your NULL pointer thing, having
looked at my original patch I realized that keeping the pointer to the
pipe buffer in copy_splice_read() across the dropping of the pipe lock
is completely broken.
I thought (because I didn't remember the code) that pipe resizing will
just copy the pipe buffer pointers around. That would have made it ok
to remember a pipe buffer pointer. But it is not so. It will actually
create new pipe buffers and copy the buffer contents around.
So while fixing your NULL pointer check should be trivial, I think
that first patch is actually fundamentally broken wrt pipe resizing,
and I see no really sane way to fix it. We could add a new lock just
for that, but I don't think it's worth it.
> You are, but, well, that's also the case when the pipe is full.
> As it stands, the pipe is /empty/ and yet /no-one can write to it/.
> This is the crux of the issue at hand.
No, I think you are mis-representing things. The pipe isn't empty.
It's full of things that just aren't finalized yet.
> Or, rather: splice() from a non-seekable (non-mmap()pable?)
Please stop with the non-seekable nonsense.
Any time I see a patch like this:
> + if (!(in->f_mode & FMODE_LSEEK))
> + return -EINVAL;
I will just go "that person is not competent".
This has absolutely nothing to do with seekability.
But it is possible that we need to just bite the bullet and say
"copy_splice_read() needs to use a non-blocking kiocb for the IO".
Of course, that then doesn't work, because while doing this is trivial:
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -364,6 +364,7 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos,
iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
init_sync_kiocb(&kiocb, in);
kiocb.ki_pos = *ppos;
+ kiocb.ki_flags |= IOCB_NOWAIT;
ret = call_read_iter(in, &kiocb, &to);
if (ret > 0) {
I suspectr you'll find that it makes no difference, because the tty
layer doesn't actually honor the IOCB_NOWAIT flag for various
historical reasons. In fact, the kiocb isn't even passed down to the
low-level routine, which only gets the 'struct file *', and instead it
looks at tty_io_nonblock(), which just does that legacy
file->f_flags & O_NONBLOCK
test.
I guess combined with something like
if (!(in->f_mode & FMODE_NOWAIT))
return -EINVAL;
it might all work.
Linus
On Sat, Jul 08, 2023 at 01:06:56PM -0700, Linus Torvalds wrote:
> On Fri, 7 Jul 2023 at 17:30, Ahelenia Ziemiańska
> <[email protected]> wrote:
> >
> > Same reproducer, backtrace attached:
> > $ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e
> > splice_from_pipe_next+0x6e/0x180:
> > pipe_buf_confirm at include/linux/pipe_fs_i.h:233
> Bah. I should have looked more closely at your case.
>
> This is a buffer without an 'ops' pointer. So it looks like was
> already released.
>
> And the reason is that the pipe was readable because there were no
> writers, and I had put the
>
> if (!pipe->writers)
> return 0;
>
> check in splice_from_pipe_next() in the wrong place. It needs to be
> *before* the eat_empty_buffer() call.
>
> Anyway, while I think that fixes your NULL pointer thing,
It does.
> So while fixing your NULL pointer check should be trivial, I think
> that first patch is actually fundamentally broken wrt pipe resizing,
> and I see no really sane way to fix it. We could add a new lock just
> for that, but I don't think it's worth it.
>
> > You are, but, well, that's also the case when the pipe is full.
> > As it stands, the pipe is /empty/ and yet /no-one can write to it/.
> > This is the crux of the issue at hand.
> No, I think you are mis-representing things. The pipe isn't empty.
> It's full of things that just aren't finalized yet.
Being full of no data (as part of some hidden state)
doesn't make it any less empty, but meh; neither here not there.
> > Or, rather: splice() from a non-seekable (non-mmap()pable?)
> Please stop with the non-seekable nonsense.
>
> Any time I see a patch like this:
>
> > + if (!(in->f_mode & FMODE_LSEEK))
> > + return -EINVAL;
>
> I will just go "that person is not competent".
Accurate assessment.
> This has absolutely nothing to do with seekability.
Yes, and as noted, I was using it as a stand-in for "I/O won't
block" due to the above (and splice_direct_to_actor() already uses
it). Glad to see you've managed to synthesise my drivel into something
workable.
> But it is possible that we need to just bite the bullet and say
> "copy_splice_read() needs to use a non-blocking kiocb for the IO".
>
> Of course, that then doesn't work, because while doing this is trivial:
>
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -364,6 +364,7 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos,
> iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
> init_sync_kiocb(&kiocb, in);
> kiocb.ki_pos = *ppos;
> + kiocb.ki_flags |= IOCB_NOWAIT;
> ret = call_read_iter(in, &kiocb, &to);
>
> if (ret > 0) {
>
> I suspectr you'll find that it makes no difference, because the tty
> layer doesn't actually honor the IOCB_NOWAIT flag for various
> historical reasons.
Indeed, neither when splicing from a tty,
nor from a socket (same setup but socketpair(AF_UNIX, SOCK_STREAM, 0); w.c).
> In fact, the kiocb isn't even passed down to the
> low-level routine, which only gets the 'struct file *', and instead it
> looks at tty_io_nonblock(), which just does that legacy
>
> file->f_flags & O_NONBLOCK
>
> test.
>
> I guess combined with something like
>
> if (!(in->f_mode & FMODE_NOWAIT))
> return -EINVAL;
>
> it might all work.
Yes, that makes the splice instantly -EINVAL for ttys, but doesn't
affect the socketpair() case above, which still blocks forever.
This appears to be because vfs_splice_read() does
if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host))
return copy_splice_read(in, ppos, pipe, len, flags);
return in->f_op->splice_read(in, ppos, pipe, len, flags);
so the c_s_r() isn't even called for the socket, which uses
unix_stream_splice_read().
Thus,
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 123b35ddfd71..384d5a479b4a 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2880,15 +2880,12 @@ static ssize_t unix_stream_splice_read(struct socket *sock, loff_t *ppos,
.pipe = pipe,
.size = size,
.splice_flags = flags,
+ .flags = MSG_DONTWAIT,
};
if (unlikely(*ppos))
return -ESPIPE;
- if (sock->file->f_flags & O_NONBLOCK ||
- flags & SPLICE_F_NONBLOCK)
- state.flags = MSG_DONTWAIT;
-
return unix_stream_read_generic(&state, false);
}
makes the splice -EAGAIN w/o data and splice whatever's in the socket
when there is data.
git grep '\.splice_read.*=' | cut -d= -f2 | tr -s ',;[:space:]' '\n' | sort -u
gives me 27 distinct splice_read implementations and 1 cocci config.
These are simple delegations:
nfs_file_splice_read filemap_splice_read
afs_file_splice_read filemap_splice_read
ceph_splice_read filemap_splice_read
ecryptfs_splice_read_update_atime filemap_splice_read
ext4_file_splice_read filemap_splice_read
f2fs_file_splice_read filemap_splice_read
ntfs_file_splice_read filemap_splice_read
ocfs2_file_splice_read filemap_splice_read
orangefs_file_splice_read filemap_splice_read
v9fs_file_splice_read filemap_splice_read
xfs_file_splice_read filemap_splice_read
zonefs_file_splice_read filemap_splice_read
sock_splice_read copy_splice_read or a socket-specific one
coda_file_splice_read vfs_splice_read
ovl_splice_read vfs_splice_read
vfs_splice_read calls copy_splice_read (not used as a .splice_read).
And the rest are:
01. copy_splice_read you've fixed
02. filemap_splice_read is, as I understand it, only applicable to
files/blockdevs and already has the required semantics?
03. unix_stream_splice_read can be fixed as above
04. fuse_dev_splice_read allocates a buffer and fuse_dev_do_read()s into
it, fuse_dev_do_read does
if (file->f_flags & O_NONBLOCK)
return -EAGAIN;
so this is likely a similarly easy fix
05. tracing_buffers_splice_read, when it doesn't read anything does
ret = -EAGAIN;
if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK))
goto out;
and waits for at least one thing to read;
can probably just goto out instantly
06. tracing_splice_read_pipe delegates to whatever it's tracing, and
errors if that errored, so it's fine(?)
07. shmem_file_splice_read is always nonblocking
08. relay_file_splice_read only checks SPLICE_F_NONBLOCK to turn 0 into
-EAGAIN so I think it also just doesn't block
09. smc_splice_read falls back to an embedded socket's splice_read,
or uses
if (flags & SPLICE_F_NONBLOCK)
flags = MSG_DONTWAIT;
else
flags = 0;
SMC_STAT_INC(smc, splice_cnt);
rc = smc_rx_recvmsg(smc, NULL, pipe, len, flags);
so also probably remove the condition
10. kcm_splice_read:
10a. passes flags (SPLICE_F_...) to skb_recv_datagram(), which does
timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
verbatim!? and forwards them to try_recv which also checks
for socket-style bits
10b. give it MSG_DONTWAIT, call it a day?
10c. it also passes flags to skb_splice_bits() to actually splice to the
pipe, but that discards flags, so no change needed
11. tls_sw_splice_read I don't really understand but it does
err = tls_rx_reader_lock(sk, ctx, flags & SPLICE_F_NONBLOCK);
and
err = tls_rx_rec_wait(sk, NULL, flags & SPLICE_F_NONBLOCK,
true);
(and skb_splice_bits()) so make them both true?
12. tcp_splice_read uses skb_splice_bits() and timeout governed by
timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK);
and re-tries on empty read if timeo!=0 (i.e. !(sock->file->f_flags & O_NONBLOCK));
so turning that into true (timeo = 0) and propagating that would
make it behave accd'g to the new semantic
Would that make sense?
On Sun, Jul 09, 2023 at 03:03:22AM +0200, Ahelenia Ziemiańska wrote:
> On Sat, Jul 08, 2023 at 01:06:56PM -0700, Linus Torvalds wrote:
> > I guess combined with something like
> >
> > if (!(in->f_mode & FMODE_NOWAIT))
> > return -EINVAL;
> >
> > it might all work.
> Yes, that makes the splice instantly -EINVAL for ttys, but doesn't
> affect the socketpair() case above, which still blocks forever.
This also triggers for regular file -> pipe splices,
which is probably a no-go.
Attaching a summary diff that does all I said in the previous mail.
filemap_get_pages() does use and inspect IOCB_NOWAIT if set in
filemap_splice_read(), but it appears to not really make much sense,
inasmuch it returns EAGAIN for the first splice() from a
blockdev and then instantly return with data on the next go-around.
This doesn't really make much sense ‒ and open(2) says blockdevs
don't have O_NONBLOCK semantics, so I'm assuming this is not supposed
to be exposed to userspace ‒ so I'm not setting it in the diff.
I've tested this for:
* tty: -EINVAL
* socketpair(AF_UNIX, SOCK_STREAM): works as expected
$ wc -c fifo &
[1] 261
$ ./af_unix ./s > fifo
5 Success
6454 Resource temporarily unavailable
5 Success
6445 Resource temporarily unavailable
0 Success
10 fifo
* socket(AF_INET, SOCK_STREAM, 0): works as expected
$ wc fifo &
[1] 249
$ ./tcp ./s > fifo
23099 Resource temporarily unavailable
7 Success
2099 Resource temporarily unavailable
4 Success
1751 Resource temporarily unavailable
3 Success
21655 Resource temporarily unavailable
95 Success
19589 Resource temporarily unavailable
0 Success
4 15 109 fifo
corresponding to
host$ nc 127.0.0.1 14640
żupan
asd
ad
asdda dasj aiojd askdl; jasiopdij as[pkdo[p askd9p ias90dk aso[pjd 890pasid90[ jaskl;dj il;asd
^C
* blockdev (/dev/vda): as expected (with filemap_splice_read() unchanged), copies it all
* regular file: -EINVAL(!)
Splicers still sleep if the pipe's full, of course,
unless SPLICE_F_NONBLOCK.
Test drivers attached as well.
On Mon, Jul 10, 2023 at 12:33:07AM +0200, Ahelenia Ziemiańska wrote:
> On Sun, Jul 09, 2023 at 03:03:22AM +0200, Ahelenia Ziemiańska wrote:
> > On Sat, Jul 08, 2023 at 01:06:56PM -0700, Linus Torvalds wrote:
> > > I guess combined with something like
> > >
> > > if (!(in->f_mode & FMODE_NOWAIT))
> > > return -EINVAL;
> > >
> > > it might all work.
> > Yes, that makes the splice instantly -EINVAL for ttys, but doesn't
> > affect the socketpair() case above, which still blocks forever.
> This also triggers for regular file -> pipe splices,
> which is probably a no-go.
Actually, that's only the case for regular files on some filesystems?
I originally tested on tmpfs, and now on vfat, ramfs, procfs, and sysfs,
and none have FMODE_NOWAIT set.
Conversely, ext4 and XFS files both have FMODE_NOWAIT set,
and behave like blockdevs incl. the filemap_splice_read() oddities below.
Indeed, it looks like Some filesystems
(btrfs/ext4/f2fs/ocfs2/xfs, blockdevs, /dev/{null,zero,random,urandom},
pipes, tun/tap)
set FMODE_NOWAIT, but that's by far not All of them, so maybe
/* File is capable of returning -EAGAIN if I/O will block */
is not the right check for regular files.
> filemap_get_pages() does use and inspect IOCB_NOWAIT if set in
> filemap_splice_read(), but it appears to not really make much sense,
> inasmuch it returns EAGAIN for the first splice() from a
> blockdev and then instantly return with data on the next go-around.
Indeed, this is inconsistent to both:
* readv2(off=-1, RWF_NOWAIT), which always returns EAGAIN, and
* fcntl(0, F_SETFL, O_NONBLOCK), read(), which always reads.
Thus,
> This doesn't really make much sense ‒ and open(2) says blockdevs
> don't have O_NONBLOCK semantics, so I'm assuming this is not supposed
> to be exposed to userspace ‒ so I'm not setting it in the diff.
not specifying IOCB_NOWAIT in filemap_splice_read() provides consistent
semantics to "file is read as-if it had O_NONBLOCK set".
I've tentatively updated the check to
if (!((in->f_mode & FMODE_NOWAIT) || S_ISREG(in->f_inode->i_mode)))
(with the reasoning, as previously, that regular files don't /have/ a
distinct O_NONBLOCK mode, because they always behave non-blockingly)
and with that
> I've tested this for:
* regular file: works as expected