2009-11-05 10:07:59

by Max Kellermann

[permalink] [raw]
Subject: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced

When splicing a large amount of bytes from a TCP socket to a pipe
(more than PIPE_BUFFERS), splice() can block, even though the pipe was
empty. The correct behavior would be to copy as much as possible, and
return without blocking. Block only if nothing can be transferred.
When the destination pipe is (initially) writable, splice() should do
the same with or without SPLICE_F_NONBLOCK.

The cause is the loop in tcp_splice_read(): it calls
__tcp_splice_read() (and thus skb_splice_bits() and splice_to_pipe())
again and again, until the requested number of bytes has been
transferred, or an error occurs. In the first iteration, up to 64 kB
is copied, and the second iteration will block, because
splice_to_pipe() is called again and sees the pipe is already full.

This patch adds SPLICE_F_NONBLOCK to the splice flags after the first
iteration has finished successfully. This prevents the second
splice_to_pipe() call from blocking. The resulting EAGAIN error is
handled gracefully, and tcp_splice_read() returns the number of bytes
successfully moved.
---

net/ipv4/tcp.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9114524..0f8b01f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -628,6 +628,11 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
(sk->sk_shutdown & RCV_SHUTDOWN) ||
signal_pending(current))
break;
+
+ /* the following splice_to_pipe() calls should not
+ block, because we have already successfully
+ transferred at least one buffer */
+ tss.flags |= SPLICE_F_NONBLOCK;
}

release_sock(sk);


2009-11-05 10:11:54

by Max Kellermann

[permalink] [raw]
Subject: Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced

Hi,

here is a small test program for the bug. The last splice() blocks.
Interestingly, if you close the client socket before splice(), it does
not block, and the number of bytes transferred is smaller.

In my patch submission, I forgot the Signed-off-by line - please use
the attached patch file instead.

Max


Attachments:
(No filename) (312.00 B)
splice-block.c (3.69 kB)
0001-tcp-set-SPLICE_F_NONBLOCK-after-first-buffer-has-be.patch (1.88 kB)
Download all attachments

2009-11-05 10:30:32

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced

Max Kellermann a écrit :
> When splicing a large amount of bytes from a TCP socket to a pipe
> (more than PIPE_BUFFERS), splice() can block, even though the pipe was
> empty. The correct behavior would be to copy as much as possible, and
> return without blocking. Block only if nothing can be transferred.
> When the destination pipe is (initially) writable, splice() should do
> the same with or without SPLICE_F_NONBLOCK.
>
> The cause is the loop in tcp_splice_read(): it calls
> __tcp_splice_read() (and thus skb_splice_bits() and splice_to_pipe())
> again and again, until the requested number of bytes has been
> transferred, or an error occurs. In the first iteration, up to 64 kB
> is copied, and the second iteration will block, because
> splice_to_pipe() is called again and sees the pipe is already full.
>
> This patch adds SPLICE_F_NONBLOCK to the splice flags after the first
> iteration has finished successfully. This prevents the second
> splice_to_pipe() call from blocking. The resulting EAGAIN error is
> handled gracefully, and tcp_splice_read() returns the number of bytes
> successfully moved.
> ---
>
> net/ipv4/tcp.c | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 9114524..0f8b01f 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -628,6 +628,11 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
> (sk->sk_shutdown & RCV_SHUTDOWN) ||
> signal_pending(current))
> break;
> +
> + /* the following splice_to_pipe() calls should not
> + block, because we have already successfully
> + transferred at least one buffer */
> + tss.flags |= SPLICE_F_NONBLOCK;
> }
>
> release_sock(sk);
>

CC netdev

I dont think this patch is correct. Could you describe your use case ?

If you dont want to block on output pipe, you should set this NONBLOCK
flag before calling splice(SPLICE_F_NONBLOCK) syscall.

ie : Use the socket in blocking mode, but output pipe in non-blocking mode.

Some application could have a thread working in full blocking mode, and have another
thread reading the pipe (and eventually unblocking first thread)

Thanks

2009-11-05 10:57:40

by Max Kellermann

[permalink] [raw]
Subject: Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced

On 2009/11/05 11:30, Eric Dumazet <[email protected]> wrote:
> I dont think this patch is correct. Could you describe your use case ?

See my second email, there's a demo source.

> If you dont want to block on output pipe, you should set this NONBLOCK
> flag before calling splice(SPLICE_F_NONBLOCK) syscall.
>
> ie : Use the socket in blocking mode, but output pipe in non-blocking mode.

Do you think that a splice() should block if the socket is readable
and the pipe is writable according to select()?

"The correct behavior would be to copy as much as possible, and return
without blocking. Block only if nothing can be transferred."

Do you disagree with that?

> Some application could have a thread working in full blocking mode,
> and have another thread reading the pipe (and eventually unblocking
> first thread)

I don't get this objection. Please explain.

2009-11-05 11:22:02

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced

Max Kellermann a ?crit :
> On 2009/11/05 11:30, Eric Dumazet <[email protected]> wrote:
>> I dont think this patch is correct. Could you describe your use case ?
>
> See my second email, there's a demo source.
>
>> If you dont want to block on output pipe, you should set this NONBLOCK
>> flag before calling splice(SPLICE_F_NONBLOCK) syscall.
>>
>> ie : Use the socket in blocking mode, but output pipe in non-blocking mode.
>
> Do you think that a splice() should block if the socket is readable
> and the pipe is writable according to select()?
>

Yes, this is perfectly legal

select() can return "OK to write on fd",
and still, write(fd, buffer, 10000000) is supposer/allowed to block if fd is not O_NDELAY

If you want to not block on fd, use O_NDELAY (if using write() syscall),
or SPLICE_F_NONBLOCK splice() flag ?

Please read recent commit on this area and why I think your patch conflicts with
this commit.

commit 42324c62704365d6a3e89138dea55909d2f26afe
Author: Eric Dumazet <[email protected]>
Date: Thu Oct 1 15:26:00 2009 -0700

net: splice() from tcp to pipe should take into account O_NONBLOCK

tcp_splice_read() doesnt take into account socket's O_NONBLOCK flag

Before this patch :

splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE);
causes a random endless block (if pipe is full) and
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
will return 0 immediately if the TCP buffer is empty.

User application has no way to instruct splice() that socket should be in blocking mode
but pipe in nonblock more.

Many projects cannot use splice(tcp -> pipe) because of this flaw.

http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD
http://lkml.indiana.edu/hypermail/linux/kernel/0807.2/0687.html

Linus introduced SPLICE_F_NONBLOCK in commit 29e350944fdc2dfca102500790d8ad6d6ff4f69d
(splice: add SPLICE_F_NONBLOCK flag )

It doesn't make the splice itself necessarily nonblocking (because the
actual file descriptors that are spliced from/to may block unless they
have the O_NONBLOCK flag set), but it makes the splice pipe operations
nonblocking.

Linus intention was clear : let SPLICE_F_NONBLOCK control the splice pipe mode only

This patch instruct tcp_splice_read() to use the underlying file O_NONBLOCK
flag, as other socket operations do.
Users will then call :

splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK );

to block on data coming from socket (if file is in blocking mode),
and not block on pipe output (to avoid deadlock)

First version of this patch was submitted by Octavian Purdila

Reported-by: Volker Lendecke <[email protected]>
Reported-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Octavian Purdila <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Jens Axboe <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

2009-11-05 13:23:44

by Max Kellermann

[permalink] [raw]
Subject: Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced

On 2009/11/05 12:21, Eric Dumazet <[email protected]> wrote:
> Max Kellermann a ?crit :
> > Do you think that a splice() should block if the socket is readable
> > and the pipe is writable according to select()?
> >
>
> Yes, this is perfectly legal
>
> select() can return "OK to write on fd",
> and still, write(fd, buffer, 10000000) is supposer/allowed to block if fd is not O_NDELAY

>From the select() manpage: "those in writefds will be watched to see
if a write will not block"

>From the poll() manpage: "Writing now will not block."

This looks unambiguous to me, and contradicts with your thesis. Can
you provide sources?

What is your interpretation of the guarantees provided by select() and
poll()? Which byte count is "ok" to write after POLLOUT, and how much
is "too much"? How does the application know?

> Please read recent commit on this area and why I think your patch
> conflicts with this commit.

I understand your patch, but I don't understand the conflict with my
patch. Can you describe a breakage caused by my patch?

2009-11-05 14:11:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced

Max Kellermann a ?crit :
> On 2009/11/05 12:21, Eric Dumazet <[email protected]> wrote:
>> Max Kellermann a ?crit :
>>> Do you think that a splice() should block if the socket is readable
>>> and the pipe is writable according to select()?
>>>
>> Yes, this is perfectly legal
>>
>> select() can return "OK to write on fd",
>> and still, write(fd, buffer, 10000000) is supposer/allowed to block if fd is not O_NDELAY
>
>>From the select() manpage: "those in writefds will be watched to see
> if a write will not block"
>
>>From the poll() manpage: "Writing now will not block."
>
> This looks unambiguous to me, and contradicts with your thesis. Can
> you provide sources?
>
> What is your interpretation of the guarantees provided by select() and
> poll()? Which byte count is "ok" to write after POLLOUT, and how much
> is "too much"? How does the application know?

It cannot, therefore an application uses O_NDELAY to avoid blocking.

Try following program if you are not convinced

#include <unistd.h>
#include <sys/poll.h>
#include <stdio.h>

char buffer[1000000];

int main(int argc, char *argv[])
{
int fds[2];
struct pollfd pfd;
int res;

pipe(fds);
pfd.fd = fds[1];
pfd.events = POLLOUT;
res = poll(&pfd, 1, -1);
if (res > 0 && pfd.revents & POLLOUT)
printf("OK to write on pipe\n");
write(fds[1], buffer, sizeof(buffer)); // why it blocks, did poll() lied ???
return 0;
}



> I understand your patch, but I don't understand the conflict with my
> patch. Can you describe a breakage caused by my patch?

I only pointed out that using splice(tcp -> pipe) and blocking on pipe
_can_ block, even on _first_ frame received from tcp, as you discovered.


Your only choices to avoid a deadlock are :
1) to use SPLICE_F_NONBLOCK.
2) Using a second thread to read the pipe and empty it. First thread will
happily transfert 1000000 bytes in one syscall...
3) or limit your splice(... len, flags) length to 16 (16 buffers of one byte
in pathological cases)

Your patch basically makes SPLICE_F_NONBLOCK option always set (choice 1) above)

So users wanting option 3) are stuck. You force them to use a poll()/select()
thing while they dont want to poll : They have a producer thread(s), and a consumer
thread(s).

producer()
{
while (1)
splice(tcp, &offset, pfds[1], NULL, 10000000,
SPLICE_F_MORE | SPLICE_F_MOVE);
}

Why in the first place have an option if it is always set ?

2009-11-05 14:33:13

by Max Kellermann

[permalink] [raw]
Subject: Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced

On 2009/11/05 15:11, Eric Dumazet <[email protected]> wrote:
> It cannot, therefore an application uses O_NDELAY to avoid blocking.
>
> Try following program if you are not convinced

Indeed, I'm surprised by the result rendered by the Linux kernel.
That however still contradicts with the poll() documentation.

So this boils down to the question: kernel bug or documentation bug?

http://www.opengroup.org/onlinepubs/9699919799/functions/pselect.html

"A descriptor shall be considered ready for writing when a call to an
output function with O_NONBLOCK clear would not block, whether or not
the function would transfer data successfully."

There is no size limit mentioned here. Your program reveals that the
kernel violates this definition.

> Your patch basically makes SPLICE_F_NONBLOCK option always set
> (choice 1) above)
[...]
> Why in the first place have an option if it is always set ?

It is not, you misunderstood my patch. If there's no room in the pipe
buffer, then the first iteration of the "while" loop will block (as
usual). *After* the first iteration has finished (and at least one
buffer has been moved already), the flag is set, and further
iterations will not block.