2009-01-08 10:48:30

by Volker Lendecke

[permalink] [raw]
Subject: maximum buffer size for splice(2) tcp->pipe?

Hi!

While implementing splice support in Samba for better
performance I found it blocking when trying to pull data off
tcp into a pipe when the recvq was full. Attached find a
test program that shows this behaviour, on another host I
started

netcat 192.168.19.10 4711 < /dev/zero

vlendec@lenny:~$ uname -a
Linux lenny 2.6.28-06857-g5cbd04a #7 Wed Jan 7 10:10:42 CET 2009 x86_64 = GNU/Linux
vlendec@lenny:~$ gcc -o splicetest /host/home/vlendec/splicetest.c -O3 -Wall
vlendec@lenny:~$ ./splicetest out 65536 &
[1] 697
vlendec@lenny:~$ strace -p 697
Process 697 attached - interrupt to quit
splice(0x3, 0, 0x5, 0, 0x56a0, 0x1) = 22176
splice(0x7, 0, 0x4, 0, 0x10000, 0x1^C <unfinished ...>
Process 697 detached
vlendec@lenny:~$ netstat -nt | grep 4711
tcp 69272 0 192.168.19.10:4711 192.168.19.1:33773 ESTABLISHED
vlendec@lenny:~$

Interestingly, whenever I start the strace, it gets another
chunk of data and then blocks in the next splice call.

If I start splicetest with a buffer size of 16384 instead of
65536, it does not block. I could not find a way to ask the
kernel for the tipping point below which it does not block.

What is a safe buffer size to use with splice?

BTW, this kernel is from Steve French's linux-cifs.git repo.

Thanks,

Volker Lendecke

Samba Team

P.S: I'm not subscribed to linux-kernel, so if possible
please CC me directly. If this is inappropriate behaviour,
please give me a quick hint :-)

--
SerNet GmbH, Bahnhofsallee 1b, 37081 G?ttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG G?ttingen, HRB 2816, GF: Dr. Johannes Loxen


Attachments:
(No filename) (0.00 B)
(No filename) (189.00 B)
Download all attachments

2009-01-13 20:37:41

by Andrew Morton

[permalink] [raw]
Subject: Re: maximum buffer size for splice(2) tcp->pipe?

(cc's added)

On Thu, 8 Jan 2009 11:13:51 +0100
Volker Lendecke <[email protected]> wrote:

> Hi!
>
> While implementing splice support in Samba for better
> performance I found it blocking when trying to pull data off
> tcp into a pipe when the recvq was full. Attached find a
> test program that shows this behaviour, on another host I
> started
>
> netcat 192.168.19.10 4711 < /dev/zero
>
> vlendec@lenny:~$ uname -a
> Linux lenny 2.6.28-06857-g5cbd04a #7 Wed Jan 7 10:10:42 CET 2009 x86_64 = GNU/Linux
> vlendec@lenny:~$ gcc -o splicetest /host/home/vlendec/splicetest.c -O3 -Wall
> vlendec@lenny:~$ ./splicetest out 65536 &
> [1] 697
> vlendec@lenny:~$ strace -p 697
> Process 697 attached - interrupt to quit
> splice(0x3, 0, 0x5, 0, 0x56a0, 0x1) = 22176
> splice(0x7, 0, 0x4, 0, 0x10000, 0x1^C <unfinished ...>
> Process 697 detached
> vlendec@lenny:~$ netstat -nt | grep 4711
> tcp 69272 0 192.168.19.10:4711 192.168.19.1:33773 ESTABLISHED
> vlendec@lenny:~$
>
> Interestingly, whenever I start the strace, it gets another
> chunk of data and then blocks in the next splice call.
>
> If I start splicetest with a buffer size of 16384 instead of
> 65536, it does not block. I could not find a way to ask the
> kernel for the tipping point below which it does not block.
>
> What is a safe buffer size to use with splice?
>
> BTW, this kernel is from Steve French's linux-cifs.git repo.
>
> Thanks,
>
> Volker Lendecke
>
> Samba Team
>
> P.S: I'm not subscribed to linux-kernel, so if possible
> please CC me directly. If this is inappropriate behaviour,
> please give me a quick hint :-)
>

2009-01-13 23:16:23

by Eric Dumazet

[permalink] [raw]
Subject: Re: maximum buffer size for splice(2) tcp->pipe?

Andrew Morton a ?crit :
> (cc's added)
>
> On Thu, 8 Jan 2009 11:13:51 +0100
> Volker Lendecke <[email protected]> wrote:
>
>> Hi!
>>
>> While implementing splice support in Samba for better
>> performance I found it blocking when trying to pull data off
>> tcp into a pipe when the recvq was full. Attached find a
>> test program that shows this behaviour, on another host I
>> started
>>
>> netcat 192.168.19.10 4711 < /dev/zero
>>
>> vlendec@lenny:~$ uname -a
>> Linux lenny 2.6.28-06857-g5cbd04a #7 Wed Jan 7 10:10:42 CET 2009 x86_64 = GNU/Linux
>> vlendec@lenny:~$ gcc -o splicetest /host/home/vlendec/splicetest.c -O3 -Wall
>> vlendec@lenny:~$ ./splicetest out 65536 &
>> [1] 697
>> vlendec@lenny:~$ strace -p 697
>> Process 697 attached - interrupt to quit
>> splice(0x3, 0, 0x5, 0, 0x56a0, 0x1) = 22176
>> splice(0x7, 0, 0x4, 0, 0x10000, 0x1^C <unfinished ...>

Volker, your splice() is a blocking one, from tcp socket to a pipe ?

If no other thread is reading the pipe, then you might block forever
in splice_to_pipe() as soon pipe is full (16 pages).

As pages are not necessarly full (each skb will use at least one page, even if
its length is small), it is not really possible to use splice() like this.

In your case, only safe way with current kernel would be to call splice()
asking for no more than 16 bytes, that would be really insane for your needs.

You may prefer a non blocking mode, at least when calling splice_to_pipe()

Maybe SPLICE_F_NONBLOCK splice() flag should only apply on pipe side.
tcp_splice_read() should not use this flag to select a blocking/nonbloking
mode on the source socket, but underlying file flag.

This way, your program could let socket in blocking mode, yet call splice()
with SPLICE_F_NONBLOCK flag to not block on pipe.

2009-01-13 23:39:16

by Eric Dumazet

[permalink] [raw]
Subject: Re: maximum buffer size for splice(2) tcp->pipe?

Eric Dumazet a ?crit :
> Andrew Morton a ?crit :
>> (cc's added)
>>
>> On Thu, 8 Jan 2009 11:13:51 +0100
>> Volker Lendecke <[email protected]> wrote:
>>
>>> Hi!
>>>
>>> While implementing splice support in Samba for better
>>> performance I found it blocking when trying to pull data off
>>> tcp into a pipe when the recvq was full. Attached find a
>>> test program that shows this behaviour, on another host I
>>> started
>>>
>>> netcat 192.168.19.10 4711 < /dev/zero
>>>
>>> vlendec@lenny:~$ uname -a
>>> Linux lenny 2.6.28-06857-g5cbd04a #7 Wed Jan 7 10:10:42 CET 2009 x86_64 = GNU/Linux
>>> vlendec@lenny:~$ gcc -o splicetest /host/home/vlendec/splicetest.c -O3 -Wall
>>> vlendec@lenny:~$ ./splicetest out 65536 &
>>> [1] 697
>>> vlendec@lenny:~$ strace -p 697
>>> Process 697 attached - interrupt to quit
>>> splice(0x3, 0, 0x5, 0, 0x56a0, 0x1) = 22176
>>> splice(0x7, 0, 0x4, 0, 0x10000, 0x1^C <unfinished ...>
>
> Volker, your splice() is a blocking one, from tcp socket to a pipe ?
>
> If no other thread is reading the pipe, then you might block forever
> in splice_to_pipe() as soon pipe is full (16 pages).
>
> As pages are not necessarly full (each skb will use at least one page, even if
> its length is small), it is not really possible to use splice() like this.
>
> In your case, only safe way with current kernel would be to call splice()
> asking for no more than 16 bytes, that would be really insane for your needs.
>
> You may prefer a non blocking mode, at least when calling splice_to_pipe()
>
> Maybe SPLICE_F_NONBLOCK splice() flag should only apply on pipe side.
> tcp_splice_read() should not use this flag to select a blocking/nonbloking
> mode on the source socket, but underlying file flag.
>
> This way, your program could let socket in blocking mode, yet call splice()
> with SPLICE_F_NONBLOCK flag to not block on pipe.
>

This patch, coupled with the previous one from Willy Tarreau
(tcp: splice as many packets as possible at once)
gives expected result.

[PATCH] net: splice() from tcp to socket should take into account O_NONBLOCK

Instead of using SPLICE_F_NONBLOCK to select a non blocking mode both on
source tcp socket and pipe destination, we use the underlying file flag (O_NONBLOCK)
for selecting a non blocking socket.

Signed-off-by: Eric Dumazet <[email protected]>

diff --git a/include/linux/net.h b/include/linux/net.h
index 4515efa..10e38d1 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -185,7 +185,7 @@ struct proto_ops {
struct vm_area_struct * vma);
ssize_t (*sendpage) (struct socket *sock, struct page *page,
int offset, size_t size, int flags);
- ssize_t (*splice_read)(struct socket *sock, loff_t *ppos,
+ ssize_t (*splice_read)(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len, unsigned int flags);
};

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 218235d..e8e7f80 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -309,7 +309,7 @@ extern int tcp_twsk_unique(struct sock *sk,

extern void tcp_twsk_destructor(struct sock *sk);

-extern ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
+extern ssize_t tcp_splice_read(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len, unsigned int flags);

static inline void tcp_dec_quickack_mode(struct sock *sk,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ce572f9..c777d88 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -548,10 +548,11 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
* Will read pages from given socket and fill them into a pipe.
*
**/
-ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
+ssize_t tcp_splice_read(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
{
+ struct socket *sock = file->private_data;
struct sock *sk = sock->sk;
struct tcp_splice_state tss = {
.pipe = pipe,
@@ -572,7 +573,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,

lock_sock(sk);

- timeo = sock_rcvtimeo(sk, flags & SPLICE_F_NONBLOCK);
+ timeo = sock_rcvtimeo(sk, file->f_flags & O_NONBLOCK);
while (tss.len) {
ret = __tcp_splice_read(sk, &tss);
if (ret < 0)

2009-01-14 07:40:29

by Volker Lendecke

[permalink] [raw]
Subject: Re: maximum buffer size for splice(2) tcp->pipe?

On Wed, Jan 14, 2009 at 12:15:04AM +0100, Eric Dumazet wrote:
> Volker, your splice() is a blocking one, from tcp socket to a pipe ?

Yes, it is.

> If no other thread is reading the pipe, then you might block forever
> in splice_to_pipe() as soon pipe is full (16 pages).

Why does it block when the pipe is full? Why doesn't it
return a short read, just like the read(2) call does? We
need to cope with that behaviour anyway.

> As pages are not necessarly full (each skb will use at least one page, even if
> its length is small), it is not really possible to use splice() like this.
>
> In your case, only safe way with current kernel would be to call splice()
> asking for no more than 16 bytes, that would be really insane for your needs.
>
> You may prefer a non blocking mode, at least when calling splice_to_pipe()

Which fd do I have to set the nonblocking flag on? The TCP
socket I read from, or the pipe I write to?

Thanks for the hint anyway :-)

Volker


Attachments:
(No filename) (971.00 B)
(No filename) (189.00 B)
Download all attachments

2009-01-14 09:14:14

by Eric Dumazet

[permalink] [raw]
Subject: Re: maximum buffer size for splice(2) tcp->pipe?

Volker Lendecke a ?crit :
> On Wed, Jan 14, 2009 at 12:15:04AM +0100, Eric Dumazet wrote:
>> Volker, your splice() is a blocking one, from tcp socket to a pipe ?
>
> Yes, it is.
>
>> If no other thread is reading the pipe, then you might block forever
>> in splice_to_pipe() as soon pipe is full (16 pages).
>
> Why does it block when the pipe is full? Why doesn't it
> return a short read, just like the read(2) call does? We
> need to cope with that behaviour anyway.

Well, check code in fs/splice.c, function splice_to_pipe().

If SPLICE_F_NONBLOCK is not set, it is *expected* to block on pipe.

In this mode, only another thread is able to drain the pipe and wakeup the blocked thread.

Code review :

When all pages are used "if (pipe->nrbufs == PIPE_BUFFERS)"

if (spd->flags & SPLICE_F_NONBLOCK) {
if (!ret)
ret = -EAGAIN;
break;
}

if (signal_pending(current)) {
if (!ret)
ret = -ERESTARTSYS;
break;
}

if (do_wakeup) {
smp_mb();
if (waitqueue_active(&pipe->wait))
wake_up_interruptible_sync(&pipe->wait);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
do_wakeup = 0;
}

pipe->waiting_writers++;
HERE >> pipe_wait(pipe);
pipe->waiting_writers--;


>
>> As pages are not necessarly full (each skb will use at least one page, even if
>> its length is small), it is not really possible to use splice() like this.
>>
>> In your case, only safe way with current kernel would be to call splice()
>> asking for no more than 16 bytes, that would be really insane for your needs.
>>
>> You may prefer a non blocking mode, at least when calling splice_to_pipe()
>
> Which fd do I have to set the nonblocking flag on? The TCP
> socket I read from, or the pipe I write to?

I would say, use the SPLICE_F_NONBLOCK flag on splice() system call,
but let tcp socket in blocking mode... But with current kernel it
wont work. In order to avoid busy looping, you might add a poll()/select()
to call splice(SPLICE_F_NONBLOCK) only when socket has data
in its receive queue.

for (;;) {
struct pollfd pfd;
pfd.fd = socket;
pfd.events = POLLIN;
if (poll(&pfd, 1, -1) != 1)
continue;
res = splice(socket, NULL, pipefds[1], NULL, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
if (res > 0)
nwritten = splice(pipefds[0], NULL, file_fd, NULL, res, SPLICE_F_MOVE|SPLICE_F_MORE);
}

splice() from tcp socket to pipe is not working as is unfortunatly if !SPLICE_F_NONBLOCK)
and if using the same thread to write and read the pipe. Or risk deadlock.

2009-01-14 09:58:43

by Volker Lendecke

[permalink] [raw]
Subject: Re: maximum buffer size for splice(2) tcp->pipe?

On Wed, Jan 14, 2009 at 10:13:34AM +0100, Eric Dumazet wrote:
> for (;;) {
> struct pollfd pfd;
> pfd.fd = socket;
> pfd.events = POLLIN;
> if (poll(&pfd, 1, -1) != 1)
> continue;
> res = splice(socket, NULL, pipefds[1], NULL, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
> if (res > 0)
> nwritten = splice(pipefds[0], NULL, file_fd, NULL, res, SPLICE_F_MOVE|SPLICE_F_MORE);
> }

Doesn't this reduce performance again? I thought the whole
point of splice() was to increase performance by avoiding
memory copies. If I have to do a poll syscall for each call
to splice, doesn't the context switch eat that performance
advantage again?

Or was splice designed only for multi-threaded applications
(which at least Samba is not)?

Volker


Attachments:
(No filename) (739.00 B)
(No filename) (189.00 B)
Download all attachments

2009-01-14 10:19:09

by Eric Dumazet

[permalink] [raw]
Subject: Re: maximum buffer size for splice(2) tcp->pipe?

Volker Lendecke a ?crit :
> On Wed, Jan 14, 2009 at 10:13:34AM +0100, Eric Dumazet wrote:
>> for (;;) {
>> struct pollfd pfd;
>> pfd.fd = socket;
>> pfd.events = POLLIN;
>> if (poll(&pfd, 1, -1) != 1)
>> continue;
>> res = splice(socket, NULL, pipefds[1], NULL, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
>> if (res > 0)
>> nwritten = splice(pipefds[0], NULL, file_fd, NULL, res, SPLICE_F_MOVE|SPLICE_F_MORE);
>> }
>
> Doesn't this reduce performance again? I thought the whole
> point of splice() was to increase performance by avoiding
> memory copies. If I have to do a poll syscall for each call
> to splice, doesn't the context switch eat that performance
> advantage again?
>
> Or was splice designed only for multi-threaded applications
> (which at least Samba is not)?
>
> Volker

splice() avoids memory copies yes, but on typical 1460 bytes
frames its a small gain.

But if no data is available on socket,
you still have to wait (and have a context switch later).

Waiting in poll() or splice() has same context switch cost.

Only cost is the extra syscall of course, but it is mandatory
if you want to avoid a possible deadlock in current splice()
implementation.

2009-01-15 04:58:51

by David Miller

[permalink] [raw]
Subject: Re: maximum buffer size for splice(2) tcp->pipe?

From: Eric Dumazet <[email protected]>
Date: Wed, 14 Jan 2009 00:38:32 +0100

> [PATCH] net: splice() from tcp to socket should take into account O_NONBLOCK
>
> Instead of using SPLICE_F_NONBLOCK to select a non blocking mode both on
> source tcp socket and pipe destination, we use the underlying file flag (O_NONBLOCK)
> for selecting a non blocking socket.
>
> Signed-off-by: Eric Dumazet <[email protected]>

This needs at least some more thought.

It seems, for one thing, that this change will interfere with the
intentions of the code in splice_dirt_to_actor which goes:

/*
* Don't block on output, we have to drain the direct pipe.
*/
sd->flags &= ~SPLICE_F_NONBLOCK;

2009-01-15 11:47:56

by Eric Dumazet

[permalink] [raw]
Subject: Re: maximum buffer size for splice(2) tcp->pipe?

David Miller a ?crit :
> From: Eric Dumazet <[email protected]>
> Date: Wed, 14 Jan 2009 00:38:32 +0100
>
>> [PATCH] net: splice() from tcp to socket should take into account O_NONBLOCK
>>
>> Instead of using SPLICE_F_NONBLOCK to select a non blocking mode both on
>> source tcp socket and pipe destination, we use the underlying file flag (O_NONBLOCK)
>> for selecting a non blocking socket.
>>
>> Signed-off-by: Eric Dumazet <[email protected]>
>
> This needs at least some more thought.
>
> It seems, for one thing, that this change will interfere with the
> intentions of the code in splice_dirt_to_actor which goes:
>
> /*
> * Don't block on output, we have to drain the direct pipe.
> */
> sd->flags &= ~SPLICE_F_NONBLOCK;

Reading splice_direct_to_actor() I see nothing wrong with the patch

(Patch is about splice from socket to pipe, while the sd->flags you mention
in splice_direct_to_actor() only applies to the splice from internal pipe to
something else, as splice_direct_to_actor() allocates an internal pipe to perform
its work.

Also, the meaning of SPLICE_F_NONBLOCK, as explained in include/linux/splice.h is :

#define SPLICE_F_NONBLOCK (0x02) /* don't block on the pipe splicing (but */
/* we may still block on the fd we splice */
/* from/to, of course */

If the comment is still correct, SPLICE_F_NONBLOCK only applies to the pipe implied in
splice() syscall.

For the other file, either its :
- A regular file : nonblocking mode is not available, like a normal read()/write() syscall

- A socket : We should be able to specify if its blocking or not, independantly from
the SPLICE_F_NONBLOCK flag that only applies to the pipe. Normal way
is using ioctl(FIONBIO) or other fcntl() call to change file->f_flags O_NONBLOCK


In order to be able to efficiently use splice() from a socket to a file, we need
to do a loop of :

{
splice(from blocking tcp socket to non blocking pipe, SPLICE_F_NONBLOCK); /* nonblocking pipe or risk deadlock */
splice(from pipe to file)
}