2008-07-17 13:36:17

by Octavian Purdila

[permalink] [raw]
Subject: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

commit 11134aa8499b6fd67569e8fd21bde6fc481898d1
Author: Octavian Purdila <[email protected]>
Date: Thu Jul 17 16:25:23 2008 +0300

tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

This patch changes tcp_splice_read to the behavior implied by man 2
splice:

SPLICE_F_NONBLOCK - Do not block on I/O. This makes the splice
pipe operations non-blocking, but splice() may nevertheless block
because the file descriptors that are spliced to/from may block
(unless they have the O_NONBLOCK flag set).

This approach also provides a simple solution to the splice
transfer size problem. Say we have the following common sequence:

splice(socket, pipe);
splice(pipe, file);

Unless we specify SPLICE_F_NONBLOCK, we can't use arbitrarily large
transfer sizes with the 1st splice since otherwise we will deadlock
due to pipe being full. But if we use SPLICE_F_NONBLOCK, the current
implementation will make the underlying socket non-blocking and thus
will force us use poll or other async I/O notification mechanism.

Choosing a splice transfer size that won't deadlock is not trivial: we
need to stay under PIPE_BUFFERS packets and since packets can have
arbitrary sizes we will need to be conservative and use a small
transfer size. That can degrade performance due to excessive system
calls.

Signed-off-by: Octavian Purdila <[email protected]>

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 56a133c..cc5082b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -570,7 +570,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,

lock_sock(sk);

- timeo = sock_rcvtimeo(sk, flags & SPLICE_F_NONBLOCK);
+ timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK);
while (tss.len) {
ret = __tcp_splice_read(sk, &tss);
if (ret < 0)
@@ -578,10 +578,6 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
else if (!ret) {
if (spliced)
break;
- if (flags & SPLICE_F_NONBLOCK) {
- ret = -EAGAIN;
- break;
- }
if (sock_flag(sk, SOCK_DONE))
break;
if (sk->sk_err) {


Attachments:
(No filename) (1.00 B)
x (2.13 kB)
Download all attachments

2008-07-17 14:21:54

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

Hi Octavian.

On Thu, Jul 17, 2008 at 04:33:49PM +0300, Octavian Purdila ([email protected]) wrote:
> This patch changes tcp_splice_read to the behavior implied by man 2
> splice:
>
> SPLICE_F_NONBLOCK - Do not block on I/O. This makes the splice
> pipe operations non-blocking, but splice() may nevertheless block
> because the file descriptors that are spliced to/from may block
> (unless they have the O_NONBLOCK flag set).
>
> This approach also provides a simple solution to the splice
> transfer size problem. Say we have the following common sequence:
>
> splice(socket, pipe);
> splice(pipe, file);
>
> Unless we specify SPLICE_F_NONBLOCK, we can't use arbitrarily large
> transfer sizes with the 1st splice since otherwise we will deadlock
> due to pipe being full. But if we use SPLICE_F_NONBLOCK, the current
> implementation will make the underlying socket non-blocking and thus
> will force us use poll or other async I/O notification mechanism.

Existing behaviour was selected to be able to have a progress if socket
does not have enough data to fill the pipe. With your change if socket
is not opened with non-blocking mode reading will block not matter if
SPLICE_F_NONBLOCK is set or not. This is a quite serious break of the
overall idea behind SPLICE_F_NONBLOCK.

Socket will not be marked as non-blocking if SPLICE_F_NONBLOCK is
specified, only splicing will used non-blocking reading, any read via
recv() will use existing socket flags.

--
Evgeniy Polyakov

2008-07-17 14:49:46

by Octavian Purdila

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

On Thursday 17 July 2008, Evgeniy Polyakov wrote:

>
> Existing behaviour was selected to be able to have a progress if socket
> does not have enough data to fill the pipe. With your change if socket
> is not opened with non-blocking mode reading will block not matter if
> SPLICE_F_NONBLOCK is set or not.

I am probably missing some usecases here, but usually if you want to use
non-blocking I/O you need to use special approach anyway (e.g. code the
poll/epoll/select bits) so then you could open the socket with O_NONBLOCK.

> This is a quite serious break of the
> overall idea behind SPLICE_F_NONBLOCK.
>

I don't know... the man page explicitly says that even when you use
SPLICE_F_NONBLOCK splice may block because of the underlying fd blocking.

But more importantly, how can we solve the deadlock issue described in the
patch? Do we need all of the complications of async I/O for such a simple and
common usecase?

Maybe we can solve both usecases by using two flags: one for splice and
another one for the underlying file descriptor?

Thanks,
tavi

2008-07-17 17:42:28

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

On Thu, Jul 17, 2008 at 05:47:27PM +0300, Octavian Purdila ([email protected]) wrote:
> > Existing behaviour was selected to be able to have a progress if socket
> > does not have enough data to fill the pipe. With your change if socket
> > is not opened with non-blocking mode reading will block not matter if
> > SPLICE_F_NONBLOCK is set or not.
>
> I am probably missing some usecases here, but usually if you want to use
> non-blocking I/O you need to use special approach anyway (e.g. code the
> poll/epoll/select bits) so then you could open the socket with O_NONBLOCK.

It depends. Splice clearly states that it tries to be nonblocking with
given flag being set, and its reading will be non-blocking indeed.

> > This is a quite serious break of the
> > overall idea behind SPLICE_F_NONBLOCK.
> >
>
> I don't know... the man page explicitly says that even when you use
> SPLICE_F_NONBLOCK splice may block because of the underlying fd blocking.

Yes, but reading from the network will not.

> But more importantly, how can we solve the deadlock issue described in the
> patch? Do we need all of the complications of async I/O for such a simple and
> common usecase?

I'm not sure I understand how it can deadlock, please explain it in more
details.

--
Evgeniy Polyakov

2008-07-17 21:55:03

by Octavian Purdila

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

On Thursday 17 July 2008, Evgeniy Polyakov wrote:
> >
> > I am probably missing some usecases here, but usually if you want to use
> > non-blocking I/O you need to use special approach anyway (e.g. code the
> > poll/epoll/select bits) so then you could open the socket with
> > O_NONBLOCK.
>
> It depends. Splice clearly states that it tries to be nonblocking with
> given flag being set, and its reading will be non-blocking indeed.
>
> > > This is a quite serious break of the
> > > overall idea behind SPLICE_F_NONBLOCK.
> >
> > I don't know... the man page explicitly says that even when you use
> > SPLICE_F_NONBLOCK splice may block because of the underlying fd blocking.
>
> Yes, but reading from the network will not.
>

You lost me here :)

The way I interpret the man page text is that it is ok for splice to block,
even if SPLICE_F_NONBLOCK is set. The comments near SPLICE_F_NONBLOCK says
the same thing:

#define SPLICE_F_NONBLOCK (0x02) /* don't block on the pipe splicing (but */
/* we may still block on the fd we splice */
/* from/to, of course */

Am I missing something?

> > But more importantly, how can we solve the deadlock issue described in
> > the patch? Do we need all of the complications of async I/O for such a
> > simple and common usecase?
>
> I'm not sure I understand how it can deadlock, please explain it in more
> details.

For this "program":

x=splice(socket, pipe, size, flags=0);
if (x > 0)
splice(pipe, file, x, flags=0);

it is hard to come up with a non tiny value for size that does not deadlock
the program, because the pipe size is measured in packets and not bytes and
we have no control over the packet sizes.

For example, if we set size=17 and we are unlucky and get 16 packets of 1 byte
in a row, at the right time, the first splice call will block - and the
program will deadlock since we can't reach the consumer.

Thanks,
tavi

2008-07-18 10:54:08

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

Hi.

On Fri, Jul 18, 2008 at 12:52:33AM +0300, Octavian Purdila ([email protected]) wrote:
> You lost me here :)
>
> The way I interpret the man page text is that it is ok for splice to block,
> even if SPLICE_F_NONBLOCK is set. The comments near SPLICE_F_NONBLOCK says
> the same thing:
>
> #define SPLICE_F_NONBLOCK (0x02) /* don't block on the pipe splicing (but */
> /* we may still block on the fd we splice */
> /* from/to, of course */
>
> Am I missing something?

tcp_splice_read:

timeo = sock_rcvtimeo(sk, flags & SPLICE_F_NONBLOCK);

So, if you set SPLICE_F_NONBLOCK, then reading from the network will not
block. Splice can block in reading from other descriptors though. It can
also block during writing.

> > > But more importantly, how can we solve the deadlock issue described in
> > > the patch? Do we need all of the complications of async I/O for such a
> > > simple and common usecase?
> >
> > I'm not sure I understand how it can deadlock, please explain it in more
> > details.
>
> For this "program":
>
> x=splice(socket, pipe, size, flags=0);
> if (x > 0)
> splice(pipe, file, x, flags=0);
>
> it is hard to come up with a non tiny value for size that does not deadlock
> the program, because the pipe size is measured in packets and not bytes and
> we have no control over the packet sizes.
>
> For example, if we set size=17 and we are unlucky and get 16 packets of 1 byte
> in a row, at the right time, the first splice call will block - and the
> program will deadlock since we can't reach the consumer.

It is not a deadlock. recv() on blocking socket with the same parameters
will behave exactly the same. Application designer should think about
how it is supposed to handle cases, when not enough data is available in
the receiving queue - either return or wait.

--
Evgeniy Polyakov

2008-07-18 11:22:11

by Octavian Purdila

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

On Friday 18 July 2008, Evgeniy Polyakov wrote:

> tcp_splice_read:
>
> timeo = sock_rcvtimeo(sk, flags & SPLICE_F_NONBLOCK);
>
> So, if you set SPLICE_F_NONBLOCK, then reading from the network will not
> block. Splice can block in reading from other descriptors though. It can
> also block during writing.
>

I know that. But I am arguing that splice API does not required not to block
even when the SPLICE_F_NONBLOCK is used. So changing this behavior the way I
suggested will still be conformant with the splice API requirements.

> > > > But more importantly, how can we solve the deadlock issue described
> > > > in the patch? Do we need all of the complications of async I/O for
> > > > such a simple and common usecase?
> > >
> > > I'm not sure I understand how it can deadlock, please explain it in
> > > more details.
> >
> > For this "program":
> >
> > x=splice(socket, pipe, size, flags=0);
> > if (x > 0)
> > splice(pipe, file, x, flags=0);
> >
> > it is hard to come up with a non tiny value for size that does not
> > deadlock the program, because the pipe size is measured in packets and
> > not bytes and we have no control over the packet sizes.
> >
> > For example, if we set size=17 and we are unlucky and get 16 packets of 1
> > byte in a row, at the right time, the first splice call will block - and
> > the program will deadlock since we can't reach the consumer.
>
> It is not a deadlock. recv() on blocking socket with the same parameters
> will behave exactly the same. Application designer should think about
> how it is supposed to handle cases, when not enough data is available in
> the receiving queue - either return or wait.

Sorry, it was an unfortunate example :) This is not about not enough data
being available. Lets change the number of packets in the example with 20
instead of 16 (and keep the size to 17) - the splice call will still block
because of the pipe being full. The pipe can only hold PIPE_BUFFERS packets
(which is 16 currently).

Thanks,
tavi

2008-07-18 12:24:30

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

On Fri, Jul 18, 2008 at 02:18:59PM +0300, Octavian Purdila ([email protected]) wrote:
> I know that. But I am arguing that splice API does not required not to block
> even when the SPLICE_F_NONBLOCK is used. So changing this behavior the way I
> suggested will still be conformant with the splice API requirements.

It will block in sending and/or other than network reading. With your
patch if receiving socket was opened in blocking mode, than there is no
way to finish splice-in until whole requested number of bytes are read.
SPLICE_F_NONBLOCK is an extension, consider it like recv() with temporal
non-blocking flag.

> Sorry, it was an unfortunate example :) This is not about not enough data
> being available. Lets change the number of packets in the example with 20
> instead of 16 (and keep the size to 17) - the splice call will still block
> because of the pipe being full. The pipe can only hold PIPE_BUFFERS packets
> (which is 16 currently).

Why? It will read its data from 16 packets, then send them into another end
of the pipe :)

You propose to change a very useful splice feature (actually you would just
remove it at all with the same results for reading network path, since
it is essentially what you did :) - not to block when it is possible.

This kind of non-blocking mode was added for performance issues too:
consider application which reads from the network and writes into the
file, while there is no data in the socket it can write what was already
read into any object attached to the given end of the pipe.

--
Evgeniy Polyakov

2008-07-18 14:06:36

by Octavian Purdila

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

On Friday 18 July 2008, Evgeniy Polyakov wrote:
>
> It will block in sending and/or other than network reading. With your
> patch if receiving socket was opened in blocking mode, than there is no
> way to finish splice-in until whole requested number of bytes are read.
> SPLICE_F_NONBLOCK is an extension, consider it like recv() with temporal
> non-blocking flag.
>

The way is see it, and API and documentation is written, SPLICE_F_NONBLOCK was
added only for choosing to block or not block on the _pipe_, not on the other
fd.

And, IMHO, it should be kept that way. If we need to make certain splice
operations non-blocking for the other file descriptor, then maybe we should
add a separate flag for that. But, again IMHO, overloading SPLICE_F_NONBLOCK
with responsabilities for both the pipe and the other file descriptor is
wrong is at is taking the freedom from the application of controlling things.

And when you do that, sooner or later you will run in a scenario which
requires workarounds in the applications to bypass the API assumptions.

> > Sorry, it was an unfortunate example :) This is not about not enough data
> > being available. Lets change the number of packets in the example with 20
> > instead of 16 (and keep the size to 17) - the splice call will still
> > block because of the pipe being full. The pipe can only hold PIPE_BUFFERS
> > packets (which is 16 currently).
>
> Why? It will read its data from 16 packets, then send them into another end
> of the pipe :)
>

splice will consume one packet at a time and will try to feed them in the
pipe. Since the pipe can only hold 16 descriptors, on the 17th it will block.

> You propose to change a very useful splice feature (actually you would just
> remove it at all with the same results for reading network path, since
> it is essentially what you did :) - not to block when it is possible.
>
> This kind of non-blocking mode was added for performance issues too:
> consider application which reads from the network and writes into the
> file, while there is no data in the socket it can write what was already
> read into any object attached to the given end of the pipe.

I have my doubts about the benefit of using the non-blocking operations only
for some splice calls :) But to solve both issues the solution would be to
add a separate non-blocking flag for the other file descriptor. Is that ok?

Thanks,
tavi

2008-07-18 14:32:49

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

On Fri, Jul 18, 2008 at 05:04:17PM +0300, Octavian Purdila ([email protected]) wrote:
> > It will block in sending and/or other than network reading. With your
> > patch if receiving socket was opened in blocking mode, than there is no
> > way to finish splice-in until whole requested number of bytes are read.
> > SPLICE_F_NONBLOCK is an extension, consider it like recv() with temporal
> > non-blocking flag.
>
> The way is see it, and API and documentation is written, SPLICE_F_NONBLOCK was
> added only for choosing to block or not block on the _pipe_, not on the other
> fd.
>
> And, IMHO, it should be kept that way. If we need to make certain splice
> operations non-blocking for the other file descriptor, then maybe we should
> add a separate flag for that. But, again IMHO, overloading SPLICE_F_NONBLOCK
> with responsabilities for both the pipe and the other file descriptor is
> wrong is at is taking the freedom from the application of controlling things.
>
> And when you do that, sooner or later you will run in a scenario which
> requires workarounds in the applications to bypass the API assumptions.

Why? There is clearly documented behaviour of the call, it works exactly
like it is supposed to work - it tries to be non-blocking everywhere
where it can, but not always, that's why there is a sentence which
states that even with given flag call may block.

> > > Sorry, it was an unfortunate example :) This is not about not enough data
> > > being available. Lets change the number of packets in the example with 20
> > > instead of 16 (and keep the size to 17) - the splice call will still
> > > block because of the pipe being full. The pipe can only hold PIPE_BUFFERS
> > > packets (which is 16 currently).
> >
> > Why? It will read its data from 16 packets, then send them into another end
> > of the pipe :)
> >
>
> splice will consume one packet at a time and will try to feed them in the
> pipe. Since the pipe can only hold 16 descriptors, on the 17th it will block.

If there are 20 packets in the queue it will get 16 and put them into
another end (in the next call in your example). Where will it block?

> > You propose to change a very useful splice feature (actually you would just
> > remove it at all with the same results for reading network path, since
> > it is essentially what you did :) - not to block when it is possible.
> >
> > This kind of non-blocking mode was added for performance issues too:
> > consider application which reads from the network and writes into the
> > file, while there is no data in the socket it can write what was already
> > read into any object attached to the given end of the pipe.
>
> I have my doubts about the benefit of using the non-blocking operations only
> for some splice calls :) But to solve both issues the solution would be to
> add a separate non-blocking flag for the other file descriptor. Is that ok?

I really do not think that there is any kind of problem with current
behaviour, and thus there is no need to introduce additional flags
and/or change existing behaviour, but I can understand you that existing
approach does not met your expectation, so you are trying to change it.
I've added Jens Axboe to copy list, who is responsible for splice
design.

Btw, you are also trying to change existing userspace API, which may be
very much forbidden at this stage.

--
Evgeniy Polyakov

2008-07-18 15:52:37

by Octavian Purdila

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK


> Why? There is clearly documented behaviour of the call, it works exactly
> like it is supposed to work - it tries to be non-blocking everywhere
> where it can, but not always, that's why there is a sentence which
> states that even with given flag call may block.

I think that it tries a bit too hard to be non-blocking in the TCP receive
implementation, and that is causing problems for some usecases.

And (sorry for saying this again - it will be the last time) to me it looks
like SPLICE_F_NONBLOCK is intended for the pipe only:

commit 29e350944fdc2dfca102500790d8ad6d6ff4f69d
Author: Linus Torvalds <[email protected]>
Date: Sun Apr 2 12:46:35 2006 -0700

splice: add SPLICE_F_NONBLOCK flag

It doesn't make the splice itself necessarily nonblocking (because the
actual file descriptors that are spliced from/to may block unless they
have the O_NONBLOCK flag set), but it makes the splice pipe operations
nonblocking.

>
> If there are 20 packets in the queue it will get 16 and put them into
> another end (in the next call in your example). Where will it block?
>

It will take 17 because this is what the user requested. And when trying to
push the 17th on the pipe, it will block. I base this both on experiments and
on my understanding of the tcp splice receive implementation.

>
> I really do not think that there is any kind of problem with current
> behaviour, and thus there is no need to introduce additional flags
> and/or change existing behaviour, but I can understand you that existing
> approach does not met your expectation, so you are trying to change it.
> I've added Jens Axboe to copy list, who is responsible for splice
> design.
>
> Btw, you are also trying to change existing userspace API, which may be
> very much forbidden at this stage.

If people here will be telling me that for splice you must always use
non-blocking I/O since you can't get the blocking case to work reliably, than
I will accept that. After all, they know better :)

Thanks,
tavi

2008-07-18 16:00:30

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

On Fri, Jul 18, 2008 at 06:50:07PM +0300, Octavian Purdila ([email protected]) wrote:
> > If there are 20 packets in the queue it will get 16 and put them into
> > another end (in the next call in your example). Where will it block?
>
> It will take 17 because this is what the user requested. And when trying to
> push the 17th on the pipe, it will block. I base this both on experiments and
> on my understanding of the tcp splice receive implementation.

Seems like we do not understand each other. How it can take 17 if there
are only 16 pages? By 'push' you mean splice-into-pipe or
splice-out-of-pipe-into-other-fd? Where exactly will it block?

> > Btw, you are also trying to change existing userspace API, which may be
> > very much forbidden at this stage.
>
> If people here will be telling me that for splice you must always use
> non-blocking I/O since you can't get the blocking case to work reliably, than
> I will accept that. After all, they know better :)

It may be not because of how is right or wrong, but that some userspace
can already rely on existing behaviour of the flag, and changing it
(even to more correct) may break some application in a really
non-trivial way.

--
Evgeniy Polyakov

2008-07-18 17:07:49

by Octavian Purdila

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

On Friday 18 July 2008, Evgeniy Polyakov wrote:

> > It will take 17 because this is what the user requested. And when trying
> > to push the 17th on the pipe, it will block. I base this both on
> > experiments and on my understanding of the tcp splice receive
> > implementation.
>
> Seems like we do not understand each other. How it can take 17 if there
> are only 16 pages? By 'push' you mean splice-into-pipe or
> splice-out-of-pipe-into-other-fd? Where exactly will it block?
>

Suppose we have 20 packets in the socket queue and the pipe is empty and the
application calls splice(sock, pipe, 17, flags=0).

Then, tcp_splice_read will be called, which in turn calls tcp_read_sock.

tcp_read_sock will loop until all the 17 bytes will be read from the socket.
tcp_read_sock calls skb_splice_bits which calls splice_to_pipe.

Now while skb_splice_bits is carefull to only put a maximum of PIPE_BUFFERS
during its iteration, due to the looping in tcp_read_sock, we will end up
with 17 calls to splice_to_pipe. Thus on the 17th call, splice_to_pipe will
block.

Does this make sense?

Thanks,
tavi

2008-07-18 17:53:59

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

Hi.

On Fri, Jul 18, 2008 at 08:04:44PM +0300, Octavian Purdila ([email protected]) wrote:
> Suppose we have 20 packets in the socket queue and the pipe is empty and the
> application calls splice(sock, pipe, 17, flags=0).
>
> Then, tcp_splice_read will be called, which in turn calls tcp_read_sock.
>
> tcp_read_sock will loop until all the 17 bytes will be read from the socket.
> tcp_read_sock calls skb_splice_bits which calls splice_to_pipe.

How come?
spd_fill_page() should fail when it will be called for the 17'th skb and
all reading from the socket will return, and thus can be sent to the
file.

> Now while skb_splice_bits is carefull to only put a maximum of PIPE_BUFFERS
> during its iteration, due to the looping in tcp_read_sock, we will end up
> with 17 calls to splice_to_pipe. Thus on the 17th call, splice_to_pipe will
> block.

Where exactly?
Why
tcp_splice_data_recv()->skb_splice_bits()->__skb_splice_bits()->spd_fill_page()
callchain does not return error and that pipe is full?

--
Evgeniy Polyakov

2008-07-18 18:19:00

by Octavian Purdila

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

On Friday 18 July 2008, Evgeniy Polyakov wrote:
> Hi.
>
> On Fri, Jul 18, 2008 at 08:04:44PM +0300, Octavian Purdila
([email protected]) wrote:
> > Suppose we have 20 packets in the socket queue and the pipe is empty and
> > the application calls splice(sock, pipe, 17, flags=0).
> >
> > Then, tcp_splice_read will be called, which in turn calls tcp_read_sock.
> >
> > tcp_read_sock will loop until all the 17 bytes will be read from the
> > socket. tcp_read_sock calls skb_splice_bits which calls splice_to_pipe.
>
> How come?
> spd_fill_page() should fail when it will be called for the 17'th skb and
> all reading from the socket will return, and thus can be sent to the
> file.
>

spd_fill_page work with the splice_pipe_descriptor declared in
skb_splice_bits, thus spd_fill_page does not have visibility across multiple
skb_splice_bits calls.

> > Now while skb_splice_bits is carefull to only put a maximum of
> > PIPE_BUFFERS during its iteration, due to the looping in tcp_read_sock,
> > we will end up with 17 calls to splice_to_pipe. Thus on the 17th call,
> > splice_to_pipe will block.
>
> Where exactly?
> Why
> tcp_splice_data_recv()->skb_splice_bits()->__skb_splice_bits()->spd_fill_pa
>ge() callchain does not return error and that pipe is full?


Ok, let me try to move through the function calls:

tcp_read_sock
... -> skb_splice_bits -> spd_fill_page;
on return (spd->nr_page is 1 and pipe->nrbufs is 1)
... -> skb_splice_bits -> spd_fill_page;
on return (spd->nr_page is 1 and pipe->nrbufs is 2)
... -> skb_splice_bits -> spd_fill_page;
on return (spd->nr_page is 1 and pipe->nrbufs is 3)
...

and so on until pipe->nrbufs is 16. At than point, we will block in pipe_wait,
inside splice_to_pipe.

Thanks,
tavi







2008-07-18 18:36:19

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

On Fri, Jul 18, 2008 at 09:16:43PM +0300, Octavian Purdila ([email protected]) wrote:
> tcp_read_sock
> ... -> skb_splice_bits -> spd_fill_page;
> on return (spd->nr_page is 1 and pipe->nrbufs is 1)
> ... -> skb_splice_bits -> spd_fill_page;
> on return (spd->nr_page is 1 and pipe->nrbufs is 2)
> ... -> skb_splice_bits -> spd_fill_page;
> on return (spd->nr_page is 1 and pipe->nrbufs is 3)
> ...
>
> and so on until pipe->nrbufs is 16. At than point, we will block in pipe_wait,
> inside splice_to_pipe.

Seems that SPLICE_F_NONBLOCK check should be propagated from
tcp_splice_read() into skb_splice_bits(), and this flag is actually
there already in tss.flags.

--
Evgeniy Polyakov

2008-07-18 18:45:56

by Octavian Purdila

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

On Friday 18 July 2008, Evgeniy Polyakov wrote:

> >
> > and so on until pipe->nrbufs is 16. At than point, we will block in
> > pipe_wait, inside splice_to_pipe.
>
> Seems that SPLICE_F_NONBLOCK check should be propagated from
> tcp_splice_read() into skb_splice_bits(), and this flag is actually
> there already in tss.flags.

I'm not sure I understand you...

The flag gets propagated to splice_to_pipe (so there is no need to propagate
the check in skb_splice_bits) but we don't have SPLICE_F_NONBLOCK set, we are
on the blocking usecase.

Thanks,
tavi

2008-07-19 08:51:30

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

Hi Octavian.

On Fri, Jul 18, 2008 at 09:43:38PM +0300, Octavian Purdila ([email protected]) wrote:
> The flag gets propagated to splice_to_pipe (so there is no need to propagate
> the check in skb_splice_bits) but we don't have SPLICE_F_NONBLOCK set, we are
> on the blocking usecase.

Hmm, than how does it concerns SPLICE_F_NONBLOCK feature change? Your
patch does not touch this behaviour.

Anyway, in case of not having SPLICE_F_NONBLOCK it does not deadlock,
but waits until someone reads from the other side of the pipe. It blocks
and expects reader to unblock it.

It looks like you have two unexpected independent issues with splice:
ability to perform non-blocking reading from the socket into the pipe
when SPLICE_F_NONBLOCK is set,
and blocking waiting for reader to get data from the pipe when
SPLICE_F_NONBLOCK is not set. Is it correct?

If so, the former is a feature, which allows to have some progress when
receiving queue is empty: one can start getting data from the pipe,
i.e. splice data out of the pipe to the different file descriptor.
So, this flag means both non-blocking pipe operations _and_ non-blocking
receiving from the socket.

For the blocking in pipe_wait() when pipe is full and SPLICE_F_NONBLOCK
is not set, then it is just a pipe behaviour, which is used as a
temporal storage for the requested data. It is not some buffer, which is
returned to the userspace when it is full (or indication of it), but a
pipe, where you put page pointers, so when pipe is full and opened in
blocking mode, writer sleeps waiting or reader to get some data out of it
and thus awake writer. It is not a deadlock, but very expected behaviour
of the pipe: to block when pipe is full waiting for reader to get data.

Hope this will clarify this discussion a bit, so it would not look like
talk of blind and deaf :)

--
Evgeniy Polyakov

2008-07-19 11:20:58

by Octavian Purdila

[permalink] [raw]
Subject: Re: [PATCH] tcp: do not promote SPLICE_F_NONBLOCK to socket O_NONBLOCK

On Saturday 19 July 2008, Evgeniy Polyakov wrote:

>
> Hmm, than how does it concerns SPLICE_F_NONBLOCK feature change? Your
> patch does not touch this behaviour.
>

My patch makes SPLICE_F_NONBLOCK work only on the pipe, it stops
SPLICE_F_NONBLOCK giving non-blocking semantics to the socket. It thus allows
the application to control the "blockiness" of pipe and socket operations
_independently_:
- you want blocking operations on the socket: you use O_NONBLOCK.
- you want blocking operations on the pipe: you use SPLICE_F_NONBLOCK

It has a limitation though: you are not able to control the "blockiness" of
the socket operation during the splice call. Thats why I propose using two
flags: one for the pipe and one for the socket.

> Anyway, in case of not having SPLICE_F_NONBLOCK it does not deadlock,
> but waits until someone reads from the other side of the pipe. It blocks
> and expects reader to unblock it.
>

Correct.

> It looks like you have two unexpected independent issues with splice:
> ability to perform non-blocking reading from the socket into the pipe
> when SPLICE_F_NONBLOCK is set,
> and blocking waiting for reader to get data from the pipe when
> SPLICE_F_NONBLOCK is not set. Is it correct?
>

What I would like is to have the ability to do blocking operations on the
socket and non-blocking operations on the pipe.

> If so, the former is a feature, which allows to have some progress when
> receiving queue is empty: one can start getting data from the pipe,
> i.e. splice data out of the pipe to the different file descriptor.

> So, this flag means both non-blocking pipe operations _and_ non-blocking
> receiving from the socket.
>

Correct.

> For the blocking in pipe_wait() when pipe is full and SPLICE_F_NONBLOCK
> is not set, then it is just a pipe behaviour, which is used as a
> temporal storage for the requested data. It is not some buffer, which is
> returned to the userspace when it is full (or indication of it), but a
> pipe, where you put page pointers, so when pipe is full and opened in
> blocking mode, writer sleeps waiting or reader to get some data out of it
> and thus awake writer.

Correct.


> It is not a deadlock, but very expected behaviour
> of the pipe: to block when pipe is full waiting for reader to get data.
>

But in the larger picture it means that:

(a) a simple single threaded program that copies data from a socket to a file
with splice, using blocking operations on the socket and file, will not work
as it can block in the 1st splice if we specify a non tiny buffer. So, we
will either:

(b) need to use threads to read from the pipe and unblock it, or

(c) need to use full non-blocking operations with the whole poll/epoll/select
bits implemented

If you agree with the above statement, and you think that it is ok to be
forced to use (b) or (c) instead of (a), than I'll be happy to stop the
discussion here since its getting into a matter of personal interpretation of
what a good API is :)

> Hope this will clarify this discussion a bit, so it would not look like
> talk of blind and deaf :)

Indeed :)

My bad, I probably should have explained this in more details from the
beginning. Thanks for taking the time to debate this, despite of my poor
explanations.

Thanks,
tavi