2008-12-24 15:29:08

by Willy Tarreau

[permalink] [raw]
Subject: Data corruption issue with splice() on 2.6.27.10

Hi Jens,

I'm facing a data corruption problem with splice() between two
non-blocking TCP sockets on 2.6.27.10. I could finally write a
simpler proof of concept, and capture a snapshot of the issue
with the associated strace result.

My program does the following :
- accept an incoming connection
- connect to a remote server
- forward all data from the server to the client using splice()

The data count is always correct, but some parts are corrupted and
contain data which seem to come from random memory locations (this
raises a security concern BTW). It *sometimes* happens that a few
megabytes can be transferred without any problem, but most of the
time, corruption happens for a few hundreds of bytes every few
hundreds of kilobytes.

I first noticed that sometimes the corrupted part sized a multiple of
the MSS. But now it's more precise. Each time I observe the problem,
it is when the splice from the pipe to the outgoing socket has sent
two chunks at once.

For my test, I run my PoC program on my laptop. I connect to it using
netcat from localhost, then it connects to another machine on which
netcat is waiting for an incoming connection. The source of gcc-4.1.2
is then feed on stdin, and transferred via my program to the client-
side netcat. Both NICs are tg3, but the problem was first observed
with ixgbe NICs, then reproduced on sky2, so I don't think there's a
relation.

Here's an example of what I observed. This trace is really nice
because the problem happened very early (after about 14kB). 20k1 is an
"od -vAx -tx1" output of the original file, 20k2 is an output of the
corrupted one :

--- 20k1.hex 2008-12-24 11:35:36 +0100
+++ 20k2.hex 2008-12-24 14:48:26 +0100
@@ -883,17 +883,17 @@
003720 05 06 78 1b e8 0f 3b 88 6f 96 e1 99 93 51 fd 07
003730 a7 e9 d6 f6 9f b4 b4 78 ff 7f ea 3b b9 79 cc 4c
003740 65 d8 9e de 49 e5 fa 45 ab b4 3f 95 d4 42 23 e1
-003750 a8 19 22 c4 fd 6a 8c 8e 08 24 01 29 1c f5 65 86
-003760 7d be de 81 d2 07 41 44 40 3a 31 1c b9 6e ce 19
-003770 e6 a0 5c 81 f6 40 6e 1b a1 9f 58 61 9d 37 c7 d5
-003780 e3 10 83 b7 3f 36 e3 26 fd 22 e3 03 50 8c 3d 54
-003790 f0 f3 d4 73 cc 7e 61 96 43 04 13 e6 d3 96 eb 2a
-0037a0 f5 37 9e 09 9f d2 25 e5 73 ec f3 88 08 47 ca 3f
-0037b0 b0 27 e9 ae 7d a0 04 db 3a c3 c4 2f f8 23 05 04
-0037c0 76 c7 7f 90 65 56 ed 4f 31 03 fb 3b a6 de b5 01
-0037d0 48 02 55 f2 d0 38 89 03 7f 90 cf 5a a6 01 04 18
-0037e0 5b ae 62 c6 84 4b f2 ff 82 04 12 18 4f 32 6e 4f
-0037f0 2e 01 15 14 51 23 81 de 53 9a 86 a6 71 50 47 ee
+003750 a8 19 22 c4 02 00 00 00 00 00 00 00 08 00 00 00
+003760 00 00 00 00 4d 69 63 00 00 00 00 00 00 00 00 00
+003770 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+003780 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+003790 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0037a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0037b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0037c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0037d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0037e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0037f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
003800 fe 0a 4b 11 82 07 55 32 e6 44 21 e6 19 07 d2 38
003810 77 8f 5a 8f 9c 7b 9f 1e 72 01 04 fc 6a 88 c9 cf
003820 f3 6c e0 f8 a0 fe 43 76 50 13 5e d1 b2 08 7c e9

As you can see, the first corrupted byte starts at offset 0x3754 =
14164 and the corruption remains for 172 bytes.

Now let's take a look at the strace output :

socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(4000), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(3, 10) = 0
accept(3, 0, NULL) = 4

socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(4001), sin_addr=inet_addr("10.1.1.1")}, 16) = 0
fcntl64(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl64(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
pipe([6, 7]) = 0

select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1024
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0x9b4, 0x3) = 2484
select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0xb68, 0x3) = 2920
select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0xb68, 0x3) = 2920
select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0xb68, 0x3) = 2920
select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0xb68, 0x3) = 2920

==> at this point, exactly 14164 bytes have been transferred from the
server to the client, ie exactly the amount of data before the
corruption.

select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 172

==> we have a read of 172 bytes, exactly the number of corrupted bytes.

select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0x660, 0x3) = 1632

==> and here, we write the last two chunks at once and the reader
observes random data for the first chunk.

select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x5, 0, 0x7, 0, 0x10000, 0x3) = 1460
splice(0x6, 0, 0x4, 0, 0xb68, 0x3) = 2920
...


I tried to replace either socket with a file, and in both cases, the
problem disappears (which explains why we have not seen reports of
corruption on sendfile). So it's really between TCP and TCP the the
problem lies.

I tried to disable SPLICE_F_MOVE, it did not change anything.
Disabling SPLICE_F_NONBLOCK makes the code unusable as it blocks when
the pipe is full, while the same process is supposed to flush it.

It looks like the corruption happens between sockets and/or pipes,
because the first time it was noticed on a production system, the
system was running haproxy (reverse-proxy load-balancer), and we found
multiple copies of request and response headers in a data file. I
don't know however if those were from other active splices or from
earlier ones running on the same FDs.

Also, it's very hard to reproduce the problem under strace, so I feel
really lucky to have caught this trace. I think it's just a timing
problem.

Another method I found to quickly detect corruption is to feed
/dev/zero to netcat on the server side, and pipe the output to "od",
which will only show sequences of non-zeroes. Hmm just did it right
now again and I got a complete capture of the output of the "top"
utility that I just started in another window.

At first, I thought that blocking reads when there are data in the
pipe would make the problem go away, but it's not the case. I modified
my program to stop polling for reads when "bytes" is not null, and I
finally got on my socket a complete output of strace running in
another xterm.

I found an analysis [1] for a potential corruption problem between two
sockets, but I noticed there were no responses and I did not fully
understand the report anyway.

What can I do to help debug the problem ? I'm really willing to help
getting this fixed, and I also have at least one user who definitely
wants splice() to work because the recv/send model currently limits
haproxy to 3 Gbps on his machines, while I have no problem reaching
10 Gbps with splice().

Please find the attached program. The server's address is hard-coded
to 10.1.1.1:4001. It accepts connections on port 4000. Here's how you
can try to reproduce :

# server :
# nc -lp 4000 </dev/zero

# test machine :
# ./splice-net-to-net

# test machine (again) :
# nc 127.1 4000 | od -Ax -tx1

If you don't get the problem, start strace on splice-net-to-net, here
it seems to help (probably the timing race again).

Thanks in advance,
Willy

----
[1] http://lkml.org/lkml/2008/2/26/210


Attachments:
(No filename) (8.45 kB)
splice-net-to-net.c (3.50 kB)
Download all attachments

2009-01-06 08:55:09

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On 24-12-2008 16:28, Willy Tarreau wrote:
> Hi Jens,
>
> I'm facing a data corruption problem with splice() between two
> non-blocking TCP sockets on 2.6.27.10. I could finally write a
> simpler proof of concept, and capture a snapshot of the issue
> with the associated strace result.
...
> I found an analysis [1] for a potential corruption problem between two
> sockets, but I noticed there were no responses and I did not fully
> understand the report anyway.
>
> What can I do to help debug the problem ? I'm really willing to help
> getting this fixed, and I also have at least one user who definitely
> wants splice() to work because the recv/send model currently limits
> haproxy to 3 Gbps on his machines, while I have no problem reaching
> 10 Gbps with splice().
...
> ----
> [1] http://lkml.org/lkml/2008/2/26/210

Great story! Alas I don't understand this fully either, but it seems
Changli Gao was concerned with sendpage sending this "as pages", so
when NETIF_F_SG flag is available. Did you try this without SG btw?

Thanks,
Jarek P.

2009-01-06 09:42:01

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

Hi Jarek,

On Tue, Jan 06, 2009 at 08:54:42AM +0000, Jarek Poplawski wrote:
> On 24-12-2008 16:28, Willy Tarreau wrote:
> > Hi Jens,
> >
> > I'm facing a data corruption problem with splice() between two
> > non-blocking TCP sockets on 2.6.27.10. I could finally write a
> > simpler proof of concept, and capture a snapshot of the issue
> > with the associated strace result.
> ...
> > I found an analysis [1] for a potential corruption problem between two
> > sockets, but I noticed there were no responses and I did not fully
> > understand the report anyway.
> >
> > What can I do to help debug the problem ? I'm really willing to help
> > getting this fixed, and I also have at least one user who definitely
> > wants splice() to work because the recv/send model currently limits
> > haproxy to 3 Gbps on his machines, while I have no problem reaching
> > 10 Gbps with splice().
> ...
> > ----
> > [1] http://lkml.org/lkml/2008/2/26/210
>
> Great story! Alas I don't understand this fully either, but it seems
> Changli Gao was concerned with sendpage sending this "as pages", so
> when NETIF_F_SG flag is available. Did you try this without SG btw?

No I did not. I can try, it's not too hard. It would in part defeat the
purpose of the mechanism (especially at 10 Gbps) but at least it will
help narrow the problem down.

Thanks for the tip, I'll keep you informed !
Willy

2009-01-06 10:01:30

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Tue, Jan 06, 2009 at 10:41:38AM +0100, Willy Tarreau wrote:
...
> > Great story! Alas I don't understand this fully either, but it seems
> > Changli Gao was concerned with sendpage sending this "as pages", so
> > when NETIF_F_SG flag is available. Did you try this without SG btw?
>
> No I did not. I can try, it's not too hard. It would in part defeat the
> purpose of the mechanism (especially at 10 Gbps) but at least it will
> help narrow the problem down.

Yes, I meant it only as a proof of concept. BTW, delaying TCP acks a
bit for these sendpages should then make it more reproducible, I guess.

Jarek P.

2009-01-06 10:04:27

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Tue, Jan 06, 2009 at 10:01:13AM +0000, Jarek Poplawski wrote:
> On Tue, Jan 06, 2009 at 10:41:38AM +0100, Willy Tarreau wrote:
> ...
> > > Great story! Alas I don't understand this fully either, but it seems
> > > Changli Gao was concerned with sendpage sending this "as pages", so
> > > when NETIF_F_SG flag is available. Did you try this without SG btw?
> >
> > No I did not. I can try, it's not too hard. It would in part defeat the
> > purpose of the mechanism (especially at 10 Gbps) but at least it will
> > help narrow the problem down.
>
> Yes, I meant it only as a proof of concept. BTW, delaying TCP acks a
> bit for these sendpages should then make it more reproducible, I guess.

ah ? I can try that too, using iptables on the target machine to drop
a few outgoing acks.

Regards,
Willy

2009-01-06 15:58:03

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Tue, Jan 06, 2009 at 10:01:13AM +0000, Jarek Poplawski wrote:
> On Tue, Jan 06, 2009 at 10:41:38AM +0100, Willy Tarreau wrote:
> ...
> > > Great story! Alas I don't understand this fully either, but it seems
> > > Changli Gao was concerned with sendpage sending this "as pages", so
> > > when NETIF_F_SG flag is available. Did you try this without SG btw?
> >
> > No I did not. I can try, it's not too hard. It would in part defeat the
> > purpose of the mechanism (especially at 10 Gbps) but at least it will
> > help narrow the problem down.
>
> Yes, I meant it only as a proof of concept. BTW, delaying TCP acks a
> bit for these sendpages should then make it more reproducible, I guess.

OK here is an update. It does not change anything to turn off any acceleration
feature on the interface (tg3) :

root@wtap:~# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off

It still forwards corrupted data like mad. I noticed that the corruption rate
is 10-100 times higher when forwarding from eth0 to eth0 than from eth0 to lo.

Maybe this can help find the culprit ?

Willy

2009-01-06 17:56:53

by Ben Mansell

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

Hi,

> I'm facing a data corruption problem with splice() between two
> non-blocking TCP sockets on 2.6.27.10. I could finally write a
> simpler proof of concept, and capture a snapshot of the issue
> with the associated strace result.
>
> My program does the following :
> - accept an incoming connection
> - connect to a remote server
> - forward all data from the server to the client using splice()
>
> The data count is always correct, but some parts are corrupted and
> contain data which seem to come from random memory locations (this
> raises a security concern BTW). It *sometimes* happens that a few
> megabytes can be transferred without any problem, but most of the
> time, corruption happens for a few hundreds of bytes every few
> hundreds of kilobytes.

FWIW, I can easily reproduce this on a Linux 2.6.27-9 (Ubuntu kernel),
using both forcedeth and tg3 network drivers. It's reassuring to hear of
this network corruption as I have been puzzling over non-blocking
splice() code recently!

The corruption does seem to be confined to the user data on the
connections, as I have been able to run some benchmarks using my own
splice()-enabled HTTP proxy to transfer lots of data. All the TCP
connections 'work' fine (as in no broken TCP), the initial HTTP request
& response headers get through OK, but the body data of the responses
sometimes gets corrupted. The benchmark seems to work flawlessly until
you look at the web page data!

I'm happy to run any test code on systems here or provide any debug
information if it would help to track this down.


Ben

2009-01-06 18:16:23

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

Hi Ben,

On Tue, Jan 06, 2009 at 05:42:31PM +0000, Ben Mansell wrote:
> Hi,
>
> >I'm facing a data corruption problem with splice() between two
> >non-blocking TCP sockets on 2.6.27.10. I could finally write a
> >simpler proof of concept, and capture a snapshot of the issue
> >with the associated strace result.
> >
> >My program does the following :
> > - accept an incoming connection
> > - connect to a remote server
> > - forward all data from the server to the client using splice()
> >
> >The data count is always correct, but some parts are corrupted and
> >contain data which seem to come from random memory locations (this
> >raises a security concern BTW). It *sometimes* happens that a few
> >megabytes can be transferred without any problem, but most of the
> >time, corruption happens for a few hundreds of bytes every few
> >hundreds of kilobytes.
>
> FWIW, I can easily reproduce this on a Linux 2.6.27-9 (Ubuntu kernel),
> using both forcedeth and tg3 network drivers. It's reassuring to hear of
> this network corruption as I have been puzzling over non-blocking
> splice() code recently!

Ah, so you might also have discovered a few annoyances with the API, eg
the fact that splice() returns after the first read in non-blocking mode,
as well as the fact that it never returns zero on close, but -EAGAIN,
which requires an additional recv(MSG_PEEK) to distinguish between a
close and a lack of data. But I leave that for a later discussion, let's
address the corruption issue first.

> The corruption does seem to be confined to the user data on the
> connections, as I have been able to run some benchmarks using my own
> splice()-enabled HTTP proxy to transfer lots of data. All the TCP
> connections 'work' fine (as in no broken TCP), the initial HTTP request
> & response headers get through OK, but the body data of the responses
> sometimes gets corrupted. The benchmark seems to work flawlessly until
> you look at the web page data!

I confirm your observations. Benchmarks were OK, it's the first user of
my experimental code who reported wrong md5sums on their ISO images :-/
For this reason, I think that it's completely related to the way pages
are passed between sockets, but I'm too much a loser in this area. I
understood tcp_splice_read(), but can't manage to find what is called
on the other side nor what the data become :-(

> I'm happy to run any test code on systems here or provide any debug
> information if it would help to track this down.

That's nice, because I'd like to ensure that whatever fix is proposed
is properly validated, not only on my platforms!

Regards,
Willy

2009-01-06 18:32:40

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

Hi Willy.

Unfortunately I can not work on this problem right now, but will do if
things are not resolved after Jan 11 (long vacations will be finished in
Russia and I will return to my test machines :) But right now I have
one quesstion: I read several times your mail but still can not figure
out if receiving or sending side is broken?

I.e. can you splice from socket into the file, check the file, and then
splice to the another socket and check received data to find out which
side is broken? Or did I just missed that in the problem description?

Thanks a lot for the test application, it will greatly help to resolve
this issue.

--
Evgeniy Polyakov

2009-01-06 18:38:25

by Jens Axboe

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Tue, Jan 06 2009, Evgeniy Polyakov wrote:
> Hi Willy.
>
> Unfortunately I can not work on this problem right now, but will do if
> things are not resolved after Jan 11 (long vacations will be finished in
> Russia and I will return to my test machines :) But right now I have
> one quesstion: I read several times your mail but still can not figure
> out if receiving or sending side is broken?
>
> I.e. can you splice from socket into the file, check the file, and then
> splice to the another socket and check received data to find out which
> side is broken? Or did I just missed that in the problem description?
>
> Thanks a lot for the test application, it will greatly help to resolve
> this issue.

I'll give this a spin tomorrow as well. A hunch tells me that this is
likely a page reuse issue, that splice is getting the reference to the
buffer dropped before the data has really been transmitted. IOW, the
page is likely fine reaching the ->sendpage() bit, but will be reused
before the data has actually been transmitted. So once you get that far,
other random data from that page is going out.

Just a guess, I'll try and reproduce this tomorrow!

--
Jens Axboe

2009-01-06 18:50:52

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

Hi Evgeniy,

On Tue, Jan 06, 2009 at 09:32:23PM +0300, Evgeniy Polyakov wrote:
> Hi Willy.
>
> Unfortunately I can not work on this problem right now, but will do if
> things are not resolved after Jan 11 (long vacations will be finished in
> Russia and I will return to my test machines :) But right now I have
> one quesstion: I read several times your mail but still can not figure
> out if receiving or sending side is broken?
>
> I.e. can you splice from socket into the file, check the file, and then
> splice to the another socket and check received data to find out which
> side is broken? Or did I just missed that in the problem description?

Maybe I wasn't very clear. It's only when splicing from a socket to another
socket that it breaks. Splicing from a file to a socket is OK (sendfile is
OK too BTW). Splicing from a socket to a file is OK.

And BTW, the receiving side must be on a real interface, not loopback. Here
are my observations :
- splice data from lo to lo => no corruption at all
- splice data from lo to eth0 => no corruption at all
- splice data from eth0 to lo => occasional corruption
- splice data from eth0 to eth0 => massive corruption

I think this may ring a bell for someone.

> Thanks a lot for the test application, it will greatly help to resolve
> this issue.

I figured it was an absolute necessity. The original code in my proxy is in
an experimental state and far too hard to debug for these purposes. It was
enough to detect the problem, but I could run a lot more tests with this
small test app ! An who know, maybe it will serve as an example for
non-blocking splice ;-)

Cheers,
Willy

2009-01-06 18:56:00

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

Hi Jens,

On Tue, Jan 06, 2009 at 07:37:05PM +0100, Jens Axboe wrote:
> On Tue, Jan 06 2009, Evgeniy Polyakov wrote:
> > Hi Willy.
> >
> > Unfortunately I can not work on this problem right now, but will do if
> > things are not resolved after Jan 11 (long vacations will be finished in
> > Russia and I will return to my test machines :) But right now I have
> > one quesstion: I read several times your mail but still can not figure
> > out if receiving or sending side is broken?
> >
> > I.e. can you splice from socket into the file, check the file, and then
> > splice to the another socket and check received data to find out which
> > side is broken? Or did I just missed that in the problem description?
> >
> > Thanks a lot for the test application, it will greatly help to resolve
> > this issue.
>
> I'll give this a spin tomorrow as well. A hunch tells me that this is
> likely a page reuse issue, that splice is getting the reference to the
> buffer dropped before the data has really been transmitted. IOW, the
> page is likely fine reaching the ->sendpage() bit, but will be reused
> before the data has actually been transmitted. So once you get that far,
> other random data from that page is going out.

I like your explanation because eventhough I don't understand the code
(can't follow it past the actors in fact), I understand the problem you're
suggesting ;-)

> Just a guess, I'll try and reproduce this tomorrow!

OK. In order not to waste your time, run the test app from one interface to
the same one, with both the client and the server on the same machine, distinct
from the test app. It will trigger immediately. "nc|od -Ax -tx1" will save you
a lot of time on the client side too BTW.

Thanks,
Willy

2009-01-07 04:43:00

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Tue, Jan 06, 2009 at 06:37:05PM +0000, Jens Axboe wrote:
>
> I'll give this a spin tomorrow as well. A hunch tells me that this is
> likely a page reuse issue, that splice is getting the reference to the
> buffer dropped before the data has really been transmitted. IOW, the
> page is likely fine reaching the ->sendpage() bit, but will be reused
> before the data has actually been transmitted. So once you get that far,
> other random data from that page is going out.

I see the problem.

The socket pipes in net/core/skbuff.c use references on the skb
to hold down the memory in skb->head as well as the pages in the
skb.

Unfortunately, once the pipe is fed into sendpage we only use
page reference counting to pin down the memory. So as soon as
sendpage returns we drop the ref count on the skb, thus freeing
the memory in skb->head, which is yet to be transmitted.

Moral: Using page reference counts on skb->head is wrong.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 06:42:41

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 03:42:32PM +1100, Herbert Xu wrote:
> On Tue, Jan 06, 2009 at 06:37:05PM +0000, Jens Axboe wrote:
> >
> > I'll give this a spin tomorrow as well. A hunch tells me that this is
> > likely a page reuse issue, that splice is getting the reference to the
> > buffer dropped before the data has really been transmitted. IOW, the
> > page is likely fine reaching the ->sendpage() bit, but will be reused
> > before the data has actually been transmitted. So once you get that far,
> > other random data from that page is going out.
>
> I see the problem.
>
> The socket pipes in net/core/skbuff.c use references on the skb
> to hold down the memory in skb->head as well as the pages in the
> skb.
>
> Unfortunately, once the pipe is fed into sendpage we only use
> page reference counting to pin down the memory. So as soon as
> sendpage returns we drop the ref count on the skb, thus freeing
> the memory in skb->head, which is yet to be transmitted.

So this means that anything relying on sendpage() is at risk ? What
I find really strange is that I can only reproduce the issue if the
spliced data come from a real interface. If they come from the loopback
or from a file, there is no problem. Maybe the ref counting is different
depending on the origin of the data ?

> Moral: Using page reference counts on skb->head is wrong.

My question will sound stupid to some of you, but wouldn't increasing
the refcount on those skb solve the problem (and decreasing it once
the skb is effectively sent) ?

Regards,
Willy

2009-01-07 08:18:27

by Jens Axboe

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07 2009, Herbert Xu wrote:
> On Tue, Jan 06, 2009 at 06:37:05PM +0000, Jens Axboe wrote:
> >
> > I'll give this a spin tomorrow as well. A hunch tells me that this is
> > likely a page reuse issue, that splice is getting the reference to the
> > buffer dropped before the data has really been transmitted. IOW, the
> > page is likely fine reaching the ->sendpage() bit, but will be reused
> > before the data has actually been transmitted. So once you get that far,
> > other random data from that page is going out.
>
> I see the problem.
>
> The socket pipes in net/core/skbuff.c use references on the skb
> to hold down the memory in skb->head as well as the pages in the
> skb.
>
> Unfortunately, once the pipe is fed into sendpage we only use
> page reference counting to pin down the memory. So as soon as
> sendpage returns we drop the ref count on the skb, thus freeing
> the memory in skb->head, which is yet to be transmitted.
>
> Moral: Using page reference counts on skb->head is wrong.

So my hunch was pretty close. The fix would seem to involve NOT calling
ops->release(pipe, buf) until we actually have an ACK on that data gone
out.

--
Jens Axboe

2009-01-07 09:39:37

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Tue, Jan 06, 2009 at 04:57:15PM +0100, Willy Tarreau wrote:
> On Tue, Jan 06, 2009 at 10:01:13AM +0000, Jarek Poplawski wrote:
> > On Tue, Jan 06, 2009 at 10:41:38AM +0100, Willy Tarreau wrote:
> > ...
> > > > Great story! Alas I don't understand this fully either, but it seems
> > > > Changli Gao was concerned with sendpage sending this "as pages", so
> > > > when NETIF_F_SG flag is available. Did you try this without SG btw?
> > >
> > > No I did not. I can try, it's not too hard. It would in part defeat the
> > > purpose of the mechanism (especially at 10 Gbps) but at least it will
> > > help narrow the problem down.
> >
> > Yes, I meant it only as a proof of concept. BTW, delaying TCP acks a
> > bit for these sendpages should then make it more reproducible, I guess.
>
> OK here is an update. It does not change anything to turn off any acceleration
> feature on the interface (tg3) :
>
> root@wtap:~# ethtool -k eth0
> Offload parameters for eth0:
> rx-checksumming: off
> tx-checksumming: off
> scatter-gather: off
> tcp segmentation offload: off
>
> It still forwards corrupted data like mad. I noticed that the corruption rate
> is 10-100 times higher when forwarding from eth0 to eth0 than from eth0 to lo.
>
> Maybe this can help find the culprit ?

I hope it will, but I still don't get it. Anyway, here is an untested
patch, which I guess partly tries Changli Gao's recommendation to give
real pages to splice/pipe (but here it's always - not for sendpage
only).

BTW, I added Changli to Cc - great review!

Thanks,
Jarek P.

---

net/core/skbuff.c | 41 +++++++++++++++++++++++++++--------------
1 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..4c080cd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}

static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}

static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
*/
static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
- struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+ put_page(spd->pages[i]);
+}

- kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy(p + offset, page + offset, len);
+
+ return p;
}

/*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;

+ if (linear) {
+ page = linear_to_page(page, len, offset);
+ if (!page)
+ return 1;
+ }
+
spd->pages[spd->nr_pages] = page;
spd->partial[spd->nr_pages].len = len;
spd->partial[spd->nr_pages].offset = offset;
- spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
spd->nr_pages++;
+ get_page(page);
+
return 0;
}

@@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
static inline int __splice_segment(struct page *page, unsigned int poff,
unsigned int plen, unsigned int *off,
unsigned int *len, struct sk_buff *skb,
- struct splice_pipe_desc *spd)
+ struct splice_pipe_desc *spd, int linear)
{
if (!*len)
return 1;
@@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
/* the linear region may spread across several pages */
flen = min_t(unsigned int, flen, PAGE_SIZE - poff);

- if (spd_fill_page(spd, page, flen, poff, skb))
+ if (spd_fill_page(spd, page, flen, poff, skb, linear))
return 1;

__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
if (__splice_segment(virt_to_page(skb->data),
(unsigned long) skb->data & (PAGE_SIZE - 1),
skb_headlen(skb),
- offset, len, skb, spd))
+ offset, len, skb, spd, 1))
return 1;

/*
@@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];

if (__splice_segment(f->page, f->page_offset, f->size,
- offset, len, skb, spd))
+ offset, len, skb, spd, 0))
return 1;
}

2009-01-07 09:52:52

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 07:38:59AM +0100, Willy Tarreau wrote:
>
> > Unfortunately, once the pipe is fed into sendpage we only use
> > page reference counting to pin down the memory. So as soon as
> > sendpage returns we drop the ref count on the skb, thus freeing
> > the memory in skb->head, which is yet to be transmitted.
>
> So this means that anything relying on sendpage() is at risk ? What

No the bug is in the splice code. sendfile() and other sendpage
users are not affected.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 09:55:00

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 08:52:18PM +1100, Herbert Xu wrote:
> On Wed, Jan 07, 2009 at 07:38:59AM +0100, Willy Tarreau wrote:
> >
> > > Unfortunately, once the pipe is fed into sendpage we only use
> > > page reference counting to pin down the memory. So as soon as
> > > sendpage returns we drop the ref count on the skb, thus freeing
> > > the memory in skb->head, which is yet to be transmitted.
> >
> > So this means that anything relying on sendpage() is at risk ? What
>
> No the bug is in the splice code. sendfile() and other sendpage
> users are not affected.

but sendfile() now uses splice().

Willy

2009-01-07 11:29:23

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

Hi Herbert.

On Wed, Jan 07, 2009 at 03:42:32PM +1100, Herbert Xu ([email protected]) wrote:
> I see the problem.
>
> The socket pipes in net/core/skbuff.c use references on the skb
> to hold down the memory in skb->head as well as the pages in the
> skb.
>
> Unfortunately, once the pipe is fed into sendpage we only use
> page reference counting to pin down the memory. So as soon as
> sendpage returns we drop the ref count on the skb, thus freeing
> the memory in skb->head, which is yet to be transmitted.
>
> Moral: Using page reference counts on skb->head is wrong.

That would not happen without scatter-gather support on the interface,
date would be plain copied, and after Jarek's requst Willy confirmed
that corruption happens with all acceleration being turned off.

--
Evgeniy Polyakov

2009-01-07 11:50:59

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 02:29:06PM +0300, Evgeniy Polyakov wrote:
>
> That would not happen without scatter-gather support on the interface,
> date would be plain copied, and after Jarek's requst Willy confirmed
> that corruption happens with all acceleration being turned off.

It will happen without scatter-gather support. The problem
is with skb->head, which is always there. In any case, SG can't
make any difference because the skb is an inbound one and most
drivers only produce linear packets.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 11:53:14

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 10:54:15AM +0100, Willy Tarreau wrote:
>
> but sendfile() now uses splice().

Good point. However, this bug only triggers when inbound splice
is reinjected into sendpage so sendfile() users shouldn't see it.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 11:56:21

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 10:50:32PM +1100, Herbert Xu ([email protected]) wrote:
> > That would not happen without scatter-gather support on the interface,
> > date would be plain copied, and after Jarek's requst Willy confirmed
> > that corruption happens with all acceleration being turned off.
>
> It will happen without scatter-gather support. The problem
> is with skb->head, which is always there. In any case, SG can't
> make any difference because the skb is an inbound one and most
> drivers only produce linear packets.

But it will not push pages into the stack, but copy them including copy
of the submitted head?

--
Evgeniy Polyakov

2009-01-07 11:59:45

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 02:56:05PM +0300, Evgeniy Polyakov wrote:
>
> But it will not push pages into the stack, but copy them including copy
> of the submitted head?

It will use a single page entry for skb->head with splice, see
skb_splice_bits.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 12:15:43

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 10:59:21PM +1100, Herbert Xu ([email protected]) wrote:
> On Wed, Jan 07, 2009 at 02:56:05PM +0300, Evgeniy Polyakov wrote:
> >
> > But it will not push pages into the stack, but copy them including copy
> > of the submitted head?
>
> It will use a single page entry for skb->head with splice, see
> skb_splice_bits.

Looks like we are talking about different directions of the dataflow.
I meant that set of pages submitted into the sending part will be copied
if sending interface does not support acceleration, and thus it will
copy part of the page corresponding to the linear part of the skb prior
the transmission, so even if skb will be freed right after that call
(prior data transmission by the hardware), it should not affect copied
data.

--
Evgeniy Polyakov

2009-01-07 12:23:00

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 03:15:30PM +0300, Evgeniy Polyakov wrote:
>
> Looks like we are talking about different directions of the dataflow.
> I meant that set of pages submitted into the sending part will be copied
> if sending interface does not support acceleration, and thus it will
> copy part of the page corresponding to the linear part of the skb prior
> the transmission, so even if skb will be freed right after that call
> (prior data transmission by the hardware), it should not affect copied
> data.

You must be looking at a different tcp.c than the one I've got
because mine clearly always uses skb frags in sendpage regardless
of SG support.

Yes we will linearize the packet in dev_queue_xmit but as soon
as the netdev stops the tx queue you'll get corruption.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 12:25:10

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 01:22:05PM +0100, Willy Tarreau wrote:
>
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 5110b35..4c080cd 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
> > static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
> > struct pipe_buffer *buf)
> > {
> > - struct sk_buff *skb = (struct sk_buff *) buf->private;
> > -
> > - kfree_skb(skb);
> > + put_page(buf->page);
> > }
> >
> > static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
> > struct pipe_buffer *buf)
> > {
> > - struct sk_buff *skb = (struct sk_buff *) buf->private;
> > -
> > - skb_get(skb);
> > + get_page(buf->page);
> > }

Well this patch can only make it worse because not only are you
still ref counting skb->head with get_page, but you've also
completely removed the skb ref count which means that the corruption
can only occur sooner.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 12:25:44

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

[ CCing Evgeniy and Herbert who also participate to the thread ]

On Wed, Jan 07, 2009 at 09:39:15AM +0000, Jarek Poplawski wrote:
> On Tue, Jan 06, 2009 at 04:57:15PM +0100, Willy Tarreau wrote:
> > On Tue, Jan 06, 2009 at 10:01:13AM +0000, Jarek Poplawski wrote:
> > > On Tue, Jan 06, 2009 at 10:41:38AM +0100, Willy Tarreau wrote:
> > > ...
> > > > > Great story! Alas I don't understand this fully either, but it seems
> > > > > Changli Gao was concerned with sendpage sending this "as pages", so
> > > > > when NETIF_F_SG flag is available. Did you try this without SG btw?
> > > >
> > > > No I did not. I can try, it's not too hard. It would in part defeat the
> > > > purpose of the mechanism (especially at 10 Gbps) but at least it will
> > > > help narrow the problem down.
> > >
> > > Yes, I meant it only as a proof of concept. BTW, delaying TCP acks a
> > > bit for these sendpages should then make it more reproducible, I guess.
> >
> > OK here is an update. It does not change anything to turn off any acceleration
> > feature on the interface (tg3) :
> >
> > root@wtap:~# ethtool -k eth0
> > Offload parameters for eth0:
> > rx-checksumming: off
> > tx-checksumming: off
> > scatter-gather: off
> > tcp segmentation offload: off
> >
> > It still forwards corrupted data like mad. I noticed that the corruption rate
> > is 10-100 times higher when forwarding from eth0 to eth0 than from eth0 to lo.
> >
> > Maybe this can help find the culprit ?
>
> I hope it will, but I still don't get it. Anyway, here is an untested
> patch, which I guess partly tries Changli Gao's recommendation to give
> real pages to splice/pipe (but here it's always - not for sendpage
> only).
>
> BTW, I added Changli to Cc - great review!
>
> Thanks,
> Jarek P.

Well, I've just tested it. It did not fix the problem but made it worse.
Sending a few bytes at a time, the corruption is still there, from the
beginning. Here's what I fed :

willy@pcw:~$ nc -lp4001
azerazerazerazerazerazer
eiguhaeihgaeighaeighaeirghiareg
aeroigjaeorgjaeorgjaoeigjaeoig
ioejrgoiaerjgoiaerjgoiaerjgaoiejgoaiejg

Here's what I got :

willy@pcw:~$ telnet 10.0.3.2 4000
Trying 10.0.3.2...
Connected to 10.0.3.2.
Escape character is '^]'.
_J0s9?uMG1S9?t2D$E?L$T$

However, when feeding /dev/zero as in previous tests, the kernel paniced
in skb_release_data().

Regards,
Willy

> ---
>
> net/core/skbuff.c | 41 +++++++++++++++++++++++++++--------------
> 1 files changed, 27 insertions(+), 14 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5110b35..4c080cd 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
> static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
> struct pipe_buffer *buf)
> {
> - struct sk_buff *skb = (struct sk_buff *) buf->private;
> -
> - kfree_skb(skb);
> + put_page(buf->page);
> }
>
> static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
> struct pipe_buffer *buf)
> {
> - struct sk_buff *skb = (struct sk_buff *) buf->private;
> -
> - skb_get(skb);
> + get_page(buf->page);
> }
>
> static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
> @@ -1334,9 +1330,19 @@ fault:
> */
> static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
> {
> - struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
> + put_page(spd->pages[i]);
> +}
>
> - kfree_skb(skb);
> +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> + unsigned int offset)
> +{
> + struct page *p = alloc_pages(GFP_KERNEL, 0);
> +
> + if (!p)
> + return NULL;
> + memcpy(p + offset, page + offset, len);
> +
> + return p;
> }
>
> /*
> @@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
> */
> static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> unsigned int len, unsigned int offset,
> - struct sk_buff *skb)
> + struct sk_buff *skb, int linear)
> {
> if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> return 1;
>
> + if (linear) {
> + page = linear_to_page(page, len, offset);
> + if (!page)
> + return 1;
> + }
> +
> spd->pages[spd->nr_pages] = page;
> spd->partial[spd->nr_pages].len = len;
> spd->partial[spd->nr_pages].offset = offset;
> - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> spd->nr_pages++;
> + get_page(page);
> +
> return 0;
> }
>
> @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
> static inline int __splice_segment(struct page *page, unsigned int poff,
> unsigned int plen, unsigned int *off,
> unsigned int *len, struct sk_buff *skb,
> - struct splice_pipe_desc *spd)
> + struct splice_pipe_desc *spd, int linear)
> {
> if (!*len)
> return 1;
> @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
> /* the linear region may spread across several pages */
> flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
>
> - if (spd_fill_page(spd, page, flen, poff, skb))
> + if (spd_fill_page(spd, page, flen, poff, skb, linear))
> return 1;
>
> __segment_seek(&page, &poff, &plen, flen);
> @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> if (__splice_segment(virt_to_page(skb->data),
> (unsigned long) skb->data & (PAGE_SIZE - 1),
> skb_headlen(skb),
> - offset, len, skb, spd))
> + offset, len, skb, spd, 1))
> return 1;
>
> /*
> @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
>
> if (__splice_segment(f->page, f->page_offset, f->size,
> - offset, len, skb, spd))
> + offset, len, skb, spd, 0))
> return 1;
> }
>

2009-01-07 12:27:58

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 11:22:38PM +1100, Herbert Xu wrote:
>
> Yes we will linearize the packet in dev_queue_xmit but as soon
> as the netdev stops the tx queue you'll get corruption.

OK that can't happen because the linearisation precedes that, but
ARP (or anything else that delays the skb prior to dev_queue_xmit)
can still cause corruption.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 12:31:01

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 11:27:31PM +1100, Herbert Xu wrote:
>
> OK that can't happen because the linearisation precedes that, but
> ARP (or anything else that delays the skb prior to dev_queue_xmit)
> can still cause corruption.

In fact it's probably TCP that's delaying the packet.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 12:32:21

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 01:22:05PM +0100, Willy Tarreau wrote:
> [ CCing Evgeniy and Herbert who also participate to the thread ]
...
> Well, I've just tested it. It did not fix the problem but made it worse.
...

Terrible mistake! Here is take 2.

Sorry!
Jarek P.

---

net/core/skbuff.c | 41 +++++++++++++++++++++++++++--------------
1 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..a598034 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}

static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}

static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
*/
static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
- struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+ put_page(spd->pages[i]);
+}

- kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy((void *)p + offset, (void *)page + offset, len);
+
+ return p;
}

/*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;

+ if (linear) {
+ page = linear_to_page(page, len, offset);
+ if (!page)
+ return 1;
+ }
+
spd->pages[spd->nr_pages] = page;
spd->partial[spd->nr_pages].len = len;
spd->partial[spd->nr_pages].offset = offset;
- spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
spd->nr_pages++;
+ get_page(page);
+
return 0;
}

@@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
static inline int __splice_segment(struct page *page, unsigned int poff,
unsigned int plen, unsigned int *off,
unsigned int *len, struct sk_buff *skb,
- struct splice_pipe_desc *spd)
+ struct splice_pipe_desc *spd, int linear)
{
if (!*len)
return 1;
@@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
/* the linear region may spread across several pages */
flen = min_t(unsigned int, flen, PAGE_SIZE - poff);

- if (spd_fill_page(spd, page, flen, poff, skb))
+ if (spd_fill_page(spd, page, flen, poff, skb, linear))
return 1;

__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
if (__splice_segment(virt_to_page(skb->data),
(unsigned long) skb->data & (PAGE_SIZE - 1),
skb_headlen(skb),
- offset, len, skb, spd))
+ offset, len, skb, spd, 1))
return 1;

/*
@@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];

if (__splice_segment(f->page, f->page_offset, f->size,
- offset, len, skb, spd))
+ offset, len, skb, spd, 0))
return 1;
}

2009-01-07 12:36:25

by Jens Axboe

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07 2009, Jarek Poplawski wrote:
> On Wed, Jan 07, 2009 at 01:22:05PM +0100, Willy Tarreau wrote:
> > [ CCing Evgeniy and Herbert who also participate to the thread ]
> ...
> > Well, I've just tested it. It did not fix the problem but made it worse.
> ...
>
> Terrible mistake! Here is take 2.

Not sure what this:

> +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> + unsigned int offset)
> +{
> + struct page *p = alloc_pages(GFP_KERNEL, 0);
> +
> + if (!p)
> + return NULL;
> + memcpy((void *)p + offset, (void *)page + offset, len);

is trying to do. I'm assuming you want to copy the page contents? If so,
you'd want something like

memcpy(page_address(p) + offset, page_address(page) + offset, len);

with possible kmaps for 'page'.

Irregardless of that particular oddity, I don't think this is the right
path to take at all. We need to delay the pipe buffer consumption until
the appropriate time.

--
Jens Axboe

2009-01-07 12:37:54

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 11:22:38PM +1100, Herbert Xu ([email protected]) wrote:
> > Looks like we are talking about different directions of the dataflow.
> > I meant that set of pages submitted into the sending part will be copied
> > if sending interface does not support acceleration, and thus it will
> > copy part of the page corresponding to the linear part of the skb prior
> > the transmission, so even if skb will be freed right after that call
> > (prior data transmission by the hardware), it should not affect copied
> > data.
>
> You must be looking at a different tcp.c than the one I've got
> because mine clearly always uses skb frags in sendpage regardless
> of SG support.

Doesn't your tcp fallbacks to kernel_sendmsg() without sg in
tcp_sendpage()? And then just feeds data into the stack the same way it
happens with send() i.e. by copying it.

> Yes we will linearize the packet in dev_queue_xmit but as soon
> as the netdev stops the tx queue you'll get corruption.

That's perfectly valid when sendpage() returns and holds a reference to
the pages but not skb->head, so freed skb will free (and potentially
reuse) that area which has not been transmitted yet.
But without acceleration it will copy data and the whole original skb
may be freed without any problems.

--
Evgeniy Polyakov

2009-01-07 12:38:28

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 11:24:07PM +1100, Herbert Xu wrote:
> On Wed, Jan 07, 2009 at 01:22:05PM +0100, Willy Tarreau wrote:
> >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 5110b35..4c080cd 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
> > > static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
> > > struct pipe_buffer *buf)
> > > {
> > > - struct sk_buff *skb = (struct sk_buff *) buf->private;
> > > -
> > > - kfree_skb(skb);
> > > + put_page(buf->page);
> > > }
> > >
> > > static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
> > > struct pipe_buffer *buf)
> > > {
> > > - struct sk_buff *skb = (struct sk_buff *) buf->private;
> > > -
> > > - skb_get(skb);
> > > + get_page(buf->page);
> > > }
>
> Well this patch can only make it worse because not only are you
> still ref counting skb->head with get_page, but you've also
> completely removed the skb ref count which means that the corruption
> can only occur sooner.

But we don't need this skb... except its ->frags[] pages, which are
get_paged?! (The rest is copied to new pages.)

Jarek P.

2009-01-07 12:40:48

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 01:35:04PM +0100, Jens Axboe ([email protected]) wrote:
> Irregardless of that particular oddity, I don't think this is the right
> path to take at all. We need to delay the pipe buffer consumption until
> the appropriate time.

As a proof of concept we can put a delayed work_struct into the buffer
and only release its content after some timeout big enough (like one
second or so) for the hardware to actually transmit its buffers.

--
Evgeniy Polyakov

2009-01-07 12:42:36

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 12:31:53PM +0000, Jarek Poplawski wrote:
> On Wed, Jan 07, 2009 at 01:22:05PM +0100, Willy Tarreau wrote:
> > [ CCing Evgeniy and Herbert who also participate to the thread ]
> ...
> > Well, I've just tested it. It did not fix the problem but made it worse.
> ...
>
> Terrible mistake! Here is take 2.

no problem. However, Herbert explained that this fix could not work for
multiple reasons (refcounting not fixed where the issue really is). BTW,
I've just tried it, and it crashed again, this time in tcp_send_ack().

Regards,
Willy

2009-01-07 12:43:22

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 03:37:41PM +0300, Evgeniy Polyakov wrote:
>
> Doesn't your tcp fallbacks to kernel_sendmsg() without sg in
> tcp_sendpage()? And then just feeds data into the stack the same way it
> happens with send() i.e. by copying it.

Good point. Did he check GSO though? GSO will always enable SG
on the socket regardless of the netdev's setting. And if the device
started out with SG enabled then recent kernels will enable GSO
by default.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 12:45:42

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

Willy Tarreau <[email protected]> wrote:
>
> root@wtap:~# ethtool -k eth0
> Offload parameters for eth0:
> rx-checksumming: off
> tx-checksumming: off
> scatter-gather: off
> tcp segmentation offload: off

Can you please upgrade your ethtool so that it knows about GSO?
Once you've done that please retest with everything including
GSO turned off.

Alternatively edit out the bit that enables GSO by default in
net/core/dev.c.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 12:46:43

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 11:42:46PM +1100, Herbert Xu ([email protected]) wrote:
> > Doesn't your tcp fallbacks to kernel_sendmsg() without sg in
> > tcp_sendpage()? And then just feeds data into the stack the same way it
> > happens with send() i.e. by copying it.
>
> Good point. Did he check GSO though? GSO will always enable SG
> on the socket regardless of the netdev's setting. And if the device
> started out with SG enabled then recent kernels will enable GSO
> by default.

Willy, what was the kernel you are tested no-accel behaviour and what
were the gso settings?
Can you add a simple single print into tcp_sendpage() to determine if
content was copied or fed into do_tcp_sendpages() otherwise?

--
Evgeniy Polyakov

2009-01-07 12:50:15

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 01:35:04PM +0100, Jens Axboe wrote:
> On Wed, Jan 07 2009, Jarek Poplawski wrote:
> > On Wed, Jan 07, 2009 at 01:22:05PM +0100, Willy Tarreau wrote:
> > > [ CCing Evgeniy and Herbert who also participate to the thread ]
> > ...
> > > Well, I've just tested it. It did not fix the problem but made it worse.
> > ...
> >
> > Terrible mistake! Here is take 2.
>
> Not sure what this:
>
> > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > + unsigned int offset)
> > +{
> > + struct page *p = alloc_pages(GFP_KERNEL, 0);
> > +
> > + if (!p)
> > + return NULL;
> > + memcpy((void *)p + offset, (void *)page + offset, len);
>
> is trying to do. I'm assuming you want to copy the page contents? If so,
> you'd want something like
>
> memcpy(page_address(p) + offset, page_address(page) + offset, len);
>
> with possible kmaps for 'page'.

My BAD!!! Thanks!

>
> Irregardless of that particular oddity, I don't think this is the right
> path to take at all. We need to delay the pipe buffer consumption until
> the appropriate time.

Hmm... in any case: take 3

Jarek P.

---

net/core/skbuff.c | 41 +++++++++++++++++++++++++++--------------
1 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..6e43d52 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}

static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}

static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
*/
static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
- struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+ put_page(spd->pages[i]);
+}

- kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy(page_address(p) + offset, page_address(page) + offset, len);
+
+ return p;
}

/*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;

+ if (linear) {
+ page = linear_to_page(page, len, offset);
+ if (!page)
+ return 1;
+ }
+
spd->pages[spd->nr_pages] = page;
spd->partial[spd->nr_pages].len = len;
spd->partial[spd->nr_pages].offset = offset;
- spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
spd->nr_pages++;
+ get_page(page);
+
return 0;
}

@@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
static inline int __splice_segment(struct page *page, unsigned int poff,
unsigned int plen, unsigned int *off,
unsigned int *len, struct sk_buff *skb,
- struct splice_pipe_desc *spd)
+ struct splice_pipe_desc *spd, int linear)
{
if (!*len)
return 1;
@@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
/* the linear region may spread across several pages */
flen = min_t(unsigned int, flen, PAGE_SIZE - poff);

- if (spd_fill_page(spd, page, flen, poff, skb))
+ if (spd_fill_page(spd, page, flen, poff, skb, linear))
return 1;

__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
if (__splice_segment(virt_to_page(skb->data),
(unsigned long) skb->data & (PAGE_SIZE - 1),
skb_headlen(skb),
- offset, len, skb, spd))
+ offset, len, skb, spd, 1))
return 1;

/*
@@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];

if (__splice_segment(f->page, f->page_offset, f->size,
- offset, len, skb, spd))
+ offset, len, skb, spd, 0))
return 1;
}

2009-01-07 12:52:27

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 03:40:34PM +0300, Evgeniy Polyakov wrote:
> On Wed, Jan 07, 2009 at 01:35:04PM +0100, Jens Axboe ([email protected]) wrote:
> > Irregardless of that particular oddity, I don't think this is the right
> > path to take at all. We need to delay the pipe buffer consumption until
> > the appropriate time.
>
> As a proof of concept we can put a delayed work_struct into the buffer
> and only release its content after some timeout big enough (like one
> second or so) for the hardware to actually transmit its buffers.

Evgeniy, I'd like to understand something related to our apparent lack of
knowledge of when the data is effectively transmitted. If we're focusing
on the send part, I can't understand why I never reproduce the corruption
when the data source is a file or loopback, but I only see it when the source
is an ethernet interface. How is it possible that a problem affecting only
the send side is so much selective about the source ? And in fact, why can't
we apply the same workflow for outgoing data for both types of sources ? It
seems to me that the page is released at the right time when sending a file,
and I don't see why we cannot apply the same principle when splicing between
sockets.

Please excuse me for my blattant ignorance in this area, as I once said, I
could not completely follow the whole splice process between tcp_splice_read()
and the moment the data leaves the machine. Also, I failed to understand what
linear data means. It seems to me this is the parts that are memcpy'd, but I'm
not sure.

Thanks,
Willy

2009-01-07 12:53:06

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 12:49:46PM +0000, Jarek Poplawski wrote:
>
> Hmm... in any case: take 3

Yes this should fix the corruption but it kind of defeats the
purpose of splice by copying the data.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 12:54:25

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 01:52:01PM +0100, Willy Tarreau wrote:
>
> Evgeniy, I'd like to understand something related to our apparent lack of
> knowledge of when the data is effectively transmitted. If we're focusing
> on the send part, I can't understand why I never reproduce the corruption
> when the data source is a file or loopback, but I only see it when the source
> is an ethernet interface. How is it possible that a problem affecting only

It doesn't happen with a file because in that case you don't
start with an skb so there is no skb->head. It probably doesn't
happen with loopback because loopback does GSO so again skb->head
does not exist (so to speak).

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 12:55:59

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 03:46:29PM +0300, Evgeniy Polyakov wrote:
> On Wed, Jan 07, 2009 at 11:42:46PM +1100, Herbert Xu ([email protected]) wrote:
> > > Doesn't your tcp fallbacks to kernel_sendmsg() without sg in
> > > tcp_sendpage()? And then just feeds data into the stack the same way it
> > > happens with send() i.e. by copying it.
> >
> > Good point. Did he check GSO though? GSO will always enable SG
> > on the socket regardless of the netdev's setting. And if the device
> > started out with SG enabled then recent kernels will enable GSO
> > by default.
>
> Willy, what was the kernel you are tested no-accel behaviour and what
> were the gso settings?

kernel is 2.6.27.10 + a few patches (squashfs, etc..., nothing related to
this area). My ethtool is a bit old and does not report GSO. I'll download
and rebuild a more recent one and retest.

> Can you add a simple single print into tcp_sendpage() to determine if
> content was copied or fed into do_tcp_sendpages() otherwise?

Yes, will do that too.

Willy

2009-01-07 12:56:41

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 01:39:02PM +0100, Willy Tarreau wrote:
...
> no problem. However, Herbert explained that this fix could not work for
> multiple reasons (refcounting not fixed where the issue really is). BTW,
> I've just tried it, and it crashed again, this time in tcp_send_ack().

I'm extremely sorry! (I don't even expect you'll dare take 3...)

Jarek P.

2009-01-07 12:58:00

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 01:55:35PM +0100, Willy Tarreau wrote:
>
> kernel is 2.6.27.10 + a few patches (squashfs, etc..., nothing related to

2.6.27 will enable GSO by default if SG is enabled at registration
time.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 12:58:24

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 11:53:56PM +1100, Herbert Xu ([email protected]) wrote:
> On Wed, Jan 07, 2009 at 01:52:01PM +0100, Willy Tarreau wrote:
> >
> > Evgeniy, I'd like to understand something related to our apparent lack of
> > knowledge of when the data is effectively transmitted. If we're focusing
> > on the send part, I can't understand why I never reproduce the corruption
> > when the data source is a file or loopback, but I only see it when the source
> > is an ethernet interface. How is it possible that a problem affecting only
>
> It doesn't happen with a file because in that case you don't
> start with an skb so there is no skb->head. It probably doesn't
> happen with loopback because loopback does GSO so again skb->head
> does not exist (so to speak).

Yup, basically splice's transmit pipe buffer contains page references,
where the first one is actually not a real page but skb, while in the
case of sendfile() and/or splice() from the file first page is a real
page of the appropriate file.

--
Evgeniy Polyakov

2009-01-07 13:01:57

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 02:00:31PM +0100, Willy Tarreau wrote:
>
> that's what I was initially wondering about, but it looks like only linear
> data is copied. Will that cause too much a overhead ? (I don't like copying

For splicing from socket to socket or socket to file, all the
data will be linear with most drivers.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-07 13:03:20

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 11:52:17PM +1100, Herbert Xu wrote:
> On Wed, Jan 07, 2009 at 12:49:46PM +0000, Jarek Poplawski wrote:
> >
> > Hmm... in any case: take 3
>
> Yes this should fix the corruption but it kind of defeats the
> purpose of splice by copying the data.
>

Hmm.. Again - the main point of this patch was a proof of concept...
If the bug is really here it could be optimized (e.g. done only for
sendmesage) or fixed another way.

Jarek P.

2009-01-07 13:03:34

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 11:57:34PM +1100, Herbert Xu ([email protected]) wrote:
> On Wed, Jan 07, 2009 at 01:55:35PM +0100, Willy Tarreau wrote:
> >
> > kernel is 2.6.27.10 + a few patches (squashfs, etc..., nothing related to
>
> 2.6.27 will enable GSO by default if SG is enabled at registration
> time.

So even if later sg was turned off without gso being turned off, it
will try doing fair page transfer and not falling back to the copy.

The simplest case is to also turn gso off, but with older ethtools it is
not possible and you hit the bug.

If I understood correctly Jarek's patch, he wants to allocate a page and
copy linear content of the skb there, so this will end up being the same
case as with splicing from the file. And while generally this may be a
good idea, but with usual 1.5k mtu this will copy every skb, which is
rather bad idea. Will have to think :)

--
Evgeniy Polyakov

2009-01-07 13:04:16

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 11:52:17PM +1100, Herbert Xu wrote:
> On Wed, Jan 07, 2009 at 12:49:46PM +0000, Jarek Poplawski wrote:
> >
> > Hmm... in any case: take 3
>
> Yes this should fix the corruption but it kind of defeats the
> purpose of splice by copying the data.

that's what I was initially wondering about, but it looks like only linear
data is copied. Will that cause too much a overhead ? (I don't like copying
at all anyway, but if it can help us find the cause, I'll happily test it).

BTW, Jarek, don't be sorry, I *am* expecting to crash my laptop a number of
times before ensuring the bug is fixed. I just don't want to lose my data.

Cheers,
Willy

2009-01-07 13:08:57

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 03:57:56PM +0300, Evgeniy Polyakov wrote:
> On Wed, Jan 07, 2009 at 11:53:56PM +1100, Herbert Xu ([email protected]) wrote:
> > On Wed, Jan 07, 2009 at 01:52:01PM +0100, Willy Tarreau wrote:
> > >
> > > Evgeniy, I'd like to understand something related to our apparent lack of
> > > knowledge of when the data is effectively transmitted. If we're focusing
> > > on the send part, I can't understand why I never reproduce the corruption
> > > when the data source is a file or loopback, but I only see it when the source
> > > is an ethernet interface. How is it possible that a problem affecting only
> >
> > It doesn't happen with a file because in that case you don't
> > start with an skb so there is no skb->head. It probably doesn't
> > happen with loopback because loopback does GSO so again skb->head
> > does not exist (so to speak).
>
> Yup, basically splice's transmit pipe buffer contains page references,
> where the first one is actually not a real page but skb, while in the
> case of sendfile() and/or splice() from the file first page is a real
> page of the appropriate file.

OK thanks guys for the clarifications.

Evgeniy, my printk() in tcp_sendpage() fired several times indicating we were
going through do_tcp_sendpage. During the same test, I observed a lot of
corruption.

Also, I have a good news. As you suggested, disabling both SG and GSO indeed
fixes the issue. do_tcp_sendpage() is not called anymore from tcp_sendpage()
in this case (according to dmesg).

Cheers,
Willy

2009-01-07 13:11:22

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 04:02:59PM +0300, Evgeniy Polyakov wrote:
...
> If I understood correctly Jarek's patch, he wants to allocate a page and
> copy linear content of the skb there, so this will end up being the same
> case as with splicing from the file. And while generally this may be a
> good idea, but with usual 1.5k mtu this will copy every skb, which is
> rather bad idea. Will have to think :)
>

As I mentioned earlier, I (poorly) tried to realize Changli's idea
only, and it's more to verify the reason still (after it failed with
SG method...).

Thanks,
Jarek P.

2009-01-07 13:16:10

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 01:10:57PM +0000, Jarek Poplawski wrote:
> On Wed, Jan 07, 2009 at 04:02:59PM +0300, Evgeniy Polyakov wrote:
> ...
> > If I understood correctly Jarek's patch, he wants to allocate a page and
> > copy linear content of the skb there, so this will end up being the same
> > case as with splicing from the file. And while generally this may be a
> > good idea, but with usual 1.5k mtu this will copy every skb, which is
> > rather bad idea. Will have to think :)
> >
>
> As I mentioned earlier, I (poorly) tried to realize Changli's idea
> only, and it's more to verify the reason still (after it failed with
> SG method...).

Jarek, since it now works when disabling both SG and GSO, we know that the
bug is triggered when passing through do_tcp_sendpage(). Thus, I will not
try your patch #3 yet, but rather wait for fix candidates to try.

Thanks!
Willy

2009-01-07 13:23:24

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 02:15:39PM +0100, Willy Tarreau wrote:
> On Wed, Jan 07, 2009 at 01:10:57PM +0000, Jarek Poplawski wrote:
> > On Wed, Jan 07, 2009 at 04:02:59PM +0300, Evgeniy Polyakov wrote:
> > ...
> > > If I understood correctly Jarek's patch, he wants to allocate a page and
> > > copy linear content of the skb there, so this will end up being the same
> > > case as with splicing from the file. And while generally this may be a
> > > good idea, but with usual 1.5k mtu this will copy every skb, which is
> > > rather bad idea. Will have to think :)
> > >
> >
> > As I mentioned earlier, I (poorly) tried to realize Changli's idea
> > only, and it's more to verify the reason still (after it failed with
> > SG method...).
>
> Jarek, since it now works when disabling both SG and GSO, we know that the
> bug is triggered when passing through do_tcp_sendpage(). Thus, I will not
> try your patch #3 yet, but rather wait for fix candidates to try.

Sure! Congratulations for debugging to you and Evgeniy!

Jarek P.

2009-01-07 14:02:06

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 01:22:58PM +0000, Jarek Poplawski wrote:
...
> Sure! Congratulations for debugging to you and Evgeniy!

...and Herbert! (I missed the beginning of this GSO trail...)

Jarek P.

2009-01-08 07:17:18

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On 06-01-2009 19:15, Willy Tarreau wrote:
...
> Ah, so you might also have discovered a few annoyances with the API, eg
> the fact that splice() returns after the first read in non-blocking mode,
> as well as the fact that it never returns zero on close, but -EAGAIN,
> which requires an additional recv(MSG_PEEK) to distinguish between a
> close and a lack of data. But I leave that for a later discussion, let's
> address the corruption issue first.

FYI, this should be just fixed:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4f7d54f59bc470f0aaa932f747a95232d7ebf8b1

Regards,
Jarek P.

2009-01-08 08:05:41

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Thu, Jan 08, 2009 at 07:16:51AM +0000, Jarek Poplawski wrote:
> On 06-01-2009 19:15, Willy Tarreau wrote:
> ...
> > Ah, so you might also have discovered a few annoyances with the API, eg
> > the fact that splice() returns after the first read in non-blocking mode,
> > as well as the fact that it never returns zero on close, but -EAGAIN,
> > which requires an additional recv(MSG_PEEK) to distinguish between a
> > close and a lack of data. But I leave that for a later discussion, let's
> > address the corruption issue first.
>
> FYI, this should be just fixed:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4f7d54f59bc470f0aaa932f747a95232d7ebf8b1
>

Ah cool, thanks Jarek for notifying us. Indeed, it's the exact same patch
I had pending here ;-)

I'll ping Greg for a backport into -stable, as applications relying on this
will clearly not work without that fix.

The other one I had consists in removing "|| !timeo" at the end of the loop,
because otherwise splice() returns very small chunks (typically 1448 or
1460 bytes), leading to disastrous performance on high bandwidth links.
At 10 Gbps, this means about 800000 calls to splice() per second!

Regards,
Willy

2009-01-08 14:53:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10


* Willy Tarreau <[email protected]> wrote:

> On Thu, Jan 08, 2009 at 07:16:51AM +0000, Jarek Poplawski wrote:
> > On 06-01-2009 19:15, Willy Tarreau wrote:
> > ...
> > > Ah, so you might also have discovered a few annoyances with the API, eg
> > > the fact that splice() returns after the first read in non-blocking mode,
> > > as well as the fact that it never returns zero on close, but -EAGAIN,
> > > which requires an additional recv(MSG_PEEK) to distinguish between a
> > > close and a lack of data. But I leave that for a later discussion, let's
> > > address the corruption issue first.
> >
> > FYI, this should be just fixed:
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4f7d54f59bc470f0aaa932f747a95232d7ebf8b1
> >
>
> Ah cool, thanks Jarek for notifying us. Indeed, it's the exact same patch
> I had pending here ;-)
>
> I'll ping Greg for a backport into -stable, as applications relying on
> this will clearly not work without that fix.
>
> The other one I had consists in removing "|| !timeo" at the end of the
> loop, because otherwise splice() returns very small chunks (typically
> 1448 or 1460 bytes), leading to disastrous performance on high bandwidth
> links. At 10 Gbps, this means about 800000 calls to splice() per second!

looks interesting - would you mind to submit it?

Ingo

2009-01-08 15:16:30

by Ben Mansell

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

Ingo Molnar wrote:
> * Willy Tarreau <[email protected]> wrote:
>
>> On Thu, Jan 08, 2009 at 07:16:51AM +0000, Jarek Poplawski wrote:
>>> On 06-01-2009 19:15, Willy Tarreau wrote:
>>> ...
>>>> Ah, so you might also have discovered a few annoyances with the API, eg
>>>> the fact that splice() returns after the first read in non-blocking mode,
>>>> as well as the fact that it never returns zero on close, but -EAGAIN,
>>>> which requires an additional recv(MSG_PEEK) to distinguish between a
>>>> close and a lack of data. But I leave that for a later discussion, let's
>>>> address the corruption issue first.
>>> FYI, this should be just fixed:
>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4f7d54f59bc470f0aaa932f747a95232d7ebf8b1
>>>
>> Ah cool, thanks Jarek for notifying us. Indeed, it's the exact same patch
>> I had pending here ;-)
>>
>> I'll ping Greg for a backport into -stable, as applications relying on
>> this will clearly not work without that fix.
>>
>> The other one I had consists in removing "|| !timeo" at the end of the
>> loop, because otherwise splice() returns very small chunks (typically
>> 1448 or 1460 bytes), leading to disastrous performance on high bandwidth
>> links. At 10 Gbps, this means about 800000 calls to splice() per second!
>
> looks interesting - would you mind to submit it?

FWIW, I've also tested this change with some splice() benchmarks. I can
confirm that removing the "|| !timeo" works well and improves
performance significantly.


Ben

2009-01-08 17:17:51

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Thu, Jan 08, 2009 at 03:53:00PM +0100, Ingo Molnar wrote:
>
> * Willy Tarreau <[email protected]> wrote:
>
> > On Thu, Jan 08, 2009 at 07:16:51AM +0000, Jarek Poplawski wrote:
> > > On 06-01-2009 19:15, Willy Tarreau wrote:
> > > ...
> > > > Ah, so you might also have discovered a few annoyances with the API, eg
> > > > the fact that splice() returns after the first read in non-blocking mode,
> > > > as well as the fact that it never returns zero on close, but -EAGAIN,
> > > > which requires an additional recv(MSG_PEEK) to distinguish between a
> > > > close and a lack of data. But I leave that for a later discussion, let's
> > > > address the corruption issue first.
> > >
> > > FYI, this should be just fixed:
> > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4f7d54f59bc470f0aaa932f747a95232d7ebf8b1
> > >
> >
> > Ah cool, thanks Jarek for notifying us. Indeed, it's the exact same patch
> > I had pending here ;-)
> >
> > I'll ping Greg for a backport into -stable, as applications relying on
> > this will clearly not work without that fix.
> >
> > The other one I had consists in removing "|| !timeo" at the end of the
> > loop, because otherwise splice() returns very small chunks (typically
> > 1448 or 1460 bytes), leading to disastrous performance on high bandwidth
> > links. At 10 Gbps, this means about 800000 calls to splice() per second!
>
> looks interesting - would you mind to submit it?

OK I will. I preferred to focus on the bug first, which is clearly
more annoying since it basically makes splice() unusuable in a proxy.
I wanted to avoid changes before the bug is fixed, but now, with the
first investigations, it seems that the fix will be in a different
area anyway so that doesn't matter.

Regards,
Willy

2009-01-12 12:03:40

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Wed, Jan 07, 2009 at 11:52:17PM +1100, Herbert Xu wrote:
> On Wed, Jan 07, 2009 at 12:49:46PM +0000, Jarek Poplawski wrote:
> >
> > Hmm... in any case: take 3
>
> Yes this should fix the corruption but it kind of defeats the
> purpose of splice by copying the data.

However, as we don't have a better fix yet, we probably should
take Jarek's patch for now since data corruption is bad.

This is a very hard problem, so in the end the only viable solution
might be to get the drivers to switch to using page frags, like
the Intel page split method.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-12 12:47:40

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Mon, Jan 12, 2009 at 11:02:57PM +1100, Herbert Xu ([email protected]) wrote:
> > > Hmm... in any case: take 3
> >
> > Yes this should fix the corruption but it kind of defeats the
> > purpose of splice by copying the data.
>
> However, as we don't have a better fix yet, we probably should
> take Jarek's patch for now since data corruption is bad.

Iirc it copies data from skb to the new pipe page unconditionally while
it is needed only for skb->sendpage path, although it is not possible to
know what is the other side of the pipe (or not?).

What about storing a callback and private pointer in the shared info for
the skb and clone them during usual clone, and invoke the callback at
shared info freeing time, which in turn will call spd->spd_release()?

Given that we only need to protect linear part, it should be simple
enough and we will not need to mess with the pskb_expand* calls.

> This is a very hard problem, so in the end the only viable solution
> might be to get the drivers to switch to using page frags, like
> the Intel page split method.

As a long-term solution this sounds as the best case, but introduces
quite heavy overhead for the allocators. Right now we allocate
1500+shared_info rounded up to the nearest power of the two (2k), but
then we will either need to have own network allocator (I have one :) or
allocate PAGE_SIZE+shared_info rounded up to the pwoer of the two (i.e.
8k), which is unfeasible.

--
Evgeniy Polyakov

2009-01-12 12:56:37

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Mon, Jan 12, 2009 at 03:45:46PM +0300, Evgeniy Polyakov wrote:
>
> As a long-term solution this sounds as the best case, but introduces
> quite heavy overhead for the allocators. Right now we allocate
> 1500+shared_info rounded up to the nearest power of the two (2k), but
> then we will either need to have own network allocator (I have one :) or
> allocate PAGE_SIZE+shared_info rounded up to the pwoer of the two (i.e.
> 8k), which is unfeasible.

No that's not what I was suggesting. The page split model allocates
an skb with a very small head that accomodates only the headers.
All payload is stored in the frags structure.

For 1500-byte packets, we can manage the payload area efficiently
by dividing each allocated page into 2K chunks. The page will
then be automatically freed once all the 2K chunks on it have been
freed through the page ref count mechanism.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-12 12:59:28

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Mon, Jan 12, 2009 at 11:56:07PM +1100, Herbert Xu ([email protected]) wrote:
> > As a long-term solution this sounds as the best case, but introduces
> > quite heavy overhead for the allocators. Right now we allocate
> > 1500+shared_info rounded up to the nearest power of the two (2k), but
> > then we will either need to have own network allocator (I have one :) or
> > allocate PAGE_SIZE+shared_info rounded up to the pwoer of the two (i.e.
> > 8k), which is unfeasible.
>
> No that's not what I was suggesting. The page split model allocates
> an skb with a very small head that accomodates only the headers.
> All payload is stored in the frags structure.
>
> For 1500-byte packets, we can manage the payload area efficiently
> by dividing each allocated page into 2K chunks. The page will
> then be automatically freed once all the 2K chunks on it have been
> freed through the page ref count mechanism.

That's the part I referred to as a network own allocator. We can have
multiple kmem_caches though for the popular MTUs and round up the
requested size otherwise.

--
Evgeniy Polyakov

2009-01-12 13:15:37

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Mon, Jan 12, 2009 at 11:02:57PM +1100, Herbert Xu wrote:
> On Wed, Jan 07, 2009 at 11:52:17PM +1100, Herbert Xu wrote:
> > On Wed, Jan 07, 2009 at 12:49:46PM +0000, Jarek Poplawski wrote:
> > >
> > > Hmm... in any case: take 3
> >
> > Yes this should fix the corruption but it kind of defeats the
> > purpose of splice by copying the data.
>
> However, as we don't have a better fix yet, we probably should
> take Jarek's patch for now since data corruption is bad.
>

I've wondered if something like this could work as a temporary hack?

!!! not compiled/tested !!!
---

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ce572f9..99b0876 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -264,6 +264,7 @@
#include <linux/cache.h>
#include <linux/err.h>
#include <linux/crypto.h>
+#include <linux/page-flags.h>

#include <net/icmp.h>
#include <net/tcp.h>
@@ -776,7 +777,8 @@ ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
struct sock *sk = sock->sk;

if (!(sk->sk_route_caps & NETIF_F_SG) ||
- !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
+ !(sk->sk_route_caps & NETIF_F_ALL_CSUM) ||
+ PageSlab(page))
return sock_no_sendpage(sock, page, offset, size, flags);

lock_sock(sk);

2009-01-12 21:11:42

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Mon, Jan 12, 2009 at 03:59:14PM +0300, Evgeniy Polyakov wrote:
>
> That's the part I referred to as a network own allocator. We can have
> multiple kmem_caches though for the popular MTUs and round up the
> requested size otherwise.

No we can't use kmem_caches because we're relying on using the
page refereounce count to free the page. It has to be naked
pages.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-12 21:13:40

by Herbert Xu

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Mon, Jan 12, 2009 at 01:15:17PM +0000, Jarek Poplawski wrote:
>
> I've wondered if something like this could work as a temporary hack?

No we should fix it in skb_splice_bits because the corruption
can affect other terminations as well if a delay occurs.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-19 07:32:57

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Mon, Jan 12, 2009 at 01:15:17PM +0000, Jarek Poplawski wrote:
...
> I've wondered if something like this could work as a temporary hack?
...
> @@ -776,7 +777,8 @@ ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
> struct sock *sk = sock->sk;
>
> if (!(sk->sk_route_caps & NETIF_F_SG) ||
> - !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
> + !(sk->sk_route_caps & NETIF_F_ALL_CSUM) ||
> + PageSlab(page))
> return sock_no_sendpage(sock, page, offset, size, flags);

Just for the record: this wouldn't work yet:-( It should be probably
something like "PageCompound(compound_head(page))" test instead.

Jarek P.

2009-01-19 09:03:58

by Lennert Buytenhek

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Tue, Jan 06, 2009 at 07:50:12PM +0100, Willy Tarreau wrote:

> > Thanks a lot for the test application, it will greatly help to resolve
> > this issue.
>
> I figured it was an absolute necessity. The original code in my proxy is in
> an experimental state and far too hard to debug for these purposes. It was
> enough to detect the problem, but I could run a lot more tests with this
> small test app ! An who know, maybe it will serve as an example for
> non-blocking splice ;-)

:-)

Just to throw some more (hacky) example code into the pool, below is
an echo server that I hacked up to test nonblocking splice(). (You'll
need sf.net/projects/libivykis to use it.) I also have a splice()
discard server and a patch to my intercept-connection-via-iptables-and-
forward-it-to-a-remote-SOCKS5-server-to-deal-with-crappy-VPNs app to
use splice() somewhere.

My main annoyances with splice(2) are/were:

1. -EAGAIN return on splice from socket/pipe to socket/pipe doesn't
directly tell you whether the source ran out of data or the
destination can't accept more data, which means you can't e.g. use
epoll in edge triggered mode without jumping through some minor
number of extra hoops. (For a pipe you can keep track of how many
bytes are in it by hand, but for a socket->pipe splice -EAGAIN return
you'll have to assume that the pipe is full if there are >0 bytes in
it.)

2. Because of (1), and because when splicing from a socket to a pipe
it returns after the first bit of data (you mentioned this as well),
you don't know at that point whether your pipe is full or not.

3. Always returns -EAGAIN even if there was a FIN or error on the
source socket. (Now fixed.)



#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <iv.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <signal.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <time.h>
#include <unistd.h>

struct connection {
struct iv_fd sock;
int pfd[2];
int pipe_bytes;
int saw_fin;
};


static void conn_kill(struct connection *conn)
{
iv_unregister_fd(&conn->sock);
close(conn->sock.fd);
close(conn->pfd[0]);
close(conn->pfd[1]);
free(conn);
}

static void conn_pollin(void *_conn);
static void conn_pollout(void *_conn);
static void conn_pollerr(void *_conn);

static void conn_pollin(void *_conn)
{
struct connection *conn = (struct connection *)_conn;
int ret;

while (1) {
ret = splice(conn->sock.fd, NULL, conn->pfd[1], NULL,
1048576, SPLICE_F_NONBLOCK);
if (ret <= 0)
break;

if (conn->pipe_bytes == 0)
iv_fd_set_handler_out(&conn->sock, conn_pollout);
conn->pipe_bytes += ret;
}

if (ret == 0) {
conn->saw_fin = 1;
iv_fd_set_handler_in(&conn->sock, NULL);
if (conn->pipe_bytes == 0) {
shutdown(conn->sock.fd, SHUT_WR);
conn_kill(conn);
}
return;
}

if (errno == EAGAIN && conn->pipe_bytes > 0) {
int bytes_sock = 1;

ioctl(conn->sock.fd, FIONREAD, &bytes_sock);
if (bytes_sock > 0)
iv_fd_set_handler_in(&conn->sock, NULL);
} else if (errno != EAGAIN) {
conn_kill(conn);
}
}

static void conn_pollout(void *_conn)
{
struct connection *conn = (struct connection *)_conn;
int ret;

if (!conn->pipe_bytes) {
fprintf(stderr, "conn_pollout: no pipe bytes!\n");
abort();
}

do {
ret = splice(conn->pfd[0], NULL, conn->sock.fd, NULL,
conn->pipe_bytes, 0);
if (ret <= 0)
break;

conn->pipe_bytes -= ret;
if (!conn->saw_fin)
iv_fd_set_handler_in(&conn->sock, conn_pollin);
} while (conn->pipe_bytes);

if (ret == 0) {
fprintf(stderr, "pollout: splice returned zero!\n");
abort();
}

if (errno == EAGAIN && conn->pipe_bytes == 0) {
iv_fd_set_handler_out(&conn->sock, NULL);
if (conn->saw_fin) {
shutdown(conn->sock.fd, SHUT_WR);
conn_kill(conn);
}
} else if (errno != EAGAIN) {
conn_kill(conn);
}
}

static void conn_pollerr(void *_conn)
{
struct connection *conn = (struct connection *)_conn;
socklen_t len;
int ret;

len = sizeof(ret);
if (getsockopt(conn->sock.fd, SOL_SOCKET, SO_ERROR, &ret, &len) < 0) {
fprintf(stderr, "pollerr: error %d while "
"getsockopt(SO_ERROR)\n", errno);
abort();
}

if (ret == 0) {
fprintf(stderr, "pollerr: no error?!\n");
abort();
}

conn_kill(conn);
}


static struct iv_fd listening_socket;

static void got_connection(void *_dummy)
{
struct sockaddr_in addr;
struct connection *conn;
socklen_t addrlen;
int ret;

addrlen = sizeof(addr);
ret = accept(listening_socket.fd, (struct sockaddr *)&addr, &addrlen);
if (ret < 0)
return;

conn = malloc(sizeof(*conn));
if (conn == NULL) {
fprintf(stderr, "memory squeeze\n");
abort();
}

if (pipe(conn->pfd) < 0) {
fprintf(stderr, "pipe squeeze\n");
abort();
}

INIT_IV_FD(&conn->sock);
conn->sock.fd = ret;
conn->sock.cookie = (void *)conn;
conn->sock.handler_in = conn_pollin;
conn->sock.handler_err = conn_pollerr;
iv_register_fd(&conn->sock);

conn->pipe_bytes = 0;
conn->saw_fin = 0;
}

static int open_listening_socket(void)
{
struct sockaddr_in addr;
int sock;

sock = socket(AF_INET, SOCK_STREAM, 0);
if (sock < 0) {
perror("socket");
return 1;
}

addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_ANY);
addr.sin_port = htons(6969);
if (bind(sock, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
perror("bind");
return 1;
}

listen(sock, 5);

INIT_IV_FD(&listening_socket);
listening_socket.fd = sock;
listening_socket.cookie = NULL;
listening_socket.handler_in = got_connection;
iv_register_fd(&listening_socket);

return 0;
}

int main()
{
iv_init();
open_listening_socket();
iv_main();

return 0;
}

2009-01-19 09:53:46

by Willy Tarreau

[permalink] [raw]
Subject: Re: Data corruption issue with splice() on 2.6.27.10

On Mon, Jan 19, 2009 at 09:39:21AM +0100, Lennert Buytenhek wrote:
> On Tue, Jan 06, 2009 at 07:50:12PM +0100, Willy Tarreau wrote:
>
> > > Thanks a lot for the test application, it will greatly help to resolve
> > > this issue.
> >
> > I figured it was an absolute necessity. The original code in my proxy is in
> > an experimental state and far too hard to debug for these purposes. It was
> > enough to detect the problem, but I could run a lot more tests with this
> > small test app ! An who know, maybe it will serve as an example for
> > non-blocking splice ;-)
>
> :-)
>
> Just to throw some more (hacky) example code into the pool, below is
> an echo server that I hacked up to test nonblocking splice(). (You'll
> need sf.net/projects/libivykis to use it.) I also have a splice()
> discard server and a patch to my intercept-connection-via-iptables-and-
> forward-it-to-a-remote-SOCKS5-server-to-deal-with-crappy-VPNs app to
> use splice() somewhere.
>
> My main annoyances with splice(2) are/were:
>
> 1. -EAGAIN return on splice from socket/pipe to socket/pipe doesn't
> directly tell you whether the source ran out of data or the
> destination can't accept more data, which means you can't e.g. use
> epoll in edge triggered mode without jumping through some minor
> number of extra hoops. (For a pipe you can keep track of how many
> bytes are in it by hand, but for a socket->pipe splice -EAGAIN return
> you'll have to assume that the pipe is full if there are >0 bytes in
> it.)

I proceeded the same way : if EAGAIN and data still in the pipe, then
stop polling.

> 2. Because of (1), and because when splicing from a socket to a pipe
> it returns after the first bit of data (you mentioned this as well),
> you don't know at that point whether your pipe is full or not.

In fact this is fixed now. tcp_splice_read() returns all available data,
which somewhat hides problem #1. I'm running with 23 kB in a push/pull
method all the time, so it remains optimal.

> 3. Always returns -EAGAIN even if there was a FIN or error on the
> source socket. (Now fixed.)

Yes I saw your fix, it was indeed very annoying because the only workaround
I found was to perform an recv(MSG_PEEK) on the socket after each EAGAIN
to check whether the connection was closed or not.

For these reasons, I'd really love to see the few recent fixes backported
to -stable ASAP. It will boost splice() adoption among products.

Regards,
Willy