LinuxLists.cc - [PATCH] tcp: splice as many packets as possible at once

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Ben Mansell <[email protected]>
Date: Thu, 08 Jan 2009 21:50:44 +0000

> > From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
> > From: Willy Tarreau <[email protected]>
> > Date: Thu, 8 Jan 2009 17:10:13 +0100
> > Subject: [PATCH] tcp: splice as many packets as possible at once
> > Currently, in non-blocking mode, tcp_splice_read() returns after
> > splicing one segment regardless of the len argument. This results
> > in low performance and very high overhead due to syscall rate when
> > splicing from interfaces which do not support LRO.
> > The fix simply consists in not breaking out of the loop after the
> > first read. That way, we can read up to the size requested by the
> > caller and still return when there is no data left.
> > Performance has significantly improved with this fix, with the
> > number of calls to splice() divided by about 20, and CPU usage
> > dropped from 100% to 75%.
> >
>
> I get similar results with my testing here. Benchmarking an application with this patch shows that more than one packet is being splice()d in at once, as a result I see a doubling in throughput.
>
> Tested-by: Ben Mansell <[email protected]>

I'm not applying this until someone explains to me why
we should remove this test from the splice receive but
keep it in the tcp_recvmsg() code where it has been
essentially forever.

2009-01-08 22:04:18

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Thu, Jan 08, 2009 at 08:44:24PM +0100, Jens Axboe wrote:
> On Thu, Jan 08 2009, Willy Tarreau wrote:
> > Jens,
> >
> > here's the other patch I was talking about, for better behaviour of
> > non-blocking splice(). Ben Mansell also confirms similar improvements
> > in his tests, where non-blocking splice() initially showed half of
> > read()/write() performance. Ben, would you mind adding a Tested-By
> > line ?
>
> Looks very good to me, I don't see any valid reason to have that !timeo
> break. So feel free to add my acked-by to that patch and send it to
> Dave.

Done, thanks.

Also, I tried to follow the code path from the userspace splice() call and
the kernel code. While splicing from socket to a pipe seems obvious till
tcp_splice_read(), I have a hard time finding the correct way from the pipe
to the socket.

It *seems* to me that we proceed like this, I'd like someone to confirm :

sys_splice()
do_splice()
do_splice_from()
generic_splice_sendpage() [ I assumed this from socket_file_ops ]
splice_from_pipe(,,pipe_to_sendpage)
__splice_from_pipe(,,pipe_to_sendpage)
pipe_to_sendpage()
tcp_sendpage() [ I assumed this from inet_stream_ops ]
do_tcp_sendpages() (if NETIF_F_SG) or sock_no_sendpage()

I hope that I guessed it right because it does not appear obvious to me. It
will help me try to understand better the corruption problem.

Now for the read part, am I wrong tcp_splice_read() will return -EAGAIN
if the pipe is full ? It seems to me that this will be brought up by
splice_to_pipe(), via __tcp_splice_read, tcp_read_sock, tcp_splice_data_recv
and skb_splice_bits.

If so, this can cause serious trouble because if there is data available in
a socket, poll() will return POLLIN. Then we call splice(sock, pipe) but the
pipe is full, so we get -EAGAIN, which in turn makes the application think
that we need to poll again, and since no data has moved, we'll immediately
call splice() again. In my early experiments, I'm pretty sure I already came
across such a situation. Shouldn't we return -EBUSY or -ENOSPC in this case ?
I can work on patches if there is a consensus around those issues, and this
will also help me understand better the splice internals. I just don't want
to work in the wrong direction for nothing.

Thanks,
Willy

2009-01-08 22:21:24

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Thu, Jan 08, 2009 at 01:55:15PM -0800, David Miller wrote:
> From: Ben Mansell <[email protected]>
> Date: Thu, 08 Jan 2009 21:50:44 +0000
>
> > > From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
> > > From: Willy Tarreau <[email protected]>
> > > Date: Thu, 8 Jan 2009 17:10:13 +0100
> > > Subject: [PATCH] tcp: splice as many packets as possible at once
> > > Currently, in non-blocking mode, tcp_splice_read() returns after
> > > splicing one segment regardless of the len argument. This results
> > > in low performance and very high overhead due to syscall rate when
> > > splicing from interfaces which do not support LRO.
> > > The fix simply consists in not breaking out of the loop after the
> > > first read. That way, we can read up to the size requested by the
> > > caller and still return when there is no data left.
> > > Performance has significantly improved with this fix, with the
> > > number of calls to splice() divided by about 20, and CPU usage
> > > dropped from 100% to 75%.
> > >
> >
> > I get similar results with my testing here. Benchmarking an application with this patch shows that more than one packet is being splice()d in at once, as a result I see a doubling in throughput.
> >
> > Tested-by: Ben Mansell <[email protected]>
>
> I'm not applying this until someone explains to me why
> we should remove this test from the splice receive but
> keep it in the tcp_recvmsg() code where it has been
> essentially forever.

In my opinion, the code structure is different between both functions. In
tcp_recvmsg(), we test for it if (copied > 0), where copied is the sum of
all data which have been processed since the entry in the function. If we
removed the test here, we could not break out of the loop once we have
copied something. In tcp_splice_read(), the test is still present in the
(!ret) code path, where ret is the last number of bytes processed, so the
test is still performed regardless of what has been previously transferred.

So in summary, in tcp_splice_read without this test, we get back to the
top of the loop, and if __tcp_splice_read() returns 0, then we break out
of the loop.

I don't know if my explanation is clear or not, it's easier to follow the
loops in front of the code :-/

Willy

2009-01-09 06:48:22

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

David Miller a ?crit :
> From: Ben Mansell <[email protected]>
> Date: Thu, 08 Jan 2009 21:50:44 +0000
>
>>> From fafe76713523c8e9767805cfdc7b73323d7bf180 Mon Sep 17 00:00:00 2001
>>> From: Willy Tarreau <[email protected]>
>>> Date: Thu, 8 Jan 2009 17:10:13 +0100
>>> Subject: [PATCH] tcp: splice as many packets as possible at once
>>> Currently, in non-blocking mode, tcp_splice_read() returns after
>>> splicing one segment regardless of the len argument. This results
>>> in low performance and very high overhead due to syscall rate when
>>> splicing from interfaces which do not support LRO.
>>> The fix simply consists in not breaking out of the loop after the
>>> first read. That way, we can read up to the size requested by the
>>> caller and still return when there is no data left.
>>> Performance has significantly improved with this fix, with the
>>> number of calls to splice() divided by about 20, and CPU usage
>>> dropped from 100% to 75%.
>>>
>> I get similar results with my testing here. Benchmarking an application with this patch shows that more than one packet is being splice()d in at once, as a result I see a doubling in throughput.
>>
>> Tested-by: Ben Mansell <[email protected]>
>
> I'm not applying this until someone explains to me why
> we should remove this test from the splice receive but
> keep it in the tcp_recvmsg() code where it has been
> essentially forever.

I found this patch usefull in my testings, but had a feeling something
was not complete. If the goal is to reduce number of splice() calls,
we also should reduce number of wakeups. If splice() is used in non
blocking mode, nothing we can do here of course, since the application
will use a poll()/select()/epoll() event before calling splice(). A
good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things.

I tested this on current tree and it is not working : we still have
one wakeup for each frame (ethernet link is a 100 Mb/s one)

bind(6, {sa_family=AF_INET, sin_port=htons(4711), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(6, 5) = 0
accept(6, 0, NULL) = 7
setsockopt(7, SOL_SOCKET, SO_RCVLOWAT, [32768], 4) = 0
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1024
splice(0x3, 0, 0x5, 0, 0x400, 0x5) = 1024
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460

About tcp_recvmsg(), we might also remove the "!timeo" test as well,
more testings are needed. But remind that if an application provides
a large buffer to tcp_recvmsg() call, removing the test will reduce
the number of syscalls but might use more DCACHE. It could reduce
performance on old cpus. With splice() call, we expect to not
copy memory and trash DCACHE, and pipe buffers being limited to 16,
we cope with a limited working set.

2009-01-09 07:04:55

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Fri, Jan 09, 2009 at 07:47:16AM +0100, Eric Dumazet wrote:
> > I'm not applying this until someone explains to me why
> > we should remove this test from the splice receive but
> > keep it in the tcp_recvmsg() code where it has been
> > essentially forever.
>
> I found this patch usefull in my testings, but had a feeling something
> was not complete. If the goal is to reduce number of splice() calls,
> we also should reduce number of wakeups. If splice() is used in non
> blocking mode, nothing we can do here of course, since the application
> will use a poll()/select()/epoll() event before calling splice(). A
> good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things.
>
> I tested this on current tree and it is not working : we still have
> one wakeup for each frame (ethernet link is a 100 Mb/s one)

Well, it simply means that data are not coming in fast enough compared to
the tiny amount of work you have to perform to forward them, there's nothing
wrong with that. It is important in my opinion not to wait for *enough* data
to come in, otherwise it might become impossible to forward small chunks.
I mean, if there are only 300 bytes left to forward, we must not wait
indefinitely for more data to come, we must forward those 300 bytes.

In your case below, it simply means that the performance improvement brought
by splice will be really minor because you'll just avoid 2 memory copies,
which are ridiculously cheap at 100 Mbps. If you would change your program
to use recv/send, you would observe the exact same pattern, because as soon
as poll() wakes you up, you still only have one frame in the system buffers.
On a small machine I have here (geode 500 MHz), I easily have multiple
frames queued at 100 Mbps because when epoll() wakes me up, I have traffic
on something like 10-100 sockets, and by the time I process the first ones,
the later have time to queue up more data.

> bind(6, {sa_family=AF_INET, sin_port=htons(4711), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> listen(6, 5) = 0
> accept(6, 0, NULL) = 7
> setsockopt(7, SOL_SOCKET, SO_RCVLOWAT, [32768], 4) = 0
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1024
> splice(0x3, 0, 0x5, 0, 0x400, 0x5) = 1024
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460
> poll([{fd=7, events=POLLIN, revents=POLLIN|POLLERR|POLLHUP}], 1, -1) = 1
> splice(0x7, 0, 0x4, 0, 0x10000, 0x3) = 1460
> splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460

How much CPU is used for that ? Probably something like 1-2% ? If you could
retry with 100-1000 concurrent sessions, you would observe more traffic on
each socket, which is where the gain of splice becomes really important.

> About tcp_recvmsg(), we might also remove the "!timeo" test as well,
> more testings are needed.

No right now we can't (we must move it somewhere else at least). Because
once at least one byte has been received (copied != 0), no other check
will break out of the loop (or at least I have not found it).

> But remind that if an application provides
> a large buffer to tcp_recvmsg() call, removing the test will reduce
> the number of syscalls but might use more DCACHE. It could reduce
> performance on old cpus. With splice() call, we expect to not
> copy memory and trash DCACHE, and pipe buffers being limited to 16,
> we cope with a limited working set.

That's an interesting point indeed !

Regards,
Willy

2009-01-09 07:29:10

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Willy Tarreau a ?crit :
> On Fri, Jan 09, 2009 at 07:47:16AM +0100, Eric Dumazet wrote:
>>> I'm not applying this until someone explains to me why
>>> we should remove this test from the splice receive but
>>> keep it in the tcp_recvmsg() code where it has been
>>> essentially forever.
>> I found this patch usefull in my testings, but had a feeling something
>> was not complete. If the goal is to reduce number of splice() calls,
>> we also should reduce number of wakeups. If splice() is used in non
>> blocking mode, nothing we can do here of course, since the application
>> will use a poll()/select()/epoll() event before calling splice(). A
>> good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things.
>>
>> I tested this on current tree and it is not working : we still have
>> one wakeup for each frame (ethernet link is a 100 Mb/s one)
>
> Well, it simply means that data are not coming in fast enough compared to
> the tiny amount of work you have to perform to forward them, there's nothing
> wrong with that. It is important in my opinion not to wait for *enough* data
> to come in, otherwise it might become impossible to forward small chunks.
> I mean, if there are only 300 bytes left to forward, we must not wait
> indefinitely for more data to come, we must forward those 300 bytes.
>
> In your case below, it simply means that the performance improvement brought
> by splice will be really minor because you'll just avoid 2 memory copies,
> which are ridiculously cheap at 100 Mbps. If you would change your program
> to use recv/send, you would observe the exact same pattern, because as soon
> as poll() wakes you up, you still only have one frame in the system buffers.
> On a small machine I have here (geode 500 MHz), I easily have multiple
> frames queued at 100 Mbps because when epoll() wakes me up, I have traffic
> on something like 10-100 sockets, and by the time I process the first ones,
> the later have time to queue up more data.

My point is to use Gigabit links or 10Gb links and hundred or thousand of flows :)

But if it doesnt work on a single flow, it wont work on many :)

I tried my test program with a Gb link, one flow, and got splice() calls returns 23000 bytes
in average, using a litle too much of CPU : If poll() could wait a litle bit more, CPU
could be available for other tasks.

If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT, [32768], 4), it
would be good if kernel was smart enough and could reduce number of wakeups.

(Next blocking point is the fixed limit of 16 pages per pipe, but thats another story)

>
>> About tcp_recvmsg(), we might also remove the "!timeo" test as well,
>> more testings are needed.
>
> No right now we can't (we must move it somewhere else at least). Because
> once at least one byte has been received (copied != 0), no other check
> will break out of the loop (or at least I have not found it).
>

Of course we cant remove the test totally, but change the logic so that several skb
might be used/consumed per tcp_recvmsg() call, like your patch did for splice()

Lets focus on functional changes, not on implementation details :)

2009-01-09 07:42:55

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Fri, Jan 09, 2009 at 08:28:09AM +0100, Eric Dumazet wrote:
> My point is to use Gigabit links or 10Gb links and hundred or thousand of flows :)
>
> But if it doesnt work on a single flow, it wont work on many :)

Yes it will, precisely because during the time you spend processing flow #1,
you're still receiving data for flow #2. I really invite you to try. That's
what I've been observing for years of userland coding of proxies.

> I tried my test program with a Gb link, one flow, and got splice() calls returns 23000 bytes
> in average, using a litle too much of CPU : If poll() could wait a litle bit more, CPU
> could be available for other tasks.

I also observe 23000 bytes on average on gigabit, which is very good
(only about 5000 calls per second). And the CPU usage is lower than
with recv/send, and I'd like to be able to run some profiling because
I observed very different performance patterns depending on the network
cards used. Generally, almost all the time is spent in softirqs.

It's easy to make poll wait a little bit more : call it later and do
your work before calling it. Also, epoll_wait() lets you ask it to
return just a few amount of FDs. This really improves data gathering.
I generally observe best performance between 30-200 FDs per call, even
with 10000 concurrent connections. During the time I process the first
200 FDs, data is accumulating in the other's buffers.

> If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT, [32768], 4), it
> would be good if kernel was smart enough and could reduce number of wakeups.

Yes, I agree about that. But my comment was about not making this
behaviour mandatory for splice(). Letting the application choose is
the way to go, of course.

> (Next blocking point is the fixed limit of 16 pages per pipe, but
> thats another story)

Yes but that's not always easy to guess how many data you can feed
into the pipe. It seems that depending on how the segments are
gathered, you can store between 16 segments and 64 kB. I have
observed some cases in blocking mode where I could not push more
than a few kbytes with a small MSS, indicating to me that all those
segments were each on a distinct page. I don't know precisely how
that's handled internally.

> >> About tcp_recvmsg(), we might also remove the "!timeo" test as well,
> >> more testings are needed.
> >
> > No right now we can't (we must move it somewhere else at least). Because
> > once at least one byte has been received (copied != 0), no other check
> > will break out of the loop (or at least I have not found it).
> >
>
> Of course we cant remove the test totally, but change the logic so that several skb
> might be used/consumed per tcp_recvmsg() call, like your patch did for splice()

OK, I initially understood that you suggested we could simply remove
it like I did for splice.

> Lets focus on functional changes, not on implementation details :)

Agreed :-)

Willy

2009-01-09 15:43:52

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Eric Dumazet a ?crit :
> David Miller a ?crit :
>> I'm not applying this until someone explains to me why
>> we should remove this test from the splice receive but
>> keep it in the tcp_recvmsg() code where it has been
>> essentially forever.

Reading again tcp_recvmsg(), I found it already is able to eat several skb
even in nonblocking mode.

setsockopt(5, SOL_SOCKET, SO_RCVLOWAT, [61440], 4) = 0
ioctl(5, FIONBIO, [1]) = 0
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536

David, if you referred to code at line 1374 of net/ipv4/tcp.c, I believe there is
no issue with it. We really want to break from this loop if !timeo .

Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)

I wonder if the right fix should be done in tcp_read_sock() : this is the
one who should eat several skbs IMHO, if we want optimal ACK generation.

We break out of its loop at line 1246

if (!desc->count) /* this test is always true */
break;

(__tcp_splice_read() set count to 0, right before calling tcp_read_sock())

So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.

2009-01-09 17:58:51

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Eric Dumazet a ?crit :
> Eric Dumazet a ?crit :
>> David Miller a ?crit :
>>> I'm not applying this until someone explains to me why
>>> we should remove this test from the splice receive but
>>> keep it in the tcp_recvmsg() code where it has been
>>> essentially forever.
>
> Reading again tcp_recvmsg(), I found it already is able to eat several skb
> even in nonblocking mode.
>
> setsockopt(5, SOL_SOCKET, SO_RCVLOWAT, [61440], 4) = 0
> ioctl(5, FIONBIO, [1]) = 0
> poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
> recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
> write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
> poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
> recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
> write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
> poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
> recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
> write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
> poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, -1) = 1
> recv(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, MSG_DONTWAIT) = 65536
> write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536
>
>
> David, if you referred to code at line 1374 of net/ipv4/tcp.c, I believe there is
> no issue with it. We really want to break from this loop if !timeo .
>
> Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
> tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)
>
> I wonder if the right fix should be done in tcp_read_sock() : this is the
> one who should eat several skbs IMHO, if we want optimal ACK generation.
>
> We break out of its loop at line 1246
>
> if (!desc->count) /* this test is always true */
> break;
>
> (__tcp_splice_read() set count to 0, right before calling tcp_read_sock())
>
> So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.

I tested following patch and got expected result :

- less ACK, and correct rcvbuf adjustments.
- I can fill the Gb link with one flow only, using less than
10% of the cpu, instead of 40% without patch.

Setting desc->count to 1 seems to be the current practice.
(Example in drivers/scsi/iscsi_tcp.c : iscsi_sw_tcp_data_ready())

******************************************************************
* Note : this patch is wrong, because splice() can now *
* return more bytes than asked for (if SPLICE_F_NONBLOCK asked) *
******************************************************************

[PATCH] tcp: splice as many packets as possible at once

As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not
optimal. It processes at most one segment per call.
This results in low performance and very high overhead due to syscall rate
when splicing from interfaces which do not support LRO.

Willy provided a patch inside tcp_splice_read(), but a better fix
is to let tcp_read_sock() process as many segments as possible, so
that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called once
per syscall.

With this change, splice() behaves like tcp_recvmsg(), being able
to consume many skbs in one system call.

Signed-off-by: Eric Dumazet <[email protected]>
---
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bd6ff90..15bd67b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -533,6 +533,9 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
.arg.data = tss,
};

+ if (tss->flags & SPLICE_F_NONBLOCK)
+ rd_desc.count = 1; /* we want as many segments as possible */
+
return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
}

2009-01-09 18:55:33

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Hi Eric,

On Fri, Jan 09, 2009 at 04:42:44PM +0100, Eric Dumazet wrote:
(...)
> Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
> tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)
>
> I wonder if the right fix should be done in tcp_read_sock() : this is the
> one who should eat several skbs IMHO, if we want optimal ACK generation.
>
> We break out of its loop at line 1246
>
> if (!desc->count) /* this test is always true */
> break;
>
> (__tcp_splice_read() set count to 0, right before calling tcp_read_sock())
>
> So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.

That's a very interesting discovery that you made here. I have made
mesurements with this line commented out just to get an idea. The
hardest part was to find a CPU-bound machine. Finally I slowed my
laptop down to 300 MHz (in fact, 600 with throttle 50%, but let's
call that 300). That way, I cannot saturate the PCI-based tg3 and
I can observe the effects of various changes on the data rate.

- original tcp_splice_read(), with "!timeo" : 24.1 MB/s
- modified tcp_splice_read(), without "!timeo" : 32.5 MB/s (+34%)
- original with line #1246 commented out : 34.5 MB/s (+43%)

So you're right, avoiding calling tcp_read_sock() all the time
gives a nice performance boost.

Also, I found that tcp_splice_read() behaves like this when breaking
out of the loop :

lock_sock();
while () {
...
__tcp_splice_read();
...
release_sock();
lock_sock();
if (break condition)
break;
}
release_sock();

Which means that when breaking out of the loop on (!timeo)
with ret > 0, we do release_sock/lock_sock/release_sock.

So I tried a minor modification, consisting in moving the
test before release_sock(), and leaving !timeo there with
line #1246 commented out. That's a noticeable winner, as
the data rate went up to 35.7 MB/s (+48%).

Also, in your second mail, you're saying that your change
might return more data than requested by the user. I can't
find why, could you please explain to me, as I'm still quite
ignorant in this area ?

Thanks,
Willy

2009-01-09 20:52:26

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Willy Tarreau a ?crit :
> Hi Eric,
>
> On Fri, Jan 09, 2009 at 04:42:44PM +0100, Eric Dumazet wrote:
> (...)
>> Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
>> tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)
>>
>> I wonder if the right fix should be done in tcp_read_sock() : this is the
>> one who should eat several skbs IMHO, if we want optimal ACK generation.
>>
>> We break out of its loop at line 1246
>>
>> if (!desc->count) /* this test is always true */
>> break;
>>
>> (__tcp_splice_read() set count to 0, right before calling tcp_read_sock())
>>
>> So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.
>
> That's a very interesting discovery that you made here. I have made
> mesurements with this line commented out just to get an idea. The
> hardest part was to find a CPU-bound machine. Finally I slowed my
> laptop down to 300 MHz (in fact, 600 with throttle 50%, but let's
> call that 300). That way, I cannot saturate the PCI-based tg3 and
> I can observe the effects of various changes on the data rate.
>
> - original tcp_splice_read(), with "!timeo" : 24.1 MB/s
> - modified tcp_splice_read(), without "!timeo" : 32.5 MB/s (+34%)
> - original with line #1246 commented out : 34.5 MB/s (+43%)
>
> So you're right, avoiding calling tcp_read_sock() all the time
> gives a nice performance boost.
>
> Also, I found that tcp_splice_read() behaves like this when breaking
> out of the loop :
>
> lock_sock();
> while () {
> ...
> __tcp_splice_read();
> ...
> release_sock();
> lock_sock();
> if (break condition)
> break;
> }
> release_sock();
>
> Which means that when breaking out of the loop on (!timeo)
> with ret > 0, we do release_sock/lock_sock/release_sock.
>
> So I tried a minor modification, consisting in moving the
> test before release_sock(), and leaving !timeo there with
> line #1246 commented out. That's a noticeable winner, as
> the data rate went up to 35.7 MB/s (+48%).
>
> Also, in your second mail, you're saying that your change
> might return more data than requested by the user. I can't
> find why, could you please explain to me, as I'm still quite
> ignorant in this area ?

Well, I just tested various user programs and indeed got this
strange result :

Here I call splice() with len=1000 (0x3e8), and you can see
it gives a result of 1460 at the second call.

I suspect a bug in splice code, that my patch just exposed.

pipe([3, 4]) = 0
open("/dev/null", O_RDWR|O_CREAT|O_TRUNC, 0644) = 5
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6
bind(6, {sa_family=AF_INET, sin_port=htons(4711), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(6, 5) = 0
accept(6, 0, NULL) = 7
setsockopt(7, SOL_SOCKET, SO_RCVLOWAT, [1000], 4) = 0
poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x3e8, 0x3) = 1000 OK
splice(0x3, 0, 0x5, 0, 0x3e8, 0x5) = 1000
poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x3e8, 0x3) = 1460 Oh well... 1460 > 1000 !!!
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460
poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, -1) = 1
splice(0x7, 0, 0x4, 0, 0x3e8, 0x3) = 1460
splice(0x3, 0, 0x5, 0, 0x5b4, 0x5) = 1460

2009-01-09 21:25:24

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
(...)
> > Also, in your second mail, you're saying that your change
> > might return more data than requested by the user. I can't
> > find why, could you please explain to me, as I'm still quite
> > ignorant in this area ?
>
> Well, I just tested various user programs and indeed got this
> strange result :
>
> Here I call splice() with len=1000 (0x3e8), and you can see
> it gives a result of 1460 at the second call.

huh, not nice indeed!

While looking at the code to see how this could be possible, I
came across this minor thing (unrelated IMHO) :

if (__skb_splice_bits(skb, &offset, &tlen, &spd))
goto done;
>>>>>> else if (!tlen) <<<<<<
goto done;

/*
* now see if we have a frag_list to map
*/
if (skb_shinfo(skb)->frag_list) {
struct sk_buff *list = skb_shinfo(skb)->frag_list;

for (; list && tlen; list = list->next) {
if (__skb_splice_bits(list, &offset, &tlen, &spd))
break;
}
}

done:

Above on the enlighted line, we'd better remove the else and leave a plain
"if (!tlen)". Otherwise, when the first call to __skb_splice_bits() zeroes
tlen, we still enter the if and evaluate the for condition for nothing. But
let's leave that for later.

> I suspect a bug in splice code, that my patch just exposed.

I've checked in skb_splice_bits() and below and can't see how we can move
more than the requested len.

However, with your change, I don't clearly see how we break out of
the loop in tcp_read_sock(). Maybe we first read 1000 then loop again
and read remaining data ? I suspect that we should at least exit when
((struct tcp_splice_state *)desc->arg.data)->len = 0.

At least that's something easy to add just before or after !desc->count
for a test.

Regards,
Willy

2009-01-09 22:03:40

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Willy Tarreau a ?crit :
> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
> (...)
>>> Also, in your second mail, you're saying that your change
>>> might return more data than requested by the user. I can't
>>> find why, could you please explain to me, as I'm still quite
>>> ignorant in this area ?
>> Well, I just tested various user programs and indeed got this
>> strange result :
>>
>> Here I call splice() with len=1000 (0x3e8), and you can see
>> it gives a result of 1460 at the second call.
>
> huh, not nice indeed!
>
> While looking at the code to see how this could be possible, I
> came across this minor thing (unrelated IMHO) :
>
> if (__skb_splice_bits(skb, &offset, &tlen, &spd))
> goto done;
>>>>>>> else if (!tlen) <<<<<<
> goto done;
>
> /*
> * now see if we have a frag_list to map
> */
> if (skb_shinfo(skb)->frag_list) {
> struct sk_buff *list = skb_shinfo(skb)->frag_list;
>
> for (; list && tlen; list = list->next) {
> if (__skb_splice_bits(list, &offset, &tlen, &spd))
> break;
> }
> }
>
> done:
>
> Above on the enlighted line, we'd better remove the else and leave a plain
> "if (!tlen)". Otherwise, when the first call to __skb_splice_bits() zeroes
> tlen, we still enter the if and evaluate the for condition for nothing. But
> let's leave that for later.
>
>> I suspect a bug in splice code, that my patch just exposed.
>
> I've checked in skb_splice_bits() and below and can't see how we can move
> more than the requested len.
>
> However, with your change, I don't clearly see how we break out of
> the loop in tcp_read_sock(). Maybe we first read 1000 then loop again
> and read remaining data ? I suspect that we should at least exit when
> ((struct tcp_splice_state *)desc->arg.data)->len = 0.
>
> At least that's something easy to add just before or after !desc->count
> for a test.
>

I believe the bug is in tcp_splice_data_recv()

I am going to test a new patch, but here is the thing I found:

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bd6ff90..fbbddf4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -523,7 +523,7 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
{
struct tcp_splice_state *tss = rd_desc->arg.data;

- return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
+ return skb_splice_bits(skb, offset, tss->pipe, len, tss->flags);
}

static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)

2009-01-09 22:08:23

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote:
> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
> (...)
> > > Also, in your second mail, you're saying that your change
> > > might return more data than requested by the user. I can't
> > > find why, could you please explain to me, as I'm still quite
> > > ignorant in this area ?
> >
> > Well, I just tested various user programs and indeed got this
> > strange result :
> >
> > Here I call splice() with len=1000 (0x3e8), and you can see
> > it gives a result of 1460 at the second call.

OK finally I could reproduce it and found why we have this. It's
expected in fact.

The problem when we loop in tcp_read_sock() is that tss->len is
not decremented by the amount of bytes read, this one is done
only in tcp_splice_read() which is outer.

The solution I found was to do just like other callers, which means
use desc->count to keep the remaining number of bytes we want to
read. In fact, tcp_read_sock() is designed to use that one as a stop
condition, which explains why you first had to hide it.

Now with the attached patch as a replacement for my previous one,
both issues are solved :
- I splice 1000 bytes if I ask to do so
- I splice as much as possible if available (typically 23 kB).

My observed performances are still at the top of earlier results
and IMHO that way of counting bytes makes sense for an actor called
from tcp_read_sock().

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 35bcddf..51ff3aa 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
unsigned int offset, size_t len)
{
struct tcp_splice_state *tss = rd_desc->arg.data;
+ int ret;

- return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
+ ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags);
+ if (ret > 0)
+ rd_desc->count -= ret;
+ return ret;
}

static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
@@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
/* Store TCP splice context information in read_descriptor_t. */
read_descriptor_t rd_desc = {
.arg.data = tss,
+ .count = tss->len,
};

return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);

Regards,
Willy

2009-01-09 22:13:33

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Fri, Jan 09, 2009 at 11:02:06PM +0100, Eric Dumazet wrote:
> Willy Tarreau a ?crit :
> > On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
> > (...)
> >>> Also, in your second mail, you're saying that your change
> >>> might return more data than requested by the user. I can't
> >>> find why, could you please explain to me, as I'm still quite
> >>> ignorant in this area ?
> >> Well, I just tested various user programs and indeed got this
> >> strange result :
> >>
> >> Here I call splice() with len=1000 (0x3e8), and you can see
> >> it gives a result of 1460 at the second call.
> >
> > huh, not nice indeed!
> >
> > While looking at the code to see how this could be possible, I
> > came across this minor thing (unrelated IMHO) :
> >
> > if (__skb_splice_bits(skb, &offset, &tlen, &spd))
> > goto done;
> >>>>>>> else if (!tlen) <<<<<<
> > goto done;
> >
> > /*
> > * now see if we have a frag_list to map
> > */
> > if (skb_shinfo(skb)->frag_list) {
> > struct sk_buff *list = skb_shinfo(skb)->frag_list;
> >
> > for (; list && tlen; list = list->next) {
> > if (__skb_splice_bits(list, &offset, &tlen, &spd))
> > break;
> > }
> > }
> >
> > done:
> >
> > Above on the enlighted line, we'd better remove the else and leave a plain
> > "if (!tlen)". Otherwise, when the first call to __skb_splice_bits() zeroes
> > tlen, we still enter the if and evaluate the for condition for nothing. But
> > let's leave that for later.
> >
> >> I suspect a bug in splice code, that my patch just exposed.
> >
> > I've checked in skb_splice_bits() and below and can't see how we can move
> > more than the requested len.
> >
> > However, with your change, I don't clearly see how we break out of
> > the loop in tcp_read_sock(). Maybe we first read 1000 then loop again
> > and read remaining data ? I suspect that we should at least exit when
> > ((struct tcp_splice_state *)desc->arg.data)->len = 0.
> >
> > At least that's something easy to add just before or after !desc->count
> > for a test.
> >
>
> I believe the bug is in tcp_splice_data_recv()

yes, see my other mail.

> I am going to test a new patch, but here is the thing I found:
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index bd6ff90..fbbddf4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -523,7 +523,7 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
> {
> struct tcp_splice_state *tss = rd_desc->arg.data;
>
> - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
> + return skb_splice_bits(skb, offset, tss->pipe, len, tss->flags);
> }

it will not work, len is always 1000 in your case, for every call,
as I had in my logs :

kernel :

tcp_splice_data_recv: skb=ed3d3480 offset=0 len=1460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_sendpage: going through do_tcp_sendpages
tcp_splice_data_recv: skb=ed3d3480 offset=1000 len=460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_splice_data_recv: skb=ed3d3540 offset=0 len=1460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_sendpage: going through do_tcp_sendpages
tcp_sendpage: going through do_tcp_sendpages
tcp_splice_data_recv: skb=ed3d3540 offset=1000 len=460 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_splice_data_recv: skb=ed3d3600 offset=0 len=1176 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_sendpage: going through do_tcp_sendpages
tcp_sendpage: going through do_tcp_sendpages
tcp_splice_data_recv: skb=ed3d3600 offset=1000 len=176 tss->pipe=ed1cbc00 tss->len=1000 tss->flags=3
tcp_sendpage: going through do_tcp_sendpages

program :

accept(3, 0, NULL) = 4
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(4001), sin_addr=inet_addr("10.0.3.1")}, 16) = 0
fcntl64(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl64(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
pipe([6, 7]) = 0
select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x5b4, 0x3) = 1460
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1460
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x5b4, 0x3) = 1460
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 176
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0xb0, 0x3) = 176
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = -1 EAGAIN (Resource temporarily unavailable)
close(5) = 0
close(4) = 0
exit_group(0) = ?

Now with the fix :

accept(3, 0, NULL) = 4
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(4001), sin_addr=inet_addr("10.0.3.1")}, 16) = 0
fcntl64(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl64(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
pipe([6, 7]) = 0
select(6, [5], [], NULL, NULL) = 1 (in [5])
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 1000
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x3e8, 0x3) = 1000
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = 96
select(6, [5], [4], NULL, NULL) = 2 (in [5], out [4])
splice(0x6, 0, 0x4, 0, 0x60, 0x3) = 96
splice(0x5, 0, 0x7, 0, 0x3e8, 0x3) = -1 EAGAIN (Resource temporarily unavailable)
close(5) = 0
close(4) = 0
exit_group(0) = ?

Regards,
Willy

2009-01-09 22:16:30

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Willy Tarreau a ?crit :
> On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote:
>> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
>> (...)
>>>> Also, in your second mail, you're saying that your change
>>>> might return more data than requested by the user. I can't
>>>> find why, could you please explain to me, as I'm still quite
>>>> ignorant in this area ?
>>> Well, I just tested various user programs and indeed got this
>>> strange result :
>>>
>>> Here I call splice() with len=1000 (0x3e8), and you can see
>>> it gives a result of 1460 at the second call.
>
> OK finally I could reproduce it and found why we have this. It's
> expected in fact.
>
> The problem when we loop in tcp_read_sock() is that tss->len is
> not decremented by the amount of bytes read, this one is done
> only in tcp_splice_read() which is outer.
>
> The solution I found was to do just like other callers, which means
> use desc->count to keep the remaining number of bytes we want to
> read. In fact, tcp_read_sock() is designed to use that one as a stop
> condition, which explains why you first had to hide it.
>
> Now with the attached patch as a replacement for my previous one,
> both issues are solved :
> - I splice 1000 bytes if I ask to do so
> - I splice as much as possible if available (typically 23 kB).
>
> My observed performances are still at the top of earlier results
> and IMHO that way of counting bytes makes sense for an actor called
> from tcp_read_sock().
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 35bcddf..51ff3aa 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
> unsigned int offset, size_t len)
> {
> struct tcp_splice_state *tss = rd_desc->arg.data;
> + int ret;
>
> - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
> + ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags);
> + if (ret > 0)
> + rd_desc->count -= ret;
> + return ret;
> }
>
> static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
> @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
> /* Store TCP splice context information in read_descriptor_t. */
> read_descriptor_t rd_desc = {
> .arg.data = tss,
> + .count = tss->len,
> };
>
> return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
>

OK, I came to a different patch. Please check other tcp_read_sock() callers in tree :)

diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 23808df..96b49e1 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -100,13 +100,11 @@ static void iscsi_sw_tcp_data_ready(struct sock *sk, int flag)

/*
* Use rd_desc to pass 'conn' to iscsi_tcp_recv.
- * We set count to 1 because we want the network layer to
- * hand us all the skbs that are available. iscsi_tcp_recv
- * handled pdus that cross buffers or pdus that still need data.
+ * iscsi_tcp_recv handled pdus that cross buffers or pdus that
+ * still need data.
*/
rd_desc.arg.data = conn;
- rd_desc.count = 1;
- tcp_read_sock(sk, &rd_desc, iscsi_sw_tcp_recv);
+ tcp_read_sock(sk, &rd_desc, iscsi_sw_tcp_recv, 65536);

read_unlock(&sk->sk_callback_lock);

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 218235d..b1facd1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -490,7 +490,7 @@ extern void tcp_get_info(struct sock *, struct tcp_info *);
typedef int (*sk_read_actor_t)(read_descriptor_t *, struct sk_buff *,
unsigned int, size_t);
extern int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
- sk_read_actor_t recv_actor);
+ sk_read_actor_t recv_actor, size_t tlen);

extern void tcp_initialize_rcv_mss(struct sock *sk);

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bd6ff90..fbbddf4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -523,7 +523,7 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
{
struct tcp_splice_state *tss = rd_desc->arg.data;

- return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
+ return skb_splice_bits(skb, offset, tss->pipe, len, tss->flags);
}

static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
@@ -533,7 +533,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
.arg.data = tss,
};

- return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
+ return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv, tss->len);
}

/**
@@ -611,11 +611,13 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
tss.len -= ret;
spliced += ret;

+ if (!timeo)
+ break;
release_sock(sk);
lock_sock(sk);

if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
- (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo ||
+ (sk->sk_shutdown & RCV_SHUTDOWN) ||
signal_pending(current))
break;
}
@@ -1193,7 +1195,7 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
* (although both would be easy to implement).
*/
int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
- sk_read_actor_t recv_actor)
+ sk_read_actor_t recv_actor, size_t tlen)
{
struct sk_buff *skb;
struct tcp_sock *tp = tcp_sk(sk);
@@ -1209,6 +1211,8 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
size_t len;

len = skb->len - offset;
+ if (len > tlen)
+ len = tlen;
/* Stop reading if we hit a patch of urgent data */
if (tp->urg_data) {
u32 urg_offset = tp->urg_seq - seq;
@@ -1226,6 +1230,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
seq += used;
copied += used;
offset += used;
+ tlen -= used;
}
/*
* If recv_actor drops the lock (e.g. TCP splice
@@ -1243,7 +1248,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
break;
}
sk_eat_skb(sk, skb, 0);
- if (!desc->count)
+ if (!tlen)
break;
}
tp->copied_seq = seq;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 5cbb404..75f8e83 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1109,8 +1109,7 @@ static void xs_tcp_data_ready(struct sock *sk, int bytes)
/* We use rd_desc to pass struct xprt to xs_tcp_data_recv */
rd_desc.arg.data = xprt;
do {
- rd_desc.count = 65536;
- read = tcp_read_sock(sk, &rd_desc, xs_tcp_data_recv);
+ read = tcp_read_sock(sk, &rd_desc, xs_tcp_data_recv, 65536);
} while (read > 0);
out:
read_unlock(&sk->sk_callback_lock);

2009-01-09 22:18:31

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Fri, Jan 09, 2009 at 11:12:09PM +0100, Eric Dumazet wrote:
> Willy Tarreau a ?crit :
> > On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote:
> >> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
> >> (...)
> >>>> Also, in your second mail, you're saying that your change
> >>>> might return more data than requested by the user. I can't
> >>>> find why, could you please explain to me, as I'm still quite
> >>>> ignorant in this area ?
> >>> Well, I just tested various user programs and indeed got this
> >>> strange result :
> >>>
> >>> Here I call splice() with len=1000 (0x3e8), and you can see
> >>> it gives a result of 1460 at the second call.
> >
> > OK finally I could reproduce it and found why we have this. It's
> > expected in fact.
> >
> > The problem when we loop in tcp_read_sock() is that tss->len is
> > not decremented by the amount of bytes read, this one is done
> > only in tcp_splice_read() which is outer.
> >
> > The solution I found was to do just like other callers, which means
> > use desc->count to keep the remaining number of bytes we want to
> > read. In fact, tcp_read_sock() is designed to use that one as a stop
> > condition, which explains why you first had to hide it.
> >
> > Now with the attached patch as a replacement for my previous one,
> > both issues are solved :
> > - I splice 1000 bytes if I ask to do so
> > - I splice as much as possible if available (typically 23 kB).
> >
> > My observed performances are still at the top of earlier results
> > and IMHO that way of counting bytes makes sense for an actor called
> > from tcp_read_sock().
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 35bcddf..51ff3aa 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
> > unsigned int offset, size_t len)
> > {
> > struct tcp_splice_state *tss = rd_desc->arg.data;
> > + int ret;
> >
> > - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
> > + ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags);
> > + if (ret > 0)
> > + rd_desc->count -= ret;
> > + return ret;
> > }
> >
> > static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
> > @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
> > /* Store TCP splice context information in read_descriptor_t. */
> > read_descriptor_t rd_desc = {
> > .arg.data = tss,
> > + .count = tss->len,
> > };
> >
> > return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
> >
>
> OK, I came to a different patch. Please check other tcp_read_sock() callers in tree :)

I've seen the other callers, but they all use desc->count for their own
purpose. That's how I understood what it was used for :-)

I think it's better not to change the API here and use tcp_read_sock()
how it's supposed to be used. Also, the less parameters to the function,
the better.

However I'm OK for the !timeo before release_sock/lock_sock. I just
don't know if we can put the rest of the if above or not. I don't
know what changes we're supposed to collect by doing release_sock/
lock_sock before the if().

Regards,
Willy

2009-01-09 22:43:19

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Fri, Jan 09, 2009 at 11:17:44PM +0100, Willy Tarreau ([email protected]) wrote:
> However I'm OK for the !timeo before release_sock/lock_sock. I just
> don't know if we can put the rest of the if above or not. I don't
> know what changes we're supposed to collect by doing release_sock/
> lock_sock before the if().

Not to interrupt the discussion, but for the clarification, that
release_sock/lock_sock is used to process the backlog accumulated while
socket was locked. And while dropping additional pair before the final
release is ok, but moving this itself should be thought of twice.

--
Evgeniy Polyakov

2009-01-09 22:46:17

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Willy Tarreau a ?crit :
> On Fri, Jan 09, 2009 at 11:12:09PM +0100, Eric Dumazet wrote:
>> Willy Tarreau a ?crit :
>>> On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote:
>>>> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
>>>> (...)
>>>>>> Also, in your second mail, you're saying that your change
>>>>>> might return more data than requested by the user. I can't
>>>>>> find why, could you please explain to me, as I'm still quite
>>>>>> ignorant in this area ?
>>>>> Well, I just tested various user programs and indeed got this
>>>>> strange result :
>>>>>
>>>>> Here I call splice() with len=1000 (0x3e8), and you can see
>>>>> it gives a result of 1460 at the second call.
>>> OK finally I could reproduce it and found why we have this. It's
>>> expected in fact.
>>>
>>> The problem when we loop in tcp_read_sock() is that tss->len is
>>> not decremented by the amount of bytes read, this one is done
>>> only in tcp_splice_read() which is outer.
>>>
>>> The solution I found was to do just like other callers, which means
>>> use desc->count to keep the remaining number of bytes we want to
>>> read. In fact, tcp_read_sock() is designed to use that one as a stop
>>> condition, which explains why you first had to hide it.
>>>
>>> Now with the attached patch as a replacement for my previous one,
>>> both issues are solved :
>>> - I splice 1000 bytes if I ask to do so
>>> - I splice as much as possible if available (typically 23 kB).
>>>
>>> My observed performances are still at the top of earlier results
>>> and IMHO that way of counting bytes makes sense for an actor called
>>> from tcp_read_sock().
>>>
>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>> index 35bcddf..51ff3aa 100644
>>> --- a/net/ipv4/tcp.c
>>> +++ b/net/ipv4/tcp.c
>>> @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
>>> unsigned int offset, size_t len)
>>> {
>>> struct tcp_splice_state *tss = rd_desc->arg.data;
>>> + int ret;
>>>
>>> - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
>>> + ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags);
>>> + if (ret > 0)
>>> + rd_desc->count -= ret;
>>> + return ret;
>>> }
>>>
>>> static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
>>> @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
>>> /* Store TCP splice context information in read_descriptor_t. */
>>> read_descriptor_t rd_desc = {
>>> .arg.data = tss,
>>> + .count = tss->len,
>>> };
>>>
>>> return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
>>>
>> OK, I came to a different patch. Please check other tcp_read_sock() callers in tree :)
>
> I've seen the other callers, but they all use desc->count for their own
> purpose. That's how I understood what it was used for :-)

Ah yes, I reread your patch and you are right.

>
> I think it's better not to change the API here and use tcp_read_sock()
> how it's supposed to be used. Also, the less parameters to the function,
> the better.
>
> However I'm OK for the !timeo before release_sock/lock_sock. I just
> don't know if we can put the rest of the if above or not. I don't
> know what changes we're supposed to collect by doing release_sock/
> lock_sock before the if().

Only the (!timeo) can be above. Other conditions must be checked after
the release/lock.

Thank you

2009-01-09 22:50:53

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Sat, Jan 10, 2009 at 01:42:58AM +0300, Evgeniy Polyakov wrote:
> On Fri, Jan 09, 2009 at 11:17:44PM +0100, Willy Tarreau ([email protected]) wrote:
> > However I'm OK for the !timeo before release_sock/lock_sock. I just
> > don't know if we can put the rest of the if above or not. I don't
> > know what changes we're supposed to collect by doing release_sock/
> > lock_sock before the if().
>
> Not to interrupt the discussion, but for the clarification, that
> release_sock/lock_sock is used to process the backlog accumulated while
> socket was locked. And while dropping additional pair before the final
> release is ok, but moving this itself should be thought of twice.

Nice, thanks Evgeniy. So it makes sense to move only the !timeo test
above since it's not dependant on the socket, and leave the rest of
the test where it currently is. That's what Eric has proposed in his
latest patch.

Well, I'm now trying to educate myself on the send part. It's still
not very clear to me and I'd like to understand a little bit better
why we have this corruption problem and why there is a difference
between sending segments from memory and sending them from another
socket where they were already waiting.

I think I'll put printks everywhere and see what I can observe.
Knowing about the GSO/SG workaround already helps me enable/disable
the bug.

Regards,
Willy

2009-01-09 22:53:55

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Fri, Jan 09, 2009 at 11:45:02PM +0100, Eric Dumazet wrote:
> Willy Tarreau a ?crit :
> > On Fri, Jan 09, 2009 at 11:12:09PM +0100, Eric Dumazet wrote:
> >> Willy Tarreau a ?crit :
> >>> On Fri, Jan 09, 2009 at 10:24:00PM +0100, Willy Tarreau wrote:
> >>>> On Fri, Jan 09, 2009 at 09:51:17PM +0100, Eric Dumazet wrote:
> >>>> (...)
> >>>>>> Also, in your second mail, you're saying that your change
> >>>>>> might return more data than requested by the user. I can't
> >>>>>> find why, could you please explain to me, as I'm still quite
> >>>>>> ignorant in this area ?
> >>>>> Well, I just tested various user programs and indeed got this
> >>>>> strange result :
> >>>>>
> >>>>> Here I call splice() with len=1000 (0x3e8), and you can see
> >>>>> it gives a result of 1460 at the second call.
> >>> OK finally I could reproduce it and found why we have this. It's
> >>> expected in fact.
> >>>
> >>> The problem when we loop in tcp_read_sock() is that tss->len is
> >>> not decremented by the amount of bytes read, this one is done
> >>> only in tcp_splice_read() which is outer.
> >>>
> >>> The solution I found was to do just like other callers, which means
> >>> use desc->count to keep the remaining number of bytes we want to
> >>> read. In fact, tcp_read_sock() is designed to use that one as a stop
> >>> condition, which explains why you first had to hide it.
> >>>
> >>> Now with the attached patch as a replacement for my previous one,
> >>> both issues are solved :
> >>> - I splice 1000 bytes if I ask to do so
> >>> - I splice as much as possible if available (typically 23 kB).
> >>>
> >>> My observed performances are still at the top of earlier results
> >>> and IMHO that way of counting bytes makes sense for an actor called
> >>> from tcp_read_sock().
> >>>
> >>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >>> index 35bcddf..51ff3aa 100644
> >>> --- a/net/ipv4/tcp.c
> >>> +++ b/net/ipv4/tcp.c
> >>> @@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
> >>> unsigned int offset, size_t len)
> >>> {
> >>> struct tcp_splice_state *tss = rd_desc->arg.data;
> >>> + int ret;
> >>>
> >>> - return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
> >>> + ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags);
> >>> + if (ret > 0)
> >>> + rd_desc->count -= ret;
> >>> + return ret;
> >>> }
> >>>
> >>> static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
> >>> @@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
> >>> /* Store TCP splice context information in read_descriptor_t. */
> >>> read_descriptor_t rd_desc = {
> >>> .arg.data = tss,
> >>> + .count = tss->len,
> >>> };
> >>>
> >>> return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
> >>>
> >> OK, I came to a different patch. Please check other tcp_read_sock() callers in tree :)
> >
> > I've seen the other callers, but they all use desc->count for their own
> > purpose. That's how I understood what it was used for :-)
>
> Ah yes, I reread your patch and you are right.
>
> >
> > I think it's better not to change the API here and use tcp_read_sock()
> > how it's supposed to be used. Also, the less parameters to the function,
> > the better.
> >
> > However I'm OK for the !timeo before release_sock/lock_sock. I just
> > don't know if we can put the rest of the if above or not. I don't
> > know what changes we're supposed to collect by doing release_sock/
> > lock_sock before the if().
>
> Only the (!timeo) can be above. Other conditions must be checked after
> the release/lock.

Yes that's what Evgeniy explained too. I smelled something like this
but did not know.

Care to redo the whole patch, since you already have all code parts
at hand as well as some fragments of commit messages ? You can even
add my Tested-by if you want. Finally it was nice that Dave asked
for this explanation because it drove our nose to the fishy parts ;-)

Thanks,
Willy

2009-01-09 23:01:38

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Fri, Jan 09, 2009 at 11:50:10PM +0100, Willy Tarreau ([email protected]) wrote:
> Well, I'm now trying to educate myself on the send part. It's still
> not very clear to me and I'd like to understand a little bit better
> why we have this corruption problem and why there is a difference
> between sending segments from memory and sending them from another
> socket where they were already waiting.

printks are the best choice, since you will get exactly what you are
looking for instead of deciphering what developer or code told you.

> I think I'll put printks everywhere and see what I can observe.
> Knowing about the GSO/SG workaround already helps me enable/disable
> the bug.

I wish I could also be capable to disable the bugs... :)

--
Evgeniy Polyakov

2009-01-09 23:07:31

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Sat, Jan 10, 2009 at 02:01:24AM +0300, Evgeniy Polyakov wrote:
> On Fri, Jan 09, 2009 at 11:50:10PM +0100, Willy Tarreau ([email protected]) wrote:
> > Well, I'm now trying to educate myself on the send part. It's still
> > not very clear to me and I'd like to understand a little bit better
> > why we have this corruption problem and why there is a difference
> > between sending segments from memory and sending them from another
> > socket where they were already waiting.
>
> printks are the best choice, since you will get exactly what you are
> looking for instead of deciphering what developer or code told you.

I know, but as I told in an earlier mail, it was not even clear to
me what function were called. I have made guesses but that's quite
hard when you don't know the entry point. I think I'm not too far
from having discovered the call chain. Understanding it will be
another story, of course ;-)

> > I think I'll put printks everywhere and see what I can observe.
> > Knowing about the GSO/SG workaround already helps me enable/disable
> > the bug.
>
> I wish I could also be capable to disable the bugs... :)

Me too :-)
Here we have an opportunity to turn it on/off, let's take it !

Willy

2009-01-09 23:38:50

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Willy Tarreau a ?crit :
> On Fri, Jan 09, 2009 at 11:45:02PM +0100, Eric Dumazet wrote:
>> Only the (!timeo) can be above. Other conditions must be checked after
>> the release/lock.
>
> Yes that's what Evgeniy explained too. I smelled something like this
> but did not know.
>
> Care to redo the whole patch, since you already have all code parts
> at hand as well as some fragments of commit messages ? You can even
> add my Tested-by if you want. Finally it was nice that Dave asked
> for this explanation because it drove our nose to the fishy parts ;-)

Sure, here it is :

David, do you think we still must call __tcp_splice_read() only once
in tcp_splice_read() if SPLICE_F_NONBLOCK is set ?

With following patch, a splice() call is limited to 16 frames, typically
16*1460 = 23360 bytes. Removing the test as Willy did in its patch
could return the exact length requested by user (limited to 16 pages),
giving nice blocks if feeding a file on disk...

Thank you

From: Willy Tarreau <[email protected]>

[PATCH] tcp: splice as many packets as possible at once

As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not
optimal. It processes at most one segment per call.
This results in low performance and very high overhead due to syscall rate
when splicing from interfaces which do not support LRO.

Willy provided a patch inside tcp_splice_read(), but a better fix
is to let tcp_read_sock() process as many segments as possible, so
that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less
often.

With this change, splice() behaves like tcp_recvmsg(), being able
to consume many skbs in one system call. With typical 1460 bytes
of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return
16*1460 = 23360 bytes.

Signed-off-by: Willy Tarreau <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
---
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bd6ff90..1233835 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -522,8 +522,12 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
unsigned int offset, size_t len)
{
struct tcp_splice_state *tss = rd_desc->arg.data;
+ int ret;

- return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
+ ret = skb_splice_bits(skb, offset, tss->pipe, rd_desc->count, tss->flags);
+ if (ret > 0)
+ rd_desc->count -= ret;
+ return ret;
}

static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
@@ -531,6 +535,7 @@ static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
/* Store TCP splice context information in read_descriptor_t. */
read_descriptor_t rd_desc = {
.arg.data = tss,
+ .count = tss->len,
};

return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
@@ -611,11 +616,13 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
tss.len -= ret;
spliced += ret;

+ if (!timeo)
+ break;
release_sock(sk);
lock_sock(sk);

if (sk->sk_err || sk->sk_state == TCP_CLOSE ||
- (sk->sk_shutdown & RCV_SHUTDOWN) || !timeo ||
+ (sk->sk_shutdown & RCV_SHUTDOWN) ||
signal_pending(current))
break;
}

2009-01-10 07:41:18

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Evgeniy Polyakov a ?crit :
> On Fri, Jan 09, 2009 at 11:17:44PM +0100, Willy Tarreau ([email protected]) wrote:
>> However I'm OK for the !timeo before release_sock/lock_sock. I just
>> don't know if we can put the rest of the if above or not. I don't
>> know what changes we're supposed to collect by doing release_sock/
>> lock_sock before the if().
>
> Not to interrupt the discussion, but for the clarification, that
> release_sock/lock_sock is used to process the backlog accumulated while
> socket was locked. And while dropping additional pair before the final
> release is ok, but moving this itself should be thought of twice.
>

Hum... I just caught the release_sock(sk)/lock_sock(sk) done in skb_splice_bits()

So :

1) the release_sock/lock_sock done in tcp_splice_read() is not necessary
to process backlog. Its already done in skb_splice_bits()

2) If we loop in tcp_read_sock() calling skb_splice_bits() several times
then we should perform the following tests inside this loop ?

if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) ||
signal_pending(current)) break;

And removie them from tcp_splice_read() ?

2009-01-11 12:58:22

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Hi Eric.

On Sat, Jan 10, 2009 at 08:40:05AM +0100, Eric Dumazet ([email protected]) wrote:
> > Not to interrupt the discussion, but for the clarification, that
> > release_sock/lock_sock is used to process the backlog accumulated while
> > socket was locked. And while dropping additional pair before the final
> > release is ok, but moving this itself should be thought of twice.
> >
>
> Hum... I just caught the release_sock(sk)/lock_sock(sk) done in skb_splice_bits()
>
> So :
>
> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary
> to process backlog. Its already done in skb_splice_bits()

Yes, in the tcp_splice_read() they are added to remove a deadlock.

> 2) If we loop in tcp_read_sock() calling skb_splice_bits() several times
> then we should perform the following tests inside this loop ?
>
> if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) ||
> signal_pending(current)) break;
>
> And removie them from tcp_splice_read() ?

It could be done, but for what reason? To detect disconnected socket early?
Does it worth the changes?

--
Evgeniy Polyakov

2009-01-11 13:16:13

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Evgeniy Polyakov a ?crit :
> Hi Eric.
>
> On Sat, Jan 10, 2009 at 08:40:05AM +0100, Eric Dumazet ([email protected]) wrote:
>>> Not to interrupt the discussion, but for the clarification, that
>>> release_sock/lock_sock is used to process the backlog accumulated while
>>> socket was locked. And while dropping additional pair before the final
>>> release is ok, but moving this itself should be thought of twice.
>>>
>> Hum... I just caught the release_sock(sk)/lock_sock(sk) done in skb_splice_bits()
>>
>> So :
>>
>> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary
>> to process backlog. Its already done in skb_splice_bits()
>
> Yes, in the tcp_splice_read() they are added to remove a deadlock.

Could you elaborate ? A deadlock only if !SPLICE_F_NONBLOCK ?

>
>> 2) If we loop in tcp_read_sock() calling skb_splice_bits() several times
>> then we should perform the following tests inside this loop ?
>>
>> if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) ||
>> signal_pending(current)) break;
>>
>> And removie them from tcp_splice_read() ?
>
> It could be done, but for what reason? To detect disconnected socket early?
> Does it worth the changes?
>

I was thinking about the case your thread is doing a splice() from tcp socket to
a pipe, while another thread is doing the splice from this pipe to something else.

Once patched, tcp_read_sock() could loop a long time...

2009-01-11 13:37:58

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Sun, Jan 11, 2009 at 02:14:57PM +0100, Eric Dumazet ([email protected]) wrote:
> >> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary
> >> to process backlog. Its already done in skb_splice_bits()
> >
> > Yes, in the tcp_splice_read() they are added to remove a deadlock.
>
> Could you elaborate ? A deadlock only if !SPLICE_F_NONBLOCK ?

Sorry, I meant that we drop lock in skb_splice_bits() to prevent the deadlock,
and tcp_splice_read() needs it to process the backlog.

I think that even with non-blocking splice that release_sock/lock_sock
is needed, since we are able to do a parallel job: to receive new data
(scheduled by early release_sock backlog processing) in bh and to
process already received data via splice codepath.
Maybe in non-blocking splice mode this is not an issue though, but for
the blocking mode this allows to grab more skbs at once in skb_splice_bits.

> >
> >> 2) If we loop in tcp_read_sock() calling skb_splice_bits() several times
> >> then we should perform the following tests inside this loop ?
> >>
> >> if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) ||
> >> signal_pending(current)) break;
> >>
> >> And removie them from tcp_splice_read() ?
> >
> > It could be done, but for what reason? To detect disconnected socket early?
> > Does it worth the changes?
>
> I was thinking about the case your thread is doing a splice() from tcp socket to
> a pipe, while another thread is doing the splice from this pipe to something else.
>
> Once patched, tcp_read_sock() could loop a long time...

Well, it maybe a good idea... Can not say anything against it :)

--
Evgeniy Polyakov

2009-01-11 16:06:06

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Sun, Jan 11, 2009 at 05:00:37PM +0100, Eric Dumazet ([email protected]) wrote:
> >>>> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary
> >>>> to process backlog. Its already done in skb_splice_bits()
> >>> Yes, in the tcp_splice_read() they are added to remove a deadlock.
> >> Could you elaborate ? A deadlock only if !SPLICE_F_NONBLOCK ?
> >
> > Sorry, I meant that we drop lock in skb_splice_bits() to prevent the deadlock,
> > and tcp_splice_read() needs it to process the backlog.
>
> While we drop lock in skb_splice_bits() to prevent the deadlock, we
> also process backlog at this stage. No need to process backlog
> again in the higher level function.

Yes, but having it earlier allows to receive new skb while processing
already received.

> > I think that even with non-blocking splice that release_sock/lock_sock
> > is needed, since we are able to do a parallel job: to receive new data
> > (scheduled by early release_sock backlog processing) in bh and to
> > process already received data via splice codepath.
> > Maybe in non-blocking splice mode this is not an issue though, but for
> > the blocking mode this allows to grab more skbs at once in skb_splice_bits.
>
> skb_splice_bits() operates on one skb, you lost me :)

Exactly, and to have it we earlier release a socket so that it could be
acked and while we copy it or doing anything else, the next one would
received.

--
Evgeniy Polyakov

2009-01-11 17:10:53

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Evgeniy Polyakov a ?crit :
> On Sun, Jan 11, 2009 at 02:14:57PM +0100, Eric Dumazet ([email protected]) wrote:
>>>> 1) the release_sock/lock_sock done in tcp_splice_read() is not necessary
>>>> to process backlog. Its already done in skb_splice_bits()
>>> Yes, in the tcp_splice_read() they are added to remove a deadlock.
>> Could you elaborate ? A deadlock only if !SPLICE_F_NONBLOCK ?
>
> Sorry, I meant that we drop lock in skb_splice_bits() to prevent the deadlock,
> and tcp_splice_read() needs it to process the backlog.

While we drop lock in skb_splice_bits() to prevent the deadlock, we
also process backlog at this stage. No need to process backlog
again in the higher level function.

>
> I think that even with non-blocking splice that release_sock/lock_sock
> is needed, since we are able to do a parallel job: to receive new data
> (scheduled by early release_sock backlog processing) in bh and to
> process already received data via splice codepath.
> Maybe in non-blocking splice mode this is not an issue though, but for
> the blocking mode this allows to grab more skbs at once in skb_splice_bits.

skb_splice_bits() operates on one skb, you lost me :)

2009-01-13 05:46:06

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Eric Dumazet <[email protected]>
Date: Sat, 10 Jan 2009 00:34:40 +0100

> David, do you think we still must call __tcp_splice_read() only once
> in tcp_splice_read() if SPLICE_F_NONBLOCK is set ?

Eric, I'll get to this thread as soon as I can, perhaps tomorrow. I
want to get all of the build fallout and bug fixes for 2.6.29-rcX
sorted before everyone heads off to LCA in the next week or so :-)

2009-01-13 23:09:05

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Willy Tarreau <[email protected]>
Date: Thu, 8 Jan 2009 23:20:39 +0100

> On Thu, Jan 08, 2009 at 01:55:15PM -0800, David Miller wrote:
> > I'm not applying this until someone explains to me why
> > we should remove this test from the splice receive but
> > keep it in the tcp_recvmsg() code where it has been
> > essentially forever.
>
> In my opinion, the code structure is different between both functions. In
> tcp_recvmsg(), we test for it if (copied > 0), where copied is the sum of
> all data which have been processed since the entry in the function. If we
> removed the test here, we could not break out of the loop once we have
> copied something. In tcp_splice_read(), the test is still present in the
> (!ret) code path, where ret is the last number of bytes processed, so the
> test is still performed regardless of what has been previously transferred.
>
> So in summary, in tcp_splice_read without this test, we get back to the
> top of the loop, and if __tcp_splice_read() returns 0, then we break out
> of the loop.

Ok I see what you're saying, the !timeo check is only
necessary when the receive queue has been exhausted.

2009-01-13 23:26:50

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Eric Dumazet <[email protected]>
Date: Fri, 09 Jan 2009 07:47:16 +0100

> I found this patch usefull in my testings, but had a feeling something
> was not complete. If the goal is to reduce number of splice() calls,
> we also should reduce number of wakeups. If splice() is used in non
> blocking mode, nothing we can do here of course, since the application
> will use a poll()/select()/epoll() event before calling splice(). A
> good setting of SO_RCVLOWAT to (16*PAGE_SIZE)/2 might improve things.

Spice read does not handle SO_RCVLOWAT like tcp_recvmsg() does.

We should probably add a:

target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);

and check 'target' against 'spliced' in the main loop of
tcp_splice_read().

> About tcp_recvmsg(), we might also remove the "!timeo" test as well,
> more testings are needed. But remind that if an application provides
> a large buffer to tcp_recvmsg() call, removing the test will reduce
> the number of syscalls but might use more DCACHE. It could reduce
> performance on old cpus. With splice() call, we expect to not
> copy memory and trash DCACHE, and pipe buffers being limited to 16,
> we cope with a limited working set.

I sometimes have a suspicion we can remove this test too, but it's
not really that clear.

If an application is doing non-blocking reads and they care about
latency, they shouldn't be providing huge buffers. This much I
agree with, but...

If you look at where this check is placed in the recvmsg() case, it is
done after we have verified that there is no socket backlog.

if (copied >= target && !sk->sk_backlog.tail)
break;

if (copied) {
if (sk->sk_err ||
sk->sk_state == TCP_CLOSE ||
(sk->sk_shutdown & RCV_SHUTDOWN) ||
!timeo ||
signal_pending(current))
break;
} else {

So either:

1) We haven't met the target. And note that target is one unless
the user makes an explicit receive low-water setting.

2) Or there is no backlog.

When we get to the 'if (copied)' check.

You can view this "!timeo" check as meaning "non-blocking". When
we get to it, we are guarenteed that we haven't met the target
and we have no backlog. So it is absolutely appropriate to break
out of recvmsg() processing here if non-blocking.

There is a lot of logic and feature handling in tcp_splice_read() and
that's why the semantics of "!timeo" cases are not being handled
properly here.

2009-01-13 23:28:25

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Eric Dumazet <[email protected]>
Date: Fri, 09 Jan 2009 08:28:09 +0100

> If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT,
> [32768], 4), it would be good if kernel was smart enough and could
> reduce number of wakeups.

Right, and as I pointed out in previous replies the problem is
that splice() receive in TCP doesn't check the low water mark
at all.

2009-01-13 23:31:30

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Eric Dumazet <[email protected]>
Date: Fri, 09 Jan 2009 16:42:44 +0100

> David, if you referred to code at line 1374 of net/ipv4/tcp.c, I
> believe there is no issue with it. We really want to break from this
> loop if !timeo .

Correct, I agree, and I gave some detailed analysis of this in
another response :-)

> Willy patch makes splice() behaving like tcp_recvmsg(), but we might call
> tcp_cleanup_rbuf() several times, with copied=1460 (for each frame processed)

"Like", sure, but not the same since splice() lacks the low-water
and backlog checks.

> I wonder if the right fix should be done in tcp_read_sock() : this is the
> one who should eat several skbs IMHO, if we want optimal ACK generation.
>
> We break out of its loop at line 1246
>
> if (!desc->count) /* this test is always true */
> break;
>
> (__tcp_splice_read() set count to 0, right before calling tcp_read_sock())
>
> So code at line 1246 (tcp_read_sock()) seems wrong, or pessimistic at least.

Yes, that's very odd.

2009-01-13 23:38:23

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

David Miller a ?crit :
> From: Eric Dumazet <[email protected]>
> Date: Fri, 09 Jan 2009 08:28:09 +0100
>
>> If the application uses setsockopt(sock, SOL_SOCKET, SO_RCVLOWAT,
>> [32768], 4), it would be good if kernel was smart enough and could
>> reduce number of wakeups.
>
> Right, and as I pointed out in previous replies the problem is
> that splice() receive in TCP doesn't check the low water mark
> at all.
>
>

Yes I understand, but if splice() is running, wakeup occured, and
no need to check if the wakeup was good or not... just proceed and
consume some skb, since we already were awaken.

Then, an application might setup a high SO_RCVLOWAT, but want to splice()
only few bytes, so RCVLOWAT is only a hint for the thing that perform the
wakeup, not for the consumer.

2009-01-14 00:05:22

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Eric Dumazet <[email protected]>
Date: Sat, 10 Jan 2009 00:34:40 +0100

> David, do you think we still must call __tcp_splice_read() only once
> in tcp_splice_read() if SPLICE_F_NONBLOCK is set ?

You seem to be working that out in another thread :-)

> [PATCH] tcp: splice as many packets as possible at once
>
> As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not
> optimal. It processes at most one segment per call.
> This results in low performance and very high overhead due to syscall rate
> when splicing from interfaces which do not support LRO.
>
> Willy provided a patch inside tcp_splice_read(), but a better fix
> is to let tcp_read_sock() process as many segments as possible, so
> that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less
> often.
>
> With this change, splice() behaves like tcp_recvmsg(), being able
> to consume many skbs in one system call. With typical 1460 bytes
> of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return
> 16*1460 = 23360 bytes.
>
> Signed-off-by: Willy Tarreau <[email protected]>
> Signed-off-by: Eric Dumazet <[email protected]>

I've applied this, thanks!

2009-01-14 00:07:35

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Evgeniy Polyakov <[email protected]>
Date: Sun, 11 Jan 2009 19:05:48 +0300

> On Sun, Jan 11, 2009 at 05:00:37PM +0100, Eric Dumazet ([email protected]) wrote:
> > While we drop lock in skb_splice_bits() to prevent the deadlock, we
> > also process backlog at this stage. No need to process backlog
> > again in the higher level function.
>
> Yes, but having it earlier allows to receive new skb while processing
> already received.
>
> > > I think that even with non-blocking splice that release_sock/lock_sock
> > > is needed, since we are able to do a parallel job: to receive new data
> > > (scheduled by early release_sock backlog processing) in bh and to
> > > process already received data via splice codepath.
> > > Maybe in non-blocking splice mode this is not an issue though, but for
> > > the blocking mode this allows to grab more skbs at once in skb_splice_bits.
> >
> > skb_splice_bits() operates on one skb, you lost me :)
>
> Exactly, and to have it we earlier release a socket so that it could be
> acked and while we copy it or doing anything else, the next one would
> received.

I think the socket release in skb_splice_bits() (although necessary)
just muddies the waters, and whether the extra one done in
tcp_splice_read() helps at all is open to debate.

That skb_clone() done by skb_splice_bits() pisses me off too,
we really ought to fix that. And we also have that data corruption
bug to cure too.

2009-01-14 00:14:06

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Tue, Jan 13, 2009 at 04:07:21PM -0800, David Miller ([email protected]) wrote:
> > Exactly, and to have it we earlier release a socket so that it could be
> > acked and while we copy it or doing anything else, the next one would
> > received.
>
> I think the socket release in skb_splice_bits() (although necessary)
> just muddies the waters, and whether the extra one done in
> tcp_splice_read() helps at all is open to debate.

Well, yes, probably simple performance test with and without will
clarify the things.

> That skb_clone() done by skb_splice_bits() pisses me off too,
> we really ought to fix that. And we also have that data corruption
> bug to cure too.

Clone is needed since tcp expects to own the skb and frees it
unconditionally via __kfree_skb().

What is the best solution for the data corruption bug? To copy the data
all the time or implement own allocator to be used in alloc_skb and
friends to allocate the head? I think it can be done transparently for
the drivers. I can volunteer for this :)
The first one is more appropriate for the current bugfix-only stage,
but this will result in the whole release being too slow.

--
Evgeniy Polyakov

2009-01-14 00:16:39

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Evgeniy Polyakov <[email protected]>
Date: Wed, 14 Jan 2009 03:13:45 +0300

> What is the best solution for the data corruption bug? To copy the data
> all the time or implement own allocator to be used in alloc_skb and
> friends to allocate the head? I think it can be done transparently for
> the drivers. I can volunteer for this :)
> The first one is more appropriate for the current bugfix-only stage,
> but this will result in the whole release being too slow.

I am looking into this right now.

I wish there were some way we could make this code grab and release a
reference to the SKB data area (I mean skb_shinfo(skb)->dataref) to
accomplish it's goals.

2009-01-14 00:23:15

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Tue, Jan 13, 2009 at 04:16:25PM -0800, David Miller ([email protected]) wrote:
> I wish there were some way we could make this code grab and release a
> reference to the SKB data area (I mean skb_shinfo(skb)->dataref) to
> accomplish it's goals.

Ugh... Clone without cloninig, but by increasing the dataref. Getting
that splice only needs that skb to track the head of the original, this
may really work.

--
Evgeniy Polyakov

2009-01-14 00:37:23

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Evgeniy Polyakov <[email protected]>
Date: Wed, 14 Jan 2009 03:22:52 +0300

> On Tue, Jan 13, 2009 at 04:16:25PM -0800, David Miller ([email protected]) wrote:
> > I wish there were some way we could make this code grab and release a
> > reference to the SKB data area (I mean skb_shinfo(skb)->dataref) to
> > accomplish it's goals.
>
> Ugh... Clone without cloninig, but by increasing the dataref. Getting
> that splice only needs that skb to track the head of the original, this
> may really work.

Here is something I scrambled together, it is largely based upon
Jarek's patch:

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..05126da 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -70,12 +70,17 @@
static struct kmem_cache *skbuff_head_cache __read_mostly;
static struct kmem_cache *skbuff_fclone_cache __read_mostly;

+static void skb_release_data(struct sk_buff *skb);
+
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
struct sk_buff *skb = (struct sk_buff *) buf->private;

- kfree_skb(skb);
+ if (skb)
+ skb_release_data(skb);
+ else
+ put_page(buf->page);
}

static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
@@ -83,7 +88,10 @@ static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
{
struct sk_buff *skb = (struct sk_buff *) buf->private;

- skb_get(skb);
+ if (skb)
+ atomic_inc(&skb_shinfo(skb)->dataref);
+ else
+ get_page(buf->page);
}

static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1336,7 +1344,10 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;

- kfree_skb(skb);
+ if (skb)
+ skb_release_data(skb);
+ else
+ put_page(spd->pages[i]);
}

/*
@@ -1344,7 +1355,7 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;
@@ -1352,8 +1363,15 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
spd->pages[spd->nr_pages] = page;
spd->partial[spd->nr_pages].len = len;
spd->partial[spd->nr_pages].offset = offset;
- spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
+ spd->partial[spd->nr_pages].private =
+ (unsigned long) (linear ? skb : NULL);
spd->nr_pages++;
+
+ if (linear)
+ atomic_inc(&skb_shinfo(skb)->dataref);
+ else
+ get_page(page);
+
return 0;
}

@@ -1369,7 +1387,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
static inline int __splice_segment(struct page *page, unsigned int poff,
unsigned int plen, unsigned int *off,
unsigned int *len, struct sk_buff *skb,
- struct splice_pipe_desc *spd)
+ struct splice_pipe_desc *spd, int linear)
{
if (!*len)
return 1;
@@ -1392,7 +1410,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
/* the linear region may spread across several pages */
flen = min_t(unsigned int, flen, PAGE_SIZE - poff);

- if (spd_fill_page(spd, page, flen, poff, skb))
+ if (spd_fill_page(spd, page, flen, poff, skb, linear))
return 1;

__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1437,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
if (__splice_segment(virt_to_page(skb->data),
(unsigned long) skb->data & (PAGE_SIZE - 1),
skb_headlen(skb),
- offset, len, skb, spd))
+ offset, len, skb, spd, 1))
return 1;

/*
@@ -1429,7 +1447,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];

if (__splice_segment(f->page, f->page_offset, f->size,
- offset, len, skb, spd))
+ offset, len, skb, spd, 0))
return 1;
}

2009-01-14 00:52:08

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

David Miller <[email protected]> wrote:
>
> I wish there were some way we could make this code grab and release a
> reference to the SKB data area (I mean skb_shinfo(skb)->dataref) to
> accomplish it's goals.

We can probably do that for spliced data that end up going to
the networking stack again. However, as splice is generic the
data may be headed to a destination other than the network stack.

In that case to make dataref work we'd need some mechanism
that allows non-network entities to get/drop this ref count.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-14 01:24:42

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Herbert Xu <[email protected]>
Date: Wed, 14 Jan 2009 11:51:10 +1100

> We can probably do that for spliced data that end up going to
> the networking stack again. However, as splice is generic the
> data may be headed to a destination other than the network stack.
>
> In that case to make dataref work we'd need some mechanism
> that allows non-network entities to get/drop this ref count.

I see.

I'll think about this some more.

2009-01-14 03:52:22

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

David Miller <[email protected]> wrote:
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5110b35..05126da 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -70,12 +70,17 @@
> static struct kmem_cache *skbuff_head_cache __read_mostly;
> static struct kmem_cache *skbuff_fclone_cache __read_mostly;
>
> +static void skb_release_data(struct sk_buff *skb);
> +
> static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
> struct pipe_buffer *buf)
> {
> struct sk_buff *skb = (struct sk_buff *) buf->private;
>
> - kfree_skb(skb);
> + if (skb)
> + skb_release_data(skb);
> + else
> + put_page(buf->page);
> }

Unfortunately this won't work, not even for network destinations.

The reason is that this gets called as soon as the destination's
splice hook returns, for networking that means when sendpage returns.

So by that time we'll still be left with just a page reference
on a page where the slab memory may already have been freed.

To make this work we need to get the destination's splice hooks
to acquire this reference.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-14 04:25:44

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

2009-01-14 07:27:24

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Herbert Xu <[email protected]>
Date: Wed, 14 Jan 2009 14:51:24 +1100

> Unfortunately this won't work, not even for network destinations.
>
> The reason is that this gets called as soon as the destination's
> splice hook returns, for networking that means when sendpage returns.
>
> So by that time we'll still be left with just a page reference
> on a page where the slab memory may already have been freed.
>
> To make this work we need to get the destination's splice hooks
> to acquire this reference.

So while trying to figure out a sane way to fix this, I found
another bug:

/*
* map the linear part
*/
if (__splice_segment(virt_to_page(skb->data),
(unsigned long) skb->data & (PAGE_SIZE - 1),
skb_headlen(skb),
offset, len, skb, spd))
return 1;

This will explode if the SLAB cache for skb->head is using compound
(ie. order > 0) pages.

For example, if this is an order-1 page being used for the skb->head
data (which would be true on most systems for jumbo MTU frames being
received into a linear SKB), the offset will be wrong and depending
upon skb_headlen() we could reference past the end of that
non-compound page we will end up grabbing a reference to.

And then we'll end up with a compound page in an skb_shinfo() frag
array, which is illegal.

Well, at least, I can list several drivers that will barf when
trying to TX that (Acenic, atlx1, cassini, jme, sungem), since
they use pci_map_page(... virt_to_page(skb->data)) or similar.

The core KMAP'ing support for SKBs will also not be able to grok
such a beastly SKB.

2009-01-14 08:27:24

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Tue, Jan 13, 2009 at 11:27:10PM -0800, David Miller wrote:
>
> So while trying to figure out a sane way to fix this, I found
> another bug:
>
> /*
> * map the linear part
> */
> if (__splice_segment(virt_to_page(skb->data),
> (unsigned long) skb->data & (PAGE_SIZE - 1),
> skb_headlen(skb),
> offset, len, skb, spd))
> return 1;
>
> This will explode if the SLAB cache for skb->head is using compound
> (ie. order > 0) pages.
>
> For example, if this is an order-1 page being used for the skb->head
> data (which would be true on most systems for jumbo MTU frames being
> received into a linear SKB), the offset will be wrong and depending
> upon skb_headlen() we could reference past the end of that
> non-compound page we will end up grabbing a reference to.

I'm actually not worried so much about these packets since these
drivers should be converted to skb frags as otherwise they'll
probably stop working after a while due to memory fragmentation.

But yeah for correctness we definitely should address this in
skb_splice_bits.

I still think Jarek's approach (the copying one) is probably the
easiest for now until we can find a better way.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-14 08:53:30

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Wed, Jan 14, 2009 at 07:26:30PM +1100, Herbert Xu wrote:
> On Tue, Jan 13, 2009 at 11:27:10PM -0800, David Miller wrote:
> >
> > So while trying to figure out a sane way to fix this, I found
> > another bug:
> >
> > /*
> > * map the linear part
> > */
> > if (__splice_segment(virt_to_page(skb->data),
> > (unsigned long) skb->data & (PAGE_SIZE - 1),
> > skb_headlen(skb),
> > offset, len, skb, spd))
> > return 1;
> >
> > This will explode if the SLAB cache for skb->head is using compound
> > (ie. order > 0) pages.
> >
> > For example, if this is an order-1 page being used for the skb->head
> > data (which would be true on most systems for jumbo MTU frames being
> > received into a linear SKB), the offset will be wrong and depending
> > upon skb_headlen() we could reference past the end of that
> > non-compound page we will end up grabbing a reference to.
>
> I'm actually not worried so much about these packets since these
> drivers should be converted to skb frags as otherwise they'll
> probably stop working after a while due to memory fragmentation.
>
> But yeah for correctness we definitely should address this in
> skb_splice_bits.
>
> I still think Jarek's approach (the copying one) is probably the
> easiest for now until we can find a better way.
>

Actually, I still think my second approach (the PageSlab) is probably
(if tested) the easiest for now, because it should fix the reported
(Willy's) problem, without any change or copy overhead for splice to
file (which could be still wrong, but not obviously wrong). Then we
could search for the only right way (which is most probably around
Herbert's new skb page allocator. IMHO "my" "copying approach" is too
risky e.g. for stable etc. because of unknown memory requirements,
especially for some larger size page configs/systems.

Jarek P.

2009-01-14 09:29:42

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Jarek Poplawski <[email protected]>
Date: Wed, 14 Jan 2009 08:53:08 +0000

> Actually, I still think my second approach (the PageSlab) is probably
> (if tested) the easiest for now, because it should fix the reported
> (Willy's) problem, without any change or copy overhead for splice to
> file (which could be still wrong, but not obviously wrong).

It's a simple fix, but as Herbert stated it leaves other ->sendpage()
implementations exposed to data corruption when the from side of the
pipe buffer is a socket.

That, to me, is almost worse than a bad fix.

It's definitely worse than a slower but full fix, which the copy
patch is.

Therefore what I'll likely do is push Jarek's copy based cure,
and meanwhile we can brainstorm some more on how to fix this
properly in the long term.

So, I've put together a full commit message and Jarek's patch
below. One thing I notice is that the silly skb_clone() done
by SKB splicing is no longer necessary.

We could get rid of that to offset (some) of the cost we are
adding with this bug fix.

Comments?

net: Fix data corruption when splicing from sockets.

From: Jarek Poplawski <[email protected]>

The trick in socket splicing where we try to convert the skb->data
into a page based reference using virt_to_page() does not work so
well.

The idea is to pass the virt_to_page() reference via the pipe
buffer, and refcount the buffer using a SKB reference.

But if we are splicing from a socket to a socket (via sendpage)
this doesn't work.

The from side processing will grab the page (and SKB) references.
The sendpage() calls will grab page references only, return, and
then the from side processing completes and drops the SKB ref.

The page based reference to skb->data is not enough to keep the
kmalloc() buffer backing it from being reused. Yet, that is
all that the socket send side has at this point.

This leads to data corruption if the skb->data buffer is reused
by SLAB before the send side socket actually gets the TX packet
out to the device.

The fix employed here is to simply allocate a page and copy the
skb->data bytes into that page.

This will hurt performance, but there is no clear way to fix this
properly without a copy at the present time, and it is important
to get rid of the data corruption.

Signed-off-by: David S. Miller <[email protected]>

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..6e43d52 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}

static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}

static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
*/
static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
- struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+ put_page(spd->pages[i]);
+}

- kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy(page_address(p) + offset, page_address(page) + offset, len);
+
+ return p;
}

/*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;

+ if (linear) {
+ page = linear_to_page(page, len, offset);
+ if (!page)
+ return 1;
+ }
+
spd->pages[spd->nr_pages] = page;
spd->partial[spd->nr_pages].len = len;
spd->partial[spd->nr_pages].offset = offset;
- spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
spd->nr_pages++;
+ get_page(page);
+
return 0;
}

@@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
static inline int __splice_segment(struct page *page, unsigned int poff,
unsigned int plen, unsigned int *off,
unsigned int *len, struct sk_buff *skb,
- struct splice_pipe_desc *spd)
+ struct splice_pipe_desc *spd, int linear)
{
if (!*len)
return 1;
@@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
/* the linear region may spread across several pages */
flen = min_t(unsigned int, flen, PAGE_SIZE - poff);

- if (spd_fill_page(spd, page, flen, poff, skb))
+ if (spd_fill_page(spd, page, flen, poff, skb, linear))
return 1;

__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
if (__splice_segment(virt_to_page(skb->data),
(unsigned long) skb->data & (PAGE_SIZE - 1),
skb_headlen(skb),
- offset, len, skb, spd))
+ offset, len, skb, spd, 1))
return 1;

/*
@@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];

if (__splice_segment(f->page, f->page_offset, f->size,
- offset, len, skb, spd))
+ offset, len, skb, spd, 0))
return 1;
}

2009-01-14 09:42:36

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
> From: Jarek Poplawski <[email protected]>
> Date: Wed, 14 Jan 2009 08:53:08 +0000
>
> > Actually, I still think my second approach (the PageSlab) is probably
> > (if tested) the easiest for now, because it should fix the reported
> > (Willy's) problem, without any change or copy overhead for splice to
> > file (which could be still wrong, but not obviously wrong).
>
> It's a simple fix, but as Herbert stated it leaves other ->sendpage()
> implementations exposed to data corruption when the from side of the
> pipe buffer is a socket.

I don't think Herbert meant other ->sendpage() implementations, but I
could miss something.

> That, to me, is almost worse than a bad fix.
>
> It's definitely worse than a slower but full fix, which the copy
> patch is.

Sorry, I can't see how this patch could make sendpage worse.

Jarek P.

2009-01-14 09:55:30

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
...
> Therefore what I'll likely do is push Jarek's copy based cure,
> and meanwhile we can brainstorm some more on how to fix this
> properly in the long term.
>
> So, I've put together a full commit message and Jarek's patch
> below. One thing I notice is that the silly skb_clone() done
> by SKB splicing is no longer necessary.
>
> We could get rid of that to offset (some) of the cost we are
> adding with this bug fix.
>
> Comments?

Yes, this should lessen the overhead a bit.

>
> net: Fix data corruption when splicing from sockets.
>
> From: Jarek Poplawski <[email protected]>
>
> The trick in socket splicing where we try to convert the skb->data
> into a page based reference using virt_to_page() does not work so
> well.
>
> The idea is to pass the virt_to_page() reference via the pipe
> buffer, and refcount the buffer using a SKB reference.
>
> But if we are splicing from a socket to a socket (via sendpage)
> this doesn't work.
>
> The from side processing will grab the page (and SKB) references.
> The sendpage() calls will grab page references only, return, and
> then the from side processing completes and drops the SKB ref.
>
> The page based reference to skb->data is not enough to keep the
> kmalloc() buffer backing it from being reused. Yet, that is
> all that the socket send side has at this point.
>
> This leads to data corruption if the skb->data buffer is reused
> by SLAB before the send side socket actually gets the TX packet
> out to the device.
>
> The fix employed here is to simply allocate a page and copy the
> skb->data bytes into that page.
>
> This will hurt performance, but there is no clear way to fix this
> properly without a copy at the present time, and it is important
> to get rid of the data corruption.
>
> Signed-off-by: David S. Miller <[email protected]>

You are very brave! I'd prefer to wait for at least minimal testing
by Willy...

Thanks,
Jarek P.

BTW, an skb parameter could be removed from spd_fill_page() to make it
even faster...

...
> static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> unsigned int len, unsigned int offset,
> - struct sk_buff *skb)
> + struct sk_buff *skb, int linear)
> {
> if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> return 1;
>
> + if (linear) {
> + page = linear_to_page(page, len, offset);
> + if (!page)
> + return 1;
> + }
> +
> spd->pages[spd->nr_pages] = page;
> spd->partial[spd->nr_pages].len = len;
> spd->partial[spd->nr_pages].offset = offset;
> - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> spd->nr_pages++;
> + get_page(page);
> +
> return 0;
> }
>
> @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
> static inline int __splice_segment(struct page *page, unsigned int poff,
> unsigned int plen, unsigned int *off,
> unsigned int *len, struct sk_buff *skb,
> - struct splice_pipe_desc *spd)
> + struct splice_pipe_desc *spd, int linear)
> {
> if (!*len)
> return 1;
> @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
> /* the linear region may spread across several pages */
> flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
>
> - if (spd_fill_page(spd, page, flen, poff, skb))
> + if (spd_fill_page(spd, page, flen, poff, skb, linear))
> return 1;
>
> __segment_seek(&page, &poff, &plen, flen);
> @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> if (__splice_segment(virt_to_page(skb->data),
> (unsigned long) skb->data & (PAGE_SIZE - 1),
> skb_headlen(skb),
> - offset, len, skb, spd))
> + offset, len, skb, spd, 1))
> return 1;
>
> /*
> @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
>
> if (__splice_segment(f->page, f->page_offset, f->size,
> - offset, len, skb, spd))
> + offset, len, skb, spd, 0))
> return 1;
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-01-14 10:02:21

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Wed, Jan 14, 2009 at 09:54:54AM +0000, Jarek Poplawski wrote:
> You are very brave! I'd prefer to wait for at least minimal testing
> by Willy...

Will test, probably this evening.

Thanks guys,
Willy

2009-01-14 10:06:50

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Jarek Poplawski <[email protected]>
Date: Wed, 14 Jan 2009 09:42:16 +0000

> On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
> > From: Jarek Poplawski <[email protected]>
> > Date: Wed, 14 Jan 2009 08:53:08 +0000
> >
> > > Actually, I still think my second approach (the PageSlab) is probably
> > > (if tested) the easiest for now, because it should fix the reported
> > > (Willy's) problem, without any change or copy overhead for splice to
> > > file (which could be still wrong, but not obviously wrong).
> >
> > It's a simple fix, but as Herbert stated it leaves other ->sendpage()
> > implementations exposed to data corruption when the from side of the
> > pipe buffer is a socket.
>
> I don't think Herbert meant other ->sendpage() implementations, but I
> could miss something.

I think he did :-)

Or, more generally, he could have been referring to splice pipe
outputs. All of these things grab references to pages and
expect that to keep the underlying data from being reallocated.

That doesn't work for this skb->data case.

> > That, to me, is almost worse than a bad fix.
> >
> > It's definitely worse than a slower but full fix, which the copy
> > patch is.
>
> Sorry, I can't see how this patch could make sendpage worse.

Because that patch only fixes TCP's ->sendpage() implementation.

There are others out there which could end up experiencing similar
data corruption.

2009-01-14 10:47:36

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Wed, Jan 14, 2009 at 02:06:37AM -0800, David Miller wrote:
> From: Jarek Poplawski <[email protected]>
> Date: Wed, 14 Jan 2009 09:42:16 +0000
>
> > On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
> > > From: Jarek Poplawski <[email protected]>
> > > Date: Wed, 14 Jan 2009 08:53:08 +0000
> > >
> > > > Actually, I still think my second approach (the PageSlab) is probably
> > > > (if tested) the easiest for now, because it should fix the reported
> > > > (Willy's) problem, without any change or copy overhead for splice to
> > > > file (which could be still wrong, but not obviously wrong).
> > >
> > > It's a simple fix, but as Herbert stated it leaves other ->sendpage()
> > > implementations exposed to data corruption when the from side of the
> > > pipe buffer is a socket.
> >
> > I don't think Herbert meant other ->sendpage() implementations, but I
> > could miss something.
>
> I think he did :-)

I hope Herbert will make it clear.

>
> Or, more generally, he could have been referring to splice pipe
> outputs. All of these things grab references to pages and
> expect that to keep the underlying data from being reallocated.
>
> That doesn't work for this skb->data case.
>
> > > That, to me, is almost worse than a bad fix.
> > >
> > > It's definitely worse than a slower but full fix, which the copy
> > > patch is.
> >
> > Sorry, I can't see how this patch could make sendpage worse.
>
> Because that patch only fixes TCP's ->sendpage() implementation.

Yes, it fixes (I hope) the only reported implementation.

>
> There are others out there which could end up experiencing similar
> data corruption.

Since this fix is very simple (and IMHO safe) it could be probably
used elsewhere too, but I doubt we should care at the moment.

Jarek P.

2009-01-14 11:29:04

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
>
> It's a simple fix, but as Herbert stated it leaves other ->sendpage()
> implementations exposed to data corruption when the from side of the
> pipe buffer is a socket.
>
> That, to me, is almost worse than a bad fix.

Yep, so far nobody has verified the disk path at all. So for all
we know, if there is a delay on the way to disk, the exact same
thing can occur.

Besides, the PageSlab thing is going to copy for network to network
anyway.

> So, I've put together a full commit message and Jarek's patch
> below. One thing I notice is that the silly skb_clone() done
> by SKB splicing is no longer necessary.
>
> We could get rid of that to offset (some) of the cost we are
> adding with this bug fix.
>
> Comments?

Yes that's probably a good idea.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-14 11:29:44

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Wed, Jan 14, 2009 at 10:47:16AM +0000, Jarek Poplawski wrote:
>
> > > I don't think Herbert meant other ->sendpage() implementations, but I
> > > could miss something.
> >
> > I think he did :-)
>
> I hope Herbert will make it clear.

Yes I did mean the other splice paths, and in particular, the path
to disk. I hope that's clear enough :)

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-14 11:41:25

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Wed, Jan 14, 2009 at 10:29:03PM +1100, Herbert Xu wrote:
> On Wed, Jan 14, 2009 at 10:47:16AM +0000, Jarek Poplawski wrote:
> >
> > > > I don't think Herbert meant other ->sendpage() implementations, but I
> > > > could miss something.
> > >
> > > I think he did :-)
> >
> > I hope Herbert will make it clear.
>
> Yes I did mean the other splice paths, and in particular, the path
> to disk. I hope that's clear enough :)

So, I think I got it right: otherwise it would be enough to make this
copy to a new page only before ->sendpage() calls e.g. in
generic_splice_sendpage().

Jarek P.

2009-01-14 11:45:39

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Wed, Jan 14, 2009 at 11:40:59AM +0000, Jarek Poplawski wrote:
> On Wed, Jan 14, 2009 at 10:29:03PM +1100, Herbert Xu wrote:
> > On Wed, Jan 14, 2009 at 10:47:16AM +0000, Jarek Poplawski wrote:
> > >
> > > > > I don't think Herbert meant other ->sendpage() implementations, but I
> > > > > could miss something.
> > > >
> > > > I think he did :-)
> > >
> > > I hope Herbert will make it clear.
> >
> > Yes I did mean the other splice paths, and in particular, the path
> > to disk. I hope that's clear enough :)
>
> So, I think I got it right: otherwise it would be enough to make this
> copy to a new page only before ->sendpage() calls e.g. in
> generic_splice_sendpage().

Hmm... Actually in pipe_to_sendpage().

Jarek P.

2009-01-14 12:07:22

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Wed, Jan 14, 2009 at 09:54:54AM +0000, Jarek Poplawski wrote:
> On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
...
> > net: Fix data corruption when splicing from sockets.
> >
> > From: Jarek Poplawski <[email protected]>
...
> >
> > Signed-off-by: David S. Miller <[email protected]>
>
> You are very brave! I'd prefer to wait for at least minimal testing
> by Willy...

On the other hand, I can't waste your trust like this (plus I'm not
suicidal), so a little update:

Based on a review by Changli Gao <[email protected]>:
http://lkml.org/lkml/2008/2/26/210

Foreseen-by: Changli Gao <[email protected]>
Diagnosed-by: Willy Tarreau <[email protected]>
Reported-by: Willy Tarreau <[email protected]>
Signed-off-by: Jarek Poplawski <[email protected]>

>
> Thanks,
> Jarek P.
>
> BTW, an skb parameter could be removed from spd_fill_page() to make it
> even faster...
>
> ...
> > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> > unsigned int len, unsigned int offset,
> > - struct sk_buff *skb)
> > + struct sk_buff *skb, int linear)
> > {
> > if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> > return 1;
> >
> > + if (linear) {
> > + page = linear_to_page(page, len, offset);
> > + if (!page)
> > + return 1;
> > + }
> > +
> > spd->pages[spd->nr_pages] = page;
> > spd->partial[spd->nr_pages].len = len;
> > spd->partial[spd->nr_pages].offset = offset;
> > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> > spd->nr_pages++;
> > + get_page(page);
> > +
> > return 0;
> > }
> >
> > @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
> > static inline int __splice_segment(struct page *page, unsigned int poff,
> > unsigned int plen, unsigned int *off,
> > unsigned int *len, struct sk_buff *skb,
> > - struct splice_pipe_desc *spd)
> > + struct splice_pipe_desc *spd, int linear)
> > {
> > if (!*len)
> > return 1;
> > @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
> > /* the linear region may spread across several pages */
> > flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
> >
> > - if (spd_fill_page(spd, page, flen, poff, skb))
> > + if (spd_fill_page(spd, page, flen, poff, skb, linear))
> > return 1;
> >
> > __segment_seek(&page, &poff, &plen, flen);
> > @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> > if (__splice_segment(virt_to_page(skb->data),
> > (unsigned long) skb->data & (PAGE_SIZE - 1),
> > skb_headlen(skb),
> > - offset, len, skb, spd))
> > + offset, len, skb, spd, 1))
> > return 1;
> >
> > /*
> > @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> > const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
> >
> > if (__splice_segment(f->page, f->page_offset, f->size,
> > - offset, len, skb, spd))
> > + offset, len, skb, spd, 0))
> > return 1;
> > }
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-01-14 12:15:41

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Sorry: take 2:

On Wed, Jan 14, 2009 at 09:54:54AM +0000, Jarek Poplawski wrote:
> On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
...
> > net: Fix data corruption when splicing from sockets.
> >
> > From: Jarek Poplawski <[email protected]>
...
> >
> > Signed-off-by: David S. Miller <[email protected]>
>
> You are very brave! I'd prefer to wait for at least minimal testing
> by Willy...

On the other hand, I can't waste your trust like this (plus I'm not
suicidal), so a little update:

Based on a review by Changli Gao <[email protected]>:
http://lkml.org/lkml/2008/2/26/210

Foreseen-by: Changli Gao <[email protected]>
Diagnosed-by: Willy Tarreau <[email protected]>
Reported-by: Willy Tarreau <[email protected]>
Signed-off-by: Jarek Poplawski <[email protected]>
Fixed-by: Jens Axboe <[email protected]>

>
> Thanks,
> Jarek P.
>
> BTW, an skb parameter could be removed from spd_fill_page() to make it
> even faster...
>
> ...
> > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> > unsigned int len, unsigned int offset,
> > - struct sk_buff *skb)
> > + struct sk_buff *skb, int linear)
> > {
> > if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> > return 1;
> >
> > + if (linear) {
> > + page = linear_to_page(page, len, offset);
> > + if (!page)
> > + return 1;
> > + }
> > +
> > spd->pages[spd->nr_pages] = page;
> > spd->partial[spd->nr_pages].len = len;
> > spd->partial[spd->nr_pages].offset = offset;
> > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> > spd->nr_pages++;
> > + get_page(page);
> > +
> > return 0;
> > }
> >
> > @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
> > static inline int __splice_segment(struct page *page, unsigned int poff,
> > unsigned int plen, unsigned int *off,
> > unsigned int *len, struct sk_buff *skb,
> > - struct splice_pipe_desc *spd)
> > + struct splice_pipe_desc *spd, int linear)
> > {
> > if (!*len)
> > return 1;
> > @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
> > /* the linear region may spread across several pages */
> > flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
> >
> > - if (spd_fill_page(spd, page, flen, poff, skb))
> > + if (spd_fill_page(spd, page, flen, poff, skb, linear))
> > return 1;
> >
> > __segment_seek(&page, &poff, &plen, flen);
> > @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> > if (__splice_segment(virt_to_page(skb->data),
> > (unsigned long) skb->data & (PAGE_SIZE - 1),
> > skb_headlen(skb),
> > - offset, len, skb, spd))
> > + offset, len, skb, spd, 1))
> > return 1;
> >
> > /*
> > @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> > const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
> >
> > if (__splice_segment(f->page, f->page_offset, f->size,
> > - offset, len, skb, spd))
> > + offset, len, skb, spd, 0))
> > return 1;
> > }
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-01-15 23:06:27

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote:
> Therefore what I'll likely do is push Jarek's copy based cure,
> and meanwhile we can brainstorm some more on how to fix this
> properly in the long term.
>
> So, I've put together a full commit message and Jarek's patch
> below. One thing I notice is that the silly skb_clone() done
> by SKB splicing is no longer necessary.
>
> We could get rid of that to offset (some) of the cost we are
> adding with this bug fix.
>
> Comments?

David,

please don't merge it as-is. I've just tried it and got an OOM
in __alloc_pages_internal after a few seconds of data transfer.

I'm leaving the patch below for comments, maybe someone will spot
something ? Don't we need at least one kfree() somewhere to match
alloc_pages() ?

Regards,
Willy

--

>
> net: Fix data corruption when splicing from sockets.
>
> From: Jarek Poplawski <[email protected]>
>
> The trick in socket splicing where we try to convert the skb->data
> into a page based reference using virt_to_page() does not work so
> well.
>
> The idea is to pass the virt_to_page() reference via the pipe
> buffer, and refcount the buffer using a SKB reference.
>
> But if we are splicing from a socket to a socket (via sendpage)
> this doesn't work.
>
> The from side processing will grab the page (and SKB) references.
> The sendpage() calls will grab page references only, return, and
> then the from side processing completes and drops the SKB ref.
>
> The page based reference to skb->data is not enough to keep the
> kmalloc() buffer backing it from being reused. Yet, that is
> all that the socket send side has at this point.
>
> This leads to data corruption if the skb->data buffer is reused
> by SLAB before the send side socket actually gets the TX packet
> out to the device.
>
> The fix employed here is to simply allocate a page and copy the
> skb->data bytes into that page.
>
> This will hurt performance, but there is no clear way to fix this
> properly without a copy at the present time, and it is important
> to get rid of the data corruption.
>
> Signed-off-by: David S. Miller <[email protected]>
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5110b35..6e43d52 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
> static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
> struct pipe_buffer *buf)
> {
> - struct sk_buff *skb = (struct sk_buff *) buf->private;
> -
> - kfree_skb(skb);
> + put_page(buf->page);
> }
>
> static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
> struct pipe_buffer *buf)
> {
> - struct sk_buff *skb = (struct sk_buff *) buf->private;
> -
> - skb_get(skb);
> + get_page(buf->page);
> }
>
> static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
> @@ -1334,9 +1330,19 @@ fault:
> */
> static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
> {
> - struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
> + put_page(spd->pages[i]);
> +}
>
> - kfree_skb(skb);
> +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> + unsigned int offset)
> +{
> + struct page *p = alloc_pages(GFP_KERNEL, 0);
> +
> + if (!p)
> + return NULL;
> + memcpy(page_address(p) + offset, page_address(page) + offset, len);
> +
> + return p;
> }
>
> /*
> @@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
> */
> static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> unsigned int len, unsigned int offset,
> - struct sk_buff *skb)
> + struct sk_buff *skb, int linear)
> {
> if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> return 1;
>
> + if (linear) {
> + page = linear_to_page(page, len, offset);
> + if (!page)
> + return 1;
> + }
> +
> spd->pages[spd->nr_pages] = page;
> spd->partial[spd->nr_pages].len = len;
> spd->partial[spd->nr_pages].offset = offset;
> - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> spd->nr_pages++;
> + get_page(page);
> +
> return 0;
> }
>
> @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
> static inline int __splice_segment(struct page *page, unsigned int poff,
> unsigned int plen, unsigned int *off,
> unsigned int *len, struct sk_buff *skb,
> - struct splice_pipe_desc *spd)
> + struct splice_pipe_desc *spd, int linear)
> {
> if (!*len)
> return 1;
> @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
> /* the linear region may spread across several pages */
> flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
>
> - if (spd_fill_page(spd, page, flen, poff, skb))
> + if (spd_fill_page(spd, page, flen, poff, skb, linear))
> return 1;
>
> __segment_seek(&page, &poff, &plen, flen);
> @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> if (__splice_segment(virt_to_page(skb->data),
> (unsigned long) skb->data & (PAGE_SIZE - 1),
> skb_headlen(skb),
> - offset, len, skb, spd))
> + offset, len, skb, spd, 1))
> return 1;
>
> /*
> @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
> const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
>
> if (__splice_segment(f->page, f->page_offset, f->size,
> - offset, len, skb, spd))
> + offset, len, skb, spd, 0))
> return 1;
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-01-15 23:19:37

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Willy Tarreau <[email protected]>
Date: Fri, 16 Jan 2009 00:03:31 +0100

> please don't merge it as-is. I've just tried it and got an OOM
> in __alloc_pages_internal after a few seconds of data transfer.
>
> I'm leaving the patch below for comments, maybe someone will spot
> something ? Don't we need at least one kfree() somewhere to match
> alloc_pages() ?

What is needed is a put_page().

But that should be happening as a result of the splice callback.

2009-01-15 23:20:46

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Fri, Jan 16, 2009 at 12:03:31AM +0100, Willy Tarreau wrote:
>
> I'm leaving the patch below for comments, maybe someone will spot
> something ? Don't we need at least one kfree() somewhere to match
> alloc_pages() ?

Indeed.

> > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> > unsigned int len, unsigned int offset,
> > - struct sk_buff *skb)
> > + struct sk_buff *skb, int linear)
> > {
> > if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> > return 1;
> >
> > + if (linear) {
> > + page = linear_to_page(page, len, offset);
> > + if (!page)
> > + return 1;
> > + }
> > +
> > spd->pages[spd->nr_pages] = page;
> > spd->partial[spd->nr_pages].len = len;
> > spd->partial[spd->nr_pages].offset = offset;
> > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> > spd->nr_pages++;
> > + get_page(page);

This get_page needs to be moved into an else clause of the previous
if block.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-15 23:26:27

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Herbert Xu <[email protected]>
Date: Fri, 16 Jan 2009 10:19:35 +1100

> On Fri, Jan 16, 2009 at 12:03:31AM +0100, Willy Tarreau wrote:
> > > + if (linear) {
> > > + page = linear_to_page(page, len, offset);
> > > + if (!page)
> > > + return 1;
> > > + }
> > > +
> > > spd->pages[spd->nr_pages] = page;
> > > spd->partial[spd->nr_pages].len = len;
> > > spd->partial[spd->nr_pages].offset = offset;
> > > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> > > spd->nr_pages++;
> > > + get_page(page);
>
> This get_page needs to be moved into an else clause of the previous
> if block.

Yep, good catch Herbert.

New patch, this has the SKB clone removal as well:

net: Fix data corruption when splicing from sockets.

From: Jarek Poplawski <[email protected]>

The trick in socket splicing where we try to convert the skb->data
into a page based reference using virt_to_page() does not work so
well.

The idea is to pass the virt_to_page() reference via the pipe
buffer, and refcount the buffer using a SKB reference.

But if we are splicing from a socket to a socket (via sendpage)
this doesn't work.

The from side processing will grab the page (and SKB) references.
The sendpage() calls will grab page references only, return, and
then the from side processing completes and drops the SKB ref.

The page based reference to skb->data is not enough to keep the
kmalloc() buffer backing it from being reused. Yet, that is
all that the socket send side has at this point.

This leads to data corruption if the skb->data buffer is reused
by SLAB before the send side socket actually gets the TX packet
out to the device.

The fix employed here is to simply allocate a page and copy the
skb->data bytes into that page.

This will hurt performance, but there is no clear way to fix this
properly without a copy at the present time, and it is important
to get rid of the data corruption.

With fixes from Herbert Xu.

Signed-off-by: David S. Miller <[email protected]>

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 65eac77..56272ac 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}

static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}

static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
*/
static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
- struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+ put_page(spd->pages[i]);
+}

- kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy(page_address(p) + offset, page_address(page) + offset, len);
+
+ return p;
}

/*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;

+ if (linear) {
+ page = linear_to_page(page, len, offset);
+ if (!page)
+ return 1;
+ } else
+ get_page(page);
+
spd->pages[spd->nr_pages] = page;
spd->partial[spd->nr_pages].len = len;
spd->partial[spd->nr_pages].offset = offset;
- spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
spd->nr_pages++;
+
return 0;
}

@@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff,
static inline int __splice_segment(struct page *page, unsigned int poff,
unsigned int plen, unsigned int *off,
unsigned int *len, struct sk_buff *skb,
- struct splice_pipe_desc *spd)
+ struct splice_pipe_desc *spd, int linear)
{
if (!*len)
return 1;
@@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
/* the linear region may spread across several pages */
flen = min_t(unsigned int, flen, PAGE_SIZE - poff);

- if (spd_fill_page(spd, page, flen, poff, skb))
+ if (spd_fill_page(spd, page, flen, poff, skb, linear))
return 1;

__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
if (__splice_segment(virt_to_page(skb->data),
(unsigned long) skb->data & (PAGE_SIZE - 1),
skb_headlen(skb),
- offset, len, skb, spd))
+ offset, len, skb, spd, 1))
return 1;

/*
@@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];

if (__splice_segment(f->page, f->page_offset, f->size,
- offset, len, skb, spd))
+ offset, len, skb, spd, 0))
return 1;
}

@@ -1442,7 +1455,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
* the frag list, if such a thing exists. We'd probably need to recurse to
* handle that cleanly.
*/
-int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
+int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
struct pipe_inode_info *pipe, unsigned int tlen,
unsigned int flags)
{
@@ -1455,16 +1468,6 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
.ops = &sock_pipe_buf_ops,
.spd_release = sock_spd_release,
};
- struct sk_buff *skb;
-
- /*
- * I'd love to avoid the clone here, but tcp_read_sock()
- * ignores reference counts and unconditonally kills the sk_buff
- * on return from the actor.
- */
- skb = skb_clone(__skb, GFP_KERNEL);
- if (unlikely(!skb))
- return -ENOMEM;

/*
* __skb_splice_bits() only fails if the output has no room left,
@@ -1488,15 +1491,9 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
}

done:
- /*
- * drop our reference to the clone, the pipe consumption will
- * drop the rest.
- */
- kfree_skb(skb);
-
if (spd.nr_pages) {
+ struct sock *sk = skb->sk;
int ret;
- struct sock *sk = __skb->sk;

/*
* Drop the socket lock, otherwise we have reverse

2009-01-15 23:32:57

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
>
> New patch, this has the SKB clone removal as well:

Thanks Dave!

Something else just came to mind though.

> +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> + unsigned int offset)
> +{
> + struct page *p = alloc_pages(GFP_KERNEL, 0);
> +
> + if (!p)
> + return NULL;
> + memcpy(page_address(p) + offset, page_address(page) + offset, len);

This won't work very well if skb->head is longer than a page.

We'll need to divide it up into individual pages.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-15 23:33:42

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Fri, Jan 16, 2009 at 10:19:35AM +1100, Herbert Xu wrote:
> On Fri, Jan 16, 2009 at 12:03:31AM +0100, Willy Tarreau wrote:
> >
> > I'm leaving the patch below for comments, maybe someone will spot
> > something ? Don't we need at least one kfree() somewhere to match
> > alloc_pages() ?
>
> Indeed.
>
> > > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> > > unsigned int len, unsigned int offset,
> > > - struct sk_buff *skb)
> > > + struct sk_buff *skb, int linear)
> > > {
> > > if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> > > return 1;
> > >
> > > + if (linear) {
> > > + page = linear_to_page(page, len, offset);
> > > + if (!page)
> > > + return 1;
> > > + }
> > > +
> > > spd->pages[spd->nr_pages] = page;
> > > spd->partial[spd->nr_pages].len = len;
> > > spd->partial[spd->nr_pages].offset = offset;
> > > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> > > spd->nr_pages++;
> > > + get_page(page);
>
> This get_page needs to be moved into an else clause of the previous
> if block.

Good catch Herbert, it's working fine now. The performance I get
(30.6 MB/s) is between the original splice version (24.1) and the
recently optimized one (35.7). That's not that bad considering
we're copying the data.

Cheers,
Willy

2009-01-15 23:35:22

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Herbert Xu <[email protected]>
Date: Fri, 16 Jan 2009 10:32:05 +1100

> On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > + unsigned int offset)
> > +{
> > + struct page *p = alloc_pages(GFP_KERNEL, 0);
> > +
> > + if (!p)
> > + return NULL;
> > + memcpy(page_address(p) + offset, page_address(page) + offset, len);
>
> This won't work very well if skb->head is longer than a page.
>
> We'll need to divide it up into individual pages.

Oh yes the same bug I pointed out the other day.

But Willy can test this patch as-is, since he is not using
jumbo frames in linear SKBs.

2009-01-15 23:35:59

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Willy Tarreau <[email protected]>
Date: Fri, 16 Jan 2009 00:32:20 +0100

> Good catch Herbert, it's working fine now. The performance I get
> (30.6 MB/s) is between the original splice version (24.1) and the
> recently optimized one (35.7). That's not that bad considering
> we're copying the data.

Willy, see how much my skb clone removal in the patch I just
posted helps, if at all.

2009-01-15 23:43:29

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote:
> From: Herbert Xu <[email protected]>
> Date: Fri, 16 Jan 2009 10:32:05 +1100
>
> > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > > + unsigned int offset)
> > > +{
> > > + struct page *p = alloc_pages(GFP_KERNEL, 0);
> > > +
> > > + if (!p)
> > > + return NULL;
> > > + memcpy(page_address(p) + offset, page_address(page) + offset, len);
> >
> > This won't work very well if skb->head is longer than a page.
> >
> > We'll need to divide it up into individual pages.
>
> Oh yes the same bug I pointed out the other day.
>
> But Willy can test this patch as-is,

Hey, nice work Dave. +3% performance from your previous patch
(31.6 MB/s). It's going fine and stable here.

> since he is not using jumbo frames in linear SKBs.

If you're interested, this week-end I can do some tests on my
myri10ge NICs which support LRO. I frequently observe 23 kB
packets there, and they also support jumbo frames. Those should
cover the case above.

I'm afraid that's all for me for this evening, I have to get some
sleep before going to work. If you want to cook up more patches,
I'll be able to do a bit of testing in 5 hours now.

Cheers!
Willy

2009-01-15 23:44:47

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Fri, Jan 16, 2009 at 12:42:55AM +0100, Willy Tarreau wrote:
> On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote:
> > From: Herbert Xu <[email protected]>
> > Date: Fri, 16 Jan 2009 10:32:05 +1100
> >
> > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > > > + unsigned int offset)
> > > > +{
> > > > + struct page *p = alloc_pages(GFP_KERNEL, 0);
> > > > +
> > > > + if (!p)
> > > > + return NULL;
> > > > + memcpy(page_address(p) + offset, page_address(page) + offset, len);
> > >
> > > This won't work very well if skb->head is longer than a page.
> > >
> > > We'll need to divide it up into individual pages.
> >
> > Oh yes the same bug I pointed out the other day.
> >
> > But Willy can test this patch as-is,
>
> Hey, nice work Dave. +3% performance from your previous patch
> (31.6 MB/s). It's going fine and stable here.

And BTW feel free to add my Tested-by if you want in case you merge
this fix.

Willy

2009-01-15 23:54:51

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Willy Tarreau <[email protected]>
Date: Fri, 16 Jan 2009 00:44:08 +0100

> And BTW feel free to add my Tested-by if you want in case you merge
> this fix.

Done, thanks Willy.

2009-01-16 06:51:36

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Fri, Jan 16, 2009 at 12:44:08AM +0100, Willy Tarreau wrote:
> On Fri, Jan 16, 2009 at 12:42:55AM +0100, Willy Tarreau wrote:
> > On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote:
> > > From: Herbert Xu <[email protected]>
> > > Date: Fri, 16 Jan 2009 10:32:05 +1100
> > >
> > > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > > > > + unsigned int offset)
> > > > > +{
> > > > > + struct page *p = alloc_pages(GFP_KERNEL, 0);
> > > > > +
> > > > > + if (!p)
> > > > > + return NULL;
> > > > > + memcpy(page_address(p) + offset, page_address(page) + offset, len);
> > > >
> > > > This won't work very well if skb->head is longer than a page.
> > > >
> > > > We'll need to divide it up into individual pages.
> > >
> > > Oh yes the same bug I pointed out the other day.
> > >
> > > But Willy can test this patch as-is,
> >
> > Hey, nice work Dave. +3% performance from your previous patch
> > (31.6 MB/s). It's going fine and stable here.
>
> And BTW feel free to add my Tested-by if you want in case you merge
> this fix.
>
> Willy
>

Herbert, good catch!

David, if it's not too late I think more credits are needed,
especially for Willy. He did "a bit" more than testing.

Alas, I can't see this problem with skb->head longer than page. There
is even some comment on this in __splice_segment(), but I can miss
something.

I'm more concerned with memory usage if these skbs are not acked for
some reason. Isn't there some DOS issue possible?

Thanks everybody,
Jarek P.
--------->
Based on a review by Changli Gao <[email protected]>:
http://lkml.org/lkml/2008/2/26/210

Foreseen-by: Changli Gao <[email protected]>
Diagnosed-by: Willy Tarreau <[email protected]>
Reported-by: Willy Tarreau <[email protected]>
Fixed-by: Jens Axboe <[email protected]>
Signed-off-by: Jarek Poplawski <[email protected]>

2009-01-19 00:42:49

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Hi guys,

On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
> From: Willy Tarreau <[email protected]>
> Date: Fri, 16 Jan 2009 00:44:08 +0100
>
> > And BTW feel free to add my Tested-by if you want in case you merge
> > this fix.
>
> Done, thanks Willy.

Just for the record, I've now re-integrated those changes in a test kernel
that I booted on my 10gig machines. I have updated my user-space code in
haproxy to run a new series of tests. Eventhough there is a memcpy(), the
results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :

- 4.8 Gbps at 100% CPU using MTU=1500 without LRO
(3.2 Gbps at 100% CPU without splice)

- 9.2 Gbps at 50% CPU using MTU=1500 with LRO

- 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
splice)

- 10 Gbps at 15% CPU using MTU=9000 with LRO

These last ones are really impressive. While I had already observed such
performance on the Myri10GE with Tux, it's the first time I can reach that
level with so little CPU usage in haproxy !

So I think that the memcpy() workaround might be a non-issue for some time.
I agree it's not beautiful but it works pretty well for now.

The 3 patches I used on top of 2.6.27.10 were the fix to return 0 intead of
-EAGAIN on end of read, the one to process multiple skbs at once, and Dave's
last patch based on Jarek's workaround for the corruption issue.

Cheers,
Willy

2009-01-19 03:09:34

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote:
>
> Just for the record, I've now re-integrated those changes in a test kernel
> that I booted on my 10gig machines. I have updated my user-space code in
> haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
>
> - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> (3.2 Gbps at 100% CPU without splice)

One thing to note is that Myricom's driver probably uses page
frags which means that you're not actually triggering the copy.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-19 03:27:32

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Herbert Xu <[email protected]>
Date: Mon, 19 Jan 2009 14:08:44 +1100

> On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote:
> >
> > Just for the record, I've now re-integrated those changes in a test kernel
> > that I booted on my 10gig machines. I have updated my user-space code in
> > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> >
> > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> > (3.2 Gbps at 100% CPU without splice)
>
> One thing to note is that Myricom's driver probably uses page
> frags which means that you're not actually triggering the copy.

Right.

And this is also the only reason why jumbo MTU worked :-)

2009-01-19 03:28:26

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Willy Tarreau <[email protected]>
Date: Mon, 19 Jan 2009 01:42:06 +0100

> Hi guys,
>
> On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
> > From: Willy Tarreau <[email protected]>
> > Date: Fri, 16 Jan 2009 00:44:08 +0100
> >
> > > And BTW feel free to add my Tested-by if you want in case you merge
> > > this fix.
> >
> > Done, thanks Willy.
>
> Just for the record, I've now re-integrated those changes in a test kernel
> that I booted on my 10gig machines. I have updated my user-space code in
> haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
>
> - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> (3.2 Gbps at 100% CPU without splice)
>
> - 9.2 Gbps at 50% CPU using MTU=1500 with LRO
>
> - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
> splice)
>
> - 10 Gbps at 15% CPU using MTU=9000 with LRO

Thanks for the numbers.

We can almost certainly do a lot better, so if you have the time and
can get some oprofile dumps for these various cases that would be
useful to us.

Thanks again.

2009-01-19 06:08:46

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Jarek Poplawski <[email protected]>
Date: Fri, 16 Jan 2009 06:51:08 +0000

> David, if it's not too late I think more credits are needed,
> especially for Willy. He did "a bit" more than testing.
...
> Foreseen-by: Changli Gao <[email protected]>
> Diagnosed-by: Willy Tarreau <[email protected]>
> Reported-by: Willy Tarreau <[email protected]>
> Fixed-by: Jens Axboe <[email protected]>
> Signed-off-by: Jarek Poplawski <[email protected]>

I will be sure to add these before committing, thanks Jarek.

2009-01-19 06:15:00

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote:
> From: Herbert Xu <[email protected]>
> Date: Mon, 19 Jan 2009 14:08:44 +1100
>
> > On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote:
> > >
> > > Just for the record, I've now re-integrated those changes in a test kernel
> > > that I booted on my 10gig machines. I have updated my user-space code in
> > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> > >
> > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> > > (3.2 Gbps at 100% CPU without splice)
> >
> > One thing to note is that Myricom's driver probably uses page
> > frags which means that you're not actually triggering the copy.

So does this mean that the corruption problem should still there for
such a driver ? I'm asking before testing, because at these speeds,
validity tests are not that easy ;-)

> Right.
>
> And this is also the only reason why jumbo MTU worked :-)

What should we expect from other drivers with jumbo frames ? Hangs,
corruption, errors, packet loss ?

Thanks,
Willy

2009-01-19 06:15:28

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Sun, Jan 18, 2009 at 07:28:15PM -0800, David Miller wrote:
> From: Willy Tarreau <[email protected]>
> Date: Mon, 19 Jan 2009 01:42:06 +0100
>
> > Hi guys,
> >
> > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
> > > From: Willy Tarreau <[email protected]>
> > > Date: Fri, 16 Jan 2009 00:44:08 +0100
> > >
> > > > And BTW feel free to add my Tested-by if you want in case you merge
> > > > this fix.
> > >
> > > Done, thanks Willy.
> >
> > Just for the record, I've now re-integrated those changes in a test kernel
> > that I booted on my 10gig machines. I have updated my user-space code in
> > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> >
> > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> > (3.2 Gbps at 100% CPU without splice)
> >
> > - 9.2 Gbps at 50% CPU using MTU=1500 with LRO
> >
> > - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
> > splice)
> >
> > - 10 Gbps at 15% CPU using MTU=9000 with LRO
>
> Thanks for the numbers.
>
> We can almost certainly do a lot better, so if you have the time and
> can get some oprofile dumps for these various cases that would be
> useful to us.

No problem, of course. It's just a matter of time, but if we can push
the numbers further, let's try.

Regards,
Willy

2009-01-19 06:16:26

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: David Miller <[email protected]>
Date: Thu, 15 Jan 2009 15:34:49 -0800 (PST)

> From: Herbert Xu <[email protected]>
> Date: Fri, 16 Jan 2009 10:32:05 +1100
>
> > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > > + unsigned int offset)
> > > +{
> > > + struct page *p = alloc_pages(GFP_KERNEL, 0);
> > > +
> > > + if (!p)
> > > + return NULL;
> > > + memcpy(page_address(p) + offset, page_address(page) + offset, len);
> >
> > This won't work very well if skb->head is longer than a page.
> >
> > We'll need to divide it up into individual pages.
>
> Oh yes the same bug I pointed out the other day.
>
> But Willy can test this patch as-is, since he is not using
> jumbo frames in linear SKBs.

Actually, Herbert, it turns out this case should be OK.

Look a level or two higher, at __splice_segment(), it even has a
comment :-)

--------------------
do {
unsigned int flen = min(*len, plen);

/* the linear region may spread across several pages */
flen = min_t(unsigned int, flen, PAGE_SIZE - poff);

if (spd_fill_page(spd, page, flen, poff, skb, linear))
return 1;

__segment_seek(&page, &poff, &plen, flen);
*len -= flen;

} while (*len && plen);
--------------------

That code should work and do what we need.

2009-01-19 06:19:40

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Willy Tarreau <[email protected]>
Date: Mon, 19 Jan 2009 07:14:20 +0100

> On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote:
> > From: Herbert Xu <[email protected]>
> > Date: Mon, 19 Jan 2009 14:08:44 +1100
> >
> > > One thing to note is that Myricom's driver probably uses page
> > > frags which means that you're not actually triggering the copy.
>
> So does this mean that the corruption problem should still there for
> such a driver ? I'm asking before testing, because at these speeds,
> validity tests are not that easy ;-)

It ought not to, but it seems that is the case where you
saw the original corruptions, so hmmm...

Actually, I see, the myri10ge driver does put up to
64 bytes of the initial packet into the linear area.
If the IPV4 + TCP headers are less than this, you will
hit the corruption case even with the myri10ge driver.

So everything checks out.

> > And this is also the only reason why jumbo MTU worked :-)
>
> What should we expect from other drivers with jumbo frames ? Hangs,
> corruption, errors, packet loss ?

Upon recent review I think jumbo frames in such drivers should
actually be fine.

2009-01-19 06:45:55

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote:
> From: Willy Tarreau <[email protected]>
> Date: Mon, 19 Jan 2009 07:14:20 +0100
>
> > On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote:
> > > From: Herbert Xu <[email protected]>
> > > Date: Mon, 19 Jan 2009 14:08:44 +1100
> > >
> > > > One thing to note is that Myricom's driver probably uses page
> > > > frags which means that you're not actually triggering the copy.
> >
> > So does this mean that the corruption problem should still there for
> > such a driver ? I'm asking before testing, because at these speeds,
> > validity tests are not that easy ;-)
>
> It ought not to, but it seems that is the case where you
> saw the original corruptions, so hmmm...
>
> Actually, I see, the myri10ge driver does put up to
> 64 bytes of the initial packet into the linear area.
> If the IPV4 + TCP headers are less than this, you will
> hit the corruption case even with the myri10ge driver.
>
> So everything checks out.

OK, so I will modify my tools in order to perform a few checks.

Thanks!
Willy

2009-01-19 08:41:19

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Mon, Jan 19, 2009 at 07:14:20AM +0100, Willy Tarreau wrote:
> On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote:
> > From: Herbert Xu <[email protected]>
> > Date: Mon, 19 Jan 2009 14:08:44 +1100
> >
> > > On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote:
> > > >
> > > > Just for the record, I've now re-integrated those changes in a test kernel
> > > > that I booted on my 10gig machines. I have updated my user-space code in
> > > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> > > >
> > > > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> > > > (3.2 Gbps at 100% CPU without splice)
> > >
> > > One thing to note is that Myricom's driver probably uses page
> > > frags which means that you're not actually triggering the copy.
>
> So does this mean that the corruption problem should still there for
> such a driver ? I'm asking before testing, because at these speeds,
> validity tests are not that easy ;-)

I guess, David meant the performance: there is not much change because
only a small part could be copied. The most harmed should be jumbo
frames in linear only skbs. But no corruption is expected.

Jarek P.

2009-01-19 10:20:31

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote:
>
> Actually, I see, the myri10ge driver does put up to
> 64 bytes of the initial packet into the linear area.
> If the IPV4 + TCP headers are less than this, you will
> hit the corruption case even with the myri10ge driver.

I thought splice only mapped the payload areas, no?

So we should probably test with a non-paged driver to be totally
sure.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-19 10:20:57

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Sun, Jan 18, 2009 at 10:16:03PM -0800, David Miller wrote:
>
> Actually, Herbert, it turns out this case should be OK.
>
> Look a level or two higher, at __splice_segment(), it even has a
> comment :-)

Aha, that should take care of it.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-19 20:59:53

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

From: Herbert Xu <[email protected]>
Date: Mon, 19 Jan 2009 21:19:24 +1100

> On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote:
> >
> > Actually, I see, the myri10ge driver does put up to
> > 64 bytes of the initial packet into the linear area.
> > If the IPV4 + TCP headers are less than this, you will
> > hit the corruption case even with the myri10ge driver.
>
> I thought splice only mapped the payload areas, no?

And the difference between 64 and IPV4+TCP header len becomes the
payload, don't you see? :-)

myri10ge just pulls min(64, skb->len) bytes from the SKB frags into
the linear area, unconditionally. So a small number of payload bytes
can in fact end up there.

Otherwise Willy could never have triggered this bug.

2009-01-19 21:25:56

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Mon, Jan 19, 2009 at 12:59:41PM -0800, David Miller wrote:
.
> And the difference between 64 and IPV4+TCP header len becomes the
> payload, don't you see? :-)
>
> myri10ge just pulls min(64, skb->len) bytes from the SKB frags into
> the linear area, unconditionally. So a small number of payload bytes
> can in fact end up there.

Ah, I thought they were doing a precise division. Yes pulling
a constant length will trigger it if it's big enough.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-20 08:37:48

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
...
> net: Fix data corruption when splicing from sockets.
>
> From: Jarek Poplawski <[email protected]>
>
> The trick in socket splicing where we try to convert the skb->data
> into a page based reference using virt_to_page() does not work so
> well.
...
> Signed-off-by: David S. Miller <[email protected]>
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 65eac77..56272ac 100644
...

Here is a tiny upgrade to save some memory by reusing a page for more
chunks if possible, which I think could be considered, after the
testing of the main patch is finished. (There could be also added an
additional freeing of this cached page before socket destruction,
maybe in tcp_splice_read(), if somebody finds good place.)

Thanks,
Jarek P.

---

include/net/sock.h | 4 ++++
net/core/skbuff.c | 32 ++++++++++++++++++++++++++------
net/core/sock.c | 2 ++
net/ipv4/tcp_ipv4.c | 8 ++++++++
4 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 5a3a151..4ded741 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -190,6 +190,8 @@ struct sock_common {
* @sk_user_data: RPC layer private data
* @sk_sndmsg_page: cached page for sendmsg
* @sk_sndmsg_off: cached offset for sendmsg
+ * @sk_splice_page: cached page for splice
+ * @sk_splice_off: cached offset for splice
* @sk_send_head: front of stuff to transmit
* @sk_security: used by security modules
* @sk_mark: generic packet mark
@@ -279,6 +281,8 @@ struct sock {
struct page *sk_sndmsg_page;
struct sk_buff *sk_send_head;
__u32 sk_sndmsg_off;
+ struct page *sk_splice_page;
+ __u32 sk_splice_off;
int sk_write_pending;
#ifdef CONFIG_SECURITY
void *sk_security;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 56272ac..74adc79 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1334,13 +1334,33 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
}

static inline struct page *linear_to_page(struct page *page, unsigned int len,
- unsigned int offset)
+ unsigned int *offset,
+ struct sk_buff *skb)
{
- struct page *p = alloc_pages(GFP_KERNEL, 0);
+ struct sock *sk = skb->sk;
+ struct page *p = sk->sk_splice_page;
+ unsigned int off;

- if (!p)
- return NULL;
- memcpy(page_address(p) + offset, page_address(page) + offset, len);
+ if (!p) {
+new_page:
+ p = sk->sk_splice_page = alloc_pages(sk->sk_allocation, 0);
+ if (!p)
+ return NULL;
+
+ off = sk->sk_splice_off = 0;
+ /* hold this page until it's full or unneeded */
+ get_page(p);
+ } else {
+ off = sk->sk_splice_off;
+ if (off + len > PAGE_SIZE) {
+ put_page(p);
+ goto new_page;
+ }
+ }
+
+ memcpy(page_address(p) + off, page_address(page) + *offset, len);
+ sk->sk_splice_off += len;
+ *offset = off;

return p;
}
@@ -1356,7 +1376,7 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
return 1;

if (linear) {
- page = linear_to_page(page, len, offset);
+ page = linear_to_page(page, len, &offset, skb);
if (!page)
return 1;
} else
diff --git a/net/core/sock.c b/net/core/sock.c
index f3a0d08..6b258a9 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1732,6 +1732,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)

sk->sk_sndmsg_page = NULL;
sk->sk_sndmsg_off = 0;
+ sk->sk_splice_page = NULL;
+ sk->sk_splice_off = 0;

sk->sk_peercred.pid = 0;
sk->sk_peercred.uid = -1;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 19d7b42..cf3d367 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1848,6 +1848,14 @@ void tcp_v4_destroy_sock(struct sock *sk)
sk->sk_sndmsg_page = NULL;
}

+ /*
+ * If splice cached page exists, toss it.
+ */
+ if (sk->sk_splice_page) {
+ __free_page(sk->sk_splice_page);
+ sk->sk_splice_page = NULL;
+ }
+
percpu_counter_dec(&tcp_sockets_allocated);
}

2009-01-20 09:34:21

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

On Tue, Jan 20, 2009 at 08:37:26AM +0000, Jarek Poplawski wrote:
...
> Here is a tiny upgrade to save some memory by reusing a page for more
> chunks if possible, which I think could be considered, after the
> testing of the main patch is finished. (There could be also added an
> additional freeing of this cached page before socket destruction,
> maybe in tcp_splice_read(), if somebody finds good place.)

OOPS! I did it again... Here is better refcounting.

Jarek P.

--- (take 2)

include/net/sock.h | 4 ++++
net/core/skbuff.c | 32 ++++++++++++++++++++++++++------
net/core/sock.c | 2 ++
net/ipv4/tcp_ipv4.c | 8 ++++++++
4 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 5a3a151..4ded741 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -190,6 +190,8 @@ struct sock_common {
* @sk_user_data: RPC layer private data
* @sk_sndmsg_page: cached page for sendmsg
* @sk_sndmsg_off: cached offset for sendmsg
+ * @sk_splice_page: cached page for splice
+ * @sk_splice_off: cached offset for splice
* @sk_send_head: front of stuff to transmit
* @sk_security: used by security modules
* @sk_mark: generic packet mark
@@ -279,6 +281,8 @@ struct sock {
struct page *sk_sndmsg_page;
struct sk_buff *sk_send_head;
__u32 sk_sndmsg_off;
+ struct page *sk_splice_page;
+ __u32 sk_splice_off;
int sk_write_pending;
#ifdef CONFIG_SECURITY
void *sk_security;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 56272ac..02a1a6c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1334,13 +1334,33 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
}

static inline struct page *linear_to_page(struct page *page, unsigned int len,
- unsigned int offset)
+ unsigned int *offset,
+ struct sk_buff *skb)
{
- struct page *p = alloc_pages(GFP_KERNEL, 0);
+ struct sock *sk = skb->sk;
+ struct page *p = sk->sk_splice_page;
+ unsigned int off;

- if (!p)
- return NULL;
- memcpy(page_address(p) + offset, page_address(page) + offset, len);
+ if (!p) {
+new_page:
+ p = sk->sk_splice_page = alloc_pages(sk->sk_allocation, 0);
+ if (!p)
+ return NULL;
+
+ off = sk->sk_splice_off = 0;
+ /* we hold one ref to this page until it's full or unneeded */
+ } else {
+ off = sk->sk_splice_off;
+ if (off + len > PAGE_SIZE) {
+ put_page(p);
+ goto new_page;
+ }
+ }
+
+ memcpy(page_address(p) + off, page_address(page) + *offset, len);
+ sk->sk_splice_off += len;
+ *offset = off;
+ get_page(p);

return p;
}
@@ -1356,7 +1376,7 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
return 1;

if (linear) {
- page = linear_to_page(page, len, offset);
+ page = linear_to_page(page, len, &offset, skb);
if (!page)
return 1;
} else
diff --git a/net/core/sock.c b/net/core/sock.c
index f3a0d08..6b258a9 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1732,6 +1732,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)

sk->sk_sndmsg_page = NULL;
sk->sk_sndmsg_off = 0;
+ sk->sk_splice_page = NULL;
+ sk->sk_splice_off = 0;

sk->sk_peercred.pid = 0;
sk->sk_peercred.uid = -1;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 19d7b42..cf3d367 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1848,6 +1848,14 @@ void tcp_v4_destroy_sock(struct sock *sk)
sk->sk_sndmsg_page = NULL;
}

+ /*
+ * If splice cached page exists, toss it.
+ */
+ if (sk->sk_splice_page) {
+ __free_page(sk->sk_splice_page);
+ sk->sk_splice_page = NULL;
+ }
+
percpu_counter_dec(&tcp_sockets_allocated);
}

2009-01-20 10:00:54

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

Hi Jarek.

On Tue, Jan 20, 2009 at 09:33:52AM +0000, Jarek Poplawski ([email protected]) wrote:
> > Here is a tiny upgrade to save some memory by reusing a page for more
> > chunks if possible, which I think could be considered, after the
> > testing of the main patch is finished. (There could be also added an
> > additional freeing of this cached page before socket destruction,
> > maybe in tcp_splice_read(), if somebody finds good place.)
>
> OOPS! I did it again... Here is better refcounting.
>
> Jarek P.
>
> --- (take 2)
>
> include/net/sock.h | 4 ++++
> net/core/skbuff.c | 32 ++++++++++++++++++++++++++------
> net/core/sock.c | 2 ++
> net/ipv4/tcp_ipv4.c | 8 ++++++++
> 4 files changed, 40 insertions(+), 6 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 5a3a151..4ded741 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -190,6 +190,8 @@ struct sock_common {
> * @sk_user_data: RPC layer private data
> * @sk_sndmsg_page: cached page for sendmsg
> * @sk_sndmsg_off: cached offset for sendmsg
> + * @sk_splice_page: cached page for splice
> + * @sk_splice_off: cached offset for splice

Ugh, increase every socket by 16 bytes... Does TCP one still fit the
page?

--
Evgeniy Polyakov

2009-01-20 10:21:20

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

On Tue, Jan 20, 2009 at 01:00:43PM +0300, Evgeniy Polyakov wrote:
> Hi Jarek.

Hi Evgeniy.
...
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 5a3a151..4ded741 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -190,6 +190,8 @@ struct sock_common {
> > * @sk_user_data: RPC layer private data
> > * @sk_sndmsg_page: cached page for sendmsg
> > * @sk_sndmsg_off: cached offset for sendmsg
> > + * @sk_splice_page: cached page for splice
> > + * @sk_splice_off: cached offset for splice
>
> Ugh, increase every socket by 16 bytes... Does TCP one still fit the
> page?

Good question! Alas I can't check this soon, but if it's really like
this, of course this needs some better idea and rework. (BTW, I'd like
to prevent here as much as possible some strange activities like 1
byte (payload) packets getting full pages without any accounting.)

Thanks,
Jarek P.

2009-01-20 10:31:42

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski ([email protected]) wrote:
> Good question! Alas I can't check this soon, but if it's really like
> this, of course this needs some better idea and rework. (BTW, I'd like
> to prevent here as much as possible some strange activities like 1
> byte (payload) packets getting full pages without any accounting.)

I believe approach to meet all our goals is to have own network memory
allocator, so that each skb could have its payload in the fragments, we
would not suffer from the heavy fragmentation and power-of-two overhead
for the larger MTUs, have a reserve for the OOM condition and generally
do not depend on the main system behaviour.

I will resurrect to some point my network allocator to check how things
go in the modern environment, if no one will beat this idea first :)

1. Network (tree) allocator
http://www.ioremap.net/projects/nta

--
Evgeniy Polyakov

2009-01-20 11:02:13

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote:
> On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski ([email protected]) wrote:
> > Good question! Alas I can't check this soon, but if it's really like
> > this, of course this needs some better idea and rework. (BTW, I'd like
> > to prevent here as much as possible some strange activities like 1
> > byte (payload) packets getting full pages without any accounting.)
>
> I believe approach to meet all our goals is to have own network memory
> allocator, so that each skb could have its payload in the fragments, we
> would not suffer from the heavy fragmentation and power-of-two overhead
> for the larger MTUs, have a reserve for the OOM condition and generally
> do not depend on the main system behaviour.

100% right! But I guess we need this current fix for -stable, and I'm
a bit worried about safety.

>
> I will resurrect to some point my network allocator to check how things
> go in the modern environment, if no one will beat this idea first :)

I can't see too much beating of ideas around this problem now... I Wish
you luck!

>
> 1. Network (tree) allocator
> http://www.ioremap.net/projects/nta
>

Great, I'll try to learn a bit btw.,
Jarek P.

2009-01-20 12:01:32

by Ben Mansell

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Willy Tarreau wrote:
> Hi guys,
>
> On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
>> From: Willy Tarreau <[email protected]>
>> Date: Fri, 16 Jan 2009 00:44:08 +0100
>>
>>> And BTW feel free to add my Tested-by if you want in case you merge
>>> this fix.
>> Done, thanks Willy.
>
> Just for the record, I've now re-integrated those changes in a test kernel
> that I booted on my 10gig machines. I have updated my user-space code in
> haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
>
> - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> (3.2 Gbps at 100% CPU without splice)
>
> - 9.2 Gbps at 50% CPU using MTU=1500 with LRO
>
> - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
> splice)
>
> - 10 Gbps at 15% CPU using MTU=9000 with LRO
>
> These last ones are really impressive. While I had already observed such
> performance on the Myri10GE with Tux, it's the first time I can reach that
> level with so little CPU usage in haproxy !
>
> So I think that the memcpy() workaround might be a non-issue for some time.
> I agree it's not beautiful but it works pretty well for now.
>
> The 3 patches I used on top of 2.6.27.10 were the fix to return 0 intead of
> -EAGAIN on end of read, the one to process multiple skbs at once, and Dave's
> last patch based on Jarek's workaround for the corruption issue.

I've also tested on the same three patches (against 2.6.27.2 here), and
the patches appear to work just fine. I'm running a similar proxy
benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3,
2xforcedeth). splice is working OK now, although I get identical results
when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user,
98% system).

I may be hitting a h/w limitation which prevents any higher throughput,
but I'm a little surprised that splice() didn't use less CPU time.
Anyway, the splice code is working which is the important part!

Ben

2009-01-20 12:11:27

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Tue, Jan 20, 2009 at 12:01:13PM +0000, Ben Mansell ([email protected]) wrote:
> I've also tested on the same three patches (against 2.6.27.2 here), and
> the patches appear to work just fine. I'm running a similar proxy
> benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3,
> 2xforcedeth). splice is working OK now, although I get identical results
> when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user,
> 98% system).

With small MTU or when driver does not support fragmented allocation
(iirc at least forcedeth does not) skb will contain all the data in the
linear part and thus will be copied in the kernel. read()/write() does
effectively the same, but in userspace.
This should only affect splice usage which involves socket->pipe data
transfer.

> I may be hitting a h/w limitation which prevents any higher throughput,
> but I'm a little surprised that splice() didn't use less CPU time.
> Anyway, the splice code is working which is the important part!

Does splice without patches (but with performance improvement for
non-blocking splice) has the same performance? It does not copy data,
but you may hit the data corruption? If performance is the same, this
maybe indeed HW limitation.

--
Evgeniy Polyakov

2009-01-20 13:44:13

by Ben Mansell

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Evgeniy Polyakov wrote:
> On Tue, Jan 20, 2009 at 12:01:13PM +0000, Ben Mansell ([email protected]) wrote:
>> I've also tested on the same three patches (against 2.6.27.2 here), and
>> the patches appear to work just fine. I'm running a similar proxy
>> benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3,
>> 2xforcedeth). splice is working OK now, although I get identical results
>> when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user,
>> 98% system).
>
> With small MTU or when driver does not support fragmented allocation
> (iirc at least forcedeth does not) skb will contain all the data in the
> linear part and thus will be copied in the kernel. read()/write() does
> effectively the same, but in userspace.
> This should only affect splice usage which involves socket->pipe data
> transfer.

I'll try with some larger MTUs and see if that helps - it should also
give an improvement if I'm hitting a limit on the number of
packets/second that the cards can process, regardless of splice...

>> I may be hitting a h/w limitation which prevents any higher throughput,
>> but I'm a little surprised that splice() didn't use less CPU time.
>> Anyway, the splice code is working which is the important part!
>
> Does splice without patches (but with performance improvement for
> non-blocking splice) has the same performance? It does not copy data,
> but you may hit the data corruption? If performance is the same, this
> maybe indeed HW limitation.

With an unpatched kernel, the splice performance was worse (due to the
one packet per-splice issues). With the small patch to fix that, I was
getting around 2 Gbps performance, although oddly enough, I could only
get 2 Gbps with read()/write() then as well...

I'll try and do some tests on a machine that hopefully doesn't have the
bottlenecks (and one that uses different NICs)

Ben

2009-01-20 14:06:31

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Tue, Jan 20, 2009 at 01:43:52PM +0000, Ben Mansell wrote:
...
> With an unpatched kernel, the splice performance was worse (due to the
> one packet per-splice issues). With the small patch to fix that, I was
> getting around 2 Gbps performance, although oddly enough, I could only
> get 2 Gbps with read()/write() then as well...
>
> I'll try and do some tests on a machine that hopefully doesn't have the
> bottlenecks (and one that uses different NICs)

I guess you should especially check if SG and checksums are on, and it
could depend on a chip within those NICs as well.

Jarek P.

2009-01-20 17:16:27

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

From: Jarek Poplawski <[email protected]>
Date: Tue, 20 Jan 2009 11:01:44 +0000

> On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote:
> > On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski ([email protected]) wrote:
> > > Good question! Alas I can't check this soon, but if it's really like
> > > this, of course this needs some better idea and rework. (BTW, I'd like
> > > to prevent here as much as possible some strange activities like 1
> > > byte (payload) packets getting full pages without any accounting.)
> >
> > I believe approach to meet all our goals is to have own network memory
> > allocator, so that each skb could have its payload in the fragments, we
> > would not suffer from the heavy fragmentation and power-of-two overhead
> > for the larger MTUs, have a reserve for the OOM condition and generally
> > do not depend on the main system behaviour.
>
> 100% right! But I guess we need this current fix for -stable, and I'm
> a bit worried about safety.

Jarek, we already have a page and offset you can use.

It's called sk_sndmsg_page but that is just the (current) name.
Nothing prevents you from reusing it for your purposes here.

2009-01-21 09:54:24

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

On Tue, Jan 20, 2009 at 09:16:16AM -0800, David Miller wrote:
...
> Jarek, we already have a page and offset you can use.
>
> It's called sk_sndmsg_page but that is just the (current) name.
> Nothing prevents you from reusing it for your purposes here.

I'm trying to get some know-how about this field.

Thanks,
Jarek P.

2009-01-22 09:05:28

[permalink] [raw]

Subject: Re: [PATCH v3] tcp: splice as many packets as possible at once

On Tue, Jan 20, 2009 at 09:16:16AM -0800, David Miller wrote:
> From: Jarek Poplawski <[email protected]>
> Date: Tue, 20 Jan 2009 11:01:44 +0000
>
> > On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote:
> > > On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski ([email protected]) wrote:
> > > > Good question! Alas I can't check this soon, but if it's really like
> > > > this, of course this needs some better idea and rework. (BTW, I'd like
> > > > to prevent here as much as possible some strange activities like 1
> > > > byte (payload) packets getting full pages without any accounting.)
> > >
> > > I believe approach to meet all our goals is to have own network memory
> > > allocator, so that each skb could have its payload in the fragments, we
> > > would not suffer from the heavy fragmentation and power-of-two overhead
> > > for the larger MTUs, have a reserve for the OOM condition and generally
> > > do not depend on the main system behaviour.
> >
> > 100% right! But I guess we need this current fix for -stable, and I'm
> > a bit worried about safety.
>
> Jarek, we already have a page and offset you can use.
>
> It's called sk_sndmsg_page but that is just the (current) name.
> Nothing prevents you from reusing it for your purposes here.

It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
I used here tcp_sndmsg() way, but I think I'll go back to this question
soon.

Thanks,
Jarek P.

------------> take 3

net: Optimize memory usage when splicing from sockets.

The recent fix of data corruption when splicing from sockets uses
memory very inefficiently allocating a new page to copy each chunk of
linear part of skb. This patch uses the same page until it's full
(almost) by caching the page in sk_sndmsg_page field.

With changes from David S. Miller <[email protected]>

Signed-off-by: Jarek Poplawski <[email protected]>

Tested-by: needed...
---

net/core/skbuff.c | 45 +++++++++++++++++++++++++++++++++++----------
1 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2e5f2ca..2e64c1b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1333,14 +1333,39 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
put_page(spd->pages[i]);
}

-static inline struct page *linear_to_page(struct page *page, unsigned int len,
- unsigned int offset)
-{
- struct page *p = alloc_pages(GFP_KERNEL, 0);
+static inline struct page *linear_to_page(struct page *page, unsigned int *len,
+ unsigned int *offset,
+ struct sk_buff *skb)
+{
+ struct sock *sk = skb->sk;
+ struct page *p = sk->sk_sndmsg_page;
+ unsigned int off;
+
+ if (!p) {
+new_page:
+ p = sk->sk_sndmsg_page = alloc_pages(sk->sk_allocation, 0);
+ if (!p)
+ return NULL;

- if (!p)
- return NULL;
- memcpy(page_address(p) + offset, page_address(page) + offset, len);
+ off = sk->sk_sndmsg_off = 0;
+ /* hold one ref to this page until it's full */
+ } else {
+ unsigned int mlen;
+
+ off = sk->sk_sndmsg_off;
+ mlen = PAGE_SIZE - off;
+ if (mlen < 64 && mlen < *len) {
+ put_page(p);
+ goto new_page;
+ }
+
+ *len = min_t(unsigned int, *len, mlen);
+ }
+
+ memcpy(page_address(p) + off, page_address(page) + *offset, *len);
+ sk->sk_sndmsg_off += *len;
+ *offset = off;
+ get_page(p);

return p;
}
@@ -1349,21 +1374,21 @@ static inline struct page *linear_to_page(struct page *page, unsigned int len,
* Fill page/offset/length into spd, if it can hold more pages.
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
- unsigned int len, unsigned int offset,
+ unsigned int *len, unsigned int offset,
struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;

if (linear) {
- page = linear_to_page(page, len, offset);
+ page = linear_to_page(page, len, &offset, skb);
if (!page)
return 1;
} else
get_page(page);

spd->pages[spd->nr_pages] = page;
- spd->partial[spd->nr_pages].len = len;
+ spd->partial[spd->nr_pages].len = *len;
spd->partial[spd->nr_pages].offset = offset;
spd->nr_pages++;

@@ -1405,7 +1430,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff,
/* the linear region may spread across several pages */
flen = min_t(unsigned int, flen, PAGE_SIZE - poff);

- if (spd_fill_page(spd, page, flen, poff, skb, linear))
+ if (spd_fill_page(spd, page, &flen, poff, skb, linear))
return 1;

__segment_seek(&page, &poff, &plen, flen);

2009-01-24 21:24:03

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Hi David,

On Sun, Jan 18, 2009 at 07:28:15PM -0800, David Miller wrote:
> From: Willy Tarreau <[email protected]>
> Date: Mon, 19 Jan 2009 01:42:06 +0100
>
> > Hi guys,
> >
> > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
> > > From: Willy Tarreau <[email protected]>
> > > Date: Fri, 16 Jan 2009 00:44:08 +0100
> > >
> > > > And BTW feel free to add my Tested-by if you want in case you merge
> > > > this fix.
> > >
> > > Done, thanks Willy.
> >
> > Just for the record, I've now re-integrated those changes in a test kernel
> > that I booted on my 10gig machines. I have updated my user-space code in
> > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> >
> > - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> > (3.2 Gbps at 100% CPU without splice)
> >
> > - 9.2 Gbps at 50% CPU using MTU=1500 with LRO
> >
> > - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
> > splice)
> >
> > - 10 Gbps at 15% CPU using MTU=9000 with LRO
>
> Thanks for the numbers.
>
> We can almost certainly do a lot better, so if you have the time and
> can get some oprofile dumps for these various cases that would be
> useful to us.

Well, I tried to get oprofile to work on those machines, but I'm surely
missing something, as "opreport" always complains :
opreport error: No sample file found: try running opcontrol --dump
or specify a session containing sample files

I've straced opcontrol --dump, opcontrol --stop, and I see nothing
creating any file in the samples directory. I thought it would be
opjitconv which would do it, but it's hard to debug, and I haven't
used oprofile for a 2/3 years now. I followed exactly the instructions
in the kernel doc, as well as a howto found on the net, but I remain
out of luck. I just see a "complete_dump" file created with only two
bytes. It's not easy to debug on those machines are they're diskless
and PXE-booted from squashfs root images.

However, upon Ingo's suggestion I tried his perfcounters while
running a test at 5 Gbps with haproxy running alone on one core,
and IRQs on the other one. No LRO was used and MTU was 1500.

For instance, kerneltop tells how many CPU cycles are spent in each
function :

# kerneltop -e 0 -d 1 -c 1000000 -C 1

events RIP kernel function
______ ______ ________________ _______________

1864.00 - 00000000f87be000 : init_module [myri10ge]
1078.00 - 00000000784a6580 : tcp_read_sock
901.00 - 00000000784a7840 : tcp_sendpage
857.00 - 0000000078470be0 : __skb_splice_bits
617.00 - 00000000784b2260 : tcp_transmit_skb
604.00 - 000000007849fdf0 : __ip_local_out
581.00 - 0000000078470460 : __copy_skb_header
569.00 - 000000007850cac0 : _spin_lock [myri10ge]
472.00 - 0000000078185150 : __slab_free
443.00 - 000000007850cc10 : _spin_lock_bh [sky2]
434.00 - 00000000781852e0 : __slab_alloc
408.00 - 0000000078488620 : __qdisc_run
355.00 - 0000000078185b20 : kmem_cache_free
352.00 - 0000000078472950 : __alloc_skb
348.00 - 00000000784705f0 : __skb_clone
337.00 - 0000000078185870 : kmem_cache_alloc [myri10ge]
302.00 - 0000000078472150 : skb_release_data
297.00 - 000000007847bcf0 : dev_queue_xmit
285.00 - 00000000784a08f0 : ip_queue_xmit

You should ignore the lines init_module, _spin_lock, etc, in fact all
lines indicating a module, as there's something wrong there, they
always report the name of the last module loaded, and the name changes
when the module is unloaded.

I also tried dumping the number of cache misses per function :

------------------------------------------------------------------------------
KernelTop: 1146 irqs/sec [NMI, 1000 cache-misses], (all, cpu: 1)
------------------------------------------------------------------------------

events RIP kernel function
______ ______ ________________ _______________

7512.00 - 00000000784a6580 : tcp_read_sock
2716.00 - 0000000078470be0 : __skb_splice_bits
2516.00 - 00000000784a08f0 : ip_queue_xmit
986.00 - 00000000784a7840 : tcp_sendpage
587.00 - 00000000781a40c0 : sys_splice
462.00 - 00000000781856b0 : kfree [myri10ge]
451.00 - 000000007849fdf0 : __ip_local_out
242.00 - 0000000078185b20 : kmem_cache_free
205.00 - 00000000784b1b90 : __tcp_select_window
153.00 - 000000007850cac0 : _spin_lock [myri10ge]
142.00 - 000000007849f6a0 : ip_fragment
119.00 - 0000000078188950 : rw_verify_area
117.00 - 00000000784a99e0 : tcp_rcv_space_adjust
107.00 - 000000007850cc10 : _spin_lock_bh [sky2]
104.00 - 00000000781852e0 : __slab_alloc

There are other options to combine events but I don't understand
the output when I enable it.

I think that when properly used, these tools can report useful
information.

Regards,
Willy

2009-01-25 21:04:10

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

Hi David,

On Mon, Jan 19, 2009 at 12:59:41PM -0800, David Miller wrote:
> From: Herbert Xu <[email protected]>
> Date: Mon, 19 Jan 2009 21:19:24 +1100
>
> > On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote:
> > >
> > > Actually, I see, the myri10ge driver does put up to
> > > 64 bytes of the initial packet into the linear area.
> > > If the IPV4 + TCP headers are less than this, you will
> > > hit the corruption case even with the myri10ge driver.
> >
> > I thought splice only mapped the payload areas, no?
>
> And the difference between 64 and IPV4+TCP header len becomes the
> payload, don't you see? :-)
>
> myri10ge just pulls min(64, skb->len) bytes from the SKB frags into
> the linear area, unconditionally. So a small number of payload bytes
> can in fact end up there.
>
> Otherwise Willy could never have triggered this bug.

Just FWIW, I've updated my tools in order to perform content checks more
easily. I cannot reproduce the issue at all with the myri10ge NICs, neither
with large frames nor with tiny ones (8 bytes).

However, I have noticed that the load is now sensible to the number of
concurrent sessions. I'm using 2.6.29-rc2 with the perfcounters patches,
and I'm not sure whether the difference in behaviour came with the data
corruption fixes or with the new kernel (which has some profiling options
turned on). Basically, below 800-1000 concurrent sessions, I have no
problem reaching 10 Gbps with LRO and MTU=1500, with about 60% CPU. Above
this number of session, the CPU suddenly jumps to 100% and the data rate
drops to about 6.7 Gbps.

I spent a long time trying to figure what it was, but I think that I
have found. Kerneltop reports different figures above and below the
limit.

1) below the limit :

1429.00 - 00000000784a7840 : tcp_sendpage
561.00 - 00000000784a6580 : tcp_read_sock
485.00 - 00000000f87e13c0 : myri10ge_xmit [myri10ge]
433.00 - 00000000781a40c0 : sys_splice
411.00 - 00000000784a6eb0 : tcp_poll
344.00 - 000000007847bcf0 : dev_queue_xmit
342.00 - 0000000078470be0 : __skb_splice_bits
319.00 - 0000000078472950 : __alloc_skb
310.00 - 0000000078185870 : kmem_cache_alloc
285.00 - 00000000784b2260 : tcp_transmit_skb
285.00 - 000000007850cac0 : _spin_lock
250.00 - 00000000781afda0 : sys_epoll_ctl
238.00 - 000000007810334c : system_call
232.00 - 000000007850ac20 : schedule
230.00 - 000000007850cc10 : _spin_lock_bh
222.00 - 00000000784705f0 : __skb_clone
220.00 - 000000007850cbc0 : _spin_lock_irqsave
213.00 - 00000000784a08f0 : ip_queue_xmit
211.00 - 0000000078185ea0 : __kmalloc_track_caller

2) above the limit :

1778.00 - 00000000784a7840 : tcp_sendpage
1281.00 - 0000000078472950 : __alloc_skb
639.00 - 00000000784a6780 : sk_stream_alloc_skb
507.00 - 0000000078185ea0 : __kmalloc_track_caller
484.00 - 0000000078185870 : kmem_cache_alloc
476.00 - 00000000784a6580 : tcp_read_sock
451.00 - 00000000784a08f0 : ip_queue_xmit
421.00 - 00000000f87e13c0 : myri10ge_xmit [myri10ge]
374.00 - 00000000781852e0 : __slab_alloc
361.00 - 00000000781a40c0 : sys_splice
273.00 - 0000000078470be0 : __skb_splice_bits
231.00 - 000000007850cac0 : _spin_lock
206.00 - 0000000078168b30 : get_pageblock_flags_group
165.00 - 00000000784a0260 : ip_finish_output
165.00 - 00000000784b2260 : tcp_transmit_skb
161.00 - 0000000078470460 : __copy_skb_header
153.00 - 000000007816d6d0 : put_page
144.00 - 000000007850cbc0 : _spin_lock_irqsave
137.00 - 0000000078189be0 : fget_light

The memory allocation clearly is the culprit here. I'll try Jarek's
patch which reduces memory allocation to see if that changes something,
as I'm sure we can do fairly better, given how it behaves with limited
sessions.

Regards,
Willy

PS: this thread is long, if some of the people in CC want to get off
the thread, please complain.

2009-01-26 05:22:20

[permalink] [raw]

Subject: Re: [PATCH v3] tcp: splice as many packets as possible at once

From: Jarek Poplawski <[email protected]>
Date: Thu, 22 Jan 2009 09:04:42 +0000

> It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
> I used here tcp_sndmsg() way, but I think I'll go back to this question
> soon.

Indeed, it is something to look into, as well as locking.

I'll try to find some time for this, thanks Jarek.

2009-01-26 07:59:52

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Sun, Jan 25, 2009 at 10:03:25PM +0100, Willy Tarreau wrote:
...
> The memory allocation clearly is the culprit here. I'll try Jarek's
> patch which reduces memory allocation to see if that changes something,
> as I'm sure we can do fairly better, given how it behaves with limited
> sessions.

I think you are right, but I wonder if it's not better to wait with
more profiling until this splicing is really redone.

Regards,
Jarek P.

2009-01-26 08:13:21

[permalink] [raw]

Subject: Re: [PATCH] tcp: splice as many packets as possible at once

On Mon, Jan 26, 2009 at 07:59:33AM +0000, Jarek Poplawski wrote:
> On Sun, Jan 25, 2009 at 10:03:25PM +0100, Willy Tarreau wrote:
> ...
> > The memory allocation clearly is the culprit here. I'll try Jarek's
> > patch which reduces memory allocation to see if that changes something,
> > as I'm sure we can do fairly better, given how it behaves with limited
> > sessions.
>
> I think you are right, but I wonder if it's not better to wait with
> more profiling until this splicing is really redone.

Agreed. In fact I have run a few tests even with your patch and I could
see no obvious figure starting to appear. I'll go back to 2.6.27-stable
+ the fixes because I don't really know what I'm testing under 2.6.29-rc2.
Once I'm able to get reproducible reference numbers, I'll test again
with the latest kernel.

Regards,
Willy

2009-01-26 08:20:56

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

On 20-01-2009 11:31, Evgeniy Polyakov wrote:
...
> I believe approach to meet all our goals is to have own network memory
> allocator, so that each skb could have its payload in the fragments, we
> would not suffer from the heavy fragmentation and power-of-two overhead
> for the larger MTUs, have a reserve for the OOM condition and generally
> do not depend on the main system behaviour.
>
> I will resurrect to some point my network allocator to check how things
> go in the modern environment, if no one will beat this idea first :)
>
> 1. Network (tree) allocator
> http://www.ioremap.net/projects/nta

I looked at this a bit, but alas I didn't find much for this Herbert's
idea of payload in fragments/pages. Maybe some kind of API RFC is
needed before this resurrection?

Jarek P.

2009-01-26 21:21:56

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

Hi Jarek.

On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski ([email protected]) wrote:
> > 1. Network (tree) allocator
> > http://www.ioremap.net/projects/nta
>
> I looked at this a bit, but alas I didn't find much for this Herbert's
> idea of payload in fragments/pages. Maybe some kind of API RFC is
> needed before this resurrection?

Basic idea is to steal some (probably a lot) pages from the slab
allocator and put network buffers there without strict need for
power-of-two alignment and possible wraps when we add skb_shared_info at
the end, so that old e1000 driver required order-4 allocations for the
jumbo frames. We can do that in alloc_skb() and friends and put returned
buffers into skb's fraglist and updated reference counters for those
pages; and with additional copy of the network headers into skb->head.

--
Evgeniy Polyakov

2009-01-27 06:11:18

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

From: Evgeniy Polyakov <[email protected]>
Date: Tue, 27 Jan 2009 00:21:30 +0300

> Hi Jarek.
>
> On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski ([email protected]) wrote:
> > > 1. Network (tree) allocator
> > > http://www.ioremap.net/projects/nta
> >
> > I looked at this a bit, but alas I didn't find much for this Herbert's
> > idea of payload in fragments/pages. Maybe some kind of API RFC is
> > needed before this resurrection?
>
> Basic idea is to steal some (probably a lot) pages from the slab
> allocator and put network buffers there without strict need for
> power-of-two alignment and possible wraps when we add skb_shared_info at
> the end, so that old e1000 driver required order-4 allocations for the
> jumbo frames. We can do that in alloc_skb() and friends and put returned
> buffers into skb's fraglist and updated reference counters for those
> pages; and with additional copy of the network headers into skb->head.

We are going back and forth saying the same thing, I think :-)
(BTW, I think NTA is cool and we might do something like that
eventually)

The basic thing we have to do is make the drivers receive into
pages, and then slide the network headers (only) into the linear
SKB data area.

Even for drivers like NIU and myri10ge that do this, they only
use heuristics or some fixed minimum to decide how much to
move to the linear area.

Result? Some data payload bits end up there because it overshoots.

Since we have pskb_may_pull() calls everywhere necessary, which
means not in eth_type_trans(), we could just make these drivers
(and future drivers converted to operate in this way) only
put the ethernet headers there initially.

Then the rest of the stack will take care of moving the network
and transport payloads there, as necessary.

I bet it won't even hurt latency or routing/firewall performance.

I did test this with the NIU driver at one point, and it did not
change TCP latency nor throughput at all even at 10g speeds.

2009-01-27 07:14:11

[permalink] [raw]

Subject: Re: [PATCH v3] tcp: splice as many packets as possible at once

David Miller <[email protected]> wrote:
> From: Jarek Poplawski <[email protected]>
> Date: Thu, 22 Jan 2009 09:04:42 +0000
>
>> It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
>> I used here tcp_sndmsg() way, but I think I'll go back to this question
>> soon.
>
> Indeed, it is something to look into, as well as locking.
>
> I'll try to find some time for this, thanks Jarek.

After a quick look it seems to be OK to me. The code in the patch
is called from tcp_splice_read, which holds the socket lock. So as
long as the patch uses the usual TCP convention it should work.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-27 07:41:16

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

On Mon, Jan 26, 2009 at 10:10:56PM -0800, David Miller wrote:
> From: Evgeniy Polyakov <[email protected]>
> Date: Tue, 27 Jan 2009 00:21:30 +0300
>
> > Hi Jarek.
> >
> > On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski ([email protected]) wrote:
> > > > 1. Network (tree) allocator
> > > > http://www.ioremap.net/projects/nta
> > >
> > > I looked at this a bit, but alas I didn't find much for this Herbert's
> > > idea of payload in fragments/pages. Maybe some kind of API RFC is
> > > needed before this resurrection?
> >
> > Basic idea is to steal some (probably a lot) pages from the slab
> > allocator and put network buffers there without strict need for
> > power-of-two alignment and possible wraps when we add skb_shared_info at
> > the end, so that old e1000 driver required order-4 allocations for the
> > jumbo frames. We can do that in alloc_skb() and friends and put returned
> > buffers into skb's fraglist and updated reference counters for those
> > pages; and with additional copy of the network headers into skb->head.

I think the main problem is to respect put_page() more, and maybe you
mean to add this to your allocator too, but using slab pages for this
looks a bit complex to me, but I can miss something.

> We are going back and forth saying the same thing, I think :-)
> (BTW, I think NTA is cool and we might do something like that
> eventually)
>
> The basic thing we have to do is make the drivers receive into
> pages, and then slide the network headers (only) into the linear
> SKB data area.

As a matter of fact, I wonder if these headers should be always
separated. Their "chunk" could be refcounted as well, I guess.

Jarek P.

2009-01-27 07:54:35

[permalink] [raw]

Subject: Re: [PATCH v3] tcp: splice as many packets as possible at once

On Tue, Jan 27, 2009 at 06:11:30PM +1100, Herbert Xu wrote:
> David Miller <[email protected]> wrote:
> > From: Jarek Poplawski <[email protected]>
> > Date: Thu, 22 Jan 2009 09:04:42 +0000
> >
> >> It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
> >> I used here tcp_sndmsg() way, but I think I'll go back to this question
> >> soon.
> >
> > Indeed, it is something to look into, as well as locking.
> >
> > I'll try to find some time for this, thanks Jarek.
>
> After a quick look it seems to be OK to me. The code in the patch
> is called from tcp_splice_read, which holds the socket lock. So as
> long as the patch uses the usual TCP convention it should work.

Yes, but ip_append_data() (and skb_append_datato_frags() for
NETIF_F_UFO only, so currently not a problem), uses this differently,
and these pages in sk->sk_sndmsg_page could leak or be used after
kfree. (I didn't track locking in these other places).

Thanks,
Jarek P.

2009-01-27 10:10:56

[permalink] [raw]

Subject: Re: [PATCH v3] tcp: splice as many packets as possible at once

On Tue, Jan 27, 2009 at 07:54:18AM +0000, Jarek Poplawski wrote:
>
> Yes, but ip_append_data() (and skb_append_datato_frags() for
> NETIF_F_UFO only, so currently not a problem), uses this differently,
> and these pages in sk->sk_sndmsg_page could leak or be used after
> kfree. (I didn't track locking in these other places).

It'll be freed when the socket is freed so that should be fine.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-27 10:35:33

[permalink] [raw]

Subject: Re: [PATCH v3] tcp: splice as many packets as possible at once

On Tue, Jan 27, 2009 at 09:09:58PM +1100, Herbert Xu wrote:
> On Tue, Jan 27, 2009 at 07:54:18AM +0000, Jarek Poplawski wrote:
> >
> > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > and these pages in sk->sk_sndmsg_page could leak or be used after
> > kfree. (I didn't track locking in these other places).
>
> It'll be freed when the socket is freed so that should be fine.
>

I don't think so: these places can overwrite sk->sk_sndmsg_page left
after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
pointer without put_page() (they only reference copied chunks and
expect auto freeing). On the other hand, if tcp_sendmsg() reads after
them it could use a pointer after the page is freed, I guess.

Cheers,
Jarek P.

2009-01-27 10:58:03

[permalink] [raw]

Subject: Re: [PATCH v3] tcp: splice as many packets as possible at once

On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> On Tue, Jan 27, 2009 at 09:09:58PM +1100, Herbert Xu wrote:
> > On Tue, Jan 27, 2009 at 07:54:18AM +0000, Jarek Poplawski wrote:
> > >
> > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > kfree. (I didn't track locking in these other places).
> >
> > It'll be freed when the socket is freed so that should be fine.
> >
>
> I don't think so: these places can overwrite sk->sk_sndmsg_page left
> after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> pointer without put_page() (they only reference copied chunks and
> expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> them it could use a pointer after the page is freed, I guess.

tcp_v4_destroy_sock() looks like vulnerable too.

BTW, skb_append_datato_frags() currently doesn't need to use this
sk->sk_sndmsg_page at all - it doesn't use caching between calls.

Jarek P.

2009-01-27 11:48:53

[permalink] [raw]

Subject: Re: [PATCH v3] tcp: splice as many packets as possible at once

On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
>
> > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > kfree. (I didn't track locking in these other places).
> >
> > It'll be freed when the socket is freed so that should be fine.
>
> I don't think so: these places can overwrite sk->sk_sndmsg_page left
> after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> pointer without put_page() (they only reference copied chunks and
> expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> them it could use a pointer after the page is freed, I guess.

I wasn't referring to the first part of your sentence. That can't
happen because they're only used for UDP sockets, this is a TCP
socket.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-27 12:17:17

[permalink] [raw]

Subject: Re: [PATCH v3] tcp: splice as many packets as possible at once

On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote:
> On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> >
> > > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > > kfree. (I didn't track locking in these other places).
> > >
> > > It'll be freed when the socket is freed so that should be fine.
> >
> > I don't think so: these places can overwrite sk->sk_sndmsg_page left
> > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> > pointer without put_page() (they only reference copied chunks and
> > expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> > them it could use a pointer after the page is freed, I guess.
>
> I wasn't referring to the first part of your sentence. That can't
> happen because they're only used for UDP sockets, this is a TCP
> socket.

Do you mean this part from ip_append_data() isn't used for TCP?:

1007
1008 if (page && (left = PAGE_SIZE - off) > 0) {
1009 if (copy >= left)
1010 copy = left;
1011 if (page != frag->page) {
1012 if (i == MAX_SKB_FRAGS) {
1013 err = -EMSGSIZE;
1014 goto error;
1015 }
1016 get_page(page);
1017 skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
1018 frag = &skb_shinfo(skb)->frags[i];
1019 }
1020 } else if (i < MAX_SKB_FRAGS) {
1021 if (copy > PAGE_SIZE)
1022 copy = PAGE_SIZE;
1023 page = alloc_pages(sk->sk_allocation, 0);
1024 if (page == NULL) {
1025 err = -ENOMEM;
1026 goto error;
1027 }
1028 sk->sk_sndmsg_page = page;
1029 sk->sk_sndmsg_off = 0;
1030
1031 skb_fill_page_desc(skb, i, page, 0, 0);
1032 frag = &skb_shinfo(skb)->frags[i];
1033 } else {

Jarek P.

2009-01-27 12:31:39

[permalink] [raw]

Subject: Re: [PATCH v3] tcp: splice as many packets as possible at once

On Tue, Jan 27, 2009 at 12:16:42PM +0000, Jarek Poplawski wrote:
> On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote:
> > On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> > >
> > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > > > kfree. (I didn't track locking in these other places).
> > > >
> > > > It'll be freed when the socket is freed so that should be fine.
> > >
> > > I don't think so: these places can overwrite sk->sk_sndmsg_page left
> > > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> > > pointer without put_page() (they only reference copied chunks and
> > > expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> > > them it could use a pointer after the page is freed, I guess.
> >
> > I wasn't referring to the first part of your sentence. That can't
> > happen because they're only used for UDP sockets, this is a TCP
> > socket.
>
> Do you mean this part from ip_append_data() isn't used for TCP?:

Actually, the beginning part of ip_append_data() should be enough too.
So I guess I missed your point...

Jarek P.

2009-01-27 17:07:10

[permalink] [raw]

Subject: Re: [PATCH v3] tcp: splice as many packets as possible at once

From: Jarek Poplawski <[email protected]>
Date: Tue, 27 Jan 2009 12:31:11 +0000

> On Tue, Jan 27, 2009 at 12:16:42PM +0000, Jarek Poplawski wrote:
> > On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote:
> > > On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> > > >
> > > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > > > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > > > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > > > > kfree. (I didn't track locking in these other places).
> > > > >
> > > > > It'll be freed when the socket is freed so that should be fine.
> > > >
> > > > I don't think so: these places can overwrite sk->sk_sndmsg_page left
> > > > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> > > > pointer without put_page() (they only reference copied chunks and
> > > > expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> > > > them it could use a pointer after the page is freed, I guess.
> > >
> > > I wasn't referring to the first part of your sentence. That can't
> > > happen because they're only used for UDP sockets, this is a TCP
> > > socket.
> >
> > Do you mean this part from ip_append_data() isn't used for TCP?:
>
> Actually, the beginning part of ip_append_data() should be enough too.
> So I guess I missed your point...

TCP doesn't use ip_append_data(), period.

2009-01-27 18:42:33

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

From: David Miller <[email protected]>
Date: Mon, 26 Jan 2009 22:10:56 -0800 (PST)

> Even for drivers like NIU and myri10ge that do this, they only
> use heuristics or some fixed minimum to decide how much to
> move to the linear area.
>
> Result? Some data payload bits end up there because it overshoots.
...
> I did test this with the NIU driver at one point, and it did not
> change TCP latency nor throughput at all even at 10g speeds.

As a followup, it turns out that NIU right now does this properly.

It only pulls a maximum of ETH_HLEN into the linear area before giving
the SKB to netif_receive_skb().

2009-01-28 08:11:21

[permalink] [raw]

Subject: Re: [PATCH v3] tcp: splice as many packets as possible at once

On Tue, Jan 27, 2009 at 09:06:51AM -0800, David Miller wrote:
> From: Jarek Poplawski <[email protected]>
> Date: Tue, 27 Jan 2009 12:31:11 +0000
>
> > On Tue, Jan 27, 2009 at 12:16:42PM +0000, Jarek Poplawski wrote:
> > > On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote:
> > > > On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> > > > >
> > > > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > > > > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > > > > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > > > > > kfree. (I didn't track locking in these other places).
> > > > > >
> > > > > > It'll be freed when the socket is freed so that should be fine.
> > > > >
> > > > > I don't think so: these places can overwrite sk->sk_sndmsg_page left
> > > > > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> > > > > pointer without put_page() (they only reference copied chunks and
> > > > > expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> > > > > them it could use a pointer after the page is freed, I guess.
> > > >
> > > > I wasn't referring to the first part of your sentence. That can't
> > > > happen because they're only used for UDP sockets, this is a TCP
> > > > socket.
> > >
> > > Do you mean this part from ip_append_data() isn't used for TCP?:
> >
> > Actually, the beginning part of ip_append_data() should be enough too.
> > So I guess I missed your point...
>
> TCP doesn't use ip_append_data(), period.

Hmm... I see: TCP does use ip_send_reply(), so ip_append_data() too,
but with a special socket.

Thanks for the explanations,
Jarek P.

2009-01-30 21:42:42

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

From: Jarek Poplawski <[email protected]>
Date: Tue, 27 Jan 2009 07:40:48 +0000

> I think the main problem is to respect put_page() more, and maybe you
> mean to add this to your allocator too, but using slab pages for this
> looks a bit complex to me, but I can miss something.

Hmmm, Jarek's comments here made me realize that we might be
able to do some hack with cooperation with SLAB.

Basically the idea is that if the page count of a SLAB page
is greater than one, SLAB will not use that page for new
allocations.

It's cheesy and the SLAB developers will likely barf at the
idea, but it would certainly work.

Back to real life, I think long term the thing to do is to just do the
cached page allocator thing we'll be doing after Jarek's socket page
patch is integrated, and for best performance the driver has to
receive it's data into pages, only explicitly pulling the ethernet
header into the linear area, like NIU does.

2009-01-30 22:00:21

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> From: Jarek Poplawski <[email protected]>
> Date: Tue, 27 Jan 2009 07:40:48 +0000
>
> > I think the main problem is to respect put_page() more, and maybe you
> > mean to add this to your allocator too, but using slab pages for this
> > looks a bit complex to me, but I can miss something.
>
> Hmmm, Jarek's comments here made me realize that we might be
> able to do some hack with cooperation with SLAB.
>
> Basically the idea is that if the page count of a SLAB page
> is greater than one, SLAB will not use that page for new
> allocations.

I thought it was the standard behaviour. That may explain why I
did not understand much of previous discussion then :-/

> It's cheesy and the SLAB developers will likely barf at the
> idea, but it would certainly work.

Maybe that would be enough as a definitive fix for a stable
release, so that we can go on with deeper changes in newer
versions ?

> Back to real life, I think long term the thing to do is to just do the
> cached page allocator thing we'll be doing after Jarek's socket page
> patch is integrated, and for best performance the driver has to
> receive it's data into pages, only explicitly pulling the ethernet
> header into the linear area, like NIU does.

Are there NICs out there able to do that themselves or does the
driver need to rely on complex hacks in order to achieve this ?

Willy

2009-01-30 22:04:01

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

From: Willy Tarreau <[email protected]>
Date: Fri, 30 Jan 2009 22:59:20 +0100

> On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> > It's cheesy and the SLAB developers will likely barf at the
> > idea, but it would certainly work.
>
> Maybe that would be enough as a definitive fix for a stable
> release, so that we can go on with deeper changes in newer
> versions ?

Such a check could have performance ramifications, I wouldn't
risk it and already I intend to push Jarek's page allocator
splice fix back to -stable eventually.

> > Back to real life, I think long term the thing to do is to just do the
> > cached page allocator thing we'll be doing after Jarek's socket page
> > patch is integrated, and for best performance the driver has to
> > receive it's data into pages, only explicitly pulling the ethernet
> > header into the linear area, like NIU does.
>
> Are there NICs out there able to do that themselves or does the
> driver need to rely on complex hacks in order to achieve this ?

Any NIC, even the dumbest ones, can be made to receive into pages.

2009-01-30 22:14:28

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

On Fri, Jan 30, 2009 at 02:03:46PM -0800, David Miller wrote:
> From: Willy Tarreau <[email protected]>
> Date: Fri, 30 Jan 2009 22:59:20 +0100
>
> > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> > > It's cheesy and the SLAB developers will likely barf at the
> > > idea, but it would certainly work.
> >
> > Maybe that would be enough as a definitive fix for a stable
> > release, so that we can go on with deeper changes in newer
> > versions ?
>
> Such a check could have performance ramifications, I wouldn't
> risk it and already I intend to push Jarek's page allocator
> splice fix back to -stable eventually.

OK.

> > > Back to real life, I think long term the thing to do is to just do the
> > > cached page allocator thing we'll be doing after Jarek's socket page
> > > patch is integrated, and for best performance the driver has to
> > > receive it's data into pages, only explicitly pulling the ethernet
> > > header into the linear area, like NIU does.
> >
> > Are there NICs out there able to do that themselves or does the
> > driver need to rely on complex hacks in order to achieve this ?
>
> Any NIC, even the dumbest ones, can be made to receive into pages.

OK I thought that it was not always easy to split between headers
and payload. I know that myri10ge can be configured to receive into
either skbs or pages, but I was not sure about the real implications
behind that.

Thanks
Willy

2009-01-30 22:15:45

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: splice as many packets as possible at once

From: Willy Tarreau <[email protected]>
Date: Fri, 30 Jan 2009 23:13:46 +0100

> On Fri, Jan 30, 2009 at 02:03:46PM -0800, David Miller wrote:
> > Any NIC, even the dumbest ones, can be made to receive into pages.
>
> OK I thought that it was not always easy to split between headers
> and payload. I know that myri10ge can be configured to receive into
> either skbs or pages, but I was not sure about the real implications
> behind that.

For a dumb NIC you wouldn't split, you'd receive directly, the
entire packet, into part of a page.

Then right befire you give it to the stack, you pull the ethernet
header from the page into the linear area.

2009-01-30 22:16:59