Hi,
I found something intresting (at least to me ;) in Linux TCP stack.
I don't know if it should be regarded a bug or not, or if it's known.
Anyway, this email is not meant to start flame of any kind (test results
are flamable material... ;)
So, it occurs in programs doing packet communication over TCP, when
peer waits for a packet to send an answer. If they send data with two
write() calls (for example to write packet header and packet data),
the performance dramaticly decrases (down to exactly 100 (2.2.19)
or 25 (2.4.[57]) packet exchanges per second on x86, from several
thousands. 100 seems to be related to HZ variable, see also AXP results,
where HZ is 10 times bigger).
Maybe example of code will tell more:
struct header {
int cmd;
int len;
};
void send_packet(int cmd, void *data, int len)
{
struct header h = { cmd, len };
write(fd, &h, sizeof(h));
write(fd, data, len);
}
is, let's say, 300 times slower then:
void send_packet(int cmd, void *data, int len)
{
struct header h = { cmd, len };
char tmp[BUFSIZE];
memcpy(tmp, &h, sizeof(h));
memcpy(tmp + sizeof(h), data, len);
write(fd, tmp, len + sizeof(h));
}
when running over loopback. Similar effects can be seen when running over
ethernet (the condition is, that next packet is requested only after
first one is recived).
I, personally, would expect the second version to be at most two times
slower (as there might be need to send two IP packets instead of one).
Also note, that as it is obvious that version with copying to buffer
on the stack should be faster, it is not so obvious if there is need to
malloc() buffer before sending (for example if there is no upper limit
on len). However even if we need to malloc() buffer, second version is
still by orders of magnitude faster.
I don't know how many user space program does it impact. Probably not many,
as they often use buffering of some kind.
This is both true for 2.2 and 2.4, IPv4 and IPv6. One vs two writes doesn't
seem to make big a diffrence for unix domain sockets though.
I found it during work with client/server program that worked horribly
slow just becouse of this. (of course I fixed it, but that's not the point).
I tried to find it in kernel sources, but probably I didn't try hard enough ;)
I attach a test program and results of tests with few diffrent machines.
Test results follow. Please don't be suggested by diffrences between
2.2 and 2.4 as they might be results of kernel patches, also machines
other then roke were on heavy load. The only important thing, is that
two-writes-mode works at *constant* speed, indpenent of machine speed.
(to make things a bit sweeter I can tell that on fbsd 4.4 stable fragmented
writes go at 10 packets/sec, unfortunetly I don't have other machines to
chceck right now ;)
Linux roke 2.4.7 #3 Wed Oct 3 22:22:24 CEST 2001 i686 pld
cpu MHz : 840.426
model name : AMD Duron(tm) Processor
IPv4
25 packets/sec
31833 packets/sec
IPv6
25 packets/sec
31634 packets/sec
UNIX
66135 packets/sec
77457 packets/sec
Linux roke 2.2.19 #6 Sun Sep 30 20:25:08 CEST 2001 i686 pld
cpu MHz : 840.442
model name : AMD Duron(tm) Processor
IPv4
100 packets/sec
34562 packets/sec
IPv6
100 packets/sec
38555 packets/sec
UNIX
72355 packets/sec
90586 packets/sec
Linux boniek 2.2.19 #2 Tue Mar 27 17:19:45 CEST 2001 alpha pld
IPv4
1024 packets/sec
23351 packets/sec
UNIX
42219 packets/sec
50643 packets/sec
# and more recent 2.4:
Linux kenny 2.4.16 #1 SMP Thu Dec 20 16:16:22 CET 2001 i686 pld
cpu MHz : 699.331
cpu MHz : 699.331
model name : Pentium III (Cascades)
model name : Pentium III (Cascades)
IPv4
25 packets/sec
16965 packets/sec
IPv6
25 packets/sec
14928 packets/sec
UNIX
30111 packets/sec
32143 packets/sec
sparc64/2.2.19 does similary as x86/2.2.19
--
: Michal ``,/\/\, '' Moskal | | : GCS {C,UL}++++$
: | |alekith @ |)|(| . org . pl : {E--, W, w-,M}-
: Linux: We are dot in .ORG. | : {b,e>+}++ !tv h
: CurProj: ftp://ftp.pld.org.pl/people/malekith/ksi : PLD Team member
Michal Moskal wrote:
>
> So, it occurs in programs doing packet communication over TCP, when
> peer waits for a packet to send an answer. If they send data with two
> write() calls (for example to write packet header and packet data),
> the performance dramaticly decrases (down to exactly 100 (2.2.19)
> or 25 (2.4.[57]) packet exchanges per second on x86, from several
> thousands. 100 seems to be related to HZ variable, see also AXP results,
> where HZ is 10 times bigger).
Try disabling the nagle algorithm:
int i = 1;
if (setsockopt(fd, SOL_TCP, TCP_NODELAY, &i, sizeof(i)) == -1)
perror("TCP_NODELAY");
Ciao, ET.
On Wed, 2 Jan 2002, Michal Moskal wrote:
> void send_packet(int cmd, void *data, int len)
> {
> struct header h = { cmd, len };
>
> write(fd, &h, sizeof(h));
> write(fd, data, len);
> }
you should look into writev(2).
you might also want to look at this paper
<http://www.isi.edu/~johnh/PAPERS/Heidemann97a.html>, it's probably
similar to the problems you're seeing.
-dean
On Wed, 2 Jan 2002, dean gaudet wrote:
> On Wed, 2 Jan 2002, Michal Moskal wrote:
>
> > void send_packet(int cmd, void *data, int len)
> > {
> > struct header h = { cmd, len };
> >
> > write(fd, &h, sizeof(h));
> > write(fd, data, len);
> > }
>
> you should look into writev(2).
[SNIPPED...]
First, this isn't "TCP stack behavior...". It's an apparent attempt
to write raw (network?) packets using some kernel primitives. I presume
that you have obtained the fd from either socket() or by opening some
device. Whatever. If you are generating a "packet", you need to
make the packet in a buffer and send the packet. You can't presume
that something will concatenate to separate writes into some
kind of "packet". If the hardware is Ethernet, even the hardware
will fight you because it puts a destination-hardware-address,
source-hardware-address, packet-length, data (your packet), then
32-bit CRC into the outgoing packet. FYI, that 'data' is where
the TCP/IP data-gram exists.
That said, if you are trying to make some kind of "zero-copy" thing,
you need to leave space in the initial allocation for the header and
other overhead. That way, you do one write to the device.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.1 on an i686 machine (797.90 BogoMips).
I was going to compile a list of innovations that could be
attributed to Microsoft. Once I realized that Ctrl-Alt-Del
was handled in the BIOS, I found that there aren't any.
On Wed, 2 Jan 2002 17:28:06 +0100, Michal Moskal wrote:
>So, it occurs in programs doing packet communication over TCP, when peer
>waits for a packet to send an answer. If they send data with two write()
>calls (for example to write packet header and packet data), the performance
>dramaticly decrases (down to exactly 100 (2.2.19)
>or 25 (2.4.[57]) packet exchanges per second on x86, from several thousands.
>100 seems to be related to HZ variable, see also AXP results, where HZ is 10
>times bigger).
That's why you should never, ever do anything that stupid. What should the
kernel do? When it sees the first write, it has no idea there's going to be a
second write, so it sends a packet. It gives you the benefit of the doubt and
assumes that you know how to use TCP. When it sees the second write
immediately thereafter and they're both small, it no longer trusts you and it
has no idea there isn't going to be a third write a microsecond later, so it
doesn't send a packet.
>I, personally, would expect the second version to be at most two times
>slower (as there might be need to send two IP packets instead of one).
>Also note, that as it is obvious that version with copying to buffer on the
>stack should be faster, it is not so obvious if there is need to malloc()
>buffer before sending (for example if there is no upper limit on len).
>However even if we need to malloc() buffer, second version is still by
>orders of magnitude faster.
If you can design an algorithm that makes that only two times slower, then
the world will be excited and interested and perhaps that algorithm will
replace TCP. But until that time, we're stuck with what we have.
>I found it during work with client/server program that worked horribly slow
>just becouse of this. (of course I fixed it, but that's not the point).
THAT IS THE POINT. The problem wasn't in the kernel, it was in the program,
and you fixed it. If you do smart buffering, TCP can behave efficiently. If
you don't, it has to guess when to send packets, and it can't possibly
predict the future and behave in the way you think is optimum.
How does it know you care about latency rather than throughput? And what
should it do if it sees a steady stream of one byte writes, one every tenth
of a second?
DS
On Wed, Jan 02, 2002 at 01:49:56PM -0800, David Schwartz wrote:
>
> On Wed, 2 Jan 2002 17:28:06 +0100, Michal Moskal wrote:
> >I, personally, would expect the second version to be at most two times
> >slower (as there might be need to send two IP packets instead of one).
> >Also note, that as it is obvious that version with copying to buffer on the
> >stack should be faster, it is not so obvious if there is need to malloc()
> >buffer before sending (for example if there is no upper limit on len).
> >However even if we need to malloc() buffer, second version is still by
> >orders of magnitude faster.
>
> If you can design an algorithm that makes that only two times slower, then
> the world will be excited and interested and perhaps that algorithm will
> replace TCP. But until that time, we're stuck with what we have.
With negle disabled it works 17/15 times slower, which is much less then
two. Similary with UNIX domain sockets.
> >I found it during work with client/server program that worked horribly slow
> >just becouse of this. (of course I fixed it, but that's not the point).
>
> THAT IS THE POINT. The problem wasn't in the kernel, it was in the program,
> and you fixed it. If you do smart buffering, TCP can behave efficiently. If
> you don't, it has to guess when to send packets, and it can't possibly
> predict the future and behave in the way you think is optimum.
Ok, *now* I know that ;)
Thank you all for pointers.
--
: Michal ``,/\/\, '' Moskal | | : GCS {C,UL}++++$
: | |alekith @ |)|(| . org . pl : {E--, W, w-,M}-
: Linux: We are dot in .ORG. | : {b,e>+}++ !tv h
: CurProj: ftp://ftp.pld.org.pl/people/malekith/ksi : PLD Team member
On Thu, 3 Jan 2002 14:22:52 +0100, Michal Moskal wrote:
>> If you can design an algorithm that makes that only two times slower,
>>then
>> the world will be excited and interested and perhaps that algorithm will
>>replace TCP. But until that time, we're stuck with what we have.
>With negle disabled it works 17/15 times slower, which is much less then
>two. Similary with UNIX domain sockets.
However, with Nagle disabled, there is no bound to how poor network
efficiency can be. If you do a single byte write every tenth of a second, you
will send out a packet for each single byte.
You can only disable Nagle if you can assume that the application is smart
enough to do the coalescing. After all, someone has to. Since we're talking
about an app that can't coalesce, you cannot disable Nagle. (Unless you
consider it acceptable to send one byte of data in each packet.)
Again, an application *must* *not* disable Nagle unless it (the app) takes
responsibility for ensuring that data is sent in large enough chunks to
ensure network efficiency. So you can disable nagle if you want to, but not
until *AFTER* you make sure your application coalesces writes.
DS