2000-11-30 18:14:57

by Ben Mansell

[permalink] [raw]
Subject: TCP push missing with writev()

(possibly treading on ground covered before:
http://www.uwsg.iu.edu/hypermail/linux/kernel/9904.1/0304.html )

To be brief and to the point: Should there be any difference between the
following two ways of writing data to a TCP socket?

1) write( fd, buffer, length )
2) writev( fd, {buffer, length}, {NULL,0} )

The problem is that if data happens to be written via method (2), then
the PUSH flag is never set on any packets generated. This is a bug,
surely?

(Occurs on 2.2.5 and 2.4.0-test10. Doesn't occur in 2.0.36 and lots of
other UNIX-alikes)


Ben


2000-11-30 18:45:04

by Andi Kleen

[permalink] [raw]
Subject: Re: TCP push missing with writev()

On Thu, Nov 30, 2000 at 05:35:41PM +0000, Ben Mansell wrote:
> (possibly treading on ground covered before:
> http://www.uwsg.iu.edu/hypermail/linux/kernel/9904.1/0304.html )
>
> To be brief and to the point: Should there be any difference between the
> following two ways of writing data to a TCP socket?
>
> 1) write( fd, buffer, length )
> 2) writev( fd, {buffer, length}, {NULL,0} )

No.

>
> The problem is that if data happens to be written via method (2), then
> the PUSH flag is never set on any packets generated. This is a bug,
> surely?

I just tried it on 2.2.17 and 2.4.0test11 and it sets PUSH for writev()
for both cases just fine. Maybe you could supply a test program and tcpdump
logs for what you think is wrong ?

-Andi

2000-11-30 20:06:09

by Ben Mansell

[permalink] [raw]
Subject: Re: TCP push missing with writev()

On Thu, 30 Nov 2000, Andi Kleen wrote:

> > The problem is that if data happens to be written via method (2), then
> > the PUSH flag is never set on any packets generated. This is a bug,
> > surely?
>
> I just tried it on 2.2.17 and 2.4.0test11 and it sets PUSH for writev()
> for both cases just fine. Maybe you could supply a test program and tcpdump
> logs for what you think is wrong ?

BTW, Nagle turned off for all connections.
I can't supply source, but here are some TCP dumps of whats happening.
They're of HTTP, with a Windows IE refreshing a web page. I'll include
just one of the connections.
electra: win98, artemis: 2.4.0-test10

First of all, using write():

18:41:08.714801 electra.1057 > artemis.www: S 984816:984816(0) win 8192 <mss 1460,nop,nop,sackOK> (DF)
18:41:08.714864 artemis.www > electra.1057: S 319022729:319022729(0) ack 984817 win 5840 <mss 1460,nop,nop,sackOK> (DF)
18:41:08.715228 electra.1057 > artemis.www: . ack 1 win 8760 (DF)
18:41:08.734268 electra.1057 > artemis.www: P 1:288(287) ack 1 win 8760 (DF)
18:41:08.734354 artemis.www > electra.1057: . ack 288 win 6432 (DF)
18:41:08.745542 artemis.www > electra.1057: P 1:85(84) ack 288 win 6432 (DF)
18:41:08.754096 electra.1057 > artemis.www: P 288:568(280) ack 85 win 8676 (DF)
18:41:08.754150 artemis.www > electra.1057: . ack 568 win 7504 (DF)
18:41:08.770517 artemis.www > electra.1057: P 85:169(84) ack 568 win 7504 (DF)
18:41:08.825662 electra.1057 > artemis.www: P 568:798(230) ack 169 win 8592 (DF)
18:41:08.825742 artemis.www > electra.1057: . ack 798 win 8576 (DF)
18:41:08.856386 artemis.www > electra.1057: P 169:1230(1061) ack 798 win 8576 (DF)
18:41:08.885806 electra.1057 > artemis.www: R 985614:985614(0) win 0 (DF)

Now using writev() (all of which use two buffers, the second one empty):

18:40:15.434759 electra.1054 > artemis.www: S 931532:931532(0) win 8192 <mss 1460,nop,nop,sackOK> (DF)
18:40:15.434820 artemis.www > electra.1054: S 272275362:272275362(0) ack 931533 win 5840 <mss 1460,nop,nop,sackOK> (DF)
18:40:15.435149 electra.1054 > artemis.www: . ack 1 win 8760 (DF)
18:40:15.468973 electra.1054 > artemis.www: P 1:288(287) ack 1 win 8760 (DF)
18:40:15.469037 artemis.www > electra.1054: . ack 288 win 6432 (DF)
18:40:15.485787 artemis.www > electra.1054: . 1:85(84) ack 288 win 6432 (DF)
18:40:15.592677 electra.1054 > artemis.www: . ack 85 win 8676 (DF)
18:40:15.897950 electra.1054 > artemis.www: P 288:568(280) ack 85 win 8676 (DF)
18:40:15.897977 artemis.www > electra.1054: . ack 568 win 7504 (DF)
18:40:15.900336 artemis.www > electra.1054: . 85:169(84) ack 568 win 7504 (DF)
18:40:16.092696 electra.1054 > artemis.www: . ack 169 win 8592 (DF)
18:40:16.396061 electra.1054 > artemis.www: P 568:798(230) ack 169 win 8592 (DF)
18:40:16.411279 artemis.www > electra.1054: . ack 798 win 8576 (DF)
18:40:16.428007 artemis.www > electra.1054: . 169:1230(1061) ack 798 win 8576 (DF)
18:40:16.592603 electra.1054 > artemis.www: . ack 1230 win 7531 (DF)
18:40:16.895021 electra.1054 > artemis.www: R 932330:932330(0) win 0 (DF)


Ben

2000-11-30 21:02:10

by David Miller

[permalink] [raw]
Subject: Re: TCP push missing with writev()

Date: Thu, 30 Nov 2000 19:14:14 +0100
From: Andi Kleen <[email protected]>

> The problem is that if data happens to be written via method (2),
> then the PUSH flag is never set on any packets generated. This is
> a bug, surely?

I just tried it on 2.2.17 and 2.4.0test11 and it sets PUSH for
writev() for both cases just fine. Maybe you could supply a test
program and tcpdump logs for what you think is wrong ?

One thing which could affect the behavior is if the Zeus
folks are futzing around with the TCP_CORK socket option.
Ben, are you guys playing with it?

Finally, are you doing these writes from a userspace program
or directly inside of a kernel module?

Later,
David S. Miller
[email protected]

2000-11-30 23:23:06

by Ben Mansell

[permalink] [raw]
Subject: Re: TCP push missing with writev()

On Thu, 30 Nov 2000, David S. Miller wrote:

> Date: Thu, 30 Nov 2000 19:14:14 +0100
> From: Andi Kleen <[email protected]>
>
> > The problem is that if data happens to be written via method (2),
> > then the PUSH flag is never set on any packets generated. This is
> > a bug, surely?
>
> I just tried it on 2.2.17 and 2.4.0test11 and it sets PUSH for
> writev() for both cases just fine. Maybe you could supply a test
> program and tcpdump logs for what you think is wrong ?
>
> One thing which could affect the behavior is if the Zeus
> folks are futzing around with the TCP_CORK socket option.
> Ben, are you guys playing with it?
>
> Finally, are you doing these writes from a userspace program
> or directly inside of a kernel module?

Nope, nothing cunning going on here. Its from userspace, TCP_CORK
untouched. TCP_NODELAY is set, and I was using non-blocking IO (though
that doesn't seem to affect it)

Anyway, I've just hacked together a quick & nasty piece of code that
demonstrates the issue. Run the program with no arguments to get write()
behaviour, or give an argument and you'll get writev()

write() trace: (bagpipe=server, 2.4.0-test10, baphomet=client, 2.2.17)
22:36:38.430774 baphomet.4126 > bagpipe.4000: S 2380162569:2380162569(0) win 32120 <mss 1460,sackOK,timestamp 337446754 0,nop,wscale 0> (DF) [tos 0x10]
22:36:38.432410 bagpipe.4000 > baphomet.4126: S 2375849323:2375849323(0) ack 2380162570 win 5792 <mss 1460,sackOK,timestamp 4813031 337446754,nop,wscale 0> (DF)
22:36:38.432581 baphomet.4126 > bagpipe.4000: . ack 1 win 32120 <nop,nop,timestamp 337446754 4813031> (DF) [tos 0x10]
22:36:38.433494 bagpipe.4000 > baphomet.4126: P 1:128(127) ack 1 win 5792 <nop,nop,timestamp 4813032 337446754> (DF)
22:36:38.433685 baphomet.4126 > bagpipe.4000: . ack 128 win 32120 <nop,nop,timestamp 337446754 4813032> (DF) [tos 0x10]
22:36:40.442471 bagpipe.4000 > baphomet.4126: P 128:255(127) ack 1 win 5792 <nop,nop,timestamp 4813233 337446754> (DF)
22:36:40.455945 baphomet.4126 > bagpipe.4000: . ack 255 win 32120 <nop,nop,timestamp 337446957 4813233> (DF) [tos 0x10]
22:36:42.452473 bagpipe.4000 > baphomet.4126: P 255:382(127) ack 1 win 5792 <nop,nop,timestamp 4813434 337446957> (DF)
22:36:42.465812 baphomet.4126 > bagpipe.4000: . ack 382 win 32120 <nop,nop,timestamp 337447158 4813434> (DF) [tos 0x10]
[...]

writev() trace:
22:37:57.214625 baphomet.4129 > bagpipe.4000: S 2458096423:2458096423(0) win 32120 <mss 1460,sackOK,timestamp 337454633 0,nop,wscale 0> (DF) [tos 0x10]
22:37:57.214682 bagpipe.4000 > baphomet.4129: S 2451832783:2451832783(0) ack 2458096424 win 5792 <mss 1460,sackOK,timestamp 4820910 337454633,nop,wscale 0> (DF)
22:37:57.214830 baphomet.4129 > bagpipe.4000: . ack 1 win 32120 <nop,nop,timestamp 337454633 4820910> (DF) [tos 0x10]
22:37:57.215017 bagpipe.4000 > baphomet.4129: . 1:128(127) ack 1 win 5792 <nop,nop,timestamp 4820910 337454633> (DF)
22:37:57.215204 baphomet.4129 > bagpipe.4000: . ack 128 win 32120 <nop,nop,timestamp 337454633 4820910> (DF) [tos 0x10]
22:37:59.222458 bagpipe.4000 > baphomet.4129: . 128:255(127) ack 1 win 5792 <nop,nop,timestamp 4821111 337454633> (DF)
22:37:59.720572 baphomet.4129 > bagpipe.4000: . ack 255 win 32120 <nop,nop,timestamp 337454884 4821111> (DF) [tos 0x10]
22:38:01.232460 bagpipe.4000 > baphomet.4129: . 255:382(127) ack 1 win 5792 <nop,nop,timestamp 4821312 337454884> (DF)
22:38:01.730440 baphomet.4129 > bagpipe.4000: . ack 382 win 32120 <nop,nop,timestamp 337455085 4821312> (DF) [tos 0x10]
[...]


If you vary the 'CHUNK_SIZE' value to change how much data gets sent out
between each sleep(), you can provoke the PUSH flag to be set.
In fact, I've just noticed something; setting it to 1700, and after
leaving it to merrily writev() for a while, I do eventually see the PUSH
flag:

22:51:56.302460 bagpipe.4000 > baphomet.4135: P 15301:16749(1448) ack 1 win 5792 <nop,nop,timestamp 4904819 337538396> (DF)
...
22:52:16.402458 bagpipe.4000 > baphomet.4135: P 32301:33749(1448) ack 1 win 5792 <nop,nop,timestamp 4906829 337540407> (DF)
...
22:52:36.502454 bagpipe.4000 > baphomet.4135: P 49301:50749(1448) ack 1 win 5792 <nop,nop,timestamp 4908839 337542417> (DF)
...
22:52:56.602449 bagpipe.4000 > baphomet.4135: P 66301:67749(1448) ack 1 win 5792 <nop,nop,timestamp 4910849 337544427> (DF)



Ben


==============




#include <stdio.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>
#include <unistd.h>
#include <sys/uio.h>

#define CHUNK_SIZE 127


char data[ CHUNK_SIZE ];

int main( int argc, char *argv[] )
{
int serverfd, clientfd, i, client_size;
struct sockaddr_in server_addr, client_addr;
struct iovec vector[ 2 ];

for( i = 0; i < CHUNK_SIZE; i++ ) data[ i ] = 'a';

serverfd = socket( PF_INET, SOCK_STREAM, 0 );
server_addr.sin_family = AF_INET;
server_addr.sin_addr.s_addr = htonl( INADDR_ANY );
server_addr.sin_port = htons( 4000 );

if( bind( serverfd, ( struct sockaddr * ) &server_addr,
sizeof( server_addr )) ) {
perror( "bind" );
exit( 1 );
}

listen( serverfd, 10 );
clientfd = accept( serverfd, ( struct sockaddr * ) &client_addr,
&client_size );
i = 1;
setsockopt( clientfd, SOL_TCP, TCP_NODELAY, &i, sizeof( i ));

for( ;; ) {
if( argc > 1 ) {
vector[ 0 ].iov_base = data;
vector[ 0 ].iov_len = CHUNK_SIZE;
vector[ 1 ].iov_base = NULL;
vector[ 1 ].iov_len = 0;
writev( clientfd, vector, 2 );
} else {
write( clientfd, &data, CHUNK_SIZE );
}
sleep( 2 );
}
}


2000-11-30 23:47:58

by David Miller

[permalink] [raw]
Subject: Re: TCP push missing with writev()

Date: Thu, 30 Nov 2000 22:57:09 +0000 (GMT)
From: Ben Mansell <[email protected]>

vector[ 1 ].iov_base = NULL;
vector[ 1 ].iov_len = 0;

"Don't do that" :-)

I'll doubt I will bother to fix it (it's not %100 trivial to fix and
it will put a cost into a hot path). Even if I did, you're going to
need to fix Zeus not to do writev's with empty iovec's like this so
that you get the PSH's in current kernels.

Later,
David S. Miller
[email protected]

2000-12-01 00:41:28

by Ben Mansell

[permalink] [raw]
Subject: Re: TCP push missing with writev()

On Thu, 30 Nov 2000, David S. Miller wrote:

> vector[ 1 ].iov_base = NULL;
> vector[ 1 ].iov_len = 0;
>
> "Don't do that" :-)
>
> I'll doubt I will bother to fix it (it's not %100 trivial to fix and
> it will put a cost into a hot path). Even if I did, you're going to
> need to fix Zeus not to do writev's with empty iovec's like this so
> that you get the PSH's in current kernels.

Its OK, our web server currently doesn't have this problem :)

I ran into the problem developing code that wrote data out of a circular
buffer - so the data being written may potentially be in two sections if
the buffer wraps around. A generic writev() call then emptied as much data
out of the buffer as possible, and so tripping the bug. I guess what I'm saying
is that you don't have to have plain dumb code (like the example code I
gave) in order to hit this problem.

Paranoia check: can this bug occur with writev() and other combinations of
sizes, or is it only when there is a 0-length buffer in there?


Ben

2000-12-01 00:54:42

by David Miller

[permalink] [raw]
Subject: Re: TCP push missing with writev()

Date: Fri, 1 Dec 2000 00:15:45 +0000 (GMT)
From: Ben Mansell <[email protected]>

Paranoia check: can this bug occur with writev() and other
combinations of sizes, or is it only when there is a 0-length
buffer in there?

Only the 0-length buffer at the end of the iovec can trigger this
problem. The PSH bit check during writes is basically:

if (we are emptying the current iovec entry &&
no more iovecs to process)
set PSH bit

Zero length iovec entries do not even make it to this test
since the zero length makes the loop this test is within
never execute.

Later,
David S. Miller
[email protected]