I wrote a small read-only SMB server, and wanted to see how fast it was.
So I used smbget to download a moderately large file from it via localhost.
smbget only got ~70 KB/sec.
This is what the view from strace -tt on the server is:
22:44:58.812467 read(8, "\0\0\0007\377SMB.\0\0\0\0\10\1\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3"..., 8192) = 59
22:44:58.812619 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b46b8e5e000
22:44:58.812729 fcntl(9, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
22:44:58.812847 epoll_ctl(7, EPOLL_CTL_DEL, 8, {0, {u32=8, u64=13323248792850399240}}) = 0
22:44:58.812946 epoll_ctl(7, EPOLL_CTL_ADD, 8, {EPOLLOUT, {u32=8, u64=18251433459580936}}) = 0
22:44:58.813039 epoll_wait(7, {{EPOLLOUT, {u32=8, u64=18251433459580936}}}, 100, 442) = 1
22:44:58.813132 setsockopt(8, SOL_TCP, TCP_CORK, [1], 4) = 0
22:44:58.813215 write(8, "\0\0\372<\377SMB.\0\0\0\0\200A\300\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3"..., 64) = 64
22:44:58.813323 sendfile(8, 9, [128000], 64000) = 64000
22:44:58.813430 setsockopt(8, SOL_TCP, TCP_CORK, [0], 4) = 0
22:44:58.813511 munmap(0x2b46b8e5e000, 8192) = 0
22:44:58.813600 epoll_wait(7, {{EPOLLOUT, {u32=8, u64=18251433459580936}}}, 100, 442) = 1
22:44:58.813693 epoll_ctl(7, EPOLL_CTL_DEL, 8, {0, {u32=8, u64=8}}) = 0
22:44:58.813778 epoll_ctl(7, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, u64=18252000395264008}}) = 0
22:44:58.813869 epoll_wait(7, {}, 100, 441) = 0
22:44:59.255789 epoll_wait(7, {{EPOLLIN, {u32=8, u64=18252000395264008}}}, 100, 999) = 1
22:44:59.688519 read(8, "\0\0\0007\377SMB.\0\0\0\0\10\1\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3"..., 8192) = 59
As you can see, the time difference between reading the query and writing the
result is very small, but there is a big delay before receiving the next request.
This is the view from a sniffer on the lo interface:
1192653899.688385 127.0.0.1 -> 127.0.0.1 SMB Read AndX Request, FID: 0x0001, 64000 bytes at offset 192000
1192653899.688399 127.0.0.1 -> 127.0.0.1 TCP 445 > 42990 [ACK] Seq=192660 Ack=779 Win=33920 Len=0 TSV=359208 TSER=359208
1192653899.895725 127.0.0.1 -> 127.0.0.1 SMB [TCP Window Full] Read AndX Response, FID: 0x0001, 64000 bytes
1192653899.895793 127.0.0.1 -> 127.0.0.1 TCP 42990 > 445 [ACK] Seq=779 Ack=204948 Win=12288 Len=0 TSV=359260 TSER=359260
1192653899.895805 127.0.0.1 -> 127.0.0.1 NBSS NBSS Continuation Message
1192653899.935725 127.0.0.1 -> 127.0.0.1 TCP 42990 > 445 [ACK] Seq=779 Ack=209044 Win=12288 Len=0 TSV=359270 TSER=359260
1192653900.147739 127.0.0.1 -> 127.0.0.1 NBSS [TCP Window Full] NBSS Continuation Message
1192653900.147767 127.0.0.1 -> 127.0.0.1 TCP [TCP ZeroWindow] 42990 > 445 [ACK] Seq=779 Ack=221332 Win=0 Len=0 TSV=359323 TSER=359323
1192653900.147807 127.0.0.1 -> 127.0.0.1 TCP [TCP Window Update] 42990 > 445 [ACK] Seq=779 Ack=221332 Win=12288 Len=0 TSV=359323 TSER=359323
1192653900.147815 127.0.0.1 -> 127.0.0.1 NBSS NBSS Continuation Message
1192653900.147837 127.0.0.1 -> 127.0.0.1 TCP 42990 > 445 [ACK] Seq=779 Ack=225428 Win=12288 Len=0 TSV=359323 TSER=359323
1192653900.355754 127.0.0.1 -> 127.0.0.1 NBSS [TCP Window Full] NBSS Continuation Message
1192653900.355782 127.0.0.1 -> 127.0.0.1 TCP [TCP ZeroWindow] 42990 > 445 [ACK] Seq=779 Ack=237716 Win=0 Len=0 TSV=359375 TSER=359375
1192653900.355820 127.0.0.1 -> 127.0.0.1 TCP [TCP Window Update] 42990 > 445 [ACK] Seq=779 Ack=237716 Win=12288 Len=0 TSV=359375 TSER=359375
1192653900.355829 127.0.0.1 -> 127.0.0.1 NBSS NBSS Continuation Message
1192653900.355849 127.0.0.1 -> 127.0.0.1 TCP 42990 > 445 [ACK] Seq=779 Ack=241812 Win=12288 Len=0 TSV=359375 TSER=359375
1192653900.563766 127.0.0.1 -> 127.0.0.1 NBSS [TCP Window Full] NBSS Continuation Message
1192653900.563794 127.0.0.1 -> 127.0.0.1 TCP [TCP ZeroWindow] 42990 > 445 [ACK] Seq=779 Ack=254100 Win=0 Len=0 TSV=359427 TSER=359427
1192653900.563831 127.0.0.1 -> 127.0.0.1 TCP [TCP Window Update] 42990 > 445 [ACK] Seq=779 Ack=254100 Win=12288 Len=0 TSV=359427 TSER=359427
1192653900.563839 127.0.0.1 -> 127.0.0.1 NBSS NBSS Continuation Message
1192653900.563858 127.0.0.1 -> 127.0.0.1 TCP 42990 > 445 [ACK] Seq=779 Ack=256724 Win=12288 Len=0 TSV=359427 TSER=359427
1192653900.564444 127.0.0.1 -> 127.0.0.1 SMB Read AndX Request, FID: 0x0001, 64000 bytes at offset 256000
Note the delay between sending the response and getting the reply.
Also note that there is almost no delay between getting the reply and sending
the next request.
My understanding of TCP_CORK from the tcp(7) man page is that it should flush
out the data immediately, but the network trace seems to suggest that there is
a 200 ms delay between the request and the outgoing data. tcp(7) says there is
a 200 ms delay for sending out data when the socket is in corked mode, so
uncorking does not appear to work.
Now for the strange part: the same code works without a 200 ms delay
over ethernet. The delay only appears on the loopback interface.
Also, on my home desktop, which is also running 2.6.22, there is no
delay even on loopback.
I tried 2.6.23 on my notebook, but the effect is the same.
Any ideas what could cause this?
Thanks,
Felix
On 10/17/2007 04:51 PM, Felix von Leitner wrote:
> I wrote a small read-only SMB server, and wanted to see how fast it was.
> So I used smbget to download a moderately large file from it via localhost.
> smbget only got ~70 KB/sec.
>
> This is what the view from strace -tt on the server is:
>
> 22:44:58.812467 read(8, "\0\0\0007\377SMB.\0\0\0\0\10\1\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3"..., 8192) = 59
> 22:44:58.812619 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b46b8e5e000
> 22:44:58.812729 fcntl(9, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
> 22:44:58.812847 epoll_ctl(7, EPOLL_CTL_DEL, 8, {0, {u32=8, u64=13323248792850399240}}) = 0
> 22:44:58.812946 epoll_ctl(7, EPOLL_CTL_ADD, 8, {EPOLLOUT, {u32=8, u64=18251433459580936}}) = 0
> 22:44:58.813039 epoll_wait(7, {{EPOLLOUT, {u32=8, u64=18251433459580936}}}, 100, 442) = 1
> 22:44:58.813132 setsockopt(8, SOL_TCP, TCP_CORK, [1], 4) = 0
> 22:44:58.813215 write(8, "\0\0\372<\377SMB.\0\0\0\0\200A\300\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3"..., 64) = 64
> 22:44:58.813323 sendfile(8, 9, [128000], 64000) = 64000
> 22:44:58.813430 setsockopt(8, SOL_TCP, TCP_CORK, [0], 4) = 0
> 22:44:58.813511 munmap(0x2b46b8e5e000, 8192) = 0
> 22:44:58.813600 epoll_wait(7, {{EPOLLOUT, {u32=8, u64=18251433459580936}}}, 100, 442) = 1
> 22:44:58.813693 epoll_ctl(7, EPOLL_CTL_DEL, 8, {0, {u32=8, u64=8}}) = 0
> 22:44:58.813778 epoll_ctl(7, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, u64=18252000395264008}}) = 0
> 22:44:58.813869 epoll_wait(7, {}, 100, 441) = 0
> 22:44:59.255789 epoll_wait(7, {{EPOLLIN, {u32=8, u64=18252000395264008}}}, 100, 999) = 1
> 22:44:59.688519 read(8, "\0\0\0007\377SMB.\0\0\0\0\10\1\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3"..., 8192) = 59
>
>
> As you can see, the time difference between reading the query and writing the
> result is very small, but there is a big delay before receiving the next request.
>
> This is the view from a sniffer on the lo interface:
>
> 1192653899.688385 127.0.0.1 -> 127.0.0.1 SMB Read AndX Request, FID: 0x0001, 64000 bytes at offset 192000
> 1192653899.688399 127.0.0.1 -> 127.0.0.1 TCP 445 > 42990 [ACK] Seq=192660 Ack=779 Win=33920 Len=0 TSV=359208 TSER=359208
> 1192653899.895725 127.0.0.1 -> 127.0.0.1 SMB [TCP Window Full] Read AndX Response, FID: 0x0001, 64000 bytes
> 1192653899.895793 127.0.0.1 -> 127.0.0.1 TCP 42990 > 445 [ACK] Seq=779 Ack=204948 Win=12288 Len=0 TSV=359260 TSER=359260
> 1192653899.895805 127.0.0.1 -> 127.0.0.1 NBSS NBSS Continuation Message
> 1192653899.935725 127.0.0.1 -> 127.0.0.1 TCP 42990 > 445 [ACK] Seq=779 Ack=209044 Win=12288 Len=0 TSV=359270 TSER=359260
> 1192653900.147739 127.0.0.1 -> 127.0.0.1 NBSS [TCP Window Full] NBSS Continuation Message
> 1192653900.147767 127.0.0.1 -> 127.0.0.1 TCP [TCP ZeroWindow] 42990 > 445 [ACK] Seq=779 Ack=221332 Win=0 Len=0 TSV=359323 TSER=359323
> 1192653900.147807 127.0.0.1 -> 127.0.0.1 TCP [TCP Window Update] 42990 > 445 [ACK] Seq=779 Ack=221332 Win=12288 Len=0 TSV=359323 TSER=359323
> 1192653900.147815 127.0.0.1 -> 127.0.0.1 NBSS NBSS Continuation Message
> 1192653900.147837 127.0.0.1 -> 127.0.0.1 TCP 42990 > 445 [ACK] Seq=779 Ack=225428 Win=12288 Len=0 TSV=359323 TSER=359323
> 1192653900.355754 127.0.0.1 -> 127.0.0.1 NBSS [TCP Window Full] NBSS Continuation Message
> 1192653900.355782 127.0.0.1 -> 127.0.0.1 TCP [TCP ZeroWindow] 42990 > 445 [ACK] Seq=779 Ack=237716 Win=0 Len=0 TSV=359375 TSER=359375
> 1192653900.355820 127.0.0.1 -> 127.0.0.1 TCP [TCP Window Update] 42990 > 445 [ACK] Seq=779 Ack=237716 Win=12288 Len=0 TSV=359375 TSER=359375
> 1192653900.355829 127.0.0.1 -> 127.0.0.1 NBSS NBSS Continuation Message
> 1192653900.355849 127.0.0.1 -> 127.0.0.1 TCP 42990 > 445 [ACK] Seq=779 Ack=241812 Win=12288 Len=0 TSV=359375 TSER=359375
> 1192653900.563766 127.0.0.1 -> 127.0.0.1 NBSS [TCP Window Full] NBSS Continuation Message
> 1192653900.563794 127.0.0.1 -> 127.0.0.1 TCP [TCP ZeroWindow] 42990 > 445 [ACK] Seq=779 Ack=254100 Win=0 Len=0 TSV=359427 TSER=359427
> 1192653900.563831 127.0.0.1 -> 127.0.0.1 TCP [TCP Window Update] 42990 > 445 [ACK] Seq=779 Ack=254100 Win=12288 Len=0 TSV=359427 TSER=359427
> 1192653900.563839 127.0.0.1 -> 127.0.0.1 NBSS NBSS Continuation Message
> 1192653900.563858 127.0.0.1 -> 127.0.0.1 TCP 42990 > 445 [ACK] Seq=779 Ack=256724 Win=12288 Len=0 TSV=359427 TSER=359427
> 1192653900.564444 127.0.0.1 -> 127.0.0.1 SMB Read AndX Request, FID: 0x0001, 64000 bytes at offset 256000
>
>
> Note the delay between sending the response and getting the reply.
> Also note that there is almost no delay between getting the reply and sending
> the next request.
>
> My understanding of TCP_CORK from the tcp(7) man page is that it should flush
> out the data immediately, but the network trace seems to suggest that there is
> a 200 ms delay between the request and the outgoing data. tcp(7) says there is
> a 200 ms delay for sending out data when the socket is in corked mode, so
> uncorking does not appear to work.
>
>
> Now for the strange part: the same code works without a 200 ms delay
> over ethernet. The delay only appears on the loopback interface.
>
> Also, on my home desktop, which is also running 2.6.22, there is no
> delay even on loopback.
>
> I tried 2.6.23 on my notebook, but the effect is the same.
>
> Any ideas what could cause this?
>
(cc: netdev)
Thus spake Chuck Ebbert ([email protected]):
> > Any ideas what could cause this?
> (cc: netdev)
Maybe I should mention this, too:
accept(5, {sa_family=AF_INET6, sin6_port=htons(59821), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [18446744069414584348]) = 8
setsockopt(8, SOL_TCP, TCP_NODELAY, [1], 4) = 0
And if it would be the Nagle algorithm, it should also impact the
ethernet case.
Felix
the packet trace was a bit too cooked perhaps, but there were indications that
at times the TCP window was going to zero - perhaps something with window
updates or persist timers?
rick jones
> the packet trace was a bit too cooked perhaps, but there were indications
> that at times the TCP window was going to zero - perhaps something with
> window updates or persist timers?
Does TCP use different window sizes on loopback? Why is this not
happening on ethernet?
How could I test this theory?
My initial idea was that it has something todo with the different MTU on
loopback. My initial block size was 16k, but the problem stayed when I
changed it to 64k.
Felix
Felix von Leitner wrote:
>>the packet trace was a bit too cooked perhaps, but there were indications
>>that at times the TCP window was going to zero - perhaps something with
>>window updates or persist timers?
>
>
> Does TCP use different window sizes on loopback? Why is this not
> happening on ethernet?
I don't think it uses different window sizes on loopback, but with the
autotuning it can be difficult to say a priori what the window size will be.
What one can say with confidence is that the MTU and thus the MSS will be
different between loopback and ethernet.
>
> How could I test this theory?
Can you take another trace that isn't so "cooked?" One that just sticks with
TCP-level and below stuff?
If SMB is a one-request-at-a-time protocol (I can never remember), you could
simulate it with a netperf TCP_RR test by passing suitable values to the
test-specific -r option:
netperf -H <remote> -t TCP_RR -- -r <req>,<rsp>
If that shows similar behaviour then you can ass-u-me it isn't your application.
One caveat though is that TCP_CORK mode in netperf is very primitive and may
not match what you are doing, however, that may be a good thing.
http://www.netperf.org/svn/netperf2/tags/netperf-2.4.4/ or
ftp://ftp.netperf.org/netperf/
to get the current netperf bits. It is also possible to get multiple
transactions in flight at one time if you configure netperf with --enable-burst,
which will then enable a test-specific -b option. With the latest netperf you
cna also switch the output of a TCP_RR test to bits or bytes per second a la the
_STREAM tests.
rick jones
>
> My initial idea was that it has something todo with the different MTU on
> loopback. My initial block size was 16k, but the problem stayed when I
> changed it to 64k.
>
> Felix
Thus spake Rick Jones ([email protected]):
> >How could I test this theory?
> Can you take another trace that isn't so "cooked?" One that just sticks
> with TCP-level and below stuff?
Sorry for taking so long. Here is a tcpdump. The side on port 445 is
the SMB server using TCP_CORK.
23:03:20.283772 IP 127.0.0.1.33230 > 127.0.0.1.445: S 1503927325:1503927325(0) win 32792 <mss 16396,sackOK,timestamp 9451736 0,nop,wscale 7>
23:03:20.283774 IP 127.0.0.1.445 > 127.0.0.1.33230: S 1513925692:1513925692(0) ack 1503927326 win 32768 <mss 16396,sackOK,timestamp 9451737 9451736,nop,wscale 7>
23:03:20.283797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 1 win 257 <nop,nop,timestamp 9451737 9451737>
23:03:20.295851 IP 127.0.0.1.33230 > 127.0.0.1.445: P 1:195(194) ack 1 win 257 <nop,nop,timestamp 9451740 9451737>
23:03:20.295881 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 195 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.295959 IP 127.0.0.1.445 > 127.0.0.1.33230: P 1:87(86) ack 195 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.295998 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 87 win 256 <nop,nop,timestamp 9451740 9451740>
23:03:20.296063 IP 127.0.0.1.33230 > 127.0.0.1.445: P 195:287(92) ack 87 win 256 <nop,nop,timestamp 9451740 9451740>
23:03:20.296096 IP 127.0.0.1.445 > 127.0.0.1.33230: P 87:181(94) ack 287 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.296135 IP 127.0.0.1.33230 > 127.0.0.1.445: P 287:373(86) ack 181 win 255 <nop,nop,timestamp 9451740 9451740>
23:03:20.296163 IP 127.0.0.1.445 > 127.0.0.1.33230: P 181:239(58) ack 373 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.296201 IP 127.0.0.1.33230 > 127.0.0.1.445: P 373:459(86) ack 239 win 255 <nop,nop,timestamp 9451740 9451740>
23:03:20.296245 IP 127.0.0.1.445 > 127.0.0.1.33230: P 239:309(70) ack 459 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.296286 IP 127.0.0.1.33230 > 127.0.0.1.445: P 459:535(76) ack 309 win 254 <nop,nop,timestamp 9451740 9451740>
23:03:20.296314 IP 127.0.0.1.445 > 127.0.0.1.33230: P 309:461(152) ack 535 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.296361 IP 127.0.0.1.33230 > 127.0.0.1.445: P 535:594(59) ack 461 win 253 <nop,nop,timestamp 9451740 9451740>
23:03:20.296400 IP 127.0.0.1.445 > 127.0.0.1.33230: . 461:16845(16384) ack 594 win 265 <nop,nop,timestamp 9451740 9451740>
23:03:20.335748 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 16845 win 125 <nop,nop,timestamp 9451750 9451740>
[note the .2 sec pause]
23:03:20.547763 IP 127.0.0.1.445 > 127.0.0.1.33230: P 16845:32845(16000) ack 594 win 265 <nop,nop,timestamp 9451803 9451750>
23:03:20.547797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 0 <nop,nop,timestamp 9451803 9451803>
23:03:20.547855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 96 <nop,nop,timestamp 9451803 9451803>
23:03:20.547863 IP 127.0.0.1.445 > 127.0.0.1.33230: P 32845:33229(384) ack 594 win 265 <nop,nop,timestamp 9451803 9451803>
23:03:20.547890 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 33229 win 96 <nop,nop,timestamp 9451803 9451803>
[note the .2 sec pause]
23:03:20.755775 IP 127.0.0.1.445 > 127.0.0.1.33230: P 33229:45517(12288) ack 594 win 265 <nop,nop,timestamp 9451855 9451803>
23:03:20.755855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 45517 win 96 <nop,nop,timestamp 9451855 9451855>
23:03:20.755868 IP 127.0.0.1.445 > 127.0.0.1.33230: P 45517:49613(4096) ack 594 win 265 <nop,nop,timestamp 9451855 9451855>
23:03:20.755898 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 49613 win 96 <nop,nop,timestamp 9451855 9451855>
[another one]
23:03:20.963789 IP 127.0.0.1.445 > 127.0.0.1.33230: P 49613:61901(12288) ack 594 win 265 <nop,nop,timestamp 9451907 9451855>
23:03:20.963871 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 61901 win 96 <nop,nop,timestamp 9451907 9451907>
23:03:20.963885 IP 127.0.0.1.445 > 127.0.0.1.33230: P 61901:64525(2624) ack 594 win 265 <nop,nop,timestamp 9451907 9451907>
23:03:20.963909 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 64525 win 96 <nop,nop,timestamp 9451907 9451907>
23:03:20.964101 IP 127.0.0.1.33230 > 127.0.0.1.445: P 594:653(59) ack 64525 win 96 <nop,nop,timestamp 9451907 9451907>
23:03:21.003790 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 653 win 265 <nop,nop,timestamp 9451917 9451907>
23:03:21.171811 IP 127.0.0.1.445 > 127.0.0.1.33230: P 64525:76813(12288) ack 653 win 265 <nop,nop,timestamp 9451959 9451907>
You get the idea.
Anyway, now THIS is the interesting case, because we have two packets in
the answer, and you see the first half of the answer leaving immediately
(when I wanted the whole answer to be sent) but the second only leaving
after the .2 sec delay.
> If SMB is a one-request-at-a-time protocol (I can never remember),
It is.
> you
> could simulate it with a netperf TCP_RR test by passing suitable values to
> the test-specific -r option:
> netperf -H <remote> -t TCP_RR -- -r <req>,<rsp>
> If that shows similar behaviour then you can ass-u-me it isn't your
> application.
Oh I'm pretty sure it's not my application, because my application performs
well over ethernet, which is after all its purpose. Also I see the
write, the TCP uncork, then a pause, and then the packet leaving.
Felix
Felix von Leitner wrote:
> Thus spake Rick Jones ([email protected]):
>
>>>How could I test this theory?
>>
>>Can you take another trace that isn't so "cooked?" One that just sticks
>>with TCP-level and below stuff?
>
>
> Sorry for taking so long. Here is a tcpdump. The side on port 445 is
> the SMB server using TCP_CORK.
>
> 23:03:20.283772 IP 127.0.0.1.33230 > 127.0.0.1.445: S 1503927325:1503927325(0) win 32792 <mss 16396,sackOK,timestamp 9451736 0,nop,wscale 7>
> 23:03:20.283774 IP 127.0.0.1.445 > 127.0.0.1.33230: S 1513925692:1513925692(0) ack 1503927326 win 32768 <mss 16396,sackOK,timestamp 9451737 9451736,nop,wscale 7>
>
> 23:03:20.283797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 1 win 257 <nop,nop,timestamp 9451737 9451737>
> 23:03:20.295851 IP 127.0.0.1.33230 > 127.0.0.1.445: P 1:195(194) ack 1 win 257 <nop,nop,timestamp 9451740 9451737>
> 23:03:20.295881 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 195 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.295959 IP 127.0.0.1.445 > 127.0.0.1.33230: P 1:87(86) ack 195 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.295998 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 87 win 256 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296063 IP 127.0.0.1.33230 > 127.0.0.1.445: P 195:287(92) ack 87 win 256 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296096 IP 127.0.0.1.445 > 127.0.0.1.33230: P 87:181(94) ack 287 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296135 IP 127.0.0.1.33230 > 127.0.0.1.445: P 287:373(86) ack 181 win 255 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296163 IP 127.0.0.1.445 > 127.0.0.1.33230: P 181:239(58) ack 373 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296201 IP 127.0.0.1.33230 > 127.0.0.1.445: P 373:459(86) ack 239 win 255 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296245 IP 127.0.0.1.445 > 127.0.0.1.33230: P 239:309(70) ack 459 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296286 IP 127.0.0.1.33230 > 127.0.0.1.445: P 459:535(76) ack 309 win 254 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296314 IP 127.0.0.1.445 > 127.0.0.1.33230: P 309:461(152) ack 535 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296361 IP 127.0.0.1.33230 > 127.0.0.1.445: P 535:594(59) ack 461 win 253 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.296400 IP 127.0.0.1.445 > 127.0.0.1.33230: . 461:16845(16384) ack 594 win 265 <nop,nop,timestamp 9451740 9451740>
> 23:03:20.335748 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 16845 win 125 <nop,nop,timestamp 9451750 9451740>
>
> [note the .2 sec pause]
I wonder if the ack 16384 win 125 not updating the window is part of it? With a
window scale of 7, the advertised window of 125 is only 16000 bytes, and it
looks based on what follows that TCP has another 16384 to send, so my guess is
that TCP was waiting to have enough window, the persist timer expired and TCP
then had to say "oh well, send what I can" Probably a coupling with this being
less than the MSS (16396) involved too.
> 23:03:20.547763 IP 127.0.0.1.445 > 127.0.0.1.33230: P 16845:32845(16000) ack 594 win 265 <nop,nop,timestamp 9451803 9451750>
> 23:03:20.547797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 0 <nop,nop,timestamp 9451803 9451803>
Notice that an ACK comes-back with a zero window in it - that means that by this
point the receiver still hasn't consumed the 16384+16000 bytes sent to id.
> 23:03:20.547855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 96 <nop,nop,timestamp 9451803 9451803>
Now the receiver has pulled some data, on the order of 96*128 bytes so TCP can
now go ahead and send the remaining 384 bytes.
> 23:03:20.547863 IP 127.0.0.1.445 > 127.0.0.1.33230: P 32845:33229(384) ack 594 win 265 <nop,nop,timestamp 9451803 9451803>
> 23:03:20.547890 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 33229 win 96 <nop,nop,timestamp 9451803 9451803>
>
> [note the .2 sec pause]
I'll bet that 96 * 128 is 12288 and we have another persist timer expiring.
I also wonder if the behaviour might be different if you were using send()
rather than sendfile() - just random musings...
> 23:03:20.755775 IP 127.0.0.1.445 > 127.0.0.1.33230: P 33229:45517(12288) ack 594 win 265 <nop,nop,timestamp 9451855 9451803>
> 23:03:20.755855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 45517 win 96 <nop,nop,timestamp 9451855 9451855>
> 23:03:20.755868 IP 127.0.0.1.445 > 127.0.0.1.33230: P 45517:49613(4096) ack 594 win 265 <nop,nop,timestamp 9451855 9451855>
> 23:03:20.755898 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 49613 win 96 <nop,nop,timestamp 9451855 9451855>
>
> [another one]
>
> 23:03:20.963789 IP 127.0.0.1.445 > 127.0.0.1.33230: P 49613:61901(12288) ack 594 win 265 <nop,nop,timestamp 9451907 9451855>
> 23:03:20.963871 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 61901 win 96 <nop,nop,timestamp 9451907 9451907>
> 23:03:20.963885 IP 127.0.0.1.445 > 127.0.0.1.33230: P 61901:64525(2624) ack 594 win 265 <nop,nop,timestamp 9451907 9451907>
> 23:03:20.963909 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 64525 win 96 <nop,nop,timestamp 9451907 9451907>
> 23:03:20.964101 IP 127.0.0.1.33230 > 127.0.0.1.445: P 594:653(59) ack 64525 win 96 <nop,nop,timestamp 9451907 9451907>
> 23:03:21.003790 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 653 win 265 <nop,nop,timestamp 9451917 9451907>
> 23:03:21.171811 IP 127.0.0.1.445 > 127.0.0.1.33230: P 64525:76813(12288) ack 653 win 265 <nop,nop,timestamp 9451959 9451907>
>
> You get the idea.
>
> Anyway, now THIS is the interesting case, because we have two packets in
> the answer, and you see the first half of the answer leaving immediately
> (when I wanted the whole answer to be sent) but the second only leaving
> after the .2 sec delay.
And it wasn't waiting for an ACK/window-update. You could try:
ifconfig lo mtu 1500
and see what happens then.
>>If SMB is a one-request-at-a-time protocol (I can never remember),
> It is.
Joy.
>> you could simulate it with a netperf TCP_RR test by passing suitable values
>> to the test-specific -r option:
>
>>netperf -H <remote> -t TCP_RR -- -r <req>,<rsp>
>
>>If that shows similar behaviour then you can ass-u-me it isn't your
>>application.
>
>
> Oh I'm pretty sure it's not my application, because my application performs
> well over ethernet, which is after all its purpose. Also I see the
> write, the TCP uncork, then a pause, and then the packet leaving.
Well, a wise old engineer tried to teach me that the proper spelling is ass-u-me
:) so just for grins, you might try the TCP_RR test anyway :) And even if your
application is correct (although I wonder why the receiver isn't sucking
data-out very quickly...) if you can reproduce the problem with netperf it will
be easier for others to do so.
rick jones
Thus spake Rick Jones ([email protected]):
> >Oh I'm pretty sure it's not my application, because my application performs
> >well over ethernet, which is after all its purpose. Also I see the
> >write, the TCP uncork, then a pause, and then the packet leaving.
> Well, a wise old engineer tried to teach me that the proper spelling is
> ass-u-me :) so just for grins, you might try the TCP_RR test anyway :) And
> even if your application is correct (although I wonder why the receiver
> isn't sucking data-out very quickly...) if you can reproduce the problem
> with netperf it will be easier for others to do so.
My application is only the server, the receiver is smbget from Samba, so
I don't feel responsible for it :-)
Still, when run over Ethernet, it works fine without waiting for
timeouts to expire.
To reproduce this:
- smbget is from samba, you probably already have this
- gatling (my server) can be gotten from
cvs -d :pserver:[email protected]:/cvs -z9 co dietlibc libowfat gatling
dietlibc is not strictly needed, but it's my environment.
First built dietlibc, then libowfat, then gatling.
Felix
Felix von Leitner wrote:
> Thus spake Rick Jones ([email protected]):
>
>>>Oh I'm pretty sure it's not my application, because my application performs
>>>well over ethernet, which is after all its purpose. Also I see the
>>>write, the TCP uncork, then a pause, and then the packet leaving.
>>
>>Well, a wise old engineer tried to teach me that the proper spelling is
>>ass-u-me :) so just for grins, you might try the TCP_RR test anyway :) And
>>even if your application is correct (although I wonder why the receiver
>>isn't sucking data-out very quickly...) if you can reproduce the problem
>>with netperf it will be easier for others to do so.
>
>
> My application is only the server, the receiver is smbget from Samba, so
> I don't feel responsible for it :-)
Might want to strace it anyway... no good deed (such as reporting a potential
issue) goes unpunished :)
> Still, when run over Ethernet, it works fine without waiting for
> timeouts to expire.
Past performance is no guarantee of current correctness :) And over an
Ethernet, there will be a very different set of both timings and TCP segment
sizes compared to loopback.
My guess is that you will find setting the lo mtu to 1500 a very interesting
experiment.
>
> To reproduce this:
>
> - smbget is from samba, you probably already have this
> - gatling (my server) can be gotten from
> cvs -d :pserver:[email protected]:/cvs -z9 co dietlibc libowfat gatling
>
> dietlibc is not strictly needed, but it's my environment.
> First built dietlibc, then libowfat, then gatling.
>
> Felix
Thus spake Rick Jones ([email protected]):
> Past performance is no guarantee of current correctness :) And over an
> Ethernet, there will be a very different set of both timings and TCP
> segment sizes compared to loopback.
> My guess is that you will find setting the lo mtu to 1500 a very
> interesting experiment.
Setting the MTU on lo to 1500 eliminates the problem and gives me double
digit MB/sec throughput.
Felix
Felix von Leitner wrote:
> Thus spake Rick Jones ([email protected]):
>
>>Past performance is no guarantee of current correctness :) And over an
>>Ethernet, there will be a very different set of both timings and TCP
>>segment sizes compared to loopback.
>
>>My guess is that you will find setting the lo mtu to 1500 a very
>>interesting experiment.
>
> Setting the MTU on lo to 1500 eliminates the problem and gives me double
> digit MB/sec throughput.
I'm not in a position at the moment to test it as my IPoIB systems are offline,
and not sure you are either, but I will note that with IPoIB bits circa OFED1.2
the default MTU for IPoIB goes up to 65520 bytes. If indeed the problem you
were seeing was related to sub-mss sends and window probing and such, it might
appear on IPoIB in addition to loopback.
rick jones