2003-01-23 12:11:40

by Martin Devera

[permalink] [raw]
Subject: TCP probably broken in W2K

Hi All,

this is not exactly Linux problem but it is VERY interesting
and as I'm linux developer I'm posting it here.

Situation:
Win2k, workstation (service packs unknown at the moment as
I have no access to it) connecting via FTP to Linux 2.4.18.
We tried several programs at w2k, both clients and servers.
At linux we tried proftpd, ncftp, wget.
We connect via Internet, DSL+w2k in Italia, leasedline+Linux
in Czech, both 256kbit.

Symptoms:
Transfered files from w2k are corrupted inside. I've written
analyzer which tries to locate each block of corrupted file
in good copy of it. We used 96MB test file. Here are details:

block 0:43032576(43032577) matched at 0 (off 0)
block 43032577:43159553(126977) matched at 43036672 (off 4095)
block 43159554:55193602(12034049) matched at 43167744 (off 8190)
block 55193603:57233410(2039808) matched at 55205888 (off 12285)
block 57233411:88076291(30842881) matched at 57233411 (off 0)
block 88076292:97873920(9797629) matched at 88080387 (off 4095)

The offset 0 at line 5 is because we "resumed" download
from that point.
>From TCP stream, 4095 bytes is missing at some places, see:

[devik@devix]$ hexdump -f /hfmt -s 43032570 -n 16 xxx.ok
2909ffa: b7 3b 00 76 80 f2 86 76 d6 db 59 e5 d4 41 fe 04
^^^^^^^^^^^^^^^^^^^^^
[devik@devix]$ hexdump -f /hfmt -s 43032570 -n 16 xxx.bad
2909ffa: b7 3b 00 76 80 f2 86 fa 19 75 6f a4 fc 7f 35 cc
P1-^^ P2 vvvvvvvvvvvvvvvvvvvvvvvvvvvv
[devik@devix]$ hexdump -f /hfmt -s $[43032570+4095] -n 16 xxx.ok
290aff9: 59 d7 9f 6f 50 b4 c8 fa 19 75 6f a4 fc 7f 35 cc

You can see that xxx.bad ends with 80 f2. Then 86 76 should
continue but fa 19 is here (which is 4095 bytes latter in
good copy of file).
Here is relevant part of tcpdump of xxx.bad transfer:

0x04b0 ebe7 8f5c 6b37 f1ff 773d c567 ef07 6489 ...\k7..w=.g..d.
0x04c0 cfa3 c809 f3c4 84e8 2008 b73b 0076 80f2 ...........;.v..
^^-P1
x.x.x.x.1384 > y.y.y.y.12151: P 43032576:43034028(1452) ack 1 win 17424 (DF)
0x0000 4500 05d4 33c8 4000 6f06 3618 5074 24cc [email protected]$.
0x0010 d597 516c 0568 2f77 1325 dcdf 485a c2f7 ..Ql.h/w.%..HZ..
0x0020 5018 4410 56c7 0000 86fa 1975 6fa4 fc7f P.D.V......uo...
^^-P2
0x0030 35cc 9649 6d79 de7c f93f 0faf f3f8 6fdb 5..Imy.|.?....o.

(Marks P1 and P2 are here so you can easily pair two lists above)
You can see that the break is NOT at packet boundary but rather
inside of second packet (after the first byte). It means that there
is no problem with tcp transfer itself (also tcp checksum and sequence
numbers are ok). It seems as if sending computer (win2k+DSLmodem)
feeds system calls badly. I'd blame ftp client (which runs at win2k)
but the same problem was true when win2k was running as ftp server.
So we have probably bug in win2k maybe paging related (4095 is too
similar to 4096 which is pagesize).
In the file there was more missing data blocks and ALL 4095 bytes.
And ALWAYS the last byte before missing block is the first byte
of new TCP packet as in example above.
Also always the first of packet pair is LESS THAN 1452 and has
PUSH bit set - it seems that it was the last packet of
"send" syscall block. See:
P 43023192:43024384(1192) ack 1 win 17424 (DF)
. 43024384:43025836(1452) ack 1 win 17424 (DF)
. 43025836:43027288(1452) ack 1 win 17424 (DF)
P 43027288:43028480(1192) ack 1 win 17424 (DF)
. 43028480:43029932(1452) ack 1 win 17424 (DF)
. 43029932:43031384(1452) ack 1 win 17424 (DF)
P 43031384:43032576(1192) ack 1 win 17424 (DF)
-- missing 4095 bytes here ---
P 43032576:43034028(1452) ack 1 win 17424 (DF)
. 43034028:43035480(1452) ack 1 win 17424 (DF)
P 43035480:43036673(1193) ack 1 win 17424 (DF)

You can see that typical sequence contains 4096
bytes (1192+1452+1452). Then PUSH packet is emited.
At place where data is missing in above listing
you see two PUSHes in sequence.
Seems like if some win32 routine missed one page,
but realized it somehow because last packet has
size 1193 instead of 1192.

Anyone knows about this bug or where should I ask ?

-------------------------------
Martin Devera aka devik
Linux kernel QoS/HTB maintainer
http://luxik.cdi.cz/~devik/


2003-01-23 15:47:17

by Michael T. Babcock

[permalink] [raw]
Subject: Re: [LARTC] TCP probably broken in W2K

devik wrote:

>this is not exactly Linux problem but it is VERY interesting
>and as I'm linux developer I'm posting it here.
>
>

I have Win2k and XP machines and I've had many symptoms of a broken
TCP/IP stack in my TCP programming on those platforms. I communicate
with Linux 2.2.19 and 2.2.21 machines running tcpserver _a lot_ from
these, and I have to do a number of strange things to make it all work.
I don't know if my layer-4 problems would help, but if they would, I
could try and describe them for you.

--
Michael T. Babcock
C.T.O., FibreSpeed Ltd.
http://www.fibrespeed.net/~mbabcock