2002-08-11 12:15:50

by Meelis Roos

[permalink] [raw]
Subject: Linux TCP problem while talking to hostme.bkbits.net ?

Linux 2.4.19 (+i2c/lm_sensors), x86 (Duron 600). I'm trying to do
'bk pull' for linux-2.4 tree from linux.bkbits.net. This has worked
before (even with the same kernel) but has failed for several days now.
The symptoms are that my linux sends 2 packets of data, hostme ack's
them, hostme sends 2 packets of data (seen from linux tcpdump output),
my linux acks only the secound with sack. Now every now and then hostme
tries to push its first packet through but linux ignores it. Broken
checksum or smth worse?

tcpdump -s 1500 -w pkts.bk from my linux box is attached, here it is in
text form:

15:04:00.102445 pc170.trtcab10a.comtrade.ee.35279 > hostme.bkbits.net.www: S 3220370823:3220370823(0) win 5840 <mss 1460,sackOK,timestamp 26461830 0,nop,wscale 0> (DF)
15:04:00.517063 hostme.bkbits.net.www > pc170.trtcab10a.comtrade.ee.35279: S 3214011989:3214011989(0) ack 3220370824 win 5792 <mss 1460,sackOK,timestamp 454537726 26461830,nop,wscale 0> (DF)
15:04:00.517162 pc170.trtcab10a.comtrade.ee.35279 > hostme.bkbits.net.www: . ack 1 win 5840 <nop,nop,timestamp 26461872 454537726> (DF)
15:04:00.518138 pc170.trtcab10a.comtrade.ee.35279 > hostme.bkbits.net.www: P 1:184(183) ack 1 win 5840 <nop,nop,timestamp 26461872 454537726> (DF)
15:04:00.945486 hostme.bkbits.net.www > pc170.trtcab10a.comtrade.ee.35279: . ack 184 win 6432 <nop,nop,timestamp 454537769 26461872> (DF)
15:04:00.945589 pc170.trtcab10a.comtrade.ee.35279 > hostme.bkbits.net.www: P 184:500(316) ack 1 win 5840 <nop,nop,timestamp 26461915 454537769> (DF)
15:04:01.349910 hostme.bkbits.net.www > pc170.trtcab10a.comtrade.ee.35279: . ack 500 win 7504 <nop,nop,timestamp 454537810 26461915> (DF)
15:04:01.350952 hostme.bkbits.net.www > pc170.trtcab10a.comtrade.ee.35279: P 1:205(204) ack 500 win 7504 <nop,nop,timestamp 454537810 26461915> (DF)
15:04:01.353252 hostme.bkbits.net.www > pc170.trtcab10a.comtrade.ee.35279: P 205:219(14) ack 500 win 7504 <nop,nop,timestamp 454537810 26461915> (DF)
15:04:01.353296 pc170.trtcab10a.comtrade.ee.35279 > hostme.bkbits.net.www: . ack 1 win 5840 <nop,nop,timestamp 26461955 454537810,nop,nop,sack sack 1 {205:219} > (DF)
15:04:01.412649 hostme.bkbits.net.www > pc170.trtcab10a.comtrade.ee.35279: FP 219:1386(1167) ack 500 win 7504 <nop,nop,timestamp 454537816 26461915> (DF)
15:04:03.467400 hostme.bkbits.net.www > pc170.trtcab10a.comtrade.ee.35279: P 1:205(204) ack 500 win 7504 <nop,nop,timestamp 454538025 26461955> (DF)
15:04:07.796399 hostme.bkbits.net.www > pc170.trtcab10a.comtrade.ee.35279: P 1:205(204) ack 500 win 7504 <nop,nop,timestamp 454538455 26461955> (DF)
15:04:16.390267 hostme.bkbits.net.www > pc170.trtcab10a.comtrade.ee.35279: P 1:205(204) ack 500 win 7504 <nop,nop,timestamp 454539315 26461955> (DF)
15:04:33.597306 hostme.bkbits.net.www > pc170.trtcab10a.comtrade.ee.35279: P 1:205(204) ack 500 win 7504 <nop,nop,timestamp 454541035 26461955> (DF)
15:05:07.978241 hostme.bkbits.net.www > pc170.trtcab10a.comtrade.ee.35279: P 1:205(204) ack 500 win 7504 <nop,nop,timestamp 454544475 26461955> (DF)
15:06:16.805594 hostme.bkbits.net.www > pc170.trtcab10a.comtrade.ee.35279: P 1:205(204) ack 500 win 7504 <nop,nop,timestamp 454551355 26461955> (DF)

And so on.

--
Meelis Roos ([email protected])


Attachments:
pkts.bk (4.62 kB)

2002-08-11 21:22:34

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Linux TCP problem while talking to hostme.bkbits.net ?

Hello!

> before (even with the same kernel) but has failed for several days now.
...
> checksum or smth worse?

Well, if this is so easy to reproduce, you could try to answer this. :-)
Look at statistics at least.

Alexey

2002-08-12 07:24:27

by Meelis Roos

[permalink] [raw]
Subject: Re: Linux TCP problem while talking to hostme.bkbits.net ?

> Well, if this is so easy to reproduce, you could try to answer this. :-)

Before and after the connetion attempt - looks like the number of bad
segments is increasing so probably not a TCP bug.

Since I can connect to this server fine from different locations, it
looks like something (ISP caches/shapers? cable modem?) is corrupting
specific packets in a reproducible way.

--- enne 2002-08-12 10:22:47.000000000 +0300
+++ nyyd 2002-08-12 10:23:37.000000000 +0300
@@ -1,9 +1,9 @@
Ip:
- 490363 total packets received
+ 490378 total packets received
54146 forwarded
0 incoming packets discarded
- 467788 incoming packets delivered
- 391339 requests sent out
+ 467803 incoming packets delivered
+ 391348 requests sent out
Icmp:
32786 ICMP messages received
0 input ICMP message failed.
@@ -16,33 +16,33 @@
destination unreachable: 11
echo replies: 4
Tcp:
- 3012 active connections openings
+ 3013 active connections openings
31 passive connection openings
0 failed connection attempts
0 connection resets received
0 connections established
- 385065 segments received
- 355005 segments send out
+ 385077 segments received
+ 355011 segments send out
437 segments retransmited
- 801 bad segments received.
+ 808 bad segments received.
761 resets sent
Udp:
- 18080 packets received
+ 18083 packets received
11 packets to unknown port received.
0 packet receive errors
- 2873 packets sent
+ 2876 packets sent
TcpExt:
ArpFilter: 0
- 321 TCP sockets finished time wait in fast timer
+ 322 TCP sockets finished time wait in fast timer
3621 delayed acks sent
9 delayed acks further delayed because of locked socket
Quick ack mode was activated 141 times
- 767 packets directly queued to recvmsg prequeue.
+ 777 packets directly queued to recvmsg prequeue.
1241624 packets directly received from backlog
556184 packets directly received from prequeue
298158 packets header predicted
1271 packets header predicted and directly queued to user
- TCPPureAcks: 5053
+ TCPPureAcks: 5056
TCPHPAcks: 16768
TCPRenoRecovery: 0
TCPSackRecovery: 2

--
Meelis Roos ([email protected])

2002-08-12 10:05:09

by David Miller

[permalink] [raw]
Subject: Re: Linux TCP problem while talking to hostme.bkbits.net ?

From: Meelis Roos <[email protected]>
Date: Mon, 12 Aug 2002 10:28:10 +0300 (EEST)

Before and after the connetion attempt - looks like the number of bad
segments is increasing so probably not a TCP bug.

Since I can connect to this server fine from different locations, it
looks like something (ISP caches/shapers? cable modem?) is corrupting
specific packets in a reproducible way.

Aparently something is wrong with the checksums.
InErrs gets incremented in three cases:

1) Header too small, unlikely what you see

2) Bad SYN sequences, Abort On Syn in TcpExt would have been
incremented also if this were the case, it was not

3) Bad TCP checksum

Hmmm, do you have messages that say "hw tcp v4 csum failed"
in your kernel logs? Any other interesting kernel messages?

2002-08-12 10:10:43

by Meelis Roos

[permalink] [raw]
Subject: Re: Linux TCP problem while talking to hostme.bkbits.net ?

> Aparently something is wrong with the checksums.
> InErrs gets incremented in three cases:
>
> 1) Header too small, unlikely what you see
>
> 2) Bad SYN sequences, Abort On Syn in TcpExt would have been
> incremented also if this were the case, it was not
>
> 3) Bad TCP checksum
>
> Hmmm, do you have messages that say "hw tcp v4 csum failed"
> in your kernel logs? Any other interesting kernel messages?

Nothing interesting at all in my kernel logs.

eth1: RealTek RTL8139 Fast Ethernet at 0xd88ad000, 00:c0:df:04:7f:9b, IRQ 10
eth1: Identified 8139 chip type 'RTL-8139B'

eth1 Link encap:Ethernet HWaddr 00:C0:DF:04:7F:9B
inet addr:62.128.97.170 Bcast:255.255.255.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:556705 errors:0 dropped:0 overruns:0 frame:0
TX packets:410116 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:581410743 (554.4 MiB) TX bytes:32798898 (31.2 MiB)
Interrupt:10 Base address:0xd000

--
Meelis Roos ([email protected])

2002-08-12 17:30:32

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Linux TCP problem while talking to hostme.bkbits.net ?

Hello!

> Aparently something is wrong with the checksums.

The problem is that checksum in tcpdump is OK.
This smells really bad.

I feel you have to hunt where exactly the segment is dropped
and TCPInErrs is incremented.

> Hmmm, do you have messages that say "hw tcp v4 csum failed"

It is dumb realtek card, so checksum is calculated in software.

Alexey

2002-08-13 08:36:25

by Meelis Roos

[permalink] [raw]
Subject: Re: Linux TCP problem while talking to hostme.bkbits.net ?

> The problem is that checksum in tcpdump is OK.
> This smells really bad.
>
> I feel you have to hunt where exactly the segment is dropped
> and TCPInErrs is incremented.

Things got stranger. The symptoms started to appear on other connections
too (slashdot web for example). TCP bad packet count increased and no
success was made. I did a reboot with the same kernel (2.4.19+bk of some
state, 4. Aug probably) and it just started to work with the same kernel
image.

Now I will test and see if the symptoms appear again after some days.

--
Meelis Roos e-mail: [email protected]
www: http://www.cs.ut.ee/~mroos/

2002-08-13 12:28:11

by Roger Gammans

[permalink] [raw]
Subject: Re: Linux TCP problem while talking to hostme.bkbits.net ?

On Tue, Aug 13, 2002 at 11:40:08AM +0300, Meelis Roos wrote:
> > The problem is that checksum in tcpdump is OK.
> > This smells really bad.
> >
> > I feel you have to hunt where exactly the segment is dropped
> > and TCPInErrs is incremented.
>
> Things got stranger. The symptoms started to appear on other connections
> [....]
> I did a reboot with the same kernel (2.4.19+bk of some
> state, 4. Aug probably) and it just started to work with the same kernel
> image.

I've seen these sort of symptoms before and it turned out
to be faulty memory.

Back in 2.2 I had a box which picked the behavior up if
you did a ifconfig down/ifconfig up after it had been running
for some time.

tcpdump on the localbox it that case showed Ok (outgoing) packets, but
another box on the same network segment showed the same packets as
corrupted.

Changing the RAM cured it completely.

TTFN
--
Roger.
Master of Peng Shui. (Ancient oriental art of Penguin Arranging)
GPG Key FPR: CFF1 F383 F854 4E6A 918D 5CFF A90D E73B 88DE 0B3E


Attachments:
(No filename) (1.00 kB)
(No filename) (232.00 B)
Download all attachments

2002-08-13 13:36:35

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Linux TCP problem while talking to hostme.bkbits.net ?

Hello!

> Things got stranger. The symptoms started to appear

:-) Hey, if we are not going to finish in lunatic asylum we need to do
something constructive yet. Please, find the place where the packet is dropped.

I do not understand what happens at all. It is surely not a plain
packet corruption because tcpdump shows correct packet. Hmm... of course,
provided you make tcpdump on receiving host.


Alexey

2002-08-13 13:44:53

by Larry McVoy

[permalink] [raw]
Subject: Re: Linux TCP problem while talking to hostme.bkbits.net ?

I just became aware of this thread. Dave, if you want a login so you can
poke around, let me know. The config on our end is

Linux hostme 2.4.5 #1 Mon May 28 10:54:32 PDT 2001 i686 unknown
6:45am up 54 days, 16:17, 3 users, load average: 1.39, 1.88, 2.06

I poked around in the logs and didn't see anything obvious.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-08-13 13:53:50

by Meelis Roos

[permalink] [raw]
Subject: Re: Linux TCP problem while talking to hostme.bkbits.net ?

> :-) Hey, if we are not going to finish in lunatic asylum we need to do
> something constructive yet. Please, find the place where the packet is dropped.

It does not occur any more since reboot. Will try some printk's and
recompile and wait it to happen _if_ it ever happens again.

Since it's gone now I currently suspect bad RAM the most. But no other
stange thing have happene so I don't really know.

> I do not understand what happens at all. It is surely not a plain
> packet corruption because tcpdump shows correct packet. Hmm... of course,
> provided you make tcpdump on receiving host.

Yes, they were from the receiving host.

--
Meelis Roos ([email protected])

2002-08-15 05:31:20

by Clint Byrum

[permalink] [raw]
Subject: Re: Linux TCP problem while talking to hostme.bkbits.net ?

> Things got stranger. The symptoms started to appear on other connections
> too (slashdot web for example). TCP bad packet count increased and no
> success was made. I did a reboot with the same kernel (2.4.19+bk of some
> state, 4. Aug probably) and it just started to work with the same kernel
> image.
>

I am seeing this exact same problem on 2.4.18+ipsec+grsecurity, though
mine is on an Intel NIC..

eth0: OEM i82557/i82558 10/100 Ethernet, 00:02:B3:1E:41:4D, IRQ 10.
eth0 Link encap:Ethernet HWaddr 00:02:B3:1E:41:4D
inet addr:10.30.3.2 Bcast:10.30.3.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:110268 errors:0 dropped:0 overruns:0 frame:0
TX packets:4613 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:26566206 (25.3 MiB) TX bytes:430171 (420.0 KiB)
Interrupt:10

This box was up for about 15 days without problems, and then started
exhibiting this behavior of not seeing SYNACK's. Though this was
happening to me on all attempts at making an outgoing TCP connection. My
tcpdump was nearly identical to your earlier post. I rebooted it, and it
has not had any problems with outgoing connections since(a whole 6
hours).

> Now I will test and see if the symptoms appear again after some days.
>

That is where I am at too. If worst comes to worst, I will break out the
memtest86 disk and see if it is in fact bad RAM causing this stuff.


btw... I am not subscribed to lkml, so please CC: me on any replies to
this. Thanks. :)


2002-09-19 13:11:46

by Meelis Roos

[permalink] [raw]
Subject: Re: Linux TCP problem while talking to hostme.bkbits.net ?

> I've seen these sort of symptoms before and it turned out
> to be faulty memory.

The mystery has been solved - after the symptoms appeared again (this
time certain outgoing packets didn't get through), I ran memtest86 for
night and it showed bad memory.

--
Meelis Roos e-mail: [email protected]
www: http://www.cs.ut.ee/~mroos/

2002-09-19 19:27:21

by David Miller

[permalink] [raw]
Subject: Re: Linux TCP problem while talking to hostme.bkbits.net ?

From: Meelis Roos <[email protected]>
Date: Thu, 19 Sep 2002 16:15:52 +0300 (EEST)

> I've seen these sort of symptoms before and it turned out
> to be faulty memory.

The mystery has been solved - after the symptoms appeared again (this
time certain outgoing packets didn't get through), I ran memtest86 for
night and it showed bad memory.

Thanks for following up on this.