2001-02-21 00:09:00

by Vibol Hou

[permalink] [raw]
Subject: netdev issues (3c905B)

Hi,

I have some problems on a heavily loaded web server. The first is that the
kernel is spitting out a bunch of "NETDEV WATCHDOG: eth0: transmit timed
out" errors. I do not recall this happening in 2.4.0 under the same
conditions.

Another problem that I seem to have, of which I have had reports from
clients, is that the server has problems talking to clients using modems
This didn't occur before with the 2.2 series kernel (all other things held
constant). It seems each time a client tries to load up any site on the
server, the connection will just die (or stall). This does not apply to
high-bandwidth connections (DSL and up) since everything seems fine on DSL
and faster, but I tried connecting using my dial-up account with Earthlink,
and the reports seem to be true. Can those of you on a 56k modem try
connecting to http://khmerconnection.com and see if the page loads? Apache
isn't the only service affected. It seems *any* TCP communication runs like
a turtle (even SSH. takes minutes to login, then minutes to echo each
letter. doesn't do this on a DSL connection from the same computer).

The card that is exhibiting this problem is a 3c905B (lspci below):

00:08.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone]
(rev 30)
Subsystem: 3Com Corporation: Unknown device 9055
Flags: bus master, medium devsel, latency 80, IRQ 17
I/O ports at e400 [size=128]
Memory at e8001000 (32-bit, non-prefetchable) [size=128]
Expansion ROM at e4000000 [disabled] [size=128K]
Capabilities: [dc] Power Management version 1

dmesg shows hordes of these at high peak usage (300KBps+):

NETDEV WATCHDOG: eth0: transmit timed out
eth0: transmit timed out, tx_status 00 status e601.
diagnostics: net 0cd8 media 8880 dma 0000003a.
eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
Flags; bus-master 1, full 0; dirty 9256291(3) current 9256291(3).
Transmit list 00000000 vs. f7de5230.
0: @f7de5200 length 80000042 status 00010042
1: @f7de5210 length 8000004a status 8001004a
2: @f7de5220 length 80000036 status 80010036
3: @f7de5230 length 80000036 status 00010036
4: @f7de5240 length 80000042 status 00010042
5: @f7de5250 length 80000036 status 00010036
6: @f7de5260 length 800005ea status 000105ea
7: @f7de5270 length 800005ea status 000105ea
8: @f7de5280 length 8000003a status 0001003a
9: @f7de5290 length 8000003e status 0001003e
10: @f7de52a0 length 8000003a status 0001003a
11: @f7de52b0 length 8000003e status 0001003e
12: @f7de52c0 length 8000003e status 0001003e
13: @f7de52d0 length 8000004a status 0001004a
14: @f7de52e0 length 8000004a status 0001004a
15: @f7de52f0 length 8000003e status 0001003e
eth0: Resetting the Tx ring pointer.

Any ideas?

Thanks,
--
Vibol Hou


2001-02-21 00:19:40

by Martin Moerman

[permalink] [raw]
Subject: Re: netdev issues (3c905B)



Vibol,

I see that the card is on IRQ 17 ???

can you send us /proc/interrupts

/Martin


On Tue, 20 Feb 2001, Vibol Hou wrote:

> Hi,
>
> I have some problems on a heavily loaded web server. The first is that the
> kernel is spitting out a bunch of "NETDEV WATCHDOG: eth0: transmit timed
> out" errors. I do not recall this happening in 2.4.0 under the same
> conditions.
>
> Another problem that I seem to have, of which I have had reports from
> clients, is that the server has problems talking to clients using modems
> This didn't occur before with the 2.2 series kernel (all other things held
> constant). It seems each time a client tries to load up any site on the
> server, the connection will just die (or stall). This does not apply to
> high-bandwidth connections (DSL and up) since everything seems fine on DSL
> and faster, but I tried connecting using my dial-up account with Earthlink,
> and the reports seem to be true. Can those of you on a 56k modem try
> connecting to http://khmerconnection.com and see if the page loads? Apache
> isn't the only service affected. It seems *any* TCP communication runs like
> a turtle (even SSH. takes minutes to login, then minutes to echo each
> letter. doesn't do this on a DSL connection from the same computer).
>
> The card that is exhibiting this problem is a 3c905B (lspci below):
>
> 00:08.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone]
> (rev 30)
> Subsystem: 3Com Corporation: Unknown device 9055
> Flags: bus master, medium devsel, latency 80, IRQ 17
> I/O ports at e400 [size=128]
> Memory at e8001000 (32-bit, non-prefetchable) [size=128]
> Expansion ROM at e4000000 [disabled] [size=128K]
> Capabilities: [dc] Power Management version 1
>
> dmesg shows hordes of these at high peak usage (300KBps+):
>
> NETDEV WATCHDOG: eth0: transmit timed out
> eth0: transmit timed out, tx_status 00 status e601.
> diagnostics: net 0cd8 media 8880 dma 0000003a.
> eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
> Flags; bus-master 1, full 0; dirty 9256291(3) current 9256291(3).
> Transmit list 00000000 vs. f7de5230.
> 0: @f7de5200 length 80000042 status 00010042
> 1: @f7de5210 length 8000004a status 8001004a
> 2: @f7de5220 length 80000036 status 80010036
> 3: @f7de5230 length 80000036 status 00010036
> 4: @f7de5240 length 80000042 status 00010042
> 5: @f7de5250 length 80000036 status 00010036
> 6: @f7de5260 length 800005ea status 000105ea
> 7: @f7de5270 length 800005ea status 000105ea
> 8: @f7de5280 length 8000003a status 0001003a
> 9: @f7de5290 length 8000003e status 0001003e
> 10: @f7de52a0 length 8000003a status 0001003a
> 11: @f7de52b0 length 8000003e status 0001003e
> 12: @f7de52c0 length 8000003e status 0001003e
> 13: @f7de52d0 length 8000004a status 0001004a
> 14: @f7de52e0 length 8000004a status 0001004a
> 15: @f7de52f0 length 8000003e status 0001003e
> eth0: Resetting the Tx ring pointer.
>
> Any ideas?
>
> Thanks,
> --
> Vibol Hou
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2001-02-21 00:37:23

by Vibol Hou

[permalink] [raw]
Subject: RE: netdev issues (3c905B)

Hi Martin,

Here's /proc/interrupts:

CPU0 CPU1
0: 2748043 2754927 IO-APIC-edge timer
1: 2 0 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
4: 2737 2892 IO-APIC-edge serial
17: 9573612 9568840 IO-APIC-level eth0
18: 483436 482421 IO-APIC-level aic7xxx
NMI: 5505505 5505399
LOC: 5502609 5502508
ERR: 0

-Vibol

-----Original Message-----
From: [email protected]
[mailto:[email protected]]On Behalf Of Martin Moerman
Sent: Tuesday, February 20, 2001 4:22 PM
To: Vibol Hou
Cc: Linux-Kernel
Subject: Re: netdev issues (3c905B)




Vibol,

I see that the card is on IRQ 17 ???

can you send us /proc/interrupts

/Martin


On Tue, 20 Feb 2001, Vibol Hou wrote:

> Hi,
>
> I have some problems on a heavily loaded web server. The first is that
the
> kernel is spitting out a bunch of "NETDEV WATCHDOG: eth0: transmit timed
> out" errors. I do not recall this happening in 2.4.0 under the same
> conditions.
>
> Another problem that I seem to have, of which I have had reports from
> clients, is that the server has problems talking to clients using modems
> This didn't occur before with the 2.2 series kernel (all other things held
> constant). It seems each time a client tries to load up any site on the
> server, the connection will just die (or stall). This does not apply to
> high-bandwidth connections (DSL and up) since everything seems fine on DSL
> and faster, but I tried connecting using my dial-up account with
Earthlink,
> and the reports seem to be true. Can those of you on a 56k modem try
> connecting to http://khmerconnection.com and see if the page loads?
Apache
> isn't the only service affected. It seems *any* TCP communication runs
like
> a turtle (even SSH. takes minutes to login, then minutes to echo each
> letter. doesn't do this on a DSL connection from the same computer).
>
> The card that is exhibiting this problem is a 3c905B (lspci below):
>
> 00:08.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone]
> (rev 30)
> Subsystem: 3Com Corporation: Unknown device 9055
> Flags: bus master, medium devsel, latency 80, IRQ 17
> I/O ports at e400 [size=128]
> Memory at e8001000 (32-bit, non-prefetchable) [size=128]
> Expansion ROM at e4000000 [disabled] [size=128K]
> Capabilities: [dc] Power Management version 1
>
> dmesg shows hordes of these at high peak usage (300KBps+):
>
> NETDEV WATCHDOG: eth0: transmit timed out
> eth0: transmit timed out, tx_status 00 status e601.
> diagnostics: net 0cd8 media 8880 dma 0000003a.
> eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
> Flags; bus-master 1, full 0; dirty 9256291(3) current 9256291(3).
> Transmit list 00000000 vs. f7de5230.
> 0: @f7de5200 length 80000042 status 00010042
> 1: @f7de5210 length 8000004a status 8001004a
> 2: @f7de5220 length 80000036 status 80010036
> 3: @f7de5230 length 80000036 status 00010036
> 4: @f7de5240 length 80000042 status 00010042
> 5: @f7de5250 length 80000036 status 00010036
> 6: @f7de5260 length 800005ea status 000105ea
> 7: @f7de5270 length 800005ea status 000105ea
> 8: @f7de5280 length 8000003a status 0001003a
> 9: @f7de5290 length 8000003e status 0001003e
> 10: @f7de52a0 length 8000003a status 0001003a
> 11: @f7de52b0 length 8000003e status 0001003e
> 12: @f7de52c0 length 8000003e status 0001003e
> 13: @f7de52d0 length 8000004a status 0001004a
> 14: @f7de52e0 length 8000004a status 0001004a
> 15: @f7de52f0 length 8000003e status 0001003e
> eth0: Resetting the Tx ring pointer.
>
> Any ideas?
>
> Thanks,
> --
> Vibol Hou
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2001-02-21 09:49:10

by Ookhoi

[permalink] [raw]
Subject: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))

Hi!

> Another problem that I seem to have, of which I have had reports from
> clients, is that the server has problems talking to clients using modems
> This didn't occur before with the 2.2 series kernel (all other things held
> constant). It seems each time a client tries to load up any site on the
> server, the connection will just die (or stall). This does not apply to
> high-bandwidth connections (DSL and up) since everything seems fine on DSL
> and faster, but I tried connecting using my dial-up account with Earthlink,
> and the reports seem to be true. Can those of you on a 56k modem try
> connecting to http://khmerconnection.com and see if the page loads? Apache
> isn't the only service affected. It seems *any* TCP communication runs like
> a turtle (even SSH. takes minutes to login, then minutes to echo each
> letter. doesn't do this on a DSL connection from the same computer).
>
> The card that is exhibiting this problem is a 3c905B (lspci below):

[cut]

We have exactly the same problem but in our case it depends on the
following three conditions: 1, kernel 2.4 (2.2 is fine), 2, windows ip
header compression turned on, 3, a free internet access provider in
Holland called 'Wish' (which seemes to stand for 'I Wish I had a faster
connection').
If we remove one of the three conditions, the connection is oke. It is
only tcp which is affected.
A packet on its way from linux server to windows client seems to get
dropped once and retransmitted. This makes the connection _very_ slow.

It seemes that Simon has the same problem. Can I provide tcp dumps to
help and find the cause to this problem? Not sure yet this is only with
3com nics. Will test that.

Ookhoi


Date: Fri, 16 Feb 2001 20:02:11 -0500
From: Simon Kirby <[email protected]>
To: [email protected], [email protected]
Cc: Alan Evetts <[email protected]>
Subject: Re: 2.4 TCP(?) timeouts

On Fri, Feb 16, 2001 at 07:08:05PM -0500, Simon Kirby wrote:

> Hello,
>
> Today we put 2.4.1 on our mail server after having see it perform well on
> some other boxes. It seems now we are receiving a few calls every hour
> from customers reporting that the server tends to hang and eventually
> time out on them when downloading mail. All customers that have reported
> this problem so far are on a didalup connection. Apparently the server
> will stop transmitting data (or the client seems to think so), and then
> their mail client will time out.

We recorded a trace on the mail server end to one of the customers having
the problem. At first they closed the connection because their mail
client was set to a timeout of 1 minute, but then when they changed it to
5 seconds, it seemed to limp along further. It seems to me just like
there's a huge amount of packet loss, but pinging the machine just after
this shows 0% loss (just occasional jumps in response time).

During this trace, when long periods of nothing went by, "netstat -tan
|grep ip" showed nothing abnormal: a 0 byte receive queue and some
data in the send queue equal to what would be retransmitted and
eventually go through two minutes later.

nmap:
Remote operating system guess: Windows 2000 Professional, Build 2128

16:26:14.738836 < client.1104 > mail.pop3: S 1263956200:1263956200(0) win 8760 <mss 536,nop,nop,sackOK> (DF)
16:26:14.738888 > mail.pop3 > client.1104: S 26894293:26894293(0) ack 1263956201 win 5840 <mss 1460,nop,nop,sackOK> (DF)
16:26:15.014145 < client.1104 > mail.pop3: . 1:1(0) ack 1 win 9112 (DF)
16:26:15.014866 > mail.pop3 > client.1104: P 1:92(91) ack 1 win 5840 (DF)
16:26:15.291998 < client.1104 > mail.pop3: P 1:16(15) ack 92 win 9021 (DF)
16:26:15.292199 > mail.pop3 > client.1104: . 92:92(0) ack 16 win 5840 (DF)
16:26:15.292305 > mail.pop3 > client.1104: P 92:115(23) ack 16 win 5840 (DF)
16:26:16.686295 > mail.pop3 > client.1104: P 92:115(23) ack 16 win 5840 (DF)
16:26:16.954563 < client.1104 > mail.pop3: P 16:30(14) ack 115 win 8998 (DF)
16:26:16.976908 > mail.pop3 > client.1104: P 115:137(22) ack 30 win 5840 (DF)
16:26:19.776322 > mail.pop3 > client.1104: P 115:137(22) ack 30 win 5840 (DF)
16:26:20.033951 < client.1104 > mail.pop3: P 30:36(6) ack 137 win 8976 (DF)
16:26:20.034063 > mail.pop3 > client.1104: P 137:149(12) ack 36 win 5840 (DF)
16:26:25.626301 > mail.pop3 > client.1104: P 137:149(12) ack 36 win 5840 (DF)
16:26:25.922151 < client.1104 > mail.pop3: P 36:42(6) ack 149 win 8964 (DF)
16:26:25.922254 > mail.pop3 > client.1104: P 149:219(70) ack 42 win 5840 (DF)
16:26:36.949499 < client.1104 > mail.pop3: P 36:42(6) ack 149 win 8964 (DF)
16:26:36.949533 > mail.pop3 > client.1104: . 219:219(0) ack 42 win 5840 <nop,nop, sack 1 {36:42} > (DF)
16:26:37.116302 > mail.pop3 > client.1104: P 149:219(70) ack 42 win 5840 (DF)
16:26:37.380554 < client.1104 > mail.pop3: P 42:50(8) ack 219 win 8894 (DF)
16:26:37.380645 > mail.pop3 > client.1104: . 219:219(0) ack 50 win 5840 (DF)
16:26:37.380709 > mail.pop3 > client.1104: P 219:231(12) ack 50 win 5840 (DF)
16:26:59.567440 < client.1104 > mail.pop3: P 42:50(8) ack 219 win 8894 (DF)
16:26:59.567476 > mail.pop3 > client.1104: . 231:231(0) ack 50 win 5840 <nop,nop, sack 1 {42:50} > (DF)
16:26:59.776301 > mail.pop3 > client.1104: P 219:231(12) ack 50 win 5840 (DF)
16:27:00.043125 < client.1104 > mail.pop3: P 50:59(9) ack 231 win 8882 (DF)
16:27:00.043186 > mail.pop3 > client.1104: . 231:231(0) ack 59 win 5840 (DF)
16:27:00.043475 > mail.pop3 > client.1104: . 231:767(536) ack 59 win 5840 (DF)
16:27:00.043491 > mail.pop3 > client.1104: P 767:1220(453) ack 59 win 5840 (DF)
16:27:44.399831 < client.1104 > mail.pop3: P 50:59(9) ack 231 win 8882 (DF)
16:27:44.399869 > mail.pop3 > client.1104: . 1220:1220(0) ack 59 win 5840 <nop,nop, sack 1 {50:59} > (DF)
16:27:44.836304 > mail.pop3 > client.1104: . 231:767(536) ack 59 win 5840 (DF)
16:27:45.295946 < client.1104 > mail.pop3: . 59:59(0) ack 767 win 9112 (DF)
16:27:45.296003 > mail.pop3 > client.1104: P 767:1220(453) ack 59 win 5840 (DF)
16:29:14.886322 > mail.pop3 > client.1104: P 767:1220(453) ack 59 win 5840 (DF)
16:29:15.264417 < client.1104 > mail.pop3: P 59:67(8) ack 1220 win 8659 (DF)
16:29:15.264479 > mail.pop3 > client.1104: . 1220:1220(0) ack 67 win 5840 (DF)
16:29:15.265127 > mail.pop3 > client.1104: . 1220:1756(536) ack 67 win 5840 (DF)
16:29:15.265145 > mail.pop3 > client.1104: . 1756:2292(536) ack 67 win 5840 (DF)
16:30:45.187652 < client.1104 > mail.pop3: P 59:67(8) ack 1220 win 8659 (DF)
16:30:45.187727 > mail.pop3 > client.1104: . 2292:2292(0) ack 67 win 5840 <nop,nop, sack 1 {59:67} > (DF)
16:31:16.326378 > mail.pop3 > client.1104: . 1220:1756(536) ack 67 win 5840 (DF)
16:31:17.513053 < client.1104 > mail.pop3: . 67:67(0) ack 1756 win 9112 (DF)
16:31:17.513129 > mail.pop3 > client.1104: . 1756:2292(536) ack 67 win 5840 (DF)
16:31:17.513143 > mail.pop3 > client.1104: . 2292:2828(536) ack 67 win 5840 (DF)
16:33:17.506376 > mail.pop3 > client.1104: . 1756:2292(536) ack 67 win 5840 (DF)
16:33:17.919146 < client.1104 > mail.pop3: . 67:67(0) ack 2292 win 9112 (DF)
16:33:17.919198 > mail.pop3 > client.1104: . 2292:2828(536) ack 67 win 5840 (DF)
16:33:17.919211 > mail.pop3 > client.1104: . 2828:3364(536) ack 67 win 5840 (DF)
16:35:17.916383 > mail.pop3 > client.1104: . 2292:2828(536) ack 67 win 5840 (DF)
16:35:18.401250 < client.1104 > mail.pop3: . 67:67(0) ack 2828 win 9112 (DF)
16:35:18.401394 > mail.pop3 > client.1104: . 2828:3364(536) ack 67 win 5840 (DF)
16:35:18.401414 > mail.pop3 > client.1104: . 3364:3900(536) ack 67 win 5840 (DF)
16:37:18.396373 > mail.pop3 > client.1104: . 2828:3364(536) ack 67 win 5840 (DF)
16:37:21.763859 < client.1104 > mail.pop3: . 67:67(0) ack 3364 win 9112 (DF)
16:37:21.764049 > mail.pop3 > client.1104: . 3364:3900(536) ack 67 win 5840 (DF)
16:37:21.764062 > mail.pop3 > client.1104: . 3900:4436(536) ack 67 win 5840 (DF)
16:42:22.308578 < client.1104 > mail.pop3: F 67:67(0) ack 3364 win 9112 (DF)
16:42:22.308625 > mail.pop3 > client.1104: R 26897657:26897657(0) win 0 (DF)

I'm not sure how the last part happened, but I'm guessing the server was
waiting on the next transmit to send that it had already closed the
connection, and the RST was sent out as a response to the socket already
being closed locally when the customer eventually closed the connection.

Would any of the networking changes in 2.4.1pre3 affect what is happening
here?

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2001-02-21 10:59:24

by David Miller

[permalink] [raw]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))


Ookhoi writes:
> We have exactly the same problem but in our case it depends on the
> following three conditions: 1, kernel 2.4 (2.2 is fine), 2, windows ip
> header compression turned on, 3, a free internet access provider in
> Holland called 'Wish' (which seemes to stand for 'I Wish I had a faster
> connection').
> If we remove one of the three conditions, the connection is oke. It is
> only tcp which is affected.
> A packet on its way from linux server to windows client seems to get
> dropped once and retransmitted. This makes the connection _very_ slow.

:-( I hate these buggy systems.

Does this patch below fix the performance problem and are the windows
clients win2000 or win95?

--- include/net/ip.h.~1~ Mon Feb 19 00:12:31 2001
+++ include/net/ip.h Wed Feb 21 02:56:15 2001
@@ -190,9 +190,11 @@

static inline void ip_select_ident(struct iphdr *iph, struct dst_entry *dst)
{
+#if 0
if (iph->frag_off&__constant_htons(IP_DF))
iph->id = 0;
else
+#endif
__ip_select_ident(iph, dst);
}

2001-02-21 11:34:39

by Ookhoi

[permalink] [raw]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))

Hi David,

> > We have exactly the same problem but in our case it depends on the
> > following three conditions: 1, kernel 2.4 (2.2 is fine), 2, windows ip
> > header compression turned on, 3, a free internet access provider in
> > Holland called 'Wish' (which seemes to stand for 'I Wish I had a faster
> > connection').
> > If we remove one of the three conditions, the connection is oke. It is
> > only tcp which is affected.
> > A packet on its way from linux server to windows client seems to get
> > dropped once and retransmitted. This makes the connection _very_ slow.
>
> :-( I hate these buggy systems.
>
> Does this patch below fix the performance problem and are the windows
> clients win2000 or win95?

It is 95 in our case. I'll test the patch today and report back to you.
Thanks a lot!

Ookhoi


> --- include/net/ip.h.~1~ Mon Feb 19 00:12:31 2001
> +++ include/net/ip.h Wed Feb 21 02:56:15 2001
> @@ -190,9 +190,11 @@
>
> static inline void ip_select_ident(struct iphdr *iph, struct dst_entry *dst)
> {
> +#if 0
> if (iph->frag_off&__constant_htons(IP_DF))
> iph->id = 0;
> else
> +#endif
> __ip_select_ident(iph, dst);
> }
>

2001-02-21 13:12:37

by Gregory Maxwell

[permalink] [raw]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))

On Wed, Feb 21, 2001 at 10:47:24AM +0100, Ookhoi wrote:
[snip]
> We have exactly the same problem but in our case it depends on the
> following three conditions: 1, kernel 2.4 (2.2 is fine), 2, windows ip
> header compression turned on, 3, a free internet access provider in
> Holland called 'Wish' (which seemes to stand for 'I Wish I had a faster
> connection').
> If we remove one of the three conditions, the connection is oke. It is
> only tcp which is affected.
> A packet on its way from linux server to windows client seems to get
> dropped once and retransmitted. This makes the connection _very_ slow.
[snip]

It's been true for some time now that there are several firewalls, RAS, and
NAT devices that break TCP connections in subtile but horrible ways when they
encounter SACK, timestamps, have header compression enabled, or other
'exotic' features.

Has anyone compiled a list of such bugs so that a test application could be
created?

2001-02-21 17:18:02

by Ookhoi

[permalink] [raw]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))

Hi David!

> > We have exactly the same problem but in our case it depends on the
> > following three conditions: 1, kernel 2.4 (2.2 is fine), 2, windows ip
> > header compression turned on, 3, a free internet access provider in
> > Holland called 'Wish' (which seemes to stand for 'I Wish I had a faster
> > connection').
> > If we remove one of the three conditions, the connection is oke. It is
> > only tcp which is affected.
> > A packet on its way from linux server to windows client seems to get
> > dropped once and retransmitted. This makes the connection _very_ slow.
>
> :-( I hate these buggy systems.
>
> Does this patch below fix the performance problem and are the windows
> clients win2000 or win95?

Yes, the problem is fixed! Thank you very much. :-) 'great' patch!

Ookhoi


> --- include/net/ip.h.~1~ Mon Feb 19 00:12:31 2001
> +++ include/net/ip.h Wed Feb 21 02:56:15 2001
> @@ -190,9 +190,11 @@
>
> static inline void ip_select_ident(struct iphdr *iph, struct dst_entry *dst)
> {
> +#if 0
> if (iph->frag_off&__constant_htons(IP_DF))
> iph->id = 0;
> else
> +#endif
> __ip_select_ident(iph, dst);
> }

2001-02-21 19:10:40

by Vibol Hou

[permalink] [raw]
Subject: RE: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))

Win2K here, I'll apply the patch and let you know what happens.

-Vibol

-----Original Message-----
From: David S. Miller [mailto:[email protected]]
Sent: Wednesday, February 21, 2001 2:57 AM
To: [email protected]
Cc: Vibol Hou; Linux-Kernel; [email protected]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev
issues (3c905B))



Ookhoi writes:
> We have exactly the same problem but in our case it depends on the
> following three conditions: 1, kernel 2.4 (2.2 is fine), 2, windows ip
> header compression turned on, 3, a free internet access provider in
> Holland called 'Wish' (which seemes to stand for 'I Wish I had a faster
> connection').
> If we remove one of the three conditions, the connection is oke. It is
> only tcp which is affected.
> A packet on its way from linux server to windows client seems to get
> dropped once and retransmitted. This makes the connection _very_ slow.

:-( I hate these buggy systems.

Does this patch below fix the performance problem and are the windows
clients win2000 or win95?

--- include/net/ip.h.~1~ Mon Feb 19 00:12:31 2001
+++ include/net/ip.h Wed Feb 21 02:56:15 2001
@@ -190,9 +190,11 @@

static inline void ip_select_ident(struct iphdr *iph, struct dst_entry
*dst)
{
+#if 0
if (iph->frag_off&__constant_htons(IP_DF))
iph->id = 0;
else
+#endif
__ip_select_ident(iph, dst);
}



2001-02-21 19:25:22

by Vibol Hou

[permalink] [raw]
Subject: RE: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))

It looks like the patch fixed the problem. TCP communications over modem
seems fine now with the same settings that didnt' work earlier.

-Vibol

-----Original Message-----
From: David S. Miller [mailto:[email protected]]
Sent: Wednesday, February 21, 2001 2:57 AM
To: [email protected]
Cc: Vibol Hou; Linux-Kernel; [email protected]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev
issues (3c905B))



Ookhoi writes:
> We have exactly the same problem but in our case it depends on the
> following three conditions: 1, kernel 2.4 (2.2 is fine), 2, windows ip
> header compression turned on, 3, a free internet access provider in
> Holland called 'Wish' (which seemes to stand for 'I Wish I had a faster
> connection').
> If we remove one of the three conditions, the connection is oke. It is
> only tcp which is affected.
> A packet on its way from linux server to windows client seems to get
> dropped once and retransmitted. This makes the connection _very_ slow.

:-( I hate these buggy systems.

Does this patch below fix the performance problem and are the windows
clients win2000 or win95?

--- include/net/ip.h.~1~ Mon Feb 19 00:12:31 2001
+++ include/net/ip.h Wed Feb 21 02:56:15 2001
@@ -190,9 +190,11 @@

static inline void ip_select_ident(struct iphdr *iph, struct dst_entry
*dst)
{
+#if 0
if (iph->frag_off&__constant_htons(IP_DF))
iph->id = 0;
else
+#endif
__ip_select_ident(iph, dst);
}



2001-02-21 22:31:48

by Jordan Mendelson

[permalink] [raw]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))

"David S. Miller" wrote:
>
> Ookhoi writes:
> > We have exactly the same problem but in our case it depends on the
> > following three conditions: 1, kernel 2.4 (2.2 is fine), 2, windows ip
> > header compression turned on, 3, a free internet access provider in
> > Holland called 'Wish' (which seemes to stand for 'I Wish I had a faster
> > connection').
> > If we remove one of the three conditions, the connection is oke. It is
> > only tcp which is affected.
> > A packet on its way from linux server to windows client seems to get
> > dropped once and retransmitted. This makes the connection _very_ slow.
>
> :-( I hate these buggy systems.
>
> Does this patch below fix the performance problem and are the windows
> clients win2000 or win95?

I wanted to see if this would fix the problem I was seeing with Win9x
users on PPP w/ compression dialing up to Earthlink in the bay area
(there are others, but it's the only one I can reproduce).

I compiled 2.4.1 with this change and for some odd reason, the kernel
started dropping packets and became unusable (couldn't ssh in) after
around 4050 connections were opened. I tested it also with 2.4.1-ac20
and had the same problem right around 4050 connections.

This is on a VA Linux box with dual eepro100's (one used) connected to a
Cisco 6509.



> --- include/net/ip.h.~1~ Mon Feb 19 00:12:31 2001
> +++ include/net/ip.h Wed Feb 21 02:56:15 2001
> @@ -190,9 +190,11 @@
>
> static inline void ip_select_ident(struct iphdr *iph, struct dst_entry *dst)
> {
> +#if 0
> if (iph->frag_off&__constant_htons(IP_DF))
> iph->id = 0;
> else
> +#endif
> __ip_select_ident(iph, dst);
> }
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-02-21 23:50:21

by Jordan Mendelson

[permalink] [raw]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))

"David S. Miller" wrote:
>
> Ookhoi writes:
> > We have exactly the same problem but in our case it depends on the
> > following three conditions: 1, kernel 2.4 (2.2 is fine), 2, windows ip
> > header compression turned on, 3, a free internet access provider in
> > Holland called 'Wish' (which seemes to stand for 'I Wish I had a faster
> > connection').
> > If we remove one of the three conditions, the connection is oke. It is
> > only tcp which is affected.
> > A packet on its way from linux server to windows client seems to get
> > dropped once and retransmitted. This makes the connection _very_ slow.
>
> :-( I hate these buggy systems.
>
> Does this patch below fix the performance problem and are the windows
> clients win2000 or win95?

Just a note however... this patch did fix the problem we were seeing
with retransmits and Win95 compressed PPP and dialup over earthlink in
the bay area.

Now, if it didn't have the side effect of dropping packets left and
right after ~4000 open connections (simultaneously), I could finally
move our production system to 2.4.x.



Jordan

2001-02-21 23:55:31

by David Miller

[permalink] [raw]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))


Jordan Mendelson writes:
> Now, if it didn't have the side effect of dropping packets left and
> right after ~4000 open connections (simultaneously), I could finally
> move our production system to 2.4.x.

There is no reason my patch should have this effect.

All of this is what appears to be a bug in Windows TCP header
compression, if the ID field of the IPv4 header does not change then
it drops every other packet.

The change I posted as-is, is unacceptable because it adds unnecessary
cost to a fast path. The final change I actually use will likely
involve using the TCP sequence numbers to calculate an "always
changing" ID number in the IPv4 headers to placate these broken
windows machines.

Later,
David S. Miller
[email protected]

2001-02-22 00:11:42

by Jordan Mendelson

[permalink] [raw]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))

"David S. Miller" wrote:
>
> Jordan Mendelson writes:
> > Now, if it didn't have the side effect of dropping packets left and
> > right after ~4000 open connections (simultaneously), I could finally
> > move our production system to 2.4.x.
>
> There is no reason my patch should have this effect.

My guess is that the fast path prevented the need for looking up the
destination in some structure which is limited to ~4K entries (route
table?).


Jordan

2001-02-22 00:50:59

by Jordan Mendelson

[permalink] [raw]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))

"David S. Miller" wrote:
>
> Jordan Mendelson writes:
> > Now, if it didn't have the side effect of dropping packets left and
> > right after ~4000 open connections (simultaneously), I could finally
> > move our production system to 2.4.x.
>
> The change I posted as-is, is unacceptable because it adds unnecessary
> cost to a fast path. The final change I actually use will likely
> involve using the TCP sequence numbers to calculate an "always
> changing" ID number in the IPv4 headers to placate these broken
> windows machines.

Just for kicks I modified the fast path to use a globally incremented
count to see if it would fix both Win9x problem and my 4K connection
problem and it appears to be working just fine.

What probably happened was the sheer number of packets at 4K connections
without the fast path just slowed everything down to a crawl.


Thanks Dave,

Jordan

2001-02-22 08:28:39

by Ookhoi

[permalink] [raw]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))

Hi Jordan,

> > > We have exactly the same problem but in our case it depends on the
> > > following three conditions: 1, kernel 2.4 (2.2 is fine), 2, windows ip
> > > header compression turned on, 3, a free internet access provider in
> > > Holland called 'Wish' (which seemes to stand for 'I Wish I had a faster
> > > connection').
> > > If we remove one of the three conditions, the connection is oke. It is
> > > only tcp which is affected.
> > > A packet on its way from linux server to windows client seems to get
> > > dropped once and retransmitted. This makes the connection _very_ slow.
> >
> > :-( I hate these buggy systems.
> >
> > Does this patch below fix the performance problem and are the windows
> > clients win2000 or win95?
>
> I wanted to see if this would fix the problem I was seeing with Win9x
> users on PPP w/ compression dialing up to Earthlink in the bay area
> (there are others, but it's the only one I can reproduce).
>
> I compiled 2.4.1 with this change and for some odd reason, the kernel
> started dropping packets and became unusable (couldn't ssh in) after
> around 4050 connections were opened. I tested it also with 2.4.1-ac20
> and had the same problem right around 4050 connections.
>
> This is on a VA Linux box with dual eepro100's (one used) connected to a
> Cisco 6509.

I patched two computers, 2.4.1-ac20. One of them is a fairly loaded
webserver. Both have an uptime of 15.15 and 16.30 hours, and are fine.
Didn't test with that much connections though.

Ookhoi

2001-02-27 00:21:44

by Simon Kirby

[permalink] [raw]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))

On Wed, Feb 21, 2001 at 03:52:37PM -0800, David S. Miller wrote:

> There is no reason my patch should have this effect.
>
> All of this is what appears to be a bug in Windows TCP header
> compression, if the ID field of the IPv4 header does not change then
> it drops every other packet.
>
> The change I posted as-is, is unacceptable because it adds unnecessary
> cost to a fast path. The final change I actually use will likely
> involve using the TCP sequence numbers to calculate an "always
> changing" ID number in the IPv4 headers to placate these broken
> windows machines.

Has such a patch gone in to the kernel yet?

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2001-02-27 00:30:14

by David Miller

[permalink] [raw]
Subject: Re: 2.4 tcp very slow under certain circumstances (Re: netdev issues (3c905B))


Simon Kirby writes:
> Has such a patch gone in to the kernel yet?

Yep, it is in both the zerocopy and AC patches. (Linus is
away at the moment)

Later,
David S. Miller
[email protected]