2006-08-15 16:05:41

by Mark Reidenbach

[permalink] [raw]
Subject: How to find a sick router with 2.6.17+ and tcp_window_scaling enabled

I had made an earlier post concerning very poor network performance
after upgrading to 2.6.17 and later kernels. The solution provided by
the e1000 developers was that it was in fact a change to the default tcp
window scaling settings and that there was a router somewhere between my
computer and its destination.

After scouring the net for many days trying to find an answer as to how
to find the broken router, I've come up empty and there are many
references as to why you don't want to disable window scaling completely
which so far has been my only working solution. Can anyone give
instructions or references as to what the requirements are for a router
to work (specifically Cisco routers)? Is there a minimum required IOS
or certain commands that must be enabled such as any of the following?
ip tcp window-size 8388480
ip tcp selective-ack
ip tcp timestamp

Does anyone have a way to find the broken router if you are not running
the networks involved? I'm almost positive it's our T1 provider, but
after being on the phone with them for a couple hours they insist it's
not their problem and that their routers are configured properly (what
else would you expect them to say after all). There are only 5 hops in
the traceroute between us and a test file they have set. Below is the
traceroute info:

1 192.168.13.1 (192.168.13.1) 0.319 ms 0.332 ms 0.245 ms
2 nsc69.38.0-110.newsouth.net (69.38.0.110) 2.484 ms 2.107 ms 1.985 ms
3 nsc69.38.3-17.newsouth.net (69.38.3.17) 6.612 ms 6.403 ms 5.986 ms
4 66.64.228.106.nw.nuvox.net (66.64.228.106) 15.357 ms 14.885 ms
15.353 ms
5 virt4.rhetoric.nuvox.net (66.83.21.33) 14.982 ms 14.880 ms 15.102 ms

The only information I have on the routers is:
192.168.13.1: This is our office router and is a Cisco 1811 running
12.3(8)YI1.
69.38.0.110: T1 provider's router installed at our office.
This is a Cisco 2600 series that I was told was running
12.2(10)R
even though I can't find a 10R release on Cisco's website.


My computer has 2GB of ram if that helps since it seems the new defaults
are based on system ram.

Below is the start of a tcpdump when trying to retrieve the test file
from my T1 provider's server with tcp_window_scaling = 1. I'm somewhat
confused why performance drops from more than 225kB/s with window
scaling disabled to less than 50kB/s with it enabled since it looks like
the test server acks with a wscale of 0 and I thought that would have
the same behavior as setting tcp_window_scaling to 0.

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
10:49:01.890275 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: S 906926812:906926812(0) win 5840 <mss
1460,sackOK,timestamp 6583784 0,nop,wscale 7>
10:49:01.905118 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: S 4149240349:4149240349(0) ack 906926813
win 5792 <mss 1460,sackOK,timestamp 514424107 6583784,nop,wscale 0>
10:49:01.905128 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: . ack 1 win 46 <nop,nop,timestamp 6583786
514424107>
10:49:01.905229 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: P 1:115(114) ack 1 win 46
<nop,nop,timestamp 6583786 514424107>
10:49:01.920359 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: . ack 115 win 5792 <nop,nop,timestamp
514424109 6583786>
10:49:01.932477 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: . 1:1449(1448) ack 115 win 5792
<nop,nop,timestamp 514424109 6583786>
10:49:01.932484 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: . ack 1449 win 69 <nop,nop,timestamp
6583789 514424109>
10:49:01.938473 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: . 1449:2897(1448) ack 115 win 5792
<nop,nop,timestamp 514424109 6583786>
10:49:01.938481 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: . ack 2897 win 91 <nop,nop,timestamp
6583789 514424109>
10:49:01.956837 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: P 2897:4345(1448) ack 115 win 5792
<nop,nop,timestamp 514424111 6583789>
10:49:01.956843 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: . ack 4345 win 114 <nop,nop,timestamp
6583791 514424111>
10:49:01.962834 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: . 4345:5793(1448) ack 115 win 5792
<nop,nop,timestamp 514424111 6583789>
10:49:01.962839 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: . ack 5793 win 137 <nop,nop,timestamp
6583792 514424111>
10:49:01.968830 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: . 5793:7241(1448) ack 115 win 5792
<nop,nop,timestamp 514424112 6583789>
10:49:01.968835 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: . ack 7241 win 159 <nop,nop,timestamp
6583792 514424112>
10:49:01.974952 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: . 7241:8689(1448) ack 115 win 5792
<nop,nop,timestamp 514424112 6583789>
10:49:01.974956 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: . ack 8689 win 182 <nop,nop,timestamp
6583793 514424112>
10:49:01.981323 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: . 8689:10137(1448) ack 115 win 5792
<nop,nop,timestamp 514424114 6583791>
10:49:01.981327 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: . ack 10137 win 204 <nop,nop,timestamp
6583793 514424114>
10:49:01.987319 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: P 10137:11585(1448) ack 115 win 5792
<nop,nop,timestamp 514424114 6583791>
10:49:01.987323 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: . ack 11585 win 227 <nop,nop,timestamp
6583794 514424114>
10:49:01.993316 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: . 11585:13033(1448) ack 115 win 5792
<nop,nop,timestamp 514424114 6583792>
10:49:01.993320 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: . ack 13033 win 250 <nop,nop,timestamp
6583795 514424114>
10:49:01.999437 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: P 13033:14481(1448) ack 115 win 5792
<nop,nop,timestamp 514424114 6583792>
10:49:01.999441 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: . ack 14481 win 272 <nop,nop,timestamp
6583795 514424114>
10:49:02.005434 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: . 14481:15929(1448) ack 115 win 5792
<nop,nop,timestamp 514424115 6583792>
10:49:02.005438 IP backup.truckersexit.com.55805 >
virt4.rhetoric.nuvox.net.http: . ack 15929 win 295 <nop,nop,timestamp
6583796 514424115>
10:49:02.011306 IP virt4.rhetoric.nuvox.net.http >
backup.truckersexit.com.55805: . 15929:17377(1448) ack 115 win 5792
<nop,nop,timestamp 514424115 6583792>


Please CC me on any replies as I'm not subscribed to the list.
--
Mark Reidenbach
EveryTruckJob.com
[email protected]
Phone: (205)722-9112


2006-08-15 16:41:11

by Alan

[permalink] [raw]
Subject: Re: How to find a sick router with 2.6.17+ and tcp_window_scaling enabled

Ar Maw, 2006-08-15 am 11:05 -0500, ysgrifennodd Mark Reidenbach:
> Does anyone have a way to find the broken router if you are not running
> the networks involved?

You are almost certainly looking for a broken/crap NAT box, firewall or
similar product. Routers that are just being routers don't touch the TCP
layer so even if they are broken/crap/ancient they won't do any harm to
it.

The usual offenders are cheap NAT boxes and badly designed load
balancers. They may not even show up in a trace but you should expect
them to be at one end or the other, unless your ISP is providing you
with NATted addresses or some kind of managed security service.

Alan

2006-08-15 17:30:48

by Phil Oester

[permalink] [raw]
Subject: Re: How to find a sick router with 2.6.17+ and tcp_window_scaling enabled

On Tue, Aug 15, 2006 at 06:01:47PM +0100, Alan Cox wrote:
> Ar Maw, 2006-08-15 am 11:05 -0500, ysgrifennodd Mark Reidenbach:
> > Does anyone have a way to find the broken router if you are not running
> > the networks involved?
>
> You are almost certainly looking for a broken/crap NAT box, firewall or
> similar product. Routers that are just being routers don't touch the TCP
> layer so even if they are broken/crap/ancient they won't do any harm to
> it.
>
> The usual offenders are cheap NAT boxes and badly designed load
> balancers. They may not even show up in a trace but you should expect
> them to be at one end or the other, unless your ISP is providing you
> with NATted addresses or some kind of managed security service.

Certain versions of BSD ipfilter are also broken. Try some of Apple's
websites for examples.

Is the destination box BSD or behind a BSD firewall?

Phil

2006-08-15 17:47:28

by Mark Reidenbach

[permalink] [raw]
Subject: Re: How to find a sick router with 2.6.17+ and tcp_window_scaling enabled

Phil Oester wrote:
> On Tue, Aug 15, 2006 at 06:01:47PM +0100, Alan Cox wrote:
>
>> Ar Maw, 2006-08-15 am 11:05 -0500, ysgrifennodd Mark Reidenbach:
>>
>>> Does anyone have a way to find the broken router if you are not running
>>> the networks involved?
>>>
>> You are almost certainly looking for a broken/crap NAT box, firewall or
>> similar product. Routers that are just being routers don't touch the TCP
>> layer so even if they are broken/crap/ancient they won't do any harm to
>> it.
>>
>> The usual offenders are cheap NAT boxes and badly designed load
>> balancers. They may not even show up in a trace but you should expect
>> them to be at one end or the other, unless your ISP is providing you
>> with NATted addresses or some kind of managed security service.
>>
>
> Certain versions of BSD ipfilter are also broken. Try some of Apple's
> websites for examples.
>
> Is the destination box BSD or behind a BSD firewall?
>
> Phil
>
I'm not sure what OS the T1 provider's box is running. I experience the
same problems trying to access kernel.org or one of my servers hosted at
Verio in Sterling, VA.

Alan Cox says it's most likely a broken NAT box or firewall. I'm not
aware of any firewalls in between my office and my servers in Sterling
other than the Cisco 1811 here in the office, and it is performing NAT
and firewall services for our office. I'm going to try a few cheapo
home routers and see if the problem remains. I would think the Cisco
router would be better off than a home Linksys or Xincom one, but I
figure it's at least worth a try.

Thanks for your help.

Mark Reidenbach
EveryTruckJob.com
[email protected]
Phone: (205)722-9112

2006-08-15 18:07:30

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: How to find a sick router with 2.6.17+ and tcp_window_scaling enabled


On Tue, 15 Aug 2006, Mark Reidenbach wrote:

> Phil Oester wrote:
>> On Tue, Aug 15, 2006 at 06:01:47PM +0100, Alan Cox wrote:
>>
>>> Ar Maw, 2006-08-15 am 11:05 -0500, ysgrifennodd Mark Reidenbach:
>>>
>>>> Does anyone have a way to find the broken router if you are not running
>>>> the networks involved?
>>>>
>>> You are almost certainly looking for a broken/crap NAT box, firewall or
>>> similar product. Routers that are just being routers don't touch the TCP
>>> layer so even if they are broken/crap/ancient they won't do any harm to
>>> it.
>>>
>>> The usual offenders are cheap NAT boxes and badly designed load
>>> balancers. They may not even show up in a trace but you should expect
>>> them to be at one end or the other, unless your ISP is providing you
>>> with NATted addresses or some kind of managed security service.
>>>
>>
>> Certain versions of BSD ipfilter are also broken. Try some of Apple's
>> websites for examples.
>>
>> Is the destination box BSD or behind a BSD firewall?
>>
>> Phil
>>
> I'm not sure what OS the T1 provider's box is running. I experience the
> same problems trying to access kernel.org or one of my servers hosted at
> Verio in Sterling, VA.
>
> Alan Cox says it's most likely a broken NAT box or firewall. I'm not
> aware of any firewalls in between my office and my servers in Sterling
> other than the Cisco 1811 here in the office, and it is performing NAT
> and firewall services for our office. I'm going to try a few cheapo
> home routers and see if the problem remains. I would think the Cisco
> router would be better off than a home Linksys or Xincom one, but I
> figure it's at least worth a try.
>
> Thanks for your help.
>
> Mark Reidenbach
> EveryTruckJob.com
> [email protected]
> Phone: (205)722-9112

Some older Ciscos need to be upgraded to handle the ECN bit. If
your router is a couple years old and has not been upgraded, this
might be the problem. A few years ago, vger started running a
new kernel that used ECN. The result was chaos for a few weeks.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.24 on an i686 machine (5592.62 BogoMips).
New book: http://www.AbominableFirebug.com/
_


****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

2006-08-15 18:15:36

by alex

[permalink] [raw]
Subject: Re: How to find a sick router with 2.6.17+ and tcp_window_scaling enabled

> After scouring the net for many days trying to find an answer as to how
> to find the broken router, I've come up empty and there are many
> references as to why you don't want to disable window scaling completely
> which so far has been my only working solution. Can anyone give
> instructions or references as to what the requirements are for a router
> to work (specifically Cisco routers)? Is there a minimum required IOS
> or certain commands that must be enabled such as any of the following?
> ip tcp window-size 8388480
> ip tcp selective-ack
> ip tcp timestamp
>

This is absolutely not correct. Routers forward packets. They do not mangle
the data in them.

Alex

2006-08-15 18:24:09

by Willy Tarreau

[permalink] [raw]
Subject: Re: How to find a sick router with 2.6.17+ and tcp_window_scaling enabled

On Tue, Aug 15, 2006 at 02:06:34PM -0400, [email protected] wrote:
> > After scouring the net for many days trying to find an answer as to how
> > to find the broken router, I've come up empty and there are many
> > references as to why you don't want to disable window scaling completely
> > which so far has been my only working solution. Can anyone give
> > instructions or references as to what the requirements are for a router
> > to work (specifically Cisco routers)? Is there a minimum required IOS
> > or certain commands that must be enabled such as any of the following?
> > ip tcp window-size 8388480
> > ip tcp selective-ack
> > ip tcp timestamp
> >
>
> This is absolutely not correct. Routers forward packets. They do not mangle
> the data in them.

Believe it or not, there are a lot of routers nowadays that can do NAT.
And even for very basic NAT, you have to recompute the TCP checksum, which
means that you mangle data within the packet. Even worse, some of them are
able to NAT complex protocols such as FTP and for this, they need to mangle
the application payload. OK, this should not be the router's job, but it's
often the best placed to do the job, and there is customer demand for this.

> Alex

Willy

2006-08-15 18:42:06

by alex

[permalink] [raw]
Subject: Re: How to find a sick router with 2.6.17+ and tcp_window_scaling enabled

On Tue, Aug 15, 2006 at 08:19:39PM +0200, Willy Tarreau wrote:

> > This is absolutely not correct. Routers forward packets. They do not mangle
> > the data in them.
>
> Believe it or not, there are a lot of routers nowadays that can do NAT.
> And even for very basic NAT, you have to recompute the TCP checksum, which
> means that you mangle data within the packet. Even worse, some of them are
> able to NAT complex protocols such as FTP and for this, they need to mangle
> the application payload. OK, this should not be the router's job, but it's
> often the best placed to do the job, and there is customer demand for this.

Just because you are using a Linksys/Netgear or god else knows what to
mangle your packets and call that device a router does not mean that normal
service providers have NAT enabled on their GSRs and Junipers.

The issue is not in a router running IOS somewhere. The issue is in the
broken code/broken driver/broken something on the end-point.

Alex

2006-08-15 18:49:39

by Willy Tarreau

[permalink] [raw]
Subject: Re: How to find a sick router with 2.6.17+ and tcp_window_scaling enabled

On Tue, Aug 15, 2006 at 02:33:00PM -0400, [email protected] wrote:
> On Tue, Aug 15, 2006 at 08:19:39PM +0200, Willy Tarreau wrote:
>
> > > This is absolutely not correct. Routers forward packets. They do not mangle
> > > the data in them.
> >
> > Believe it or not, there are a lot of routers nowadays that can do NAT.
> > And even for very basic NAT, you have to recompute the TCP checksum, which
> > means that you mangle data within the packet. Even worse, some of them are
> > able to NAT complex protocols such as FTP and for this, they need to mangle
> > the application payload. OK, this should not be the router's job, but it's
> > often the best placed to do the job, and there is customer demand for this.
>
> Just because you are using a Linksys/Netgear or god else knows what to
> mangle your packets and call that device a router

That's not what I call a router !

> does not mean that normal service providers have NAT enabled on
> their GSRs and Junipers.

not on the PE, but offen on the CE.

> The issue is not in a router running IOS somewhere. The issue is in the
> broken code/broken driver/broken something on the end-point.

He may very well have an IOS based 1600 or equivalent doing a very dirty NAT.

> Alex

Willy

2006-08-15 19:53:39

by Mark Reidenbach

[permalink] [raw]
Subject: Re: How to find a sick router with 2.6.17+ and tcp_window_scaling enabled

Willy Tarreau wrote:
> He may very well have an IOS based 1600 or equivalent doing a very dirty NAT.
>
> Willy
>
>
Willy, I am in fact running an IOS based NAT/firewall on a 1811. It's
IOS version 12.3(8)YI1. Do you know if this version has a "very dirty
NAT" implementation? If you don't, I think I'll just try a few spare
home routers and see if their NAT implementation is cleaner than my Cisco's.

Mark Reidenbach
EveryTruckJob.com
[email protected]
Phone: (205)722-9112

2006-08-15 20:25:05

by Willy Tarreau

[permalink] [raw]
Subject: Re: How to find a sick router with 2.6.17+ and tcp_window_scaling enabled

On Tue, Aug 15, 2006 at 02:53:33PM -0500, Mark Reidenbach wrote:
> Willy Tarreau wrote:
> >He may very well have an IOS based 1600 or equivalent doing a very dirty
> >NAT.
> >
> >Willy
> >
> >
> Willy, I am in fact running an IOS based NAT/firewall on a 1811. It's
> IOS version 12.3(8)YI1. Do you know if this version has a "very dirty
> NAT" implementation? If you don't, I think I'll just try a few spare
> home routers and see if their NAT implementation is cleaner than my Cisco's.

I have absolutely no idea. If they borrowed the session tracking code from
the PIX, you might have window tracking inside it, which might cause what
you observe if it's buggy. But that's just supposition from me.

> Mark Reidenbach
> EveryTruckJob.com
> [email protected]
> Phone: (205)722-9112

Regards,
willy

2006-08-19 16:42:22

by Ralf Hildebrandt

[permalink] [raw]
Subject: Re: How to find a sick router with 2.6.17+ and tcp_window_scaling enabled

* Mark Reidenbach <[email protected]>:
> I had made an earlier post concerning very poor network performance
> after upgrading to 2.6.17 and later kernels. The solution provided by
> the e1000 developers was that it was in fact a change to the default tcp
> window scaling settings and that there was a router somewhere between my
> computer and its destination.

Are you looking for a traceroute-/mtr-like tool that sends specially
crafted packets and finally comes up with a message like "the router
between x and y is broken"?

--
Ralf Hildebrandt (i.A. des IT-Zentrums) [email protected]
Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-Berlin Fax. +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to [email protected]

2006-09-13 14:00:19

by Mark Reidenbach

[permalink] [raw]
Subject: Re: How to find a sick router with 2.6.17+ and tcp_window_scaling enabled

Willy Tarreau wrote:
> On Tue, Aug 15, 2006 at 02:53:33PM -0500, Mark Reidenbach wrote:
>
>> Willy Tarreau wrote:
>>
>>> He may very well have an IOS based 1600 or equivalent doing a very dirty
>>> NAT.
>>>
>>> Willy
>>>
>>>
>>>
>> Willy, I am in fact running an IOS based NAT/firewall on a 1811. It's
>> IOS version 12.3(8)YI1. Do you know if this version has a "very dirty
>> NAT" implementation? If you don't, I think I'll just try a few spare
>> home routers and see if their NAT implementation is cleaner than my Cisco's.
>>
>
> I have absolutely no idea. If they borrowed the session tracking code from
> the PIX, you might have window tracking inside it, which might cause what
> you observe if it's buggy. But that's just supposition from me.
>
>
>> Mark Reidenbach
>> EveryTruckJob.com
>> [email protected]
>> Phone: (205)722-9112
>>
>
> Regards,
> willy
>
>

I'm just posting this as a follow-up in case anyone runs across the same
problem I experienced and couldn't find a solution. My problem was the
old Cisco IOS [Version 12.3(8)YI1] that came installed on my Cisco 1811
router I purchased in March 2006. I believe this is still the default
shipping version for this router, so anyone who buys one really needs to
get a Cisco support contract for it from a Cisco reseller so they can
download a new/working IOS [Version 12.4(9)T1 was the most recent as of
this email and was the one I used].
Mark Reidenbach
EveryTruckJob.com
[email protected]
Phone: (205)722-9112