One of those very-hard-to-track-down, trivial-to-fix kind of problems:
without this patch, TCP roundtrip time measurements will corrupt the
routing cache's RTT estimates under heavy network load (the bug causes
RTAX_RTT to go negative, but since its type is u32, you end up with a
huge positive value...). From there on, later TCP connections quickly
will go south.
The typo was introduced 8 months ago in v1.29 of the file by the patch
entitled "Cleanup DST metrics and abstrct MSS/PMTU further".
--david
===== net/ipv4/tcp_input.c 1.36 vs edited =====
--- 1.36/net/ipv4/tcp_input.c Mon Apr 28 09:27:57 2003
+++ edited/net/ipv4/tcp_input.c Tue Jun 3 08:19:36 2003
@@ -556,8 +556,8 @@
if (m >= dst_metric(dst, RTAX_RTTVAR))
dst->metrics[RTAX_RTTVAR-1] = m;
else
- dst->metrics[RTAX_RTT-1] -=
- (dst->metrics[RTAX_RTT-1] - m)>>2;
+ dst->metrics[RTAX_RTTVAR-1] -=
+ (dst->metrics[RTAX_RTTVAR-1] - m)>>2;
}
if (tp->snd_ssthresh >= 0xFFFF) {
(trimmed CC line and added netdev)
On Tue, 2003-06-03 at 17:52, David Mosberger wrote:
> One of those very-hard-to-track-down, trivial-to-fix kind of problems:
> without this patch, TCP roundtrip time measurements will corrupt the
> routing cache's RTT estimates under heavy network load (the bug causes
> RTAX_RTT to go negative, but since its type is u32, you end up with a
> huge positive value...). From there on, later TCP connections quickly
> will go south.
>
> The typo was introduced 8 months ago in v1.29 of the file by the patch
> entitled "Cleanup DST metrics and abstrct MSS/PMTU further".
I tested this patch and it looks like it has cured my mysterious TCP
stalls.
without patch:
cache mtu 1500 rtt 479411ms rttvar 953813ms cwnd 46 advmss 1460
I see that before and during the stall if not using this patch.
(rtt is never above 20ms accoring to ping)
With the patch I see normal rtt and rttvar times.
Havn't seen a stall yet (~30 kernelcompiles with distcc over a sometimes
congested link), will continue testing.
> ===== net/ipv4/tcp_input.c 1.36 vs edited =====
> --- 1.36/net/ipv4/tcp_input.c Mon Apr 28 09:27:57 2003
> +++ edited/net/ipv4/tcp_input.c Tue Jun 3 08:19:36 2003
> @@ -556,8 +556,8 @@
> if (m >= dst_metric(dst, RTAX_RTTVAR))
> dst->metrics[RTAX_RTTVAR-1] = m;
> else
> - dst->metrics[RTAX_RTT-1] -=
> - (dst->metrics[RTAX_RTT-1] - m)>>2;
> + dst->metrics[RTAX_RTTVAR-1] -=
> + (dst->metrics[RTAX_RTTVAR-1] - m)>>2;
> }
>
> if (tp->snd_ssthresh >= 0xFFFF) {
--
/Martin
>>>>> On 03 Jun 2003 19:41:11 +0200, Martin Josefsson <[email protected]> said:
Martin> (trimmed CC line and added netdev) On Tue, 2003-06-03 at
Martin> 17:52, David Mosberger wrote:
>> One of those very-hard-to-track-down, trivial-to-fix kind of
>> problems: without this patch, TCP roundtrip time measurements
>> will corrupt the routing cache's RTT estimates under heavy
>> network load (the bug causes RTAX_RTT to go negative, but since
>> its type is u32, you end up with a huge positive value...). From
>> there on, later TCP connections quickly will go south.
>> The typo was introduced 8 months ago in v1.29 of the file by the
>> patch entitled "Cleanup DST metrics and abstrct MSS/PMTU
>> further".
Martin> I tested this patch and it looks like it has cured my
Martin> mysterious TCP stalls.
Yes, this sounds reasonable. I wasn't very clear on this point, but
"by going south" I meant that TCP is starting to misbehave. In
particular, you'll likely end up with the kernel aborting ESTABLISHED
TCP connections with extreme prejudice (and in violation of the TCP
protocol), because it thought that it had been unable to communicate
with the remote end for a _very_ long time. The net effect typically
is that you end up with one end having a connection that's in the
ESTABLISHED state and the other end having no trace of that
connection.
--david
On Tue, 3 Jun 2003, David Mosberger wrote:
> Martin> I tested this patch and it looks like it has cured my
> Martin> mysterious TCP stalls.
>
> Yes, this sounds reasonable. I wasn't very clear on this point, but
> "by going south" I meant that TCP is starting to misbehave. In
> particular, you'll likely end up with the kernel aborting ESTABLISHED
> TCP connections with extreme prejudice (and in violation of the TCP
> protocol), because it thought that it had been unable to communicate
> with the remote end for a _very_ long time. The net effect typically
> is that you end up with one end having a connection that's in the
> ESTABLISHED state and the other end having no trace of that
> connection.
David,
This might be the solution to one of the 'must-fix' bugs for the
networking, which nobody so far was quite able to track down.
- James
--
James Morris
<[email protected]>
Hello!
> This might be the solution to one of the 'must-fix' bugs for the
> networking, which nobody so far was quite able to track down.
No doubts. All the symptoms are explained by this. I hope Andrew
will confirm that the problem has gone.
Alexey
[email protected] wrote:
> No doubts. All the symptoms are explained by this. I hope Andrew
> will confirm that the problem has gone.
Yep, great catch! But, FYI, DaveM and Alexey, we tried
reproducing the stalls we (Dave Hansen, Troy Wilson) had
seen during SpecWeb99 runs and couldn't reproduce them on
2.5.69. (Same config, etc). So its possible our hang/stalls
were some other issue that got silently fixed (or more
likely, possibly the same thing but other changes minimized
us running into the problem).
thanks,
Nivedita
From: Nivedita Singhvi <[email protected]>
Date: Tue, 03 Jun 2003 19:01:25 -0700
But, FYI, DaveM and Alexey, we tried
reproducing the stalls we (Dave Hansen, Troy Wilson) had
seen during SpecWeb99 runs and couldn't reproduce them on
2.5.69. (Same config, etc). So its possible our hang/stalls
were some other issue that got silently fixed (or more
likely, possibly the same thing but other changes minimized
us running into the problem).
I think this means nothing, and that you can infer nothing from such
results.
My understanding is that the problem case triggers only when a timeout
based retransmit occurs. On LAN this tends to be extremely rare.
Although under enough traffic load it can occur.
So if your old SpecWEB99 lab tended more to trigger timeout based
retransmits on LAN, and your new test network does not, then your new
test network will tend to not reproduce the bug regardless of whether
the bug is present in the kernel or not :-)
>>>>> On Tue, 03 Jun 2003 20:23:20 -0700 (PDT), "David S. Miller" <[email protected]> said:
DaveM> From: Nivedita Singhvi <[email protected]> Date: Tue, 03 Jun
DaveM> 2003 19:01:25 -0700
DaveM> But, FYI, DaveM and Alexey, we tried reproducing the
DaveM> stalls we (Dave Hansen, Troy Wilson) had seen during
DaveM> SpecWeb99 runs and couldn't reproduce them on 2.5.69. (Same
DaveM> config, etc). So its possible our hang/stalls were some other
DaveM> issue that got silently fixed (or more likely, possibly the
DaveM> same thing but other changes minimized us running into the
DaveM> problem).
DaveM> I think this means nothing, and that you can infer nothing
DaveM> from such results.
DaveM> My understanding is that the problem case triggers only when
DaveM> a timeout based retransmit occurs. On LAN this tends to be
DaveM> extremely rare. Although under enough traffic load it can
DaveM> occur.
DaveM> So if your old SpecWEB99 lab tended more to trigger timeout
DaveM> based retransmits on LAN, and your new test network does not,
DaveM> then your new test network will tend to not reproduce the bug
DaveM> regardless of whether the bug is present in the kernel or not
DaveM> :-)
Is this where I get to plug httperf? It triggered the bug reliably in
less than 10 secs. ;-)
--david
David Mosberger wrote:
> DaveM> So if your old SpecWEB99 lab tended more to trigger timeout
> DaveM> based retransmits on LAN, and your new test network does not,
> DaveM> then your new test network will tend to not reproduce the bug
> DaveM> regardless of whether the bug is present in the kernel or not
> DaveM> :-)
>
> Is this where I get to plug httperf? It triggered the bug reliably in
> less than 10 secs. ;-)
Tarnation!! Ran httperf! Didnt hit it! :(. What were your
settings?
I extracted an old debug patch to implement dropping of
packets - have a sysctl that controls the rate at which I
can drop IP packets, so can also generate any kind of packet
loss..So thought I would bang away with netperf using
sendfile()/TCP_CORK. Thought it was in that code path.
Will be running tests tmrw and the rest of this
week on 2.5.70 +- patch. Will see if I can provoke any
further hangs, stalls, wackiness of any flavor...
thanks,
Nivedita
From: David Mosberger <[email protected]>
Date: Tue, 3 Jun 2003 21:35:55 -0700
Is this where I get to plug httperf? It triggered the bug reliably in
less than 10 secs. ;-)
distcc was a reliable test case too...
>>>>> On Tue, 03 Jun 2003 21:40:18 -0700, Nivedita Singhvi <[email protected]> said:
Nivedita> David Mosberger wrote:
DaveM> So if your old SpecWEB99 lab tended more to trigger timeout
DaveM> based retransmits on LAN, and your new test network does not,
DaveM> then your new test network will tend to not reproduce the bug
DaveM> regardless of whether the bug is present in the kernel or not
DaveM> :-)
>> Is this where I get to plug httperf? It triggered the bug
>> reliably in less than 10 secs. ;-)
Nivedita> Tarnation!! Ran httperf! Didnt hit it! :(. What were your
Nivedita> settings?
I used:
$ httperf --rate 1000 --num-conns 1000000 --verbose --hog --server HOST \
--uri pathto30KBfile
on 3 clients (for a total of 3000 conns/sec). You can't go higher
than 1000 conn/sec per client (IP address) because otherwise you run
out of port space (due to TIME_WAIT).
This load worked well for a machine with a single GigE card. All
network tunables were on the default setting (in particular, the tx
queue len was 300, which is were the losses came from).
With this load, I saw bad RTT values in the route cache within a
couple of seconds after starting the third httperf generator. It then
took a bit longer (on the order of 1-2 minutes) until the first
TCPAbortFailed errors started to pop up.
--david
From: David Mosberger <[email protected]>
Date: Tue, 3 Jun 2003 22:34:30 -0700
You can't go higher than 1000 conn/sec per client (IP address)
because otherwise you run out of port space (due to TIME_WAIT).
echo "1" >/proc/sys/net/ipv4/tcp_tw_recycle
It should eliminate this limit. Unfortunately we can't enable
this by default because of NAT :(
>>>>> On Tue, 03 Jun 2003 22:52:45 -0700 (PDT), "David S. Miller" <[email protected]> said:
David> From: David Mosberger <[email protected]> Date:
David> Tue, 3 Jun 2003 22:34:30 -0700
David> You can't go higher than 1000 conn/sec per client (IP
David> address) because otherwise you run out of port space (due to
David> TIME_WAIT).
DaveM> echo "1" >/proc/sys/net/ipv4/tcp_tw_recycle
DaveM> It should eliminate this limit. Unfortunately we can't
DaveM> enable this by default because of NAT :(
Ah, yes, provided PAWS is enabled, this would give you a time_wait
timeout of 3.5*RTO. Nice.
--david
David Mosberger wrote:
> $ httperf --rate 1000 --num-conns 1000000 --verbose --hog --server HOST \
> --uri pathto30KBfile
Hmm, ditto, except I was way down at --rate 300 (was seeing client
errors of fd-unavail). Have ulimited upwards but am still seeing
them..
> on 3 clients (for a total of 3000 conns/sec). You can't go higher
> than 1000 conn/sec per client (IP address) because otherwise you run
> out of port space (due to TIME_WAIT).
You can hike /proc/sys/net/ipv4/tcp_tw_recycle for that.
> This load worked well for a machine with a single GigE card. All
> network tunables were on the default setting (in particular, the tx
> queue len was 300, which is were the losses came from).
>
> With this load, I saw bad RTT values in the route cache within a
> couple of seconds after starting the third httperf generator. It then
> took a bit longer (on the order of 1-2 minutes) until the first
> TCPAbortFailed errors started to pop up
I saw a few AbortOnTimeouts, but no AbortFailed counts.
Those should be TCPAbortOnTimeout counts, rather than TCPAbortFailed
errors, I would expect? Why AbortFailed? Coming from IP via
tcp_transmit_skb()?
thanks,
Nivedita
>>>>> On Tue, 03 Jun 2003 23:04:02 -0700, Nivedita Singhvi <[email protected]> said:
Nivedita> Those should be TCPAbortOnTimeout counts, rather than
Nivedita> TCPAbortFailed errors, I would expect? Why AbortFailed?
Nivedita> Coming from IP via tcp_transmit_skb()?
Yes, the "connection hangs/disappearances" where triggered by
TCPAbortOnTimeout; the TCPAbortFailed errors were indicating that
tcp_transmit_skb() had failed, i.e., the tx queue was overrun (that's
were the losses came from).
--david
From: David Mosberger <[email protected]>
Date: Tue, 3 Jun 2003 23:19:31 -0700
Yes, the "connection hangs/disappearances" where triggered by
TCPAbortOnTimeout;
This is correct.
And it is the reason the connection dies silently. Because
such write timeouts invoke tcp_done() which closes the connection
off silently. This is correct behavior (sans the RTT bug David fixed
of course :)) because a host which hasn't responded at all from
so many repeated retransmission attempts isn't likely to get any
reset we send either :)