LinuxLists.cc - Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-03-07 21:46:14

Subject: Re: [openib-general] Re: TSO and IPoIB performance degradation

On Mon, 2006-03-06 at 19:13 -0800, Shirley Ma wrote:
>
> > More likely you are getting hit by the fact that TSO prevents the
> congestion
> window from increasing properly. This was fixed in 2.6.15 (around mid
> of Nov 2005).
>
> Yep, I noticed the same problem. After updating to the new kernel, the
> performance are much better, but it's still lower than before.

Here is an updated version of OpenIB IPoIB performance for various
kernels with and without one of the TSO patches. The netperf
performance for the latest kernels has not improved the TSO performance
drop.

Any comments or suggestions would be appreciated.

- Matt

>
All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0
dual EM64T 3.2 GHz PCIe IB HCA (memfull)
patch 1 - remove changeset 314324121f9b94b2ca657a494cf2b9cb0e4a28cc

Kernel OpenIB msi_x netperf (MB/s)
2.6.16-rc5 in-kernel 1 367
2.6.15 in-kernel 1 382
2.6.14-rc4 patch 1 in-kernel 1 434
2.6.14-rc4 in-kernel 1 385
2.6.14-rc3 in-kernel 1 374
2.6.13.2 svn3627 1 386
2.6.13.2 patch 1 svn3627 1 446
2.6.13.2 in-kernel 1 394
2.6.13-rc3 patch 12 in-kernel 1 442
2.6.13-rc3 patch 1 in-kernel 1 450
2.6.13-rc3 in-kernel 1 395
2.6.12.5-lustre in-kernel 1 399
2.6.12.5 patch 1 in-kernel 1 464
2.6.12.5 in-kernel 1 402
2.6.12 in-kernel 1 406
2.6.12-rc6 patch 1 in-kernel 1 470
2.6.12-rc6 in-kernel 1 407
2.6.12-rc5 in-kernel 1 405
2.6.12-rc5 patch 1 in-kernel 1 474
2.6.12-rc4 in-kernel 1 470
2.6.12-rc3 in-kernel 1 466
2.6.12-rc2 in-kernel 1 469
2.6.12-rc1 in-kernel 1 466
2.6.11 in-kernel 1 464
2.6.11 svn3687 1 464
2.6.9-11.ELsmp svn3513 1 425 (Woody's results, 3.6Ghz
EM64T)

2006-03-07 21:49:35

by Stephen Hemminger

[permalink] [raw]

Subject: Re: [openib-general] Re: TSO and IPoIB performance degradation

On Tue, 07 Mar 2006 13:44:51 -0800
Matt Leininger <[email protected]> wrote:

> On Mon, 2006-03-06 at 19:13 -0800, Shirley Ma wrote:
> >
> > > More likely you are getting hit by the fact that TSO prevents the
> > congestion
> > window from increasing properly. This was fixed in 2.6.15 (around mid
> > of Nov 2005).
> >
> > Yep, I noticed the same problem. After updating to the new kernel, the
> > performance are much better, but it's still lower than before.
>
> Here is an updated version of OpenIB IPoIB performance for various
> kernels with and without one of the TSO patches. The netperf
> performance for the latest kernels has not improved the TSO performance
> drop.
>
> Any comments or suggestions would be appreciated.
>
> - Matt

Configuration information? like did you increase the tcp_rmem, tcp_wmem?
Tcpdump traces of what is being sent and available window?
Is IB using NAPI or just doing netif_rx()?

2006-03-07 21:53:15

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [openib-general] Re: TSO and IPoIB performance degradation

Quoting r. Stephen Hemminger <[email protected]>:
> Is IB using NAPI or just doing netif_rx()?

No, IPoIB doesn't use NAPI.

--
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

2006-03-08 00:13:00

by Matt Leininger

[permalink] [raw]

Subject: Re: [openib-general] Re: TSO and IPoIB performance degradation

On Tue, 2006-03-07 at 13:49 -0800, Stephen Hemminger wrote:
> On Tue, 07 Mar 2006 13:44:51 -0800
> Matt Leininger <[email protected]> wrote:
>
> > On Mon, 2006-03-06 at 19:13 -0800, Shirley Ma wrote:
> > >
> > > > More likely you are getting hit by the fact that TSO prevents the
> > > congestion
> > > window from increasing properly. This was fixed in 2.6.15 (around mid
> > > of Nov 2005).
> > >
> > > Yep, I noticed the same problem. After updating to the new kernel, the
> > > performance are much better, but it's still lower than before.
> >
> > Here is an updated version of OpenIB IPoIB performance for various
> > kernels with and without one of the TSO patches. The netperf
> > performance for the latest kernels has not improved the TSO performance
> > drop.
> >
> > Any comments or suggestions would be appreciated.
> >
> > - Matt
>
> Configuration information? like did you increase the tcp_rmem, tcp_wmem?
> Tcpdump traces of what is being sent and available window?
> Is IB using NAPI or just doing netif_rx()?

I used the standard setting for tcp_rmem and tcp_wmem. Here are a
few other runs that change those variables. I was able to improve
performance by ~30MB/s to 403 MB/s, but this is still a ways from the
474 MB/s before the TSO patches.

Thanks,

- Matt

All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0
dual EM64T 3.2 GHz PCIe IB HCA (memfull)
patch 1 - remove changeset 314324121f9b94b2ca657a494cf2b9cb0e4a28cc
msi_x=1 for all tests

Kernel OpenIB netperf (MB/s)
2.6.16-rc5 in-kernel 403
tcp_wmem 4096 87380 16777216 tcp_rmem 4096 87380 16777216

2.6.16-rc5 in-kernel 395
tcp_wmem 4096 102400 16777216 tcp_rmem 4096 102400 16777216

2.6.16-rc5 in-kernel 392
tcp_wmem 4096 65536 16777216 tcp_rmem 4096 87380 16777216

2.6.16-rc5 in-kernel 394
tcp_wmem 4096 131072 16777216 tcp_rmem 4096 102400 16777216

2.6.16-rc5 in-kernel 377
tcp_wmem 4096 131072 16777216 tcp_rmem 4096 153600 16777216

2.6.16-rc5 in-kernel 377
tcp_wmem 4096 131072 16777216 tcp_rmem 4096 131072 16777216

2.6.16-rc5 in-kernel 353
tcp_wmem 4096 262144 16777216 tcp_rmem 4096 262144 16777216

2.6.16-rc5 in-kernel 305
tcp_wmem 4096 262144 16777216 tcp_rmem 4096 524288 16777216

2.6.16-rc5 in-kernel 303
tcp_wmem 4096 131072 16777216 tcp_rmem 4096 524288 16777216

2.6.16-rc5 in-kernel 290
tcp_wmem 4096 524288 16777216 tcp_rmem 4096 524288 16777216

2.6.16-rc5 in-kernel 367 default tcp values

--------------------
All with standard tcp settings
Kernel OpenIB netperf (MB/s)
2.6.16-rc5 in-kernel 367
2.6.15 in-kernel 382
2.6.14-rc4 patch 12 in-kernel 436
2.6.14-rc4 patch 1 in-kernel 434
2.6.14-rc4 in-kernel 385
2.6.14-rc3 in-kernel 374
2.6.13.2 svn3627 386
2.6.13.2 patch 1 svn3627 446
2.6.13.2 in-kernel 394
2.6.13-rc3 patch 12 in-kernel 442
2.6.13-rc3 patch 1 in-kernel 450
2.6.13-rc3 in-kernel 395
2.6.12.5-lustre in-kernel 399
2.6.12.5 patch 1 in-kernel 464
2.6.12.5 in-kernel 402
2.6.12 in-kernel 406
2.6.12-rc6 patch 1 in-kernel 470
2.6.12-rc6 in-kernel 407
2.6.12-rc5 in-kernel 405
2.6.12-rc5 patch 1 in-kernel 474
2.6.12-rc4 in-kernel 470
2.6.12-rc3 in-kernel 466
2.6.12-rc2 in-kernel 469
2.6.12-rc1 in-kernel 466
2.6.11 in-kernel 464
2.6.11 svn3687 464
2.6.9-11.ELsmp svn3513 425 (Woody's results, 3.6Ghz EM64T)

2006-03-08 00:19:39

by David Miller

[permalink] [raw]

Subject: Re: [openib-general] Re: TSO and IPoIB performance degradation

From: Matt Leininger <[email protected]>
Date: Tue, 07 Mar 2006 16:11:37 -0800

> I used the standard setting for tcp_rmem and tcp_wmem. Here are a
> few other runs that change those variables. I was able to improve
> performance by ~30MB/s to 403 MB/s, but this is still a ways from the
> 474 MB/s before the TSO patches.

How limited are the IPoIB devices, TX descriptor wise?

One side effect of the TSO changes is that one extra descriptor
will be used for outgoing packets. This is because we have to
put the headers as well as the user data, into page based
buffers now.

Perhaps you can experiment with increasing the transmit descriptor
table size, if that's possible.

2006-03-08 01:17:36

by Roland Dreier

[permalink] [raw]

Subject: Re: [openib-general] Re: TSO and IPoIB performance degradation

David> How limited are the IPoIB devices, TX descriptor wise?

David> One side effect of the TSO changes is that one extra
David> descriptor will be used for outgoing packets. This is
David> because we have to put the headers as well as the user
David> data, into page based buffers now.

We have essentially no limit on TX descriptors. However I think
there's some confusion about TSO: IPoIB does _not_ do TSO -- generic
InfiniBand hardware does not have any TSO capability. In the future
we might be able to implement TSO for certain hardware that does have
support, but even that requires some firmware help from the from the
HCA vendors, etc. So right now the IPoIB driver does not do TSO.

The reason TSO comes up is that reverting the patch described below
helps (or helped at some point at least) IPoIB throughput quite a bit.
Clearly this was a bug fix so we can't revert it in general but I
think what Michael Tsirkin was suggesting at the beginning of this
thread is to do what the last paragraph of the changelog says -- find
some way to re-enable the trick.

diff-tree 3143241... (from e16fa6b...)
Author: David S. Miller <[email protected]>
Date: Mon May 23 12:03:06 2005 -0700

[TCP]: Fix stretch ACK performance killer when doing ucopy.

When we are doing ucopy, we try to defer the ACK generation to
cleanup_rbuf(). This works most of the time very well, but if the
ucopy prequeue is large, this ACKing behavior kills performance.

With TSO, it is possible to fill the prequeue so large that by the
time the ACK is sent and gets back to the sender, most of the window
has emptied of data and performance suffers significantly.

This behavior does help in some cases, so we should think about
re-enabling this trick in the future, using some kind of limit in
order to avoid the bug case.

- R.

2006-03-08 01:24:47

by David Miller

[permalink] [raw]

Subject: Re: [openib-general] Re: TSO and IPoIB performance degradation

From: Roland Dreier <[email protected]>
Date: Tue, 07 Mar 2006 17:17:30 -0800

> The reason TSO comes up is that reverting the patch described below
> helps (or helped at some point at least) IPoIB throughput quite a bit.

I wish you had started the thread by mentioning this specific
patch, we wasted an enormous amount of precious developer time
speculating and asking for arbitrary tests to be run in order
to narrow down the problem, yet you knew the specific change
that introduced the performance regression already...

This is a good example of how not to report a bug.

2006-03-08 01:34:39

by Roland Dreier

[permalink] [raw]

Subject: Re: [openib-general] Re: TSO and IPoIB performance degradation

David> I wish you had started the thread by mentioning this
David> specific patch, we wasted an enormous amount of precious
David> developer time speculating and asking for arbitrary tests
David> to be run in order to narrow down the problem, yet you knew
David> the specific change that introduced the performance
David> regression already...

Sorry, you're right. I was a little confused because I had a memory of
Michael's original email (http://lkml.org/lkml/2006/3/6/150) quoting a
changelog entry, but looking back at the message, it was quoting
something completely different and misleading.

I think the most interesting email in the old thread is
http://openib.org/pipermail/openib-general/2005-October/012482.html
which shows that reverting 314324121 (the "stretch ACK performance
killer" fix) gives ~400 Mbit/sec in extra IPoIB performance.

- R.

2006-03-08 12:53:01

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: Re: TSO and IPoIB performance degradation

Quoting r. David S. Miller <[email protected]>:
> Subject: Re: Re: TSO and IPoIB performance degradation
>
> From: Roland Dreier <[email protected]>
> Date: Tue, 07 Mar 2006 17:17:30 -0800
>
> > The reason TSO comes up is that reverting the patch described below
> > helps (or helped at some point at least) IPoIB throughput quite a bit.
>
> I wish you had started the thread by mentioning this specific patch

Er, since you mention it, the first message in thread did include this link:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=314324121f9b94b2ca657a494cf2b9cb0e4a28cc
and I even pasted the patch description there, but oh well.

Now that Roland helped us clear it all up, and now that it has been clarified
that reverting this patch gives us back most of the performance, is the answer
to my question the same?

What I was trying to figure out was, how can we re-enable the trick without
hurting TSO? Could a solution be to simply look at the frame size, and call
tcp_send_delayed_ack if the frame size is small?

--
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

2006-03-08 20:53:46

by David Miller

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

From: "Michael S. Tsirkin" <[email protected]>
Date: Wed, 8 Mar 2006 14:53:11 +0200

> What I was trying to figure out was, how can we re-enable the trick without
> hurting TSO? Could a solution be to simply look at the frame size, and call
> tcp_send_delayed_ack if the frame size is small?

The problem is that this patch helps performance when the
receiver is CPU limited.

The old code would delay ACKs forever if the CPU of the
receiver was slow, because we'd wait for all received
packets to be copied into userspace before spitting out
the ACK. This would allow the pipe to empty, since the
sender is waiting for ACKs in order to send more into
the pipe, and once the ACK did go out it would cause the
sender to emit an enormous burst of data. Both of these
behaviors are highly frowned upon for a TCP stack.

I'll try to look at this some more later today.

2006-03-09 23:48:29

by David Miller

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

From: "Michael S. Tsirkin" <[email protected]>
Date: Wed, 8 Mar 2006 14:53:11 +0200

> What I was trying to figure out was, how can we re-enable the trick
> without hurting TSO? Could a solution be to simply look at the frame
> size, and call tcp_send_delayed_ack if the frame size is small?

The change is really not related to TSO.

By reverting it, you are reducing the number of ACKs on the wire, and
the number of context switches at the sender to push out new data.
That's why it can make things go faster, but it also leads to bursty
TCP sender behavior, which is bad for congestion on the internet.

When the receiver has a strong cpu and can keep up with the incoming
packet rate very well and we are in an environment with no congestion,
the old code helps a lot. But if the receiver is cpu limited or we
have congestion of any kind, it does exactly the wrong thing. It will
delay ACKs a very long time to the point where the pipe is depleted
and this kills performance in that case. For congested environments,
due to the decreased ACK feedback, packet loss recovery will be
extremely poor. This is the first reason behind my change.

The behavior is also specifically frowned upon in the TCP implementor
community. It is specifically mentioned in the Known TCP
Implementation Problems RFC2525, in section 2.13 "Stretch ACK
violation".

The entry, quoted below for reference, is very clear on the reasons
why stretch ACKs are bad. And although it may help performance for
your case, in congested environments and also with cpu limited
receivers it will have a negative impact on performance. So, this was
the second reason why I made this change.

So reverting the change isn't really an option.

Name of Problem
Stretch ACK violation

Classification
Congestion Control/Performance

Description
To improve efficiency (both computer and network) a data receiver
may refrain from sending an ACK for each incoming segment,
according to [RFC1122]. However, an ACK should not be delayed an
inordinate amount of time. Specifically, ACKs SHOULD be sent for
every second full-sized segment that arrives. If a second full-
sized segment does not arrive within a given timeout (of no more
than 0.5 seconds), an ACK should be transmitted, according to
[RFC1122]. A TCP receiver which does not generate an ACK for
every second full-sized segment exhibits a "Stretch ACK
Violation".

Significance
TCP receivers exhibiting this behavior will cause TCP senders to
generate burstier traffic, which can degrade performance in
congested environments. In addition, generating fewer ACKs
increases the amount of time needed by the slow start algorithm to
open the congestion window to an appropriate point, which
diminishes performance in environments with large bandwidth-delay
products. Finally, generating fewer ACKs may cause needless
retransmission timeouts in lossy environments, as it increases the
possibility that an entire window of ACKs is lost, forcing a
retransmission timeout.

Implications
When not in loss recovery, every ACK received by a TCP sender
triggers the transmission of new data segments. The burst size is
determined by the number of previously unacknowledged segments
each ACK covers. Therefore, a TCP receiver ack'ing more than 2
segments at a time causes the sending TCP to generate a larger
burst of traffic upon receipt of the ACK. This large burst of
traffic can overwhelm an intervening gateway, leading to higher
drop rates for both the connection and other connections passing
through the congested gateway.

In addition, the TCP slow start algorithm increases the congestion
window by 1 segment for each ACK received. Therefore, increasing
the ACK interval (thus decreasing the rate at which ACKs are
transmitted) increases the amount of time it takes slow start to
increase the congestion window to an appropriate operating point,
and the connection consequently suffers from reduced performance.
This is especially true for connections using large windows.

Relevant RFCs
RFC 1122 outlines delayed ACKs as a recommended mechanism.

Trace file demonstrating it
Trace file taken using tcpdump at host B, the data receiver (and
ACK originator). The advertised window (which never changed) and
timestamp options have been omitted for clarity, except for the
first packet sent by A:

12:09:24.820187 A.1174 > B.3999: . 2049:3497(1448) ack 1
win 33580 <nop,nop,timestamp 2249877 2249914> [tos 0x8]
12:09:24.824147 A.1174 > B.3999: . 3497:4945(1448) ack 1
12:09:24.832034 A.1174 > B.3999: . 4945:6393(1448) ack 1
12:09:24.832222 B.3999 > A.1174: . ack 6393
12:09:24.934837 A.1174 > B.3999: . 6393:7841(1448) ack 1
12:09:24.942721 A.1174 > B.3999: . 7841:9289(1448) ack 1
12:09:24.950605 A.1174 > B.3999: . 9289:10737(1448) ack 1
12:09:24.950797 B.3999 > A.1174: . ack 10737
12:09:24.958488 A.1174 > B.3999: . 10737:12185(1448) ack 1
12:09:25.052330 A.1174 > B.3999: . 12185:13633(1448) ack 1
12:09:25.060216 A.1174 > B.3999: . 13633:15081(1448) ack 1
12:09:25.060405 B.3999 > A.1174: . ack 15081

This portion of the trace clearly shows that the receiver (host B)
sends an ACK for every third full sized packet received. Further
investigation of this implementation found that the cause of the
increased ACK interval was the TCP options being used. The
implementation sent an ACK after it was holding 2*MSS worth of
unacknowledged data. In the above case, the MSS is 1460 bytes so
the receiver transmits an ACK after it is holding at least 2920
bytes of unacknowledged data. However, the length of the TCP
options being used [RFC1323] took 12 bytes away from the data
portion of each packet. This produced packets containing 1448
bytes of data. But the additional bytes used by the options in
the header were not taken into account when determining when to
trigger an ACK. Therefore, it took 3 data segments before the
data receiver was holding enough unacknowledged data (>= 2*MSS, or
2920 bytes in the above example) to transmit an ACK.

Trace file demonstrating correct behavior
Trace file taken using tcpdump at host B, the data receiver (and
ACK originator), again with window and timestamp information
omitted except for the first packet:

12:06:53.627320 A.1172 > B.3999: . 1449:2897(1448) ack 1
win 33580 <nop,nop,timestamp 2249575 2249612> [tos 0x8]
12:06:53.634773 A.1172 > B.3999: . 2897:4345(1448) ack 1
12:06:53.634961 B.3999 > A.1172: . ack 4345
12:06:53.737326 A.1172 > B.3999: . 4345:5793(1448) ack 1
12:06:53.744401 A.1172 > B.3999: . 5793:7241(1448) ack 1
12:06:53.744592 B.3999 > A.1172: . ack 7241
12:06:53.752287 A.1172 > B.3999: . 7241:8689(1448) ack 1
12:06:53.847332 A.1172 > B.3999: . 8689:10137(1448) ack 1
12:06:53.847525 B.3999 > A.1172: . ack 10137

This trace shows the TCP receiver (host B) ack'ing every second
full-sized packet, according to [RFC1122]. This is the same
implementation shown above, with slight modifications that allow
the receiver to take the length of the options into account when
deciding when to transmit an ACK.

References
This problem is documented in [Allman97] and [Paxson97].

How to detect
Stretch ACK violations show up immediately in receiver-side packet
traces of bulk transfers, as shown above. However, packet traces
made on the sender side of the TCP connection may lead to
ambiguities when diagnosing this problem due to the possibility of
lost ACKs.

2006-03-10 00:10:19

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

Quoting David S. Miller <[email protected]>:
> Description
> To improve efficiency (both computer and network) a data receiver
> may refrain from sending an ACK for each incoming segment,
> according to [RFC1122]. However, an ACK should not be delayed an
> inordinate amount of time. Specifically, ACKs SHOULD be sent for
> every second full-sized segment that arrives. If a second full-
> sized segment does not arrive within a given timeout (of no more
> than 0.5 seconds), an ACK should be transmitted, according to
> [RFC1122]. A TCP receiver which does not generate an ACK for
> every second full-sized segment exhibits a "Stretch ACK
> Violation".

Thanks very much for the info!

So the longest we can delay, according to this spec, is until we have two full
sized segments.

But with the change we are discussing, could an ack now be sent even sooner than
we have at least two full sized segments? Or does __tcp_ack_snd_check delay
until we have at least two full sized segments? David, could you explain please?

--
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

2006-03-10 00:21:08

by Rick Jones

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

David S. Miller wrote:
> From: "Michael S. Tsirkin" <[email protected]>
> Date: Wed, 8 Mar 2006 14:53:11 +0200
>
>
>>What I was trying to figure out was, how can we re-enable the trick
>>without hurting TSO? Could a solution be to simply look at the frame
>>size, and call tcp_send_delayed_ack if the frame size is small?
>
>
> The change is really not related to TSO.
>
> By reverting it, you are reducing the number of ACKs on the wire, and
> the number of context switches at the sender to push out new data.
> That's why it can make things go faster, but it also leads to bursty
> TCP sender behavior, which is bad for congestion on the internet.

naughty naughty Solaris and HP-UX TCP :)

>
> When the receiver has a strong cpu and can keep up with the incoming
> packet rate very well and we are in an environment with no congestion,
> the old code helps a lot. But if the receiver is cpu limited or we
> have congestion of any kind, it does exactly the wrong thing. It will
> delay ACKs a very long time to the point where the pipe is depleted
> and this kills performance in that case. For congested environments,
> due to the decreased ACK feedback, packet loss recovery will be
> extremely poor. This is the first reason behind my change.

well, there are stacks which do "stretch acks" (after a fashion) that
make sure when they see packet loss to "do the right thing" wrt sending
enough acks to allow cwnds to open again in a timely fashion.

that brings-back all that stuff I posted ages ago about the performance
delta when using an HP-UX receiver and altering the number of segmetns
per ACK. should be in the netdev archive somewhere.

might have been around the time of the discussions about MacOS and its
ack avoidance - which wasn't done very well at the time.

>
> The behavior is also specifically frowned upon in the TCP implementor
> community. It is specifically mentioned in the Known TCP
> Implementation Problems RFC2525, in section 2.13 "Stretch ACK
> violation".
>
> The entry, quoted below for reference, is very clear on the reasons
> why stretch ACKs are bad. And although it may help performance for
> your case, in congested environments and also with cpu limited
> receivers it will have a negative impact on performance. So, this was
> the second reason why I made this change.

I would have thought that a receiver "stretching ACK's" would be helpful
when it was CPU limited since it was spending fewer CPU cycles
generating ACKs?

>
> So reverting the change isn't really an option.
>
> Name of Problem
> Stretch ACK violation
>
> Classification
> Congestion Control/Performance
>
> Description
> To improve efficiency (both computer and network) a data receiver
> may refrain from sending an ACK for each incoming segment,
> according to [RFC1122]. However, an ACK should not be delayed an
> inordinate amount of time. Specifically, ACKs SHOULD be sent for
> every second full-sized segment that arrives. If a second full-
> sized segment does not arrive within a given timeout (of no more
> than 0.5 seconds), an ACK should be transmitted, according to
> [RFC1122]. A TCP receiver which does not generate an ACK for
> every second full-sized segment exhibits a "Stretch ACK
> Violation".

How can it be a "violation" of a SHOULD?-)

>
> Significance
> TCP receivers exhibiting this behavior will cause TCP senders to
> generate burstier traffic, which can degrade performance in
> congested environments. In addition, generating fewer ACKs
> increases the amount of time needed by the slow start algorithm to
> open the congestion window to an appropriate point, which
> diminishes performance in environments with large bandwidth-delay
> products. Finally, generating fewer ACKs may cause needless
> retransmission timeouts in lossy environments, as it increases the
> possibility that an entire window of ACKs is lost, forcing a
> retransmission timeout.

Of those three, I think the most meaningful is the second, which can be
dealt with by smarts in the ACK-stretching receiver.

For the first, it will only degrade performance if it triggers packet loss.

I'm not sure I've ever seen the third item happen.

>
> Implications
> When not in loss recovery, every ACK received by a TCP sender
> triggers the transmission of new data segments. The burst size is
> determined by the number of previously unacknowledged segments
> each ACK covers. Therefore, a TCP receiver ack'ing more than 2
> segments at a time causes the sending TCP to generate a larger
> burst of traffic upon receipt of the ACK. This large burst of
> traffic can overwhelm an intervening gateway, leading to higher
> drop rates for both the connection and other connections passing
> through the congested gateway.

Doesn't RED mean that those other connections are rather less likely to
be affected?

>
> In addition, the TCP slow start algorithm increases the congestion
> window by 1 segment for each ACK received. Therefore, increasing
> the ACK interval (thus decreasing the rate at which ACKs are
> transmitted) increases the amount of time it takes slow start to
> increase the congestion window to an appropriate operating point,
> and the connection consequently suffers from reduced performance.
> This is especially true for connections using large windows.

This one is dealt with by ABC isn't it? At least in part since ABC
appears to cap cwnd increase at 2*SMSS (I only know this because I just
read the RFC mentioned in another thread - seems a bit much to have made
that limit a MUST rather than a SHOULD :)

rick jones

2006-03-10 00:38:57

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

Quoting r. Michael S. Tsirkin <[email protected]>:
> Or does __tcp_ack_snd_check delay until we have at least two full sized
> segments?

What I'm trying to say, since RFC 2525, 2.13 talks about
"every second full-sized segment", so following the code from
__tcp_ack_snd_check, why does it do

/* More than one full frame received... */
if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss

rather than

/* At least two full frames received... */
if (((tp->rcv_nxt - tp->rcv_wup) >= 2 * inet_csk(sk)->icsk_ack.rcv_mss

--
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

2006-03-10 07:18:09

by David Miller

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

From: "Michael S. Tsirkin" <[email protected]>
Date: Fri, 10 Mar 2006 02:10:31 +0200

> But with the change we are discussing, could an ack now be sent even
> sooner than we have at least two full sized segments? Or does
> __tcp_ack_snd_check delay until we have at least two full sized
> segments? David, could you explain please?

__tcp_ack_snd_check() delays until we have at least two full
sized segments.

2006-03-10 07:23:04

by David Miller

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

From: Rick Jones <[email protected]>
Date: Thu, 09 Mar 2006 16:21:05 -0800

> well, there are stacks which do "stretch acks" (after a fashion) that
> make sure when they see packet loss to "do the right thing" wrt sending
> enough acks to allow cwnds to open again in a timely fashion.

Once a loss happens, it's too late to stop doing the stretch ACKs, the
damage is done already. It is going to take you at least one
extra RTT to recover from the loss compared to if you were not doing
stretch ACKs.

You have to keep giving consistent well spaced ACKs back to the
receiver in order to recover from loss optimally.

The ACK every 2 full sized frames behavior of TCP is absolutely
essential.

2006-03-10 17:45:04

by Rick Jones

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

David S. Miller wrote:
> From: Rick Jones <[email protected]>
> Date: Thu, 09 Mar 2006 16:21:05 -0800
>
>
>>well, there are stacks which do "stretch acks" (after a fashion) that
>>make sure when they see packet loss to "do the right thing" wrt sending
>>enough acks to allow cwnds to open again in a timely fashion.
>
>
> Once a loss happens, it's too late to stop doing the stretch ACKs, the
> damage is done already. It is going to take you at least one
> extra RTT to recover from the loss compared to if you were not doing
> stretch ACKs.

I must be dense (entirely possible), but how is that absolute?

If there is no more data in flight after the segment that was lost the
"stretch ACK" stacks with which I'm familiar will generate the
standalone ACK within the deferred ACK interval (50 milliseconds). I
guess that can be the "one extra RTT" However, if there is data in
flight after the point of loss, the immediate ACK upon receipt of out-of
order data kicks in.

> You have to keep giving consistent well spaced ACKs back to the
> receiver in order to recover from loss optimally.

The key there is defining consistent and well spaced. Certainly an ACK
only after a window's-worth of data would not be well spaced, but I
believe that an ACK after more than two full sized frames could indeed
be well-spaced.

> The ACK every 2 full sized frames behavior of TCP is absolutely
> essential.

I don't think it is _quite_ that cut and dried, otherwise, HP-UX and
Solaris, since < 1997 would have had big time problems.

rick jones

2006-03-20 09:05:50

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

Quoting r. David S. Miller <[email protected]>:
> > well, there are stacks which do "stretch acks" (after a fashion) that
> > make sure when they see packet loss to "do the right thing" wrt sending
> > enough acks to allow cwnds to open again in a timely fashion.
>
> Once a loss happens, it's too late to stop doing the stretch ACKs, the
> damage is done already. It is going to take you at least one
> extra RTT to recover from the loss compared to if you were not doing
> stretch ACKs.
>
> You have to keep giving consistent well spaced ACKs back to the
> receiver in order to recover from loss optimally.

Is it the case then that this requirement is less essential on
networks such as IP over InfiniBand, which are very low latency
and essencially lossless (with explicit congestion contifications
in hardware)?

> The ACK every 2 full sized frames behavior of TCP is absolutely
> essential.

Interestingly, I was pointed towards the following RFC draft
http://www.ietf.org/internet-drafts/draft-ietf-tcpm-rfc2581bis-00.txt

The requirement that an ACK "SHOULD" be generated for at least every
second full-sized segment is listed in [RFC1122] in one place as a
SHOULD and another as a MUST. Here we unambiguously state it is a
SHOULD. We also emphasize that this is a SHOULD, meaning that an
implementor should indeed only deviate from this requirement after
careful consideration of the implications.

And as Matt Leininger's research appears to show, stretch ACKs
are good for performance in case of IP over InfiniBand.

Given all this, would it make sense to add a per-netdevice (or per-neighbour)
flag to re-enable the trick for these net devices (as was done before
314324121f9b94b2ca657a494cf2b9cb0e4a28cc)?
IP over InfiniBand driver would then simply set this flag.

David, would you accept such a patch? It would be nice to get 2.6.17
back to within at least 10% of 2.6.11.

--
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

2006-03-20 09:55:11

by David Miller

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

From: "Michael S. Tsirkin" <[email protected]>
Date: Mon, 20 Mar 2006 11:06:29 +0200

> Is it the case then that this requirement is less essential on
> networks such as IP over InfiniBand, which are very low latency
> and essencially lossless (with explicit congestion contifications
> in hardware)?

You can never assume any attribute of the network whatsoever.
Even if initially the outgoing device is IPoIB, something in
the middle, like a traffic classification or netfilter rule,
could rewrite the packet and make it go somewhere else.

This even applies to loopback packets, because packets can
get rewritten and redirected even once they are passed in
via netif_receive_skb().

> And as Matt Leininger's research appears to show, stretch ACKs
> are good for performance in case of IP over InfiniBand.
>
> Given all this, would it make sense to add a per-netdevice (or per-neighbour)
> flag to re-enable the trick for these net devices (as was done before
> 314324121f9b94b2ca657a494cf2b9cb0e4a28cc)?
> IP over InfiniBand driver would then simply set this flag.

See above, this is not feasible.

The path an SKB can take is opaque and unknown until the very last
moment it is actually given to the device transmit function.

People need to get the "special case this topology" ideas out of their
heads. :-)

2006-03-20 10:21:55

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

Quoting r. David S. Miller <[email protected]>:
> The path an SKB can take is opaque and unknown until the very last
> moment it is actually given to the device transmit function.

Why, I was proposing looking at dst cache. If that's NULL, well,
we won't stretch ACKs. Worst case we apply the wrong optimization.
Right?

> People need to get the "special case this topology" ideas out of their
> heads. :-)

Okay, I get that.

What I'd like to clarify, however: rfc2581 explicitly states that in some cases
it might be OK to generate ACKs less frequently than every second full-sized
segment. Given Matt's measurements, TCP on top of IP over InfiniBand on Linux
seems to hit one of these cases. Do you agree to that?

--
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

2006-03-20 10:37:44

by David Miller

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

From: "Michael S. Tsirkin" <[email protected]>
Date: Mon, 20 Mar 2006 12:22:34 +0200

> Quoting r. David S. Miller <[email protected]>:
> > The path an SKB can take is opaque and unknown until the very last
> > moment it is actually given to the device transmit function.
>
> Why, I was proposing looking at dst cache. If that's NULL, well,
> we won't stretch ACKs. Worst case we apply the wrong optimization.
> Right?

Where you receive a packet from isn't very useful for determining
even the full patch on which that packet itself flowed.

More importantly, packets also do not necessarily go back out over the
same path on which packets are received for a connection. This is
actually quite common.

Maybe packets for this connection come in via IPoIB but go out via
gigabit ethernet and another route altogether.

> What I'd like to clarify, however: rfc2581 explicitly states that in
> some cases it might be OK to generate ACKs less frequently than
> every second full-sized segment. Given Matt's measurements, TCP on
> top of IP over InfiniBand on Linux seems to hit one of these cases.
> Do you agree to that?

I disagree with Linux changing it's behavior. It would be great to
turn off congestion control completely over local gigabit networks,
but that isn't determinable in any way, so we don't do that.

The IPoIB situation is no different, you can set all the bits you want
in incoming packets, the barrier to doing this remains the same.

It hurts performance if any packet drop occurs because it will require
an extra round trip for recovery to begin to be triggered at the
sender.

The network is a black box, routes to and from a destination are
arbitrary, and so is packet rewriting and reflection, so being able to
say "this all occurs on IPoIB" is simply infeasible.

I don't know how else to say this, we simply cannot special case IPoIB
or any other topology type.

2006-03-20 11:27:11

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

Quoting David S. Miller <[email protected]>:
> I disagree with Linux changing it's behavior. It would be great to
> turn off congestion control completely over local gigabit networks,
> but that isn't determinable in any way, so we don't do that.

Interesting. Would it make sense to make it another tunable knob in
/proc, sysfs or sysctl then?

--
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

2006-03-20 11:47:21

by Arjan van de Ven

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

On Mon, 2006-03-20 at 13:27 +0200, Michael S. Tsirkin wrote:
> Quoting David S. Miller <[email protected]>:
> > I disagree with Linux changing it's behavior. It would be great to
> > turn off congestion control completely over local gigabit networks,
> > but that isn't determinable in any way, so we don't do that.
>
> Interesting. Would it make sense to make it another tunable knob in
> /proc, sysfs or sysctl then?

that's not the right level; since that is per interface. And you only
know the actual interface waay too late (as per earlier posts).
Per socket.. maybe
But then again it's not impossible to have packets for one socket go out
to multiple interfaces
(think load balancing bonding over 2 interfaces, one IB another
ethernet)

2006-03-20 11:49:36

by Lennert Buytenhek

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

On Mon, Mar 20, 2006 at 12:47:03PM +0100, Arjan van de Ven wrote:

> > > I disagree with Linux changing it's behavior. It would be great to
> > > turn off congestion control completely over local gigabit networks,
> > > but that isn't determinable in any way, so we don't do that.
> >
> > Interesting. Would it make sense to make it another tunable knob in
> > /proc, sysfs or sysctl then?
>
> that's not the right level; since that is per interface. And you only
> know the actual interface waay too late (as per earlier posts).
> Per socket.. maybe
> But then again it's not impossible to have packets for one socket go out
> to multiple interfaces
> (think load balancing bonding over 2 interfaces, one IB another
> ethernet)

I read it as if he was proposing to have a sysctl knob to turn off
TCP congestion control completely (which has so many issues it's not
even funny.)

2006-03-20 11:53:57

by Arjan van de Ven

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

On Mon, 2006-03-20 at 12:49 +0100, Lennert Buytenhek wrote:
> On Mon, Mar 20, 2006 at 12:47:03PM +0100, Arjan van de Ven wrote:
>
> > > > I disagree with Linux changing it's behavior. It would be great to
> > > > turn off congestion control completely over local gigabit networks,
> > > > but that isn't determinable in any way, so we don't do that.
> > >
> > > Interesting. Would it make sense to make it another tunable knob in
> > > /proc, sysfs or sysctl then?
> >
> > that's not the right level; since that is per interface. And you only
> > know the actual interface waay too late (as per earlier posts).
> > Per socket.. maybe
> > But then again it's not impossible to have packets for one socket go out
> > to multiple interfaces
> > (think load balancing bonding over 2 interfaces, one IB another
> > ethernet)
>
> I read it as if he was proposing to have a sysctl knob to turn off
> TCP congestion control completely (which has so many issues it's not
> even funny.)

owww that's so bad I didn't even consider that

2006-03-20 12:03:25

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

Quoting r. Lennert Buytenhek <[email protected]>:
> > > > I disagree with Linux changing it's behavior. It would be great to
> > > > turn off congestion control completely over local gigabit networks,
> > > > but that isn't determinable in any way, so we don't do that.
> > >
> > > Interesting. Would it make sense to make it another tunable knob in
> > > /proc, sysfs or sysctl then?
> >
> > that's not the right level; since that is per interface. And you only
> > know the actual interface waay too late (as per earlier posts).
> > Per socket.. maybe
> > But then again it's not impossible to have packets for one socket go out
> > to multiple interfaces
> > (think load balancing bonding over 2 interfaces, one IB another
> > ethernet)
>
> I read it as if he was proposing to have a sysctl knob to turn off
> TCP congestion control completely (which has so many issues it's not
> even funny.)

Not really, that was David :)

What started this thread was the fact that since 2.6.11 Linux
does not stretch ACKs anymore. RFC 2581 does mention that it might be OK to
stretch ACKs "after careful consideration", and we are seeing that it helps
IP over InfiniBand, so recent Linux kernels perform worse in that respect.

And since there does not seem to be a way to figure it out automagically when
doing this is a good idea, I proposed adding some kind of knob that will let the
user apply the consideration for us.

--
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

2006-03-20 13:35:14

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

Quoting Arjan van de Ven <[email protected]>:
> > I read it as if he was proposing to have a sysctl knob to turn off
> > TCP congestion control completely (which has so many issues it's not
> > even funny.)
>
> owww that's so bad I didn't even consider that

No, I think that comment was taken out of thread context. We were talking about
stretching ACKs - while avoiding stretch ACKs is important for TCP congestion
control, it's not the only mechanism.

--
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies

2006-03-20 16:06:27

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

On Mon, Mar 20, 2006 at 02:04:07PM +0200, Michael S. Tsirkin wrote:
> does not stretch ACKs anymore. RFC 2581 does mention that it might be OK to
> stretch ACKs "after careful consideration", and we are seeing that it helps
> IP over InfiniBand, so recent Linux kernels perform worse in that respect.
>
> And since there does not seem to be a way to figure it out automagically when
> doing this is a good idea, I proposed adding some kind of knob that will let the
> user apply the consideration for us.

Wouldn't it make sense to strech the ACK when the previous ACK is still in
the TX queue of the device? I know that sort of behaviour was always an
issue on modem links where you don't want to send out redundant ACKs.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-03-20 18:58:30

by Rick Jones

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

> Wouldn't it make sense to strech the ACK when the previous ACK is still in
> the TX queue of the device? I know that sort of behaviour was always an
> issue on modem links where you don't want to send out redundant ACKs.

Perhaps, but it isn't clear that it would be worth the cycles to check.
I doubt that a simple reference count on the ACK skb would do it
since if it were a bare ACK I doubt that TCP keeps a reference to the
skb in the first place?

Also, what would be the "trigger" to send the next ACK after the
previous one had left the building (Elvis-like)? Receipt of N in-order
segments? A timeout?

If you are going to go ahead and try to do stretch-ACKs, then I suspect
the way to go about doing it is to have it behave very much like HP-UX
or Solaris, both of which have arguably reasonable ACK-avoidance
heuristics in them.

But don't try to do it quick and dirty.

rick "likes ACK avoidance, just check the archives" jones
on netdev, no need to cc me directly

2006-03-20 23:04:03

by David Miller

[permalink] [raw]

Subject: Re: TSO and IPoIB performance degradation

From: Benjamin LaHaise <[email protected]>
Date: Mon, 20 Mar 2006 10:09:42 -0500

> Wouldn't it make sense to strech the ACK when the previous ACK is still in
> the TX queue of the device? I know that sort of behaviour was always an
> issue on modem links where you don't want to send out redundant ACKs.

I thought about doing some similar trick with TSO, wherein we would
not defer a TSO send if all the previous packets sent are out of the
device transmit queue. The idea was the prevent the pipe from ever
emptying which is the danger of deferring too much for TSO.

This has several problems. It's hard to implement. You have to
decide if you want precise state, thereby checking the TX descriptors.
Or you go for imprecise but easier to implement, which is very
imprecise and therefore not very useful state, by just checking the
SKB refcount or similar which means that you find out it's left the TX
queue after the TX purge interrupt which can be a long time after the
event and by then the pipe has empties which is what you were trying
to prevent.

Lastly, you don't want to touch remote cpu state which is what such
a hack is going to end up doing much of the time.