LinuxLists.cc - Throughput regression with `tcp: refine TSO autosizing`

2015-01-29 11:49:02

by Michal Kazior

[permalink] [raw]

Subject: Throughput regression with `tcp: refine TSO autosizing`

Hi,

I'm not subscribed to netdev list and I can't find the message-id so I
can't reply directly to the original thread `BW regression after "tcp:
refine TSO autosizing"`.

I've noticed a big TCP performance drop with ath10k
(drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
get 250mbps in my testbed.

After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
`tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
to non-TSO packets` (for conflict free reverts) fixes the problem.

My testing setup is as follows:

a) ath10k AP, github.com/kvalo/ath/tree/master 3.19-rc5, w/ reverts
b) ath10k STA connected to (a), github.com/kvalo/ath/tree/master
3.19-rc5, w/ reverts
c) (b) w/o reverts

Devices are 3x3 (AP) and 2x2 (Client) and are RF cabled. 11ac@80MHz
2x2 has 866mbps modulation rate. In practice this should deliver
~700mbps of real UDP traffic.

Here are some numbers:

UDP: (b) -> (a): 672mbps
UDP: (a) -> (b): 687mbps
TCP: (b) -> (a): 526mbps
TCP: (a) -> (b): 500mbps

UDP: (c) -> (a): 669mbps*
UDP: (a) -> (c): 689mbps*
TCP: (c) -> (a): 240mbps**
TCP: (a) -> (c): 490mbps*

* no changes/within error margin
** the performance drop

I'm using iperf:
UDP: iperf -i1 -s -u vs iperf -i1 -c XX -u -B 200M -P5 -t 20
TCP: iperf -i1 -s vs iperf -i1 -c XX -P5 -t 20

Result values were obtained at the receiver side.

Iperf reports a few frames lost and out-of-order at each UDP test
start (during first second) but later has no packet loss and no
out-of-order. This shouldn't have any effect on a TCP session, right?

The device delivers batched up tx/rx completions (no way to change
that). I suppose this could be an issue for timing sensitive
algorithms. Also keep in mind 802.11n and 802.11ac devices have frame
aggregation windows so there's an inherent extra (and non-uniform)
latency when compared to, e.g. ethernet devices.

The driver doesn't have GRO. I have an internal patch which implements
it. It improves overall TCP traffic (more stable, up to 600mbps TCP
which is ~100mbps more than without GRO) but the TCP: (c) -> (a)
performance drop remains unaffected regardless.

I've tried applying stretch ACK patchset (v2) on both machines and
re-run the above tests. I got no measurable difference in performance.

I've also run these tests with iwlwifi 7260 (also a 2x2) as (b) and
(c). It didn't seem to be affected by the TSO patch at all (it runs at
~360mbps of TCP regardless of the TSO patch).

Any hints/ideas?

Michał

2015-01-29 13:15:00

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Thu, 2015-01-29 at 12:48 +0100, Michal Kazior wrote:
> Hi,
>
> I'm not subscribed to netdev list and I can't find the message-id so I
> can't reply directly to the original thread `BW regression after "tcp:
> refine TSO autosizing"`.
>
> I've noticed a big TCP performance drop with ath10k
> (drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
> get 250mbps in my testbed.
>
> After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
> `tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
> to non-TSO packets` (for conflict free reverts) fixes the problem.
>
> My testing setup is as follows:
>
> a) ath10k AP, github.com/kvalo/ath/tree/master 3.19-rc5, w/ reverts
> b) ath10k STA connected to (a), github.com/kvalo/ath/tree/master
> 3.19-rc5, w/ reverts
> c) (b) w/o reverts
>
> Devices are 3x3 (AP) and 2x2 (Client) and are RF cabled. 11ac@80MHz
> 2x2 has 866mbps modulation rate. In practice this should deliver
> ~700mbps of real UDP traffic.
>
> Here are some numbers:
>
> UDP: (b) -> (a): 672mbps
> UDP: (a) -> (b): 687mbps
> TCP: (b) -> (a): 526mbps
> TCP: (a) -> (b): 500mbps
>
> UDP: (c) -> (a): 669mbps*
> UDP: (a) -> (c): 689mbps*
> TCP: (c) -> (a): 240mbps**
> TCP: (a) -> (c): 490mbps*
>
> * no changes/within error margin
> ** the performance drop
>
> I'm using iperf:
> UDP: iperf -i1 -s -u vs iperf -i1 -c XX -u -B 200M -P5 -t 20
> TCP: iperf -i1 -s vs iperf -i1 -c XX -P5 -t 20
>
> Result values were obtained at the receiver side.
>
> Iperf reports a few frames lost and out-of-order at each UDP test
> start (during first second) but later has no packet loss and no
> out-of-order. This shouldn't have any effect on a TCP session, right?
>
> The device delivers batched up tx/rx completions (no way to change
> that). I suppose this could be an issue for timing sensitive
> algorithms. Also keep in mind 802.11n and 802.11ac devices have frame
> aggregation windows so there's an inherent extra (and non-uniform)
> latency when compared to, e.g. ethernet devices.
>
> The driver doesn't have GRO. I have an internal patch which implements
> it. It improves overall TCP traffic (more stable, up to 600mbps TCP
> which is ~100mbps more than without GRO) but the TCP: (c) -> (a)
> performance drop remains unaffected regardless.
>
> I've tried applying stretch ACK patchset (v2) on both machines and
> re-run the above tests. I got no measurable difference in performance.
>
> I've also run these tests with iwlwifi 7260 (also a 2x2) as (b) and
> (c). It didn't seem to be affected by the TSO patch at all (it runs at
> ~360mbps of TCP regardless of the TSO patch).
>
> Any hints/ideas?
>

Hi Michal

This patch restored original TSQ behavior, because the 1ms worth of data
per flow had totally destroyed TSQ intent.

vi +630 Documentation/networking/ip-sysctl.txt

tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
gets losses notifications. With SNDBUF autotuning, this can
result in a large amount of packets queued in qdisc/device
on the local machine, hurting latency of other flows, for
typical pfifo_fast qdiscs.
tcp_limit_output_bytes limits the number of bytes on qdisc
or device to reduce artificial RTT/cwnd and reduce bufferbloat.
Default: 131072

This is why I suggested to Eyal Perry to change the TX interrupt
mitigation parameters as in :

ethtool -C eth0 tx-frames 4 rx-frames 4

With this change and the stretch ack fixes, I got 37Gbps of throughput
on a single flow, on a 40Gbit NIC (mlx4)

If a driver needs to buffer more than tcp_limit_output_bytes=131072 to
get line rate, I suggest that you either :

1) tweak tcp_limit_output_bytes, but its not practical from a driver.

2) change the driver, knowing what are its exact requirements, by
removing a fraction of skb->truesize at ndo_start_xmit() time as in :

if ((skb->destructor == sock_wfree ||
skb->restuctor == tcp_wfree) &&
skb->sk) {
u32 fraction = skb->truesize / 2;

skb->truesize -= fraction;
atomic_sub(fraction, &skb->sk->sk_wmem_alloc);
}

Thanks.

2015-01-30 13:47:39

by Arend van Spriel

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 01/30/15 14:19, Eric Dumazet wrote:
> On Fri, 2015-01-30 at 11:29 +0100, Arend van Spriel wrote:
>
>> Hi Eric,
>>
>> Your suggestions are still based on the fact that you consider wireless
>> networking to be similar to ethernet, but as Michal indicated there are
>> some fundamental differences starting with CSMA/CD versus CSMA/CA. Also
>> the medium conditions are far from comparable. There is no shielding so
>> it needs to deal with interference and dynamically drops the link rate
>> so transmission of packets can take several milliseconds. Then with 11n
>> they came up with aggregation with sends up to 64 packets in a single
>> transmit over the air at worst case 6.5 Mbps (if I am not mistaken). The
>> parameter value for tcp_limit_output_bytes of 131072 means that it
>> allows queuing for about 1ms on a 1Gbps link, but I hope you can see
>> this is not realistic for dealing with all variances of the wireless
>> medium/standard. I suggested this as topic for the wireless workshop in
>> Otawa [1], but I can not attend there. Still hope that there will be
>> some discussions to get more awareness.
>
> Ever heard about bufferbloat ?

Sure. I am trying to get awareness about that in our wireless
driver/firmware development teams. So bear with me.

> Have you read my suggestions and tried them ?
>
> You can adjust the limit per flow to pretty much you want. If you need
> 64 packets, just do the math. If in 2018 you need 128 packets, do the
> math again.
>
> I am very well aware that wireless wants aggregation, thank you.

Sorry if I offended you. I was just giving these as example combined
with effective rate usable on the medium to say that the bandwidth is
more dynamic in wireless and as such need dynamic change of queue depth.
Now this can be done by making the fraction size as used in your
suggestion adaptive to these conditions.

> 131072 bytes of queue on 40Gbit is not 1ms, but 26 usec of queueing, and
> we get line rate nevertheless.

I was saying it was about 1ms on *1Gbit* as the wireless TCP rates are
moving into that direction in 11ac.

> We need this level of shallow queues (BQL, TSQ), to get very precise rtt
> estimations so that TCP has good entropy for its pacing, even in the 50
> usec rtt ranges.
>
> If we allowed 1ms of queueing, then a 40Gbit flow would queue 5 MBytes.
>
> This was terrible, because it increased cwnd and all sender queues to
> insane levels.

Indeed and that is what we would like to address in our wireless
drivers. I will setup some experiments using the fraction sizing and
post my findings. Again sorry if I offended you.

Regards,
Arend

2015-01-30 13:39:46

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 29 January 2015 at 14:14, Eric Dumazet <[email protected]> wrote:
> On Thu, 2015-01-29 at 12:48 +0100, Michal Kazior wrote:
>> Hi,
>>
>> I'm not subscribed to netdev list and I can't find the message-id so I
>> can't reply directly to the original thread `BW regression after "tcp:
>> refine TSO autosizing"`.
>>
>> I've noticed a big TCP performance drop with ath10k
>> (drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
>> get 250mbps in my testbed.
>>
>> After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
>> `tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
>> to non-TSO packets` (for conflict free reverts) fixes the problem.
[...]
>> Any hints/ideas?
>>
>
> Hi Michal
>
> This patch restored original TSQ behavior, because the 1ms worth of data
> per flow had totally destroyed TSQ intent.
>
> vi +630 Documentation/networking/ip-sysctl.txt
[...]
> With this change and the stretch ack fixes, I got 37Gbps of throughput
> on a single flow, on a 40Gbit NIC (mlx4)
>
> If a driver needs to buffer more than tcp_limit_output_bytes=131072 to
> get line rate, I suggest that you either :
>
> 1) tweak tcp_limit_output_bytes, but its not practical from a driver.

I've briefly tried playing with this knob to no avail unfortunately. I
tried 256K, 1M - it didn't improve TCP performance. When I tried to
make it smaller (e.g. 16K) the traffic dropped even more so it does
have an effect. It seems there's some other limiting factor in this
case.

> 2) change the driver, knowing what are its exact requirements, by
> removing a fraction of skb->truesize at ndo_start_xmit() time as in :
>
> if ((skb->destructor == sock_wfree ||
> skb->restuctor == tcp_wfree) &&
> skb->sk) {
> u32 fraction = skb->truesize / 2;
>
> skb->truesize -= fraction;
> atomic_sub(fraction, &skb->sk->sk_wmem_alloc);
> }

tcp_wfree isn't exported so I went and just did a quick hack and used
`if (skb->sk)` as the condition. Didn't help with TCP.

I wonder if this TSO patch is the single thing to blame. Maybe it's
just the trigger and something else is broken? I've been seeing some
UDP performance inconsistencies in 3.19 (I don't recall seeing these
in 3.18) - but maybe my perception is failing on me.

Anyway, thanks for the tips so far.

Michał

2015-01-30 14:40:38

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Fri, 2015-01-30 at 14:39 +0100, Michal Kazior wrote:

> I've briefly tried playing with this knob to no avail unfortunately. I
> tried 256K, 1M - it didn't improve TCP performance. When I tried to
> make it smaller (e.g. 16K) the traffic dropped even more so it does
> have an effect. It seems there's some other limiting factor in this
> case.

Interesting.

Could you take some tcpdump/pcap with various tcp_limit_output_bytes
values ?

echo 131072 >/proc/sys/net/ipv4/tcp_limit_output_bytes
tcpdump -p -i wlanX -s 128 -c 20000 -w 128k.pcap

echo 262144 >/proc/sys/net/ipv4/tcp_limit_output_bytes
tcpdump -p -i wlanX -s 128 -c 20000 -w 256k.pcap

...

Thanks !

2015-01-30 14:37:31

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Fri, 2015-01-30 at 14:47 +0100, Arend van Spriel wrote:

> Indeed and that is what we would like to address in our wireless
> drivers. I will setup some experiments using the fraction sizing and
> post my findings. Again sorry if I offended you.

You did not, but I had no feedback about my suggestions.

Michal sent it now.

Thanks

2015-01-30 13:19:38

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Fri, 2015-01-30 at 11:29 +0100, Arend van Spriel wrote:

> Hi Eric,
>
> Your suggestions are still based on the fact that you consider wireless
> networking to be similar to ethernet, but as Michal indicated there are
> some fundamental differences starting with CSMA/CD versus CSMA/CA. Also
> the medium conditions are far from comparable. There is no shielding so
> it needs to deal with interference and dynamically drops the link rate
> so transmission of packets can take several milliseconds. Then with 11n
> they came up with aggregation with sends up to 64 packets in a single
> transmit over the air at worst case 6.5 Mbps (if I am not mistaken). The
> parameter value for tcp_limit_output_bytes of 131072 means that it
> allows queuing for about 1ms on a 1Gbps link, but I hope you can see
> this is not realistic for dealing with all variances of the wireless
> medium/standard. I suggested this as topic for the wireless workshop in
> Otawa [1], but I can not attend there. Still hope that there will be
> some discussions to get more awareness.

Ever heard about bufferbloat ?

Have you read my suggestions and tried them ?

You can adjust the limit per flow to pretty much you want. If you need
64 packets, just do the math. If in 2018 you need 128 packets, do the
math again.

I am very well aware that wireless wants aggregation, thank you.

131072 bytes of queue on 40Gbit is not 1ms, but 26 usec of queueing, and
we get line rate nevertheless.

We need this level of shallow queues (BQL, TSQ), to get very precise rtt
estimations so that TCP has good entropy for its pacing, even in the 50
usec rtt ranges.

If we allowed 1ms of queueing, then a 40Gbit flow would queue 5 MBytes.

This was terrible, because it increased cwnd and all sender queues to
insane levels.

2015-01-30 10:29:34

by Arend van Spriel

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 01/29/15 14:14, Eric Dumazet wrote:
> On Thu, 2015-01-29 at 12:48 +0100, Michal Kazior wrote:
>> Hi,
>>
>> I'm not subscribed to netdev list and I can't find the message-id so I
>> can't reply directly to the original thread `BW regression after "tcp:
>> refine TSO autosizing"`.
>>
>> I've noticed a big TCP performance drop with ath10k
>> (drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
>> get 250mbps in my testbed.
>>
>> After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
>> `tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
>> to non-TSO packets` (for conflict free reverts) fixes the problem.
>>
>> My testing setup is as follows:
>>
>> a) ath10k AP, github.com/kvalo/ath/tree/master 3.19-rc5, w/ reverts
>> b) ath10k STA connected to (a), github.com/kvalo/ath/tree/master
>> 3.19-rc5, w/ reverts
>> c) (b) w/o reverts
>>
>> Devices are 3x3 (AP) and 2x2 (Client) and are RF cabled. 11ac@80MHz
>> 2x2 has 866mbps modulation rate. In practice this should deliver
>> ~700mbps of real UDP traffic.
>>
>> Here are some numbers:
>>
>> UDP: (b) -> (a): 672mbps
>> UDP: (a) -> (b): 687mbps
>> TCP: (b) -> (a): 526mbps
>> TCP: (a) -> (b): 500mbps
>>
>> UDP: (c) -> (a): 669mbps*
>> UDP: (a) -> (c): 689mbps*
>> TCP: (c) -> (a): 240mbps**
>> TCP: (a) -> (c): 490mbps*
>>
>> * no changes/within error margin
>> ** the performance drop
>>
>> I'm using iperf:
>> UDP: iperf -i1 -s -u vs iperf -i1 -c XX -u -B 200M -P5 -t 20
>> TCP: iperf -i1 -s vs iperf -i1 -c XX -P5 -t 20
>>
>> Result values were obtained at the receiver side.
>>
>> Iperf reports a few frames lost and out-of-order at each UDP test
>> start (during first second) but later has no packet loss and no
>> out-of-order. This shouldn't have any effect on a TCP session, right?
>>
>> The device delivers batched up tx/rx completions (no way to change
>> that). I suppose this could be an issue for timing sensitive
>> algorithms. Also keep in mind 802.11n and 802.11ac devices have frame
>> aggregation windows so there's an inherent extra (and non-uniform)
>> latency when compared to, e.g. ethernet devices.
>>
>> The driver doesn't have GRO. I have an internal patch which implements
>> it. It improves overall TCP traffic (more stable, up to 600mbps TCP
>> which is ~100mbps more than without GRO) but the TCP: (c) -> (a)
>> performance drop remains unaffected regardless.
>>
>> I've tried applying stretch ACK patchset (v2) on both machines and
>> re-run the above tests. I got no measurable difference in performance.
>>
>> I've also run these tests with iwlwifi 7260 (also a 2x2) as (b) and
>> (c). It didn't seem to be affected by the TSO patch at all (it runs at
>> ~360mbps of TCP regardless of the TSO patch).
>>
>> Any hints/ideas?
>>
>
> Hi Michal
>
> This patch restored original TSQ behavior, because the 1ms worth of data
> per flow had totally destroyed TSQ intent.
>
> vi +630 Documentation/networking/ip-sysctl.txt
>
> tcp_limit_output_bytes - INTEGER
> Controls TCP Small Queue limit per tcp socket.
> TCP bulk sender tends to increase packets in flight until it
> gets losses notifications. With SNDBUF autotuning, this can
> result in a large amount of packets queued in qdisc/device
> on the local machine, hurting latency of other flows, for
> typical pfifo_fast qdiscs.
> tcp_limit_output_bytes limits the number of bytes on qdisc
> or device to reduce artificial RTT/cwnd and reduce bufferbloat.
> Default: 131072
>
> This is why I suggested to Eyal Perry to change the TX interrupt
> mitigation parameters as in :
>
> ethtool -C eth0 tx-frames 4 rx-frames 4
>
> With this change and the stretch ack fixes, I got 37Gbps of throughput
> on a single flow, on a 40Gbit NIC (mlx4)
>
> If a driver needs to buffer more than tcp_limit_output_bytes=131072 to
> get line rate, I suggest that you either :
>
> 1) tweak tcp_limit_output_bytes, but its not practical from a driver.
>
> 2) change the driver, knowing what are its exact requirements, by
> removing a fraction of skb->truesize at ndo_start_xmit() time as in :
>
> if ((skb->destructor == sock_wfree ||
> skb->restuctor == tcp_wfree)&&
> skb->sk) {
> u32 fraction = skb->truesize / 2;
>
> skb->truesize -= fraction;
> atomic_sub(fraction,&skb->sk->sk_wmem_alloc);
> }

Hi Eric,

Your suggestions are still based on the fact that you consider wireless
networking to be similar to ethernet, but as Michal indicated there are
some fundamental differences starting with CSMA/CD versus CSMA/CA. Also
the medium conditions are far from comparable. There is no shielding so
it needs to deal with interference and dynamically drops the link rate
so transmission of packets can take several milliseconds. Then with 11n
they came up with aggregation with sends up to 64 packets in a single
transmit over the air at worst case 6.5 Mbps (if I am not mistaken). The
parameter value for tcp_limit_output_bytes of 131072 means that it
allows queuing for about 1ms on a 1Gbps link, but I hope you can see
this is not realistic for dealing with all variances of the wireless
medium/standard. I suggested this as topic for the wireless workshop in
Otawa [1], but I can not attend there. Still hope that there will be
some discussions to get more awareness.

Regards,
Arend

[1] http://mid.gmane.org/[email protected]

2015-02-05 13:33:32

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Thu, 2015-02-05 at 05:19 -0800, Eric Dumazet wrote:

>
> TCP could eventually dynamically adjust the tcp_limit_output_bytes,
> using a per flow dynamic value, but I would rather not add a kludge in
> TCP stack only to deal with a possible bug in ath10k driver.
>
> niu has a similar issue and simply had to call skb_orphan() :
>
> drivers/net/ethernet/sun/niu.c:6669: skb_orphan(skb);

In your case that might be the place :

diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
index 4bc51d8a14a3..cbda7a87d5a1 100644
--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -468,6 +468,7 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct sk_buff *msdu)
msdu_id = res;
htt->pending_tx[msdu_id] = msdu;
spin_unlock_bh(&htt->tx_lock);
+ skb_orphan(msdu);

prefetch_len = min(htt->prefetch_len, msdu->len);
prefetch_len = roundup(prefetch_len, 4);

2015-02-10 12:55:01

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:

> + if (msdu->sk) {
> + ewma_add(&ar->tx_delay_us,
> + ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
> + NSEC_PER_USEC);
> +
> + ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
> + (ewma_read(&ar->tx_delay_us) *
> + msdu->sk->sk_pacing_rate) >> 20;
> + }
> +

Hi Michal

This is almost it ;)

As I said you must do this using u64 arithmetics, we still support 32bit
kernels.

Also, >> 20 instead of / 1000000 introduces a 5% error, I would use a
plain divide, as the compiler will use a reciprocal divide (ie : a
multiply)

We use >> 10 instead of /1000 because a 2.4 % error is probably okay.

ewma_add(&ar->tx_delay_us,
ktime_to_ns(ktime_sub(ktime_get(),
skb_cb->stamp)) /
NSEC_PER_USEC);
u64 val = (u64)ewma_read(&ar->tx_delay_us) *
msdu->sk->sk_pacing_rate;

do_div(val, USEC_PER_SEC);

ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
(u32)val;

(WRITE_ONCE() would be better for new kernels, but ACCESS_ONCE() is ok
since we probably want to backport to stable kernels)

Thanks

2015-02-02 23:06:05

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Mon, 2015-02-02 at 13:25 -0800, Ben Greear wrote:

> It is a big throughput win to have fewer TCP ack packets on
> wireless since it is a half-duplex environment. Is there anything
> we could improve so that we can have fewer acks and still get
> good tcp stack behaviour?

First apply TCP stretch ack fixes to the sender. There is no way to get
good performance if the sender does not handle stretch ack.

d6b1a8a92a14 tcp: fix timing issue in CUBIC slope calculation
9cd981dcf174 tcp: fix stretch ACK bugs in CUBIC
c22bdca94782 tcp: fix stretch ACK bugs in Reno
814d488c6126 tcp: fix the timid additive increase on stretch ACKs
e73ebb0881ea tcp: stretch ACK fixes prep

Then, make sure you do not throttle ACK too long, especially if you hope
to get Gbit line rate on a 4 ms RTT flow.

GRO does not mean : send one ACK every ms, or after 3ms delay...

It is literally :
aggregate X packets at receive, and send the ACK asap.

If the receiver expects to have 64 ACK packets in the TX ring buffer to
actually send them (wifi aggregation), then you certainly do not want to
compress ACK too much.

2015-02-06 13:40:08

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:

> The above brings back previous behaviour, i.e. I can get 600mbps TCP
> on 5 flows again. Single flow is still (as it was before TSO
> autosizing) limited to roughly ~280mbps.
>
> I never really bothered before to understand why I need to push a few
> flows through ath10k to max it out, i.e. if I run a single UDP flow I
> get ~300mbps while with, e.g. 5 I get 670mbps easily.
>

For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
enough : UDP has no callback from TX completion to feed following frames
(No write queue like TCP)

# cat /proc/sys/net/core/wmem_default
212992
# ethtool -C eth1 tx-usecs 1024 tx-frames 120
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

212992 1450 10.00 697705 0 809.27
212992 10.00 673412 781.09

# echo 800000 >/proc/sys/net/core/wmem_default
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

800000 1450 10.00 7329221 0 8501.84
212992 10.00 7284051 8449.44

> I guess it was the tx completion latency all along.
>
> I just put an extra debug to ath10k to see the latency between
> submission and completion. Here's a log
> (http://www.filedropper.com/complete-log) of 2s run of UDP iperf
> trying to push 1gbps but managing only 300mbps.
>
> I've made sure to not hold any locks nor introduce internal to ath10k
> delays. Frames get completed between 2-4ms in avarage during load.

tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
of TX completion delay. But this would require yet another expensive
call to ktime_get() if HZ < 1000.

Then tcp_write_xmit() could use it to adjust :

limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);

to

amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate

limit = max(2 * skb->truesize, amount / 1000);

I'll cook a patch.

Thanks.

2015-02-05 13:03:06

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Thu, 2015-02-05 at 07:46 +0100, Michal Kazior wrote:
> On 4 February 2015 at 22:11, Eric Dumazet <[email protected]> wrote:

> > Most conservative patch would be :
> >
> > diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
> > index 9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4 100644
> > --- a/drivers/net/wireless/ath/ath10k/htt_rx.c
> > +++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
> > @@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct sk_buff *skb)
> > break;
> > }
> > case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
> > + skb_orphan(skb);
> > spin_lock_bh(&htt->tx_lock);
> > __skb_queue_tail(&htt->tx_compl_q, skb);
> > spin_unlock_bh(&htt->tx_lock);
>
> I suppose you want to call skb_orphan() on actual data packets, right?
> This skb is just a host-firmware communication buffer.

Right. I have no idea how you find the actual data packet at this stage.

2015-02-04 21:11:45

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

I do not see how a TSO patch could hurt a flow not using TSO/GSO.

This makes no sense.

ath10k tx completions being batched/deferred to a tasklet might increase
probability to hit this condition in tcp_wfree() :

/* If this softirq is serviced by ksoftirqd, we are likely under stress.
* Wait until our queues (qdisc + devices) are drained.
* This gives :
* - less callbacks to tcp_write_xmit(), reducing stress (batches)
* - chance for incoming ACK (processed by another cpu maybe)
* to migrate this flow (skb->ooo_okay will be eventually set)
*/
if (wmem >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
goto out;

Meaning tcp stack waits all skbs left qdisc/NIC queues before queuing
additional packets.

I would try to call skb_orphan() in ath10k if you really want to keep
these batches.

I have hard time to understand why tx completed packets go through
ath10k_htc_rx_completion_handler().. anyway...

Most conservative patch would be :

diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
index 9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4 100644
--- a/drivers/net/wireless/ath/ath10k/htt_rx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
@@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct sk_buff *skb)
break;
}
case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
+ skb_orphan(skb);
spin_lock_bh(&htt->tx_lock);
__skb_queue_tail(&htt->tx_compl_q, skb);
spin_unlock_bh(&htt->tx_lock);

2015-02-06 13:53:07

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:

> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
> of TX completion delay. But this would require yet another expensive
> call to ktime_get() if HZ < 1000.
>
> Then tcp_write_xmit() could use it to adjust :
>
> limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
>
> to
>
> amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate
>
> limit = max(2 * skb->truesize, amount / 1000);
>
> I'll cook a patch.

Hmm... doing this in all protocols would be too expensive,
and we do not want to include time spent in qdiscs.

wifi could eventually do that, providing in skb->tx_completion_delay_us
the time spent in wifi driver.

This way, we would have no penalty for network devices doing normal skb
orphaning (loopback interface, ethernet, ...)

2015-02-06 15:21:18

by David Laight

[permalink] [raw]

Subject: RE: Throughput regression with `tcp: refine TSO autosizing`

RnJvbTogRXJpYyBEdW1hemV0DQo+IE9uIEZyaSwgMjAxNS0wMi0wNiBhdCAwNTo1MyAtMDgwMCwg
RXJpYyBEdW1hemV0IHdyb3RlOg0KPiANCj4gDQo+ID4gd2lmaSBjb3VsZCBldmVudHVhbGx5IGRv
IHRoYXQsIHByb3ZpZGluZyBpbiBza2ItPnR4X2NvbXBsZXRpb25fZGVsYXlfdXMNCj4gPiB0aGUg
dGltZSBzcGVudCBpbiB3aWZpIGRyaXZlci4NCj4gPg0KPiA+IFRoaXMgd2F5LCB3ZSB3b3VsZCBo
YXZlIG5vIHBlbmFsdHkgZm9yIG5ldHdvcmsgZGV2aWNlcyBkb2luZyBub3JtYWwgc2tiDQo+ID4g
b3JwaGFuaW5nIChsb29wYmFjayBpbnRlcmZhY2UsIGV0aGVybmV0LCAuLi4pDQo+IA0KPiBBbm90
aGVyIHdheSB3b3VsZCBiZSB0aGF0IHdpZmkgZG9lcyBhbiBhdXRvbWF0aWMgb3JwaGFuaW5nIGFm
dGVyIDEgb3INCj4gMm1zLg0KDQpDb3VsZG4ndCB5b3UgZG8gYnl0ZSBjb3VudGluZz8NClNvIG9y
cGhhbiBlbm91Z2ggcGFja2V0cyB0byBrZWVwIGEgZmV3IG1zIG9mIHR4IHRyYWZmaWMgKGF0IHRo
ZSBjdXJyZW50DQp0eCByYXRlKSBvcnBoYW5lZC4NCllvdSBtaWdodCBuZWVkIHRvIGdpdmUgdGhl
IGhhcmR3YXJlIGJvdGggb3JwaGFuZWQgYW5kIG5vbi1vcnBoYW5lZCAocGFyZW50ZWQ/KQ0KcGFj
a2V0cyBhbmQgb3JwaGFuIHNvbWUgd2hlbiB5b3UgZ2V0IGEgdHggY29tcGxldGUgZm9yIGFuIG9y
cGhhbmVkIHBhY2tldC4NCg0KCURhdmlkDQoNCg==

2015-02-04 13:16:51

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

OK guys

Using a mlx4 testbed I can reproduce the problem by pushing coalescing
settings and disabling SG (thus disabling GSO)

ethtool -K eth0 sg off
Actual changes:
scatter-gather: off
tx-scatter-gather: off
generic-segmentation-offload: off [requested on]

ethtool -C eth0 tx-usecs 1024 tx-frames 64

Meaning that NIC waits one ms before sending the TX IRQ,
and can accumulate 64 frames before forcing the interrupt.

We probably have a bug in cwnd expansion logic :

lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=230 rttvar=30 snd_ssthresh=41 cwnd=59 reordering=3 total_retrans=1 ca_state=0 pacing_rate=5943.1 Mbits
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 16384 16384 10.00 530.39 0.40 0.32 2.965 2.398

-> final cwnd=59 which is not enough to avoid the 1ms delay between each
burst.

So sender sends ~60 packets, then has to wait 1ms (to get NIC TX IRQ)
before sending the following burst.

I am CCing Neal, he probably can help to root cause the problem.

Thanks

2015-02-12 08:33:54

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Wed, Feb 11, 2015 at 11:48 PM, Michal Kazior <[email protected]> wrote:
> On 11 February 2015 at 09:57, Michal Kazior <[email protected]> wrote:
>> On 10 February 2015 at 15:19, Johannes Berg <[email protected]> wrote:
>>> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>>>
>>>> + if (msdu->sk) {
>>>> + ewma_add(&ar->tx_delay_us,
>>>> + ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
>>>> + NSEC_PER_USEC);
>>>> +
>>>> + ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
>>>> + (ewma_read(&ar->tx_delay_us) *
>>>> + msdu->sk->sk_pacing_rate) >> 20;
>>>> + }
>>>
>>> To some extent, every wifi driver is going to have this problem. Perhaps
>>> we should do this in mac80211?
>>
>> Good point. I was actually thinking about it. I can try cooking a
>> patch unless you want to do it yourself :-)
>
> I've taken a look into this. The most obvious place to add the
> timestamp for each packet would be ieee80211_tx_info (i.e. the
> skb->cb[48]). The problem is it's very tight there. Even squeezing 2
> bytes (allowing up to 64ms of tx completion delay which I'm worried

I will argue strongly in favor of never allowing more than 4ms packets
to accumulate in the firmware.

> won't be enough) will be troublesome. Some drivers already use every
> last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
> 40 bytes of driver_data).
>
> I wonder if it's okay to bump skb->cb to 56 bytes to avoid the cascade
> of changes required to implement the tx completion delay accounting?
>
>
> Michał
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Dave Täht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks

2015-02-10 14:19:27

by Johannes Berg

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:

> + if (msdu->sk) {
> + ewma_add(&ar->tx_delay_us,
> + ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
> + NSEC_PER_USEC);
> +
> + ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
> + (ewma_read(&ar->tx_delay_us) *
> + msdu->sk->sk_pacing_rate) >> 20;
> + }

To some extent, every wifi driver is going to have this problem. Perhaps
we should do this in mac80211?

johannes

2015-02-04 12:55:26

by Johannes Berg

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Wed, 2015-02-04 at 13:53 +0100, Michal Kazior wrote:

> The hardware itself seems to be capable. The firmware is a problem
> though. I'm also not sure if mac80211 can handle this as is. No 802.11
> driver seems to support SG except wil6210 which uses cfg80211 and
> netdevs directly.

mac80211 cannot deal with this right now. This would make a good topic
for the workshop since there's interest elsewhere in this as well. It's
probably not terribly hard to do as far as mac80211 is concerned.

How much offload do you really have though? Sometimes people just want
to build A-MSDUs.

johannes

2015-02-04 11:57:26

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Wed, 2015-02-04 at 12:35 +0100, Michal Kazior wrote:

> > (Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)
>
> This could work if your firmware/device supports this kind of thing.
> To my understanding ath10k firmware doesn't.

This is a pure software signal. You do not need firmware support.

Idea is the following :

Your driver gets a train of messages, coming from upper layers (TCP, IP,
qdisc)

It can know that a packet is not the last one, by looking at
skb->xmit_more.

Basically, aggregation logic could use this signal as a very clear
indicator you got the end of a train -> force the xmit right now.

To disable gso you would have to use :

ethtool -K wlan1 gso off

2015-02-06 09:58:01

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 5 February 2015 at 20:50, Dave Taht <[email protected]> wrote:
[...]
> And I really, really, really wish, that just once during this thread,
> someone had bothered to try running a test
> at a real world MCS rate - say MCS1, or MCS4, and measured the latency
> under load of that...

Time between frame submission to firmware and tx-completion on one of
my ath10k machines:

Legacy 54mbps: ~18ms
Legacy 6mbps: ~37ms
11n MCS 3 (nss=0): ~13ms
11n MCS 8 (nss=1): ~6-8ms
11ac NSS=1 MCS=2: ~4-6ms
11ac NSS=2 MCS=0: ~5-8ms

Keep in mind this is a clean room environment so retransmissions are
kept at minimum. Obviously with a noisy environment you'll get retries
at different rates and higher latency.

Michał

2015-02-03 14:27:43

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Tue, 2015-02-03 at 12:50 +0100, Michal Kazior wrote:
> On 3 February 2015 at 02:18, Eric Dumazet <[email protected]> wrote:
> > On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:
> >
> >> It seems to break ACK clocking badly (linux stack has a somewhat buggy
> >> tcp_tso_should_defer(), which relies on ACK being received smoothly, as
> >> no timer is setup to split the TSO packet.)
> >
> > Following patch might help the TSO split defer logic.
> >
> > It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
> > Small Queue logic prevented the xmit at the expiration of first 'timer'.
> >
> > This patch clears the tso_deferred variable only if we could really
> > send something.
> >
> > Please try it, thanks !
> [..patch..]
>
> I've done a second round of tests. I've added the A-MSDU count
> parameter I've mentioned in my other email into the mix.
>
> net - net/master (includes stretch ack patches)
> net-tso - net/master + your TSO defer patch
> net-gro - net/master + my ath10k GRO patch
> net-gro-tso - net/master + duh
>
> Here's the best of amsdu count 1 and 3:
>
> ; for (i in */output.txt) { echo $i; for (j in (1 3)) { cat $i | awk
> 'x && /Mbits/ {y=$0}; x && y && !/Mbits/ {print y; x=0; y=""}; /set
> amsdu cnt to '$j'/{x=1}' | awk '{ if (x < $(NF-1)) {x=$(NF-1)} }
> END{print "A-MSDU limit='$j', " x " Mbits/sec"}' } }
> net-gro-tso/output.txt
> A-MSDU limit=1, 436 Mbits/sec
> A-MSDU limit=3, 284 Mbits/sec
> net-gro/output.txt
> A-MSDU limit=1, 444 Mbits/sec
> A-MSDU limit=3, 283 Mbits/sec
> net-tso/output.txt
> A-MSDU limit=1, 376 Mbits/sec
> A-MSDU limit=3, 251 Mbits/sec
> net/output.txt
> A-MSDU limit=1, 387 Mbits/sec
> A-MSDU limit=3, 260 Mbits/sec
>
> IOW:
> - stretch acks / TSO defer don't seem to help much (when compared to
> throughput results from yesterday)
> - GRO helps
> - disabling A-MSDU on sender helps
> - net/master+GRO still doesn't reach the performance from before the
> regression (~600mbps w/ GRO)
>
> You can grab logs and dumps here: http://www.filedropper.com/test2tar
>

Thanks for these traces.

There is absolutely a problem at the sender, as we can see a big 2ms
delay between reception of ACK and send of following packets.
TCP stack should generate them immediately.
Are you using some kind of netem qdisc ?

These 2ms delays, in a flow with a 5ms RTT are terrible.

06:54:57.408391 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294899240, win 11268, options [nop,nop,TS val 1053302 ecr 1052250], length 0
06:54:57.408418 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294910824, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
06:54:57.408431 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294936888, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
06:54:57.408453 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294962952, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
06:54:57.408474 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 0, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
<this 2ms delay is not generated by TCP stack.>
06:54:57.410243 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 82536:83984, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410271 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 83984:85432, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410289 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 85432:86880, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410310 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 86880:88328, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410326 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 88328:89776, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410339 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 89776:91224, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410353 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 91224:92672, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410370 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 92672:94120, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448

...
06:54:57.411178 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 28960, win 11268, options [nop,nop,TS val 1053306 ecr 1052253], length 0
06:54:57.411190 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 52128, win 11268, options [nop,nop,TS val 1053306 ecr 1052254], length 0
06:54:57.411220 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 78192, win 11268, options [nop,nop,TS val 1053306 ecr 1052254], length 0
06:54:57.411243 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 82536, win 11268, options [nop,nop,TS val 1053306 ecr 1052254], length 0
< same 2ms unexplained gap here >
06:54:57.412912 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 165072:166520, ack 1, win 457, options [nop,nop,TS val 1052259 ecr 1053306], length 1448
06:54:57.412935 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 166520:167968, ack 1, win 457, options [nop,nop,TS val 1052259 ecr 1053306], length 1448
06:54:57.412948 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 167968:169416, ack 1, win 457, options [nop,nop,TS val 1052259 ecr 1053306], length 1448
...
06:54:57.413650 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 244712:246160, ack 1, win 457, options [nop,nop,TS val 1052260 ecr 1053306], length 1448
06:54:57.413662 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 246160:247608, ack 1, win 457, options [nop,nop,TS val 1052260 ecr 1053306], length 1448
06:54:57.413712 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 112944, win 11268, options [nop,nop,TS val 1053308 ecr 1052256], length 0
06:54:57.413730 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 130320, win 11268, options [nop,nop,TS val 1053308 ecr 1052257], length 0
06:54:57.413754 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 160728, win 11268, options [nop,nop,TS val 1053309 ecr 1052257], length 0
06:54:57.413779 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 165072, win 11268, options [nop,nop,TS val 1053309 ecr 1052257], length 0
< same 2ms delay >
06:54:57.415682 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 247608:249056, ack 1, win 457, options [nop,nop,TS val 1052262 ecr 1053309], length 1448
06:54:57.415709 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 249056:250504, ack 1, win 457, options [nop,nop,TS val 1052262 ecr 1053309], length 1448

Are packets TX completed after a timer or something ?

Some very heavy stuff might run from tasklet (or other softirq triggered) event.

BTW, traces tend to show that you 'receive' multiple ACK in the same burst,
its not clear if they are delayed at one side or the other.

GRO should delay only GRO candidates. ACK packets are not GRO candidates.

Have you tried to disable GSO on sender ?

(Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)

2015-02-05 13:44:03

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 5 February 2015 at 14:19, Eric Dumazet <[email protected]> wrote:
> On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:
>
>> The intention is to control the queues to the following :
>>
>> 1 ms of buffering, but limited to a configurable value.
>>
>> On a 40Gbps flow, 1ms represents 5 MB, which is insane.
>>
>> We do not want to queue 5 MB of traffic, this would destroy latencies
>> for all concurrent flows. (Or would require having fq_codel or fq as
>> packet schedulers, instead of default pfifo_fast)
>>
>> This is why having 1.5 ms delay between the transmit and TX completion
>> is a problem in your case.

I do get your point. But 1.5ms is really tough on Wi-Fi.

Just look at this:

; ping 192.168.1.2 -c 3
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms

; ping 192.168.1.2 -c 3 -Q 224
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.939 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.906 ms
64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.946 ms

This was run with no load so batching code in the driver itself should
have no measurable effect. The channel was near-ideal: low noise
floor, cabled rf, no other traffic.

The lower latency ping is when 802.11 QoS Voice Access Category is
used. I also get 400mbps instead of 250mbps in this case with 5 flows
(net/master).

Dealing with black box firmware blobs is a pain.

> Note that TCP stack could detect when this happens, *if* ACK where
> delivered before the TX completions, or when TX completion happens,
> we could detect that the clone of the freed packet was freed.
>
> In my test, when I did "ethtool -C eth0 tx-usecs 1024 tx-frames 64", and
> disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
> blocks on tcp_limit_output_bytes.
>
> Then we receive 2 stretch ACKS after ~50 usec.
>
> TCP stack tries to push again some packets but blocks on
> tcp_limit_output_bytes again.
>
> 1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
> push following ~60 packets.
>
>
> TCP could eventually dynamically adjust the tcp_limit_output_bytes,
> using a per flow dynamic value, but I would rather not add a kludge in
> TCP stack only to deal with a possible bug in ath10k driver.
>
> niu has a similar issue and simply had to call skb_orphan() :
>
> drivers/net/ethernet/sun/niu.c:6669: skb_orphan(skb);

Ok. I tried calling skb_orphan() right after I submit each Tx frame
(similar to niu which does this in start_xmit):

--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
sk_buff *msdu)
if (res)
goto err_unmap_msdu;

+ skb_orphan(msdu);
+
return 0;

err_unmap_msdu:

Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
single flow (even better then before). Wow.

Does this look ok/safe as a solution to you?

Michał

2015-02-24 10:30:36

by Johannes Berg

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Tue, 2015-02-24 at 11:24 +0100, Johannes Berg wrote:
> On Thu, 2015-02-12 at 08:48 +0100, Michal Kazior wrote:
>
> > > Good point. I was actually thinking about it. I can try cooking a
> > > patch unless you want to do it yourself :-)
> >
> > I've taken a look into this. The most obvious place to add the
> > timestamp for each packet would be ieee80211_tx_info (i.e. the
> > skb->cb[48]). The problem is it's very tight there. Even squeezing 2
> > bytes (allowing up to 64ms of tx completion delay which I'm worried
> > won't be enough) will be troublesome. Some drivers already use every
> > last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
> > 40 bytes of driver_data).
>
> Couldn't we just repurpose the existing skb->tstamp field for this, as
> long as the skb is fully contained within the wireless layer?
>
> Actually, it looks like we can't, since I guess timestamping options can
> be turned on on any socket.

Actually, that creates a clone or a new skb? Hmm.

Anyway, I think we should just move the vif and hw_key pointers out of
the tx_info and into ieee80211_tx_control. That will give us plenty of
space, and they can't really be used long-term anyway? Although ... both
might be somewhat problematic for mac80211 itself? Hmm.

johannes

2015-02-10 13:14:44

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
> ath10k_core_napi_dummy_poll, 64);
> + ewma_init(&ar->tx_delay_us, 16384, 8);

1) 16384 factor might be too big.

2) a weight of 8 seems too low given aggregation values used in wifi.

On 32bit arches, the max range for ewma value would be 262144 usec,
a quarter of a second...

You could use a factor of 64 instead, and a weight of 16.

2015-02-06 17:48:23

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

> If you increase ability to flood on one flow, then you need to make sure
> receiver has big rcvbuf as well.
>
> echo 2000000 >/proc/sys/net/core/rmem_default
>
> Otherwise it might drop bursts.
>
> This is the kind of things that TCP does automatically, not UDP.

An alternative, if the application involved can make explicit
setsockopt() calls to set SO_SNDBUF and/or SO_RCVBUF, is to tweak
rmem_max and wmem_max and then let the application make the setsockopt()
calls.

Which path one would take would depend on circumstances I suspect.

rick jones

2015-02-05 19:50:04

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Fri, Feb 6, 2015 at 2:44 AM, Michal Kazior <[email protected]> wrote:
>
> On 5 February 2015 at 14:19, Eric Dumazet <[email protected]> wrote:
> > On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:
> >
> >> The intention is to control the queues to the following :
> >>
> >> 1 ms of buffering, but limited to a configurable value.
> >>
> >> On a 40Gbps flow, 1ms represents 5 MB, which is insane.
> >>
> >> We do not want to queue 5 MB of traffic, this would destroy latencies
> >> for all concurrent flows. (Or would require having fq_codel or fq as
> >> packet schedulers, instead of default pfifo_fast)
> >>
> >> This is why having 1.5 ms delay between the transmit and TX completion
> >> is a problem in your case.
>
> I do get your point. But 1.5ms is really tough on Wi-Fi.
>
> Just look at this:
>
> ; ping 192.168.1.2 -c 3
> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms
>
> ; ping 192.168.1.2 -c 3 -Q 224
> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.939 ms
> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.906 ms
> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.946 ms
>
> This was run with no load so batching code in the driver itself should
> have no measurable effect. The channel was near-ideal: low noise
> floor, cabled rf, no other traffic.
>
> The lower latency ping is when 802.11 QoS Voice Access Category is
> used. I also get 400mbps instead of 250mbps in this case with 5 flows
> (net/master).
>

The VO queue is now nearly useless in a real world environment. Whlle
it does grab the media mildly faster in some cases, on a good day with
no other competing APs, it cannot aggregate packets, and wastes TXOPS.
It is far saner to aim for better aggregate (use the VI queue if you
must try to get better media acquisition).

It is disabled in multiple products I know of.

And I really, really, really wish, that just once during this thread,
someone had bothered to try running a test
at a real world MCS rate - say MCS1, or MCS4, and measured the latency
under load of that...

or tried talking to two or more stations at the same time.

Instead of trying for 1.5Gbits in a faraday cage.

>
> Dealing with black box firmware blobs is a pain.
>

+10

>
>
> > Note that TCP stack could detect when this happens, *if* ACK where
> > delivered before the TX completions, or when TX completion happens,
> > we could detect that the clone of the freed packet was freed.
> >
> > In my test, when I did "ethtool -C eth0 tx-usecs 1024 tx-frames 64", and
> > disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
> > blocks on tcp_limit_output_bytes.
> >
> > Then we receive 2 stretch ACKS after ~50 usec.
> >
> > TCP stack tries to push again some packets but blocks on
> > tcp_limit_output_bytes again.
> >
> > 1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
> > push following ~60 packets.
> >
> >
> > TCP could eventually dynamically adjust the tcp_limit_output_bytes,
> > using a per flow dynamic value, but I would rather not add a kludge in
> > TCP stack only to deal with a possible bug in ath10k driver.
> >
> > niu has a similar issue and simply had to call skb_orphan() :
> >
> > drivers/net/ethernet/sun/niu.c:6669: skb_orphan(skb);
>
> Ok. I tried calling skb_orphan() right after I submit each Tx frame
> (similar to niu which does this in start_xmit):
>
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
> sk_buff *msdu)
> if (res)
> goto err_unmap_msdu;
>
> + skb_orphan(msdu);
> +
> return 0;
>
> err_unmap_msdu:
>
>
> Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
> single flow (even better then before). Wow.
>
> Does this look ok/safe as a solution to you?
>
>
> Michał
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Dave Täht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks

2015-02-02 18:52:18

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Mon, 2015-02-02 at 11:27 +0100, Michal Kazior wrote:

> While testing I've had my internal GRO patch for ath10k and no stretch
> ack patches.

Thanks for the data, I took a look at it.

I am afraid this GRO patch might be the problem.

It seems to break ACK clocking badly (linux stack has a somewhat buggy
tcp_tso_should_defer(), which relies on ACK being received smoothly, as
no timer is setup to split the TSO packet.)

I am seeing huge delays on ACK packets and bursts like that :

05:01:53.413038 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 76745, win 4435, options [nop,nop,TS val 4294758508 ecr 4294757300], length 0
05:01:53.413407 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 79641, win 4435, options [nop,nop,TS val 4294758508 ecr 4294757301], length 0
05:01:53.413969 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 92673, win 4435, options [nop,nop,TS val 4294758510 ecr 4294757302], length 0
05:01:53.413990 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 97017, win 4435, options [nop,nop,TS val 4294758510 ecr 4294757302], length 0
05:01:53.414011 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 110049, win 4435, options [nop,nop,TS val 4294758510 ecr 4294757302], length 0
...
05:01:53.422663 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 189689, win 4435, options [nop,nop,TS val 4294758519 ecr 4294757310], length 0
05:01:53.424354 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 198377, win 4435, options [nop,nop,TS val 4294758520 ecr 4294757311], length 0
05:01:53.424400 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 202721, win 4435, options [nop,nop,TS val 4294758520 ecr 4294757313], length 0
05:01:53.424409 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 205617, win 4435, options [nop,nop,TS val 4294758520 ecr 4294757313], length 0
...
05:01:53.450248 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 419921, win 4435, options [nop,nop,TS val 4294758547 ecr 4294757337], length 0
05:01:53.450266 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 427161, win 4435, options [nop,nop,TS val 4294758547 ecr 4294757340], length 0
05:01:53.450289 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 431505, win 4435, options [nop,nop,TS val 4294758547 ecr 4294757340], length 0

Could you make again your experiments using upstream kernel (David
Miller net tree) ?

You also could post the GRO patch so that we can comment on it.

Thanks

2015-02-06 14:35:48

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Fri, 2015-02-06 at 15:08 +0100, Michal Kazior wrote:

> Hmm.. I confirm it works. However the value at which I get full rate
> on a single flow is more than 2048K. Also using non-default
> wmem_default seems to introduce packet loss as per iperf reports at
> the receiver. I suppose this is kind of expected but on the other hand
> wmem_default=262992 and 5 flows of UDP max the device out with 0
> packet loss.

If you increase ability to flood on one flow, then you need to make sure
receiver has big rcvbuf as well.

echo 2000000 >/proc/sys/net/core/rmem_default

Otherwise it might drop bursts.

This is the kind of things that TCP does automatically, not UDP.

2015-02-03 09:00:34

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 3 February 2015 at 00:06, Eric Dumazet <[email protected]> wrote:
> On Mon, 2015-02-02 at 13:25 -0800, Ben Greear wrote:
>
>> It is a big throughput win to have fewer TCP ack packets on
>> wireless since it is a half-duplex environment. Is there anything
>> we could improve so that we can have fewer acks and still get
>> good tcp stack behaviour?
>
> First apply TCP stretch ack fixes to the sender. There is no way to get
> good performance if the sender does not handle stretch ack.
>
> d6b1a8a92a14 tcp: fix timing issue in CUBIC slope calculation
> 9cd981dcf174 tcp: fix stretch ACK bugs in CUBIC
> c22bdca94782 tcp: fix stretch ACK bugs in Reno
> 814d488c6126 tcp: fix the timid additive increase on stretch ACKs
> e73ebb0881ea tcp: stretch ACK fixes prep
>
> Then, make sure you do not throttle ACK too long, especially if you hope
> to get Gbit line rate on a 4 ms RTT flow.
>
> GRO does not mean : send one ACK every ms, or after 3ms delay...

I think it's worth pointing out that If you assume 3-frame A-MSDU and
64-frame A-MPDU you get 192 frames (as far as TCP/IP is concerned) per
aggregation window. Assuming effective 600mbps throughput:

python> 1.0/((((600/8)*1024*1024)/1500)/(3*64))
0.003663003663003663

This is probably worst case, but still probably worth to keep in mind.

ath10k has a knob to tune A-MSDU aggregation count. The default is "3"
and it's what I've been testing so far.

When I change it to "1" on sender I get 250->400mbps boost in TCP -P5
but see no difference with -P1 (number of flows). Changing it to "1"
on receiver yields no difference. I can try adding this configuration
permutation to my future tests if you're interested.

So that you have an idea - using "1" on sender degrades UDP throughput
(even 690->500mbps in some cases).

Michał

2015-02-04 12:38:52

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Wed, 2015-02-04 at 13:22 +0100, Michal Kazior wrote:
> On 4 February 2015 at 12:57, Eric Dumazet <[email protected]> wrote:

>
> > To disable gso you would have to use :
> >
> > ethtool -K wlan1 gso off
>
> Oh, thanks! This works. However I can't turn it on:
>
> ; ethtool -K wlan1 gso on
> Could not change any device features
>
> ..so I guess it makes no sense to re-run tests because:
>
> ; ethtool -k wlan1 | grep generic
> tx-checksum-ip-generic: on [fixed]
> generic-segmentation-offload: off [requested on]
> generic-receive-offload: on
>
> And this seems to never change.

GSO requires SG (Scatter Gather)

Are you sure this hardware has no SG support ?

2015-02-02 21:25:28

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 02/02/2015 10:52 AM, Eric Dumazet wrote:
> On Mon, 2015-02-02 at 11:27 +0100, Michal Kazior wrote:
>
>> While testing I've had my internal GRO patch for ath10k and no stretch
>> ack patches.
>
> Thanks for the data, I took a look at it.
>
> I am afraid this GRO patch might be the problem.
>
> It seems to break ACK clocking badly (linux stack has a somewhat buggy
> tcp_tso_should_defer(), which relies on ACK being received smoothly, as
> no timer is setup to split the TSO packet.)

It is a big throughput win to have fewer TCP ack packets on
wireless since it is a half-duplex environment. Is there anything
we could improve so that we can have fewer acks and still get
good tcp stack behaviour?

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2015-02-11 08:57:34

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 10 February 2015 at 15:19, Johannes Berg <[email protected]> wrote:
> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>
>> + if (msdu->sk) {
>> + ewma_add(&ar->tx_delay_us,
>> + ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
>> + NSEC_PER_USEC);
>> +
>> + ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
>> + (ewma_read(&ar->tx_delay_us) *
>> + msdu->sk->sk_pacing_rate) >> 20;
>> + }
>
> To some extent, every wifi driver is going to have this problem. Perhaps
> we should do this in mac80211?

Good point. I was actually thinking about it. I can try cooking a
patch unless you want to do it yourself :-)

Michał

2015-02-06 09:42:33

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 5 February 2015 at 18:10, Eric Dumazet <[email protected]> wrote:
> On Thu, 2015-02-05 at 06:41 -0800, Eric Dumazet wrote:
>
>> Not at all. This basically removes backpressure.
>>
>> A single UDP socket can now blast packets regardless of SO_SNDBUF
>> limits.
>>
>> This basically remove years of work trying to fix bufferbloat.
>>
>> I still do not understand why increasing tcp_limit_output_bytes is not
>> working for you.
>
> Oh well, tcp_limit_output_bytes might be ok.
>
> In fact, the problem comes from GSO assumption. Maybe Herbert was right,
> when he suggested TCP would be simpler if we enforced GSO...
>
> When GSO is used, the thing works because 2*skb->truesize is roughly 2
> ms worth of traffic.
>
> Because you do not use GSO, and tx completions are slow, we need this :
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 65caf8b95e17..ac01b4cd0035 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2044,7 +2044,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> break;
>
> /* TCP Small Queues :
> - * Control number of packets in qdisc/devices to two packets / or ~1 ms.
> + * Control number of packets in qdisc/devices to two packets /
> + * or ~2 ms (sk->sk_pacing_rate >> 9) in case GSO is off.
> * This allows for :
> * - better RTT estimation and ACK scheduling
> * - faster recovery
> @@ -2053,7 +2054,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> * of queued bytes to ensure line rate.
> * One example is wifi aggregation (802.11 AMPDU)
> */
> - limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> + limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
> limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>
> if (atomic_read(&sk->sk_wmem_alloc) > limit) {
>

The above brings back previous behaviour, i.e. I can get 600mbps TCP
on 5 flows again. Single flow is still (as it was before TSO
autosizing) limited to roughly ~280mbps.

I never really bothered before to understand why I need to push a few
flows through ath10k to max it out, i.e. if I run a single UDP flow I
get ~300mbps while with, e.g. 5 I get 670mbps easily.

I guess it was the tx completion latency all along.

I just put an extra debug to ath10k to see the latency between
submission and completion. Here's a log
(http://www.filedropper.com/complete-log) of 2s run of UDP iperf
trying to push 1gbps but managing only 300mbps.

I've made sure to not hold any locks nor introduce internal to ath10k
delays. Frames get completed between 2-4ms in avarage during load.

When I tried using different ath10k hw&fw I got between 1-2ms of
latency for tx completionsyielding ~430mbps while max should be around
670mbps.

Michał

2015-02-06 09:39:54

by Nicolas Cavallari

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 05/02/2015 15:48, Eric Dumazet wrote:
> On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:
>
>> I do get your point. But 1.5ms is really tough on Wi-Fi.
>>
>> Just look at this:
>>
>> ; ping 192.168.1.2 -c 3
>> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
>> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
>> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
>> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms
>
> Thats a different point.
>
> I dont care about rtt but TX completions. (usually much much lower than
> rtt)

On wired network perhaps, but definitely not on Wi-Fi.

With aggregation, you may send up to 4ms of data before the receiver
can acknowledge anything. But you have to gain access to the channel
first, so you may wait while others finish off their 4ms
transmissions. And this does not account for retransmissions.

And aggregation is not the only problem as far as bufferbloat is
concerned. I don't even want to think about powersave.

2015-02-02 10:27:16

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 30 January 2015 at 15:40, Eric Dumazet <[email protected]> wrote:
> On Fri, 2015-01-30 at 14:39 +0100, Michal Kazior wrote:
>
>> I've briefly tried playing with this knob to no avail unfortunately. I
>> tried 256K, 1M - it didn't improve TCP performance. When I tried to
>> make it smaller (e.g. 16K) the traffic dropped even more so it does
>> have an effect. It seems there's some other limiting factor in this
>> case.
>
> Interesting.
>
> Could you take some tcpdump/pcap with various tcp_limit_output_bytes
> values ?
>
> echo 131072 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> tcpdump -p -i wlanX -s 128 -c 20000 -w 128k.pcap
>
> echo 262144 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> tcpdump -p -i wlanX -s 128 -c 20000 -w 256k.pcap

I've run a couple of tests across different kernels. This got pretty
big so I decided to use an external file hosting:
http://www.filedropper.com/testtar

Let me know if you can't access it (and perhaps you could suggest how
you prefer the logs to be delivered in that case).

The layout of logs is: $kernel/$limit-P$threads.pcap. I've also
included the test script and output of each test run.

While testing I've had my internal GRO patch for ath10k and no stretch
ack patches.

When I was trying to come up with a testing methodology I've noticed
something interesting:
1. set 16k limit
2. start iperf -P1
3. observe 200mbps
4. set 2048k limit (while iperf is running)
5. observe 600mbps
6. set 16limit back (while iperf is running)
7. observe 500-600mbps (i.e. no drop to 200mbps)

Due to that I've decided to re-start iperf for each limit test.

If you want me to gather some other logs/dumps/configuration
permutations let me know, please.

Michał

2015-02-09 15:11:33

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Mon, 2015-02-09 at 14:47 +0100, Michal Kazior wrote:

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 65caf8b..5e249bf 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1996,6 +1996,7 @@ static bool tcp_write_xmit(struct sock *sk,
> unsigned int mss_now, int nonagle,
> max_segs = tcp_tso_autosize(sk, mss_now);
> while ((skb = tcp_send_head(sk))) {
> unsigned int limit;
> + unsigned int amount;
>
> tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
> BUG_ON(!tso_segs);
> @@ -2053,7 +2054,9 @@ static bool tcp_write_xmit(struct sock *sk,
> unsigned int mss_now, int nonagle,
> * of queued bytes to ensure line rate.
> * One example is wifi aggregation (802.11 AMPDU)
> */
> - limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> + amount = sk->sk_tx_completion_delay_us *
> + (sk->sk_pacing_rate >> 10);
> + limit = max(2 * skb->truesize, amount >> 10);
> limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>
> if (atomic_read(&sk->sk_wmem_alloc) > limit) {

This is not what I suggested.

If you test this on any other network device, you'll have
sk->sk_tx_completion_delay_us == 0

amount = 0 * (sk->sk_pacing_rate >> 10); --> 0
limit = max(2 * skb->truesize, amount >> 10); --> 2 * skb->truesize

So non TSO/GSO NIC will not be able to queue more than 2 MSS (one MSS
per skb)

Then if you store only the last tx completion, you have the possibility
of having a last packet of a train (say a retransmit) to make it very
low.

Ideally the formula would be in TCP something very fast to compute :

amount = (sk->sk_pacing_rate >> 10) + sk->tx_completion_delay_cushion;
limit = max(2 * skb->truesize, amount);
limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

So a 'problematic' driver would have to do the math (64 bit maths) like
this :

sk->tx_completion_delay_cushion = ewma_tx_delay * sk->sk_pacing_rate;

2015-02-04 11:35:24

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 3 February 2015 at 15:27, Eric Dumazet <[email protected]> wrote:
> On Tue, 2015-02-03 at 12:50 +0100, Michal Kazior wrote:
[...]
>> IOW:
>> - stretch acks / TSO defer don't seem to help much (when compared to
>> throughput results from yesterday)
>> - GRO helps
>> - disabling A-MSDU on sender helps
>> - net/master+GRO still doesn't reach the performance from before the
>> regression (~600mbps w/ GRO)
>>
>> You can grab logs and dumps here: http://www.filedropper.com/test2tar
>>
>
> Thanks for these traces.
>
> There is absolutely a problem at the sender, as we can see a big 2ms
> delay between reception of ACK and send of following packets.
> TCP stack should generate them immediately.
> Are you using some kind of netem qdisc ?

Both systems have identical setup:

; tc qdisc
qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1
2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap 1 2 2 2 1
2 0 0 1 1 1 1 1 1 1 1
qdisc mq 0: dev wlan1 root
qdisc pfifo_fast 0: dev wlan1 parent :1 bands 3 priomap 1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev wlan1 parent :2 bands 3 priomap 1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev wlan1 parent :3 bands 3 priomap 1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev wlan1 parent :4 bands 3 priomap 1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1

> These 2ms delays, in a flow with a 5ms RTT are terrible.
>
> 06:54:57.408391 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294899240, win 11268, options [nop,nop,TS val 1053302 ecr 1052250], length 0
> 06:54:57.408418 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294910824, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
> 06:54:57.408431 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294936888, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
> 06:54:57.408453 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294962952, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
> 06:54:57.408474 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 0, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
> <this 2ms delay is not generated by TCP stack.>
> 06:54:57.410243 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 82536:83984, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
[...]
>
> Are packets TX completed after a timer or something ?

As far as ath10k is concerned - no timers here. Not sure about
firmware itself though.

> Some very heavy stuff might run from tasklet (or other softirq triggered) event.
>
> BTW, traces tend to show that you 'receive' multiple ACK in the same burst,
> its not clear if they are delayed at one side or the other.
>
> GRO should delay only GRO candidates. ACK packets are not GRO candidates.
>
> Have you tried to disable GSO on sender ?

I assume I do that via ethtool? This is my current setup on both systems:

; ethtool -k wlan1
Features for wlan1:
rx-checksumming: off [fixed]
tx-checksumming: on
tx-checksum-ipv4: off [fixed]
tx-checksum-ip-generic: on [fixed]
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: off
tx-scatter-gather: off [fixed]
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
tx-tcp-segmentation: off [fixed]
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp6-segmentation: off [fixed]
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: off [requested on]
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: on [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
busy-poll: off [fixed]

; ethtool -K wlan1 generic-segmentation-offload off
ethtool: bad command line argument(s)
For more information run ethtool -h

> (Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)

This could work if your firmware/device supports this kind of thing.
To my understanding ath10k firmware doesn't.

Michał

2015-02-03 15:04:03

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Tue, 2015-02-03 at 06:27 -0800, Eric Dumazet wrote:

> Are packets TX completed after a timer or something ?
>
> Some very heavy stuff might run from tasklet (or other softirq triggered) event.
>

Right, commit 6c5151a9ffa9f796f2d707617cecb6b6b241dff8
("ath10k: batch htt tx/rx completions")
is very suspicious.

Please revert it.

BTW, ath10k_htt_txrx_compl_task() runs from softirq context, so the
_bh() prefixes are not really needed.

It seems lot of batching happens in wifi drivers, not necessarily at the
right places.

2015-02-12 07:48:25

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 11 February 2015 at 09:57, Michal Kazior <[email protected]> wrote:
> On 10 February 2015 at 15:19, Johannes Berg <[email protected]> wrote:
>> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>>
>>> + if (msdu->sk) {
>>> + ewma_add(&ar->tx_delay_us,
>>> + ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
>>> + NSEC_PER_USEC);
>>> +
>>> + ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
>>> + (ewma_read(&ar->tx_delay_us) *
>>> + msdu->sk->sk_pacing_rate) >> 20;
>>> + }
>>
>> To some extent, every wifi driver is going to have this problem. Perhaps
>> we should do this in mac80211?
>
> Good point. I was actually thinking about it. I can try cooking a
> patch unless you want to do it yourself :-)

I've taken a look into this. The most obvious place to add the
timestamp for each packet would be ieee80211_tx_info (i.e. the
skb->cb[48]). The problem is it's very tight there. Even squeezing 2
bytes (allowing up to 64ms of tx completion delay which I'm worried
won't be enough) will be troublesome. Some drivers already use every
last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
40 bytes of driver_data).

I wonder if it's okay to bump skb->cb to 56 bytes to avoid the cascade
of changes required to implement the tx completion delay accounting?

Michał

2015-02-24 10:59:51

by Johannes Berg

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Tue, 2015-02-24 at 11:30 +0100, Johannes Berg wrote:
> On Tue, 2015-02-24 at 11:24 +0100, Johannes Berg wrote:
> > On Thu, 2015-02-12 at 08:48 +0100, Michal Kazior wrote:
> >
> > > > Good point. I was actually thinking about it. I can try cooking a
> > > > patch unless you want to do it yourself :-)
> > >
> > > I've taken a look into this. The most obvious place to add the
> > > timestamp for each packet would be ieee80211_tx_info (i.e. the
> > > skb->cb[48]). The problem is it's very tight there. Even squeezing 2
> > > bytes (allowing up to 64ms of tx completion delay which I'm worried
> > > won't be enough) will be troublesome. Some drivers already use every
> > > last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
> > > 40 bytes of driver_data).
> >
> > Couldn't we just repurpose the existing skb->tstamp field for this, as
> > long as the skb is fully contained within the wireless layer?
> >
> > Actually, it looks like we can't, since I guess timestamping options can
> > be turned on on any socket.
>
> Actually, that creates a clone or a new skb? Hmm.

Ah and then it puts it on the error queue right away, so I think we can
reuse it.

johannes

2015-02-04 13:29:52

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Wed, 2015-02-04 at 05:16 -0800, Eric Dumazet wrote:
> OK guys
>
> Using a mlx4 testbed I can reproduce the problem by pushing coalescing
> settings and disabling SG (thus disabling GSO)
>
> ethtool -K eth0 sg off
> Actual changes:
> scatter-gather: off
> tx-scatter-gather: off
> generic-segmentation-offload: off [requested on]
>
> ethtool -C eth0 tx-usecs 1024 tx-frames 64
>
> Meaning that NIC waits one ms before sending the TX IRQ,
> and can accumulate 64 frames before forcing the interrupt.
>
> We probably have a bug in cwnd expansion logic :
>
> lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
> rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=230 rttvar=30 snd_ssthresh=41 cwnd=59 reordering=3 total_retrans=1 ca_state=0 pacing_rate=5943.1 Mbits
> Recv Send Send Utilization Service Demand
> Socket Socket Message Elapsed Send Recv Send Recv
> Size Size Size Time Throughput local remote local remote
> bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
>
> 87380 16384 16384 10.00 530.39 0.40 0.32 2.965 2.398
>
>
> -> final cwnd=59 which is not enough to avoid the 1ms delay between each
> burst.
>
> So sender sends ~60 packets, then has to wait 1ms (to get NIC TX IRQ)
> before sending the following burst.
>
> I am CCing Neal, he probably can help to root cause the problem.

Arg, this was with net-next, ie not including our recent stretch ack
fixes.

Using David Miller 'net' tree, cwnd seems OK.

Speed is low because of 64 queued frames are exceeding
tcp_limit_output_bytes

lpaa23:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes
131072
lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=166 rttvar=16 snd_ssthresh=26 cwnd=59 reordering=3 total_retrans=0 ca_state=0 pacing_rate=8203.52 Mbits
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 16384 16384 10.00 569.96 0.52 0.38 3.588 2.625

lpaa23:~# echo 262144 >/proc/sys/net/ipv4/tcp_limit_output_bytes
lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=98 rttvar=18 snd_ssthresh=312 cwnd=313 reordering=3 total_retrans=23 ca_state=0
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 16384 16384 10.00 8518.40 2.60 1.57 1.200 0.727

2015-02-05 08:38:25

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 4 February 2015 at 22:11, Eric Dumazet <[email protected]> wrote:
> I do not see how a TSO patch could hurt a flow not using TSO/GSO.
>
> This makes no sense.

Hmm..

@@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
* of queued bytes to ensure line rate.
* One example is wifi aggregation (802.11 AMPDU)
*/
- limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
- sk->sk_pacing_rate >> 10);
+ limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+ limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

if (atomic_read(&sk->sk_wmem_alloc) > limit) {
set_bit(TSQ_THROTTLED, &tp->tsq_flags);

Doesn't this effectively invert how tcp_limit_output_bytes is used?
This would explain why raising the limit wasn't changing anything
anymore when you asked me do so. Only decreasing it yielded any
change.

I've added a printk to show up the new and old values. Excerpt from logs:

[ 114.782740] (4608 39126 131072 = 39126) vs (131072 39126 = 131072)

(2*truesize, pacing_rate, tcp_limit = limit) vs (tcp_limit, pacing_rate = limit)

Reverting this patch hunk alone fixes my TCP problem. Not that I'm
saying the old logic was correct (it seems it wasn't, a limit should
be applied as min(value, max_value), right?).

Anyway the change doesn't seem to be TSO-only oriented so it would
explain the "makes no sense".

Michał

2015-02-03 01:18:54

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:

> It seems to break ACK clocking badly (linux stack has a somewhat buggy
> tcp_tso_should_defer(), which relies on ACK being received smoothly, as
> no timer is setup to split the TSO packet.)

Following patch might help the TSO split defer logic.

It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
Small Queue logic prevented the xmit at the expiration of first 'timer'.

This patch clears the tso_deferred variable only if we could really
send something.

Please try it, thanks !

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b95e17..e735f38557db 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1821,7 +1821,6 @@ static bool tcp_tso_should_defer(struct sock *sk,
struct sk_buff *skb,
return true;

send_now:
- tp->tso_deferred = 0;
return false;
}

@@ -2070,6 +2069,7 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
break;

+ tp->tso_deferred = 0;
repair:
/* Advance the send_head. This one is sent out.
* This call will increment packets_out.

2015-02-10 15:09:40

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Tue, 2015-02-10 at 15:19 +0100, Johannes Berg wrote:
> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>
> > + if (msdu->sk) {
> > + ewma_add(&ar->tx_delay_us,
> > + ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
> > + NSEC_PER_USEC);
> > +
> > + ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
> > + (ewma_read(&ar->tx_delay_us) *
> > + msdu->sk->sk_pacing_rate) >> 20;
> > + }
>
> To some extent, every wifi driver is going to have this problem. Perhaps
> we should do this in mac80211?

I'll provide the TCP patch.

sk->sk_tx_completion_delay_cushion is probably a wrong name, as the
units here are in bytes, since it is really number of bytes in the
network driver that accommodate for tx completions delays.

tx_completion_delay * pacing_rate

sk_tx_completion_cushion maybe.

2015-02-05 17:10:16

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Thu, 2015-02-05 at 06:41 -0800, Eric Dumazet wrote:

> Not at all. This basically removes backpressure.
>
> A single UDP socket can now blast packets regardless of SO_SNDBUF
> limits.
>
> This basically remove years of work trying to fix bufferbloat.
>
> I still do not understand why increasing tcp_limit_output_bytes is not
> working for you.

Oh well, tcp_limit_output_bytes might be ok.

In fact, the problem comes from GSO assumption. Maybe Herbert was right,
when he suggested TCP would be simpler if we enforced GSO...

When GSO is used, the thing works because 2*skb->truesize is roughly 2
ms worth of traffic.

Because you do not use GSO, and tx completions are slow, we need this :

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b95e17..ac01b4cd0035 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2044,7 +2044,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
break;

/* TCP Small Queues :
- * Control number of packets in qdisc/devices to two packets / or ~1 ms.
+ * Control number of packets in qdisc/devices to two packets /
+ * or ~2 ms (sk->sk_pacing_rate >> 9) in case GSO is off.
* This allows for :
* - better RTT estimation and ACK scheduling
* - faster recovery
@@ -2053,7 +2054,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
* of queued bytes to ensure line rate.
* One example is wifi aggregation (802.11 AMPDU)
*/
- limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+ limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

if (atomic_read(&sk->sk_wmem_alloc) > limit) {

2015-02-10 13:05:18

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Tue, 2015-02-10 at 04:54 -0800, Eric Dumazet wrote:

> Hi Michal
>
> This is almost it ;)
>
> As I said you must do this using u64 arithmetics, we still support 32bit
> kernels.
>
> Also, >> 20 instead of / 1000000 introduces a 5% error, I would use a
> plain divide, as the compiler will use a reciprocal divide (ie : a
> multiply)
>
> We use >> 10 instead of /1000 because a 2.4 % error is probably okay.
>
> ewma_add(&ar->tx_delay_us,
> ktime_to_ns(ktime_sub(ktime_get(),
> skb_cb->stamp)) /
> NSEC_PER_USEC);

btw I suspect this wont compile on 32 bit kernel

You need to use do_div() as well :

u64 val = ktime_to_ns(ktime_sub(ktime_get(),
skb_cb->stamp));

do_div(val, NSEC_PER_USEC);

ewma_add(&ar->tx_delay_us, val);

2015-02-03 08:44:28

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 2 February 2015 at 19:52, Eric Dumazet <[email protected]> wrote:
> On Mon, 2015-02-02 at 11:27 +0100, Michal Kazior wrote:
>
>> While testing I've had my internal GRO patch for ath10k and no stretch
>> ack patches.
>
> Thanks for the data, I took a look at it.
>
> I am afraid this GRO patch might be the problem.

The entire performance drop happens without the GRO patch as well. I
tested with it included because I intended to upstream it later. I'll
run without it in future tests.

[...]
> Could you make again your experiments using upstream kernel (David
> Miller net tree) ?

Sure.

> You also could post the GRO patch so that we can comment on it.

(You probably want to see mac80211 patch as well:
06d181a8fd58031db9c114d920b40d8820380a6e "mac80211: add NAPI support
back")

diff --git a/drivers/net/wireless/ath/ath10k/core.c
b/drivers/net/wireless/ath/ath10k/core.c
index 36a8fcf..367e896 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -1147,6 +1147,12 @@ err:
}
EXPORT_SYMBOL(ath10k_core_start);

+static int ath10k_core_napi_dummy_poll(struct napi_struct *napi, int budget)
+{
+ WARN_ON(1);
+ return 0;
+}
+
int ath10k_wait_for_suspend(struct ath10k *ar, u32 suspend_opt)
{
int ret;
@@ -1414,6 +1420,10 @@ struct ath10k *ath10k_core_create(size_t
priv_size, struct device *dev,
INIT_WORK(&ar->register_work, ath10k_core_register_work);
INIT_WORK(&ar->restart_work, ath10k_core_restart);

+ init_dummy_netdev(&ar->napi_dev);
+ ieee80211_napi_add(ar->hw, &ar->napi, &ar->napi_dev,
+ ath10k_core_napi_dummy_poll, 64);
+
ret = ath10k_debug_create(ar);
if (ret)
goto err_free_wq;
@@ -1434,6 +1444,7 @@ void ath10k_core_destroy(struct ath10k *ar)
{
flush_workqueue(ar->workqueue);
destroy_workqueue(ar->workqueue);
+ netif_napi_del(&ar->napi);

ath10k_debug_destroy(ar);
ath10k_mac_destroy(ar);
diff --git a/drivers/net/wireless/ath/ath10k/core.h
b/drivers/net/wireless/ath/ath10k/core.h
index 2d9f871..b5a8847 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -623,6 +623,9 @@ struct ath10k {

struct dfs_pattern_detector *dfs_detector;

+ struct net_device napi_dev;
+ struct napi_struct napi;
+
#ifdef CONFIG_ATH10K_DEBUGFS
struct ath10k_debug debug;
#endif
diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c
b/drivers/net/wireless/ath/ath10k/htt_rx.c
index c1da44f..7e58b38 100644
--- a/drivers/net/wireless/ath/ath10k/htt_rx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
@@ -2061,5 +2061,7 @@ static void ath10k_htt_txrx_compl_task(unsigned long ptr)
ath10k_htt_rx_in_ord_ind(ar, skb);
dev_kfree_skb_any(skb);
}
+
+ napi_gro_flush(&htt->ar->napi, false);
spin_unlock_bh(&htt->rx_ring.lock);
}

So that you can quickly get an understanding how ath10k Rx works:
first tasklet (not visible in the patch) picks up smallish event
buffers from firmware and puts them into ath10k queue for latter
processing by another tasklet (the last hunk). Each such event buffer
is just some metainfo but can "carry" tens of frames (both Rx and Tx
completions). The count is arbitrary and depends on fw/hw combo and
air conditions. The GRO flush is called after all queued small event
buffers are processed (frames delivered up to mac80211 which can in
turn perform aggregation reordering in case some frames were
re-transmitted in the meantime before handing them to net subsystem).

Michał

2015-02-06 15:02:16

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Fri, 2015-02-06 at 14:31 +0000, David Laight wrote:
> From: Eric Dumazet
> > On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:
> >
> >
> > > wifi could eventually do that, providing in skb->tx_completion_delay_us
> > > the time spent in wifi driver.
> > >
> > > This way, we would have no penalty for network devices doing normal skb
> > > orphaning (loopback interface, ethernet, ...)
> >
> > Another way would be that wifi does an automatic orphaning after 1 or
> > 2ms.
>
> Couldn't you do byte counting?
> So orphan enough packets to keep a few ms of tx traffic (at the current
> tx rate) orphaned.
> You might need to give the hardware both orphaned and non-orphaned (parented?)
> packets and orphan some when you get a tx complete for an orphaned packet.

We already have byte counting.

The thing is : A driver can keep an skb for itself, but calling
skb_orphan() in time to allow a socket to send more packets.

For say a UDP server, it would be quite mandatory, as it usually uses a
single UDP socket to receive and send messages.

2015-02-05 14:41:29

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:

> Ok. I tried calling skb_orphan() right after I submit each Tx frame
> (similar to niu which does this in start_xmit):
>
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
> sk_buff *msdu)
> if (res)
> goto err_unmap_msdu;
>
> + skb_orphan(msdu);
> +
> return 0;
>
> err_unmap_msdu:
>
>
> Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
> single flow (even better then before). Wow.
>
> Does this look ok/safe as a solution to you?

Not at all. This basically removes backpressure.

A single UDP socket can now blast packets regardless of SO_SNDBUF
limits.

This basically remove years of work trying to fix bufferbloat.

I still do not understand why increasing tcp_limit_output_bytes is not
working for you.

2015-02-11 08:33:45

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 10 February 2015 at 14:14, Eric Dumazet <[email protected]> wrote:
> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>> ath10k_core_napi_dummy_poll, 64);
>> + ewma_init(&ar->tx_delay_us, 16384, 8);
>
>
> 1) 16384 factor might be too big.
>
> 2) a weight of 8 seems too low given aggregation values used in wifi.
>
> On 32bit arches, the max range for ewma value would be 262144 usec,
> a quarter of a second...
>
> You could use a factor of 64 instead, and a weight of 16.

64/16 seems to work fine as well.

On a related note: I still wonder how to get single TCP flow to reach
line rate with ath10k (it still doesn't; I reach line rate with
multiple flows only). Isn't the tcp_limit_output_bytes just too small
for devices like Wi-Fi where you can send aggregates of even 64*3*1500
bytes long in a single shot and you can't expect even a single
tx-completion of it to come in before its transmitted entirely? You
effectively operate with bursts of traffic.

Some numbers:
ath10k w/o cushion w/o aggregation 1 flow: UDP 65mbps, TCP 30mbps
ath10k w/ cushion w/o aggregation 1 flow: UDP 65mbps, TCP 59mbps
ath10k w/o cushion w/ aggregation 1 flow: UDP 650mbps, TCP 250mbps
ath10k w/ cushion w/ aggregation 1 flow: UDP 650mbps, TCP 250mbps
ath10k w/o cushion w/ aggregation 5 flows: UDP 650mbps, TCP 250mbps
ath10k w/ cushion w/ aggregation 5 flows: UDP 650mbps, TCP 600mbps

"w/o aggregation" means forcing ath10k to use 1 A-MSDU and 1 A-MPDU
per aggregate so latencies due to aggregation itself should be pretty
much nil.

If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
w/ aggregation to reach 600mbps on a single flow.

Michał

2015-02-04 12:22:25

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 4 February 2015 at 12:57, Eric Dumazet <[email protected]> wrote:
> On Wed, 2015-02-04 at 12:35 +0100, Michal Kazior wrote:
>
>> > (Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)
>>
>> This could work if your firmware/device supports this kind of thing.
>> To my understanding ath10k firmware doesn't.
>
> This is a pure software signal. You do not need firmware support.
>
> Idea is the following :
>
> Your driver gets a train of messages, coming from upper layers (TCP, IP,
> qdisc)
>
> It can know that a packet is not the last one, by looking at
> skb->xmit_more.
>
> Basically, aggregation logic could use this signal as a very clear
> indicator you got the end of a train -> force the xmit right now.

There's no way to tell ath10k firmware: "xmit right now". The firmware
does all tx aggregation logic by itself. Host driver just submits a
frame and hopes it'll get out soon. It's not even a tx-ring you'd
expect. Each frame has a host assigned id which firmware then uses in
tx completion.

> To disable gso you would have to use :
>
> ethtool -K wlan1 gso off

Oh, thanks! This works. However I can't turn it on:

; ethtool -K wlan1 gso on
Could not change any device features

..so I guess it makes no sense to re-run tests because:

; ethtool -k wlan1 | grep generic
tx-checksum-ip-generic: on [fixed]
generic-segmentation-offload: off [requested on]
generic-receive-offload: on

And this seems to never change.

Michał

2015-02-12 07:16:22

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 11 February 2015 at 14:17, Eric Dumazet <[email protected]> wrote:
> On Wed, 2015-02-11 at 09:33 +0100, Michal Kazior wrote:
>
>> If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
>> w/ aggregation to reach 600mbps on a single flow.
>
> You know, there is a reason this sysctl exists in the first place ;)
>
> The first suggestion I made to you was to raise it.
>
> The default setting must stay as is as long default Qdisc is pfifo_fast.
>
> I believe I already mentioned skb->truesize tricks for drivers willing
> to adjust the TSQ given their constraints.

Right. truesize didn't help in my early tests and once the cushion
thing came about I had assumed that it's not relevant anymore.

I just checked:

@@ -2620,6 +2621,12 @@ static void ath10k_tx(struct ieee80211_hw *hw,
if (info->flags & IEEE80211_TX_CTL_NO_CCK_RATE)
ath10k_dbg(ar, ATH10K_DBG_MAC,
"IEEE80211_TX_CTL_NO_CCK_RATE\n");

+ if (skb->sk) {
+ u32 trim = skb->truesize - (skb->truesize / 8);
+ skb->truesize -= trim;
+ atomic_sub(trim, &skb->sk->sk_wmem_alloc);
+ }

With this I get 600mbps on a single flow. The /2 wasn't enough (it
barely made a difference, 250->300mbps). The question is how do I know
how much of trimming is too much? Could the tx completion delay be
used to compute the trim factor, hmm..

Maybe this should be done in mac80211 as well?

Michał

2015-02-03 11:51:01

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 3 February 2015 at 02:18, Eric Dumazet <[email protected]> wrote:
> On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:
>
>> It seems to break ACK clocking badly (linux stack has a somewhat buggy
>> tcp_tso_should_defer(), which relies on ACK being received smoothly, as
>> no timer is setup to split the TSO packet.)
>
> Following patch might help the TSO split defer logic.
>
> It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
> Small Queue logic prevented the xmit at the expiration of first 'timer'.
>
> This patch clears the tso_deferred variable only if we could really
> send something.
>
> Please try it, thanks !
[..patch..]

I've done a second round of tests. I've added the A-MSDU count
parameter I've mentioned in my other email into the mix.

net - net/master (includes stretch ack patches)
net-tso - net/master + your TSO defer patch
net-gro - net/master + my ath10k GRO patch
net-gro-tso - net/master + duh

Here's the best of amsdu count 1 and 3:

; for (i in */output.txt) { echo $i; for (j in (1 3)) { cat $i | awk
'x && /Mbits/ {y=$0}; x && y && !/Mbits/ {print y; x=0; y=""}; /set
amsdu cnt to '$j'/{x=1}' | awk '{ if (x < $(NF-1)) {x=$(NF-1)} }
END{print "A-MSDU limit='$j', " x " Mbits/sec"}' } }
net-gro-tso/output.txt
A-MSDU limit=1, 436 Mbits/sec
A-MSDU limit=3, 284 Mbits/sec
net-gro/output.txt
A-MSDU limit=1, 444 Mbits/sec
A-MSDU limit=3, 283 Mbits/sec
net-tso/output.txt
A-MSDU limit=1, 376 Mbits/sec
A-MSDU limit=3, 251 Mbits/sec
net/output.txt
A-MSDU limit=1, 387 Mbits/sec
A-MSDU limit=3, 260 Mbits/sec

IOW:
- stretch acks / TSO defer don't seem to help much (when compared to
throughput results from yesterday)
- GRO helps
- disabling A-MSDU on sender helps
- net/master+GRO still doesn't reach the performance from before the
regression (~600mbps w/ GRO)

You can grab logs and dumps here: http://www.filedropper.com/test2tar

Michał

2015-02-06 14:09:38

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 6 February 2015 at 14:53, Eric Dumazet <[email protected]> wrote:
> On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:
>
>> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
>> of TX completion delay. But this would require yet another expensive
>> call to ktime_get() if HZ < 1000.
>>
>> Then tcp_write_xmit() could use it to adjust :
>>
>> limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
>>
>> to
>>
>> amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate
>>
>> limit = max(2 * skb->truesize, amount / 1000);
>>
>> I'll cook a patch.
>
> Hmm... doing this in all protocols would be too expensive,
> and we do not want to include time spent in qdiscs.
>
> wifi could eventually do that, providing in skb->tx_completion_delay_us
> the time spent in wifi driver.
>
> This way, we would have no penalty for network devices doing normal skb
> orphaning (loopback interface, ethernet, ...)

I'll play around with this idea and report back later.

Michał

2015-02-06 14:10:44

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:

> wifi could eventually do that, providing in skb->tx_completion_delay_us
> the time spent in wifi driver.
>
> This way, we would have no penalty for network devices doing normal skb
> orphaning (loopback interface, ethernet, ...)

Another way would be that wifi does an automatic orphaning after 1 or
2ms.

2015-02-05 14:48:44

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:

> I do get your point. But 1.5ms is really tough on Wi-Fi.
>
> Just look at this:
>
> ; ping 192.168.1.2 -c 3
> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms

Thats a different point.

I dont care about rtt but TX completions. (usually much much lower than
rtt)

I can have a 4 usec delay from the moment a NIC submits a packet to the
wire and I get TX completion IRQ, free the packet.

Yet the pong reply can come 100 ms later.

It does not mean the 4 usec delay is a problem.

2015-02-05 06:46:30

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 4 February 2015 at 22:11, Eric Dumazet <[email protected]> wrote:
> I do not see how a TSO patch could hurt a flow not using TSO/GSO.
>
> This makes no sense.
>
> ath10k tx completions being batched/deferred to a tasklet might increase
> probability to hit this condition in tcp_wfree() :
>
> /* If this softirq is serviced by ksoftirqd, we are likely under stress.
> * Wait until our queues (qdisc + devices) are drained.
> * This gives :
> * - less callbacks to tcp_write_xmit(), reducing stress (batches)
> * - chance for incoming ACK (processed by another cpu maybe)
> * to migrate this flow (skb->ooo_okay will be eventually set)
> */
> if (wmem >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
> goto out;
>
> Meaning tcp stack waits all skbs left qdisc/NIC queues before queuing
> additional packets.
>
> I would try to call skb_orphan() in ath10k if you really want to keep
> these batches.
>
> I have hard time to understand why tx completed packets go through
> ath10k_htc_rx_completion_handler().. anyway...

There's a couple of layers for host-firmware communication. The
transport layer (e.g. PCI) delivers HTC packets. These contain WMI
(configuration stuff) or HTT (traffic stuff). HTT can contain
different events (tx complete, rx complete, etc). HTT Tx completion
contains a list of ids which refer to frames that have been completed
(either sent or dropped).

I've tried reverting tx/rx tasklet batching. No change in throughput.
I can get tcpdump if you're interested.

> Most conservative patch would be :
>
> diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
> index 9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4 100644
> --- a/drivers/net/wireless/ath/ath10k/htt_rx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
> @@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct sk_buff *skb)
> break;
> }
> case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
> + skb_orphan(skb);
> spin_lock_bh(&htt->tx_lock);
> __skb_queue_tail(&htt->tx_compl_q, skb);
> spin_unlock_bh(&htt->tx_lock);

I suppose you want to call skb_orphan() on actual data packets, right?
This skb is just a host-firmware communication buffer.

Michał

2015-02-06 14:08:20

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 6 February 2015 at 14:40, Eric Dumazet <[email protected]> wrote:
> On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:
>
>> The above brings back previous behaviour, i.e. I can get 600mbps TCP
>> on 5 flows again. Single flow is still (as it was before TSO
>> autosizing) limited to roughly ~280mbps.
>>
>> I never really bothered before to understand why I need to push a few
>> flows through ath10k to max it out, i.e. if I run a single UDP flow I
>> get ~300mbps while with, e.g. 5 I get 670mbps easily.
>>
>
> For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
> enough : UDP has no callback from TX completion to feed following frames
> (No write queue like TCP)
>
> # cat /proc/sys/net/core/wmem_default
> 212992
> # ethtool -C eth1 tx-usecs 1024 tx-frames 120
> # ./netperf -H remote -t UDP_STREAM -- -m 1450
> Socket Message Elapsed Messages
> Size Size Time Okay Errors Throughput
> bytes bytes secs # # 10^6bits/sec
>
> 212992 1450 10.00 697705 0 809.27
> 212992 10.00 673412 781.09
>
> # echo 800000 >/proc/sys/net/core/wmem_default
> # ./netperf -H remote -t UDP_STREAM -- -m 1450
> Socket Message Elapsed Messages
> Size Size Time Okay Errors Throughput
> bytes bytes secs # # 10^6bits/sec
>
> 800000 1450 10.00 7329221 0 8501.84
> 212992 10.00 7284051 8449.44

Hmm.. I confirm it works. However the value at which I get full rate
on a single flow is more than 2048K. Also using non-default
wmem_default seems to introduce packet loss as per iperf reports at
the receiver. I suppose this is kind of expected but on the other hand
wmem_default=262992 and 5 flows of UDP max the device out with 0
packet loss.

Michał

2015-02-10 10:33:50

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 9 February 2015 at 16:11, Eric Dumazet <[email protected]> wrote:
> On Mon, 2015-02-09 at 14:47 +0100, Michal Kazior wrote:
[...]
> This is not what I suggested.
>
> If you test this on any other network device, you'll have
> sk->sk_tx_completion_delay_us == 0
>
> amount = 0 * (sk->sk_pacing_rate >> 10); --> 0
> limit = max(2 * skb->truesize, amount >> 10); --> 2 * skb->truesize

You're right. Sorry for mixing up.

> So non TSO/GSO NIC will not be able to queue more than 2 MSS (one MSS
> per skb)
>
> Then if you store only the last tx completion, you have the possibility
> of having a last packet of a train (say a retransmit) to make it very
> low.
>
> Ideally the formula would be in TCP something very fast to compute :
>
> amount = (sk->sk_pacing_rate >> 10) + sk->tx_completion_delay_cushion;
> limit = max(2 * skb->truesize, amount);
> limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>
> So a 'problematic' driver would have to do the math (64 bit maths) like
> this :
>
>
> sk->tx_completion_delay_cushion = ewma_tx_delay * sk->sk_pacing_rate;

Hmm. So I've done like you suggested (hopefully I didn't mix anything
up this time around).

I now get pre-regression performance, ~250mbps on 1 flow, 600mbps on 5
flows (vs 250mbps whatever number of flows).

Michał

diff --git a/drivers/net/wireless/ath/ath10k/core.c
b/drivers/net/wireless/ath/ath10k/core.c
index 367e896..a29111c 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -18,6 +18,7 @@
#include <linux/module.h>
#include <linux/firmware.h>
#include <linux/of.h>
+#include <linux/average.h>

#include "core.h"
#include "mac.h"
@@ -1423,6 +1424,7 @@ struct ath10k *ath10k_core_create(size_t
priv_size, struct device *dev,
init_dummy_netdev(&ar->napi_dev);
ieee80211_napi_add(ar->hw, &ar->napi, &ar->napi_dev,
ath10k_core_napi_dummy_poll, 64);
+ ewma_init(&ar->tx_delay_us, 16384, 8);

ret = ath10k_debug_create(ar);
if (ret)
diff --git a/drivers/net/wireless/ath/ath10k/core.h
b/drivers/net/wireless/ath/ath10k/core.h
index 3be3a59..34f6d78 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -24,6 +24,7 @@
#include <linux/pci.h>
#include <linux/uuid.h>
#include <linux/time.h>
+#include <linux/average.h>

#include "htt.h"
#include "htc.h"
@@ -82,6 +83,7 @@ struct ath10k_skb_cb {
dma_addr_t paddr;
u8 eid;
u8 vdev_id;
+ ktime_t stamp;

struct {
u8 tid;
@@ -625,6 +627,7 @@ struct ath10k {

struct net_device napi_dev;
struct napi_struct napi;
+ struct ewma tx_delay_us;

#ifdef CONFIG_ATH10K_DEBUGFS
struct ath10k_debug debug;
diff --git a/drivers/net/wireless/ath/ath10k/mac.c
b/drivers/net/wireless/ath/ath10k/mac.c
index 15e47f4..5efb2a7 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2620,6 +2620,7 @@ static void ath10k_tx(struct ieee80211_hw *hw,
if (info->flags & IEEE80211_TX_CTL_NO_CCK_RATE)
ath10k_dbg(ar, ATH10K_DBG_MAC,
"IEEE80211_TX_CTL_NO_CCK_RATE\n");

+ ATH10K_SKB_CB(skb)->stamp = ktime_get();
ATH10K_SKB_CB(skb)->htt.is_offchan = false;
ATH10K_SKB_CB(skb)->htt.tid = ath10k_tx_h_get_tid(hdr);
ATH10K_SKB_CB(skb)->vdev_id = ath10k_tx_h_get_vdev_id(ar, vif);
diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
b/drivers/net/wireless/ath/ath10k/txrx.c
index 3f00cec..0f5f0f2 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -15,6 +15,8 @@
* OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
*/

+#include <net/sock.h>
+#include <linux/average.h>
#include "core.h"
#include "txrx.h"
#include "htt.h"
@@ -82,6 +84,16 @@ void ath10k_txrx_tx_unref(struct ath10k_htt *htt,

ath10k_report_offchan_tx(htt->ar, msdu);

+ if (msdu->sk) {
+ ewma_add(&ar->tx_delay_us,
+ ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
+ NSEC_PER_USEC);
+
+ ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
+ (ewma_read(&ar->tx_delay_us) *
+ msdu->sk->sk_pacing_rate) >> 20;
+ }
+
info = IEEE80211_SKB_CB(msdu);
memset(&info->status, 0, sizeof(info->status));
trace_ath10k_txrx_tx_unref(ar, tx_done->msdu_id);
diff --git a/include/net/sock.h b/include/net/sock.h
index 2210fec..6772543 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -391,6 +391,7 @@ struct sock {
gfp_t sk_allocation;
u32 sk_pacing_rate; /* bytes per second */
u32 sk_max_pacing_rate;
+ u32 sk_tx_completion_delay_cushion;
netdev_features_t sk_route_caps;
netdev_features_t sk_route_nocaps;
int sk_gso_type;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b..526a568 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1996,6 +1996,7 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
max_segs = tcp_tso_autosize(sk, mss_now);
while ((skb = tcp_send_head(sk))) {
unsigned int limit;
+ unsigned int amount;

tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
BUG_ON(!tso_segs);
@@ -2053,7 +2054,9 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
* of queued bytes to ensure line rate.
* One example is wifi aggregation (802.11 AMPDU)
*/
- limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+ amount = (sk->sk_pacing_rate >> 10) +
+ sk->sk_tx_completion_delay_cushion;
+ limit = max(2 * skb->truesize, amount);
limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

if (atomic_read(&sk->sk_wmem_alloc) > limit) {

2015-02-24 10:25:00

by Johannes Berg

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Thu, 2015-02-12 at 08:48 +0100, Michal Kazior wrote:

> > Good point. I was actually thinking about it. I can try cooking a
> > patch unless you want to do it yourself :-)
>
> I've taken a look into this. The most obvious place to add the
> timestamp for each packet would be ieee80211_tx_info (i.e. the
> skb->cb[48]). The problem is it's very tight there. Even squeezing 2
> bytes (allowing up to 64ms of tx completion delay which I'm worried
> won't be enough) will be troublesome. Some drivers already use every
> last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
> 40 bytes of driver_data).

Couldn't we just repurpose the existing skb->tstamp field for this, as
long as the skb is fully contained within the wireless layer?

Actually, it looks like we can't, since I guess timestamping options can
be turned on on any socket.

> I wonder if it's okay to bump skb->cb to 56 bytes to avoid the cascade
> of changes required to implement the tx completion delay accounting?

I have no doubt that would be rejected :)

johannes

2015-02-05 12:57:21

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Thu, 2015-02-05 at 09:38 +0100, Michal Kazior wrote:
> On 4 February 2015 at 22:11, Eric Dumazet <[email protected]> wrote:
> > I do not see how a TSO patch could hurt a flow not using TSO/GSO.
> >
> > This makes no sense.
>
> Hmm..
>
> @@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk,
> unsigned int mss_now, int nonagle,
> * of queued bytes to ensure line rate.
> * One example is wifi aggregation (802.11 AMPDU)
> */
> - limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
> - sk->sk_pacing_rate >> 10);
> + limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> + limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>
> if (atomic_read(&sk->sk_wmem_alloc) > limit) {
> set_bit(TSQ_THROTTLED, &tp->tsq_flags);
>
> Doesn't this effectively invert how tcp_limit_output_bytes is used?
> This would explain why raising the limit wasn't changing anything
> anymore when you asked me do so. Only decreasing it yielded any
> change.
>
> I've added a printk to show up the new and old values. Excerpt from logs:
>
> [ 114.782740] (4608 39126 131072 = 39126) vs (131072 39126 = 131072)
>
> (2*truesize, pacing_rate, tcp_limit = limit) vs (tcp_limit, pacing_rate = limit)
>
> Reverting this patch hunk alone fixes my TCP problem. Not that I'm
> saying the old logic was correct (it seems it wasn't, a limit should
> be applied as min(value, max_value), right?).
>
> Anyway the change doesn't seem to be TSO-only oriented so it would
> explain the "makes no sense".

The intention is to control the queues to the following :

1 ms of buffering, but limited to a configurable value.

On a 40Gbps flow, 1ms represents 5 MB, which is insane.

We do not want to queue 5 MB of traffic, this would destroy latencies
for all concurrent flows. (Or would require having fq_codel or fq as
packet schedulers, instead of default pfifo_fast)

This is why having 1.5 ms delay between the transmit and TX completion
is a problem in your case.

2015-02-11 13:17:55

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Wed, 2015-02-11 at 09:33 +0100, Michal Kazior wrote:

> If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
> w/ aggregation to reach 600mbps on a single flow.

You know, there is a reason this sysctl exists in the first place ;)

The first suggestion I made to you was to raise it.

The default setting must stay as is as long default Qdisc is pfifo_fast.

I believe I already mentioned skb->truesize tricks for drivers willing
to adjust the TSQ given their constraints.

2015-02-09 13:47:58

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 6 February 2015 at 15:09, Michal Kazior <[email protected]> wrote:
> On 6 February 2015 at 14:53, Eric Dumazet <[email protected]> wrote:
>> On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:
>>
>>> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
>>> of TX completion delay. But this would require yet another expensive
>>> call to ktime_get() if HZ < 1000.
>>>
>>> Then tcp_write_xmit() could use it to adjust :
>>>
>>> limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
>>>
>>> to
>>>
>>> amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate
>>>
>>> limit = max(2 * skb->truesize, amount / 1000);
>>>
>>> I'll cook a patch.
>>
>> Hmm... doing this in all protocols would be too expensive,
>> and we do not want to include time spent in qdiscs.
>>
>> wifi could eventually do that, providing in skb->tx_completion_delay_us
>> the time spent in wifi driver.
>>
>> This way, we would have no penalty for network devices doing normal skb
>> orphaning (loopback interface, ethernet, ...)
>
> I'll play around with this idea and report back later.

I'm able to get 600mbps with 5 flows and 250mbps with 1 flow, i.e.
same as before the regression. I'm attaching the patch at the end of
my mail - is this approach viable?

I wonder if there's anything that can be done to allow 600mbps (line
rate) on 1 flow with ath10k without tweaking tcp_limit_output_bytes
(you can't expect end-users to tweak this).

Perhaps tcp_limit_output_bytes should also consider tx_completion_delay, e.g.:

amount = sk->sk_tx_completion_delay_us;
amount *= sk->sk_pacing_rate >> 10;
limit = max(2 * skb->truesize, amount >> 10);
max_limit = sysctl_tcp_limit_output_bytes;
max_limit *= 1 + (sk->sk_tx_completion_delay_us / USEC_PER_MSEC);
limit = min(u32, limit, max_limit);

With this I get ~400mbps on 1 flow. If I add the original 1ms extra
delay from your formula to tx_completion_delay I fill in ath10k I get
nearly line rate in 1 flow (almost 600mbps; it hops between 570-620).
Decreasing tcp_limit_output_bytes decreases throughput (e.g. 64K gives
300mbps, 32K gives 180mbps, 16K gives 110mbps). Multiple flows in
iperf seem unbalanced with 128K limit, but look okay with 32K).

Michał

diff --git a/drivers/net/wireless/ath/ath10k/core.h
b/drivers/net/wireless/ath/ath10k/core.h
index 3be3a59..4ff0ae8 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -82,6 +82,7 @@ struct ath10k_skb_cb {
dma_addr_t paddr;
u8 eid;
u8 vdev_id;
+ ktime_t stamp;

struct {
u8 tid;
diff --git a/drivers/net/wireless/ath/ath10k/mac.c
b/drivers/net/wireless/ath/ath10k/mac.c
index 15e47f4..5efb2a7 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2620,6 +2620,7 @@ static void ath10k_tx(struct ieee80211_hw *hw,
if (info->flags & IEEE80211_TX_CTL_NO_CCK_RATE)
ath10k_dbg(ar, ATH10K_DBG_MAC,
"IEEE80211_TX_CTL_NO_CCK_RATE\n");

+ ATH10K_SKB_CB(skb)->stamp = ktime_get();
ATH10K_SKB_CB(skb)->htt.is_offchan = false;
ATH10K_SKB_CB(skb)->htt.tid = ath10k_tx_h_get_tid(hdr);
ATH10K_SKB_CB(skb)->vdev_id = ath10k_tx_h_get_vdev_id(ar, vif);
diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
b/drivers/net/wireless/ath/ath10k/txrx.c
index 3f00cec..0d5539b 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -15,6 +15,7 @@
* OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
*/

+#include <net/sock.h>
#include "core.h"
#include "txrx.h"
#include "htt.h"
@@ -82,6 +83,13 @@ void ath10k_txrx_tx_unref(struct ath10k_htt *htt,

ath10k_report_offchan_tx(htt->ar, msdu);

+ if (msdu->sk) {
+ ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_us) =
+ ktime_to_ns(ktime_sub(ktime_get(),
+ skb_cb->stamp)) /
+ NSEC_PER_USEC;
+ }
+
info = IEEE80211_SKB_CB(msdu);
memset(&info->status, 0, sizeof(info->status));
trace_ath10k_txrx_tx_unref(ar, tx_done->msdu_id);
diff --git a/include/net/sock.h b/include/net/sock.h
index 2210fec..6b15d71 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -390,6 +390,7 @@ struct sock {
int sk_wmem_queued;
gfp_t sk_allocation;
u32 sk_pacing_rate; /* bytes per second */
+ u32 sk_tx_completion_delay_us;
u32 sk_max_pacing_rate;
netdev_features_t sk_route_caps;
netdev_features_t sk_route_nocaps;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b..5e249bf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1996,6 +1996,7 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
max_segs = tcp_tso_autosize(sk, mss_now);
while ((skb = tcp_send_head(sk))) {
unsigned int limit;
+ unsigned int amount;

tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
BUG_ON(!tso_segs);
@@ -2053,7 +2054,9 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
* of queued bytes to ensure line rate.
* One example is wifi aggregation (802.11 AMPDU)
*/
- limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+ amount = sk->sk_tx_completion_delay_us *
+ (sk->sk_pacing_rate >> 10);
+ limit = max(2 * skb->truesize, amount >> 10);
limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

if (atomic_read(&sk->sk_wmem_alloc) > limit) {

2015-02-04 12:53:15

by Michal Kazior

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 4 February 2015 at 13:38, Eric Dumazet <[email protected]> wrote:
> On Wed, 2015-02-04 at 13:22 +0100, Michal Kazior wrote:
>> On 4 February 2015 at 12:57, Eric Dumazet <[email protected]> wrote:
>
>>
>> > To disable gso you would have to use :
>> >
>> > ethtool -K wlan1 gso off
>>
>> Oh, thanks! This works. However I can't turn it on:
>>
>> ; ethtool -K wlan1 gso on
>> Could not change any device features
>>
>> ..so I guess it makes no sense to re-run tests because:
>>
>> ; ethtool -k wlan1 | grep generic
>> tx-checksum-ip-generic: on [fixed]
>> generic-segmentation-offload: off [requested on]
>> generic-receive-offload: on
>>
>> And this seems to never change.
>
> GSO requires SG (Scatter Gather)
>
> Are you sure this hardware has no SG support ?

The hardware itself seems to be capable. The firmware is a problem
though. I'm also not sure if mac80211 can handle this as is. No 802.11
driver seems to support SG except wil6210 which uses cfg80211 and
netdevs directly.

Michał

2015-02-05 13:19:06

by Eric Dumazet

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:

> The intention is to control the queues to the following :
>
> 1 ms of buffering, but limited to a configurable value.
>
> On a 40Gbps flow, 1ms represents 5 MB, which is insane.
>
> We do not want to queue 5 MB of traffic, this would destroy latencies
> for all concurrent flows. (Or would require having fq_codel or fq as
> packet schedulers, instead of default pfifo_fast)
>
> This is why having 1.5 ms delay between the transmit and TX completion
> is a problem in your case.

Note that TCP stack could detect when this happens, *if* ACK where
delivered before the TX completions, or when TX completion happens,
we could detect that the clone of the freed packet was freed.

In my test, when I did "ethtool -C eth0 tx-usecs 1024 tx-frames 64", and
disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
blocks on tcp_limit_output_bytes.

Then we receive 2 stretch ACKS after ~50 usec.

TCP stack tries to push again some packets but blocks on
tcp_limit_output_bytes again.

1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
push following ~60 packets.

TCP could eventually dynamically adjust the tcp_limit_output_bytes,
using a per flow dynamic value, but I would rather not add a kludge in
TCP stack only to deal with a possible bug in ath10k driver.

niu has a similar issue and simply had to call skb_orphan() :

drivers/net/ethernet/sun/niu.c:6669: skb_orphan(skb);

2015-03-31 11:08:37

by Johannes Berg

[permalink] [raw]

Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

To revive an old thread ...

On Mon, 2015-02-09 at 07:11 -0800, Eric Dumazet wrote:

> Ideally the formula would be in TCP something very fast to compute :
>
> amount = (sk->sk_pacing_rate >> 10) + sk->tx_completion_delay_cushion;
> limit = max(2 * skb->truesize, amount);
> limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>
> So a 'problematic' driver would have to do the math (64 bit maths) like
> this :
>
>
> sk->tx_completion_delay_cushion = ewma_tx_delay * sk->sk_pacing_rate;

The whole notion with "ewma_tx_delay" seems very wrong to me. Measuring
something while also trying to control it (or something closely related)
seems a bit strange, but perhaps you meant to measure something
different than what Michal implemented.

What he implemented was measuring the time it takes from submission to
the hardware queues, but that seems to create a bad feedback cycle:
Allowing it as extra transmit "cushion" will, IIUC, cause TCP to submit
more data to the queues, which will in turn cause the next packets to be
potentially delayed more (since there are still waiting ones) thus
causing a longer delay to be measured, which in turn allows even more
data to be submitted etc.

IOW, while traffic is flowing this will likely cause feedback that
completely removes the point of this, no? Leaving only
sysctl_tcp_limit_output_bytes as the limit (*).

It seems it'd be better to provide a calculated estimate, perhaps based
on current transmit rate and (if available) CCA/TXOP acquisition time.

johannes

(*) Which, btw, isn't all that big given that a maximally sized A-MPDU
is like 1MB containing close to that much actual data! Don't think that
can actually be done at current transmit rates though.