LinuxLists.cc - TCP performance regression in mac80211 triggered by the fq code

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

On 2016-07-19 15:13, Michal Kazior wrote:
> On 12 July 2016 at 12:09, Felix Fietkau <[email protected]> wrote:
>> Hi,
>>
>> With Toke's ath9k txq patch I've noticed a pretty nasty performance
>> regression when running local iperf on an AP (running the txq stuff) to
>> a wireless client.
>>
>> Here's some things that I found:
>> - when I use only one TCP stream I get around 90-110 Mbit/s
>> - when running multiple TCP streams, I get only 35-40 Mbit/s total
>
> What is the baseline here (i.e. without fq/txq stuff)? Is it ~100mbps?
Yes.

> Did you try running multiple streams, each on separate tids (matching
> the same AC perhaps) or different clients?
Not yet.

- Felix

2016-07-12 13:22:55

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

On 2016-07-12 14:44, Dave Taht wrote:
> On Tue, Jul 12, 2016 at 2:28 PM, Toke Høiland-Jørgensen <[email protected]> wrote:
>> Felix Fietkau <[email protected]> writes:
>>
>>> Hi,
>>>
>>> With Toke's ath9k txq patch I've noticed a pretty nasty performance
>>> regression when running local iperf on an AP (running the txq stuff) to
>>> a wireless client.
>>>
>>> Here's some things that I found:
>>> - when I use only one TCP stream I get around 90-110 Mbit/s
>>> - when running multiple TCP streams, I get only 35-40 Mbit/s total
>>> - fairness between TCP streams looks completely fine
>>> - there's no big queue buildup, the code never actually drops any packets
>>> - if I put a hack in the fq code to force the hash to a constant value
>>> (effectively disabling fq without disabling codel), the problem
>>> disappears and even multiple streams get proper performance.
>>>
>>> Please let me know if you have any ideas.
>>
>> Hmm, I see two TCP streams get about the same aggregate throughput as
>> one, both when started from the AP and when started one hop away.
>> However, do see TCP flows take a while to ramp up when started from the
>> AP - a short test gets ~70Mbps when run from one hop away and ~50Mbps
>> when run from the AP. how long are you running the tests for?
>>
>> (I seem to recall the ramp-up issue to be there pre-patch as well,
>> though).
>
> The original ath10k code had a "swag" at hooking in an estimator from
> rate control.
> With minstrel in play that can be done better in the ath9k.
>
>> As for why this would happen... There could be a bug in the dequeue code
>> somewhere, but since you get better performance from sticking everything
>> into one queue, my best guess would be that the client is choking on the
>> interleaved packets? I.e. expending more CPU when it can't stick
>> subsequent packets into the same TCP flow?
>
> I share this concern.
>
> The quantum is? I am not opposed to a larger quantum (2 full size
> packets = 3028 in this case?).
I also agree with increasing quantum, however that did not make any
difference in my tests.

- Felix

2016-07-20 14:45:33

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

Felix Fietkau <[email protected]> writes:

> - if I put a hack in the fq code to force the hash to a constant value
> (effectively disabling fq without disabling codel), the problem
> disappears and even multiple streams get proper performance.

There's definitely something iffy about the hashing. Here's the output
relevant line from the aqm debug file after running a single TCP stream
for 60 seconds to that station:

ifname addr tid ac backlog-bytes backlog-packets flows drops marks overlimit collisions
tx-bytes tx-packets
wlp2s0 04:f0:21:1e:74:20 0 2 0 0 146 16 0 0 0 717758966 467925

(there are two extra fields here; I added per-txq CoDel stats, will send
a patch later).

This shows that the txq has 146 flows associated from that one TCP flow.
Looking at this over time, it seems that each time the queue runs empty
(which happens way too often, which is what I was originally
investigating), another flow is assigned.

Michal, any idea why? :)

-Toke

2016-07-27 17:31:35

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

Michal Kazior <[email protected]> writes:

> On 20 July 2016 at 17:24, Toke Høiland-Jørgensen <[email protected]> wrote:
>> Toke Høiland-Jørgensen <[email protected]> writes:
>>
>>> Felix Fietkau <[email protected]> writes:
>>>
>>>> - if I put a hack in the fq code to force the hash to a constant value
>>>> (effectively disabling fq without disabling codel), the problem
>>>> disappears and even multiple streams get proper performance.
>>>
>>> There's definitely something iffy about the hashing. Here's the output
>>> relevant line from the aqm debug file after running a single TCP stream
>>> for 60 seconds to that station:
>>>
>>> ifname addr tid ac backlog-bytes backlog-packets flows drops marks overlimit collisions
>>> tx-bytes tx-packets
>>> wlp2s0 04:f0:21:1e:74:20 0 2 0 0 146 16 0 0 0 717758966 467925
>>>
>>> (there are two extra fields here; I added per-txq CoDel stats, will send
>>> a patch later).
>>>
>>> This shows that the txq has 146 flows associated from that one TCP flow.
>>> Looking at this over time, it seems that each time the queue runs empty
>>> (which happens way too often, which is what I was originally
>>> investigating), another flow is assigned.
>>>
>>> Michal, any idea why? :)
>>
>> And to answer this: because the flow is being freed to be reassigned
>> when it runs empty, but the counter is not decremented. Is this
>> deliberate? I.e. is the 'flows' var supposed to be a total 'new_flows'
>> counter and not a measure of the current number of assigned flows?
>
> Yes, it is deliberate. fq_codel qdisc does the same thing and I just
> mimicked it.

Right. Think it was the name that sent me down the wrong track ('flows'
instead of 'new_flows'). Especially since the way you structured things,
having a counter for how many flows are currently assigned each tid
might actually make sense...

-Toke

2016-07-13 08:53:17

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

On 2016-07-13 09:57, Dave Taht wrote:
> On Tue, Jul 12, 2016 at 4:02 PM, Dave Taht <[email protected]> wrote:
>> On Tue, Jul 12, 2016 at 3:21 PM, Felix Fietkau <[email protected]> wrote:
>>> On 2016-07-12 14:13, Dave Taht wrote:
>>>> On Tue, Jul 12, 2016 at 12:09 PM, Felix Fietkau <[email protected]> wrote:
>>>>> Hi,
>>>>>
>>>>> With Toke's ath9k txq patch I've noticed a pretty nasty performance
>>>>> regression when running local iperf on an AP (running the txq stuff) to
>>>>> a wireless client.
>>>>
>>>> Your kernel? cpu architecture?
>>> QCA9558, 720 MHz, running Linux 4.4.14
>
> So this is a single core at the near-bottom end of the range. I guess
> we also should find a MIPS 24c derivative that runs at 400Mhz or so.
>
> What HZ? (I no longer know how much higher HZ settings make any
> difference, but I'm usually at NOHZ and 250, rather than 100.)
>
> And all the testing to date was on much higher end multi-cores.
>
>>>> What happens when going through the AP to a server from the wireless client?
>>> Will test that next.
>
> Anddddd?
Single stream: 130 Mbit/s, 70% idle
Two streams: 50-80 Mbit/s (wildly fluctuating), 73% idle.

>>>> Which direction?
>>> AP->STA, iperf running on the AP. Client is a regular MacBook Pro
>>> (Broadcom).
>>
>> There are always 2 wifi chips in play. Like the Sith.
>>
>>>>> Here's some things that I found:
>>>>> - when I use only one TCP stream I get around 90-110 Mbit/s
>>>>
>>>> with how much cpu left over?
>>> ~20%
>>>
>>>>> - when running multiple TCP streams, I get only 35-40 Mbit/s total
>>>> with how much cpu left over?
>>> ~30%
>
> To me this implies a contending lock issue, too much work in the irq
> handler or too delayed work in the softirq handler....
>
> I thought you were very brave to try and backport this.
I don't think this has anything to do with contending locks, CPU
utilization, etc. The code does something to the packets that TCP really
doesn't like.

- Felix

2016-07-12 14:02:17

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

On Tue, Jul 12, 2016 at 3:21 PM, Felix Fietkau <[email protected]> wrote:
> On 2016-07-12 14:13, Dave Taht wrote:
>> On Tue, Jul 12, 2016 at 12:09 PM, Felix Fietkau <[email protected]> wrote:
>>> Hi,
>>>
>>> With Toke's ath9k txq patch I've noticed a pretty nasty performance
>>> regression when running local iperf on an AP (running the txq stuff) to
>>> a wireless client.
>>
>> Your kernel? cpu architecture?
> QCA9558, 720 MHz, running Linux 4.4.14
>
>> What happens when going through the AP to a server from the wireless client?
> Will test that next.
>
>> Which direction?
> AP->STA, iperf running on the AP. Client is a regular MacBook Pro
> (Broadcom).

There are always 2 wifi chips in play. Like the Sith.

>>> Here's some things that I found:
>>> - when I use only one TCP stream I get around 90-110 Mbit/s
>>
>> with how much cpu left over?
> ~20%
>
>>> - when running multiple TCP streams, I get only 35-40 Mbit/s total
>> with how much cpu left over?
> ~30%

Hmm.

Care to try netperf?

>
>> context switch difference between the two tests?
> What's the easiest way to track that?

if you have gnu "time" time -v the_process

or:

perf record -e context-switches -ag

or: process /proc/$PID/status for cntx

>> tcp_limit_output_bytes is?
> 262144

I keep hoping to be able to reduce this to something saner like 4096
one day. It got bumped to 64k based on bad wifi performance once, and
then to it's current size to make the Xen folk happier.

The other param I'd like to see fiddled with is tcp_notsent_lowat.

In both cases reductions will increase your context switches but
reduce memory pressure and lead to a more reactive tcp.

And in neither case I think this is the real cause of this problem.

>> got perf?
> Need to make a new build for that.
>
>>> - fairness between TCP streams looks completely fine
>>
>> A codel will get to long term fairness pretty fast. Packet captures
>> from a fq will show much more regular interleaving of packets,
>> regardless.
>>
>>> - there's no big queue buildup, the code never actually drops any packets
>>
>> A "trick" I have been using to observe codel behavior has been to
>> enable ecn on server and client, then checking in wireshark for ect(3)
>> marked packets.
> I verified this with printk. The same issue already appears if I have
> just the fq patch (with the codel patch reverted).

OK. A four flow test "should" trigger codel....

Running out of cpu (or hitting some other bottleneck), without
loss/marking "should" result in a tcptrace -G and xplot.org of the
packet capture showing the window continuing to increase....

>>> - if I put a hack in the fq code to force the hash to a constant value
>>
>> You could also set "flows" to 1 to keep the hash being generated, but
>> not actually use it.
>>
>>> (effectively disabling fq without disabling codel), the problem
>>> disappears and even multiple streams get proper performance.
>>
>> Meaning you get 90-110Mbits ?
> Right.
>
>> Do you have a "before toke" figure for this platform?
> It's quite similar.
>
>>> Please let me know if you have any ideas.
>>
>> I am in berlin, packing hardware...
> Nice!
>
> - Felix
>

--
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

2016-07-25 05:15:05

by Michal Kazior

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

On 20 July 2016 at 17:24, Toke Høiland-Jørgensen <[email protected]> wrote:
> Toke Høiland-Jørgensen <[email protected]> writes:
>
>> Felix Fietkau <[email protected]> writes:
>>
>>> - if I put a hack in the fq code to force the hash to a constant value
>>> (effectively disabling fq without disabling codel), the problem
>>> disappears and even multiple streams get proper performance.
>>
>> There's definitely something iffy about the hashing. Here's the output
>> relevant line from the aqm debug file after running a single TCP stream
>> for 60 seconds to that station:
>>
>> ifname addr tid ac backlog-bytes backlog-packets flows drops marks overlimit collisions
>> tx-bytes tx-packets
>> wlp2s0 04:f0:21:1e:74:20 0 2 0 0 146 16 0 0 0 717758966 467925
>>
>> (there are two extra fields here; I added per-txq CoDel stats, will send
>> a patch later).
>>
>> This shows that the txq has 146 flows associated from that one TCP flow.
>> Looking at this over time, it seems that each time the queue runs empty
>> (which happens way too often, which is what I was originally
>> investigating), another flow is assigned.
>>
>> Michal, any idea why? :)
>
> And to answer this: because the flow is being freed to be reassigned
> when it runs empty, but the counter is not decremented. Is this
> deliberate? I.e. is the 'flows' var supposed to be a total 'new_flows'
> counter and not a measure of the current number of assigned flows?

Yes, it is deliberate. fq_codel qdisc does the same thing and I just
mimicked it.

Michał

2016-07-19 13:14:01

by Michal Kazior

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

On 12 July 2016 at 12:09, Felix Fietkau <[email protected]> wrote:
> Hi,
>
> With Toke's ath9k txq patch I've noticed a pretty nasty performance
> regression when running local iperf on an AP (running the txq stuff) to
> a wireless client.
>
> Here's some things that I found:
> - when I use only one TCP stream I get around 90-110 Mbit/s
> - when running multiple TCP streams, I get only 35-40 Mbit/s total

What is the baseline here (i.e. without fq/txq stuff)? Is it ~100mbps?

Did you try running multiple streams, each on separate tids (matching
the same AC perhaps) or different clients?

Michał

2016-07-13 07:57:54

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

On Tue, Jul 12, 2016 at 4:02 PM, Dave Taht <[email protected]> wrote:
> On Tue, Jul 12, 2016 at 3:21 PM, Felix Fietkau <[email protected]> wrote:
>> On 2016-07-12 14:13, Dave Taht wrote:
>>> On Tue, Jul 12, 2016 at 12:09 PM, Felix Fietkau <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> With Toke's ath9k txq patch I've noticed a pretty nasty performance
>>>> regression when running local iperf on an AP (running the txq stuff) to
>>>> a wireless client.
>>>
>>> Your kernel? cpu architecture?
>> QCA9558, 720 MHz, running Linux 4.4.14

So this is a single core at the near-bottom end of the range. I guess
we also should find a MIPS 24c derivative that runs at 400Mhz or so.

What HZ? (I no longer know how much higher HZ settings make any
difference, but I'm usually at NOHZ and 250, rather than 100.)

And all the testing to date was on much higher end multi-cores.

>>> What happens when going through the AP to a server from the wireless client?
>> Will test that next.

Anddddd?

>>
>>> Which direction?
>> AP->STA, iperf running on the AP. Client is a regular MacBook Pro
>> (Broadcom).
>
> There are always 2 wifi chips in play. Like the Sith.
>
>>>> Here's some things that I found:
>>>> - when I use only one TCP stream I get around 90-110 Mbit/s
>>>
>>> with how much cpu left over?
>> ~20%
>>
>>>> - when running multiple TCP streams, I get only 35-40 Mbit/s total
>>> with how much cpu left over?
>> ~30%

To me this implies a contending lock issue, too much work in the irq
handler or too delayed work in the softirq handler....

I thought you were very brave to try and backport this.

>
> Hmm.
>
> Care to try netperf?
>
>>
>>> context switch difference between the two tests?
>> What's the easiest way to track that?
>
> if you have gnu "time" time -v the_process
>
> or:
>
> perf record -e context-switches -ag
>
> or: process /proc/$PID/status for cntx
>
>>> tcp_limit_output_bytes is?
>> 262144
>
> I keep hoping to be able to reduce this to something saner like 4096
> one day. It got bumped to 64k based on bad wifi performance once, and
> then to it's current size to make the Xen folk happier.
>
> The other param I'd like to see fiddled with is tcp_notsent_lowat.
>
> In both cases reductions will increase your context switches but
> reduce memory pressure and lead to a more reactive tcp.
>
> And in neither case I think this is the real cause of this problem.
>
>
>>> got perf?
>> Need to make a new build for that.
>>
>>>> - fairness between TCP streams looks completely fine
>>>
>>> A codel will get to long term fairness pretty fast. Packet captures
>>> from a fq will show much more regular interleaving of packets,
>>> regardless.
>>>
>>>> - there's no big queue buildup, the code never actually drops any packets
>>>
>>> A "trick" I have been using to observe codel behavior has been to
>>> enable ecn on server and client, then checking in wireshark for ect(3)
>>> marked packets.
>> I verified this with printk. The same issue already appears if I have
>> just the fq patch (with the codel patch reverted).
>
> OK. A four flow test "should" trigger codel....
>
> Running out of cpu (or hitting some other bottleneck), without
> loss/marking "should" result in a tcptrace -G and xplot.org of the
> packet capture showing the window continuing to increase....
>
>
>>>> - if I put a hack in the fq code to force the hash to a constant value
>>>
>>> You could also set "flows" to 1 to keep the hash being generated, but
>>> not actually use it.
>>>
>>>> (effectively disabling fq without disabling codel), the problem
>>>> disappears and even multiple streams get proper performance.
>>>
>>> Meaning you get 90-110Mbits ?
>> Right.
>>
>>> Do you have a "before toke" figure for this platform?
>> It's quite similar.
>>
>>>> Please let me know if you have any ideas.
>>>
>>> I am in berlin, packing hardware...
>> Nice!
>>
>> - Felix
>>
>
>
>
> --
> Dave Täht
> Let's go make home routers and wifi faster! With better software!
> http://blog.cerowrt.org

--
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

2016-07-12 12:28:14

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

Felix Fietkau <[email protected]> writes:

> Hi,
>
> With Toke's ath9k txq patch I've noticed a pretty nasty performance
> regression when running local iperf on an AP (running the txq stuff) to
> a wireless client.
>
> Here's some things that I found:
> - when I use only one TCP stream I get around 90-110 Mbit/s
> - when running multiple TCP streams, I get only 35-40 Mbit/s total
> - fairness between TCP streams looks completely fine
> - there's no big queue buildup, the code never actually drops any packets
> - if I put a hack in the fq code to force the hash to a constant value
> (effectively disabling fq without disabling codel), the problem
> disappears and even multiple streams get proper performance.
>
> Please let me know if you have any ideas.

Hmm, I see two TCP streams get about the same aggregate throughput as
one, both when started from the AP and when started one hop away.
However, do see TCP flows take a while to ramp up when started from the
AP - a short test gets ~70Mbps when run from one hop away and ~50Mbps
when run from the AP. how long are you running the tests for?

(I seem to recall the ramp-up issue to be there pre-patch as well,
though).

As for why this would happen... There could be a bug in the dequeue code
somewhere, but since you get better performance from sticking everything
into one queue, my best guess would be that the client is choking on the
interleaved packets? I.e. expending more CPU when it can't stick
subsequent packets into the same TCP flow?

-Toke

2016-07-12 13:23:38

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

On 2016-07-12 14:28, Toke Høiland-Jørgensen wrote:
> Felix Fietkau <[email protected]> writes:
>
>> Hi,
>>
>> With Toke's ath9k txq patch I've noticed a pretty nasty performance
>> regression when running local iperf on an AP (running the txq stuff) to
>> a wireless client.
>>
>> Here's some things that I found:
>> - when I use only one TCP stream I get around 90-110 Mbit/s
>> - when running multiple TCP streams, I get only 35-40 Mbit/s total
>> - fairness between TCP streams looks completely fine
>> - there's no big queue buildup, the code never actually drops any packets
>> - if I put a hack in the fq code to force the hash to a constant value
>> (effectively disabling fq without disabling codel), the problem
>> disappears and even multiple streams get proper performance.
>>
>> Please let me know if you have any ideas.
>
> Hmm, I see two TCP streams get about the same aggregate throughput as
> one, both when started from the AP and when started one hop away.
> However, do see TCP flows take a while to ramp up when started from the
> AP - a short test gets ~70Mbps when run from one hop away and ~50Mbps
> when run from the AP. how long are you running the tests for?
Long enough to see that it's not ramping up.

> (I seem to recall the ramp-up issue to be there pre-patch as well,
> though).
>
> As for why this would happen... There could be a bug in the dequeue code
> somewhere, but since you get better performance from sticking everything
> into one queue, my best guess would be that the client is choking on the
> interleaved packets? I.e. expending more CPU when it can't stick
> subsequent packets into the same TCP flow?
Could be. I'll see what the tests show when I push traffic through the
AP instead of from the AP.

- Felix

2016-07-22 10:51:11

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

Felix Fietkau <[email protected]> writes:

> Please let me know if you have any ideas.

Two more things to try:

- Andrew McGregor mentioned that some versions of Iperf on OSX has a
threading bug when running multiple streams against the same server
instance. So try running two separate instances of iperf on the server
side, or run netperf instead.

- It could be that HyStart is acting up. You could try disabling it
(echo 0 > /sys/module/tcp_cubic/parameters/hystart) and see if that
makes a difference.

-Toke

2016-07-18 21:49:32

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

Toke Høiland-Jørgensen <[email protected]> writes:

> Felix Fietkau <[email protected]> writes:
>
>> Hi,
>>
>> With Toke's ath9k txq patch I've noticed a pretty nasty performance
>> regression when running local iperf on an AP (running the txq stuff) to
>> a wireless client.
>>
>> Here's some things that I found:
>> - when I use only one TCP stream I get around 90-110 Mbit/s
>> - when running multiple TCP streams, I get only 35-40 Mbit/s total
>> - fairness between TCP streams looks completely fine
>> - there's no big queue buildup, the code never actually drops any packets
>> - if I put a hack in the fq code to force the hash to a constant value
>> (effectively disabling fq without disabling codel), the problem
>> disappears and even multiple streams get proper performance.
>>
>> Please let me know if you have any ideas.
>
> Hmm, I see two TCP streams get about the same aggregate throughput as
> one, both when started from the AP and when started one hop away.

So while I have still not been able to reproduce the issue you
described, I have seen something else that is at least puzzling, and may
or may not be related:

When monitoring the output of /sys/kernel/debug/ieee80211/phy0/aqm I see
that all stations have their queues empty all the way to zero several
times per second. This is a bit puzzling; the queue should be kept under
control, but really shouldn't empty completely. I figure this might also
be the reason why you're seeing degraded performance...

Since the stats output doesn't include a counter for drops, I haven't
gotten any further with figuring out if it's CoDel that's being too
aggressive, or what is happening. But will probably add that in and take
another look.

-Toke

2016-07-12 12:44:06

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

On Tue, Jul 12, 2016 at 2:28 PM, Toke Høiland-Jørgensen <[email protected]> wrote:
> Felix Fietkau <[email protected]> writes:
>
>> Hi,
>>
>> With Toke's ath9k txq patch I've noticed a pretty nasty performance
>> regression when running local iperf on an AP (running the txq stuff) to
>> a wireless client.
>>
>> Here's some things that I found:
>> - when I use only one TCP stream I get around 90-110 Mbit/s
>> - when running multiple TCP streams, I get only 35-40 Mbit/s total
>> - fairness between TCP streams looks completely fine
>> - there's no big queue buildup, the code never actually drops any packets
>> - if I put a hack in the fq code to force the hash to a constant value
>> (effectively disabling fq without disabling codel), the problem
>> disappears and even multiple streams get proper performance.
>>
>> Please let me know if you have any ideas.
>
> Hmm, I see two TCP streams get about the same aggregate throughput as
> one, both when started from the AP and when started one hop away.
> However, do see TCP flows take a while to ramp up when started from the
> AP - a short test gets ~70Mbps when run from one hop away and ~50Mbps
> when run from the AP. how long are you running the tests for?
>
> (I seem to recall the ramp-up issue to be there pre-patch as well,
> though).

The original ath10k code had a "swag" at hooking in an estimator from
rate control.
With minstrel in play that can be done better in the ath9k.

> As for why this would happen... There could be a bug in the dequeue code
> somewhere, but since you get better performance from sticking everything
> into one queue, my best guess would be that the client is choking on the
> interleaved packets? I.e. expending more CPU when it can't stick
> subsequent packets into the same TCP flow?

I share this concern.

The quantum is? I am not opposed to a larger quantum (2 full size
packets = 3028 in this case?).

> -Toke
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

2016-07-19 13:10:50

by Michal Kazior

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

On 12 July 2016 at 16:02, Dave Taht <[email protected]> wrote:
[...]
>>> tcp_limit_output_bytes is?
>> 262144
>
> I keep hoping to be able to reduce this to something saner like 4096
> one day. It got bumped to 64k based on bad wifi performance once, and
> then to it's current size to make the Xen folk happier.

Not sure if it's possible. You do need this to be at least as big as a
single A-MPDU can get. In extreme 11ac cases it can be pretty big.

I recall a discussion from a long time ago and the tcp limit output
logic was/is coupled with the assumption that tx-completions always
come max 1ms after tx submission. This is rather tricky to *guarantee*
on wifi, especially with firmware blobs, big aggregates, lots of
stations and retries.

Michał

2016-07-12 12:57:34

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

Dave Taht <[email protected]> writes:

>> As for why this would happen... There could be a bug in the dequeue code
>> somewhere, but since you get better performance from sticking everything
>> into one queue, my best guess would be that the client is choking on the
>> interleaved packets? I.e. expending more CPU when it can't stick
>> subsequent packets into the same TCP flow?
>
> I share this concern.
>
> The quantum is? I am not opposed to a larger quantum (2 full size
> packets = 3028 in this case?).

The quantum is hard-coded to 300 bytes in the current implementation
(see net/fq_impl.h).

-Toke

2016-07-12 13:03:43

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

On Tue, Jul 12, 2016 at 2:57 PM, Toke Høiland-Jørgensen <[email protected]> wrote:
> Dave Taht <[email protected]> writes:
>
>>> As for why this would happen... There could be a bug in the dequeue code
>>> somewhere, but since you get better performance from sticking everything
>>> into one queue, my best guess would be that the client is choking on the
>>> interleaved packets? I.e. expending more CPU when it can't stick
>>> subsequent packets into the same TCP flow?
>>
>> I share this concern.
>>
>> The quantum is? I am not opposed to a larger quantum (2 full size
>> packets = 3028 in this case?).
>
> The quantum is hard-coded to 300 bytes in the current implementation
> (see net/fq_impl.h).

don't do that. :)

A single full size packet is preferable, and saves going around the
main dequeue loop 5-6 times per flow on this workload.

My tests on the prior patch set were mostly at the larger quantum.

> -Toke

--
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

2016-07-20 15:24:42

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

Toke Høiland-Jørgensen <[email protected]> writes:

> Felix Fietkau <[email protected]> writes:
>
>> - if I put a hack in the fq code to force the hash to a constant value
>> (effectively disabling fq without disabling codel), the problem
>> disappears and even multiple streams get proper performance.
>
> There's definitely something iffy about the hashing. Here's the output
> relevant line from the aqm debug file after running a single TCP stream
> for 60 seconds to that station:
>
> ifname addr tid ac backlog-bytes backlog-packets flows drops marks overlimit collisions
> tx-bytes tx-packets
> wlp2s0 04:f0:21:1e:74:20 0 2 0 0 146 16 0 0 0 717758966 467925
>
> (there are two extra fields here; I added per-txq CoDel stats, will send
> a patch later).
>
> This shows that the txq has 146 flows associated from that one TCP flow.
> Looking at this over time, it seems that each time the queue runs empty
> (which happens way too often, which is what I was originally
> investigating), another flow is assigned.
>
> Michal, any idea why? :)

And to answer this: because the flow is being freed to be reassigned
when it runs empty, but the counter is not decremented. Is this
deliberate? I.e. is the 'flows' var supposed to be a total 'new_flows'
counter and not a measure of the current number of assigned flows?

-Toke

2016-07-12 13:21:28

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

On 2016-07-12 14:13, Dave Taht wrote:
> On Tue, Jul 12, 2016 at 12:09 PM, Felix Fietkau <[email protected]> wrote:
>> Hi,
>>
>> With Toke's ath9k txq patch I've noticed a pretty nasty performance
>> regression when running local iperf on an AP (running the txq stuff) to
>> a wireless client.
>
> Your kernel? cpu architecture?
QCA9558, 720 MHz, running Linux 4.4.14

> What happens when going through the AP to a server from the wireless client?
Will test that next.

> Which direction?
AP->STA, iperf running on the AP. Client is a regular MacBook Pro
(Broadcom).

>> Here's some things that I found:
>> - when I use only one TCP stream I get around 90-110 Mbit/s
>
> with how much cpu left over?
~20%

>> - when running multiple TCP streams, I get only 35-40 Mbit/s total
> with how much cpu left over?
~30%

> context switch difference between the two tests?
What's the easiest way to track that?

> tcp_limit_output_bytes is?
262144

> got perf?
Need to make a new build for that.

>> - fairness between TCP streams looks completely fine
>
> A codel will get to long term fairness pretty fast. Packet captures
> from a fq will show much more regular interleaving of packets,
> regardless.
>
>> - there's no big queue buildup, the code never actually drops any packets
>
> A "trick" I have been using to observe codel behavior has been to
> enable ecn on server and client, then checking in wireshark for ect(3)
> marked packets.
I verified this with printk. The same issue already appears if I have
just the fq patch (with the codel patch reverted).

>> - if I put a hack in the fq code to force the hash to a constant value
>
> You could also set "flows" to 1 to keep the hash being generated, but
> not actually use it.
>
>> (effectively disabling fq without disabling codel), the problem
>> disappears and even multiple streams get proper performance.
>
> Meaning you get 90-110Mbits ?
Right.

> Do you have a "before toke" figure for this platform?
It's quite similar.

>> Please let me know if you have any ideas.
>
> I am in berlin, packing hardware...
Nice!

- Felix

2016-07-13 09:14:11

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

On Wed, Jul 13, 2016 at 10:53 AM, Felix Fietkau <[email protected]> wrote:

>> To me this implies a contending lock issue, too much work in the irq
>> handler or too delayed work in the softirq handler....
>>
>> I thought you were very brave to try and backport this.
> I don't think this has anything to do with contending locks, CPU
> utilization, etc. The code does something to the packets that TCP really
> doesn't like.

With your 70% idle figure, I am inclined to agree... could you get an aircap
of the two different tests? - as well as a regular packetcap taken at
the client or server?
And put somewhere I can get at them?

What version of OSX are you running?

I will setup an ath9k box shortly...

--
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

2016-07-18 22:02:40

[permalink] [raw]

Subject: Re: TCP performance regression in mac80211 triggered by the fq code

Just to add another datapoint, the "rack" optimization for tcp entered
the kernel recently. It has some "interesting" timing/batching
sensitive behaviors. While the TSO case is described, the packet
aggregation case seems similar, and is not.

https://www.ietf.org/proceedings/96/slides/slides-96-tcpm-3.pdf

10 Jan 2016

https://kernelnewbies.org/Linux_4.4#head-2583c31a65e6592bef9af426a78940078df7f630

The draft was significantly updated this month.

https://tools.ietf.org/html/draft-cheng-tcpm-rack-01
-- Andrew Shewmaker

On Mon, Jul 18, 2016 at 2:49 PM, Toke Høiland-Jørgensen <[email protected]> wrote:
> Toke Høiland-Jørgensen <[email protected]> writes:
>
>> Felix Fietkau <[email protected]> writes:
>>
>>> Hi,
>>>
>>> With Toke's ath9k txq patch I've noticed a pretty nasty performance
>>> regression when running local iperf on an AP (running the txq stuff) to
>>> a wireless client.
>>>
>>> Here's some things that I found:
>>> - when I use only one TCP stream I get around 90-110 Mbit/s
>>> - when running multiple TCP streams, I get only 35-40 Mbit/s total
>>> - fairness between TCP streams looks completely fine
>>> - there's no big queue buildup, the code never actually drops any packets
>>> - if I put a hack in the fq code to force the hash to a constant value
>>> (effectively disabling fq without disabling codel), the problem
>>> disappears and even multiple streams get proper performance.
>>>
>>> Please let me know if you have any ideas.
>>
>> Hmm, I see two TCP streams get about the same aggregate throughput as
>> one, both when started from the AP and when started one hop away.
>
> So while I have still not been able to reproduce the issue you
> described, I have seen something else that is at least puzzling, and may
> or may not be related:
>
> When monitoring the output of /sys/kernel/debug/ieee80211/phy0/aqm I see
> that all stations have their queues empty all the way to zero several
> times per second. This is a bit puzzling; the queue should be kept under
> control, but really shouldn't empty completely. I figure this might also
> be the reason why you're seeing degraded performance...
>
> Since the stats output doesn't include a counter for drops, I haven't
> gotten any further with figuring out if it's CoDel that's being too
> aggressive, or what is happening. But will probably add that in and take
> another look.
>
> -Toke
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

2016-07-12 12:13:50