2019-12-05 16:36:20

by Johannes Berg

[permalink] [raw]
Subject: debugging TXQs being empty

Hi Toke, all,

I'm debugging some throughput issues and wondered if you had a hint.
This is at HE rates 2x2 80 MHz, so you'd expect ~1Gbps or a bit more,
I'm getting ~900 Mbps. Just to set the stage.

What I think is (part of) the problem is that I see in the logs that our
hardware queues become empty every once a while.

This seems to be when/because ieee80211_tx_dequeue() returns NULL, and
we hit the
skb = ieee80211_tx_dequeue(hw, txq);

if (!skb) {
if (txq->sta)
IWL_DEBUG_TX(mvm,
"TXQ of sta %pM tid %d is now empty\n",
txq->sta->addr,
txq->tid);

printout, e.g.
iwlwifi 0000:00:14.3: I iwl_mvm_mac_itxq_xmit TXQ of sta 0c:9d:92:03:12:44 tid 0 is now empty

This isn't always bad, but in most cases I see it happen the hardware
queue actually is rather shallow at the time, say only 57 packets in
some instance. Then we can basically send all the packets in the queue
in one or two aggregations (see I here an example with 57 packets in the
queue, ieee80211_tx_dequeue() returns NULL, and we then send an A-MPDU
with 38 followed by one with 19 packets, making the HW queue empty.)

This is with 10 simultaneous TCP streams, so there *shouldn't* be any
issues with that, I did indeed try to lower the pacing shift and it had
no effect. I couldn't try with just one or two streams (actually one
stream is not enough because the AP has only GBit LAN ... so in the
ideal case wireless is faster than ethernet!!) - somehow the test hangs
then, but I'll get back to that later.


Anyhow, do you have any tips on debugging this? This is still without
AQL code. The AQM stats for the AP look fine, basically everything is 0
except for "new-flows", "tx-bytes" and "tx-packets".

One thing that does seem odd is that the new-flows counter is increasing
this rapidly - shouldn't we expect it to be like 10 new flows for 10 TCP
sessions? I see this counter increase by the thousands per second.

I don't see any calls to __ieee80211_stop_queue() either, as expected
(per trace-cmd).

CPU load is not an issue AFAICT, even with all the debugging being
written into the syslog (or journal or something) that's the only thing
that takes noticable CPU time - ~50% for systemd-journal and ~20% for
rsyslogd, <10% for the throughput testing program and that's about it.
The system has 4 threads and seems mostly idle.

All this seems to mean that the TCP stack isn't feeding us fast enough,
but is that really possible?

Any other ideas?

Thanks,
johannes


2019-12-05 16:39:05

by Ben Greear

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On 12/5/19 8:34 AM, Johannes Berg wrote:
> Hi Toke, all,
>
> I'm debugging some throughput issues and wondered if you had a hint.
> This is at HE rates 2x2 80 MHz, so you'd expect ~1Gbps or a bit more,
> I'm getting ~900 Mbps. Just to set the stage.
>
> What I think is (part of) the problem is that I see in the logs that our
> hardware queues become empty every once a while.
>
> This seems to be when/because ieee80211_tx_dequeue() returns NULL, and
> we hit the
> skb = ieee80211_tx_dequeue(hw, txq);
>
> if (!skb) {
> if (txq->sta)
> IWL_DEBUG_TX(mvm,
> "TXQ of sta %pM tid %d is now empty\n",
> txq->sta->addr,
> txq->tid);
>
> printout, e.g.
> iwlwifi 0000:00:14.3: I iwl_mvm_mac_itxq_xmit TXQ of sta 0c:9d:92:03:12:44 tid 0 is now empty
>
> This isn't always bad, but in most cases I see it happen the hardware
> queue actually is rather shallow at the time, say only 57 packets in
> some instance. Then we can basically send all the packets in the queue
> in one or two aggregations (see I here an example with 57 packets in the
> queue, ieee80211_tx_dequeue() returns NULL, and we then send an A-MPDU
> with 38 followed by one with 19 packets, making the HW queue empty.)
>
> This is with 10 simultaneous TCP streams, so there *shouldn't* be any
> issues with that, I did indeed try to lower the pacing shift and it had
> no effect. I couldn't try with just one or two streams (actually one
> stream is not enough because the AP has only GBit LAN ... so in the
> ideal case wireless is faster than ethernet!!) - somehow the test hangs
> then, but I'll get back to that later.
>
>
> Anyhow, do you have any tips on debugging this? This is still without
> AQL code. The AQM stats for the AP look fine, basically everything is 0
> except for "new-flows", "tx-bytes" and "tx-packets".
>
> One thing that does seem odd is that the new-flows counter is increasing
> this rapidly - shouldn't we expect it to be like 10 new flows for 10 TCP
> sessions? I see this counter increase by the thousands per second.
>
> I don't see any calls to __ieee80211_stop_queue() either, as expected
> (per trace-cmd).
>
> CPU load is not an issue AFAICT, even with all the debugging being
> written into the syslog (or journal or something) that's the only thing
> that takes noticable CPU time - ~50% for systemd-journal and ~20% for
> rsyslogd, <10% for the throughput testing program and that's about it.
> The system has 4 threads and seems mostly idle.
>
> All this seems to mean that the TCP stack isn't feeding us fast enough,
> but is that really possible?

Does UDP work better?

or pktgen?

Thanks,
Ben

>
> Any other ideas?
>
> Thanks,
> johannes
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2019-12-05 16:51:35

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Thu, 2019-12-05 at 08:37 -0800, Ben Greear wrote:

> > All this seems to mean that the TCP stack isn't feeding us fast enough,
> > but is that really possible?
>
> Does UDP work better?

Somewhat, I get about 1020-1030 Mbps. But still a TON of "TXQ of STA ...
is now empty" messages. Say this run got about 15 per second of those.

> or pktgen?

I haven't really tried, the setup is a bit complicated ... and it's
nowhere near me either :)

johannes

2019-12-05 16:52:15

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Thu, 2019-12-05 at 17:49 +0100, Johannes Berg wrote:
> On Thu, 2019-12-05 at 08:37 -0800, Ben Greear wrote:
>
> > > All this seems to mean that the TCP stack isn't feeding us fast enough,
> > > but is that really possible?
> >
> > Does UDP work better?
>
> Somewhat, I get about 1020-1030 Mbps. But still a TON of "TXQ of STA ...
> is now empty" messages. Say this run got about 15 per second of those.

That actually pegged a CPU in the test tool, so not sure I'm doing that
correctly ...

johannes

2019-12-05 16:58:26

by Ben Greear

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On 12/5/19 8:49 AM, Johannes Berg wrote:
> On Thu, 2019-12-05 at 08:37 -0800, Ben Greear wrote:
>
>>> All this seems to mean that the TCP stack isn't feeding us fast enough,
>>> but is that really possible?
>>
>> Does UDP work better?
>
> Somewhat, I get about 1020-1030 Mbps. But still a TON of "TXQ of STA ...
> is now empty" messages. Say this run got about 15 per second of those.

It would seem that it is not some issue with TCP stack then?

In general, UDP uses more CPU to send from user-space than TCP
because of TSO, etc. Sendmmsg can help a bit, but it is a bit painful
to code against so things like iperf do not use it, at least ones I've
looked at.

Can you provide some details on how you are generating this load?

For what it's worth, we've seen about 1.9Gbps download goodput
when using ax200 as a station receiving traffic from 160Mhz AP.
I don't have any reports of > 1Gbps of upload performance though,
not sure our user with the fast AP has done much upload testing...

>> or pktgen?
>
> I haven't really tried, the setup is a bit complicated ... and it's
> nowhere near me either :)

Yeah, it will likely crash your system unless you apply years-old patches I posted
too :)

But, at least with pktgen, you can be quite sure it is not some slowdown farther up
the stack that is causing the problem.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2019-12-05 18:14:58

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Thu, 2019-12-05 at 08:57 -0800, Ben Greear wrote:
> On 12/5/19 8:49 AM, Johannes Berg wrote:
> > On Thu, 2019-12-05 at 08:37 -0800, Ben Greear wrote:
> >
> > > > All this seems to mean that the TCP stack isn't feeding us fast enough,
> > > > but is that really possible?
> > >
> > > Does UDP work better?
> >
> > Somewhat, I get about 1020-1030 Mbps. But still a TON of "TXQ of STA ...
> > is now empty" messages. Say this run got about 15 per second of those.
>
> It would seem that it is not some issue with TCP stack then?

Hmm, yeah, maybe not then. Something more general in the stack? I just
can't think of anything.

> In general, UDP uses more CPU to send from user-space than TCP
> because of TSO, etc. Sendmmsg can help a bit, but it is a bit painful
> to code against so things like iperf do not use it, at least ones I've
> looked at.

True.

> Can you provide some details on how you are generating this load?

Using chariot. I don't really know it well, just the testers use it.

> For what it's worth, we've seen about 1.9Gbps download goodput
> when using ax200 as a station receiving traffic from 160Mhz AP.
> I don't have any reports of > 1Gbps of upload performance though,
> not sure our user with the fast AP has done much upload testing...

:)

> > > or pktgen?
> >
> > I haven't really tried, the setup is a bit complicated ... and it's
> > nowhere near me either :)
>
> Yeah, it will likely crash your system unless you apply years-old patches I posted
> too :)
>
> But, at least with pktgen, you can be quite sure it is not some slowdown farther up
> the stack that is causing the problem.

True.

johannes

2019-12-05 18:15:16

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Thu, 2019-12-05 at 18:12 +0100, Toke Høiland-Jørgensen wrote:
>
> I'm on mobile, so briefly:
>
> What you're describing sounds like it's TCP congestion control kicking
> in and throttling at the stack.

Agree, though need to think about the UDP scenario more. It's faster,
but that's expected since it doesn't have the ACK backchannel, and it's
not that *much* faster.

> The "new flows" increasing is consistent with the queue actually
> running empty (which you also see in the warnings).

Oh, right ok. That makes sense, it just loses info about the flow once
the queue is empty.

> My hand-wavy explanation is the the TCP stack gets throttled and
> doesn't get going again quickly enough to fill the pipe. Could be a
> timing thing with aggregation? As I said, hand-wavy ;)

So I should need deeper *hardware* queues to make up for the timing?

Thing is, sometimes I even see the queue empty, then there are *two*
MTU-sized frames, and then it's empty again - all the while there are
only ~40 frames on the hardware queue. You'd think the higher layer
would, once it starts feeding again, actually feed more?

Hmm. Actually, I wonder why I'm not seeing TSO, something to check
tomorrow.

johannes

2019-12-05 18:15:16

by Ben Greear

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On 12/5/19 10:04 AM, Johannes Berg wrote:
> On Thu, 2019-12-05 at 08:57 -0800, Ben Greear wrote:
>> On 12/5/19 8:49 AM, Johannes Berg wrote:
>>> On Thu, 2019-12-05 at 08:37 -0800, Ben Greear wrote:
>>>
>>>>> All this seems to mean that the TCP stack isn't feeding us fast enough,
>>>>> but is that really possible?
>>>>
>>>> Does UDP work better?
>>>
>>> Somewhat, I get about 1020-1030 Mbps. But still a TON of "TXQ of STA ...
>>> is now empty" messages. Say this run got about 15 per second of those.
>>
>> It would seem that it is not some issue with TCP stack then?
>
> Hmm, yeah, maybe not then. Something more general in the stack? I just
> can't think of anything.

Test similar setup 10g wired to 10g wired to make sure traffic generator
can generate hoped for load?

>
>> In general, UDP uses more CPU to send from user-space than TCP
>> because of TSO, etc. Sendmmsg can help a bit, but it is a bit painful
>> to code against so things like iperf do not use it, at least ones I've
>> looked at.
>
> True.
>
>> Can you provide some details on how you are generating this load?
>
> Using chariot. I don't really know it well, just the testers use it.

So, you have some PC with AX200 in it, acting as station, connected to some AP,
and Charriot runs on that PC and something upstream of the AP and tries to
send traffic from PC to AP?

If you can share the AP model, just possibly we have one and could do a similar
test....

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2019-12-05 18:20:51

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Thu, 2019-12-05 at 10:09 -0800, Ben Greear wrote:

> > Hmm, yeah, maybe not then. Something more general in the stack? I just
> > can't think of anything.
>
> Test similar setup 10g wired to 10g wired to make sure traffic generator
> can generate hoped for load?

Oh, I think it normally works better - I'd have to look up the numbers,
don't have them handy now. It's a specific issue to this specific PC
that I have, could be related to the (bastardized) kernel that has, or
something else.

Hence my more general questions how I would understand/debug this, I
don't think I can say what the hardware or even kernel is (and even if
you knew it'd probably be useless, not sure that's public now.)

> > Using chariot. I don't really know it well, just the testers use it.
>
> So, you have some PC with AX200 in it, acting as station, connected to some AP,
> and Charriot runs on that PC and something upstream of the AP and tries to
> send traffic from PC to AP?

Traffic is going from the DUT to a wired station behind the AP, which
actually has two gigabit ethernet links to the AP and two IP addresses,
so that we can distribute the wireless load onto two gigabit links.

> If you can share the AP model, just possibly we have one and could do a similar
> test....

:)

I think it's a RT-AX88U. Not sure that really makes a difference.

Seems this AP has a bug btw, it's advertising packet extension of 16usec
for 20 and 40 MHz, but not for 80 and 160 MHz, which seems a bit odd,
and indeed we miss ACK there sometimes. To exclude that as a reason I
hacked the driver to always do 16us ignoring the AP information. But I
think the issues I outlined with the TXQs are the primary reason for
even sending single frames where this would matter ... rather than only
A-MPDUs.

So I don't *think* it's really related to that, but others are looking
at that part (or well, I hope they will be on Sunday, given they're in
Israel).

In the meantime, I'm stuck trying to figure out why we run the TXQs
empty :)

johannes

2019-12-05 18:24:01

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Thu, 2019-12-05 at 19:08 +0100, Johannes Berg wrote:

> Hmm. Actually, I wonder why I'm not seeing TSO, something to check
> tomorrow.

No, I confused myself, of course we're seeing TSO, was looking at the
UDP log :)

Now though why don't we build A-MSDUs for UDP here? But I'm not really
looking at UDP right now :)

johannes

2019-12-05 18:35:23

by Ben Greear

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On 12/5/19 10:20 AM, Johannes Berg wrote:
> On Thu, 2019-12-05 at 10:09 -0800, Ben Greear wrote:
>
>>> Hmm, yeah, maybe not then. Something more general in the stack? I just
>>> can't think of anything.
>>
>> Test similar setup 10g wired to 10g wired to make sure traffic generator
>> can generate hoped for load?
>
> Oh, I think it normally works better - I'd have to look up the numbers,
> don't have them handy now. It's a specific issue to this specific PC
> that I have, could be related to the (bastardized) kernel that has, or
> something else.
>
> Hence my more general questions how I would understand/debug this, I
> don't think I can say what the hardware or even kernel is (and even if
> you knew it'd probably be useless, not sure that's public now.)

Well, my questions were based around trying to verify that the problem is actually
down in the wifi stack/queues vs farther up the stack and/or user-space.

If you can't trust your traffic generator can actually generate the load,
then you could be chasing phantom problems.

If you can use something like pktgen, then you can bypass upper stack and
user-space, so area to test and debug is smaller.

If you use some more stable kernel and it works fine, then you can suspect
the kernel is issue.

If you put ax200 in more standard PC and it works fine, then you can suspect
hardware is issue.

...

I'm debugging 160Mhz bugs in my /AC firmware, when I get that sorted, will try
ax200 as station and see what I can push through it in the upload direction.

Thanks,
Ben

>
>>> Using chariot. I don't really know it well, just the testers use it.
>>
>> So, you have some PC with AX200 in it, acting as station, connected to some AP,
>> and Charriot runs on that PC and something upstream of the AP and tries to
>> send traffic from PC to AP?
>
> Traffic is going from the DUT to a wired station behind the AP, which
> actually has two gigabit ethernet links to the AP and two IP addresses,
> so that we can distribute the wireless load onto two gigabit links.
>
>> If you can share the AP model, just possibly we have one and could do a similar
>> test....
>
> :)
>
> I think it's a RT-AX88U. Not sure that really makes a difference.
>
> Seems this AP has a bug btw, it's advertising packet extension of 16usec
> for 20 and 40 MHz, but not for 80 and 160 MHz, which seems a bit odd,
> and indeed we miss ACK there sometimes. To exclude that as a reason I
> hacked the driver to always do 16us ignoring the AP information. But I
> think the issues I outlined with the TXQs are the primary reason for
> even sending single frames where this would matter ... rather than only
> A-MPDUs.
>
> So I don't *think* it's really related to that, but others are looking
> at that part (or well, I hope they will be on Sunday, given they're in
> Israel).
>
> In the meantime, I'm stuck trying to figure out why we run the TXQs
> empty :)
>
> johannes
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2019-12-05 20:36:07

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Thu, 2019-12-05 at 17:34 +0100, Johannes Berg wrote:

> What I think is (part of) the problem is that I see in the logs that our
> hardware queues become empty every once a while.

I made a histogram of A-MPDU sizes (or singles), and out of the ~11k
frames in the log snapshot I took, I see only ~72% full size A-MPDUs (63
subframes, don't remember why now but we never use 64) but ~9.5% singles
or A-MPDUs with <10 subframes... The rest is pretty evenly distributed
with a small peak (2%) at 62.

So looks like I could get much better performance if I was able to keep
the queues full, not that I'm any closer to figuring out why they're not
...

johannes

2019-12-06 08:42:59

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

Hi,

Thanks!

On Thu, 2019-12-05 at 17:05 -0800, Kan Yan wrote:
> > Anyhow, do you have any tips on debugging this? This is still without
> > AQL code. The AQM stats for the AP look fine, basically everything is 0
> > except for "new-flows", "tx-bytes" and "tx-packets".
>
> If the "backlog" field is also 0, then it is a sign that the TCP stack
> is not feeding packets fast enough.

Mostly, not always. One thing I do noticed there now is that I get like
64k packets into the driver.

Maybe somehow TSO is interacting badly with the TXQs and the tracking
here, since TSO makes the traffic *very* bursty? A 64k packet in the
driver will typically expand to 9 or 10 A-MSDUs I think?

> > One thing that does seem odd is that the new-flows counter is increasing
> > this rapidly - shouldn't we expect it to be like 10 new flows for 10 TCP
> > sessions? I see this counter increase by the thousands per second.
>
> This could be normal. When a flow queue is completely drained, it will
> be deleted. Next packet will be in the "new_flows". This is another
> sign of the bottleneck maybe at TCP stack.

Right, Toke pointed that out too.

> > CPU load is not an issue AFAICT, even with all the debugging being
> > written into the syslog (or journal or something) that's the only thing
> > that takes noticable CPU time - ~50% for systemd-journal and ~20% for
> > rsyslogd, <10% for the throughput testing program and that's about it.
> > The system has 4 threads and seems mostly idle.
>
> What's CPU usage for software irq? Is CPU usage average of all cores?
> Maybe the core that handles packet processing or softirq is maxed out
> even the average is fine.

Agree, I didn't really say that well. I'm really just ballparking this
by using 'top', but even the *most* loaded of the 4 CPUs (threads?) is
at >80% idle, <6% softirq and <12% sys for the duration of the test.

> Are you using iperf or netperf? increase the TCP windows size may
> help. Adjust things like "net.core.wmem_max" and "net.ipv4.tcp_mem"
> maybe necessary to enable iperf to use larger windows.

Chariot :)

Anyway, this had no effect. I'll go play with the TSO next, for some
reason our driver won't let me disable it dynamically (I guess I should
just fix that).

johannes

2019-12-06 09:13:27

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Fri, 2019-12-06 at 09:41 +0100, Johannes Berg wrote:
>
> Maybe somehow TSO is interacting badly with the TXQs and the tracking
> here, since TSO makes the traffic *very* bursty? A 64k packet in the
> driver will typically expand to 9 or 10 A-MSDUs I think?

No, that all seems well. Without TSO (with the trivial mac80211 patch to
let me turn it off with ethtool) I get about 890Mbps, so about 5% less.
That's not actually *that* bad, I guess due to software A-MSDU in
mac80211, but it's not really the right direction :)

Changing wmem_max/tcp_mem to outrageous values also didn't really make
any difference.

I guess it's time to see if I can poke into the TCP stack to figure out
what's going on...

johannes

2019-12-06 10:23:04

by Koen Vandeputte

[permalink] [raw]
Subject: Re: debugging TXQs being empty


On 06.12.19 10:12, Johannes Berg wrote:
> On Fri, 2019-12-06 at 09:41 +0100, Johannes Berg wrote:
>> Maybe somehow TSO is interacting badly with the TXQs and the tracking
>> here, since TSO makes the traffic *very* bursty? A 64k packet in the
>> driver will typically expand to 9 or 10 A-MSDUs I think?
> No, that all seems well. Without TSO (with the trivial mac80211 patch to
> let me turn it off with ethtool) I get about 890Mbps, so about 5% less.
> That's not actually *that* bad, I guess due to software A-MSDU in
> mac80211, but it's not really the right direction :)
If you try this test again while setting coverage class higher (20000m
or so), you *will* notice the difference a *lot* more (>50%) :-)
Even when the actual devices are only a few meters apart.
>
> Changing wmem_max/tcp_mem to outrageous values also didn't really make
> any difference.
>
> I guess it's time to see if I can poke into the TCP stack to figure out
> what's going on...
>
> johannes
>

2019-12-06 10:23:47

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Fri, 2019-12-06 at 11:22 +0100, Koen Vandeputte wrote:
>
> > No, that all seems well. Without TSO (with the trivial mac80211 patch to
> > let me turn it off with ethtool) I get about 890Mbps, so about 5% less.
> > That's not actually *that* bad, I guess due to software A-MSDU in
> > mac80211, but it's not really the right direction :)
> If you try this test again while setting coverage class higher (20000m
> or so), you *will* notice the difference a *lot* more (>50%) :-)
> Even when the actual devices are only a few meters apart.

Heh, yeah, I guess that makes some sense. Our device doesn't let you set
that though, I think.

johannes

2019-12-06 11:50:05

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Fri, 2019-12-06 at 10:12 +0100, Johannes Berg wrote:
> On Fri, 2019-12-06 at 09:41 +0100, Johannes Berg wrote:
> > Maybe somehow TSO is interacting badly with the TXQs and the tracking
> > here, since TSO makes the traffic *very* bursty? A 64k packet in the
> > driver will typically expand to 9 or 10 A-MSDUs I think?
>
> No, that all seems well. Without TSO (with the trivial mac80211 patch to
> let me turn it off with ethtool) I get about 890Mbps, so about 5% less.
> That's not actually *that* bad, I guess due to software A-MSDU in
> mac80211, but it's not really the right direction :)
>
> Changing wmem_max/tcp_mem to outrageous values also didn't really make
> any difference.
>
> I guess it's time to see if I can poke into the TCP stack to figure out
> what's going on...

Sadly no functioning kprobes on the system ... bpftrace -l lists them,
but can't actually use them.

If I also change net.ipv4.tcp_limit_output_bytes to an outrageous value
(10x) I can recover a bit more than half of the performance loss with
TSO disabled, but it makes no real difference with TSO enabled.

Either way, what bothers me somewhat is that the backlog fluctuates so
much. Sometimes I see a backlock of 2MB or more, while it *still*
manages to go completely empty.

Shouldn't I expect the steady state to have a somewhat even backlog? Why
does this vary so much?

johannes

2019-12-06 23:48:15

by Ben Greear

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On 12/6/19 3:49 AM, Johannes Berg wrote:
> On Fri, 2019-12-06 at 10:12 +0100, Johannes Berg wrote:
>> On Fri, 2019-12-06 at 09:41 +0100, Johannes Berg wrote:
>>> Maybe somehow TSO is interacting badly with the TXQs and the tracking
>>> here, since TSO makes the traffic *very* bursty? A 64k packet in the
>>> driver will typically expand to 9 or 10 A-MSDUs I think?
>>
>> No, that all seems well. Without TSO (with the trivial mac80211 patch to
>> let me turn it off with ethtool) I get about 890Mbps, so about 5% less.
>> That's not actually *that* bad, I guess due to software A-MSDU in
>> mac80211, but it's not really the right direction :)
>>
>> Changing wmem_max/tcp_mem to outrageous values also didn't really make
>> any difference.
>>
>> I guess it's time to see if I can poke into the TCP stack to figure out
>> what's going on...
>
> Sadly no functioning kprobes on the system ... bpftrace -l lists them,
> but can't actually use them.
>
> If I also change net.ipv4.tcp_limit_output_bytes to an outrageous value
> (10x) I can recover a bit more than half of the performance loss with
> TSO disabled, but it makes no real difference with TSO enabled.
>
> Either way, what bothers me somewhat is that the backlog fluctuates so
> much. Sometimes I see a backlock of 2MB or more, while it *still*
> manages to go completely empty.
>
> Shouldn't I expect the steady state to have a somewhat even backlog? Why
> does this vary so much?
>
> johannes
>


I did some tests today:

kernel is 5.2.21+, with the fix for ax200 upload corruption bug.
AP is QCA 9984 based PC (i5 processor) running ath10k-ct firmware/driver, configured for 2x2 160Mhz
STA is PC (i5 processor) with AX200
OTA, about 5 feet apart
AP reports STA is sending at MCS-9 160Mhz (AX200 STA does not report tx rate it seems)
Our LANforge tool is traffic generator, running directly on AP and STA machine.

Download UDP, I see about 697Mbps goodput
Upload UDP, I see about 120Mbps goodput

TCP download, about 660Mbps
TCP upload, about 99Mbps

Our hacked version of pktgen, bps includes down to Ethernet frame:
Download: 740Mbps
Upload: 129Mbps

I changed AP to 80Mhz mode, and re-ran the UDP tests:

Upload 137Mbps
Download 689Mbps

Though not confirmed today, one of us reports about 1.7Gbps download on AX200 against an enterprise /AX AP,
and only abuot 600Mbps upload in that same system. That is in isolation chamber and such.

So, for whatever reason(s), we see consistent poor upload performance on AX200.

For reference, we have previously seen about 1.1Gbps upload between QCA9984 station and 4x4 /AC APs
(and about 1.3Gbps download goodput), so in general, wifi upload can run faster.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2019-12-07 20:09:55

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Fri, 2019-12-06 at 15:44 -0800, Ben Greear wrote:
> I did some tests today:
>
> kernel is 5.2.21+, with the fix for ax200 upload corruption bug.
> AP is QCA 9984 based PC (i5 processor) running ath10k-ct firmware/driver, configured for 2x2 160Mhz
> STA is PC (i5 processor) with AX200
> OTA, about 5 feet apart
> AP reports STA is sending at MCS-9 160Mhz (AX200 STA does not report tx rate it seems)

Yeah, that was an oversight for HE, it should work for HT/VHT. I have a
patch in the works to report the TX rate properly in iw.

> Our LANforge tool is traffic generator, running directly on AP and STA machine.
>
> Download UDP, I see about 697Mbps goodput
> Upload UDP, I see about 120Mbps goodput
>
> TCP download, about 660Mbps
> TCP upload, about 99Mbps
>
> Our hacked version of pktgen, bps includes down to Ethernet frame:
> Download: 740Mbps
> Upload: 129Mbps

Uh, wow, that's not good. I guess after I'm done with this bug, I should
look at upstream ...

> I changed AP to 80Mhz mode, and re-ran the UDP tests:
>
> Upload 137Mbps
> Download 689Mbps
>
> Though not confirmed today, one of us reports about 1.7Gbps download on AX200 against an enterprise /AX AP,
> and only abuot 600Mbps upload in that same system. That is in isolation chamber and such.
>
> So, for whatever reason(s), we see consistent poor upload performance on AX200.
>
> For reference, we have previously seen about 1.1Gbps upload between QCA9984 station and 4x4 /AC APs
> (and about 1.3Gbps download goodput), so in general, wifi upload can run faster.

Yes, for sure it can. Would be interesting to find out what the limiting
factor is for you.

Then again, I doubt we've released updated firmware recently - what
version are you using?

johannes

2019-12-08 17:27:15

by Ben Greear

[permalink] [raw]
Subject: Re: debugging TXQs being empty



On 12/07/2019 12:09 PM, Johannes Berg wrote:
> On Fri, 2019-12-06 at 15:44 -0800, Ben Greear wrote:
>> I did some tests today:
>>
>> kernel is 5.2.21+, with the fix for ax200 upload corruption bug.
>> AP is QCA 9984 based PC (i5 processor) running ath10k-ct firmware/driver, configured for 2x2 160Mhz
>> STA is PC (i5 processor) with AX200
>> OTA, about 5 feet apart
>> AP reports STA is sending at MCS-9 160Mhz (AX200 STA does not report tx rate it seems)
>
> Yeah, that was an oversight for HE, it should work for HT/VHT. I have a
> patch in the works to report the TX rate properly in iw.

I'm connecting to an /AC AP, so it should only be using VHT rates. I'm on 5.2-ish
kernel, so maybe it is already fixed in more recent ones?

>
>> Our LANforge tool is traffic generator, running directly on AP and STA machine.
>>
>> Download UDP, I see about 697Mbps goodput
>> Upload UDP, I see about 120Mbps goodput
>>
>> TCP download, about 660Mbps
>> TCP upload, about 99Mbps
>>
>> Our hacked version of pktgen, bps includes down to Ethernet frame:
>> Download: 740Mbps
>> Upload: 129Mbps
>
> Uh, wow, that's not good. I guess after I'm done with this bug, I should
> look at upstream ...
>
>> I changed AP to 80Mhz mode, and re-ran the UDP tests:
>>
>> Upload 137Mbps
>> Download 689Mbps
>>
>> Though not confirmed today, one of us reports about 1.7Gbps download on AX200 against an enterprise /AX AP,
>> and only abuot 600Mbps upload in that same system. That is in isolation chamber and such.
>>
>> So, for whatever reason(s), we see consistent poor upload performance on AX200.
>>
>> For reference, we have previously seen about 1.1Gbps upload between QCA9984 station and 4x4 /AC APs
>> (and about 1.3Gbps download goodput), so in general, wifi upload can run faster.
>
> Yes, for sure it can. Would be interesting to find out what the limiting
> factor is for you.
>
> Then again, I doubt we've released updated firmware recently - what
> version are you using?

It is 48-something, whatever comes with Fedora-30. We'd be happy to test more
recent firmware...maybe you could make them available somewhere?

And if you can provide release notes for the firmware, you get +1 damage vs QCA :P

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2019-12-09 08:09:33

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Fri, 2019-12-06 at 15:44 -0800, Ben Greear wrote:
>
> kernel is 5.2.21+, with the fix for ax200 upload corruption bug.

Actually, thinking about this - there are *two* recent important fixes.

1) the A-MSDU frag fix
2) security corruption fix

Which patch exactly did you take?

(if you're not using encryption on the AP 2) won't matter)

johannes

2019-12-09 17:52:18

by Ben Greear

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On 12/9/19 12:07 AM, Johannes Berg wrote:
> On Fri, 2019-12-06 at 15:44 -0800, Ben Greear wrote:
>>
>> kernel is 5.2.21+, with the fix for ax200 upload corruption bug.
>
> Actually, thinking about this - there are *two* recent important fixes.
>
> 1) the A-MSDU frag fix
> 2) security corruption fix
>
> Which patch exactly did you take?
>
> (if you're not using encryption on the AP 2) won't matter)
>
> johannes
>

I have only this one:

ommit 1416758748a12963b7dc619a54fb9cef4354fa2e
Author: Johannes Berg <[email protected]>
Date: Wed Nov 20 12:26:39 2019 +0200

iwlwifi: pcie: fix support for transmitting SKBs with fraglist


Please point me to the other one.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2019-12-09 19:39:56

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Mon, 2019-12-09 at 09:49 -0800, Ben Greear wrote:
>
> ommit 1416758748a12963b7dc619a54fb9cef4354fa2e
> Author: Johannes Berg <[email protected]>
> Date: Wed Nov 20 12:26:39 2019 +0200
>
> iwlwifi: pcie: fix support for transmitting SKBs with fraglist

OK.

> Please point me to the other one.

This one:

commit cb1a4badf59275eb7221dcec621e8154917eabd1 (tag: wireless-drivers-2019-11-14)
Author: Mordechay Goodstein <[email protected]>
Date: Thu Nov 7 13:51:47 2019 +0200

iwlwifi: pcie: don't consider IV len in A-MSDU

but maybe it's included already?

I just tested (in conductive setup, open network) kernel 5.4, still see
TP significantly lower than what I'd expect... But even in RX?

johannes

2019-12-09 19:50:35

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Sun, 2019-12-08 at 09:26 -0800, Ben Greear wrote:
>
> > Yeah, that was an oversight for HE, it should work for HT/VHT. I have a
> > patch in the works to report the TX rate properly in iw.
>
> I'm connecting to an /AC AP, so it should only be using VHT rates. I'm on 5.2-ish
> kernel, so maybe it is already fixed in more recent ones?

I think I just saw that VHT has a similar issue. I thought it didn't,
but maybe also in VHT we don't report properly to mac80211?

Anyway, the patch I had fixes it, except the Mbps calculation was
garbage ... We'll send it out soon. It's purely cosmetic anyway, in a
sense.

> > > For reference, we have previously seen about 1.1Gbps upload between QCA9984 station and 4x4 /AC APs
> > > (and about 1.3Gbps download goodput), so in general, wifi upload can run faster.
> >
> > Yes, for sure it can. Would be interesting to find out what the limiting
> > factor is for you.
> >
> > Then again, I doubt we've released updated firmware recently - what
> > version are you using?
>
> It is 48-something, whatever comes with Fedora-30. We'd be happy to test more
> recent firmware...maybe you could make them available somewhere?

I just tried with this one (48.4fa0041f.0) and it seems OK-ish, at least
for UDP. TCP is much lower than I'd expect though, but both TX and RX,
which is strange.

> And if you can provide release notes for the firmware, you get +1 damage vs QCA :P

Heh :)

johannes

2019-12-09 19:56:25

by Ben Greear

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On 12/9/19 11:37 AM, Johannes Berg wrote:
> On Mon, 2019-12-09 at 09:49 -0800, Ben Greear wrote:
>>
>> ommit 1416758748a12963b7dc619a54fb9cef4354fa2e
>> Author: Johannes Berg <[email protected]>
>> Date: Wed Nov 20 12:26:39 2019 +0200
>>
>> iwlwifi: pcie: fix support for transmitting SKBs with fraglist
>
> OK.
>
>> Please point me to the other one.
>
> This one:
>
> commit cb1a4badf59275eb7221dcec621e8154917eabd1 (tag: wireless-drivers-2019-11-14)
> Author: Mordechay Goodstein <[email protected]>
> Date: Thu Nov 7 13:51:47 2019 +0200
>
> iwlwifi: pcie: don't consider IV len in A-MSDU
>
> but maybe it's included already?

It is not in 5.2, I'll add it to my tree.

>
> I just tested (in conductive setup, open network) kernel 5.4, still see
> TP significantly lower than what I'd expect... But even in RX?

Please let us know what you actually see and expect and if what mode
you are using (VHT, HE, 160Mhz, 2x2 vs 1x1, etc).

Conductive test is not actually isolation from your environment (cables leak, especially
SMA pigtails, as do radio cards and so forth, and they leak both in and out), so
you will need a real shield chamber if you want to have repeatable
testing. And, make sure you have 30db of attenuation or so inserted
if you are doing cabled setup, otherwise signal is too hot for the chips.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2019-12-09 20:00:07

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Mon, 2019-12-09 at 11:55 -0800, Ben Greear wrote:
>
> Please let us know what you actually see and expect and if what mode
> you are using (VHT, HE, 160Mhz, 2x2 vs 1x1, etc).

Yeah, but I need to figure out first if I'm allowed to say the numbers,
sorry :)

Mostly interested in HE 160 MHz, 2x2, i.e. the max our NIC will do.

> Conductive test is not actually isolation from your environment (cables leak, especially
> SMA pigtails, as do radio cards and so forth, and they leak both in and out), so
> you will need a real shield chamber if you want to have repeatable
> testing. And, make sure you have 30db of attenuation or so inserted
> if you are doing cabled setup, otherwise signal is too hot for the chips.

I'm usually using 22dB attenuation, but yes :)

johannes

2019-12-10 20:48:19

by Ben Greear

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On 12/9/19 11:37 AM, Johannes Berg wrote:
> On Mon, 2019-12-09 at 09:49 -0800, Ben Greear wrote:
>>
>> ommit 1416758748a12963b7dc619a54fb9cef4354fa2e
>> Author: Johannes Berg <[email protected]>
>> Date: Wed Nov 20 12:26:39 2019 +0200
>>
>> iwlwifi: pcie: fix support for transmitting SKBs with fraglist
>
> OK.
>
>> Please point me to the other one.
>
> This one:
>
> commit cb1a4badf59275eb7221dcec621e8154917eabd1 (tag: wireless-drivers-2019-11-14)
> Author: Mordechay Goodstein <[email protected]>
> Date: Thu Nov 7 13:51:47 2019 +0200
>
> iwlwifi: pcie: don't consider IV len in A-MSDU
>
> but maybe it's included already?
>
> I just tested (in conductive setup, open network) kernel 5.4, still see
> TP significantly lower than what I'd expect... But even in RX?

We added this patch and tested. I don't think it changed much in our setup,
so maybe we were never hitting the bug for one reason or another.

We see about 675Mbps pktgen upload, and about 1Gbps download. AP is
/AX and configured for 160Mhz, but AP does not actually transmit at
more than 80Mhz it seems. I currently have no good way to see what MCS and BW
AX200 is transmitting at.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2019-12-10 20:50:18

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On Tue, 2019-12-10 at 12:47 -0800, Ben Greear wrote:
>
> We see about 675Mbps pktgen upload, and about 1Gbps download. AP is
> /AX and configured for 160Mhz, but AP does not actually transmit at
> more than 80Mhz it seems. I currently have no good way to see what MCS and BW
> AX200 is transmitting at.
>

Try this

https://p.sipsolutions.net/d421d04b8aef81c4.txt

That's our internal patch to fix this, will be going upstream soon I
hope.

johannes

2019-12-12 18:07:41

by Ben Greear

[permalink] [raw]
Subject: Re: debugging TXQs being empty

On 12/10/19 12:49 PM, Johannes Berg wrote:
> On Tue, 2019-12-10 at 12:47 -0800, Ben Greear wrote:
>>
>> We see about 675Mbps pktgen upload, and about 1Gbps download. AP is
>> /AX and configured for 160Mhz, but AP does not actually transmit at
>> more than 80Mhz it seems. I currently have no good way to see what MCS and BW
>> AX200 is transmitting at.
>>
>
> Try this
>
> https://p.sipsolutions.net/d421d04b8aef81c4.txt
>
> That's our internal patch to fix this, will be going upstream soon I
> hope.

That appears to work fine, by the way.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com