LinuxLists.cc - [PATCH] ath10k: Restart xmit queues below low-water mark.

2020-04-27 14:58:19

Subject: [PATCH] ath10k: Restart xmit queues below low-water mark.

From: Ben Greear <[email protected]>

While running tcp upload + download tests with ~200
concurrent TCP streams, 1-2 processes, and 30 station
vdevs, I noticed that the __ieee80211_stop_queue was taking
around 20% of the CPU according to perf-top, which other locking
taking an additional ~15%.

I believe the issue is that the ath10k driver would unlock the
txqueue when a single frame could be transmitted, instead of
waiting for a low water mark.

So, this patch adds a low-water mark that is 1/4 of the total
tx buffers allowed.

This appears to resolve the performance problem that I saw.

Tested with recent wave-1 ath10k-ct firmware.

Signed-off-by: Ben Greear <[email protected]>
---
drivers/net/wireless/ath/ath10k/htt.h | 1 +
drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
index 31c4ddbf45cb..b5634781c0dc 100644
--- a/drivers/net/wireless/ath/ath10k/htt.h
+++ b/drivers/net/wireless/ath/ath10k/htt.h
@@ -1941,6 +1941,7 @@ struct ath10k_htt {

u8 target_version_major;
u8 target_version_minor;
+ bool needs_unlock;
struct completion target_version_received;
u8 max_num_amsdu;
u8 max_num_ampdu;
diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
index 9b3c3b080e92..44795d9a7c0c 100644
--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
lockdep_assert_held(&htt->tx_lock);

htt->num_pending_tx--;
- if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
+ if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
+ htt->needs_unlock = false;
ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
+ }
}

int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
@@ -157,8 +159,10 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
return -EBUSY;

htt->num_pending_tx++;
- if (htt->num_pending_tx == htt->max_num_pending_tx)
+ if (htt->num_pending_tx == htt->max_num_pending_tx) {
+ htt->needs_unlock = true;
ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
+ }

return 0;
}
--
2.20.1

2020-04-28 14:59:47

by Steve deRosier

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On Mon, Apr 27, 2020 at 7:54 AM <[email protected]> wrote:
>
> From: Ben Greear <[email protected]>
>
> While running tcp upload + download tests with ~200
> concurrent TCP streams, 1-2 processes, and 30 station
> vdevs, I noticed that the __ieee80211_stop_queue was taking
> around 20% of the CPU according to perf-top, which other locking
> taking an additional ~15%.
>
> I believe the issue is that the ath10k driver would unlock the
> txqueue when a single frame could be transmitted, instead of
> waiting for a low water mark.
>
> So, this patch adds a low-water mark that is 1/4 of the total
> tx buffers allowed.
>
> This appears to resolve the performance problem that I saw.
>
> Tested with recent wave-1 ath10k-ct firmware.
>

Hey Ben,

Did you do any testing with this patch around latency? The nature of
the thing that you fixed makes me wonder if it was intentional with
respect to making WiFi fast - ie getting rid of buffers as much as
possible. Obviously the CPU impact is likely to be an unintended
consequence. In any case, I don't know anything for sure, it was just
a thought that went through my head when reading this.

> Signed-off-by: Ben Greear <[email protected]>
> ---
> drivers/net/wireless/ath/ath10k/htt.h | 1 +
> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
> 2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
> index 31c4ddbf45cb..b5634781c0dc 100644
> --- a/drivers/net/wireless/ath/ath10k/htt.h
> +++ b/drivers/net/wireless/ath/ath10k/htt.h
> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>
> u8 target_version_major;
> u8 target_version_minor;
> + bool needs_unlock;
> struct completion target_version_received;
> u8 max_num_amsdu;
> u8 max_num_ampdu;
> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
> index 9b3c3b080e92..44795d9a7c0c 100644
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
> lockdep_assert_held(&htt->tx_lock);
>
> htt->num_pending_tx--;
> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
> + htt->needs_unlock = false;
> ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
> + }
> }
>
> int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
> @@ -157,8 +159,10 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
> return -EBUSY;
>
> htt->num_pending_tx++;
> - if (htt->num_pending_tx == htt->max_num_pending_tx)
> + if (htt->num_pending_tx == htt->max_num_pending_tx) {
> + htt->needs_unlock = true;
> ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
> + }
>
> return 0;
> }
> --
> 2.20.1
>

2020-04-28 14:59:47

by Ben Greear

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On 04/28/2020 07:56 AM, Steve deRosier wrote:
> On Mon, Apr 27, 2020 at 7:54 AM <[email protected]> wrote:
>>
>> From: Ben Greear <[email protected]>
>>
>> While running tcp upload + download tests with ~200
>> concurrent TCP streams, 1-2 processes, and 30 station
>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>> around 20% of the CPU according to perf-top, which other locking
>> taking an additional ~15%.
>>
>> I believe the issue is that the ath10k driver would unlock the
>> txqueue when a single frame could be transmitted, instead of
>> waiting for a low water mark.
>>
>> So, this patch adds a low-water mark that is 1/4 of the total
>> tx buffers allowed.
>>
>> This appears to resolve the performance problem that I saw.
>>
>> Tested with recent wave-1 ath10k-ct firmware.
>>
>
> Hey Ben,
>
> Did you do any testing with this patch around latency? The nature of
> the thing that you fixed makes me wonder if it was intentional with
> respect to making WiFi fast - ie getting rid of buffers as much as
> possible. Obviously the CPU impact is likely to be an unintended
> consequence. In any case, I don't know anything for sure, it was just
> a thought that went through my head when reading this.

I did not, but on average my patch should make the queues be less full,
so I doubt it will hurt latency.

Thanks,
Ben

>
>
>> Signed-off-by: Ben Greear <[email protected]>
>> ---
>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>> index 31c4ddbf45cb..b5634781c0dc 100644
>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>
>> u8 target_version_major;
>> u8 target_version_minor;
>> + bool needs_unlock;
>> struct completion target_version_received;
>> u8 max_num_amsdu;
>> u8 max_num_ampdu;
>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>> index 9b3c3b080e92..44795d9a7c0c 100644
>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>> lockdep_assert_held(&htt->tx_lock);
>>
>> htt->num_pending_tx--;
>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>> + htt->needs_unlock = false;
>> ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>> + }
>> }
>>
>> int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>> @@ -157,8 +159,10 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>> return -EBUSY;
>>
>> htt->num_pending_tx++;
>> - if (htt->num_pending_tx == htt->max_num_pending_tx)
>> + if (htt->num_pending_tx == htt->max_num_pending_tx) {
>> + htt->needs_unlock = true;
>> ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>> + }
>>
>> return 0;
>> }
>> --
>> 2.20.1
>>
>

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-04-28 16:28:21

by Dave Taht

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On Tue, Apr 28, 2020 at 7:56 AM Steve deRosier <[email protected]> wrote:
>
> On Mon, Apr 27, 2020 at 7:54 AM <[email protected]> wrote:
> >
> > From: Ben Greear <[email protected]>
> >
> > While running tcp upload + download tests with ~200
> > concurrent TCP streams, 1-2 processes, and 30 station
> > vdevs, I noticed that the __ieee80211_stop_queue was taking
> > around 20% of the CPU according to perf-top, which other locking
> > taking an additional ~15%.
> >
> > I believe the issue is that the ath10k driver would unlock the
> > txqueue when a single frame could be transmitted, instead of
> > waiting for a low water mark.
> >
> > So, this patch adds a low-water mark that is 1/4 of the total
> > tx buffers allowed.
> >
> > This appears to resolve the performance problem that I saw.
> >
> > Tested with recent wave-1 ath10k-ct firmware.
> >
>
> Hey Ben,
>
> Did you do any testing with this patch around latency? The nature of
> the thing that you fixed makes me wonder if it was intentional with
> respect to making WiFi fast - ie getting rid of buffers as much as
> possible. Obviously the CPU impact is likely to be an unintended
> consequence. In any case, I don't know anything for sure, it was just
> a thought that went through my head when reading this.

I note that I'd prefer a BQL-like high/low watermark approach in
general... bytes, not packets, or better, perceived
airtime in a revision of AQL...

... but we'll try this patch, thx!

>
> > Signed-off-by: Ben Greear <[email protected]>
> > ---
> > drivers/net/wireless/ath/ath10k/htt.h | 1 +
> > drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
> > 2 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
> > index 31c4ddbf45cb..b5634781c0dc 100644
> > --- a/drivers/net/wireless/ath/ath10k/htt.h
> > +++ b/drivers/net/wireless/ath/ath10k/htt.h
> > @@ -1941,6 +1941,7 @@ struct ath10k_htt {
> >
> > u8 target_version_major;
> > u8 target_version_minor;
> > + bool needs_unlock;
> > struct completion target_version_received;
> > u8 max_num_amsdu;
> > u8 max_num_ampdu;
> > diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
> > index 9b3c3b080e92..44795d9a7c0c 100644
> > --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> > +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> > @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
> > lockdep_assert_held(&htt->tx_lock);
> >
> > htt->num_pending_tx--;
> > - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
> > + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
> > + htt->needs_unlock = false;
> > ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
> > + }
> > }
> >
> > int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
> > @@ -157,8 +159,10 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
> > return -EBUSY;
> >
> > htt->num_pending_tx++;
> > - if (htt->num_pending_tx == htt->max_num_pending_tx)
> > + if (htt->num_pending_tx == htt->max_num_pending_tx) {
> > + htt->needs_unlock = true;
> > ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
> > + }
> >
> > return 0;
> > }
> > --
> > 2.20.1
> >

--
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729

2020-04-28 16:36:39

by Ben Greear

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On 04/28/2020 09:27 AM, Dave Taht wrote:
> On Tue, Apr 28, 2020 at 7:56 AM Steve deRosier <[email protected]> wrote:
>>
>> On Mon, Apr 27, 2020 at 7:54 AM <[email protected]> wrote:
>>>
>>> From: Ben Greear <[email protected]>
>>>
>>> While running tcp upload + download tests with ~200
>>> concurrent TCP streams, 1-2 processes, and 30 station
>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>> around 20% of the CPU according to perf-top, which other locking
>>> taking an additional ~15%.
>>>
>>> I believe the issue is that the ath10k driver would unlock the
>>> txqueue when a single frame could be transmitted, instead of
>>> waiting for a low water mark.
>>>
>>> So, this patch adds a low-water mark that is 1/4 of the total
>>> tx buffers allowed.
>>>
>>> This appears to resolve the performance problem that I saw.
>>>
>>> Tested with recent wave-1 ath10k-ct firmware.
>>>
>>
>> Hey Ben,
>>
>> Did you do any testing with this patch around latency? The nature of
>> the thing that you fixed makes me wonder if it was intentional with
>> respect to making WiFi fast - ie getting rid of buffers as much as
>> possible. Obviously the CPU impact is likely to be an unintended
>> consequence. In any case, I don't know anything for sure, it was just
>> a thought that went through my head when reading this.
>
> I note that I'd prefer a BQL-like high/low watermark approach in
> general... bytes, not packets, or better, perceived
> airtime in a revision of AQL...
>
> ... but we'll try this patch, thx!

Is there a nice diagram somewhere that shows where the various
buffer-bloat technologies sit in the stack? I suspect such should
be above the txqueue start/stop logic in the driver that I mucked
with, and certainly the old behaviour has no obvious tie-in with
any higher-level algorithms.

I doubt my patch will change much except in pathological cases where
the system is transmitting frames fast enough to fully fill the tx buffers
and also loaded in such a way that it can process just a few tx frames
at a time to keep bouncing to full and not full over and over.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-04-28 17:11:57

by Dave Taht

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On Tue, Apr 28, 2020 at 9:35 AM Ben Greear <[email protected]> wrote:
>
>
>
> On 04/28/2020 09:27 AM, Dave Taht wrote:
> > On Tue, Apr 28, 2020 at 7:56 AM Steve deRosier <[email protected]> wrote:
> >>
> >> On Mon, Apr 27, 2020 at 7:54 AM <[email protected]> wrote:
> >>>
> >>> From: Ben Greear <[email protected]>
> >>>
> >>> While running tcp upload + download tests with ~200
> >>> concurrent TCP streams, 1-2 processes, and 30 station
> >>> vdevs, I noticed that the __ieee80211_stop_queue was taking
> >>> around 20% of the CPU according to perf-top, which other locking
> >>> taking an additional ~15%.
> >>>
> >>> I believe the issue is that the ath10k driver would unlock the
> >>> txqueue when a single frame could be transmitted, instead of
> >>> waiting for a low water mark.
> >>>
> >>> So, this patch adds a low-water mark that is 1/4 of the total
> >>> tx buffers allowed.
> >>>
> >>> This appears to resolve the performance problem that I saw.
> >>>
> >>> Tested with recent wave-1 ath10k-ct firmware.
> >>>
> >>
> >> Hey Ben,
> >>
> >> Did you do any testing with this patch around latency? The nature of
> >> the thing that you fixed makes me wonder if it was intentional with
> >> respect to making WiFi fast - ie getting rid of buffers as much as
> >> possible. Obviously the CPU impact is likely to be an unintended
> >> consequence. In any case, I don't know anything for sure, it was just
> >> a thought that went through my head when reading this.
> >
> > I note that I'd prefer a BQL-like high/low watermark approach in
> > general... bytes, not packets, or better, perceived
> > airtime in a revision of AQL...
> >
> > ... but we'll try this patch, thx!
>
> Is there a nice diagram somewhere that shows where the various
> buffer-bloat technologies sit in the stack? I suspect such should
> be above the txqueue start/stop logic in the driver that I mucked
> with, and certainly the old behaviour has no obvious tie-in with
> any higher-level algorithms.

There are some good diagrams of the new queue stuff buried in toke's
book and online papers, notably "ending the anomaly"

https://bufferbloat-and-beyond.net/

Plug: They just did a print run, and it makes for good bathroom
reading. There's also a preso on it around here somewhere.

That said, let's see... there's some rants in this:
http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf and here
... https://conferences.sigcomm.org/sigcomm/2014/doc/slides/137.pdf
but that's mostly about what was wrong at the time.

Definitely! revising this piece would be a good idea in light of
modern developments and increased knowledge.
https://www.linuxjournal.com/content/queueing-linux-network-stack

IMHO "how to use AQL" is underdocumented at the moment. I'd hoped to
produce some after we successfully
got the iwl drivers to use it, but we haven't got around to it, and
merely getting the ath10k using it (really really) well,
has eaten into my ax200 time.....

>
> I doubt my patch will change much except in pathological cases where
> the system is transmitting frames fast enough to fully fill the tx buffers
> and also loaded in such a way that it can process just a few tx frames
> at a time to keep bouncing to full and not full over and over.
>
> Thanks,
> Ben
>
> --
> Ben Greear <[email protected]>
> Candela Technologies Inc http://www.candelatech.com

--
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729

2020-04-28 19:44:13

by Toke Høiland-Jørgensen

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

Ben Greear <[email protected]> writes:

> On 04/28/2020 09:27 AM, Dave Taht wrote:
>> On Tue, Apr 28, 2020 at 7:56 AM Steve deRosier <[email protected]> wrote:
>>>
>>> On Mon, Apr 27, 2020 at 7:54 AM <[email protected]> wrote:
>>>>
>>>> From: Ben Greear <[email protected]>
>>>>
>>>> While running tcp upload + download tests with ~200
>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>> around 20% of the CPU according to perf-top, which other locking
>>>> taking an additional ~15%.
>>>>
>>>> I believe the issue is that the ath10k driver would unlock the
>>>> txqueue when a single frame could be transmitted, instead of
>>>> waiting for a low water mark.
>>>>
>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>> tx buffers allowed.
>>>>
>>>> This appears to resolve the performance problem that I saw.
>>>>
>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>
>>>
>>> Hey Ben,
>>>
>>> Did you do any testing with this patch around latency? The nature of
>>> the thing that you fixed makes me wonder if it was intentional with
>>> respect to making WiFi fast - ie getting rid of buffers as much as
>>> possible. Obviously the CPU impact is likely to be an unintended
>>> consequence. In any case, I don't know anything for sure, it was just
>>> a thought that went through my head when reading this.
>>
>> I note that I'd prefer a BQL-like high/low watermark approach in
>> general... bytes, not packets, or better, perceived
>> airtime in a revision of AQL...
>>
>> ... but we'll try this patch, thx!
>
> Is there a nice diagram somewhere that shows where the various
> buffer-bloat technologies sit in the stack?

Not really. Best thing I know of is the one I drew myself: Figure 3 of this paper:
https://www.usenix.org/system/files/conference/atc17/atc17-hoiland-jorgensen.pdf

That is still a semi-accurate representation of the queueing structure
in mac80211. For ath10k, just imagine that the bit that says "ath9k
driver" is part of mac80211, and that the "HW queue" is everything the
driver and firmware does... AQL also activates in the circle labelled
"RR" there, but the figure predates AQL.

> I suspect such should be above the txqueue start/stop logic in the
> driver that I mucked with, and certainly the old behaviour has no
> obvious tie-in with any higher-level algorithms.
>
> I doubt my patch will change much except in pathological cases where
> the system is transmitting frames fast enough to fully fill the tx
> buffers and also loaded in such a way that it can process just a few
> tx frames at a time to keep bouncing to full and not full over and
> over.

The latter part of that ("can process just a few tx frames at a time")
mostly happens when many stations are active at the same time, right?

-Toke

2020-04-28 19:47:04

by Toke Høiland-Jørgensen

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

[email protected] writes:

> From: Ben Greear <[email protected]>
>
> While running tcp upload + download tests with ~200
> concurrent TCP streams, 1-2 processes, and 30 station
> vdevs, I noticed that the __ieee80211_stop_queue was taking
> around 20% of the CPU according to perf-top, which other locking
> taking an additional ~15%.
>
> I believe the issue is that the ath10k driver would unlock the
> txqueue when a single frame could be transmitted, instead of
> waiting for a low water mark.
>
> So, this patch adds a low-water mark that is 1/4 of the total
> tx buffers allowed.
>
> This appears to resolve the performance problem that I saw.
>
> Tested with recent wave-1 ath10k-ct firmware.
>
> Signed-off-by: Ben Greear <[email protected]>
> ---
> drivers/net/wireless/ath/ath10k/htt.h | 1 +
> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
> 2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
> index 31c4ddbf45cb..b5634781c0dc 100644
> --- a/drivers/net/wireless/ath/ath10k/htt.h
> +++ b/drivers/net/wireless/ath/ath10k/htt.h
> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>
> u8 target_version_major;
> u8 target_version_minor;
> + bool needs_unlock;
> struct completion target_version_received;
> u8 max_num_amsdu;
> u8 max_num_ampdu;
> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
> index 9b3c3b080e92..44795d9a7c0c 100644
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
> lockdep_assert_held(&htt->tx_lock);
>
> htt->num_pending_tx--;
> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {

Why /4? Seems a bit arbitrary?

What's a typical value of max_num_pending_tx?

-Toke

2020-04-28 19:47:04

by Toke Høiland-Jørgensen

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

Ben Greear <[email protected]> writes:

> On 04/28/2020 07:56 AM, Steve deRosier wrote:
>> On Mon, Apr 27, 2020 at 7:54 AM <[email protected]> wrote:
>>>
>>> From: Ben Greear <[email protected]>
>>>
>>> While running tcp upload + download tests with ~200
>>> concurrent TCP streams, 1-2 processes, and 30 station
>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>> around 20% of the CPU according to perf-top, which other locking
>>> taking an additional ~15%.
>>>
>>> I believe the issue is that the ath10k driver would unlock the
>>> txqueue when a single frame could be transmitted, instead of
>>> waiting for a low water mark.
>>>
>>> So, this patch adds a low-water mark that is 1/4 of the total
>>> tx buffers allowed.
>>>
>>> This appears to resolve the performance problem that I saw.
>>>
>>> Tested with recent wave-1 ath10k-ct firmware.
>>>
>>
>> Hey Ben,
>>
>> Did you do any testing with this patch around latency? The nature of
>> the thing that you fixed makes me wonder if it was intentional with
>> respect to making WiFi fast - ie getting rid of buffers as much as
>> possible. Obviously the CPU impact is likely to be an unintended
>> consequence. In any case, I don't know anything for sure, it was just
>> a thought that went through my head when reading this.
>
> I did not, but on average my patch should make the queues be less full,
> so I doubt it will hurt latency.

I would tend to agree with that.

-Toke

>>> Signed-off-by: Ben Greear <[email protected]>
>>> ---
>>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>
>>> u8 target_version_major;
>>> u8 target_version_minor;
>>> + bool needs_unlock;
>>> struct completion target_version_received;
>>> u8 max_num_amsdu;
>>> u8 max_num_ampdu;
>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>> lockdep_assert_held(&htt->tx_lock);
>>>
>>> htt->num_pending_tx--;
>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>> + htt->needs_unlock = false;
>>> ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>> + }
>>> }
>>>
>>> int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>>> @@ -157,8 +159,10 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>>> return -EBUSY;
>>>
>>> htt->num_pending_tx++;
>>> - if (htt->num_pending_tx == htt->max_num_pending_tx)
>>> + if (htt->num_pending_tx == htt->max_num_pending_tx) {
>>> + htt->needs_unlock = true;
>>> ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>> + }
>>>
>>> return 0;
>>> }
>>> --
>>> 2.20.1
>>>
>>
>
> --
> Ben Greear <[email protected]>
> Candela Technologies Inc http://www.candelatech.com

2020-04-28 19:52:09

by Ben Greear

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On 04/28/2020 12:37 PM, Toke H?iland-J?rgensen wrote:
> [email protected] writes:
>
>> From: Ben Greear <[email protected]>
>>
>> While running tcp upload + download tests with ~200
>> concurrent TCP streams, 1-2 processes, and 30 station
>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>> around 20% of the CPU according to perf-top, which other locking
>> taking an additional ~15%.
>>
>> I believe the issue is that the ath10k driver would unlock the
>> txqueue when a single frame could be transmitted, instead of
>> waiting for a low water mark.
>>
>> So, this patch adds a low-water mark that is 1/4 of the total
>> tx buffers allowed.
>>
>> This appears to resolve the performance problem that I saw.
>>
>> Tested with recent wave-1 ath10k-ct firmware.
>>
>> Signed-off-by: Ben Greear <[email protected]>
>> ---
>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>> index 31c4ddbf45cb..b5634781c0dc 100644
>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>
>> u8 target_version_major;
>> u8 target_version_minor;
>> + bool needs_unlock;
>> struct completion target_version_received;
>> u8 max_num_amsdu;
>> u8 max_num_ampdu;
>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>> index 9b3c3b080e92..44795d9a7c0c 100644
>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>> lockdep_assert_held(&htt->tx_lock);
>>
>> htt->num_pending_tx--;
>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>
> Why /4? Seems a bit arbitrary?

Yes, arbitrary for sure. I figure restart filling the queue when 1/4 full so that it
is unlikely to run dry. Possibly it should restart sooner to keep it more full
on average?

Before my patch, the behaviour would be to try to keep it as full as possible, as in
restart the queues as soon as a single slot opens up in the tx queue.

>
> What's a typical value of max_num_pending_tx?

1424

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-04-28 20:40:26

by Toke Høiland-Jørgensen

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

Ben Greear <[email protected]> writes:

> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>> [email protected] writes:
>>
>>> From: Ben Greear <[email protected]>
>>>
>>> While running tcp upload + download tests with ~200
>>> concurrent TCP streams, 1-2 processes, and 30 station
>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>> around 20% of the CPU according to perf-top, which other locking
>>> taking an additional ~15%.
>>>
>>> I believe the issue is that the ath10k driver would unlock the
>>> txqueue when a single frame could be transmitted, instead of
>>> waiting for a low water mark.
>>>
>>> So, this patch adds a low-water mark that is 1/4 of the total
>>> tx buffers allowed.
>>>
>>> This appears to resolve the performance problem that I saw.
>>>
>>> Tested with recent wave-1 ath10k-ct firmware.
>>>
>>> Signed-off-by: Ben Greear <[email protected]>
>>> ---
>>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>
>>> u8 target_version_major;
>>> u8 target_version_minor;
>>> + bool needs_unlock;
>>> struct completion target_version_received;
>>> u8 max_num_amsdu;
>>> u8 max_num_ampdu;
>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>> lockdep_assert_held(&htt->tx_lock);
>>>
>>> htt->num_pending_tx--;
>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>
>> Why /4? Seems a bit arbitrary?
>
> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
> full so that it is unlikely to run dry. Possibly it should restart
> sooner to keep it more full on average?

Theoretically, the "keep the queue at the lowest possible level that
keeps it from underflowing" is what BQL is supposed to do. The diff
below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
place of num_pending_tx. I've only compile tested it, and I'm a bit
skeptical that it will work right for this, but if anyone wants to give
it a shot, there it is.

BTW, while doing that, I noticed there's a similar arbitrary limit in
ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
to keep the arbitrary limit maybe use the same one? :)

> Before my patch, the behaviour would be to try to keep it as full as
> possible, as in restart the queues as soon as a single slot opens up
> in the tx queue.

Yeah, that seems somewhat misguided as well, from a latency perspective,
at least. But I guess that's what we're fixing with AQL. What does the
firmware do with the frames queued within? Do they just go on a FIFO
queue altogether, or something smarter?

-Toke

diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
index f26cc6989dad..72771ff38a94 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -2497,6 +2497,10 @@ static int ath10k_core_init_firmware_features(struct ath10k *ar)
return -EINVAL;
}

+ dql_init(&ar->htt.dql, HZ);
+ ar->htt.dql.max_limit = ar->htt.max_num_pending_tx;
+ ar->htt.dql.min_limit = 8;
+
if (ar->hw_params.num_peers)
ar->max_num_peers = ar->hw_params.num_peers;
else
diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
index 4a12564fc30e..19024d063896 100644
--- a/drivers/net/wireless/ath/ath10k/htt.h
+++ b/drivers/net/wireless/ath/ath10k/htt.h
@@ -13,6 +13,7 @@
#include <linux/dmapool.h>
#include <linux/hashtable.h>
#include <linux/kfifo.h>
+#include <linux/dynamic_queue_limits.h>
#include <net/mac80211.h>

#include "htc.h"
@@ -1965,8 +1966,8 @@ struct ath10k_htt {
/* Protects access to pending_tx, num_pending_tx */
spinlock_t tx_lock;
int max_num_pending_tx;
- int num_pending_tx;
int num_pending_mgmt_tx;
+ struct dql dql;
struct idr pending_tx;
wait_queue_head_t empty_tx_wq;

diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
index e9d12ea708b6..911a79470bdf 100644
--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -144,8 +144,8 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
{
lockdep_assert_held(&htt->tx_lock);

- htt->num_pending_tx--;
- if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
+ dql_completed(&htt->dql, 1);
+ if (dql_avail(&htt->dql) > 0)
ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
}

@@ -153,11 +153,11 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
{
lockdep_assert_held(&htt->tx_lock);

- if (htt->num_pending_tx >= htt->max_num_pending_tx)
+ if (dql_avail(&htt->dql) <= 0)
return -EBUSY;

- htt->num_pending_tx++;
- if (htt->num_pending_tx == htt->max_num_pending_tx)
+ dql_queued(&htt->dql, 1);
+ if (dql_avail(&htt->dql) <= 0)
ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);

return 0;
diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index 2d03b8dd3b8c..1fe251742b0a 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -3998,7 +3998,7 @@ static bool ath10k_mac_tx_can_push(struct ieee80211_hw *hw,
if (ar->htt.tx_q_state.mode == HTT_TX_MODE_SWITCH_PUSH)
return true;

- if (ar->htt.num_pending_tx < ar->htt.tx_q_state.num_push_allowed)
+ if (dql_avail(&ar->htt.dql) < ar->htt.tx_q_state.num_push_allowed)
return true;

if (artxq->num_fw_queued < artxq->num_push_allowed)
@@ -4159,7 +4159,7 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
return;

- if (ar->htt.num_pending_tx >= (ar->htt.max_num_pending_tx / 2))
+ if (dql_avail(&ar->htt.dql) < (ar->htt.dql.limit / 2))
return;

rcu_read_lock();
@@ -7160,7 +7160,7 @@ void ath10k_mac_wait_tx_complete(struct ath10k *ar)
bool empty;

spin_lock_bh(&ar->htt.tx_lock);
- empty = (ar->htt.num_pending_tx == 0);
+ empty = (ar->htt.dql.num_completed == ar->htt.dql.num_queued);
spin_unlock_bh(&ar->htt.tx_lock);

skip = (ar->state == ATH10K_STATE_WEDGED) ||
diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
index 39abf8b12903..fe7cd53c2bf9 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -80,7 +80,7 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,

ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id);
ath10k_htt_tx_dec_pending(htt);
- if (htt->num_pending_tx == 0)
+ if (htt->dql.num_completed == htt->dql.num_queued)
wake_up(&htt->empty_tx_wq);
spin_unlock_bh(&htt->tx_lock);

2020-04-28 21:21:49

by Ben Greear

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On 04/28/2020 01:39 PM, Toke Høiland-Jørgensen wrote:
> Ben Greear <[email protected]> writes:
>
>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>> [email protected] writes:
>>>
>>>> From: Ben Greear <[email protected]>
>>>>
>>>> While running tcp upload + download tests with ~200
>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>> around 20% of the CPU according to perf-top, which other locking
>>>> taking an additional ~15%.
>>>>
>>>> I believe the issue is that the ath10k driver would unlock the
>>>> txqueue when a single frame could be transmitted, instead of
>>>> waiting for a low water mark.
>>>>
>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>> tx buffers allowed.
>>>>
>>>> This appears to resolve the performance problem that I saw.
>>>>
>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>
>>>> Signed-off-by: Ben Greear <[email protected]>
>>>> ---
>>>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>
>>>> u8 target_version_major;
>>>> u8 target_version_minor;
>>>> + bool needs_unlock;
>>>> struct completion target_version_received;
>>>> u8 max_num_amsdu;
>>>> u8 max_num_ampdu;
>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>> lockdep_assert_held(&htt->tx_lock);
>>>>
>>>> htt->num_pending_tx--;
>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>
>>> Why /4? Seems a bit arbitrary?
>>
>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>> full so that it is unlikely to run dry. Possibly it should restart
>> sooner to keep it more full on average?
>
> Theoretically, the "keep the queue at the lowest possible level that
> keeps it from underflowing" is what BQL is supposed to do. The diff
> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
> place of num_pending_tx. I've only compile tested it, and I'm a bit
> skeptical that it will work right for this, but if anyone wants to give
> it a shot, there it is.
>
> BTW, while doing that, I noticed there's a similar arbitrary limit in
> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
> to keep the arbitrary limit maybe use the same one? :)
>
>> Before my patch, the behaviour would be to try to keep it as full as
>> possible, as in restart the queues as soon as a single slot opens up
>> in the tx queue.
>
> Yeah, that seems somewhat misguided as well, from a latency perspective,
> at least. But I guess that's what we're fixing with AQL. What does the
> firmware do with the frames queued within? Do they just go on a FIFO
> queue altogether, or something smarter?

Sort of like a mini-mac80211 stack inside the firmware is used to
create ampdu/amsdu chains and schedule them with its own scheduler.

For optimal throughput with 200 users steaming video,
the ath10k driver should think that it has only a few active peers wanting
to send data at a time (and so firmware would think the same), and the driver should
be fed a large chunk of pkts for those peers. And then the next few peers.
That should let firmware send large ampdu/amsdu to each peer, increasing throughput
over all.

If you feed a few frames to each of the 200 peers, then even if firmware has 2000
tx buffers, that is only 10 frames per peer at best, leading to small ampdu/amsdu
and thus worse over-all throughput and utilization of airtime.

It would be nice to be able to set certain traffic flows to have the
throughput optimization and others to have the latency optimization.
For instance, high latency on a streaming download is a good trade-off
if it increases total throughput. The end device will have plenty of
buffers to handle the bursts of data.

And of course other traffic will benefit from lower latency.

Maybe some of the AI folks training their AI to categorize cat pictures could
instead start categorizing traffic flows and adjusting the stack realtime...

And now...back to the grind for me.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-04-29 09:32:15

by Toke Høiland-Jørgensen

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

Ben Greear <[email protected]> writes:

> On 04/28/2020 01:39 PM, Toke Høiland-Jørgensen wrote:
>> Ben Greear <[email protected]> writes:
>>
>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>> [email protected] writes:
>>>>
>>>>> From: Ben Greear <[email protected]>
>>>>>
>>>>> While running tcp upload + download tests with ~200
>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>> taking an additional ~15%.
>>>>>
>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>> txqueue when a single frame could be transmitted, instead of
>>>>> waiting for a low water mark.
>>>>>
>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>> tx buffers allowed.
>>>>>
>>>>> This appears to resolve the performance problem that I saw.
>>>>>
>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>
>>>>> Signed-off-by: Ben Greear <[email protected]>
>>>>> ---
>>>>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>>>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>
>>>>> u8 target_version_major;
>>>>> u8 target_version_minor;
>>>>> + bool needs_unlock;
>>>>> struct completion target_version_received;
>>>>> u8 max_num_amsdu;
>>>>> u8 max_num_ampdu;
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>> lockdep_assert_held(&htt->tx_lock);
>>>>>
>>>>> htt->num_pending_tx--;
>>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>
>>>> Why /4? Seems a bit arbitrary?
>>>
>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>> full so that it is unlikely to run dry. Possibly it should restart
>>> sooner to keep it more full on average?
>>
>> Theoretically, the "keep the queue at the lowest possible level that
>> keeps it from underflowing" is what BQL is supposed to do. The diff
>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>> skeptical that it will work right for this, but if anyone wants to give
>> it a shot, there it is.
>>
>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>> to keep the arbitrary limit maybe use the same one? :)
>>
>>> Before my patch, the behaviour would be to try to keep it as full as
>>> possible, as in restart the queues as soon as a single slot opens up
>>> in the tx queue.
>>
>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>> at least. But I guess that's what we're fixing with AQL. What does the
>> firmware do with the frames queued within? Do they just go on a FIFO
>> queue altogether, or something smarter?
>
> Sort of like a mini-mac80211 stack inside the firmware is used to
> create ampdu/amsdu chains and schedule them with its own scheduler.
>
> For optimal throughput with 200 users steaming video,
> the ath10k driver should think that it has only a few active peers wanting
> to send data at a time (and so firmware would think the same), and the driver should
> be fed a large chunk of pkts for those peers. And then the next few peers.
> That should let firmware send large ampdu/amsdu to each peer, increasing throughput
> over all.

Yes, but also increasing latency because all other stations have to wait
for a longer TXOP (see my other reply).

> If you feed a few frames to each of the 200 peers, then even if
> firmware has 2000 tx buffers, that is only 10 frames per peer at best,
> leading to small ampdu/amsdu and thus worse over-all throughput and
> utilization of airtime.

Do you have any data on exactly how long (in time) each txop becomes in
these highly congested scenarios?

> It would be nice to be able to set certain traffic flows to have the
> throughput optimization and others to have the latency optimization.
> For instance, high latency on a streaming download is a good trade-off
> if it increases total throughput.

For the individual flows to a peer, fq_codel actually does a pretty good
job at putting the latency-sensitive flows first. Which is why we want
the queueing to happen in mac80211 (where fq_codel is active) instead of
in the firmware.

> Maybe some of the AI folks training their AI to categorize cat
> pictures could instead start categorizing traffic flows and adjusting
> the stack realtime...

I know there are people trying to classify traffic with machine learning
(although usually for more nefarious purposes), but I'm not sure it's
feasible to do in a queue management algorithm (if at all). :)

-Toke

2020-04-29 13:55:13

by Ben Greear

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On 04/29/2020 02:28 AM, Toke Høiland-Jørgensen wrote:
> Ben Greear <[email protected]> writes:
>
>> On 04/28/2020 01:39 PM, Toke Høiland-Jørgensen wrote:
>>> Ben Greear <[email protected]> writes:
>>>
>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>> [email protected] writes:
>>>>>
>>>>>> From: Ben Greear <[email protected]>
>>>>>>
>>>>>> While running tcp upload + download tests with ~200
>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>> taking an additional ~15%.
>>>>>>
>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>> waiting for a low water mark.
>>>>>>
>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>> tx buffers allowed.
>>>>>>
>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>
>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>
>>>>>> Signed-off-by: Ben Greear <[email protected]>
>>>>>> ---
>>>>>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>>>>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>
>>>>>> u8 target_version_major;
>>>>>> u8 target_version_minor;
>>>>>> + bool needs_unlock;
>>>>>> struct completion target_version_received;
>>>>>> u8 max_num_amsdu;
>>>>>> u8 max_num_ampdu;
>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>> lockdep_assert_held(&htt->tx_lock);
>>>>>>
>>>>>> htt->num_pending_tx--;
>>>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>>
>>>>> Why /4? Seems a bit arbitrary?
>>>>
>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>> sooner to keep it more full on average?
>>>
>>> Theoretically, the "keep the queue at the lowest possible level that
>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>> skeptical that it will work right for this, but if anyone wants to give
>>> it a shot, there it is.
>>>
>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>> to keep the arbitrary limit maybe use the same one? :)
>>>
>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>> possible, as in restart the queues as soon as a single slot opens up
>>>> in the tx queue.
>>>
>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>> at least. But I guess that's what we're fixing with AQL. What does the
>>> firmware do with the frames queued within? Do they just go on a FIFO
>>> queue altogether, or something smarter?
>>
>> Sort of like a mini-mac80211 stack inside the firmware is used to
>> create ampdu/amsdu chains and schedule them with its own scheduler.
>>
>> For optimal throughput with 200 users steaming video,
>> the ath10k driver should think that it has only a few active peers wanting
>> to send data at a time (and so firmware would think the same), and the driver should
>> be fed a large chunk of pkts for those peers. And then the next few peers.
>> That should let firmware send large ampdu/amsdu to each peer, increasing throughput
>> over all.
>
> Yes, but also increasing latency because all other stations have to wait
> for a longer TXOP (see my other reply).

If you at most sent 4 station's worth of data to the firmware, and max is 4ms per
txop, then you have at most 16ms of latency. You could also send just two station's
worth of data at a time, as long as you can quickly service the tx-queues again
that should be enough to keep the firmware/radio productive.

In the case where you have many users wanting lots of throughput, 8 or 16ms of extra
latency is a good tradeoff vs no one being able to reliably get the bandwidth they
need.

Higher priority TIDs will get precedence in ath10k firmware anyway, so even if at time 0
you sent 64 frames to a peer on back-ground TID, if you sent a VO frame at time 0+1,
it could be transmitted first.

>
>> If you feed a few frames to each of the 200 peers, then even if
>> firmware has 2000 tx buffers, that is only 10 frames per peer at best,
>> leading to small ampdu/amsdu and thus worse over-all throughput and
>> utilization of airtime.
>
> Do you have any data on exactly how long (in time) each txop becomes in
> these highly congested scenarios?

I didn't look at time, but avg packets-per-ampdu chain is 30+ in single station tests,
and with many stations it goes into the 4-8 range (from memory, so maybe I'm
off a bit).

Here is an open-source tool that can give you those metrics by processing a pcap:

https://github.com/greearb/lanforge-scripts/tree/master/wifi_diag

# Ignore the LANforge bits about creating capture files, here is an example of how to use it:
https://www.candelatech.com/cookbook/wifire/wifi+diagnostic

>
>> It would be nice to be able to set certain traffic flows to have the
>> throughput optimization and others to have the latency optimization.
>> For instance, high latency on a streaming download is a good trade-off
>> if it increases total throughput.
>
> For the individual flows to a peer, fq_codel actually does a pretty good
> job at putting the latency-sensitive flows first. Which is why we want
> the queueing to happen in mac80211 (where fq_codel is active) instead of
> in the firmware.

That sounds good to me. What is needed from the driver/firmware to make
this work well?

>> Maybe some of the AI folks training their AI to categorize cat
>> pictures could instead start categorizing traffic flows and adjusting
>> the stack realtime...
>
> I know there are people trying to classify traffic with machine learning
> (although usually for more nefarious purposes), but I'm not sure it's
> feasible to do in a queue management algorithm (if at all). :)

If you give an API (maybe using 'tc'?) that user-space can twiddle, then anyone
doing clever things can do it at higher levels. It could easily be that multiple
APs would gain benefit from being coordinated together for instance, and no local
queueing logic would be able to coordinate that by itself.

Thanks,
Ben

>
> -Toke
>

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-04-29 14:59:50

by Toke Høiland-Jørgensen

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

Ben Greear <[email protected]> writes:

> On 04/29/2020 02:28 AM, Toke Høiland-Jørgensen wrote:
>> Ben Greear <[email protected]> writes:
>>
>>> On 04/28/2020 01:39 PM, Toke Høiland-Jørgensen wrote:
>>>> Ben Greear <[email protected]> writes:
>>>>
>>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>>> [email protected] writes:
>>>>>>
>>>>>>> From: Ben Greear <[email protected]>
>>>>>>>
>>>>>>> While running tcp upload + download tests with ~200
>>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>>> taking an additional ~15%.
>>>>>>>
>>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>>> waiting for a low water mark.
>>>>>>>
>>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>>> tx buffers allowed.
>>>>>>>
>>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>>
>>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>>
>>>>>>> Signed-off-by: Ben Greear <[email protected]>
>>>>>>> ---
>>>>>>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>>>>>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>>
>>>>>>> u8 target_version_major;
>>>>>>> u8 target_version_minor;
>>>>>>> + bool needs_unlock;
>>>>>>> struct completion target_version_received;
>>>>>>> u8 max_num_amsdu;
>>>>>>> u8 max_num_ampdu;
>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>>> lockdep_assert_held(&htt->tx_lock);
>>>>>>>
>>>>>>> htt->num_pending_tx--;
>>>>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>>>
>>>>>> Why /4? Seems a bit arbitrary?
>>>>>
>>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>>> sooner to keep it more full on average?
>>>>
>>>> Theoretically, the "keep the queue at the lowest possible level that
>>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>>> skeptical that it will work right for this, but if anyone wants to give
>>>> it a shot, there it is.
>>>>
>>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>>> to keep the arbitrary limit maybe use the same one? :)
>>>>
>>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>>> possible, as in restart the queues as soon as a single slot opens up
>>>>> in the tx queue.
>>>>
>>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>>> at least. But I guess that's what we're fixing with AQL. What does the
>>>> firmware do with the frames queued within? Do they just go on a FIFO
>>>> queue altogether, or something smarter?
>>>
>>> Sort of like a mini-mac80211 stack inside the firmware is used to
>>> create ampdu/amsdu chains and schedule them with its own scheduler.
>>>
>>> For optimal throughput with 200 users steaming video,
>>> the ath10k driver should think that it has only a few active peers wanting
>>> to send data at a time (and so firmware would think the same), and the driver should
>>> be fed a large chunk of pkts for those peers. And then the next few peers.
>>> That should let firmware send large ampdu/amsdu to each peer, increasing throughput
>>> over all.
>>
>> Yes, but also increasing latency because all other stations have to wait
>> for a longer TXOP (see my other reply).
>
> If you at most sent 4 station's worth of data to the firmware, and max
> is 4ms per txop, then you have at most 16ms of latency. You could also
> send just two station's worth of data at a time, as long as you can
> quickly service the tx-queues again that should be enough to keep the
> firmware/radio productive.

Sure, but if you have 100 stations that are backlogged, and they each
transmit for 4ms every time they get a chance, then on average each
station has to wait 400 ms between each TXOP. That is way too long; so
the maximum TXOP size should go down as the number of backlogged
stations go up.

AQL already does some the right thing, BTW: When the total outstanding
data queued in the firmware exceeds 24ms (by the AQL estimation), it'll
switch the per-station limit from 12ms to 5ms of queued data. Arguably
that could be lower (say, 2ms for the low per-station limit).

> In the case where you have many users wanting lots of throughput, 8 or
> 16ms of extra latency is a good tradeoff vs no one being able to
> reliably get the bandwidth they need.

Yes, 8-16ms of extra latency is likely worth it, but not much more than
that...

> Higher priority TIDs will get precedence in ath10k firmware anyway, so
> even if at time 0 you sent 64 frames to a peer on back-ground TID, if
> you sent a VO frame at time 0+1, it could be transmitted first.

Assuming anyone is actually using the priorities, that is; most
appliations are not (and those that are often do it wrong). Also, using
the VO queue hurts efficiency as well for the same reason, since it
can't aggregate at all.

>>> If you feed a few frames to each of the 200 peers, then even if
>>> firmware has 2000 tx buffers, that is only 10 frames per peer at best,
>>> leading to small ampdu/amsdu and thus worse over-all throughput and
>>> utilization of airtime.
>>
>> Do you have any data on exactly how long (in time) each txop becomes in
>> these highly congested scenarios?
>
> I didn't look at time, but avg packets-per-ampdu chain is 30+ in
> single station tests, and with many stations it goes into the 4-8
> range (from memory, so maybe I'm off a bit).

Right; was just looking for rough numbers, so that is fine.

> Here is an open-source tool that can give you those metrics by processing a pcap:
>
> https://github.com/greearb/lanforge-scripts/tree/master/wifi_diag
>
> # Ignore the LANforge bits about creating capture files, here is an example of how to use it:
> https://www.candelatech.com/cookbook/wifire/wifi+diagnostic
>
>>
>>> It would be nice to be able to set certain traffic flows to have the
>>> throughput optimization and others to have the latency optimization.
>>> For instance, high latency on a streaming download is a good trade-off
>>> if it increases total throughput.
>>
>> For the individual flows to a peer, fq_codel actually does a pretty good
>> job at putting the latency-sensitive flows first. Which is why we want
>> the queueing to happen in mac80211 (where fq_codel is active) instead of
>> in the firmware.
>
> That sounds good to me. What is needed from the driver/firmware to
> make this work well?

To queue as little as possible :)

AQL is the mechanism we have to enforce this (for ath10k), by throttling
queueing into the firmware earlier. It only does this on a per-station
limit, though, and there's currently no mechanism to limit the total
number of stations with outstanding data, as you're requesting. Which I
guess means it won't help much in your case. One could imagine building
a mechanism on top of AQL to do this, though, although I think it may
not be quite trivial to get the interaction with the station scheduler
right. The basic building blocks are there, though...

-Toke

2020-04-29 15:28:03

by Ben Greear

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On 04/29/2020 07:56 AM, Toke Høiland-Jørgensen wrote:
> Ben Greear <[email protected]> writes:
>
>> On 04/29/2020 02:28 AM, Toke Høiland-Jørgensen wrote:
>>> Ben Greear <[email protected]> writes:
>>>
>>>> On 04/28/2020 01:39 PM, Toke Høiland-Jørgensen wrote:
>>>>> Ben Greear <[email protected]> writes:
>>>>>
>>>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>>>> [email protected] writes:
>>>>>>>
>>>>>>>> From: Ben Greear <[email protected]>
>>>>>>>>
>>>>>>>> While running tcp upload + download tests with ~200
>>>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>>>> taking an additional ~15%.
>>>>>>>>
>>>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>>>> waiting for a low water mark.
>>>>>>>>
>>>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>>>> tx buffers allowed.
>>>>>>>>
>>>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>>>
>>>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>>>
>>>>>>>> Signed-off-by: Ben Greear <[email protected]>
>>>>>>>> ---
>>>>>>>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>>>>>>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>>>
>>>>>>>> u8 target_version_major;
>>>>>>>> u8 target_version_minor;
>>>>>>>> + bool needs_unlock;
>>>>>>>> struct completion target_version_received;
>>>>>>>> u8 max_num_amsdu;
>>>>>>>> u8 max_num_ampdu;
>>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>>>> lockdep_assert_held(&htt->tx_lock);
>>>>>>>>
>>>>>>>> htt->num_pending_tx--;
>>>>>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>>>>
>>>>>>> Why /4? Seems a bit arbitrary?
>>>>>>
>>>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>>>> sooner to keep it more full on average?
>>>>>
>>>>> Theoretically, the "keep the queue at the lowest possible level that
>>>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>>>> skeptical that it will work right for this, but if anyone wants to give
>>>>> it a shot, there it is.
>>>>>
>>>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>>>> to keep the arbitrary limit maybe use the same one? :)
>>>>>
>>>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>>>> possible, as in restart the queues as soon as a single slot opens up
>>>>>> in the tx queue.
>>>>>
>>>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>>>> at least. But I guess that's what we're fixing with AQL. What does the
>>>>> firmware do with the frames queued within? Do they just go on a FIFO
>>>>> queue altogether, or something smarter?
>>>>
>>>> Sort of like a mini-mac80211 stack inside the firmware is used to
>>>> create ampdu/amsdu chains and schedule them with its own scheduler.
>>>>
>>>> For optimal throughput with 200 users steaming video,
>>>> the ath10k driver should think that it has only a few active peers wanting
>>>> to send data at a time (and so firmware would think the same), and the driver should
>>>> be fed a large chunk of pkts for those peers. And then the next few peers.
>>>> That should let firmware send large ampdu/amsdu to each peer, increasing throughput
>>>> over all.
>>>
>>> Yes, but also increasing latency because all other stations have to wait
>>> for a longer TXOP (see my other reply).
>>
>> If you at most sent 4 station's worth of data to the firmware, and max
>> is 4ms per txop, then you have at most 16ms of latency. You could also
>> send just two station's worth of data at a time, as long as you can
>> quickly service the tx-queues again that should be enough to keep the
>> firmware/radio productive.
>
> Sure, but if you have 100 stations that are backlogged, and they each
> transmit for 4ms every time they get a chance, then on average each
> station has to wait 400 ms between each TXOP. That is way too long; so
> the maximum TXOP size should go down as the number of backlogged
> stations go up.

400ms should be fine for streaming netflix, and there is no reason that
higher priority traffic cannot jump ahead in the queue for much
better latency. If the frames are queued up in mac80211, then you don't
have to depend on TID scheduling in the driver/firmware anyway.

> AQL already does some the right thing, BTW: When the total outstanding
> data queued in the firmware exceeds 24ms (by the AQL estimation), it'll
> switch the per-station limit from 12ms to 5ms of queued data. Arguably
> that could be lower (say, 2ms for the low per-station limit).

That still should work OK for bulk transport, as 5ms is a full ampdu chain.

>> In the case where you have many users wanting lots of throughput, 8 or
>> 16ms of extra latency is a good tradeoff vs no one being able to
>> reliably get the bandwidth they need.
>
> Yes, 8-16ms of extra latency is likely worth it, but not much more than
> that...
>
>> Higher priority TIDs will get precedence in ath10k firmware anyway, so
>> even if at time 0 you sent 64 frames to a peer on back-ground TID, if
>> you sent a VO frame at time 0+1, it could be transmitted first.
>
> Assuming anyone is actually using the priorities, that is; most
> appliations are not (and those that are often do it wrong). Also, using
> the VO queue hurts efficiency as well for the same reason, since it
> can't aggregate at all.

I think ath10k aggregates VO traffic just fine, but either way, you could
use VI instead.

The kernel can adjust the TID appropriately, so the mac80211 scheduler to could
tweak the tid so that bulk transport always goes on BK, and things needing lower
latency goes out on VI perhaps?

>
>>>> If you feed a few frames to each of the 200 peers, then even if
>>>> firmware has 2000 tx buffers, that is only 10 frames per peer at best,
>>>> leading to small ampdu/amsdu and thus worse over-all throughput and
>>>> utilization of airtime.
>>>
>>> Do you have any data on exactly how long (in time) each txop becomes in
>>> these highly congested scenarios?
>>
>> I didn't look at time, but avg packets-per-ampdu chain is 30+ in
>> single station tests, and with many stations it goes into the 4-8
>> range (from memory, so maybe I'm off a bit).
>
> Right; was just looking for rough numbers, so that is fine.
>
>> Here is an open-source tool that can give you those metrics by processing a pcap:
>>
>> https://github.com/greearb/lanforge-scripts/tree/master/wifi_diag
>>
>> # Ignore the LANforge bits about creating capture files, here is an example of how to use it:
>> https://www.candelatech.com/cookbook/wifire/wifi+diagnostic
>>
>>>
>>>> It would be nice to be able to set certain traffic flows to have the
>>>> throughput optimization and others to have the latency optimization.
>>>> For instance, high latency on a streaming download is a good trade-off
>>>> if it increases total throughput.
>>>
>>> For the individual flows to a peer, fq_codel actually does a pretty good
>>> job at putting the latency-sensitive flows first. Which is why we want
>>> the queueing to happen in mac80211 (where fq_codel is active) instead of
>>> in the firmware.
>>
>> That sounds good to me. What is needed from the driver/firmware to
>> make this work well?
>
> To queue as little as possible :)
>
> AQL is the mechanism we have to enforce this (for ath10k), by throttling
> queueing into the firmware earlier. It only does this on a per-station
> limit, though, and there's currently no mechanism to limit the total
> number of stations with outstanding data, as you're requesting. Which I
> guess means it won't help much in your case. One could imagine building
> a mechanism on top of AQL to do this, though, although I think it may
> not be quite trivial to get the interaction with the station scheduler
> right. The basic building blocks are there, though...

(I think this below is right, but it is complicated as can be so maybe I
am wrong in places.)

The ath10k wave-2 firmware has its own scheduler. It asks the driver what
peers need to transmit, then it requests buffers for those peers (based on
its own idea of fairness and scheduling). If the driver tells it all want
to tx frames, it will queue lots and you are at the mercy of its scheduler.
But, if the driver tells firmware only a few peers want to transmit, then
the driver (and upper stacks) will have a lot more control over the firmware's
queueing and scheduling.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-04-29 15:46:23

by Toke Høiland-Jørgensen

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

Ben Greear <[email protected]> writes:

> On 04/29/2020 07:56 AM, Toke Høiland-Jørgensen wrote:
>> Ben Greear <[email protected]> writes:
>>
>>> On 04/29/2020 02:28 AM, Toke Høiland-Jørgensen wrote:
>>>> Ben Greear <[email protected]> writes:
>>>>
>>>>> On 04/28/2020 01:39 PM, Toke Høiland-Jørgensen wrote:
>>>>>> Ben Greear <[email protected]> writes:
>>>>>>
>>>>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>>>>> [email protected] writes:
>>>>>>>>
>>>>>>>>> From: Ben Greear <[email protected]>
>>>>>>>>>
>>>>>>>>> While running tcp upload + download tests with ~200
>>>>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>>>>> taking an additional ~15%.
>>>>>>>>>
>>>>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>>>>> waiting for a low water mark.
>>>>>>>>>
>>>>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>>>>> tx buffers allowed.
>>>>>>>>>
>>>>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>>>>
>>>>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Ben Greear <[email protected]>
>>>>>>>>> ---
>>>>>>>>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>>>>>>>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>>>>
>>>>>>>>> u8 target_version_major;
>>>>>>>>> u8 target_version_minor;
>>>>>>>>> + bool needs_unlock;
>>>>>>>>> struct completion target_version_received;
>>>>>>>>> u8 max_num_amsdu;
>>>>>>>>> u8 max_num_ampdu;
>>>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>>>>> lockdep_assert_held(&htt->tx_lock);
>>>>>>>>>
>>>>>>>>> htt->num_pending_tx--;
>>>>>>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>>>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>>>>>
>>>>>>>> Why /4? Seems a bit arbitrary?
>>>>>>>
>>>>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>>>>> sooner to keep it more full on average?
>>>>>>
>>>>>> Theoretically, the "keep the queue at the lowest possible level that
>>>>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>>>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>>>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>>>>> skeptical that it will work right for this, but if anyone wants to give
>>>>>> it a shot, there it is.
>>>>>>
>>>>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>>>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>>>>> to keep the arbitrary limit maybe use the same one? :)
>>>>>>
>>>>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>>>>> possible, as in restart the queues as soon as a single slot opens up
>>>>>>> in the tx queue.
>>>>>>
>>>>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>>>>> at least. But I guess that's what we're fixing with AQL. What does the
>>>>>> firmware do with the frames queued within? Do they just go on a FIFO
>>>>>> queue altogether, or something smarter?
>>>>>
>>>>> Sort of like a mini-mac80211 stack inside the firmware is used to
>>>>> create ampdu/amsdu chains and schedule them with its own scheduler.
>>>>>
>>>>> For optimal throughput with 200 users steaming video,
>>>>> the ath10k driver should think that it has only a few active peers wanting
>>>>> to send data at a time (and so firmware would think the same), and the driver should
>>>>> be fed a large chunk of pkts for those peers. And then the next few peers.
>>>>> That should let firmware send large ampdu/amsdu to each peer, increasing throughput
>>>>> over all.
>>>>
>>>> Yes, but also increasing latency because all other stations have to wait
>>>> for a longer TXOP (see my other reply).
>>>
>>> If you at most sent 4 station's worth of data to the firmware, and max
>>> is 4ms per txop, then you have at most 16ms of latency. You could also
>>> send just two station's worth of data at a time, as long as you can
>>> quickly service the tx-queues again that should be enough to keep the
>>> firmware/radio productive.
>>
>> Sure, but if you have 100 stations that are backlogged, and they each
>> transmit for 4ms every time they get a chance, then on average each
>> station has to wait 400 ms between each TXOP. That is way too long; so
>> the maximum TXOP size should go down as the number of backlogged
>> stations go up.
>
> 400ms should be fine for streaming netflix, and there is no reason
> that higher priority traffic cannot jump ahead in the queue for much
> better latency. If the frames are queued up in mac80211, then you
> don't have to depend on TID scheduling in the driver/firmware anyway.

Yeah, in an ideal world where you have total information about current
and future traffic patterns that is of course what you would do. The
trick is to design a general algorithm that works for all the weird
cases (and traffic mixes!) with only the information that's available to
it. We've gotten a fair way towards that, but as you point out there are
still improvements possible.

Really, I think you're the only one who's testing with this many
stations active at the same time, so I guess it's not surprising that
this case has not seen too much scrutiny thus far :)

>> AQL already does some the right thing, BTW: When the total outstanding
>> data queued in the firmware exceeds 24ms (by the AQL estimation), it'll
>> switch the per-station limit from 12ms to 5ms of queued data. Arguably
>> that could be lower (say, 2ms for the low per-station limit).
>
> That still should work OK for bulk transport, as 5ms is a full ampdu chain.
>
>>> In the case where you have many users wanting lots of throughput, 8 or
>>> 16ms of extra latency is a good tradeoff vs no one being able to
>>> reliably get the bandwidth they need.
>>
>> Yes, 8-16ms of extra latency is likely worth it, but not much more than
>> that...
>>
>>> Higher priority TIDs will get precedence in ath10k firmware anyway, so
>>> even if at time 0 you sent 64 frames to a peer on back-ground TID, if
>>> you sent a VO frame at time 0+1, it could be transmitted first.
>>
>> Assuming anyone is actually using the priorities, that is; most
>> appliations are not (and those that are often do it wrong). Also, using
>> the VO queue hurts efficiency as well for the same reason, since it
>> can't aggregate at all.
>
> I think ath10k aggregates VO traffic just fine, but either way, you could
> use VI instead.
>
> The kernel can adjust the TID appropriately, so the mac80211 scheduler
> to could tweak the tid so that bulk transport always goes on BK, and
> things needing lower latency goes out on VI perhaps?

I thought about messing with priorities like this, actually. The reason
I didn't go any further with it was that we found that it was often not
a win at all; if a station has both latency-sensitive and bulk traffic
you might as well just stick the low-latency traffic at the front of the
aggregate and send the whole thing in one go. Saves a TXOP :)

>>>>> If you feed a few frames to each of the 200 peers, then even if
>>>>> firmware has 2000 tx buffers, that is only 10 frames per peer at best,
>>>>> leading to small ampdu/amsdu and thus worse over-all throughput and
>>>>> utilization of airtime.
>>>>
>>>> Do you have any data on exactly how long (in time) each txop becomes in
>>>> these highly congested scenarios?
>>>
>>> I didn't look at time, but avg packets-per-ampdu chain is 30+ in
>>> single station tests, and with many stations it goes into the 4-8
>>> range (from memory, so maybe I'm off a bit).
>>
>> Right; was just looking for rough numbers, so that is fine.
>>
>>> Here is an open-source tool that can give you those metrics by processing a pcap:
>>>
>>> https://github.com/greearb/lanforge-scripts/tree/master/wifi_diag
>>>
>>> # Ignore the LANforge bits about creating capture files, here is an example of how to use it:
>>> https://www.candelatech.com/cookbook/wifire/wifi+diagnostic
>>>
>>>>
>>>>> It would be nice to be able to set certain traffic flows to have the
>>>>> throughput optimization and others to have the latency optimization.
>>>>> For instance, high latency on a streaming download is a good trade-off
>>>>> if it increases total throughput.
>>>>
>>>> For the individual flows to a peer, fq_codel actually does a pretty good
>>>> job at putting the latency-sensitive flows first. Which is why we want
>>>> the queueing to happen in mac80211 (where fq_codel is active) instead of
>>>> in the firmware.
>>>
>>> That sounds good to me. What is needed from the driver/firmware to
>>> make this work well?
>>
>> To queue as little as possible :)
>>
>> AQL is the mechanism we have to enforce this (for ath10k), by throttling
>> queueing into the firmware earlier. It only does this on a per-station
>> limit, though, and there's currently no mechanism to limit the total
>> number of stations with outstanding data, as you're requesting. Which I
>> guess means it won't help much in your case. One could imagine building
>> a mechanism on top of AQL to do this, though, although I think it may
>> not be quite trivial to get the interaction with the station scheduler
>> right. The basic building blocks are there, though...
>
> (I think this below is right, but it is complicated as can be so maybe I
> am wrong in places.)
>
> The ath10k wave-2 firmware has its own scheduler. It asks the driver
> what peers need to transmit, then it requests buffers for those peers
> (based on its own idea of fairness and scheduling). If the driver
> tells it all want to tx frames, it will queue lots and you are at the
> mercy of its scheduler. But, if the driver tells firmware only a few
> peers want to transmit, then the driver (and upper stacks) will have a
> lot more control over the firmware's queueing and scheduling.

Yeah, that's basically what we're trying to do in the airtime fairness
scheduler (see ieee80211_txq_may_transmit()).

-Toke

2020-04-30 23:33:05

by John Deere

[permalink] [raw]

Subject: Re: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

I've just tried Toke's DQL on an Archer C7 (QCA 988X) and it errors out
with:

ath10k_pci failed to increase tx pending count: -16, dropping

This is with ath10k non-ct firmware and driver on OpenWrt (w/backports
5.4.27 & AQL). ath10k starts up but is unable to let any stations connect.

I've also tried the standalone patch by Ben and it seems to have reduced
the latencies on top of AQL by another 5 ms.

On 4/29/20 4:39 AM, Toke Høiland-Jørgensen wrote:
> Ben Greear <[email protected]> writes:
>
>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>> [email protected] writes:
>>>
>>>> From: Ben Greear <[email protected]>
>>>>
>>>> While running tcp upload + download tests with ~200
>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>> around 20% of the CPU according to perf-top, which other locking
>>>> taking an additional ~15%.
>>>>
>>>> I believe the issue is that the ath10k driver would unlock the
>>>> txqueue when a single frame could be transmitted, instead of
>>>> waiting for a low water mark.
>>>>
>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>> tx buffers allowed.
>>>>
>>>> This appears to resolve the performance problem that I saw.
>>>>
>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>
>>>> Signed-off-by: Ben Greear <[email protected]>
>>>> ---
>>>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>
>>>> u8 target_version_major;
>>>> u8 target_version_minor;
>>>> + bool needs_unlock;
>>>> struct completion target_version_received;
>>>> u8 max_num_amsdu;
>>>> u8 max_num_ampdu;
>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>> lockdep_assert_held(&htt->tx_lock);
>>>>
>>>> htt->num_pending_tx--;
>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>
>>> Why /4? Seems a bit arbitrary?
>>
>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>> full so that it is unlikely to run dry. Possibly it should restart
>> sooner to keep it more full on average?
>
> Theoretically, the "keep the queue at the lowest possible level that
> keeps it from underflowing" is what BQL is supposed to do. The diff
> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
> place of num_pending_tx. I've only compile tested it, and I'm a bit
> skeptical that it will work right for this, but if anyone wants to give
> it a shot, there it is.
>
> BTW, while doing that, I noticed there's a similar arbitrary limit in
> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
> to keep the arbitrary limit maybe use the same one? :)
>
>> Before my patch, the behaviour would be to try to keep it as full as
>> possible, as in restart the queues as soon as a single slot opens up
>> in the tx queue.
>
> Yeah, that seems somewhat misguided as well, from a latency perspective,
> at least. But I guess that's what we're fixing with AQL. What does the
> firmware do with the frames queued within? Do they just go on a FIFO
> queue altogether, or something smarter?
>
> -Toke
>
>
>
>
> diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
> index f26cc6989dad..72771ff38a94 100644
> --- a/drivers/net/wireless/ath/ath10k/core.c
> +++ b/drivers/net/wireless/ath/ath10k/core.c
> @@ -2497,6 +2497,10 @@ static int ath10k_core_init_firmware_features(struct ath10k *ar)
> return -EINVAL;
> }
>
> + dql_init(&ar->htt.dql, HZ);
> + ar->htt.dql.max_limit = ar->htt.max_num_pending_tx;
> + ar->htt.dql.min_limit = 8;
> +
> if (ar->hw_params.num_peers)
> ar->max_num_peers = ar->hw_params.num_peers;
> else
> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
> index 4a12564fc30e..19024d063896 100644
> --- a/drivers/net/wireless/ath/ath10k/htt.h
> +++ b/drivers/net/wireless/ath/ath10k/htt.h
> @@ -13,6 +13,7 @@
> #include <linux/dmapool.h>
> #include <linux/hashtable.h>
> #include <linux/kfifo.h>
> +#include <linux/dynamic_queue_limits.h>
> #include <net/mac80211.h>
>
> #include "htc.h"
> @@ -1965,8 +1966,8 @@ struct ath10k_htt {
> /* Protects access to pending_tx, num_pending_tx */
> spinlock_t tx_lock;
> int max_num_pending_tx;
> - int num_pending_tx;
> int num_pending_mgmt_tx;
> + struct dql dql;
> struct idr pending_tx;
> wait_queue_head_t empty_tx_wq;
>
> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
> index e9d12ea708b6..911a79470bdf 100644
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -144,8 +144,8 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
> {
> lockdep_assert_held(&htt->tx_lock);
>
> - htt->num_pending_tx--;
> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
> + dql_completed(&htt->dql, 1);
> + if (dql_avail(&htt->dql) > 0)
> ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
> }
>
> @@ -153,11 +153,11 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
> {
> lockdep_assert_held(&htt->tx_lock);
>
> - if (htt->num_pending_tx >= htt->max_num_pending_tx)
> + if (dql_avail(&htt->dql) <= 0)
> return -EBUSY;
>
> - htt->num_pending_tx++;
> - if (htt->num_pending_tx == htt->max_num_pending_tx)
> + dql_queued(&htt->dql, 1);
> + if (dql_avail(&htt->dql) <= 0)
> ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>
> return 0;
> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
> index 2d03b8dd3b8c..1fe251742b0a 100644
> --- a/drivers/net/wireless/ath/ath10k/mac.c
> +++ b/drivers/net/wireless/ath/ath10k/mac.c
> @@ -3998,7 +3998,7 @@ static bool ath10k_mac_tx_can_push(struct ieee80211_hw *hw,
> if (ar->htt.tx_q_state.mode == HTT_TX_MODE_SWITCH_PUSH)
> return true;
>
> - if (ar->htt.num_pending_tx < ar->htt.tx_q_state.num_push_allowed)
> + if (dql_avail(&ar->htt.dql) < ar->htt.tx_q_state.num_push_allowed)
> return true;
>
> if (artxq->num_fw_queued < artxq->num_push_allowed)
> @@ -4159,7 +4159,7 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
> if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
> return;
>
> - if (ar->htt.num_pending_tx >= (ar->htt.max_num_pending_tx / 2))
> + if (dql_avail(&ar->htt.dql) < (ar->htt.dql.limit / 2))
> return;
>
> rcu_read_lock();
> @@ -7160,7 +7160,7 @@ void ath10k_mac_wait_tx_complete(struct ath10k *ar)
> bool empty;
>
> spin_lock_bh(&ar->htt.tx_lock);
> - empty = (ar->htt.num_pending_tx == 0);
> + empty = (ar->htt.dql.num_completed == ar->htt.dql.num_queued);
> spin_unlock_bh(&ar->htt.tx_lock);
>
> skip = (ar->state == ATH10K_STATE_WEDGED) ||
> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
> index 39abf8b12903..fe7cd53c2bf9 100644
> --- a/drivers/net/wireless/ath/ath10k/txrx.c
> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
> @@ -80,7 +80,7 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
>
> ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id);
> ath10k_htt_tx_dec_pending(htt);
> - if (htt->num_pending_tx == 0)
> + if (htt->dql.num_completed == htt->dql.num_queued)
> wake_up(&htt->empty_tx_wq);
> spin_unlock_bh(&htt->tx_lock);
>
>

2020-05-01 02:17:54

by Ben Greear

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On 04/30/2020 04:31 PM, John Deere wrote:
> I've just tried Toke's DQL on an Archer C7 (QCA 988X) and it errors out with:
>
> ath10k_pci failed to increase tx pending count: -16, dropping
>
> This is with ath10k non-ct firmware and driver on OpenWrt (w/backports 5.4.27 & AQL). ath10k starts up but is unable to let any stations connect.
>
> I've also tried the standalone patch by Ben and it seems to have reduced the latencies on top of AQL by another 5 ms.

Hello,

Did you notice any throughput changes or system load changes in the test that you did with my patch?

Thanks,
Ben

>
> On 4/29/20 4:39 AM, Toke Høiland-Jørgensen wrote:
>> Ben Greear <[email protected]> writes:
>>
>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>> [email protected] writes:
>>>>
>>>>> From: Ben Greear <[email protected]>
>>>>>
>>>>> While running tcp upload + download tests with ~200
>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>> taking an additional ~15%.
>>>>>
>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>> txqueue when a single frame could be transmitted, instead of
>>>>> waiting for a low water mark.
>>>>>
>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>> tx buffers allowed.
>>>>>
>>>>> This appears to resolve the performance problem that I saw.
>>>>>
>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>
>>>>> Signed-off-by: Ben Greear <[email protected]>
>>>>> ---
>>>>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>>>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>
>>>>> u8 target_version_major;
>>>>> u8 target_version_minor;
>>>>> + bool needs_unlock;
>>>>> struct completion target_version_received;
>>>>> u8 max_num_amsdu;
>>>>> u8 max_num_ampdu;
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>> lockdep_assert_held(&htt->tx_lock);
>>>>>
>>>>> htt->num_pending_tx--;
>>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>
>>>> Why /4? Seems a bit arbitrary?
>>>
>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>> full so that it is unlikely to run dry. Possibly it should restart
>>> sooner to keep it more full on average?
>>
>> Theoretically, the "keep the queue at the lowest possible level that
>> keeps it from underflowing" is what BQL is supposed to do. The diff
>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>> skeptical that it will work right for this, but if anyone wants to give
>> it a shot, there it is.
>>
>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>> to keep the arbitrary limit maybe use the same one? :)
>>
>>> Before my patch, the behaviour would be to try to keep it as full as
>>> possible, as in restart the queues as soon as a single slot opens up
>>> in the tx queue.
>>
>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>> at least. But I guess that's what we're fixing with AQL. What does the
>> firmware do with the frames queued within? Do they just go on a FIFO
>> queue altogether, or something smarter?
>>
>> -Toke
>>
>>
>>
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
>> index f26cc6989dad..72771ff38a94 100644
>> --- a/drivers/net/wireless/ath/ath10k/core.c
>> +++ b/drivers/net/wireless/ath/ath10k/core.c
>> @@ -2497,6 +2497,10 @@ static int ath10k_core_init_firmware_features(struct ath10k *ar)
>> return -EINVAL;
>> }
>> + dql_init(&ar->htt.dql, HZ);
>> + ar->htt.dql.max_limit = ar->htt.max_num_pending_tx;
>> + ar->htt.dql.min_limit = 8;
>> +
>> if (ar->hw_params.num_peers)
>> ar->max_num_peers = ar->hw_params.num_peers;
>> else
>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>> index 4a12564fc30e..19024d063896 100644
>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>> @@ -13,6 +13,7 @@
>> #include <linux/dmapool.h>
>> #include <linux/hashtable.h>
>> #include <linux/kfifo.h>
>> +#include <linux/dynamic_queue_limits.h>
>> #include <net/mac80211.h>
>> #include "htc.h"
>> @@ -1965,8 +1966,8 @@ struct ath10k_htt {
>> /* Protects access to pending_tx, num_pending_tx */
>> spinlock_t tx_lock;
>> int max_num_pending_tx;
>> - int num_pending_tx;
>> int num_pending_mgmt_tx;
>> + struct dql dql;
>> struct idr pending_tx;
>> wait_queue_head_t empty_tx_wq;
>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>> index e9d12ea708b6..911a79470bdf 100644
>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>> @@ -144,8 +144,8 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>> {
>> lockdep_assert_held(&htt->tx_lock);
>> - htt->num_pending_tx--;
>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>> + dql_completed(&htt->dql, 1);
>> + if (dql_avail(&htt->dql) > 0)
>> ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>> }
>> @@ -153,11 +153,11 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>> {
>> lockdep_assert_held(&htt->tx_lock);
>> - if (htt->num_pending_tx >= htt->max_num_pending_tx)
>> + if (dql_avail(&htt->dql) <= 0)
>> return -EBUSY;
>> - htt->num_pending_tx++;
>> - if (htt->num_pending_tx == htt->max_num_pending_tx)
>> + dql_queued(&htt->dql, 1);
>> + if (dql_avail(&htt->dql) <= 0)
>> ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>> return 0;
>> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
>> index 2d03b8dd3b8c..1fe251742b0a 100644
>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>> @@ -3998,7 +3998,7 @@ static bool ath10k_mac_tx_can_push(struct ieee80211_hw *hw,
>> if (ar->htt.tx_q_state.mode == HTT_TX_MODE_SWITCH_PUSH)
>> return true;
>> - if (ar->htt.num_pending_tx < ar->htt.tx_q_state.num_push_allowed)
>> + if (dql_avail(&ar->htt.dql) < ar->htt.tx_q_state.num_push_allowed)
>> return true;
>> if (artxq->num_fw_queued < artxq->num_push_allowed)
>> @@ -4159,7 +4159,7 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>> if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
>> return;
>> - if (ar->htt.num_pending_tx >= (ar->htt.max_num_pending_tx / 2))
>> + if (dql_avail(&ar->htt.dql) < (ar->htt.dql.limit / 2))
>> return;
>> rcu_read_lock();
>> @@ -7160,7 +7160,7 @@ void ath10k_mac_wait_tx_complete(struct ath10k *ar)
>> bool empty;
>> spin_lock_bh(&ar->htt.tx_lock);
>> - empty = (ar->htt.num_pending_tx == 0);
>> + empty = (ar->htt.dql.num_completed == ar->htt.dql.num_queued);
>> spin_unlock_bh(&ar->htt.tx_lock);
>> skip = (ar->state == ATH10K_STATE_WEDGED) ||
>> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
>> index 39abf8b12903..fe7cd53c2bf9 100644
>> --- a/drivers/net/wireless/ath/ath10k/txrx.c
>> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
>> @@ -80,7 +80,7 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
>> ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id);
>> ath10k_htt_tx_dec_pending(htt);
>> - if (htt->num_pending_tx == 0)
>> + if (htt->dql.num_completed == htt->dql.num_queued)
>> wake_up(&htt->empty_tx_wq);
>> spin_unlock_bh(&htt->tx_lock);
>>
>

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-05-01 15:51:57

by John Deere

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On 5/1/20 10:16 AM, Ben Greear wrote:
>
>
> On 04/30/2020 04:31 PM, John Deere wrote:
>> I've just tried Toke's DQL on an Archer C7 (QCA 988X) and it errors
>> out with:
>>
>> ath10k_pci failed to increase tx pending count: -16, dropping
>>
>> This is with ath10k non-ct firmware and driver on OpenWrt (w/backports
>> 5.4.27 & AQL). ath10k starts up but is unable to let any stations
>> connect.
>>
>> I've also tried the standalone patch by Ben and it seems to have
>> reduced the latencies on top of AQL by another 5 ms.
>
> Hello,
>
> Did you notice any throughput changes or system load changes in the test
> that you did with my patch?
>
> Thanks,
> Ben
>

I have noticed that there has been a reduction in system load and memory
use. Whereas previously with 11 clients on one Archer C7 acting as an AP
only, my free memory available would be 51%, it now shows 55-56% - a 4%
to 5% reduction. Note, these results were obtained alongside with
reverting the following comit
https://github.com/openwrt/openwrt/commit/1e27befe63ff4c69f110c7310316f4af75ee63e9.
I believe that this same set of patches also are currently in use for
the ath10k-ct smallbuffers variant on OpenWrt.

I have not had the time to conduct any meaningful throughput measurements.

>>
>> On 4/29/20 4:39 AM, Toke Høiland-Jørgensen wrote:
>>> Ben Greear <[email protected]> writes:
>>>
>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>> [email protected] writes:
>>>>>
>>>>>> From: Ben Greear <[email protected]>
>>>>>>
>>>>>> While running tcp upload + download tests with ~200
>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>> taking an additional ~15%.
>>>>>>
>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>> waiting for a low water mark.
>>>>>>
>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>> tx buffers allowed.
>>>>>>
>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>
>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>
>>>>>> Signed-off-by: Ben Greear <[email protected]>
>>>>>> ---
>>>>>> drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>> b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>
>>>>>>       u8 target_version_major;
>>>>>>       u8 target_version_minor;
>>>>>> +    bool needs_unlock;
>>>>>>       struct completion target_version_received;
>>>>>>       u8 max_num_amsdu;
>>>>>>       u8 max_num_ampdu;
>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>> b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct
>>>>>> ath10k_htt *htt)
>>>>>>       lockdep_assert_held(&htt->tx_lock);
>>>>>>
>>>>>>       htt->num_pending_tx--;
>>>>>> -    if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>> +    if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) &&
>>>>>> htt->needs_unlock) {
>>>>>
>>>>> Why /4? Seems a bit arbitrary?
>>>>
>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>> sooner to keep it more full on average?
>>>
>>> Theoretically, the "keep the queue at the lowest possible level that
>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>> skeptical that it will work right for this, but if anyone wants to give
>>> it a shot, there it is.
>>>
>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>> to keep the arbitrary limit maybe use the same one? :)
>>>
>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>> possible, as in restart the queues as soon as a single slot opens up
>>>> in the tx queue.
>>>
>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>> at least. But I guess that's what we're fixing with AQL. What does the
>>> firmware do with the frames queued within? Do they just go on a FIFO
>>> queue altogether, or something smarter?
>>>
>>> -Toke
>>>
>>>
>>>
>>>
>>> diff --git a/drivers/net/wireless/ath/ath10k/core.c
>>> b/drivers/net/wireless/ath/ath10k/core.c
>>> index f26cc6989dad..72771ff38a94 100644
>>> --- a/drivers/net/wireless/ath/ath10k/core.c
>>> +++ b/drivers/net/wireless/ath/ath10k/core.c
>>> @@ -2497,6 +2497,10 @@ static int
>>> ath10k_core_init_firmware_features(struct ath10k *ar)
>>>           return -EINVAL;
>>>       }
>>> +    dql_init(&ar->htt.dql, HZ);
>>> +    ar->htt.dql.max_limit = ar->htt.max_num_pending_tx;
>>> +    ar->htt.dql.min_limit = 8;
>>> +
>>>       if (ar->hw_params.num_peers)
>>>           ar->max_num_peers = ar->hw_params.num_peers;
>>>       else
>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h
>>> b/drivers/net/wireless/ath/ath10k/htt.h
>>> index 4a12564fc30e..19024d063896 100644
>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>> @@ -13,6 +13,7 @@
>>> #include <linux/dmapool.h>
>>> #include <linux/hashtable.h>
>>> #include <linux/kfifo.h>
>>> +#include <linux/dynamic_queue_limits.h>
>>> #include <net/mac80211.h>
>>>     #include "htc.h"
>>> @@ -1965,8 +1966,8 @@ struct ath10k_htt {
>>>       /* Protects access to pending_tx, num_pending_tx */
>>>       spinlock_t tx_lock;
>>>       int max_num_pending_tx;
>>> -    int num_pending_tx;
>>>       int num_pending_mgmt_tx;
>>> +    struct dql dql;
>>>       struct idr pending_tx;
>>>       wait_queue_head_t empty_tx_wq;
>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> index e9d12ea708b6..911a79470bdf 100644
>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> @@ -144,8 +144,8 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt
>>> *htt)
>>> {
>>>       lockdep_assert_held(&htt->tx_lock);
>>> -    htt->num_pending_tx--;
>>> -    if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>> +    dql_completed(&htt->dql, 1);
>>> +    if (dql_avail(&htt->dql) > 0)
>>>           ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>> }
>>> @@ -153,11 +153,11 @@ int ath10k_htt_tx_inc_pending(struct
>>> ath10k_htt *htt)
>>> {
>>>       lockdep_assert_held(&htt->tx_lock);
>>> -    if (htt->num_pending_tx >= htt->max_num_pending_tx)
>>> +    if (dql_avail(&htt->dql) <= 0)
>>>           return -EBUSY;
>>> -    htt->num_pending_tx++;
>>> -    if (htt->num_pending_tx == htt->max_num_pending_tx)
>>> +    dql_queued(&htt->dql, 1);
>>> +    if (dql_avail(&htt->dql) <= 0)
>>>           ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>>         return 0;
>>> diff --git a/drivers/net/wireless/ath/ath10k/mac.c
>>> b/drivers/net/wireless/ath/ath10k/mac.c
>>> index 2d03b8dd3b8c..1fe251742b0a 100644
>>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>>> @@ -3998,7 +3998,7 @@ static bool ath10k_mac_tx_can_push(struct
>>> ieee80211_hw *hw,
>>>       if (ar->htt.tx_q_state.mode == HTT_TX_MODE_SWITCH_PUSH)
>>>           return true;
>>> -    if (ar->htt.num_pending_tx < ar->htt.tx_q_state.num_push_allowed)
>>> +    if (dql_avail(&ar->htt.dql) < ar->htt.tx_q_state.num_push_allowed)
>>>           return true;
>>>         if (artxq->num_fw_queued < artxq->num_push_allowed)
>>> @@ -4159,7 +4159,7 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>>>       if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
>>>           return;
>>> -    if (ar->htt.num_pending_tx >= (ar->htt.max_num_pending_tx / 2))
>>> +    if (dql_avail(&ar->htt.dql) < (ar->htt.dql.limit / 2))
>>>           return;
>>>         rcu_read_lock();
>>> @@ -7160,7 +7160,7 @@ void ath10k_mac_wait_tx_complete(struct ath10k
>>> *ar)
>>>               bool empty;
>>>                 spin_lock_bh(&ar->htt.tx_lock);
>>> -            empty = (ar->htt.num_pending_tx == 0);
>>> +            empty = (ar->htt.dql.num_completed ==
>>> ar->htt.dql.num_queued);
>>>               spin_unlock_bh(&ar->htt.tx_lock);
>>>                 skip = (ar->state == ATH10K_STATE_WEDGED) ||
>>> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
>>> b/drivers/net/wireless/ath/ath10k/txrx.c
>>> index 39abf8b12903..fe7cd53c2bf9 100644
>>> --- a/drivers/net/wireless/ath/ath10k/txrx.c
>>> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
>>> @@ -80,7 +80,7 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
>>>         ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id);
>>>       ath10k_htt_tx_dec_pending(htt);
>>> -    if (htt->num_pending_tx == 0)
>>> +    if (htt->dql.num_completed == htt->dql.num_queued)
>>>           wake_up(&htt->empty_tx_wq);
>>>       spin_unlock_bh(&htt->tx_lock);
>>>
>>
>

2020-05-01 17:55:13

by Mark Baker

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

I have been using Toke’s patch successfully with my Netgear R7800 (QCA9884) running OpenWRT snapshot build with ath10k-ct firmware and ath10k-ct-smallbuffers driver. I am using my R7800 as an AP only, so I am running a very stripped down build of OpenWRT.

I provided several Flent tests to Dave Taht and Toke showing some pretty significant improvements in latency. I did not capture any CPU/sirq stats on my R7800 during these tests, though. It does appear my maximum download bandwidth has decreased somewhat, but I am more than happy to take that hit for the benefit of significantly lower latency.

> On May 1, 2020, at 11:50 AM, John Deere <[email protected]> wrote:
>
>
>
> On 5/1/20 10:16 AM, Ben Greear wrote:
>> On 04/30/2020 04:31 PM, John Deere wrote:
>>> I've just tried Toke's DQL on an Archer C7 (QCA 988X) and it errors out with:
>>>
>>> ath10k_pci failed to increase tx pending count: -16, dropping
>>>
>>> This is with ath10k non-ct firmware and driver on OpenWrt (w/backports 5.4.27 & AQL). ath10k starts up but is unable to let any stations connect.
>>>
>>> I've also tried the standalone patch by Ben and it seems to have reduced the latencies on top of AQL by another 5 ms.
>> Hello,
>> Did you notice any throughput changes or system load changes in the test that you did with my patch?
>> Thanks,
>> Ben
>
> I have noticed that there has been a reduction in system load and memory use. Whereas previously with 11 clients on one Archer C7 acting as an AP only, my free memory available would be 51%, it now shows 55-56% - a 4% to 5% reduction. Note, these results were obtained alongside with reverting the following comit https://github.com/openwrt/openwrt/commit/1e27befe63ff4c69f110c7310316f4af75ee63e9. I believe that this same set of patches also are currently in use for the ath10k-ct smallbuffers variant on OpenWrt.
>
> I have not had the time to conduct any meaningful throughput measurements.
>
>>>
>>> On 4/29/20 4:39 AM, Toke Høiland-Jørgensen wrote:
>>>> Ben Greear <[email protected]> writes:
>>>>
>>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>>> [email protected] writes:
>>>>>>
>>>>>>> From: Ben Greear <[email protected]>
>>>>>>>
>>>>>>> While running tcp upload + download tests with ~200
>>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>>> taking an additional ~15%.
>>>>>>>
>>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>>> waiting for a low water mark.
>>>>>>>
>>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>>> tx buffers allowed.
>>>>>>>
>>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>>
>>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>>
>>>>>>> Signed-off-by: Ben Greear <[email protected]>
>>>>>>> ---
>>>>>>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>>>>>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>>
>>>>>>> u8 target_version_major;
>>>>>>> u8 target_version_minor;
>>>>>>> + bool needs_unlock;
>>>>>>> struct completion target_version_received;
>>>>>>> u8 max_num_amsdu;
>>>>>>> u8 max_num_ampdu;
>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>>> lockdep_assert_held(&htt->tx_lock);
>>>>>>>
>>>>>>> htt->num_pending_tx--;
>>>>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>>>
>>>>>> Why /4? Seems a bit arbitrary?
>>>>>
>>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>>> sooner to keep it more full on average?
>>>>
>>>> Theoretically, the "keep the queue at the lowest possible level that
>>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>>> skeptical that it will work right for this, but if anyone wants to give
>>>> it a shot, there it is.
>>>>
>>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>>> to keep the arbitrary limit maybe use the same one? :)
>>>>
>>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>>> possible, as in restart the queues as soon as a single slot opens up
>>>>> in the tx queue.
>>>>
>>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>>> at least. But I guess that's what we're fixing with AQL. What does the
>>>> firmware do with the frames queued within? Do they just go on a FIFO
>>>> queue altogether, or something smarter?
>>>>
>>>> -Toke
>>>>
>>>>
>>>>
>>>>
>>>> diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
>>>> index f26cc6989dad..72771ff38a94 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/core.c
>>>> +++ b/drivers/net/wireless/ath/ath10k/core.c
>>>> @@ -2497,6 +2497,10 @@ static int ath10k_core_init_firmware_features(struct ath10k *ar)
>>>> return -EINVAL;
>>>> }
>>>> + dql_init(&ar->htt.dql, HZ);
>>>> + ar->htt.dql.max_limit = ar->htt.max_num_pending_tx;
>>>> + ar->htt.dql.min_limit = 8;
>>>> +
>>>> if (ar->hw_params.num_peers)
>>>> ar->max_num_peers = ar->hw_params.num_peers;
>>>> else
>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>> index 4a12564fc30e..19024d063896 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>> @@ -13,6 +13,7 @@
>>>> #include <linux/dmapool.h>
>>>> #include <linux/hashtable.h>
>>>> #include <linux/kfifo.h>
>>>> +#include <linux/dynamic_queue_limits.h>
>>>> #include <net/mac80211.h>
>>>> #include "htc.h"
>>>> @@ -1965,8 +1966,8 @@ struct ath10k_htt {
>>>> /* Protects access to pending_tx, num_pending_tx */
>>>> spinlock_t tx_lock;
>>>> int max_num_pending_tx;
>>>> - int num_pending_tx;
>>>> int num_pending_mgmt_tx;
>>>> + struct dql dql;
>>>> struct idr pending_tx;
>>>> wait_queue_head_t empty_tx_wq;
>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> index e9d12ea708b6..911a79470bdf 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> @@ -144,8 +144,8 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>> {
>>>> lockdep_assert_held(&htt->tx_lock);
>>>> - htt->num_pending_tx--;
>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>> + dql_completed(&htt->dql, 1);
>>>> + if (dql_avail(&htt->dql) > 0)
>>>> ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>>> }
>>>> @@ -153,11 +153,11 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>>>> {
>>>> lockdep_assert_held(&htt->tx_lock);
>>>> - if (htt->num_pending_tx >= htt->max_num_pending_tx)
>>>> + if (dql_avail(&htt->dql) <= 0)
>>>> return -EBUSY;
>>>> - htt->num_pending_tx++;
>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx)
>>>> + dql_queued(&htt->dql, 1);
>>>> + if (dql_avail(&htt->dql) <= 0)
>>>> ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>>> return 0;
>>>> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
>>>> index 2d03b8dd3b8c..1fe251742b0a 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>>>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>>>> @@ -3998,7 +3998,7 @@ static bool ath10k_mac_tx_can_push(struct ieee80211_hw *hw,
>>>> if (ar->htt.tx_q_state.mode == HTT_TX_MODE_SWITCH_PUSH)
>>>> return true;
>>>> - if (ar->htt.num_pending_tx < ar->htt.tx_q_state.num_push_allowed)
>>>> + if (dql_avail(&ar->htt.dql) < ar->htt.tx_q_state.num_push_allowed)
>>>> return true;
>>>> if (artxq->num_fw_queued < artxq->num_push_allowed)
>>>> @@ -4159,7 +4159,7 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>>>> if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
>>>> return;
>>>> - if (ar->htt.num_pending_tx >= (ar->htt.max_num_pending_tx / 2))
>>>> + if (dql_avail(&ar->htt.dql) < (ar->htt.dql.limit / 2))
>>>> return;
>>>> rcu_read_lock();
>>>> @@ -7160,7 +7160,7 @@ void ath10k_mac_wait_tx_complete(struct ath10k *ar)
>>>> bool empty;
>>>> spin_lock_bh(&ar->htt.tx_lock);
>>>> - empty = (ar->htt.num_pending_tx == 0);
>>>> + empty = (ar->htt.dql.num_completed == ar->htt.dql.num_queued);
>>>> spin_unlock_bh(&ar->htt.tx_lock);
>>>> skip = (ar->state == ATH10K_STATE_WEDGED) ||
>>>> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
>>>> index 39abf8b12903..fe7cd53c2bf9 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/txrx.c
>>>> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
>>>> @@ -80,7 +80,7 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
>>>> ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id);
>>>> ath10k_htt_tx_dec_pending(htt);
>>>> - if (htt->num_pending_tx == 0)
>>>> + if (htt->dql.num_completed == htt->dql.num_queued)
>>>> wake_up(&htt->empty_tx_wq);
>>>> spin_unlock_bh(&htt->tx_lock);
>>>>
>>>

2020-05-02 05:52:22

by John Deere

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On 5/2/20 1:51 AM, Mark Baker wrote:
> I have been using Toke’s patch successfully with my Netgear R7800 (QCA9884) running OpenWRT snapshot build with ath10k-ct firmware and ath10k-ct-smallbuffers driver. I am using my R7800 as an AP only, so I am running a very stripped down build of OpenWRT.

Interesting to hear that it works for you. To narrow it down on whether
it is due to CT-FW & Driver vs the device being Wave 2, is it possible
for you to test a builed with ath10k non-CT driver and FW?

>
> I provided several Flent tests to Dave Taht and Toke showing some pretty significant improvements in latency. I did not capture any CPU/sirq stats on my R7800 during these tests, though. It does appear my maximum download bandwidth has decreased somewhat, but I am more than happy to take that hit for the benefit of significantly lower latency.
>
>> On May 1, 2020, at 11:50 AM, John Deere <[email protected]> wrote:
>>
>>
>>
>> On 5/1/20 10:16 AM, Ben Greear wrote:
>>> On 04/30/2020 04:31 PM, John Deere wrote:
>>>> I've just tried Toke's DQL on an Archer C7 (QCA 988X) and it errors out with:
>>>>
>>>> ath10k_pci failed to increase tx pending count: -16, dropping
>>>>
>>>> This is with ath10k non-ct firmware and driver on OpenWrt (w/backports 5.4.27 & AQL). ath10k starts up but is unable to let any stations connect.
>>>>
>>>> I've also tried the standalone patch by Ben and it seems to have reduced the latencies on top of AQL by another 5 ms.
>>> Hello,
>>> Did you notice any throughput changes or system load changes in the test that you did with my patch?

Hi Ben,

I've ran some basic throughput tests with iperf and it seems that there
was also slight hit in throughput, but only by 5-20 mbps or so.

>>> Thanks,
>>> Ben
>>
>> I have noticed that there has been a reduction in system load and memory use. Whereas previously with 11 clients on one Archer C7 acting as an AP only, my free memory available would be 51%, it now shows 55-56% - a 4% to 5% reduction. Note, these results were obtained alongside with reverting the following comit https://github.com/openwrt/openwrt/commit/1e27befe63ff4c69f110c7310316f4af75ee63e9. I believe that this same set of patches also are currently in use for the ath10k-ct smallbuffers variant on OpenWrt.
>>
>> I have not had the time to conduct any meaningful throughput measurements.
>>
>>>>
>>>> On 4/29/20 4:39 AM, Toke Høiland-Jørgensen wrote:
>>>>> Ben Greear <[email protected]> writes:
>>>>>
>>>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>>>> [email protected] writes:
>>>>>>>
>>>>>>>> From: Ben Greear <[email protected]>
>>>>>>>>
>>>>>>>> While running tcp upload + download tests with ~200
>>>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>>>> taking an additional ~15%.
>>>>>>>>
>>>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>>>> waiting for a low water mark.
>>>>>>>>
>>>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>>>> tx buffers allowed.
>>>>>>>>
>>>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>>>
>>>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>>>
>>>>>>>> Signed-off-by: Ben Greear <[email protected]>
>>>>>>>> ---
>>>>>>>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
>>>>>>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>>>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>>>
>>>>>>>> u8 target_version_major;
>>>>>>>> u8 target_version_minor;
>>>>>>>> + bool needs_unlock;
>>>>>>>> struct completion target_version_received;
>>>>>>>> u8 max_num_amsdu;
>>>>>>>> u8 max_num_ampdu;
>>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>>>> lockdep_assert_held(&htt->tx_lock);
>>>>>>>>
>>>>>>>> htt->num_pending_tx--;
>>>>>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>>>>
>>>>>>> Why /4? Seems a bit arbitrary?
>>>>>>
>>>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>>>> sooner to keep it more full on average?
>>>>>
>>>>> Theoretically, the "keep the queue at the lowest possible level that
>>>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>>>> skeptical that it will work right for this, but if anyone wants to give
>>>>> it a shot, there it is.
>>>>>
>>>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>>>> to keep the arbitrary limit maybe use the same one? :)
>>>>>
>>>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>>>> possible, as in restart the queues as soon as a single slot opens up
>>>>>> in the tx queue.
>>>>>
>>>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>>>> at least. But I guess that's what we're fixing with AQL. What does the
>>>>> firmware do with the frames queued within? Do they just go on a FIFO
>>>>> queue altogether, or something smarter?
>>>>>
>>>>> -Toke
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
>>>>> index f26cc6989dad..72771ff38a94 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/core.c
>>>>> +++ b/drivers/net/wireless/ath/ath10k/core.c
>>>>> @@ -2497,6 +2497,10 @@ static int ath10k_core_init_firmware_features(struct ath10k *ar)
>>>>> return -EINVAL;
>>>>> }
>>>>> + dql_init(&ar->htt.dql, HZ);
>>>>> + ar->htt.dql.max_limit = ar->htt.max_num_pending_tx;
>>>>> + ar->htt.dql.min_limit = 8;
>>>>> +
>>>>> if (ar->hw_params.num_peers)
>>>>> ar->max_num_peers = ar->hw_params.num_peers;
>>>>> else
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>> index 4a12564fc30e..19024d063896 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>> @@ -13,6 +13,7 @@
>>>>> #include <linux/dmapool.h>
>>>>> #include <linux/hashtable.h>
>>>>> #include <linux/kfifo.h>
>>>>> +#include <linux/dynamic_queue_limits.h>
>>>>> #include <net/mac80211.h>
>>>>> #include "htc.h"
>>>>> @@ -1965,8 +1966,8 @@ struct ath10k_htt {
>>>>> /* Protects access to pending_tx, num_pending_tx */
>>>>> spinlock_t tx_lock;
>>>>> int max_num_pending_tx;
>>>>> - int num_pending_tx;
>>>>> int num_pending_mgmt_tx;
>>>>> + struct dql dql;
>>>>> struct idr pending_tx;
>>>>> wait_queue_head_t empty_tx_wq;
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> index e9d12ea708b6..911a79470bdf 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> @@ -144,8 +144,8 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>> {
>>>>> lockdep_assert_held(&htt->tx_lock);
>>>>> - htt->num_pending_tx--;
>>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>> + dql_completed(&htt->dql, 1);
>>>>> + if (dql_avail(&htt->dql) > 0)
>>>>> ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>>>> }
>>>>> @@ -153,11 +153,11 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>>>>> {
>>>>> lockdep_assert_held(&htt->tx_lock);
>>>>> - if (htt->num_pending_tx >= htt->max_num_pending_tx)
>>>>> + if (dql_avail(&htt->dql) <= 0)
>>>>> return -EBUSY;
>>>>> - htt->num_pending_tx++;
>>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx)
>>>>> + dql_queued(&htt->dql, 1);
>>>>> + if (dql_avail(&htt->dql) <= 0)
>>>>> ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>>>> return 0;
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
>>>>> index 2d03b8dd3b8c..1fe251742b0a 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>>>>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>>>>> @@ -3998,7 +3998,7 @@ static bool ath10k_mac_tx_can_push(struct ieee80211_hw *hw,
>>>>> if (ar->htt.tx_q_state.mode == HTT_TX_MODE_SWITCH_PUSH)
>>>>> return true;
>>>>> - if (ar->htt.num_pending_tx < ar->htt.tx_q_state.num_push_allowed)
>>>>> + if (dql_avail(&ar->htt.dql) < ar->htt.tx_q_state.num_push_allowed)
>>>>> return true;
>>>>> if (artxq->num_fw_queued < artxq->num_push_allowed)
>>>>> @@ -4159,7 +4159,7 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>>>>> if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
>>>>> return;
>>>>> - if (ar->htt.num_pending_tx >= (ar->htt.max_num_pending_tx / 2))
>>>>> + if (dql_avail(&ar->htt.dql) < (ar->htt.dql.limit / 2))
>>>>> return;
>>>>> rcu_read_lock();
>>>>> @@ -7160,7 +7160,7 @@ void ath10k_mac_wait_tx_complete(struct ath10k *ar)
>>>>> bool empty;
>>>>> spin_lock_bh(&ar->htt.tx_lock);
>>>>> - empty = (ar->htt.num_pending_tx == 0);
>>>>> + empty = (ar->htt.dql.num_completed == ar->htt.dql.num_queued);
>>>>> spin_unlock_bh(&ar->htt.tx_lock);
>>>>> skip = (ar->state == ATH10K_STATE_WEDGED) ||
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
>>>>> index 39abf8b12903..fe7cd53c2bf9 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/txrx.c
>>>>> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
>>>>> @@ -80,7 +80,7 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
>>>>> ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id);
>>>>> ath10k_htt_tx_dec_pending(htt);
>>>>> - if (htt->num_pending_tx == 0)
>>>>> + if (htt->dql.num_completed == htt->dql.num_queued)
>>>>> wake_up(&htt->empty_tx_wq);
>>>>> spin_unlock_bh(&htt->tx_lock);
>>>>>
>>>>
>

2020-05-02 15:57:00

by Ben Greear

[permalink] [raw]

Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On 05/01/2020 10:49 PM, John Deere wrote:
>

>>>>> I've also tried the standalone patch by Ben and it seems to have reduced the latencies on top of AQL by another 5 ms.
>>>> Hello,
>>>> Did you notice any throughput changes or system load changes in the test that you did with my patch?
>
> Hi Ben,
>
> I've ran some basic throughput tests with iperf and it seems that there was also slight hit in throughput, but only by 5-20 mbps or so.

If you are using smaller buffers anyway, maybe the 1/4 restart limit is too low. I'm running closer to 2000 tx buffers,
so my restart level would be at around 500 buffers. I did not see any drop in tput in my test case, but rather saw
improvement.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com