2020-04-28 19:44:13

by Dave Taht

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On Tue, Apr 28, 2020 at 12:37 PM Toke Høiland-Jørgensen <[email protected]> wrote:
>
> Ben Greear <[email protected]> writes:
>
> > On 04/28/2020 07:56 AM, Steve deRosier wrote:
> >> On Mon, Apr 27, 2020 at 7:54 AM <[email protected]> wrote:
> >>>
> >>> From: Ben Greear <[email protected]>
> >>>
> >>> While running tcp upload + download tests with ~200
> >>> concurrent TCP streams, 1-2 processes, and 30 station
> >>> vdevs, I noticed that the __ieee80211_stop_queue was taking
> >>> around 20% of the CPU according to perf-top, which other locking
> >>> taking an additional ~15%.
> >>>
> >>> I believe the issue is that the ath10k driver would unlock the
> >>> txqueue when a single frame could be transmitted, instead of
> >>> waiting for a low water mark.
> >>>
> >>> So, this patch adds a low-water mark that is 1/4 of the total
> >>> tx buffers allowed.
> >>>
> >>> This appears to resolve the performance problem that I saw.
> >>>
> >>> Tested with recent wave-1 ath10k-ct firmware.
> >>>
> >>
> >> Hey Ben,
> >>
> >> Did you do any testing with this patch around latency? The nature of
> >> the thing that you fixed makes me wonder if it was intentional with
> >> respect to making WiFi fast - ie getting rid of buffers as much as
> >> possible. Obviously the CPU impact is likely to be an unintended
> >> consequence. In any case, I don't know anything for sure, it was just
> >> a thought that went through my head when reading this.
> >
> > I did not, but on average my patch should make the queues be less full,
> > so I doubt it will hurt latency.
>
> I would tend to agree with that.

Well, I don't, as it's dependent on right sizing the ring in the first place.

But testing is in order! :)

> -Toke
>
> >>> Signed-off-by: Ben Greear <[email protected]>
> >>> ---
> >>> drivers/net/wireless/ath/ath10k/htt.h | 1 +
> >>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
> >>> 2 files changed, 7 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
> >>> index 31c4ddbf45cb..b5634781c0dc 100644
> >>> --- a/drivers/net/wireless/ath/ath10k/htt.h
> >>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
> >>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
> >>>
> >>> u8 target_version_major;
> >>> u8 target_version_minor;
> >>> + bool needs_unlock;
> >>> struct completion target_version_received;
> >>> u8 max_num_amsdu;
> >>> u8 max_num_ampdu;
> >>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
> >>> index 9b3c3b080e92..44795d9a7c0c 100644
> >>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> >>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> >>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
> >>> lockdep_assert_held(&htt->tx_lock);
> >>>
> >>> htt->num_pending_tx--;
> >>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
> >>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
> >>> + htt->needs_unlock = false;
> >>> ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
> >>> + }
> >>> }
> >>>
> >>> int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
> >>> @@ -157,8 +159,10 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
> >>> return -EBUSY;
> >>>
> >>> htt->num_pending_tx++;
> >>> - if (htt->num_pending_tx == htt->max_num_pending_tx)
> >>> + if (htt->num_pending_tx == htt->max_num_pending_tx) {
> >>> + htt->needs_unlock = true;
> >>> ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
> >>> + }
> >>>
> >>> return 0;
> >>> }
> >>> --
> >>> 2.20.1
> >>>
> >>
> >
> > --
> > Ben Greear <[email protected]>
> > Candela Technologies Inc http://www.candelatech.com



--
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729


2020-04-28 20:02:18

by Ben Greear

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.



On 04/28/2020 12:39 PM, Dave Taht wrote:
> On Tue, Apr 28, 2020 at 12:37 PM Toke Høiland-Jørgensen <[email protected]> wrote:
>>
>> Ben Greear <[email protected]> writes:
>>
>>> On 04/28/2020 07:56 AM, Steve deRosier wrote:
>>>> On Mon, Apr 27, 2020 at 7:54 AM <[email protected]> wrote:
>>>>>
>>>>> From: Ben Greear <[email protected]>
>>>>>
>>>>> While running tcp upload + download tests with ~200
>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>> taking an additional ~15%.
>>>>>
>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>> txqueue when a single frame could be transmitted, instead of
>>>>> waiting for a low water mark.
>>>>>
>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>> tx buffers allowed.
>>>>>
>>>>> This appears to resolve the performance problem that I saw.
>>>>>
>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>
>>>>
>>>> Hey Ben,
>>>>
>>>> Did you do any testing with this patch around latency? The nature of
>>>> the thing that you fixed makes me wonder if it was intentional with
>>>> respect to making WiFi fast - ie getting rid of buffers as much as
>>>> possible. Obviously the CPU impact is likely to be an unintended
>>>> consequence. In any case, I don't know anything for sure, it was just
>>>> a thought that went through my head when reading this.
>>>
>>> I did not, but on average my patch should make the queues be less full,
>>> so I doubt it will hurt latency.
>>
>> I would tend to agree with that.
>
> Well, I don't, as it's dependent on right sizing the ring in the first place.

My patch, barring strange issues elsewhere, can only make the firmware tx queues less full on
average.

If you want to test with different ring sizes, you can play with the tx_desc
setting in the ath10k-ct driver 'fwcfg' options.

http://www.candelatech.com/ath10k-10.4.php#config

My testing shows that overall throughput goes down when using lots of peers
if you have smaller numbers of txbuffers. This is because the firmware
will typically spread its buffers over lots of peers and have smaller ampdu
chains per transmit. An upper stack that more intelligently fed frames
to the firmware could mitigate this, and it is not all bad anyway since
giving everyone a 64 ampdu chains will increase burstiness at least somewhat.

I've always envisioned that the stuff you and Toke and others have been
working on would help in this area, but I don't understand your stuff well
enough to know if that is true or not.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-04-28 20:59:16

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

Ben Greear <[email protected]> writes:

> On 04/28/2020 12:39 PM, Dave Taht wrote:
>> On Tue, Apr 28, 2020 at 12:37 PM Toke Høiland-Jørgensen <[email protected]> wrote:
>>>
>>> Ben Greear <[email protected]> writes:
>>>
>>>> On 04/28/2020 07:56 AM, Steve deRosier wrote:
>>>>> On Mon, Apr 27, 2020 at 7:54 AM <[email protected]> wrote:
>>>>>>
>>>>>> From: Ben Greear <[email protected]>
>>>>>>
>>>>>> While running tcp upload + download tests with ~200
>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>> taking an additional ~15%.
>>>>>>
>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>> waiting for a low water mark.
>>>>>>
>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>> tx buffers allowed.
>>>>>>
>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>
>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>
>>>>>
>>>>> Hey Ben,
>>>>>
>>>>> Did you do any testing with this patch around latency? The nature of
>>>>> the thing that you fixed makes me wonder if it was intentional with
>>>>> respect to making WiFi fast - ie getting rid of buffers as much as
>>>>> possible. Obviously the CPU impact is likely to be an unintended
>>>>> consequence. In any case, I don't know anything for sure, it was just
>>>>> a thought that went through my head when reading this.
>>>>
>>>> I did not, but on average my patch should make the queues be less full,
>>>> so I doubt it will hurt latency.
>>>
>>> I would tend to agree with that.
>>
>> Well, I don't, as it's dependent on right sizing the ring in the first place.
>
> My patch, barring strange issues elsewhere, can only make the firmware tx queues less full on
> average.
>
> If you want to test with different ring sizes, you can play with the tx_desc
> setting in the ath10k-ct driver 'fwcfg' options.
>
> http://www.candelatech.com/ath10k-10.4.php#config
>
> My testing shows that overall throughput goes down when using lots of peers
> if you have smaller numbers of txbuffers. This is because the firmware
> will typically spread its buffers over lots of peers and have smaller ampdu
> chains per transmit. An upper stack that more intelligently fed frames
> to the firmware could mitigate this, and it is not all bad anyway since
> giving everyone a 64 ampdu chains will increase burstiness at least
> somewhat.

Making each transmission shorter is arguably the right thing to do in
the "extremely congested" scenario, though. If you have to wait your
turn behind 100 other stations for your next TXOP you'd generally want
each of those other stations to only transmit (say) 1ms instead of their
full 4ms. Yes, this will hurt aggregate throughput somewhat, but I'd
argue that in most cases the overall application performance would be
better. You're right, though, that ideally this would be managed a bit
smarter than by just running out of buffers :)

> I've always envisioned that the stuff you and Toke and others have been
> working on would help in this area, but I don't understand your stuff well
> enough to know if that is true or not.

It might, although as per above I'm not quite sure what "helps" really
means in this context. What would you expect a good behaviour to be
here? I think what you're alluding to is to limit the total number of
stations that will be allowed to have outstanding data in the firmware
at the same time, right? Here it would help a bit to know some more
details of how the firmware manages its internal queueing, and how it
schedules stations (if at all)?

BTW, are you running any of these tests with AQL enabled?

-Toke

2020-04-28 21:27:11

by Steve deRosier

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.

On Tue, Apr 28, 2020 at 1:58 PM Toke Høiland-Jørgensen <[email protected]> wrote:
>
> Ben Greear <[email protected]> writes:
>
> > On 04/28/2020 12:39 PM, Dave Taht wrote:
> >> On Tue, Apr 28, 2020 at 12:37 PM Toke Høiland-Jørgensen <[email protected]> wrote:
> >>>
> >>> Ben Greear <[email protected]> writes:
> >>>
> >>>> On 04/28/2020 07:56 AM, Steve deRosier wrote:
> >>>>> On Mon, Apr 27, 2020 at 7:54 AM <[email protected]> wrote:
> >>>>>>
> >>>>>> From: Ben Greear <[email protected]>
> >>>>>>
> >>>>>> While running tcp upload + download tests with ~200
> >>>>>> concurrent TCP streams, 1-2 processes, and 30 station
> >>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
> >>>>>> around 20% of the CPU according to perf-top, which other locking
> >>>>>> taking an additional ~15%.
> >>>>>>
> >>>>>> I believe the issue is that the ath10k driver would unlock the
> >>>>>> txqueue when a single frame could be transmitted, instead of
> >>>>>> waiting for a low water mark.
> >>>>>>
> >>>>>> So, this patch adds a low-water mark that is 1/4 of the total
> >>>>>> tx buffers allowed.
> >>>>>>
> >>>>>> This appears to resolve the performance problem that I saw.
> >>>>>>
> >>>>>> Tested with recent wave-1 ath10k-ct firmware.
> >>>>>>
> >>>>>
> >>>>> Hey Ben,
> >>>>>
> >>>>> Did you do any testing with this patch around latency? The nature of
> >>>>> the thing that you fixed makes me wonder if it was intentional with
> >>>>> respect to making WiFi fast - ie getting rid of buffers as much as
> >>>>> possible. Obviously the CPU impact is likely to be an unintended
> >>>>> consequence. In any case, I don't know anything for sure, it was just
> >>>>> a thought that went through my head when reading this.
> >>>>
> >>>> I did not, but on average my patch should make the queues be less full,
> >>>> so I doubt it will hurt latency.
> >>>
> >>> I would tend to agree with that.
> >>
> >> Well, I don't, as it's dependent on right sizing the ring in the first place.
> >
> > My patch, barring strange issues elsewhere, can only make the firmware tx queues less full on
> > average.
> >
> > If you want to test with different ring sizes, you can play with the tx_desc
> > setting in the ath10k-ct driver 'fwcfg' options.
> >
> > http://www.candelatech.com/ath10k-10.4.php#config
> >
> > My testing shows that overall throughput goes down when using lots of peers
> > if you have smaller numbers of txbuffers. This is because the firmware
> > will typically spread its buffers over lots of peers and have smaller ampdu
> > chains per transmit. An upper stack that more intelligently fed frames
> > to the firmware could mitigate this, and it is not all bad anyway since
> > giving everyone a 64 ampdu chains will increase burstiness at least
> > somewhat.
>
> Making each transmission shorter is arguably the right thing to do in
> the "extremely congested" scenario, though. If you have to wait your
> turn behind 100 other stations for your next TXOP you'd generally want
> each of those other stations to only transmit (say) 1ms instead of their
> full 4ms. Yes, this will hurt aggregate throughput somewhat, but I'd
> argue that in most cases the overall application performance would be
> better. You're right, though, that ideally this would be managed a bit
> smarter than by just running out of buffers :)
>
> > I've always envisioned that the stuff you and Toke and others have been
> > working on would help in this area, but I don't understand your stuff well
> > enough to know if that is true or not.
>
> It might, although as per above I'm not quite sure what "helps" really
> means in this context. What would you expect a good behaviour to be
> here? I think what you're alluding to is to limit the total number of
> stations that will be allowed to have outstanding data in the firmware
> at the same time, right? Here it would help a bit to know some more
> details of how the firmware manages its internal queueing, and how it
> schedules stations (if at all)?
>
> BTW, are you running any of these tests with AQL enabled?
>

I don't know if Ben is doing so, but I will be doing so here very
soon. I've got some unrelated things in the way to clear up first, but
within a couple of weeks we hope to be testing AQL with ath10k-ct
firmware and driver.

And everyone thanks for the discussion, it's been very interesting and
useful to me. Hopefully we can improve ath10k even more with this
information.

- Steve