LinuxLists.cc - Regarding .wake_tx

2020-05-04 19:48:57

Subject: Regarding .wake_tx_queue() model

Hello,

Currently switching a driver to .wake_tx_queue() model, and I would
appreciate some guidance over a couple of issues.

My hardware exposes 1 FIFO per ac, so the current driver basically
queue skb in the correct fifo from drv_tx(), and once a FIFO is big
"enough" (packet count or total duration), I use
ieee80211_stop_queue(), and the corresponding ieee80211_wait_queue()
in tx completion.

Current driver TX flow is:
- drv_tx() => push into FIFO
- drv_tx() => push into FIFO
- drv_tx() => push into FIFO => FIFO full => ieee80211_stop_queue()
- [drv_tx won't be called]
- tx completion event => ieee80211_wake_queue()
- drv_tx()
[...]

1) drv_tx() & drv_wake_tx_queue() concurrency

With the .wake_tx_queue model, there are now two entry points in the
driver, how does the stack ensure that drv_tx() is not blocked forever
if there is concurrent traffic on the same AC ?

for example:

- .wake_tx_queue() => ieee80211_next_txq() => ieee80211_tx_dequeue() => FIFO now full => ieee80211_stop_queue()
- tx completion event => ieee80211_wake_queue()
- .wake_tx_queue() => ieee80211_next_txq() => ieee80211_tx_dequeue() => FIFO now full => ieee80211_stop_queue()
- tx completion event => ieee80211_wake_queue()
- [...]

ieee80211_wake_queue() will schedule both tx_pending_tasklet and
wake_txqs_tasklet, but I don't think there is any guarantee in the
call ordering.

Is it the driver's duty to leave a bit of room during
drv_wake_tx_queue() scheduling and not stop the queues from there ?

2) ieee80211_stop_queue() vs drv_wake_tx_queue()

Commit 21a5d4c3a45ca608477a083096cfbce76e449a0c made it so that
ieee80211_tx_dequeue() will return nothing if hardware queue is
stopped, but drv_wake_tx_queue() is still called for ac whose queue is
stopped.

so should I do this ?

- .wake_tx_queue() => ieee80211_next_txq() => ieee80211_tx_dequeue() => FIFO now full => ieee80211_stop_queue()
- .wake_tx_queue() => ieee80211_next_txq() => ieee80211_tx_dequeue() => NULL => return
- tx completion event => ieee80211_wake_queue()
- .wake_tx_queue() => ieee80211_next_txq() => ieee80211_tx_dequeue() => FIFO now full => ieee80211_stop_queue()
- [...]

or this ?

- .wake_tx_queue() => ieee80211_queue_stopped() => ieee80211_next_txq() => ieee80211_tx_dequeue() => FIFO now full => ieee80211_stop_queue()
- .wake_tx_queue() => ieee80211_queue_stopped() => return

associated commit log only mentions edge cases (channel switch, DFS),
so I'm not sure if ieee80211_stop_queue() for txqs was intended for
"fast path", also see 3)

3) _ieee80211_wake_txqs() looks buggy:

If the cab_queue is not stopped, this loop will unconditionally wake
up all txqs, which I guess is not what was intended:

for (i = 0; i < local->hw.queues; i++) {
if (local->queue_stop_reasons[i])
continue;

for (ac = 0; ac < n_acs; ac++) {
int ac_queue = sdata->vif.hw_queue[ac];

if (ac_queue == i ||
sdata->vif.cab_queue == i)
__ieee80211_wake_txqs(sdata, ac);
}

4) building aggregation in the driver:

I'm doing aggregation on the host side rather than in the firmware,
which will ends up with more or less the same code as ath9k, is there
any on-going effort to move that code from the driver into the stack ?

Thanks,

--
Maxime

2020-05-05 12:07:24

by Toke Høiland-Jørgensen

[permalink] [raw]

Subject: Re: Regarding .wake_tx_queue() model

Maxime Bizon <[email protected]> writes:

> Hello,
>
> Currently switching a driver to .wake_tx_queue() model

Yay :)

> and I would appreciate some guidance over a couple of issues.
>
> My hardware exposes 1 FIFO per ac, so the current driver basically
> queue skb in the correct fifo from drv_tx(), and once a FIFO is big
> "enough" (packet count or total duration), I use
> ieee80211_stop_queue(), and the corresponding ieee80211_wait_queue()
> in tx completion.
>
> Current driver TX flow is:
> - drv_tx() => push into FIFO
> - drv_tx() => push into FIFO
> - drv_tx() => push into FIFO => FIFO full => ieee80211_stop_queue()
> - [drv_tx won't be called]
> - tx completion event => ieee80211_wake_queue()
> - drv_tx()
> [...]
>
>
> 1) drv_tx() & drv_wake_tx_queue() concurrency
>
> With the .wake_tx_queue model, there are now two entry points in the
> driver, how does the stack ensure that drv_tx() is not blocked forever
> if there is concurrent traffic on the same AC ?
>
>
> for example:
>
> - .wake_tx_queue() => ieee80211_next_txq() => ieee80211_tx_dequeue() => FIFO now full => ieee80211_stop_queue()
> - tx completion event => ieee80211_wake_queue()
> - .wake_tx_queue() => ieee80211_next_txq() => ieee80211_tx_dequeue() => FIFO now full => ieee80211_stop_queue()
> - tx completion event => ieee80211_wake_queue()
> - [...]
>
> ieee80211_wake_queue() will schedule both tx_pending_tasklet and
> wake_txqs_tasklet, but I don't think there is any guarantee in the
> call ordering.
>
> Is it the driver's duty to leave a bit of room during
> drv_wake_tx_queue() scheduling and not stop the queues from there ?

Yeah, this is basically up to the driver. I'm mostly familiar with
ath9k, and I think basically what that does is that it doesn't fill the
HW FIFO in normal operation: For data packets being pulled off
ieee80211_tx_dequeue() it'll only queue two aggregates in the hardware
at a time. This is a good thing! We want the packets to be queued on the
mac80211 TXQs not in a dumb HW FIFO causing bufferbloat!

Given that you're building aggregates in the driver, you could just do
the same thing as ath9k and likely get pretty good results, I think :)

> 2) ieee80211_stop_queue() vs drv_wake_tx_queue()
>
> Commit 21a5d4c3a45ca608477a083096cfbce76e449a0c made it so that
> ieee80211_tx_dequeue() will return nothing if hardware queue is
> stopped, but drv_wake_tx_queue() is still called for ac whose queue is
> stopped.
>
>
> so should I do this ?
>
> - .wake_tx_queue() => ieee80211_next_txq() => ieee80211_tx_dequeue() => FIFO now full => ieee80211_stop_queue()
> - .wake_tx_queue() => ieee80211_next_txq() => ieee80211_tx_dequeue() => NULL => return
> - tx completion event => ieee80211_wake_queue()
> - .wake_tx_queue() => ieee80211_next_txq() => ieee80211_tx_dequeue() => FIFO now full => ieee80211_stop_queue()
> - [...]
>
> or this ?
>
> - .wake_tx_queue() => ieee80211_queue_stopped() => ieee80211_next_txq() => ieee80211_tx_dequeue() => FIFO now full => ieee80211_stop_queue()
> - .wake_tx_queue() => ieee80211_queue_stopped() => return
>
> associated commit log only mentions edge cases (channel switch, DFS),
> so I'm not sure if ieee80211_stop_queue() for txqs was intended for
> "fast path", also see 3)

I don't think ieee80211_stop_queue() is meant to be used this way at all
in the wake_tx_queue case. Rather, when you get a wake_tx_queue()
callback, you just queue as many frames as you feel like (see '2
aggregate' limit above), and then return. Then, on a TX completion you
just call your internal driver function that tries to pull more frames
from the mac80211 TXQs.

You'll keep getting wake_tx_queue callbacks from mac80211, but there's
nothing saying you have to pull any frames on each one.

See ath_txq_schedule() for how ath9k does this :)

> 3) _ieee80211_wake_txqs() looks buggy:
>
> If the cab_queue is not stopped, this loop will unconditionally wake
> up all txqs, which I guess is not what was intended:
>
> for (i = 0; i < local->hw.queues; i++) {
> if (local->queue_stop_reasons[i])
> continue;
>
> for (ac = 0; ac < n_acs; ac++) {
> int ac_queue = sdata->vif.hw_queue[ac];
>
> if (ac_queue == i ||
> sdata->vif.cab_queue == i)
> __ieee80211_wake_txqs(sdata, ac);
> }

(not sure about this none)

> 4) building aggregation in the driver:
>
> I'm doing aggregation on the host side rather than in the firmware,
> which will ends up with more or less the same code as ath9k, is there
> any on-going effort to move that code from the driver into the stack ?

Not aware of any on-going efforts, no. Something like this usually
happens because someone feels it would make their life easier. Say, if
they're writing a new driver and wants to re-use code :)

Looking at the ath9k code, ath_tx_form_aggr() is tied into the internal
driver buffer representations, so I'm not sure how much work it would be
to generalise and split out parts of it. It need not be a complete
"build me an aggregate" function that you move into mac80211, though,
even some utility functions to calculate padding etc might be shareable?
I guess that if you're copying code from there I guess you'll find out :)

-Toke

2020-05-05 13:28:57

by Maxime Bizon

[permalink] [raw]

Subject: Re: Regarding .wake_tx_queue() model

On Tuesday 05 May 2020 ? 14:06:18 (+0200), Toke H?iland-J?rgensen wrote:

Hello Toke,

thanks for your comments

> I don't think ieee80211_stop_queue() is meant to be used this way at all
> in the wake_tx_queue case. Rather, when you get a wake_tx_queue()
> callback, you just queue as many frames as you feel like (see '2
> aggregate' limit above), and then return. Then, on a TX completion you
> just call your internal driver function that tries to pull more frames
> from the mac80211 TXQs.

alright, indeed .wake_tx_queue() does not have to result in any actual
frame sent.

but what about .release_buffered_frames then ?

.release_buffered_frames() and .drv_tx() are both "push" oriented
interface. When they are called, you have to push a frame to the
hardware, so if they are called when hardware FIFO is full, you need
to drop the frame (or add yet another intermediate level of queuing)

from what I can see, .release_buffered_frames() will happily be called
even if queue is stopped, and you have to at least send one frame.

I'm just trying to avoid any additionnal intermediate queing in the
driver between what mac80211 gives me and the HW fifo which has a
fixed size (well actually you can choose the size, but only at init
time).

my current workaround is to pre-size the hw fifo queue to handle what
I think is the worst case

Also .release_buffered_frames() codepath is difficult to test, how do
you trigger that reliably ? assuming VIF is an AP, then you need the
remote STA to go to sleep even though you have traffic waiting for it
in the txqi. For now I patch the stack to introduce artificial delay.

Since my hardware has a sticky per-STA PS filter with good status
reporting, I'm considering using ieee80211_sta_block_awake() and only
unblock STA when all its txqi are empty to get rid of
.release_buffered_frames() complexity.

--
Maxime

2020-05-05 13:56:18

by Toke Høiland-Jørgensen

[permalink] [raw]

Subject: Re: Regarding .wake_tx_queue() model

Maxime Bizon <[email protected]> writes:

> On Tuesday 05 May 2020 à 14:06:18 (+0200), Toke Høiland-Jørgensen wrote:
>
> Hello Toke,
>
> thanks for your comments
>
>> I don't think ieee80211_stop_queue() is meant to be used this way at all
>> in the wake_tx_queue case. Rather, when you get a wake_tx_queue()
>> callback, you just queue as many frames as you feel like (see '2
>> aggregate' limit above), and then return. Then, on a TX completion you
>> just call your internal driver function that tries to pull more frames
>> from the mac80211 TXQs.
>
> alright, indeed .wake_tx_queue() does not have to result in any actual
> frame sent.
>
> but what about .release_buffered_frames then ?
>
> .release_buffered_frames() and .drv_tx() are both "push" oriented
> interface. When they are called, you have to push a frame to the
> hardware, so if they are called when hardware FIFO is full, you need
> to drop the frame (or add yet another intermediate level of queuing)
>
> from what I can see, .release_buffered_frames() will happily be called
> even if queue is stopped, and you have to at least send one frame.
>
> I'm just trying to avoid any additionnal intermediate queing in the
> driver between what mac80211 gives me and the HW fifo which has a
> fixed size (well actually you can choose the size, but only at init
> time).
>
> my current workaround is to pre-size the hw fifo queue to handle what
> I think is the worst case

Well, I think that should be fine? Having a longer HW queue is fine, as
long as you have some other logic to not fill it all the time (like the
"max two aggregates" logic I mentioned before). I think this is what
ath9k does too. At least it looks like both drv_tx() and
release_buffered_frames() will just abort (and drop in the former case)
if there is no HW buffer space left.

> Also .release_buffered_frames() codepath is difficult to test, how do
> you trigger that reliably ? assuming VIF is an AP, then you need the
> remote STA to go to sleep even though you have traffic waiting for it
> in the txqi. For now I patch the stack to introduce artificial delay.
>
> Since my hardware has a sticky per-STA PS filter with good status
> reporting, I'm considering using ieee80211_sta_block_awake() and only
> unblock STA when all its txqi are empty to get rid of
> .release_buffered_frames() complexity.

I'm probably not the right person to answer that; never did have a good
grip on the details of PS support.

What hardware is it you're writing a driver for, BTW, and are you
planning to upstream it? :)

-Toke

2020-05-05 15:21:04

by Maxime Bizon

[permalink] [raw]

Subject: Re: Regarding .wake_tx_queue() model

On Tuesday 05 May 2020 ? 15:53:08 (+0200), Toke H?iland-J?rgensen wrote:

> Well, I think that should be fine? Having a longer HW queue is fine, as
> long as you have some other logic to not fill it all the time (like the
> "max two aggregates" logic I mentioned before). I think this is what
> ath9k does too. At least it looks like both drv_tx() and
> release_buffered_frames() will just abort (and drop in the former case)
> if there is no HW buffer space left.

Ok

BTW, the "max two aggregates" rule, why is it based on number of
frames and not duration ? if you are queing 1500 bytes @1Mbit/s, even
one frame is enough, but not so for faster rates.

It would be even better if minstrel could limit the total duration
when computing number of hardware retries, and then mac80211 could
handle software retries for those really slow packets, no hardware
FIFO "pollution"

> > Also .release_buffered_frames() codepath is difficult to test, how do
> > you trigger that reliably ? assuming VIF is an AP, then you need the
> > remote STA to go to sleep even though you have traffic waiting for it
> > in the txqi. For now I patch the stack to introduce artificial delay.
> >
> > Since my hardware has a sticky per-STA PS filter with good status
> > reporting, I'm considering using ieee80211_sta_block_awake() and only
> > unblock STA when all its txqi are empty to get rid of
> > .release_buffered_frames() complexity.
>
> I'm probably not the right person to answer that; never did have a good
> grip on the details of PS support.

Hopefully Felix or Johannes will see this.

Just wondering if there is anything I'm missing, this alternative way
of doing looks easier to me because it removes "knowledge" of frame
delivery under service period from the driver:

1) first get rid of buffered txq traffic when entering PS:

--- a/net/mac80211/rx.c
+++ b/net/mac80211/rx.c
@@ -1593,6 +1593,15 @@ static void sta_ps_start(struct sta_info *sta)
list_del_init(&txqi->schedule_order);
spin_unlock(&local->active_txq_lock[txq->ac]);

- if (txq_has_queue(txq))
- set_bit(tid, &sta->txq_buffered_tids);
- else
- clear_bit(tid, &sta->txq_buffered_tids);
+ /* transfer txq into tx_filtered frames */
+ spin_lock(&fq->lock);
+ while ((skb = skb_dequeue(&txq->frags)))
+ skb_queue_tail(&sta->tx_filtered[txq->ac], skb);
+ /* use something more efficient like fq_tin_reset */
+ while ((skb = fq_tin_dequeue(fq, tin, fq_tin_dequeue_func)))
+ skb_queue_tail(&sta->tx_filtered[txq->ac], skb);
+ spin_unlock_bh(&fq->lock);
+

2) driver register for STA_NOTIFY_SLEEP

3) driver count tx frames pending in the hardware per STA and sets
ieee80211_sta_block_awake(sta, 1) when > 0

4) on tx completion, if STA is sleeping and number of pending tx frames in hardware for a
given STA reaches 0:
- if driver buffers other frames for this STA, release them with TX_FILTERED in reverse order
- calls ieee80211_sta_block_awake(false)

what do you think ?

> What hardware is it you're writing a driver for, BTW, and are you
> planning to upstream it? :)

that's a rewrite of the mwl8k driver targeting the same hardware, but
with a different firmware interface.

if I can bring it on par with the existing set of supported hardware
and features, I could try to upstream it yes.

--
Maxime

2020-05-05 16:54:33

by Toke Høiland-Jørgensen

[permalink] [raw]

Subject: Re: Regarding .wake_tx_queue() model

Maxime Bizon <[email protected]> writes:

> On Tuesday 05 May 2020 à 15:53:08 (+0200), Toke Høiland-Jørgensen wrote:
>
>> Well, I think that should be fine? Having a longer HW queue is fine, as
>> long as you have some other logic to not fill it all the time (like the
>> "max two aggregates" logic I mentioned before). I think this is what
>> ath9k does too. At least it looks like both drv_tx() and
>> release_buffered_frames() will just abort (and drop in the former case)
>> if there is no HW buffer space left.
>
> Ok
>
> BTW, the "max two aggregates" rule, why is it based on number of
> frames and not duration ? if you are queing 1500 bytes @1Mbit/s, even
> one frame is enough, but not so for faster rates.

It's the minimum amount that works - assuming you get a TX completion
when one is done, the CPU has time to build the next one before the
second one is done, and you avoid starvation. Note this only works well
for aggregates, since their size tend to vary with rate; if you're
queueing individual packets to the HWQ you need something that takes
rate into account, which is what AQL (Airtime Queue Limits) does for
ath10k.

> It would be even better if minstrel could limit the total duration
> when computing number of hardware retries, and then mac80211 could
> handle software retries for those really slow packets, no hardware
> FIFO "pollution"

Minstrel will compute the max aggregate size based on the rate, which is
why the "two aggregates" scheme works. It likely could be smarter about
limiting the number of retries, as you say, but we never did get around
to doing anything about that :)

>> > Also .release_buffered_frames() codepath is difficult to test, how do
>> > you trigger that reliably ? assuming VIF is an AP, then you need the
>> > remote STA to go to sleep even though you have traffic waiting for it
>> > in the txqi. For now I patch the stack to introduce artificial delay.
>> >
>> > Since my hardware has a sticky per-STA PS filter with good status
>> > reporting, I'm considering using ieee80211_sta_block_awake() and only
>> > unblock STA when all its txqi are empty to get rid of
>> > .release_buffered_frames() complexity.
>>
>> I'm probably not the right person to answer that; never did have a good
>> grip on the details of PS support.
>
> Hopefully Felix or Johannes will see this.
>
> Just wondering if there is anything I'm missing, this alternative way
> of doing looks easier to me because it removes "knowledge" of frame
> delivery under service period from the driver:
>
>
> 1) first get rid of buffered txq traffic when entering PS:
>
> --- a/net/mac80211/rx.c
> +++ b/net/mac80211/rx.c
> @@ -1593,6 +1593,15 @@ static void sta_ps_start(struct sta_info *sta)
> list_del_init(&txqi->schedule_order);
> spin_unlock(&local->active_txq_lock[txq->ac]);
>
> - if (txq_has_queue(txq))
> - set_bit(tid, &sta->txq_buffered_tids);
> - else
> - clear_bit(tid, &sta->txq_buffered_tids);
> + /* transfer txq into tx_filtered frames */
> + spin_lock(&fq->lock);
> + while ((skb = skb_dequeue(&txq->frags)))
> + skb_queue_tail(&sta->tx_filtered[txq->ac], skb);
> + /* use something more efficient like fq_tin_reset */
> + while ((skb = fq_tin_dequeue(fq, tin, fq_tin_dequeue_func)))
> + skb_queue_tail(&sta->tx_filtered[txq->ac], skb);
> + spin_unlock_bh(&fq->lock);

This seems like a bad idea; we want the TXQ mechanism to decide which
frame to send on wakeup.

> 2) driver register for STA_NOTIFY_SLEEP
>
> 3) driver count tx frames pending in the hardware per STA and sets
> ieee80211_sta_block_awake(sta, 1) when > 0
>
> 4) on tx completion, if STA is sleeping and number of pending tx frames in hardware for a
> given STA reaches 0:
> - if driver buffers other frames for this STA, release them with TX_FILTERED in reverse order
> - calls ieee80211_sta_block_awake(false)
>
> what do you think ?

As I said, I'm not an expert on the PS code, so I may be missing
something. But it seems to me that in a model where the driver pulls the
frames from mac80211 (i.e., for drivers using wake_tx_queue), there
really is no way around having a way to instruct the driver "please use
these flags for the next N frames you send" - which is what
release_buffered_frames() does. What you're suggesting is basically
turning off this 'pull mode' for the frames buffered during PS and have
mac80211 revert to push mode for those, right? But then you lose the
benefits of pull mode (the TXQs) for those frames.

I remember Johannes talking about a 'shim layer' between the mac80211
TXQs and the 'drv_tx()' hook as a way to bring the benefits of the TXQs
to the 'long tail' of simple drivers that don't do any internal
buffering anyway, without having to change the drivers to use 'pull
mode'. Am I wrong in thinking that mwl8k may be a good candidate for
such a layer? From glancing through the existing driver it looks like
it's mostly just taking each frame, wrapping it in a HW descriptor, and
sticking it on a TX ring?

>> What hardware is it you're writing a driver for, BTW, and are you
>> planning to upstream it? :)
>
> that's a rewrite of the mwl8k driver targeting the same hardware, but
> with a different firmware interface.
>
> if I can bring it on par with the existing set of supported hardware
> and features, I could try to upstream it yes.

That would be awesome! :)

-Toke

2020-05-05 17:52:46

by Maxime Bizon

[permalink] [raw]

Subject: Re: Regarding .wake_tx_queue() model

On Tuesday 05 May 2020 ? 18:50:45 (+0200), Toke H?iland-J?rgensen wrote:

> This seems like a bad idea; we want the TXQ mechanism to decide which
> frame to send on wakeup.

.release_buffered_frames() is only needed/used if STA went into
powersave while packets were already sitting inside txqi, that's an
edge case.

In the other much more common case (STA went into sleep without any
traffic pending in txqi), then the "classic" ps delivery code is used:
frames gets pulled from ps_tx_buf queue (1 by 1 for ps poll, more for
uapsd), and those frames ends up being sent through drv_tx(), since
they have the flag IEEE80211_TX_CTRL_PS_RESPONSE so they bypass txqi.

so I was just looking at removing that edge case, sending those frames
back to ps_tx_buf() from the driver.

> really is no way around having a way to instruct the driver "please use
> these flags for the next N frames you send" - which is what
> release_buffered_frames() does. What you're suggesting is basically
> turning off this 'pull mode' for the frames buffered during PS and have
> mac80211 revert to push mode for those, right? But then you lose the
> benefits of pull mode (the TXQs) for those frames.

I just want to give those back to mac80211, those frames were already
in push mode anyway.

> I remember Johannes talking about a 'shim layer' between the mac80211
> TXQs and the 'drv_tx()' hook as a way to bring the benefits of the TXQs
> to the 'long tail' of simple drivers that don't do any internal
> buffering anyway, without having to change the drivers to use 'pull
> mode'. Am I wrong in thinking that mwl8k may be a good candidate for
> such a layer? From glancing through the existing driver it looks like
> it's mostly just taking each frame, wrapping it in a HW descriptor, and
> sticking it on a TX ring?

maybe with the current firmware interface, but with the new one
aggregation is done on host side, so tx path is no more that simple.

--
Maxime

2020-05-05 20:50:12

by Toke Høiland-Jørgensen

[permalink] [raw]

Subject: Re: Regarding .wake_tx_queue() model

Maxime Bizon <[email protected]> writes:

> On Tuesday 05 May 2020 à 18:50:45 (+0200), Toke Høiland-Jørgensen wrote:
>
>> This seems like a bad idea; we want the TXQ mechanism to decide which
>> frame to send on wakeup.
>
> .release_buffered_frames() is only needed/used if STA went into
> powersave while packets were already sitting inside txqi, that's an
> edge case.
>
> In the other much more common case (STA went into sleep without any
> traffic pending in txqi), then the "classic" ps delivery code is used:
> frames gets pulled from ps_tx_buf queue (1 by 1 for ps poll, more for
> uapsd), and those frames ends up being sent through drv_tx(), since
> they have the flag IEEE80211_TX_CTRL_PS_RESPONSE so they bypass txqi.

Ah, I see, and if there are a lot of outstanding frames the client is
supposed to wake up and resume regular operation? As I said, I really
don't know much about how PS works; but I'm enjoying learning about it,
so thanks! :)

> so I was just looking at removing that edge case, sending those frames
> back to ps_tx_buf() from the driver.

Right, that makes sense. But I guess this is only something you can do
if you never buffer frames in the driver, no? E.g., ath9k has its own
internal retry queue, so it needs the callback to train that; and once
the callback is there, extending it to pull from the TXQs is quite
straight forward... So it's not necessarily a generally-applicable
optimisation, is what I mean.

>> really is no way around having a way to instruct the driver "please use
>> these flags for the next N frames you send" - which is what
>> release_buffered_frames() does. What you're suggesting is basically
>> turning off this 'pull mode' for the frames buffered during PS and have
>> mac80211 revert to push mode for those, right? But then you lose the
>> benefits of pull mode (the TXQs) for those frames.
>
> I just want to give those back to mac80211, those frames were already
> in push mode anyway.

Gotcha.

>> I remember Johannes talking about a 'shim layer' between the mac80211
>> TXQs and the 'drv_tx()' hook as a way to bring the benefits of the TXQs
>> to the 'long tail' of simple drivers that don't do any internal
>> buffering anyway, without having to change the drivers to use 'pull
>> mode'. Am I wrong in thinking that mwl8k may be a good candidate for
>> such a layer? From glancing through the existing driver it looks like
>> it's mostly just taking each frame, wrapping it in a HW descriptor, and
>> sticking it on a TX ring?
>
> maybe with the current firmware interface, but with the new one
> aggregation is done on host side, so tx path is no more that simple.

Right, OK. Is this just a different firmware that's generally available,
or is it a new thing? I am generally a fan of moving logic out of the
firmware like this...

-Toke

2020-05-25 09:47:51

by Johannes Berg

[permalink] [raw]

Subject: Re: Regarding .wake_tx_queue() model

On Tue, 2020-05-05 at 17:20 +0200, Maxime Bizon wrote:

> > > Also .release_buffered_frames() codepath is difficult to test, how do
> > > you trigger that reliably ? assuming VIF is an AP, then you need the
> > > remote STA to go to sleep even though you have traffic waiting for it
> > > in the txqi. For now I patch the stack to introduce artificial delay.
> > >
> > > Since my hardware has a sticky per-STA PS filter with good status
> > > reporting, I'm considering using ieee80211_sta_block_awake() and only
> > > unblock STA when all its txqi are empty to get rid of
> > > .release_buffered_frames() complexity.
> >
> > I'm probably not the right person to answer that; never did have a good
> > grip on the details of PS support.
>
> Hopefully Felix or Johannes will see this.
>
> Just wondering if there is anything I'm missing, this alternative way
> of doing looks easier to me because it removes "knowledge" of frame
> delivery under service period from the driver:

This stuff is a mess. I had a plan once to just rip it all out and
combine it all with the TXQs, but not only is it hard to test, we've
also offloaded this stuff to the firmware for our devices, so motivation
is pretty low.

> 1) first get rid of buffered txq traffic when entering PS:
>
> --- a/net/mac80211/rx.c
> +++ b/net/mac80211/rx.c
> @@ -1593,6 +1593,15 @@ static void sta_ps_start(struct sta_info *sta)
> list_del_init(&txqi->schedule_order);
> spin_unlock(&local->active_txq_lock[txq->ac]);
>
> - if (txq_has_queue(txq))
> - set_bit(tid, &sta->txq_buffered_tids);
> - else
> - clear_bit(tid, &sta->txq_buffered_tids);
> + /* transfer txq into tx_filtered frames */
> + spin_lock(&fq->lock);
> + while ((skb = skb_dequeue(&txq->frags)))
> + skb_queue_tail(&sta->tx_filtered[txq->ac], skb);
> + /* use something more efficient like fq_tin_reset */
> + while ((skb = fq_tin_dequeue(fq, tin, fq_tin_dequeue_func)))
> + skb_queue_tail(&sta->tx_filtered[txq->ac], skb);
> + spin_unlock_bh(&fq->lock);
> +
>
> 2) driver register for STA_NOTIFY_SLEEP
>
> 3) driver count tx frames pending in the hardware per STA and sets
> ieee80211_sta_block_awake(sta, 1) when > 0
>
> 4) on tx completion, if STA is sleeping and number of pending tx frames in hardware for a
> given STA reaches 0:
> - if driver buffers other frames for this STA, release them with TX_FILTERED in reverse order
> - calls ieee80211_sta_block_awake(false)
>
> what do you think ?

I really don't know, off the top of my head, sorry.

johannes

2020-05-26 10:40:44

by Maxime Bizon

[permalink] [raw]

Subject: Re: Regarding .wake_tx_queue() model

On Monday 25 May 2020 ? 11:47:01 (+0200), Johannes Berg wrote:

> This stuff is a mess. I had a plan once to just rip it all out and
> combine it all with the TXQs, but not only is it hard to test, we've
> also offloaded this stuff to the firmware for our devices, so motivation
> is pretty low.

I understand.

I have softmac type of devices (ath9k type, host side aggregation)
that I'd like to keep updating for awhile, so I'd rather have a
helping stack.

If anyone here has the experience, time and testing capabilities to do
this big rework/cleanup, I may be able to get this funded, drop me an
email.

--
Maxime