2022-09-15 13:24:09

by Alexander Wetzel

[permalink] [raw]
Subject: [PATCH] mac80211: Ensure vif queues are operational after start

Make sure local->queue_stop_reasons and vif.txqs_stopped stay in sync.

When a new vif is created the queues may end up in an inconsistent state
and be inoperable:
Communication not using iTXQ will work, allowing to e.g. complete the
association. But the 4-way handshake will time out. The sta will not
send out any skbs queued in iTXQs.

All normal attempts to start the queues will fail when reaching this
state.
local->queue_stop_reasons will have marked all queues as operational but
vif.txqs_stopped will still be set, creating an inconsistent internal
state.

In reality this seems to be race between the mac80211 function
ieee80211_do_open() setting SDATA_STATE_RUNNING and the wake_txqs_tasklet:
Depending on the driver and the timing the queues may end up to be
operational or not.

Cc: [email protected]
Fixes: f856373e2f31 ("wifi: mac80211: do not wake queues on a vif that is being stopped")
Signed-off-by: Alexander Wetzel <[email protected]>
---
net/mac80211/util.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/mac80211/util.c b/net/mac80211/util.c
index 0ea5d50091dc..bf7461c41bef 100644
--- a/net/mac80211/util.c
+++ b/net/mac80211/util.c
@@ -301,14 +301,14 @@ static void __ieee80211_wake_txqs(struct ieee80211_sub_if_data *sdata, int ac)
local_bh_disable();
spin_lock(&fq->lock);

+ sdata->vif.txqs_stopped[ac] = false;
+
if (!test_bit(SDATA_STATE_RUNNING, &sdata->state))
goto out;

if (sdata->vif.type == NL80211_IFTYPE_AP)
ps = &sdata->bss->ps;

- sdata->vif.txqs_stopped[ac] = false;
-
list_for_each_entry_rcu(sta, &local->sta_list, list) {
if (sdata != sta->sdata)
continue;
--
2.37.3


2022-09-15 16:25:42

by Felix Fietkau

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Ensure vif queues are operational after start


On 15.09.22 15:09, Alexander Wetzel wrote:
> Make sure local->queue_stop_reasons and vif.txqs_stopped stay in sync.
>
> When a new vif is created the queues may end up in an inconsistent state
> and be inoperable:
> Communication not using iTXQ will work, allowing to e.g. complete the
> association. But the 4-way handshake will time out. The sta will not
> send out any skbs queued in iTXQs.
>
> All normal attempts to start the queues will fail when reaching this
> state.
> local->queue_stop_reasons will have marked all queues as operational but
> vif.txqs_stopped will still be set, creating an inconsistent internal
> state.
>
> In reality this seems to be race between the mac80211 function
> ieee80211_do_open() setting SDATA_STATE_RUNNING and the wake_txqs_tasklet:
> Depending on the driver and the timing the queues may end up to be
> operational or not.
>
> Cc: [email protected]
> Fixes: f856373e2f31 ("wifi: mac80211: do not wake queues on a vif that is being stopped")
> Signed-off-by: Alexander Wetzel <[email protected]>

Acked-by: Felix Fietkau <[email protected]>

Thanks for figuring this one out,

- Felix

2022-09-15 20:23:52

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Ensure vif queues are operational after start

On 15.09.22 18:18, Felix Fietkau wrote:
>
> On 15.09.22 15:09, Alexander Wetzel wrote:
>> Make sure local->queue_stop_reasons and vif.txqs_stopped stay in sync.
>>
>> When a new vif is created the queues may end up in an inconsistent state
>> and be inoperable:
>> Communication not using iTXQ will work, allowing to e.g. complete the
>> association. But the 4-way handshake will time out. The sta will not
>> send out any skbs queued in iTXQs.
>>
>> All normal attempts to start the queues will fail when reaching this
>> state.
>> local->queue_stop_reasons will have marked all queues as operational but
>> vif.txqs_stopped will still be set, creating an inconsistent internal
>> state.
>>
>> In reality this seems to be race between the mac80211 function
>> ieee80211_do_open() setting SDATA_STATE_RUNNING and the
>> wake_txqs_tasklet:
>> Depending on the driver and the timing the queues may end up to be
>> operational or not.
>>
>> Cc: [email protected]
>> Fixes: f856373e2f31 ("wifi: mac80211: do not wake queues on a vif that
>> is being stopped")
>> Signed-off-by: Alexander Wetzel <[email protected]>
>
> Acked-by: Felix Fietkau <[email protected]>
>

I've got some doubts that my fix is correct...
While it fixes the problem in my tests it looks like we'll need another
queue restart to get the queues working again.

After all IEEE80211_TXQ_STOP_NETIF_TX will not be cleared when it has
been set by __ieee80211_stop_queue().

I'll update the patch and skip setting vif.txqs_stopped when
SDATA_STATE_RUNNING is not set. Not having IEEE80211_TXQ_STOP_NETIF_TX
set looks harmless, having it set when it should less problematic...

Alexander


2022-09-15 20:25:29

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Ensure vif queues are operational after start

>
> I've got some doubts that my fix is correct...
> While it fixes the problem in my tests it looks like we'll need another
> queue restart to get the queues working again.
>
> After all IEEE80211_TXQ_STOP_NETIF_TX will not be cleared when it has
> been set by __ieee80211_stop_queue().
>
> I'll update the patch and skip setting vif.txqs_stopped when
> SDATA_STATE_RUNNING is not set. Not having IEEE80211_TXQ_STOP_NETIF_TX
> set looks harmless, having it set when it should less problematic...

Scratch that: The patch should be ok as it is:
IEEE80211_TXQ_STOP_NETIF_TX is not set on stop, the patch should be ok
as it is.

Sorry for the noise.

Alexander

2022-09-15 20:57:04

by Ben Greear

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Ensure vif queues are operational after start

On 9/15/22 1:06 PM, Alexander Wetzel wrote:
>>
>> I've got some doubts that my fix is correct...
>> While it fixes the problem in my tests it looks like we'll need another queue restart to get the queues working again.
>>
>> After all IEEE80211_TXQ_STOP_NETIF_TX will not be cleared when it has been set by __ieee80211_stop_queue().
>>
>> I'll update the patch and skip setting vif.txqs_stopped when SDATA_STATE_RUNNING is not set. Not having IEEE80211_TXQ_STOP_NETIF_TX set looks harmless, having
>> it set when it should less problematic...
>
> Scratch that: The patch should be ok as it is: IEEE80211_TXQ_STOP_NETIF_TX is not set on stop, the patch should be ok as it is.
>
> Sorry for the noise.
>
> Alexander
>

To add to the noise...

From reading the original patch description, it was to stop an NPE when AP was stopped. I have been testing
this patch below and it fixes the problems I saw with multiple vdevs. I was worried that
the code in the 'list_for_each_entry_rcu(sta, &local->sta_list, list) {' might still need to run
to keep everything in sync (and my patch allows that to happen), but I do not know if that is true or not.

diff --git a/net/mac80211/util.c b/net/mac80211/util.c
index c768e583aad4..2b5dafe9f4cc 100644
--- a/net/mac80211/util.c
+++ b/net/mac80211/util.c
@@ -370,12 +370,6 @@ static void __ieee80211_wake_txqs(struct ieee80211_sub_if_data *sdata, int ac)
local_bh_disable();
spin_lock(&fq->lock);

- if (!test_bit(SDATA_STATE_RUNNING, &sdata->state))
- goto out;
-
- if (sdata->vif.type == NL80211_IFTYPE_AP)
- ps = &sdata->bss->ps;
-
sdata->vif.txqs_stopped[ac] = false;

list_for_each_entry_rcu(sta, &local->sta_list, list) {
@@ -408,6 +402,10 @@ static void __ieee80211_wake_txqs(struct ieee80211_sub_if_data *sdata, int ac)

txqi = to_txq_info(vif->txq);

+ if (test_bit(SDATA_STATE_RUNNING, &sdata->state))
+ if (sdata->vif.type == NL80211_IFTYPE_AP)
+ ps = &sdata->bss->ps;
+
if (!test_and_clear_bit(IEEE80211_TXQ_STOP_NETIF_TX, &txqi->flags) ||
(ps && atomic_read(&ps->num_sta_ps)) || ac != vif->txq->ac)
goto out;


Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2022-09-16 05:52:48

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Ensure vif queues are operational after start

Alexander Wetzel <[email protected]> writes:

> Make sure local->queue_stop_reasons and vif.txqs_stopped stay in sync.
>
> When a new vif is created the queues may end up in an inconsistent state
> and be inoperable:
> Communication not using iTXQ will work, allowing to e.g. complete the
> association. But the 4-way handshake will time out. The sta will not
> send out any skbs queued in iTXQs.
>
> All normal attempts to start the queues will fail when reaching this
> state.
> local->queue_stop_reasons will have marked all queues as operational but
> vif.txqs_stopped will still be set, creating an inconsistent internal
> state.
>
> In reality this seems to be race between the mac80211 function
> ieee80211_do_open() setting SDATA_STATE_RUNNING and the wake_txqs_tasklet:
> Depending on the driver and the timing the queues may end up to be
> operational or not.
>
> Cc: [email protected]
> Fixes: f856373e2f31 ("wifi: mac80211: do not wake queues on a vif that is being stopped")
> Signed-off-by: Alexander Wetzel <[email protected]>
> ---
> net/mac80211/util.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)

The title is missing "wifi:", but no need to resend because of this. I
assume Johannes will fix it during commit.

--
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

2022-09-16 16:46:05

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Ensure vif queues are operational after start

On 15.09.22 22:39, Ben Greear wrote:

> From reading the original patch description, it was to stop an NPE when
> AP was stopped.  I have been testing
> this patch below and it fixes the problems I saw with multiple vdevs.  I
> was worried that
> the code in the 'list_for_each_entry_rcu(sta, &local->sta_list, list) {'
> might still need to run
> to keep everything in sync (and my patch allows that to happen), but I
> do not know if that is true or not.
>

I'm not sure if it's still save to wake the queues for the vif we are
tearing down and assumed the intend was to skip those, too.

But it looks like all stations for the vif are deleted prior to setting
sdata->bss = NULL, so the outcome should be the same.

Your solution removes potentially set IEEE80211_TXQ_STOP_NETIF_TX flags,
reducing the risk that a queue restart during vif setup ends up with
inoperable queues.
But the only way to set IEEE80211_TXQ_STOP_NETIF_TX seems to be during
ieee80211_tx_dequeue(). Which should not be possible to be called as
long as SDATA_STATE_RUNNING was never set for the vif.

But I'm on thin ice here:-)

Alexander

2022-09-16 17:12:43

by Ben Greear

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Ensure vif queues are operational after start

On 9/16/22 9:38 AM, Alexander Wetzel wrote:
> On 15.09.22 22:39, Ben Greear wrote:
>
>>  From reading the original patch description, it was to stop an NPE when AP was stopped.  I have been testing
>> this patch below and it fixes the problems I saw with multiple vdevs.  I was worried that
>> the code in the 'list_for_each_entry_rcu(sta, &local->sta_list, list) {' might still need to run
>> to keep everything in sync (and my patch allows that to happen), but I do not know if that is true or not.
>>
>
> I'm not sure if it's still save to wake the queues for the vif we are tearing down and assumed the intend was to skip those, too.
>
> But it looks like all stations for the vif are deleted prior to setting sdata->bss = NULL, so the outcome should be the same.
>
> Your solution removes potentially set IEEE80211_TXQ_STOP_NETIF_TX flags, reducing the risk that a queue restart during vif setup ends up with inoperable queues.
> But the only way to set IEEE80211_TXQ_STOP_NETIF_TX seems to be during ieee80211_tx_dequeue(). Which should not be possible to be called as long as
> SDATA_STATE_RUNNING was never set for the vif.
>
> But I'm on thin ice here:-)

I spent two days debugging this and have only barest understanding of how all of the atf/fq/txq logic
is supposed to work.

But, I think my test case was a bit different from yours, and in my test case, before my patch,
it failed 100% of the time:

create two station vdevs on same (mtk7916) radio
admin one up (this works)
admin second one up (this fails, tx path is hung, because sdata->vif.txqs_stopped[ac] was true).

In general, I'd like to keep start/stop state as in-sync everywhere as possible, and I think my patch
might be better at that than yours since it goes through the sta list (and, maybe too, I'm completely
wrong about that).

Potentially, static void __ieee80211_wake_txqs(struct ieee80211_sub_if_data *sdata, int ac)
should just be called each time a vdev goes admin up.
But since my original test case failed so reliably, the __ieee80211_wake_txqs method must not actually be
called when secondary vdevs go admin up. I am not sure if that is per design or some other
bug.

Hoping the 2-3 people who understand this logic well will chime in :)

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com