2019-10-25 19:40:29

by Krzysztof Hałasa

[permalink] [raw]
Subject: [PATCH v2] 802.11n IBSS: wlan0 stops receiving packets due to aggregation after sender reboot

Fix a bug where the mac80211 RX aggregation code sets a new aggregation
"session" at the remote station's request, but the head_seq_num
(the sequence number the receiver expects to receive) isn't reset.

Spotted on a pair of AR9580 in IBSS mode.

Signed-off-by: Krzysztof Halasa <[email protected]>

diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c
index 4d1c335e06e5..67733bd61297 100644
--- a/net/mac80211/agg-rx.c
+++ b/net/mac80211/agg-rx.c
@@ -354,10 +354,13 @@ void ___ieee80211_start_rx_ba_session(struct sta_info *sta,
*/
rcu_read_lock();
tid_rx = rcu_dereference(sta->ampdu_mlme.tid_rx[tid]);
- if (tid_rx && tid_rx->timeout == timeout)
+ if (tid_rx && tid_rx->timeout == timeout) {
+ tid_rx->ssn = start_seq_num;
+ tid_rx->head_seq_num = start_seq_num;
status = WLAN_STATUS_SUCCESS;
- else
+ } else {
status = WLAN_STATUS_REQUEST_DECLINED;
+ }
rcu_read_unlock();
goto end;
}

--
Krzysztof Hałasa

ŁUKASIEWICZ Research Network
Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland


2019-10-29 10:13:50

by Krzysztof Hałasa

[permalink] [raw]
Subject: Re: [PATCH v2] 802.11n IBSS: wlan0 stops receiving packets due to aggregation after sender reboot

Johannes Berg <[email protected]> writes:

> I think you just got very lucky (or unlucky) to have the same dialog
> token, because we start from 0

Right, it seems to be the case.

> - maybe we should initialize it to a
> random value to flush out such issues.

The problem I can see is that the dialog_tokens are 8-bit, way too small
to eliminate conflicts.

> Really what I think probably happened is that one of your stations lost
> the connection to the other, and didn't tell it about it in any way - so
> the other kept all the status alive.

You must have missed my previous mail - I simply rebooted that station,
and alternatively rmmoded/modprobed ath9k. But the problem originated in
a station going out of and back in range, in fact.

> I suspect to make all this work well we need to not only have the fixes
> I made recently to actually send and parse deauth frames, but also to
> even send an auth and reset the state when we receive that, so if we
> move out of range and even the deauth frame is lost, we can still reset
> properly.

That's one thing. The other is a station trying ADDBA for the first time
after boot (while the local station has seen it before that reboot).

> In any case, this is not the right approach - we need to handle the
> "lost connection" case better I suspect, but since you don't say what
> really happened I don't really know that that's what you're seeing.

I guess we need to identify "new connection" reliably. Otherwise,
the new connections are treated as old ones and it doesn't work.

Now how can it be fixed?
--
Krzysztof Halasa

ŁUKASIEWICZ Research Network
Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland

2019-10-29 10:15:59

by Sebastian Gottschall

[permalink] [raw]
Subject: Re: [PATCH v2] 802.11n IBSS: wlan0 stops receiving packets due to aggregation after sender reboot

35 km? for 802.11n with ht40 this is out of the ack timing range the
chipset supports. so this should be considered at any troubles with
connections

Am 29.10.2019 um 09:41 schrieb Koen Vandeputte:
>
> On 28.10.19 13:21, Johannes Berg wrote:
>> On Fri, 2019-10-25 at 12:21 +0200, Krzysztof Hałasa wrote:
>>> Fix a bug where the mac80211 RX aggregation code sets a new aggregation
>>> "session" at the remote station's request, but the head_seq_num
>>> (the sequence number the receiver expects to receive) isn't reset.
>>>
>>> Spotted on a pair of AR9580 in IBSS mode.
>>>
>>> Signed-off-by: Krzysztof Halasa <[email protected]>
>>>
>>> diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c
>>> index 4d1c335e06e5..67733bd61297 100644
>>> --- a/net/mac80211/agg-rx.c
>>> +++ b/net/mac80211/agg-rx.c
>>> @@ -354,10 +354,13 @@ void ___ieee80211_start_rx_ba_session(struct
>>> sta_info *sta,
>>>                */
>>>               rcu_read_lock();
>>>               tid_rx = rcu_dereference(sta->ampdu_mlme.tid_rx[tid]);
>>> -            if (tid_rx && tid_rx->timeout == timeout)
>>> +            if (tid_rx && tid_rx->timeout == timeout) {
>>> +                tid_rx->ssn = start_seq_num;
>>> +                tid_rx->head_seq_num = start_seq_num;
>>>                   status = WLAN_STATUS_SUCCESS;
>> This is wrong, this is the case of *updating an existing session*, we
>> must not reset the head SN then.
>>
>> I think you just got very lucky (or unlucky) to have the same dialog
>> token, because we start from 0 - maybe we should initialize it to a
>> random value to flush out such issues.
>>
>> Really what I think probably happened is that one of your stations lost
>> the connection to the other, and didn't tell it about it in any way - so
>> the other kept all the status alive.
>>
>> I suspect to make all this work well we need to not only have the fixes
>> I made recently to actually send and parse deauth frames, but also to
>> even send an auth and reset the state when we receive that, so if we
>> move out of range and even the deauth frame is lost, we can still reset
>> properly.
>>
>> In any case, this is not the right approach - we need to handle the
>> "lost connection" case better I suspect, but since you don't say what
>> really happened I don't really know that that's what you're seeing.
>>
>> johannes
>
> Hi all,
>
> I can confirm the issue as I'm also seeing this sometimes in the field
> here.
>
> Sometimes when a devices goes out of range and then re-enters,
> the link refuses to "come up", as in rx looks to be "stuck" without
> any reports in system log or locking issues (lockdep enabled)
>
> I have dozens of devices installed offshore (802.11n based), both on
> static and moving assets,
> which cover from short (250m) up to very long distances (~35km)
>
> So .. while there is some momentum for this issue,
> I'm more than happy to provide extensive testing should fixes be
> posted regarding IBSS in general.
>
> Regards,
>
> Koen
>
>

2019-10-29 10:16:35

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2] 802.11n IBSS: wlan0 stops receiving packets due to aggregation after sender reboot

On Tue, 2019-10-29 at 09:41 +0100, Koen Vandeputte wrote:

> I can confirm the issue as I'm also seeing this sometimes in the field here.
>
> Sometimes when a devices goes out of range and then re-enters,
> the link refuses to "come up", as in rx looks to be "stuck" without any
> reports in system log or locking issues (lockdep enabled)

Right. I've recently debugged this due to issues in distributed
beaconing (rather than moving in/out of range), but I guess it would be
relatively simple to reproduce this with wmediumd, if that can be
controlled dynamically?

What kernel are you running? You could check if you have

95697f9907bf ("mac80211: accept deauth frames in IBSS mode")
4b08d1b6a994 ("mac80211: IBSS: send deauth when expiring inactive STAs")

which might help somewhat, but don't fully cover the case of moving out
of range.

johannes

2019-10-29 10:18:14

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2] 802.11n IBSS: wlan0 stops receiving packets due to aggregation after sender reboot

On Tue, 2019-10-29 at 09:54 +0100, Krzysztof Hałasa wrote:

> The problem I can see is that the dialog_tokens are 8-bit, way too small
> to eliminate conflicts.

Well, they're also per station, we could just randomize the start and
then we'd delete the old session and start a new one, on the receiver.

So that would improve robustness somewhat (down to a 1/256 chance to hit
this problem).

> > Really what I think probably happened is that one of your stations lost
> > the connection to the other, and didn't tell it about it in any way - so
> > the other kept all the status alive.
>
> You must have missed my previous mail - I simply rebooted that station,
> and alternatively rmmoded/modprobed ath9k. But the problem originated in
> a station going out of and back in range, in fact.

I was on vacation, so yeah, quite possible I missed it.

Sounds like we need not just
4b08d1b6a994 ("mac80211: IBSS: send deauth when expiring inactive STAs")

but also send a deauth when we disconnect. Surprising we don't do that,
actually.

> > I suspect to make all this work well we need to not only have the fixes
> > I made recently to actually send and parse deauth frames, but also to
> > even send an auth and reset the state when we receive that, so if we
> > move out of range and even the deauth frame is lost, we can still reset
> > properly.
>
> That's one thing. The other is a station trying ADDBA for the first time
> after boot (while the local station has seen it before that reboot).

That's the situation though - the local station needs to know that it
has in fact *not* seen the same instance of the station, but that the
station has reset and needs to be removed & re-added.

> I guess we need to identify "new connection" reliably. Otherwise,
> the new connections are treated as old ones and it doesn't work.

Right. But we can implement the (optional) authentication (which you
actually already get when you implement [encrypted] IBSS with wpa_s),
and reset the station state when we get an authentication frame.

johannes

2019-10-29 10:41:08

by Koen Vandeputte

[permalink] [raw]
Subject: Re: [PATCH v2] 802.11n IBSS: wlan0 stops receiving packets due to aggregation after sender reboot


On 29.10.19 09:58, Sebastian Gottschall wrote:
> 35 km? for 802.11n with ht40 this is out of the ack timing range the
> chipset supports. so this should be considered at any troubles with
> connections
>
(Please don't top-post)

When we know a link can exceed ~ 21km, it's set to HT20 for this reason.

Koen

> Am 29.10.2019 um 09:41 schrieb Koen Vandeputte:
>>
>> On 28.10.19 13:21, Johannes Berg wrote:
>>> On Fri, 2019-10-25 at 12:21 +0200, Krzysztof Hałasa wrote:
>>>> Fix a bug where the mac80211 RX aggregation code sets a new
>>>> aggregation
>>>> "session" at the remote station's request, but the head_seq_num
>>>> (the sequence number the receiver expects to receive) isn't reset.
>>>>
>>>> Spotted on a pair of AR9580 in IBSS mode.
>>>>
>>>> Signed-off-by: Krzysztof Halasa <[email protected]>
>>>>
>>>> diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c
>>>> index 4d1c335e06e5..67733bd61297 100644
>>>> --- a/net/mac80211/agg-rx.c
>>>> +++ b/net/mac80211/agg-rx.c
>>>> @@ -354,10 +354,13 @@ void ___ieee80211_start_rx_ba_session(struct
>>>> sta_info *sta,
>>>>                */
>>>>               rcu_read_lock();
>>>>               tid_rx = rcu_dereference(sta->ampdu_mlme.tid_rx[tid]);
>>>> -            if (tid_rx && tid_rx->timeout == timeout)
>>>> +            if (tid_rx && tid_rx->timeout == timeout) {
>>>> +                tid_rx->ssn = start_seq_num;
>>>> +                tid_rx->head_seq_num = start_seq_num;
>>>>                   status = WLAN_STATUS_SUCCESS;
>>> This is wrong, this is the case of *updating an existing session*, we
>>> must not reset the head SN then.
>>>
>>> I think you just got very lucky (or unlucky) to have the same dialog
>>> token, because we start from 0 - maybe we should initialize it to a
>>> random value to flush out such issues.
>>>
>>> Really what I think probably happened is that one of your stations lost
>>> the connection to the other, and didn't tell it about it in any way
>>> - so
>>> the other kept all the status alive.
>>>
>>> I suspect to make all this work well we need to not only have the fixes
>>> I made recently to actually send and parse deauth frames, but also to
>>> even send an auth and reset the state when we receive that, so if we
>>> move out of range and even the deauth frame is lost, we can still reset
>>> properly.
>>>
>>> In any case, this is not the right approach - we need to handle the
>>> "lost connection" case better I suspect, but since you don't say what
>>> really happened I don't really know that that's what you're seeing.
>>>
>>> johannes
>>
>> Hi all,
>>
>> I can confirm the issue as I'm also seeing this sometimes in the
>> field here.
>>
>> Sometimes when a devices goes out of range and then re-enters,
>> the link refuses to "come up", as in rx looks to be "stuck" without
>> any reports in system log or locking issues (lockdep enabled)
>>
>> I have dozens of devices installed offshore (802.11n based), both on
>> static and moving assets,
>> which cover from short (250m) up to very long distances (~35km)
>>
>> So .. while there is some momentum for this issue,
>> I'm more than happy to provide extensive testing should fixes be
>> posted regarding IBSS in general.
>>
>> Regards,
>>
>> Koen
>>
>>

2019-10-29 10:43:33

by Koen Vandeputte

[permalink] [raw]
Subject: Re: [PATCH v2] 802.11n IBSS: wlan0 stops receiving packets due to aggregation after sender reboot


On 29.10.19 10:03, Johannes Berg wrote:
> On Tue, 2019-10-29 at 09:41 +0100, Koen Vandeputte wrote:
>
>> I can confirm the issue as I'm also seeing this sometimes in the field here.
>>
>> Sometimes when a devices goes out of range and then re-enters,
>> the link refuses to "come up", as in rx looks to be "stuck" without any
>> reports in system log or locking issues (lockdep enabled)
> Right. I've recently debugged this due to issues in distributed
> beaconing (rather than moving in/out of range), but I guess it would be
> relatively simple to reproduce this with wmediumd, if that can be
> controlled dynamically?
>
> What kernel are you running? You could check if you have
>
> 95697f9907bf ("mac80211: accept deauth frames in IBSS mode")
> 4b08d1b6a994 ("mac80211: IBSS: send deauth when expiring inactive STAs")
>
> which might help somewhat, but don't fully cover the case of moving out
> of range.
>
> johannes
>
I'm running OpenWrt (kernel 4.14.150 with 4.19.79 mac80211)
I noticed these fixes last week and made a build 2 days ago with them
backported to it.
Running in the field on roughly 4 devices since a day.

Koen

2019-10-29 11:18:58

by Krzysztof Hałasa

[permalink] [raw]
Subject: Re: [PATCH v2] 802.11n IBSS: wlan0 stops receiving packets due to aggregation after sender reboot

Johannes Berg <[email protected]> writes:

>> The problem I can see is that the dialog_tokens are 8-bit, way too small
>> to eliminate conflicts.
>
> Well, they're also per station, we could just randomize the start and
> then we'd delete the old session and start a new one, on the receiver.
>
> So that would improve robustness somewhat (down to a 1/256 chance to hit
> this problem).

That was what I meant. Still, 1/256 seems hardly acceptable to me -
unless there is some work around (a short timeout or something similar).
Remember that when it doesn't work, it doesn't work - it won't recover
until the sequence catches up, which may mean basically forever.

Or, maybe the remote station can request de-aggregation first, so the
subsequent aggregation request is always treated as new?

Alternatively, perhaps the remote can signal that it's a new request and
not merely an existing session?

> That's the situation though - the local station needs to know that it
> has in fact *not* seen the same instance of the station, but that the
> station has reset and needs to be removed & re-added.

Precisely. And it seems to me that the first time the local station
learns of this is when a new, regular, non-aggregated packet arrives.
Or, when a new aggregation request arrives.
--
Krzysztof Halasa

ŁUKASIEWICZ Research Network
Industrial Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland