2016-09-01 16:29:51

by Ben Greear

[permalink] [raw]
Subject: Ath10k probe response error related to mac80211 commit.

I have a variant of the ath10k 10.1 firmware that puts all frames, including management,
over the normal htt packet transport. This has worked fine for many kernels, but
for reasons that currently escape me, the patch below breaks this firmware.

On air, I see corrupted probe responses, seems the end of the frames are corrupted.

This only breaks AP mode (station mode works fine).

There may be some other similar breakage somewhere, since git bisect could not find this
error and I had to bisect it by hand.

If someone has any idea of why this patch might trigger it, please let me know.
I'll keep digging in the meantime...

Author: Johannes Berg <[email protected]> 2015-11-26 07:26:14
Committer: Johannes Berg <[email protected]> 2015-12-04 05:43:32
Tags: ben-bad-athia-6
Parent: bda95eb1d1581cfd79e9717ebda4b7ccd2265351 (cfg80211: handle add_station auth/assoc flag quirks)
Child: 86c7ec9eb154020797c39e1cc7dafa92da02f603 (mac80211: properly free skb when r-o-c for TX fails)
Branches: master, remotes/origin/master
Follows: ben-good-athia-7
Precedes: ben-bad-5-athai

Revert "mac80211: don't advertise NL80211_FEATURE_FULL_AP_CLIENT_STATE"

This reverts commit 45bb780a2147b9995f3d288c44ecb87ca8a330e2,
the previous two patches fixed the functionality.

Signed-off-by: Johannes Berg <[email protected]>

----------------------------- net/mac80211/main.c -----------------------------
index 175ffcf..858f6b1 100644
@@ -541,7 +541,8 @@ struct ieee80211_hw *ieee80211_alloc_hw_nm(size_t priv_data_len,
NL80211_FEATURE_HT_IBSS |
NL80211_FEATURE_VIF_TXPOWER |
NL80211_FEATURE_MAC_ON_CREATE |
- NL80211_FEATURE_USERSPACE_MPM;
+ NL80211_FEATURE_USERSPACE_MPM |
+ NL80211_FEATURE_FULL_AP_CLIENT_STATE;

if (!ops->hw_scan)
wiphy->features |= NL80211_FEATURE_LOW_PRIORITY_SCAN |

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2016-09-01 21:01:53

by Johannes Berg

[permalink] [raw]
Subject: Re: Ath10k probe response error related to mac80211 commit.

On Thu, 2016-09-01 at 11:23 -0700, Ben Greear wrote:

> Could easily be that others are corrupted too, but since probe resp
> is bad, the association will not proceed.

makes sense.

> Heh, I spent 4 days tracking this down, so I wanted to be precise in
> my bug report :)

Ahrg, ouch. Sorry about that. I really didn't think the flag would
cause any issues for anyone.

> The result I see is that there is an extra 10 bytes at the end of the
> frame on air.  But, it looks like the exact same pkt is sent to the
> firmware both with and without this patch.  Maybe the firmware is
> using the wrong tid or something like that due to how the station is
> created differently with this patch.

That makes no sense though, unless this only happens on say the second
station that connects? Until the first station sends an authentication
frame, that patch really should have no impact whatsoever.

johannes

2016-09-01 21:31:28

by Ben Greear

[permalink] [raw]
Subject: Re: Ath10k probe response error related to mac80211 commit.

On 09/01/2016 11:53 AM, Johannes Berg wrote:
> On Thu, 2016-09-01 at 11:23 -0700, Ben Greear wrote:
>>
>> Could easily be that others are corrupted too, but since probe resp
>> is bad, the association will not proceed.
>
> makes sense.
>
>> Heh, I spent 4 days tracking this down, so I wanted to be precise in
>> my bug report :)
>
> Ahrg, ouch. Sorry about that. I really didn't think the flag would
> cause any issues for anyone.

It took so much time because for whatever reason git bisect couldn't
find the issue, and I spent a day thinking it was in ath10k driver and
poking hard at that.

Anyway, no worries...it happens, and could easily just be a bug in
my firmware modifications that are somehow triggered by this.

>> The result I see is that there is an extra 10 bytes at the end of the
>> frame on air. But, it looks like the exact same pkt is sent to the
>> firmware both with and without this patch. Maybe the firmware is
>> using the wrong tid or something like that due to how the station is
>> created differently with this patch.
>
> That makes no sense though, unless this only happens on say the second
> station that connects? Until the first station sends an authentication
> frame, that patch really should have no impact whatsoever.

'makes no sense' is the ath10k motto.

I have verified that it is not an obvious difference in the ath10k driver
though...copying 4.5 ath10k driver onto 4.4 works fine, and copying 4.4
ath10k driver into 4.5 is broken.

I don't think any stations connect, but I didn't look precisely for that
yet. I know the stations I'm trying to connect will not.

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2016-09-01 22:01:50

by Johannes Berg

[permalink] [raw]
Subject: Re: Ath10k probe response error related to mac80211 commit.


> If someone has any idea of why this patch might trigger it, please
> let me know.
> I'll keep digging in the meantime...

>      Revert "mac80211: don't advertise NL80211_FEATURE_FULL_AP_CLIENT_STATE"
>

With a sufficiently recent hostapd/wpa_supplicant, the patch will cause
a station entry to be added to the firmware before sending the
authentication frame.

Why, of all frames, probe response frames should be corrupted I don't
know - I could imagine auth/assoc replies being treated differently
since they are now with a station entry rather than without.

> This only breaks AP mode (station mode works fine).

It also has no impact on anything but AP mode, as even indicated by the
name of the flag :)


Anyway, I was pretty sure this was safe and it does help other drivers
to have the full state, but I guess you can make the driver opt out of
the flag again (just unset it before register_hw).

johannes

2016-09-02 13:30:18

by Ben Greear

[permalink] [raw]
Subject: Re: Ath10k probe response error related to mac80211 commit.



On 09/02/2016 05:09 AM, Michal Kazior wrote:
> On 1 September 2016 at 22:52, Ben Greear <[email protected]> wrote:
>> On 09/01/2016 11:53 AM, Johannes Berg wrote:
>>> On Thu, 2016-09-01 at 11:23 -0700, Ben Greear wrote:
>>>>
>>>> Could easily be that others are corrupted too, but since probe resp
>>>> is bad, the association will not proceed.
>>>
>>> makes sense.
>>>
>>>> Heh, I spent 4 days tracking this down, so I wanted to be precise in
>>>> my bug report :)
>>>
>>> Ahrg, ouch. Sorry about that. I really didn't think the flag would
>>> cause any issues for anyone.
>>>
>>>> The result I see is that there is an extra 10 bytes at the end of the
>>>> frame on air. But, it looks like the exact same pkt is sent to the
>>>> firmware both with and without this patch. Maybe the firmware is
>>>> using the wrong tid or something like that due to how the station is
>>>> created differently with this patch.
>>>
>>> That makes no sense though, unless this only happens on say the second
>>> station that connects? Until the first station sends an authentication
>>> frame, that patch really should have no impact whatsoever.
>>
>> Ok, I found the problem.
>>
>> In the 10.1 firmware (at least), it will force the TID to be NON-QOS-TID
>> if the peer object does not have the qos_enabled flag set. This is probably
>> a work-around for some other thing lost in antiquity.
>>
>> When using my firmware that puts mgt frames over HTT, the TID for mgt
>> frames was being over-ridden and set to non-qos TID. Due to other
>> hackery and work-arounds, mgt frames cannot be sent on the non-qos
>> TID because they will be put on-air 10 bytes too long. They must be
>> sent on the mgt-tid or non-pause tid.
>
> Sounds like 802.11 header vs 802.3 header length problem (24 - 14 =
> 10). You can hit this if you start playing around tx encap mode as
> well.
>
> You could try fooling firmware into thinking the frame has a different
> length either in htt TX_FRM or tx fragment list (or both) but since
> this seems to be related to RA/DA peer state at xmit time it's
> probably not going to be reliable unless you introduce extra tx
> flushing barriers in the driver.

The problem is that the OS sent the packet on the mgt-tid, but the firmware
over-rode this and put it on non-qos TID instead. mgt and non-pause TIDs
are 'raw', and do not need that -10 adjustment, and my logic to handle mgt frames on
the normal htt path depends on that distinction.

Probably I could fix up htt tx path to do that -10 stuff depending on eventual
TID instead of making assumptions, but if I do this, it will probably be in 10.4
as I am hoping to keep 10.1 mostly just stable fixes and the htt tx path is one
tricky beast.

Search firmware for "The tidno that DE finds needs to be overridden for non-QOS"
and you can see where this happens. I just fixed that firmware code to not override
TID if it were already >= non-qos-tid.

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2016-09-01 20:55:57

by Ben Greear

[permalink] [raw]
Subject: Re: Ath10k probe response error related to mac80211 commit.

On 09/01/2016 11:01 AM, Johannes Berg wrote:
>
>> If someone has any idea of why this patch might trigger it, please
>> let me know.
>> I'll keep digging in the meantime...
>>
>> Revert "mac80211: don't advertise NL80211_FEATURE_FULL_AP_CLIENT_STATE"
>>
>
> With a sufficiently recent hostapd/wpa_supplicant, the patch will cause
> a station entry to be added to the firmware before sending the
> authentication frame.
>
> Why, of all frames, probe response frames should be corrupted I don't
> know - I could imagine auth/assoc replies being treated differently
> since they are now with a station entry rather than without.

Could easily be that others are corrupted too, but since probe resp is bad,
the association will not proceed.

>
>> This only breaks AP mode (station mode works fine).
>
> It also has no impact on anything but AP mode, as even indicated by the
> name of the flag :)

Heh, I spent 4 days tracking this down, so I wanted to be precise in
my bug report :)

> Anyway, I was pretty sure this was safe and it does help other drivers
> to have the full state, but I guess you can make the driver opt out of
> the flag again (just unset it before register_hw).

The result I see is that there is an extra 10 bytes at the end of the frame on
air. But, it looks like the exact same pkt is sent to the firmware both with
and without this patch. Maybe the firmware is using the wrong tid or something
like that due to how the station is created differently with this patch.

Since this only happens (as far as I know) with my modified firmware, then
I will try to fix it there. Or, possibly I can change ath10k driver to flip this
mac80211 flag when loading my firmware variant if firmware cannot be easily fixed.

Thanks for the info,
--Ben

>
> johannes
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2016-09-02 12:09:35

by Michal Kazior

[permalink] [raw]
Subject: Re: Ath10k probe response error related to mac80211 commit.

On 1 September 2016 at 22:52, Ben Greear <[email protected]> wrote:
> On 09/01/2016 11:53 AM, Johannes Berg wrote:
>> On Thu, 2016-09-01 at 11:23 -0700, Ben Greear wrote:
>>>
>>> Could easily be that others are corrupted too, but since probe resp
>>> is bad, the association will not proceed.
>>
>> makes sense.
>>
>>> Heh, I spent 4 days tracking this down, so I wanted to be precise in
>>> my bug report :)
>>
>> Ahrg, ouch. Sorry about that. I really didn't think the flag would
>> cause any issues for anyone.
>>
>>> The result I see is that there is an extra 10 bytes at the end of the
>>> frame on air. But, it looks like the exact same pkt is sent to the
>>> firmware both with and without this patch. Maybe the firmware is
>>> using the wrong tid or something like that due to how the station is
>>> created differently with this patch.
>>
>> That makes no sense though, unless this only happens on say the second
>> station that connects? Until the first station sends an authentication
>> frame, that patch really should have no impact whatsoever.
>
> Ok, I found the problem.
>
> In the 10.1 firmware (at least), it will force the TID to be NON-QOS-TID
> if the peer object does not have the qos_enabled flag set. This is proba=
bly
> a work-around for some other thing lost in antiquity.
>
> When using my firmware that puts mgt frames over HTT, the TID for mgt
> frames was being over-ridden and set to non-qos TID. Due to other
> hackery and work-arounds, mgt frames cannot be sent on the non-qos
> TID because they will be put on-air 10 bytes too long. They must be
> sent on the mgt-tid or non-pause tid.

Sounds like 802.11 header vs 802.3 header length problem (24 - 14 =3D
10). You can hit this if you start playing around tx encap mode as
well.

You could try fooling firmware into thinking the frame has a different
length either in htt TX_FRM or tx fragment list (or both) but since
this seems to be related to RA/DA peer state at xmit time it's
probably not going to be reliable unless you introduce extra tx
flushing barriers in the driver.


Micha=C5=82

2016-09-01 20:58:03

by Ben Greear

[permalink] [raw]
Subject: Re: Ath10k probe response error related to mac80211 commit.

On 09/01/2016 11:53 AM, Johannes Berg wrote:
> On Thu, 2016-09-01 at 11:23 -0700, Ben Greear wrote:
>>
>> Could easily be that others are corrupted too, but since probe resp
>> is bad, the association will not proceed.
>
> makes sense.
>
>> Heh, I spent 4 days tracking this down, so I wanted to be precise in
>> my bug report :)
>
> Ahrg, ouch. Sorry about that. I really didn't think the flag would
> cause any issues for anyone.
>
>> The result I see is that there is an extra 10 bytes at the end of the
>> frame on air. But, it looks like the exact same pkt is sent to the
>> firmware both with and without this patch. Maybe the firmware is
>> using the wrong tid or something like that due to how the station is
>> created differently with this patch.
>
> That makes no sense though, unless this only happens on say the second
> station that connects? Until the first station sends an authentication
> frame, that patch really should have no impact whatsoever.

Ok, I found the problem.

In the 10.1 firmware (at least), it will force the TID to be NON-QOS-TID
if the peer object does not have the qos_enabled flag set. This is probably
a work-around for some other thing lost in antiquity.

When using my firmware that puts mgt frames over HTT, the TID for mgt
frames was being over-ridden and set to non-qos TID. Due to other
hackery and work-arounds, mgt frames cannot be sent on the non-qos
TID because they will be put on-air 10 bytes too long. They must be
sent on the mgt-tid or non-pause tid.

I am guessing that somehow this mac80211 change creates a peer early
that does not have the qos logic enabled, and so that is why I suddenly
started hitting this bug.

Probably stock firmware is OK since it uses a second tx path for management,
and I guess that by whatever time it starts sending non-mgt frames the
qos-enabled logic is set properly.

I can fix this in my firmware by making it not over-ride the TID in
this case.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com