2018-03-24 11:07:03

by Alexander Wetzel

[permalink] [raw]
Subject: [PATCH] mac80211: Fix wlan freezes under load at rekey

Rekeying a pairwise key with encryption offload and only keyid 0 has two
potential races which can freeze the wlan conection till rekeyed again:

1) For incomming packets:
If the local STA installs the key prior to the remote STA we still
have the old key active in the hardware for a short time after
mac80211 switched to the new key.
The card can still hand over packets decoded with the old key to
mac80211, bumping the new PN (IV) value to an incorrect high number and
tricking the local replay detection to drop all packets really sent
with the new key.

2) For outgoing packets:
If mac80211 is providing the PN (IV) and hands over the cleartext
packets for encryption to the hardware immediately prior to a key
change the driver/card may process the queued packets after
switching to the new key.
This will immediatelly bump the PN (IV) value on the remote STA to
an incorrect high number, also freezing the connection.

Both issues can be prevented by deleting the key from the hardware prior
to switching to the new key in mac80211, falling back to software
encryption/decryption till the switch to the new key is completed.

Signed-off-by: Alexander Wetzel <[email protected]>
---
net/mac80211/key.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/mac80211/key.c b/net/mac80211/key.c
index aee05ec3f7ea..266ea0b507e7 100644
--- a/net/mac80211/key.c
+++ b/net/mac80211/key.c
@@ -332,10 +332,15 @@ static void ieee80211_key_replace(struct ieee80211_sub_if_data *sdata,

WARN_ON(new && old && new->conf.keyidx != old->conf.keyidx);

- if (old)
+ if (old) {
idx = old->conf.keyidx;
- else
+ /* Make sure the card can't encrypt/decrypt packets with
+ * the old key prior to switching to new key in mac80211.
+ */
+ ieee80211_key_disable_hw_accel(old);
+ } else {
idx = new->conf.keyidx;
+ }

if (sta) {
if (pairwise) {
--
2.16.2


2018-03-25 21:59:16

by Ben Greear

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Fix wlan freezes under load at rekey



On 03/25/2018 12:45 PM, Alexander Wetzel wrote:
>
>> What will happen to drivers like ath10k that cannot do software
> encrypt/decrypt?
>>
>> ath10k can support multiple key-ids as far as I can tell,
>> so maybe it would just never hit this code?
>
> Still learning how that all fits together, but I'm sure any card using
> mac80211 will also use ieee80211_key_replace, including ath10k.
>
> We are in a race with the remote station there is no chance that we can
> switch over exactly at the same time. If we can't fall pack to software
> encryption we'll just have to drop some more packets.
>
> I'm pretty sure mac80211 will just encrypt a frame in software and
> send it to ath10 for processing once we have removed the key from the hw
> in the same way as for any other card.

I don't think ath10k can handle sending already-encrypted data packets,
but possibly it works with newer upstream firmware/driver.

Either way, as long as it does not fundamentally break something (like
a non-recoverable data stall), then maybe your patch is fine anyway
and ath10k may just drop a few extra frames.

> My expectation here would be, that the driver detects and drops the
> pre-encrypted frames it no longer has a hw key for.
>
> Unfortunately this is just an assumption, since I haven't found the code
> handling this case in ath10k. And even if true this could well cause
> some undesired warning messages.
>
> I guess we should therefore make sure we do not send out any packets in
> the critical time window.
>
> Now stopping and flushing the queues seems to be bad idea which could
> cause a real performance impact for on a busy AP with many stations and
> rekeys enabled...
> Luckily it looks like we can instead just set KEY_FLAG_TAINTED for the
> old key to make sure we stop sending packets till the rekey is done.
>
> That should cause ieee80211_tx_h_select_key to drop all packets without
> a new per-packet check and also should cover potential undesired side
> effects, isn't it?

I get lost in the weeds when trying to understand all of this, and some
previous attempts of mine to fix some of this evidently wasn't correct
enough to accept upstream:

https://www.spinics.net/lists/hostap/msg03677.html

So I really don't know enough to properly review
your patch. Just be aware that ath10k is weird about sw-crypt, maybe make
sure your patch is tested on it to make sure it doesn't out-right break something.

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2018-03-25 20:01:59

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Fix wlan freezes under load at rekey


> What will happen to drivers like ath10k that cannot do software
encrypt/decrypt?
>
> ath10k can support multiple key-ids as far as I can tell,
> so maybe it would just never hit this code?

Still learning how that all fits together, but I'm sure any card using
mac80211 will also use ieee80211_key_replace, including ath10k.

We are in a race with the remote station there is no chance that we can
switch over exactly at the same time. If we can't fall pack to software
encryption we'll just have to drop some more packets.

I'm pretty sure mac80211 will just encrypt a frame in software and
send it to ath10 for processing once we have removed the key from the hw
in the same way as for any other card.
My expectation here would be, that the driver detects and drops the
pre-encrypted frames it no longer has a hw key for.

Unfortunately this is just an assumption, since I haven't found the code
handling this case in ath10k. And even if true this could well cause
some undesired warning messages.

I guess we should therefore make sure we do not send out any packets in
the critical time window.

Now stopping and flushing the queues seems to be bad idea which could
cause a real performance impact for on a busy AP with many stations and
rekeys enabled...
Luckily it looks like we can instead just set KEY_FLAG_TAINTED for the
old key to make sure we stop sending packets till the rekey is done.

That should cause ieee80211_tx_h_select_key to drop all packets without
a new per-packet check and also should cover potential undesired side
effects, isn't it?


Regards,

Alexander

2018-03-26 12:22:32

by Sebastian Gottschall

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Fix wlan freezes under load at rekey

so far i see no regressions with 9984 with that patch

except that 9984 has a rekeying problem at all. with wds ap -> wds sta
mode rekeying will fail and it will reauthenticate at each interval. (it
disconnects and reconnects)
but this is a long term issue qca never fixed for years. 988x doesnt
suffer from that issue

Am 25.03.2018 um 23:59 schrieb Ben Greear:
>
>
> On 03/25/2018 12:45 PM, Alexander Wetzel wrote:
>>
>>> What will happen to drivers like ath10k that cannot do software
>> encrypt/decrypt?
>>>
>>> ath10k can support multiple key-ids as far as I can tell,
>>> so maybe it would just never hit this code?
>>
>> Still learning how that all fits together, but I'm sure any card using
>> mac80211 will also use ieee80211_key_replace, including ath10k.
>>
>> We are in a race with the remote station there is no chance that we can
>> switch over exactly at the same time. If we can't fall pack to software
>> encryption we'll just have to drop some more packets.
>>
>> I'm pretty sure mac80211 will just encrypt a frame in software and
>> send it to ath10 for processing once we have removed the key from the hw
>> in the same way as for any other card.
>
> I don't think ath10k can handle sending already-encrypted data packets,
> but possibly it works with newer upstream firmware/driver.
>
> Either way, as long as it does not fundamentally break something (like
> a non-recoverable data stall), then maybe your patch is fine anyway
> and ath10k may just drop a few extra frames.
>
>> My expectation here would be, that the driver detects and drops the
>> pre-encrypted frames it no longer has a hw key for.
>>
>> Unfortunately this is just an assumption, since I haven't found the code
>> handling this case in ath10k. And even if true this could well cause
>> some undesired warning messages.
>>
>> I guess we should therefore make sure we do not send out any packets in
>> the critical time window.
>>
>> Now stopping and flushing the queues seems to be bad idea which could
>> cause a real performance impact for on a busy AP with many stations and
>> rekeys enabled...
>> Luckily it looks like we can instead just set KEY_FLAG_TAINTED for the
>> old key to make sure we stop sending packets till the rekey is done.
>>
>> That should cause ieee80211_tx_h_select_key to drop all packets without
>> a new per-packet check and also should cover potential undesired side
>> effects, isn't it?
>
> I get lost in the weeds when trying to understand all of this, and some
> previous attempts of mine to fix some of this evidently wasn't correct
> enough to accept upstream:
>
> https://www.spinics.net/lists/hostap/msg03677.html
>
> So I really don't know enough to properly review
> your patch.? Just be aware that ath10k is weird about sw-crypt, maybe
> make
> sure your patch is tested on it to make sure it doesn't out-right
> break something.
>
> Thanks,
> Ben
>
>

--
Mit freundlichen Gr?ssen / Regards

Sebastian Gottschall / CTO

NewMedia-NET GmbH - DD-WRT
Firmensitz: Stubenwaldallee 21a, 64625 Bensheim
Registergericht: Amtsgericht Darmstadt, HRB 25473
Gesch?ftsf?hrer: Peter Steinh?user, Christian Scheele
http://www.dd-wrt.com
email: [email protected]
Tel.: +496251-582650 / Fax: +496251-5826565

2018-03-24 15:29:23

by Ben Greear

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Fix wlan freezes under load at rekey



On 03/24/2018 03:29 AM, Alexander Wetzel wrote:
> Rekeying a pairwise key with encryption offload and only keyid 0 has two
> potential races which can freeze the wlan conection till rekeyed again:
>
> 1) For incomming packets:
> If the local STA installs the key prior to the remote STA we still
> have the old key active in the hardware for a short time after
> mac80211 switched to the new key.
> The card can still hand over packets decoded with the old key to
> mac80211, bumping the new PN (IV) value to an incorrect high number and
> tricking the local replay detection to drop all packets really sent
> with the new key.
>
> 2) For outgoing packets:
> If mac80211 is providing the PN (IV) and hands over the cleartext
> packets for encryption to the hardware immediately prior to a key
> change the driver/card may process the queued packets after
> switching to the new key.
> This will immediatelly bump the PN (IV) value on the remote STA to
> an incorrect high number, also freezing the connection.
>
> Both issues can be prevented by deleting the key from the hardware prior
> to switching to the new key in mac80211, falling back to software
> encryption/decryption till the switch to the new key is completed.

What will happen to drivers like ath10k that cannot do software encrypt/decrypt?

ath10k can support multiple key-ids as far as I can tell,
so maybe it would just never hit this code?

Thanks,
Ben

>
> Signed-off-by: Alexander Wetzel <[email protected]>
> ---
> net/mac80211/key.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/net/mac80211/key.c b/net/mac80211/key.c
> index aee05ec3f7ea..266ea0b507e7 100644
> --- a/net/mac80211/key.c
> +++ b/net/mac80211/key.c
> @@ -332,10 +332,15 @@ static void ieee80211_key_replace(struct ieee80211_sub_if_data *sdata,
>
> WARN_ON(new && old && new->conf.keyidx != old->conf.keyidx);
>
> - if (old)
> + if (old) {
> idx = old->conf.keyidx;
> - else
> + /* Make sure the card can't encrypt/decrypt packets with
> + * the old key prior to switching to new key in mac80211.
> + */
> + ieee80211_key_disable_hw_accel(old);
> + } else {
> idx = new->conf.keyidx;
> + }
>
> if (sta) {
> if (pairwise) {
>

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2018-03-27 00:01:47

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Fix wlan freezes under load at rekey

> so far i see no regressions with 9984 with that patch
>
> except that 9984 has a rekeying problem at all. with wds ap -> wds sta
> mode rekeying will fail and it will reauthenticate at each interval. (it
> disconnects and reconnects)
> but this is a long term issue qca never fixed for years. 988x doesnt
> suffer from that issue

Thanks for testing, sounds promising.

If anyone is interested how it looks in my test environment I've
uploaded two sample captures to
https://www.awhome.eu/index.php/s/abxgp9pfi2ssCNy, showing how the
unpatched and patched mac80211 are reacting to the rekey. The WPA
Password is Induction and the AP rekeys all 30s.

The AP is running lede 17.01.4, so it's way off from the current
kernel/mac80211.
The client is a HTC 10 phone running Lineageos. (The phone also has a
WLAN card which has problems when rekeying.)

There are quite many interesting things visible here, not the least one
that ath9k leaks unencrypted frames for both patched and unpatched
mac80211 which at least for my patched variant probably allow to
calculate the TK key and encrypt all frames.

I'm now experimenting now with KEY_FLAG_TAINTED, but it's not as
straight forward as I expected.

2018-03-27 13:13:35

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Fix wlan freezes under load at rekey

On Sat, 2018-03-24 at 11:29 +0100, Alexander Wetzel wrote:
> Rekeying a pairwise key with encryption offload and only keyid 0 has two
> potential races which can freeze the wlan conection till rekeyed again:
>
> 1) For incomming packets:
> If the local STA installs the key prior to the remote STA we still
> have the old key active in the hardware for a short time after
> mac80211 switched to the new key.
> The card can still hand over packets decoded with the old key to
> mac80211, bumping the new PN (IV) value to an incorrect high number and
> tricking the local replay detection to drop all packets really sent
> with the new key.
>
> 2) For outgoing packets:
> If mac80211 is providing the PN (IV) and hands over the cleartext
> packets for encryption to the hardware immediately prior to a key
> change the driver/card may process the queued packets after
> switching to the new key.
> This will immediatelly bump the PN (IV) value on the remote STA to
> an incorrect high number, also freezing the connection.

Correct for both, yes, this is a known issue.

> Both issues can be prevented by deleting the key from the hardware prior
> to switching to the new key in mac80211, falling back to software
> encryption/decryption till the switch to the new key is completed.

This, however, is not true in general. It only works if the queues are
flushed quickly enough between removing the key and setting a new one.
Otherwise, the same is still possible, e.g. on TX:

* many packets are in the (HW) queue
* enqueue packet with PN=0x10000
* turn off HW crypto
[here I can actually see how you might now leak packets as
unencrypted that are already in the queue]
* install a new key
* turn on HW crypto for the new key
* process packet with PN=0x10000

Fundamentally, nothing stops this from happening, as we (still) don't
have any synchronization between transmission and key configuration.

Also, in this case it seems pretty obvious how you can leak packets
unencrypted, since the HW now no longer has a key. This might not even
be fixable if the NIC cannot reject packets that are supposed to be
encrypted to a key that's no longer valid in the HW.

I don't really see how the unencrypted leak happens with the current
code though, unless the driver somehow first invalidates the key and
then installs a new one, and there's a race with this?

Ultimately, I don't think this patch is good enough. We clearly have a
problem here with leaking unencrypted frames, if the device can't reject
them (and I can't really fault it for that), so in order to really fix
it we'd have to completely flush all software and hardware queues, and
then start again with a new key - and for that we don't even need to
disable HW crypto.


(FWIW, iwlwifi mostly avoids this problem on TX - at least for keys
other than GCMP-256 - by embedding the key material into the frame
itself, so we simply don't have such a race condition there)

johannes

2018-03-26 07:43:31

by Sebastian Gottschall

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Fix wlan freezes under load at rekey


>
> So I really don't know enough to properly review
> your patch.? Just be aware that ath10k is weird about sw-crypt, maybe
> make
> sure your patch is tested on it to make sure it doesn't out-right
> break something.
i will test it today in sta and ap mode. lets see whats the result after
some hours

Sebastian


--
Mit freundlichen Gr?ssen / Regards

Sebastian Gottschall / CTO

NewMedia-NET GmbH - DD-WRT
Firmensitz: Stubenwaldallee 21a, 64625 Bensheim
Registergericht: Amtsgericht Darmstadt, HRB 25473
Gesch?ftsf?hrer: Peter Steinh?user, Christian Scheele
http://www.dd-wrt.com
email: [email protected]
Tel.: +496251-582650 / Fax: +496251-5826565

2018-03-27 13:03:30

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Fix wlan freezes under load at rekey

On Mon, 2018-03-26 at 22:24 +0200, Alexander Wetzel wrote:

> There are quite many interesting things visible here, not the least one
> that ath9k leaks unencrypted frames for both patched and unpatched
> mac80211 which at least for my patched variant probably allow to
> calculate the TK key and encrypt all frames.

That's odd - any thoughts on how that might be happening?


johannes

2018-04-09 07:25:59

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Fix wlan freezes under load at rekey

On Sun, 2018-04-08 at 22:31 +0200, Alexander Wetzel wrote:

> Exactly. I'm planing to avoid that issue by just dropping (and flushing)
> all packets while mac80211 replaces the keys.

That would again need driver support though, and I'm not sure all
drivers implement the flush method (correctly) today.

> Queuing them in mac80211
> should also be possible, but I abandoned that for now - after figuring
> out that the PS code currently using those queues allows only an AP (or
> a mesh) to queue. Still looks doable, but too invasive for now.

Yeah, the whole queueing is messy for sure.

> > I don't really see how the unencrypted leak happens with the current
> > code though, unless the driver somehow first invalidates the key and
> > then installs a new one, and there's a race with this?
>
> Well, current mac80211 code always handling a key replace in that order:
> - set the new key in mac80211
> - remove the key from hw
> - delete the old key
> - enable hw accel for new key
>
> Problem here is, that when we remove the key from the hw we first drop
> the key from the card and after clear KEY_FLAG_UPLOADED_TO_HARDWARE.
> So yes, we have a short window where mac80211 incorrectly assumes the hw
> is encrypting packets when it's not.
>
> That is trivial to fix, we just have to remove the flag prior to calling
> drv_set_key in ieee80211_key_disable_hw_accel.

Oh, ok. That's indeed not such a big deal then. This code was just never
really meant to work properly while operating on this key - typically it
either gets invoked when the connection is already no longer authorized,
or when swapping GTK where we only RX anyway, or on AP side when
swapping GTK the second time, but then that key has (usually) long been
unused.

> This will of course hand over software encrypted packets to the driver
> again, but this also happens when we enable hw encryption and therefore
> should be pretty well tested with all drivers.

No, it doesn't generally happen with all drivers, because we usually
install keys before we have a functioning datapath up. In the case of
ath10k, it's even not possible to send those frames, as has been pointed
out in this thread before.

> > Ultimately, I don't think this patch is good enough. We clearly have a
> > problem here with leaking unencrypted frames, if the device can't reject
> > them (and I can't really fault it for that), so in order to really fix
> > it we'd have to completely flush all software and hardware queues, and
> > then start again with a new key - and for that we don't even need to
> > disable HW crypto.
>
> I agree, but we still have to disable the hw encryption in the final
> solution, haven't we?

I don't know. Honestly, I'm not even sure this problem is worth solving
right now. AFAICT it's a pretty special and fringe configuration to
enable PTK rekeying to start with, and the latest 802.11 allows using
non-zero key ID for PTKs, so we could implement that on both AP/client
sides instead, and thus really solve the problem without any races.

> If not the remote STA may still send us some frames with the old IV and
> key and our RX will decode and hand them over to mac80211. And mac80211
> will bump the IV for the new key to the value the old had.

Yes, this does happen.

> Or is there a way mac80211 can see which key was used to decode the
> frame? I did not see anything, but did not dug deeper.

Not right now. I guess you could add one, but it may be difficult to
even know. I think iwlwifi might have a "key color" bit it reports up
and that you could use (just swap it every time you overwrite the key),
but I guess not all hardware would have that.

johannes

2018-04-08 20:49:41

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Fix wlan freezes under load at rekey


> Fundamentally, nothing stops this from happening, as we (still) don't
> have any synchronization between transmission and key configuration.
I'm currently working on that and nearly have a rfc version ready to
share. Unfortunately I still continue to find issues to iron out.
May be some more days to sort the latest and hopefully the last issue.

The current version is already fixing the issue with my ath9k AP but it
looks like it's now racing with eapol #4 and needs more tweaks.

> Also, in this case it seems pretty obvious how you can leak packets
> unencrypted, since the HW now no longer has a key. This might not even
> be fixable if the NIC cannot reject packets that are supposed to be
> encrypted to a key that's no longer valid in the HW.
>
Exactly. I'm planing to avoid that issue by just dropping (and flushing)
all packets while mac80211 replaces the keys. Queuing them in mac80211
should also be possible, but I abandoned that for now - after figuring
out that the PS code currently using those queues allows only an AP (or
a mesh) to queue. Still looks doable, but too invasive for now.

> I don't really see how the unencrypted leak happens with the current
> code though, unless the driver somehow first invalidates the key and
> then installs a new one, and there's a race with this?

Well, current mac80211 code always handling a key replace in that order:
- set the new key in mac80211
- remove the key from hw
- delete the old key
- enable hw accel for new key

Problem here is, that when we remove the key from the hw we first drop
the key from the card and after clear KEY_FLAG_UPLOADED_TO_HARDWARE.
So yes, we have a short window where mac80211 incorrectly assumes the hw
is encrypting packets when it's not.

That is trivial to fix, we just have to remove the flag prior to calling
drv_set_key in ieee80211_key_disable_hw_accel.

This will of course hand over software encrypted packets to the driver
again, but this also happens when we enable hw encryption and therefore
should be pretty well tested with all drivers.

>
> Ultimately, I don't think this patch is good enough. We clearly have a
> problem here with leaking unencrypted frames, if the device can't reject
> them (and I can't really fault it for that), so in order to really fix
> it we'd have to completely flush all software and hardware queues, and
> then start again with a new key - and for that we don't even need to
> disable HW crypto.

I agree, but we still have to disable the hw encryption in the final
solution, haven't we?
If not the remote STA may still send us some frames with the old IV and
key and our RX will decode and hand them over to mac80211. And mac80211
will bump the IV for the new key to the value the old had.

Or is there a way mac80211 can see which key was used to decode the
frame? I did not see anything, but did not dug deeper.

Alexander

2018-05-16 01:07:27

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

Hello,

> On Tue, 2018-05-15 at 12:22 +0200, Alexander Wetzel wrote:
>>
>> Both issues can be prevented by first replacing the key in the HW and
>> makeing sure no aggregation sessions are running during the rekey.
>
> I don't think you can do this - just tear down all aggregation sessions
> - there are APs out there that will not re-establish them if you tear
> them down, or only attempt a given number of times, etc. This will cause
> interoperability problems.

I'm on very thin ice here, but my impression was that this should work
without too many problems for all (most?) systems:
- An aggregation session is only started when needed
- ADDBA can't be expected to succeed
- It's normal to tear down an aggregation session once your queue is empty.

The only unusual thing here is, that the originator can get a DELBA from
the recipient while transmitting data and not after some inactivity
timeout. But reading IEEE802.11-2016 chapter 11.5.4 seems to indicate
that you have to expect and handle DELBA frames any time.

So far I've found only one device which is handling a PSK rekey
correctly (Windows Surface Pro 3 running Win 10) and that one was
working fine with my patched AP for three rekeys while downloading at
full speed. The fourth rekey failed and caused an re-associated, but
according to the OTA capture the AP did not respond to at least 5 EAPOL
#2 frames and we therefore never got to the code stopping the
aggregation for rekey.

That said I think I can get the code working without stopping RX
aggregation and a spoofed "idle" tear down of the TX aggregation.

Problem here is the reorder buffer can have already decoded packets
queued from both the old and the new key. And once the session is
complete will releases those when we are on the new key, poisoning our PN.
First plan would be to mark any running RX aggregation queue as tainted
and once the aggregation is complete discard all packets in it.

>
> OTOH, arguably we have worse interoperability problems today, and anyone
> who configures PTK rekeying is deluded that it'll work properly, so
> maybe that's not _that_ bad. Hmm.

Assuming the other STA is not totally broken this should only degrade
the speed, but keep the connection operable.
If you prefer to not stop the RX aggregation I'll try my hand on that
next. (I assume stopping TX is fine?)

The tests I've run so far are showing that we have at least two group of
"broken" devices out in the wild:
1) The first group is handling rekeys pretty much like mac80211. Some
are better on TX like my HTC 10 (seems to be fullmac) but are failing to
separate RX frames properly based on the key used to decode it.

2) The second group is even worse implemented, but in a nice twist are
seeming to work quite fine for the users. Those are simply encoding
eapol #4 with the new key, preventing any rekey to ever succeed and
triggering a re-associate.

Statistically my data is less than insufficient, but I suspect that
there are quite some APs in the wild running rekeys but the combination
of hour long rekey intervals, the fact hat you must have traffic during
the rekey and that at least some common network cards handle eapol #4
wrong keeps the heat down. And of course this issue is next to
impossible to track down if you are not some kind of expert.

Nevertheless I can you find many "magic" solutions to fix linux wlan
issues by switching over to software encryption and disabling 802.11n,
which are exactly the actions which drastically reduce the chance to
freeze a wlan during a PSK rekey. (I'm sure many of those are other
issues, but I'm equally sure a sizeable fraction is not.)

One of the "better" reports is here:
https://bugzilla.kernel.org/show_bug.cgi?id=42877

Alexander

2018-05-15 18:39:55

by Alexander Wetzel

[permalink] [raw]
Subject: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

Rekeying a pairwise key with encryption offload and only keyid 0 has two
potential races which can freeze the wlan conection till rekeyed again:

1) For incomming packets:
If the local STA installs the key prior to the remote STA we still
have the old key active in the hardware for a short time after
mac80211 switched to the new key.
The card can still hand over packets decoded with the old key to
mac80211, bumping the new PN (IV) value to an incorrect high number and
tricking the local replay detection to drop all packets really sent
with the new key.

2) For outgoing packets:
If mac80211 is providing the PN (IV) and hands over the cleartext
packets for encryption to the hardware immediately prior to a key
change the driver/card may process the queued packets after
switching to the new key.
This will immediatelly bump the PN (IV) value on the remote STA to
an incorrect high number, also freezing the connection.

Both issues can be prevented by first replacing the key in the HW and
makeing sure no aggregation sessions are running during the rekey.

Signed-off-by: Alexander Wetzel <[email protected]>
---
net/mac80211/key.c | 54 ++++++++++++++++++++++++++++++++++------------
1 file changed, 40 insertions(+), 14 deletions(-)

diff --git a/net/mac80211/key.c b/net/mac80211/key.c
index ee0d0cc8dc3b..bb498f2b6c44 100644
--- a/net/mac80211/key.c
+++ b/net/mac80211/key.c
@@ -248,6 +248,7 @@ static void ieee80211_key_disable_hw_accel(struct ieee80211_key *key)
(key->conf.flags & IEEE80211_KEY_FLAG_RESERVE_TAILROOM)))
increment_tailroom_need_count(sdata);

+ key->flags &= ~KEY_FLAG_UPLOADED_TO_HARDWARE;
ret = drv_set_key(key->local, DISABLE_KEY, sdata,
sta ? &sta->sta : NULL, &key->conf);

@@ -257,7 +258,26 @@ static void ieee80211_key_disable_hw_accel(struct ieee80211_key *key)
key->conf.keyidx,
sta ? sta->sta.addr : bcast_addr, ret);

- key->flags &= ~KEY_FLAG_UPLOADED_TO_HARDWARE;
+}
+
+static void ieee80211_key_retire(struct ieee80211_key *key)
+{
+ struct ieee80211_local *local = key->local;
+ struct sta_info *sta = key->sta;
+
+ assert_key_lock(key->local);
+
+ /* Stop TX till we are on the new key */
+ key->flags |= KEY_FLAG_TAINTED;
+ ieee80211_clear_fast_xmit(sta);
+
+ if (ieee80211_hw_check(&local->hw, AMPDU_AGGREGATION)) {
+ set_sta_flag(sta, WLAN_STA_BLOCK_BA);
+ ieee80211_sta_tear_down_BA_sessions(
+ sta, AGG_STOP_LOCAL_REQUEST);
+ }
+ ieee80211_flush_queues(key->local, key->sdata, false);
+ ieee80211_key_disable_hw_accel(key);
}

static void __ieee80211_set_default_key(struct ieee80211_sub_if_data *sdata,
@@ -316,34 +336,45 @@ void ieee80211_set_default_mgmt_key(struct ieee80211_sub_if_data *sdata,
}


-static void ieee80211_key_replace(struct ieee80211_sub_if_data *sdata,
+static int ieee80211_key_replace(struct ieee80211_sub_if_data *sdata,
struct sta_info *sta,
bool pairwise,
struct ieee80211_key *old,
struct ieee80211_key *new)
{
int idx;
+ int ret;
bool defunikey, defmultikey, defmgmtkey;

/* caller must provide at least one old/new */
if (WARN_ON(!new && !old))
- return;
+ return 0;

if (new)
list_add_tail_rcu(&new->list, &sdata->key_list);

WARN_ON(new && old && new->conf.keyidx != old->conf.keyidx);

- if (old)
+ if (old) {
idx = old->conf.keyidx;
- else
+ if(new && sta && pairwise)
+ ieee80211_key_retire(old);
+ } else {
idx = new->conf.keyidx;
+ }
+
+ if (new && !new->local->wowlan) {
+ ret = ieee80211_key_enable_hw_accel(new);
+ if (ret)
+ return ret;
+ }

if (sta) {
if (pairwise) {
rcu_assign_pointer(sta->ptk[idx], new);
sta->ptk_idx = idx;
ieee80211_check_fast_xmit(sta);
+ clear_sta_flag(sta, WLAN_STA_BLOCK_BA);
} else {
rcu_assign_pointer(sta->gtk[idx], new);
}
@@ -380,6 +411,7 @@ static void ieee80211_key_replace(struct ieee80211_sub_if_data *sdata,

if (old)
list_del_rcu(&old->list);
+ return 0;
}

struct ieee80211_key *
@@ -654,7 +686,6 @@ int ieee80211_key_link(struct ieee80211_key *key,
struct ieee80211_sub_if_data *sdata,
struct sta_info *sta)
{
- struct ieee80211_local *local = sdata->local;
struct ieee80211_key *old_key;
int idx, ret;
bool pairwise;
@@ -687,18 +718,13 @@ int ieee80211_key_link(struct ieee80211_key *key,

increment_tailroom_need_count(sdata);

- ieee80211_key_replace(sdata, sta, pairwise, old_key, key);
+ ret = ieee80211_key_replace(sdata, sta, pairwise, old_key, key);
ieee80211_key_destroy(old_key, true);

ieee80211_debugfs_key_add(key);

- if (!local->wowlan) {
- ret = ieee80211_key_enable_hw_accel(key);
- if (ret)
- ieee80211_key_free(key, true);
- } else {
- ret = 0;
- }
+ if (ret)
+ ieee80211_key_free(key, true);

out:
mutex_unlock(&sdata->local->key_mtx);
--
2.17.0

2018-05-16 06:57:02

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

On Wed, 2018-05-16 at 00:41 +0200, Alexander Wetzel wrote:
>
> I'm on very thin ice here,

I think we all are, since we're talking about interoperability :-)

> but my impression was that this should work
> without too many problems for all (most?) systems:
> - An aggregation session is only started when needed
> - ADDBA can't be expected to succeed
> - It's normal to tear down an aggregation session once your queue is empty.
>
> The only unusual thing here is, that the originator can get a DELBA from
> the recipient while transmitting data and not after some inactivity
> timeout. But reading IEEE802.11-2016 chapter 11.5.4 seems to indicate
> that you have to expect and handle DELBA frames any time.

There's not necessarily any inactivity timeout. Many STAs (AP STAs
included) will set that to 0 if the aggregation session doesn't cost
them an appreciable amount of resources.

You're reading the spec correctly, but the spec also omits entirely when
you would (want to) start a session, and in my experience the logic in
many STAs will not easily reopen sessions that were torn down (multiple
times) by a recipient DelBA, since they don't know that it won't be torn
down immediately again.

I think the issue is more pronounced with *rejecting* the AddBA, rather
than sending a recipient DelBA, but I'd still be cautious about it.

> So far I've found only one device which is handling a PSK rekey
> correctly (Windows Surface Pro 3 running Win 10) and that one was
> working fine with my patched AP for three rekeys while downloading at
> full speed. The fourth rekey failed and caused an re-associated, but
> according to the OTA capture the AP did not respond to at least 5 EAPOL
> #2 frames and we therefore never got to the code stopping the
> aggregation for rekey.

:-)

That's kinda my point though - you will not really find devices that
work correctly ;-)

> Problem here is the reorder buffer can have already decoded packets
> queued from both the old and the new key. And once the session is
> complete will releases those when we are on the new key, poisoning our PN.
> First plan would be to mark any running RX aggregation queue as tainted
> and once the aggregation is complete discard all packets in it.

Yes, good point, the reorder buffer definitely is a concern here. But
it's also completely managed in software in this case, so I guess we
could tag the key that was used into the frames somehow?

OTOH, if we allow that then we also open ourselves up to replay attacks
during this scenario since we can no longer check the frames properly,
so we'd better drop them.

> Assuming the other STA is not totally broken this should only degrade
> the speed, but keep the connection operable.

Yes.

> If you prefer to not stop the RX aggregation I'll try my hand on that
> next. (I assume stopping TX is fine?)

Stopping TX is fine, that's a local problem. We should make sure we
reopen the sessions but that's at least something we control.

Stopping RX - well, maybe I'm thinking now that it's not so bad. Given
that it's an infrequent DelBA rather than AddBA rejection, it should be
more or less OK.

> The tests I've run so far are showing that we have at least two group of
> "broken" devices out in the wild:
> 1) The first group is handling rekeys pretty much like mac80211. Some
> are better on TX like my HTC 10 (seems to be fullmac) but are failing to
> separate RX frames properly based on the key used to decode it.

No surprise here. It's a very common trade-off - do the PN checks in
software so that you don't waste precious device memory on all the PNs
for all the TIDs, but for speed/CPU reasons do the encryption in
hardware. Basically any time you have this, you run into the problem.
Even if you have full-MAC that may still happen since often this is
implemented in two different "CPUs" (like e.g. Broadcom) or different
hardware blocks perhaps.

> 2) The second group is even worse implemented, but in a nice twist are
> seeming to work quite fine for the users. Those are simply encoding
> eapol #4 with the new key, preventing any rekey to ever succeed and
> triggering a re-associate.

:-)

> Statistically my data is less than insufficient, but I suspect that
> there are quite some APs in the wild running rekeys but the combination
> of hour long rekey intervals, the fact hat you must have traffic during
> the rekey and that at least some common network cards handle eapol #4
> wrong keeps the heat down. And of course this issue is next to
> impossible to track down if you are not some kind of expert.

Indeed.

> Nevertheless I can you find many "magic" solutions to fix linux wlan
> issues by switching over to software encryption and disabling 802.11n,
> which are exactly the actions which drastically reduce the chance to
> freeze a wlan during a PSK rekey. (I'm sure many of those are other
> issues, but I'm equally sure a sizeable fraction is not.)
>
> One of the "better" reports is here:
> https://bugzilla.kernel.org/show_bug.cgi?id=42877

Right.

I think the part that I misinterpreted in a way is that I thought these
issues mostly happened to people who were explicitly configuring their
(Open|*)Wrt routers to do PTK rekeying. I hadn't really seen any vendors
enabling it in their stock configuration. But perhaps I'm mistaken here,
or perhaps people are actually running into PN wraps after 2**48
packets, which requires a rekey?

Anyway, next step is I think for me to take a closer look at this patch
- I'm starting to think that the aggregation issue isn't so bad.

Thanks for all your work on this!

johannes

2018-05-15 15:50:51

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

On Tue, 2018-05-15 at 12:22 +0200, Alexander Wetzel wrote:
>
> Both issues can be prevented by first replacing the key in the HW and
> makeing sure no aggregation sessions are running during the rekey.

I don't think you can do this - just tear down all aggregation sessions
- there are APs out there that will not re-establish them if you tear
them down, or only attempt a given number of times, etc. This will cause
interoperability problems.

OTOH, arguably we have worse interoperability problems today, and anyone
who configures PTK rekeying is deluded that it'll work properly, so
maybe that's not _that_ bad. Hmm.

johannes

2018-06-29 10:12:30

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

Hi,

On Tue, 2018-06-19 at 22:12 +0200, Alexander Wetzel wrote:
> > Am I understanding correctly that in the latter (outgoing) case this
> > might cause unencrypted packets to be transmitted?
>
> Yes, if we have a driver handling the keys similar to ath9k which is not
> implementing drv_flush (correctly) this is a possibility.

Sadly, I think many drivers fall into this category, so I'm not sure we
should "fix the bug" for them at the expense of a potential security
risk? So maybe we need to have an opt-in flag for drivers.

> This ties somewhat into the discussion what to handle in mac80211 and
> what delegate to the drivers. mac80211 itself should drop any packets
> trying to use the old outgoing key till the new key is rolled out
> correctly with the patch.

Yeah, but it can't do it for packets already queued to the
driver/hardware, and it can't know how the hardware behaves wrt.
encryption: old-iwlwifi style (key material in descriptor) vs. ath9k
style (key material in on-chip table).

> > > We can now decrypt and get incoming packets while mac80211 is still on
> > > the old key, but the worst (and most likely) case is now, that the
> > > replay detection will drop those packets till mac80211 also switch over.
> >
> > This actually is somewhat problematic, at least for TKIP it could cause
> > countermeasures. Should we exclude TKIP here somehow?
> >
> > I don't think we can disable countermeasures because then we could be
> > attacked in this time window without starting them, and we have a really
> > hard time distinguishing attacks and "fake" replays.
> >
>
> Yes, that is a definitely a possibility. But excluding TKIP seems to be
> counterproductive for my understanding: TKIP is due to the weaker cipher
> more likely to have unicast rekeying enabled.
> So far I was hesitant to add new per packet check for that, but I guess
> we have to... I'll add that in the next version, so we can discuss that
> with some code.

Ok.

> > Btw - perhaps that means we should avoid the complicated mechanisms like
> > TX aggregation shutdown for keys that the driver marks as being safe?
> > Clearly for iwlwifi (at least CCMP and before, not with the longer keys
> > in GCMP-256) the race can't possibly happen.
>
> Sounds like that should work and I think I'll just try it out. We'll
> lose the side benefit that shutting down TX aggregation will reduce the
> risk that a unpatched remote sta freezes the connection, though.

Fair point.

> And since I started out to first patch ath9k to drop packets for an
> outgoing key: That looks to become either be ugly (delayed work to
> complete the key clean up after a given time) or need some API change.

Ok.

> Remember we'll still have to shut down RX aggregation or drop all frames
> of a session running during the rekey. We are not able to tell which key
> was used to encrypt the frames and which have "old" PNs we can't allow
> mac80211 to process after switching to the new key. So I'm not sure that
> keeping TX up and running is worth it.

Yeah, good point, and TX is less interesting because we control when/how
it's set up.

> That said I have another working patch which is delaying mac80211 key
> deletion from my first misguided attempts to at this issue.
> I could repurpose that and keep RX aggregation also up and running,
> allowing us to first check if the packet would have been accepted by the
> old key and only switch to the new PN counter once it's not. But that
> seems to be kind of invasive and overkill.

Yeah, let's not go there for now. Also if HW crypto is involved it
usually won't work unless you (a) get the packet and (b) re-encrypt the
packet with the wrong key, and then decrypt it with the correct key ...

> Yes, that is part of the reason I had to add the call to drv_flush for
> ath9k. I've experimented with an additional "retire_key" command on top
> of the existing to add and delete keys but came to the conclusion that
> "replace_key" would make more sense for ath9k at least.

Ok.

> But in order of my preference I see this options:
> 1) removing a key in HW must also grantee that any packets queued for
> this key has either been send or will be dropped after the call returns
> to mac80211. (Simply sleeping the max queue and retransmit time would be
> an ugly but simple way to implement that)

This works but we have to consider a lot of drivers. I guess we're back
to some form of "opt-in" scheme then?

> 2) we add a new function/call like "replace_key" for this special case.
> But in that case we have to sort out how to handle drivers not
> implementing it. Thy only secure way would be to disconnect if someone
> tries to rekey...

Maybe that's not so bad.

> 3) Is what I implemented. We try what we can with the existing API and
> any driver not working with that has to be considered buggy.

I don't think we can really do this though. We break - in a potentially
security-relevant way - the older drivers. We can't just say that's
driver bugs, IMHO.

> > > I tried to minimize the changes, but if we can consider an API change I
> > > would ask the drivers to provide a new function to switch directly from
> > > one key to another, without the need to delete it first. And to
> > > guarantee there, that after the function returns no packets with
> > > prepared for the old key can be sent out. Including retransmissions.
> >
> > This would be pretty tricky for most drivers though.
>
> I do not see any real alternative. Sleeping some reasonable time should
> be acceptable here, to allow the hw to flush out queued packets. (100ms
> should probably acceptable here, especially compared to complete
> connection loss..

You have no guarantee that even 100ms is enough to get all packets out.
They could be sitting on a BK queue and waiting for all BE/VI traffic in
the vicinity ... basically forever.

> Only other idea I have is to lock down TX for all but EAPOL packets
> sooner, e.g. during the key handshake. But that's more or less only a
> variant of the sleep above.

Yeah, doesn't really fully address this issue anyway, since EAPOL are on
VO and others might get way more delayed on HW queues due to contention.

I think our best bet is some form of opt-in scheme. Perhaps (2), which
drivers can implement also to get like the "best" behaviour. They could
also implement there the flushing if they can, for example. Others would
get a disconnect, but they probably effectively do already?

Obviously if SW crypto is used none of this is a concern, so that's
another factor to take into account in this decision logic of whether to
disconnect or not?

IOW - I'd rather get bugs that we now force a disconnect (if anyone even
notices), rather than potentially having a bug that causes unencrypted
packets to be sent.

johannes

2018-06-18 21:19:25

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

> On Tue, 2018-05-15 at 12:22 +0200, Alexander Wetzel wrote:
>> Rekeying a pairwise key with encryption offload and only keyid 0 has two
>> potential races which can freeze the wlan conection till rekeyed again:
>>
>> 1) For incomming packets:
>> If the local STA installs the key prior to the remote STA we still
>> have the old key active in the hardware for a short time after
>> mac80211 switched to the new key.
>> The card can still hand over packets decoded with the old key to
>> mac80211, bumping the new PN (IV) value to an incorrect high number and
>> tricking the local replay detection to drop all packets really sent
>> with the new key.
>>
>> 2) For outgoing packets:
>> If mac80211 is providing the PN (IV) and hands over the cleartext
>> packets for encryption to the hardware immediately prior to a key
>> change the driver/card may process the queued packets after
>> switching to the new key.
>> This will immediatelly bump the PN (IV) value on the remote STA to
>> an incorrect high number, also freezing the connection.
>>
>> Both issues can be prevented by first replacing the key in the HW and
>> makeing sure no aggregation sessions are running during the rekey.
>
> Getting back to this, am I understanding correctly that in the latter
> (outgoing) case this would cause

I don't get the point here...Shall I flesh out the description of the
exiting races in the code a bit more?

>
> Also, I think you should probably describe better why the aggregation
> session stuff is needed. I'm already thinking there times about it again
> ...
I'll rework the description and will go into more details.
Here a first quick draft what I plan amend to the commit message. But
this got kind of long:
...
Both issues can be prevented by first replacing the key in the HW and
making sure no aggregation sessions are running during the rekey.

The "core" of the fix is to change the old unicast key install sequence
- atomic switch over to the new key in mac80211
- remove the old key in the HW and mac80211
- add new key in the hw
to
- mark the old key as tainted to drop TX packets trying to use it
- remove the old key in the HW and mac80211
- add new key to the HW
- atomic switch over to the new key in mac80211

Since the new key is not marked as tainted the last step will also
enable TX again.

With the new procedure the HW will be unable to decode packets still
encrypted with the old key prior to switching to the new key in
mac80211. Locking down TX during the rekey makes sure that there are no
outgoing packets mac80211 prepared for the old key the hw can encrypt
with the new key.
We can now decrypt and get incoming packets while mac80211 is still on
the old key, but the worst (and most likely) case is now, that the
replay detection will drop those packets till mac80211 also switch over.

Instead of checking PN from old key packets against the new key (and
bump the PN for the new key, killing our connection) we are now checking
the PN from new key packets against the old key (and drop a few packets
during rekey only).

When using TX aggregation we get an additional race which can poison the
remote station:
When a TX aggregation session is running during the key replacement it's
possible that one of the frames send out prior of the TX lock down (old
key) got lost and will be repeated after the TX lock down is lifted,
using the old skb prepared for the old key with the new installed key in
the hardware. This would bump the PN counter of our peer to an incorrect
value and breaking the RX for the remote station.
Stopping all running aggregation sessions and declining any new ones
till we have switched over to the new key side steps that problem.

/end commit draft

It would be possible to keep the aggregation sessions running, but that
seems to need some quite invasive changes and checks.

>
>> + ieee80211_sta_tear_down_BA_sessions(
>> + sta, AGG_STOP_LOCAL_REQUEST);
>
> minor indentation issue here
I'll go for a 81 char long single line here.

>
>> + ieee80211_flush_queues(key->local, key->sdata, false);
>> + ieee80211_key_disable_hw_accel(key);
>
> I'm not sure all drivers implement drv_flush() [correctly], what happens
> if they don't? I guess some packets end up being transmitted in clear
> text or a dummy key, unless the hardware/firmware knows about this and
> drops them?
Flushing the queues is not needed for to the logic and only a workaround
for drivers like ath9k which can still send out packets for a key which
was already deleted. When the driver guarantees that after the key
deletion function returns to mac80211 that there will be no more packets
send out prepared for the old key it can be removed.
The driver could simply wait for whatever the may queue time may be or
detect and drop these packets. (In fact it can even send out the packet
again, it just must make sure to still use the old and official deleted
key. May be handy for drivers like iwlwifi.)

>
> Perhaps that means we should make this whole thing opt-in, and leave it
> up to driver authors to first validate that they handle the flushing
> correctly?

I tried to minimize the changes, but if we can consider an API change I
would ask the drivers to provide a new function to switch directly from
one key to another, without the need to delete it first. And to
guarantee there, that after the function returns no packets with
prepared for the old key can be sent out. Including retransmissions.

That way drivers like ath9k should be able to directly program the new
key and drivers for cards where this is not saving time can simply call
delete and add internally.

As fallback I would still want to use the current code with a flush,
well knowing that not all drivers implement it and only remove it once
all drivers have switched to the new API for rekeys.


>
>
> Regarding the code: the whole dance you do with ieee80211_key_link() and
> ieee80211_key_replace() seems to be a little pointless because you still
> add the key to debugfs and then free it, on errors that is?

I hope the more verbose description above explains that part.
The intend of the dance is to finalize the switch to the new key in the
hardware prior to doing it in mac80211. It's probably possible to bypass
more of the code, but I did not touch the logic besides what I needed,
understand or verified to be harmless.



Alexander

2018-06-19 20:51:38

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey


> Am I understanding correctly that in the latter (outgoing) case this
> might cause unencrypted packets to be transmitted?

Yes, if we have a driver handling the keys similar to ath9k which is not
implementing drv_flush (correctly) this is a possibility.

This ties somewhat into the discussion what to handle in mac80211 and
what delegate to the drivers. mac80211 itself should drop any packets
trying to use the old outgoing key till the new key is rolled out
correctly with the patch.


>> We can now decrypt and get incoming packets while mac80211 is still on
>> the old key, but the worst (and most likely) case is now, that the
>> replay detection will drop those packets till mac80211 also switch over.
>
> This actually is somewhat problematic, at least for TKIP it could cause
> countermeasures. Should we exclude TKIP here somehow?
>
> I don't think we can disable countermeasures because then we could be
> attacked in this time window without starting them, and we have a really
> hard time distinguishing attacks and "fake" replays.
>
Yes, that is a definitely a possibility. But excluding TKIP seems to be
counterproductive for my understanding: TKIP is due to the weaker cipher
more likely to have unicast rekeying enabled.
So far I was hesitant to add new per packet check for that, but I guess
we have to... I'll add that in the next version, so we can discuss that
with some code.


>> Flushing the queues is not needed for to the logic and only a workaround
>> for drivers like ath9k which can still send out packets for a key which
>> was already deleted. When the driver guarantees that after the key
>> deletion function returns to mac80211 that there will be no more packets
>> send out prepared for the old key it can be removed.
>
> What will happen for other kinds of packets though?
>
> For iwlwifi, for example, the key material is added into the key. It
> thus doesn't have this race in the first place, but it will definitely
> send out packets after

That case is fine. A non-broken remote STA will not be able to decode
the packet if it has already completed the switch to the new key and not
hand over the packet to mac80211 or the equivalent.

>
> Btw - perhaps that means we should avoid the complicated mechanisms like
> TX aggregation shutdown for keys that the driver marks as being safe?
> Clearly for iwlwifi (at least CCMP and before, not with the longer keys
> in GCMP-256) the race can't possibly happen.

Sounds like that should work and I think I'll just try it out. We'll
lose the side benefit that shutting down TX aggregation will reduce the
risk that a unpatched remote sta freezes the connection, though.
And since I started out to first patch ath9k to drop packets for an
outgoing key: That looks to become either be ugly (delayed work to
complete the key clean up after a given time) or need some API change.

Remember we'll still have to shut down RX aggregation or drop all frames
of a session running during the rekey. We are not able to tell which key
was used to encrypt the frames and which have "old" PNs we can't allow
mac80211 to process after switching to the new key. So I'm not sure that
keeping TX up and running is worth it.

That said I have another working patch which is delaying mac80211 key
deletion from my first misguided attempts to at this issue.
I could repurpose that and keep RX aggregation also up and running,
allowing us to first check if the packet would have been accepted by the
old key and only switch to the new PN counter once it's not. But that
seems to be kind of invasive and overkill.

>
> In other drivers though, I worry that removing the key will *not* mean
> that there are no more packets transmitted with it. If you have a key
> cache in hardware then it might just be that marking the cache entry as
> invalid means the encryption will be skipped when encountering a packet
> to be transmitted with the key from a given (in the packet) key cache
> index, and then the frame goes out unprotected?
>
Yes, that is part of the reason I had to add the call to drv_flush for
ath9k. I've experimented with an additional "retire_key" command on top
of the existing to add and delete keys but came to the conclusion that
"replace_key" would make more sense for ath9k at least.

But in order of my preference I see this options:
1) removing a key in HW must also grantee that any packets queued for
this key has either been send or will be dropped after the call returns
to mac80211. (Simply sleeping the max queue and retransmit time would be
an ugly but simple way to implement that)

2) we add a new function/call like "replace_key" for this special case.
But in that case we have to sort out how to handle drivers not
implementing it. Thy only secure way would be to disconnect if someone
tries to rekey...

3) Is what I implemented. We try what we can with the existing API and
any driver not working with that has to be considered buggy.

>> I tried to minimize the changes, but if we can consider an API change I
>> would ask the drivers to provide a new function to switch directly from
>> one key to another, without the need to delete it first. And to
>> guarantee there, that after the function returns no packets with
>> prepared for the old key can be sent out. Including retransmissions.
>
> This would be pretty tricky for most drivers though.

I do not see any real alternative. Sleeping some reasonable time should
be acceptable here, to allow the hw to flush out queued packets. (100ms
should probably acceptable here, especially compared to complete
connection loss..
Only other idea I have is to lock down TX for all but EAPOL packets
sooner, e.g. during the key handshake. But that's more or less only a
variant of the sleep above.

>
>> That way drivers like ath9k should be able to directly program the new
>> key and drivers for cards where this is not saving time can simply call
>> delete and add internally.
>
> I don't think that's really all that much savings, time-wise? But it
> might address the "unprotected TX" issue I worry about above.
>
Fully agree.

>
> I'm just saying that
>
> err = add_to_hardware();
>
> add_to_debugfs();
>
> if (err)
> remove_from_debugfs();
>
> is pointless :-)

Agree, that is backwards compatibility a step too far:-)
I'll bypass that in the next patch version.
Planning to work on v3 on the weekend.

Alexander

2018-06-15 11:33:15

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

On Tue, 2018-05-15 at 12:22 +0200, Alexander Wetzel wrote:
> Rekeying a pairwise key with encryption offload and only keyid 0 has two
> potential races which can freeze the wlan conection till rekeyed again:
>
> 1) For incomming packets:
> If the local STA installs the key prior to the remote STA we still
> have the old key active in the hardware for a short time after
> mac80211 switched to the new key.
> The card can still hand over packets decoded with the old key to
> mac80211, bumping the new PN (IV) value to an incorrect high number and
> tricking the local replay detection to drop all packets really sent
> with the new key.
>
> 2) For outgoing packets:
> If mac80211 is providing the PN (IV) and hands over the cleartext
> packets for encryption to the hardware immediately prior to a key
> change the driver/card may process the queued packets after
> switching to the new key.
> This will immediatelly bump the PN (IV) value on the remote STA to
> an incorrect high number, also freezing the connection.
>
> Both issues can be prevented by first replacing the key in the HW and
> makeing sure no aggregation sessions are running during the rekey.

Getting back to this, am I understanding correctly that in the latter
(outgoing) case this would cause

Also, I think you should probably describe better why the aggregation
session stuff is needed. I'm already thinking there times about it again
...


> + ieee80211_sta_tear_down_BA_sessions(
> + sta, AGG_STOP_LOCAL_REQUEST);

minor indentation issue here

> + ieee80211_flush_queues(key->local, key->sdata, false);
> + ieee80211_key_disable_hw_accel(key);

I'm not sure all drivers implement drv_flush() [correctly], what happens
if they don't? I guess some packets end up being transmitted in clear
text or a dummy key, unless the hardware/firmware knows about this and
drops them?

Perhaps that means we should make this whole thing opt-in, and leave it
up to driver authors to first validate that they handle the flushing
correctly?


Regarding the code: the whole dance you do with ieee80211_key_link() and
ieee80211_key_replace() seems to be a little pointless because you still
add the key to debugfs and then free it, on errors that is?

johannes

2018-06-29 21:31:43

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

Hello,

>>> Am I understanding correctly that in the latter (outgoing) case this
>>> might cause unencrypted packets to be transmitted?
>>
>> Yes, if we have a driver handling the keys similar to ath9k which is not
>> implementing drv_flush (correctly) this is a possibility.
>
> Sadly, I think many drivers fall into this category, so I'm not sure we
> should "fix the bug" for them at the expense of a potential security
> risk? So maybe we need to have an opt-in flag for drivers.

Makes sense.

>
>> This ties somewhat into the discussion what to handle in mac80211 and
>> what delegate to the drivers. mac80211 itself should drop any packets
>> trying to use the old outgoing key till the new key is rolled out
>> correctly with the patch.
>
> Yeah, but it can't do it for packets already queued to the
> driver/hardware, and it can't know how the hardware behaves wrt.
> encryption: old-iwlwifi style (key material in descriptor) vs. ath9k
> style (key material in on-chip table).
>
>>>> We can now decrypt and get incoming packets while mac80211 is still on
>>>> the old key, but the worst (and most likely) case is now, that the
>>>> replay detection will drop those packets till mac80211 also switch over.
>>>
>>> This actually is somewhat problematic, at least for TKIP it could cause
>>> countermeasures. Should we exclude TKIP here somehow?
>>>
>>> I don't think we can disable countermeasures because then we could be
>>> attacked in this time window without starting them, and we have a really
>>> hard time distinguishing attacks and "fake" replays.
>>>
>>
>> Yes, that is a definitely a possibility. But excluding TKIP seems to be
>> counterproductive for my understanding: TKIP is due to the weaker cipher
>> more likely to have unicast rekeying enabled.
>> So far I was hesitant to add new per packet check for that, but I guess
>> we have to... I'll add that in the next version, so we can discuss that
>> with some code.
>
> Ok.

I was wrong here, this is not an issue. Tkip is simply dropping frames
when the IV is too small and never verifies the MIC. And since only MIC
errors can trigger counter measures we are fine as it is...

Nevertheless I also gave TKIP a quick test. While it's really slow
(2.6MB/s instead of 6.5MB/s due to no aggregation) it worked fine with
the patch and my sole client downloading. (Again ath9k as AP and iwlwifi
for STA)

Quick reference to the code:
ieee80211_rx_h_decrypt will return "RX_DROP_UNUSABLE" when IV is too
small and we bypass ieee80211_rx_h_michael_mic_verify in
ieee80211_rx_handlers_result due to the packet drop.

>
>>> Btw - perhaps that means we should avoid the complicated mechanisms like
>>> TX aggregation shutdown for keys that the driver marks as being safe?
>>> Clearly for iwlwifi (at least CCMP and before, not with the longer keys
>>> in GCMP-256) the race can't possibly happen.
>>
>> Sounds like that should work and I think I'll just try it out. We'll
>> lose the side benefit that shutting down TX aggregation will reduce the
>> risk that a unpatched remote sta freezes the connection, though.
>
> Fair point.
>
>> And since I started out to first patch ath9k to drop packets for an
>> outgoing key: That looks to become either be ugly (delayed work to
>> complete the key clean up after a given time) or need some API change.
>
> Ok.
>
>> Remember we'll still have to shut down RX aggregation or drop all frames
>> of a session running during the rekey. We are not able to tell which key
>> was used to encrypt the frames and which have "old" PNs we can't allow
>> mac80211 to process after switching to the new key. So I'm not sure that
>> keeping TX up and running is worth it.
>
> Yeah, good point, and TX is less interesting because we control when/how
> it's set up.
>

I'll keep the current aggregation tear down in the patch for now, then.
The idea with testing keeping TX aggregation up and running would
require an AP not stopping TX aggregation.


>> That said I have another working patch which is delaying mac80211 key
>> deletion from my first misguided attempts to at this issue.
>> I could repurpose that and keep RX aggregation also up and running,
>> allowing us to first check if the packet would have been accepted by the
>> old key and only switch to the new PN counter once it's not. But that
>> seems to be kind of invasive and overkill.
>
> Yeah, let's not go there for now. Also if HW crypto is involved it
> usually won't work unless you (a) get the packet and (b) re-encrypt the
> packet with the wrong key, and then decrypt it with the correct key ...

I don't get which case this is... But let's not go into that till we
have to:-)

>
>> Yes, that is part of the reason I had to add the call to drv_flush for
>> ath9k. I've experimented with an additional "retire_key" command on top
>> of the existing to add and delete keys but came to the conclusion that
>> "replace_key" would make more sense for ath9k at least.
>
> Ok.
>
>> But in order of my preference I see this options:
>> 1) removing a key in HW must also grantee that any packets queued for
>> this key has either been send or will be dropped after the call returns
>> to mac80211. (Simply sleeping the max queue and retransmit time would be
>> an ugly but simple way to implement that)
>
> This works but we have to consider a lot of drivers. I guess we're back
> to some form of "opt-in" scheme then?
>
>> 2) we add a new function/call like "replace_key" for this special case.
>> But in that case we have to sort out how to handle drivers not
>> implementing it. Thy only secure way would be to disconnect if someone
>> tries to rekey...
>
> Maybe that's not so bad.

Agree. After all this is what most of my windows drivers were doing
accidentally.


>> 3) Is what I implemented. We try what we can with the existing API and
>> any driver not working with that has to be considered buggy.
>
> I don't think we can really do this though. We break - in a potentially
> security-relevant way - the older drivers. We can't just say that's
> driver bugs, IMHO.

We would not break it, only not fix it for all drivers. The current code
is already leaking cleartext packets for at least ath9k and most likely
many others when the PTK is rekeyed. The patch would improve that, but
due to more working rekeys it could leak more packets in specific
scenarios, I assume...


>>>> I tried to minimize the changes, but if we can consider an API change I
>>>> would ask the drivers to provide a new function to switch directly from
>>>> one key to another, without the need to delete it first. And to
>>>> guarantee there, that after the function returns no packets with
>>>> prepared for the old key can be sent out. Including retransmissions.
>>>
>>> This would be pretty tricky for most drivers though.
>>
>> I do not see any real alternative. Sleeping some reasonable time should
>> be acceptable here, to allow the hw to flush out queued packets. (100ms
>> should probably acceptable here, especially compared to complete
>> connection loss..
>
> You have no guarantee that even 100ms is enough to get all packets out.
> They could be sitting on a BK queue and waiting for all BE/VI traffic in
> the vicinity ... basically forever.
>
>> Only other idea I have is to lock down TX for all but EAPOL packets
>> sooner, e.g. during the key handshake. But that's more or less only a
>> variant of the sleep above.
>
> Yeah, doesn't really fully address this issue anyway, since EAPOL are on
> VO and others might get way more delayed on HW queues due to contention.
>
> I think our best bet is some form of opt-in scheme. Perhaps (2), which
> drivers can implement also to get like the "best" behaviour. They could
> also implement there the flushing if they can, for example. Others would
> get a disconnect, but they probably effectively do already?
>

I'll give (2) a shot, then.

> Obviously if SW crypto is used none of this is a concern, so that's
> another factor to take into account in this decision logic of whether to
> disconnect or not?

I did not check the SW crypto code but I'm pretty sure that indeed works
with the current code.
I assume the packets will only be decrypted after an RX aggregation is
completed. That would filter out all packets send with the old key,
since they simply can't be decrypted any more.

Shall we bypass stopping aggregation sessions if we are on SW crypto?
We'll again lose the benefit that we prevent a broken remote STA to try
a TX Agg. (I tend to still stop them, but do not have a real opinion
here...)

>
> IOW - I'd rather get bugs that we now force a disconnect (if anyone even
> notices), rather than potentially having a bug that causes unencrypted
> packets to be sent.

Any suggestions how to trick hostap/wpa_supplicant into dropping the
connection? For me it looks like we can just report an error on key
install and expect the wpa_supplicant/hostapd to handle that.

Will try that for the next version of the patch, with the other
discussed improvements: If driver is not signalling "PTK0 rekey support"
we'll simply not accept key installs when we already have a old PTK key
for the connection.

Alexander

2018-06-18 21:27:17

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

Hi,

> > > 2) For outgoing packets:
> > > If mac80211 is providing the PN (IV) and hands over the cleartext
> > > packets for encryption to the hardware immediately prior to a key
> > > change the driver/card may process the queued packets after
> > > switching to the new key.
> > > This will immediatelly bump the PN (IV) value on the remote STA to
> > > an incorrect high number, also freezing the connection.
> > >
> > > Both issues can be prevented by first replacing the key in the HW and
> > > makeing sure no aggregation sessions are running during the rekey.
> >
> > Getting back to this, am I understanding correctly that in the latter
> > (outgoing) case this would cause
>
> I don't get the point here...Shall I flesh out the description of the
> exiting races in the code a bit more?

You couldn't, I didn't finish this sentence for some reason ... or wrote
and then deleted it by accident?

I meant to say:

Am I understanding correctly that in the latter (outgoing) case this
might cause unencrypted packets to be transmitted?

I talked about that more below.

> > Also, I think you should probably describe better why the aggregation
> > session stuff is needed. I'm already thinking there times about it again
> > ...
>
> I'll rework the description and will go into more details.
> Here a first quick draft what I plan amend to the commit message. But
> this got kind of long:
> ...
> Both issues can be prevented by first replacing the key in the HW and
> making sure no aggregation sessions are running during the rekey.
>
> The "core" of the fix is to change the old unicast key install sequence
> - atomic switch over to the new key in mac80211
> - remove the old key in the HW and mac80211
> - add new key in the hw
> to
> - mark the old key as tainted to drop TX packets trying to use it
> - remove the old key in the HW and mac80211
> - add new key to the HW
> - atomic switch over to the new key in mac80211
>
> Since the new key is not marked as tainted the last step will also
> enable TX again.
>
> With the new procedure the HW will be unable to decode packets still
> encrypted with the old key prior to switching to the new key in
> mac80211. Locking down TX during the rekey makes sure that there are no
> outgoing packets mac80211 prepared for the old key the hw can encrypt
> with the new key.

> We can now decrypt and get incoming packets while mac80211 is still on
> the old key, but the worst (and most likely) case is now, that the
> replay detection will drop those packets till mac80211 also switch over.

This actually is somewhat problematic, at least for TKIP it could cause
countermeasures. Should we exclude TKIP here somehow?

I don't think we can disable countermeasures because then we could be
attacked in this time window without starting them, and we have a really
hard time distinguishing attacks and "fake" replays.

> Instead of checking PN from old key packets against the new key (and
> bump the PN for the new key, killing our connection) we are now checking
> the PN from new key packets against the old key (and drop a few packets
> during rekey only).
>
> When using TX aggregation we get an additional race which can poison the
> remote station:
> When a TX aggregation session is running during the key replacement it's
> possible that one of the frames send out prior of the TX lock down (old
> key) got lost and will be repeated after the TX lock down is lifted,
> using the old skb prepared for the old key with the new installed key in
> the hardware. This would bump the PN counter of our peer to an incorrect
> value and breaking the RX for the remote station.

Right, this is what I had forgotten about already.

> > > + ieee80211_flush_queues(key->local, key->sdata, false);
> > > + ieee80211_key_disable_hw_accel(key);
> >
> > I'm not sure all drivers implement drv_flush() [correctly], what happens
> > if they don't? I guess some packets end up being transmitted in clear
> > text or a dummy key, unless the hardware/firmware knows about this and
> > drops them?
>
> Flushing the queues is not needed for to the logic and only a workaround
> for drivers like ath9k which can still send out packets for a key which
> was already deleted. When the driver guarantees that after the key
> deletion function returns to mac80211 that there will be no more packets
> send out prepared for the old key it can be removed.

What will happen for other kinds of packets though?

For iwlwifi, for example, the key material is added into the key. It
thus doesn't have this race in the first place, but it will definitely
send out packets after

Btw - perhaps that means we should avoid the complicated mechanisms like
TX aggregation shutdown for keys that the driver marks as being safe?
Clearly for iwlwifi (at least CCMP and before, not with the longer keys
in GCMP-256) the race can't possibly happen.

In other drivers though, I worry that removing the key will *not* mean
that there are no more packets transmitted with it. If you have a key
cache in hardware then it might just be that marking the cache entry as
invalid means the encryption will be skipped when encountering a packet
to be transmitted with the key from a given (in the packet) key cache
index, and then the frame goes out unprotected?

> I tried to minimize the changes, but if we can consider an API change I
> would ask the drivers to provide a new function to switch directly from
> one key to another, without the need to delete it first. And to
> guarantee there, that after the function returns no packets with
> prepared for the old key can be sent out. Including retransmissions.

This would be pretty tricky for most drivers though.

> That way drivers like ath9k should be able to directly program the new
> key and drivers for cards where this is not saving time can simply call
> delete and add internally.

I don't think that's really all that much savings, time-wise? But it
might address the "unprotected TX" issue I worry about above.

> > Regarding the code: the whole dance you do with ieee80211_key_link() and
> > ieee80211_key_replace() seems to be a little pointless because you still
> > add the key to debugfs and then free it, on errors that is?
>
> I hope the more verbose description above explains that part.
> The intend of the dance is to finalize the switch to the new key in the
> hardware prior to doing it in mac80211. It's probably possible to bypass
> more of the code, but I did not touch the logic besides what I needed,
> understand or verified to be harmless.

I'm just saying that

err = add_to_hardware();

add_to_debugfs();

if (err)
remove_from_debugfs();

is pointless :-)

johannes

2018-07-04 00:06:30

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

On Tue, 2018-07-03 at 21:54 +0200, Alexander Wetzel wrote:

> > I think easier would be to just disconnect ourselves? At least if we're
> > in managed mode...
> >
>
> I still have much to learn about 802.11, but so far I did not see way to
> directly disconnect a STA. (Maybe spoofing a "signal lost" event or
> something like that, but I fear complications by losing the sync with
> the remote STA.) Is there any call/signal you have in mind I could test?

ieee80211_set_disassoc(), this can also send a frame out to indicate to
the AP that we're disconnecting instead.

> hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
> and are implementing the state machine and are in a good position to
> handle the fallout... at least theoretically.

Ideally it would even know beforehand that we don't want to handle the
PTK rekeying, and then could reconnect instead of going through the
handshake.

> Instead I get a pop up asking for the PSK. Entering it reconnects
> normally. Cancel the prompt disconnect till a manual reconnect.
> I suspect NetworkManager is handling the rekey like the initial key
> install and then assumes the PSK is wrong. Hardly surprising but also
> highly visible to the users.

That's pretty awkward.

> But then only to those using the now broken rekey...

Yeah, but you don't necessarily control that - i.e. client device and AP
might have different owners :-)

> Using wpa_supplicant directly reconnects after ~15s.
> It also assumes the key is wrong and seems to rate limit the connection
> attempts. Here a log with wpa_supplicat running in the console and dmesg
> -wT output on top of that:

So I think we're probably better off accepting the set_key but not
actually using it, and instead disconnecting... even if that's awkward
and should come with a big comment :-)

johannes

2018-07-11 17:21:57

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

Hi Denis,

> Hi Alexander,
>
>>>> hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
>>>> and are implementing the state machine and are in a good position to
>>>> handle the fallout... at least theoretically.
>>>
>>> Ideally it would even know beforehand that we don't want to handle the
>>> PTK rekeying, and then could reconnect instead of going through the
>>> handshake.
>>>
>>
>> Don't see how we could do that in the kernel, all the relevant
>> information is handled in the state machine. I guess an API extension
>> telling hostap/supplicant if we can handle rekeys or not would tbe he
>> only way to avoid that.
>>
>
> Can the kernel / driver provide some sort of hint to user space that PTK
> rekey isn’t supported?  We could then have user space deauthenticate
> with a big warning about what/why this is happening and try to
> re-connect to the last used BSS.
>

Sure. In fact the latest patch is already doing that by returning an
error when set_key is called for PTK and it's not an initial call.
Tests with wpa_supplicant shows that this is is then handled like the
initial key set is failing. Networkmanager prompts for the password and
wpa_supplicant running without seems to blacklist a reconnect for 15s.

I kind of liked that solution, but with existing implematations out this
is indeed awkward to find a "correct" solution.

The main problem for me currently is to find a correct and still
acceptable solution. This turned from "let's fix this nasty wlan
connection freezes" to a projet spanning the complete wlan stack: From
hardware up to and including the userspace...

It's fun to learn how that interacts (if not very fast), I'm stuggling
finding the best way forward here. Whatever we do has undesired
consequences.
Maybe I'm missing something, but here the high level options we have in
my opinion:

1) Keep it as it is and solve that in a indefinite future when we and
the world implement the ieee802.11 2012 addition, to use key 0+1 for PTK
and 2+3 for GTK
- rekey has a extrem high probability of freezing connections and
leaking a few clear text packets for years (decades?) to come
+ The issue is fixed at the core

2) Make it worse, like some (most) Windows systems/cards seem to handle
it by encrypting EAPOL #4 with the NEW key, breaking the handshake and
forcing a reconnect.
- break something more to fix a problem sounds like a insane approach
+ This seems to be quite common and therefore well "tested" (based on my
very limited data on that)

3) Fix what we can in mac80211 but keep the API stable
- Without driver actions still many drivers will be "undefined" and even
if they are not freezing leak packets
+ This will reduce the problems to a fraction of what is is today with
only a mac80211 update

4) Redesign the mac80211 rekey handling and interaction with drivers to
only rekey if it is save and decline when not.
+ We only have to touch the kernel
- any supplicant (whatever runs the wpa state machine) may get errors
where the programmes did not expect them, leading to unexpected side
effects.

5) The full-stack solution: Update the API for the userpace
+ We do not have to "trick" the wpa state machine to disconnect, the
programmers of it have to code it.
- Well, it must be suppurted from the wpa state machine. If not we still
have to handle the rekey somehow or we accept freezes/cleartext leaks...

The last two solutions will also need some "fallback" when a secure
rekey is not possible and/or the user is runing an old state machine not
knowing about the new way...


>>> So I think we're probably better off accepting the set_key but not
>>> actually using it, and instead disconnecting... even if that's awkward
>>> and should come with a big comment :-)
>>
>> Instead of returning an error I'll change the code to accept the rekey
>> but do nothing with it. (Basically delete the new key and keep the old
>> active).
>> And of course calling ieee80211_set_disassoc() prior to return "success".
>>
>> Let's see how the supplicant will react on a disassoc while doing a
>> rekey...
>>
>
> This sounds pretty awful actually.  Now that wpa_s is not the only game
> in town, can we stop resorting to these tactics?

Nothing of it is wpa_supplicat/hostap specifiy. Only my current "test"
environment is using it, simply due to the fact that I tracked the issue
down in that environment.
Everything besides ath9k as an AP running hostapd and a iwldvm card
running wpa_supplicant is mostly untested. And even there I have some
areas marked for follow up after we find a solution acceptable for the
kernel...

Do you have any other software you think I should add to my prelimitary
tests? If possible I'll happy to extend the test of the patches with those.

Alexander


Attachments:
pEpkey.asc (1.73 kB)

2018-07-10 21:13:52

by Denis Kenzior

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

Hi Alexander,

>>> hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
>>> and are implementing the state machine and are in a good position to
>>> handle the fallout... at least theoretically.
>>
>> Ideally it would even know beforehand that we don't want to handle the
>> PTK rekeying, and then could reconnect instead of going through the
>> handshake.
>>
>
> Don't see how we could do that in the kernel, all the relevant
> information is handled in the state machine. I guess an API extension
> telling hostap/supplicant if we can handle rekeys or not would tbe he
> only way to avoid that.
>

Can the kernel / driver provide some sort of hint to user space that PTK
rekey isn’t supported? We could then have user space deauthenticate
with a big warning about what/why this is happening and try to
re-connect to the last used BSS.

>> So I think we're probably better off accepting the set_key but not
>> actually using it, and instead disconnecting... even if that's awkward
>> and should come with a big comment :-)
>
> Instead of returning an error I'll change the code to accept the rekey
> but do nothing with it. (Basically delete the new key and keep the old
> active).
> And of course calling ieee80211_set_disassoc() prior to return "success".
>
> Let's see how the supplicant will react on a disassoc while doing a rekey...
>

This sounds pretty awful actually. Now that wpa_s is not the only game
in town, can we stop resorting to these tactics?

Regards,
-Denis

2018-07-08 08:19:35

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

>>> I think easier would be to just disconnect ourselves? At least if we're
>>> in managed mode...
>>>
>>
>> I still have much to learn about 802.11, but so far I did not see way to
>> directly disconnect a STA. (Maybe spoofing a "signal lost" event or
>> something like that, but I fear complications by losing the sync with
>> the remote STA.) Is there any call/signal you have in mind I could test?
>
> ieee80211_set_disassoc(), this can also send a frame out to indicate to
> the AP that we're disconnecting instead.
>

I'll try that, but will probably take another week. My main main work
station got severe file system corruption, forcing me to reinstall it
from scratch. I suspect it was something in the wireless testing kernel
4.18-rc1 (944ae08d4e71) related to either btrfs or the ssd disk but
since I needed the system just started over and avoid to run 4.18 if I
do not have a full backup...

>> hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
>> and are implementing the state machine and are in a good position to
>> handle the fallout... at least theoretically.
>
> Ideally it would even know beforehand that we don't want to handle the
> PTK rekeying, and then could reconnect instead of going through the
> handshake.
>

Don't see how we could do that in the kernel, all the relevant
information is handled in the state machine. I guess an API extension
telling hostap/supplicant if we can handle rekeys or not would tbe he
only way to avoid that.

>> Instead I get a pop up asking for the PSK. Entering it reconnects
>> normally. Cancel the prompt disconnect till a manual reconnect.
>> I suspect NetworkManager is handling the rekey like the initial key
>> install and then assumes the PSK is wrong. Hardly surprising but also
>> highly visible to the users.
>
> That's pretty awkward.
>
>> But then only to those using the now broken rekey...
>
> Yeah, but you don't necessarily control that - i.e. client device and AP
> might have different owners :-)
>
>> Using wpa_supplicant directly reconnects after ~15s.
>> It also assumes the key is wrong and seems to rate limit the connection
>> attempts. Here a log with wpa_supplicat running in the console and dmesg
>> -wT output on top of that:
>
> So I think we're probably better off accepting the set_key but not
> actually using it, and instead disconnecting... even if that's awkward
> and should come with a big comment :-)

Instead of returning an error I'll change the code to accept the rekey
but do nothing with it. (Basically delete the new key and keep the old
active).
And of course calling ieee80211_set_disassoc() prior to return "success".

Let's see how the supplicant will react on a disassoc while doing a rekey...

Alexander


Attachments:
pEpkey.asc (1.73 kB)

2018-07-15 10:51:01

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes/clear text packet leaks at rekey

Hi,


>>> So I think we're probably better off accepting the set_key but not
>>> actually using it, and instead disconnecting... even if that's awkward
>>> and should come with a big comment :-)
>>
>> Instead of returning an error I'll change the code to accept the rekey
>> but do nothing with it. (Basically delete the new key and keep the old
>> active).
>> And of course calling ieee80211_set_disassoc() prior to return "success".
>
> Right. Did you handle/consider modes other than BSS/P2P client btw?

I've tested that on a client only and it's already not working. Problem
is, that with ieee80211_set_disassoc() we just dump the association in
the kernel and are not informing the userspace, so the state machine is
stuck in "connected" but the STA is unable to communicate.

I also tried ieee80211_mgd_deauth():
While better this is basically the same behaviour as returning an error
on key replace. By setting the error code to inactivity at least
wpa_supplicant was actually starting to reconnect, unfortunatelly
set_key then failes since we purged the assosication in the kernel. (Or
that's my best interpretation from the logs). Networkmanager then again
prompted for the password...

I started experimenting with just signals to the userspace, but then
paused... Trying to control the state machine with spoofed errors trying
to trigger an "desireable" action can't be the way forward, can it?

Even if we get that working changes or different implementations in
userspace may well break it.

I now think we only have two reasonable ways forward:

1) The user friendly one: If the userspace does not know about the new
API just print a error message and do the insecure key replace. With
potential clear text packet leaks and connection freezes.

2) The secure way: Report an error on key install and accept that users
will encounter new problems when they are using rekeys with the old API
with "old" userspace software.

Since we have this issue in the kernel at least as long as we have
mac80211 I would vote for 1) here. Fix mac80211 and add a new way for
the userspace to to securely replace the keys and detect when this is
not possible. Then get the userspace software updated to act
accordingly, finally preventing the clear text packet leaks.

After some time we can then decide to reject insecure rekeys, burning
only those who use kernels much newer than the userspace.

Does this sound like an reasonable approch or should I go back figuring
out how to trick the userspace to reconnect without user
notification/intervention?

Alexander


Attachments:
pEpkey.asc (1.73 kB)

2018-07-04 01:54:00

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

Hello,


>
>> I was wrong here, this is not an issue. Tkip is simply dropping frames
>> when the IV is too small and never verifies the MIC. And since only MIC
>> errors can trigger counter measures we are fine as it is...
>
> Err, yes, of course. My bad.
>
>>>> 3) Is what I implemented. We try what we can with the existing API and
>>>> any driver not working with that has to be considered buggy.
>>>
>>> I don't think we can really do this though. We break - in a potentially
>>> security-relevant way - the older drivers. We can't just say that's
>>> driver bugs, IMHO.
>>
>> We would not break it, only not fix it for all drivers. The current code
>> is already leaking cleartext packets for at least ath9k and most likely
>> many others when the PTK is rekeyed. The patch would improve that, but
>> due to more working rekeys it could leak more packets in specific
>> scenarios, I assume...
>
> Yeah, ok, fair point. I don't really know.
>
>>> Obviously if SW crypto is used none of this is a concern, so that's
>>> another factor to take into account in this decision logic of whether to
>>> disconnect or not?
>>
>> I did not check the SW crypto code but I'm pretty sure that indeed works
>> with the current code.
>
> Me too.
>
>> I assume the packets will only be decrypted after an RX aggregation is
>> completed. That would filter out all packets send with the old key,
>> since they simply can't be decrypted any more.
>
> Oh, good point, but that's true - reordering happens before software
> decryption.
>
>> Shall we bypass stopping aggregation sessions if we are on SW crypto?
>> We'll again lose the benefit that we prevent a broken remote STA to try
>> a TX Agg. (I tend to still stop them, but do not have a real opinion
>> here...)
>
> I don't really know.
>
>>> IOW - I'd rather get bugs that we now force a disconnect (if anyone even
>>> notices), rather than potentially having a bug that causes unencrypted
>>> packets to be sent.
>>
>> Any suggestions how to trick hostap/wpa_supplicant into dropping the
>> connection? For me it looks like we can just report an error on key
>> install and expect the wpa_supplicant/hostapd to handle that.
>
> I think easier would be to just disconnect ourselves? At least if we're
> in managed mode...
>
I still have much to learn about 802.11, but so far I did not see way to
directly disconnect a STA. (Maybe spoofing a "signal lost" event or
something like that, but I fear complications by losing the sync with
the remote STA.) Is there any call/signal you have in mind I could test?

hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
and are implementing the state machine and are in a good position to
handle the fallout... at least theoretically.


>> Will try that for the next version of the patch, with the other
>> discussed improvements: If driver is not signalling "PTK0 rekey support"
>> we'll simply not accept key installs when we already have a old PTK key
>> for the connection.
>
> Since I see you have a new patch - how did that work out? :)

So far I've only tested it with iwlwifi client (dvm, so no AP) on my
normal "desktop".

When implementing replace_key it seems to work fine. (No OTA captures
done, just connection tests with the AP running still using the previous
patch.)

Using a unpatched iwlwifi does not reconnect automatically as expected.

Instead I get a pop up asking for the PSK. Entering it reconnects
normally. Cancel the prompt disconnect till a manual reconnect.
I suspect NetworkManager is handling the rekey like the initial key
install and then assumes the PSK is wrong. Hardly surprising but also
highly visible to the users.
But then only to those using the now broken rekey...


Using wpa_supplicant directly reconnects after ~15s.
It also assumes the key is wrong and seems to rate limit the connection
attempts. Here a log with wpa_supplicat running in the console and dmesg
-wT output on top of that:

wlp3s0: WPA: Failed to set PTK to the driver (alg=3 keylen=16
bssid=12:6f:3f:0e:33:3c)
[Tue Jul 3 21:13:17 2018] wlp3s0: Driver is not supporting save PTK key
replacement. Insecure rekey attempt for STA 12:6f:3f:0e:33:3c denied.
[Tue Jul 3 21:13:17 2018] wlp3s0: deauthenticating from
12:6f:3f:0e:33:3c by local choice (Reason: 1=UNSPECIFIED)
wlp3s0: CTRL-EVENT-DISCONNECTED bssid=12:6f:3f:0e:33:3c reason=1
locally_generated=1
wlp3s0: WPA: 4-Way Handshake failed - pre-shared key may be incorrect
wlp3s0: CTRL-EVENT-SSID-TEMP-DISABLED id=0 ssid="mordor-g"
auth_failures=1 duration=10 reason=WRONG_KEY
wlp3s0: CTRL-EVENT-REGDOM-CHANGE init=CORE type=WORLD
[Tue Jul 3 21:13:17 2018] wlp3s0: delba from 12:6f:3f:0e:33:3c
(initiator) tid 0 reason code 37
[Tue Jul 3 21:13:17 2018] wlp3s0: Rx BA session stop requested for
12:6f:3f:0e:33:3c tid 0 initiator reason: 0
[Tue Jul 3 21:13:17 2018] wlp3s0: moving STA 12:6f:3f:0e:33:3c to state 3
[Tue Jul 3 21:13:17 2018] wlp3s0: moving STA 12:6f:3f:0e:33:3c to state 2
[Tue Jul 3 21:13:17 2018] wlp3s0: moving STA 12:6f:3f:0e:33:3c to state 1
[Tue Jul 3 21:13:17 2018] wlp3s0: Removed STA 12:6f:3f:0e:33:3c
[Tue Jul 3 21:13:17 2018] wlp3s0: Destroyed STA 12:6f:3f:0e:33:3c


wlp3s0: CTRL-EVENT-SSID-REENABLED id=0 ssid="mordor-g"
wlp3s0: SME: Trying to authenticate with 12:6f:3f:0e:33:3c
(SSID='mordor-g' freq=2432 MHz)
wlp3s0: Trying to associate with 12:6f:3f:0e:33:3c (SSID='mordor-g'
freq=2432 MHz)
[Tue Jul 3 21:13:31 2018] wlp3s0: authenticate with 12:6f:3f:0e:33:3c
[Tue Jul 3 21:13:31 2018] wlp3s0: Allocated STA 12:6f:3f:0e:33:3c
[Tue Jul 3 21:13:31 2018] wlp3s0: Inserted STA 12:6f:3f:0e:33:3c
[Tue Jul 3 21:13:31 2018] wlp3s0: send auth to 12:6f:3f:0e:33:3c (try 1/3)
[Tue Jul 3 21:13:31 2018] wlp3s0: authenticated
[Tue Jul 3 21:13:31 2018] wlp3s0: moving STA 12:6f:3f:0e:33:3c to state 2
[Tue Jul 3 21:13:31 2018] wlp3s0: associate with 12:6f:3f:0e:33:3c (try
1/3)
wlp3s0: Associated with 12:6f:3f:0e:33:3c
wlp3s0: CTRL-EVENT-SUBNET-STATUS-UPDATE status=0
wlp3s0: CTRL-EVENT-REGDOM-CHANGE init=COUNTRY_IE type=COUNTRY alpha2=DE
[Tue Jul 3 21:13:31 2018] wlp3s0: RX AssocResp from 12:6f:3f:0e:33:3c
(capab=0x431 status=0 aid=1)
[Tue Jul 3 21:13:31 2018] wlp3s0: moving STA 12:6f:3f:0e:33:3c to state 3
[Tue Jul 3 21:13:31 2018] wlp3s0: WMM AC=0 acm=0 aifs=2 cWmin=3 cWmax=7
txop=47 uapsd=0, downgraded=0
[Tue Jul 3 21:13:31 2018] wlp3s0: WMM AC=1 acm=0 aifs=2 cWmin=7
cWmax=15 txop=94 uapsd=0, downgraded=0
[Tue Jul 3 21:13:31 2018] wlp3s0: WMM AC=2 acm=0 aifs=3 cWmin=15
cWmax=1023 txop=0 uapsd=0, downgraded=0
[Tue Jul 3 21:13:31 2018] wlp3s0: WMM AC=3 acm=0 aifs=7 cWmin=15
cWmax=1023 txop=0 uapsd=0, downgraded=0
[Tue Jul 3 21:13:31 2018] wlp3s0: associated
wlp3s0: WPA: Key negotiation completed with 12:6f:3f:0e:33:3c [PTK=CCMP
GTK=CCMP]
wlp3s0: CTRL-EVENT-CONNECTED - Connection to 12:6f:3f:0e:33:3c completed
[id=0 id_str=]
[Tue Jul 3 21:13:31 2018] wlp3s0: moving STA 12:6f:3f:0e:33:3c to state 4
[Tue Jul 3 21:13:31 2018] wlp3s0: AddBA Req buf_size=64 for
12:6f:3f:0e:33:3c
[Tue Jul 3 21:13:31 2018] wlp3s0: Rx A-MPDU request on
12:6f:3f:0e:33:3c tid 0 result 0

A test with the code on a AP is still pending. (I'll probably try that
on the weekend.)

Alexander

2018-07-03 09:51:27

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

On Fri, 2018-06-29 at 23:14 +0200, Alexander Wetzel wrote:

> I was wrong here, this is not an issue. Tkip is simply dropping frames
> when the IV is too small and never verifies the MIC. And since only MIC
> errors can trigger counter measures we are fine as it is...

Err, yes, of course. My bad.

> > > 3) Is what I implemented. We try what we can with the existing API and
> > > any driver not working with that has to be considered buggy.
> >
> > I don't think we can really do this though. We break - in a potentially
> > security-relevant way - the older drivers. We can't just say that's
> > driver bugs, IMHO.
>
> We would not break it, only not fix it for all drivers. The current code
> is already leaking cleartext packets for at least ath9k and most likely
> many others when the PTK is rekeyed. The patch would improve that, but
> due to more working rekeys it could leak more packets in specific
> scenarios, I assume...

Yeah, ok, fair point. I don't really know.

> > Obviously if SW crypto is used none of this is a concern, so that's
> > another factor to take into account in this decision logic of whether to
> > disconnect or not?
>
> I did not check the SW crypto code but I'm pretty sure that indeed works
> with the current code.

Me too.

> I assume the packets will only be decrypted after an RX aggregation is
> completed. That would filter out all packets send with the old key,
> since they simply can't be decrypted any more.

Oh, good point, but that's true - reordering happens before software
decryption.

> Shall we bypass stopping aggregation sessions if we are on SW crypto?
> We'll again lose the benefit that we prevent a broken remote STA to try
> a TX Agg. (I tend to still stop them, but do not have a real opinion
> here...)

I don't really know.

> > IOW - I'd rather get bugs that we now force a disconnect (if anyone even
> > notices), rather than potentially having a bug that causes unencrypted
> > packets to be sent.
>
> Any suggestions how to trick hostap/wpa_supplicant into dropping the
> connection? For me it looks like we can just report an error on key
> install and expect the wpa_supplicant/hostapd to handle that.

I think easier would be to just disconnect ourselves? At least if we're
in managed mode...

> Will try that for the next version of the patch, with the other
> discussed improvements: If driver is not signalling "PTK0 rekey support"
> we'll simply not accept key installs when we already have a old PTK key
> for the connection.

Since I see you have a new patch - how did that work out? :)

johannes

2018-07-09 07:13:50

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

Hi,

> I'll try that, but will probably take another week. My main main work
> station got severe file system corruption, forcing me to reinstall it
> from scratch. I suspect it was something in the wireless testing kernel
> 4.18-rc1 (944ae08d4e71) related to either btrfs or the ssd disk but
> since I needed the system just started over and avoid to run 4.18 if I
> do not have a full backup...

Ouch. FWIW, it's possible to run inside a VM with PCI(e) devices
outside, at least on some machines.

> > > hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
> > > and are implementing the state machine and are in a good position to
> > > handle the fallout... at least theoretically.
> >
> > Ideally it would even know beforehand that we don't want to handle the
> > PTK rekeying, and then could reconnect instead of going through the
> > handshake.
> >
>
> Don't see how we could do that in the kernel, all the relevant
> information is handled in the state machine. I guess an API extension
> telling hostap/supplicant if we can handle rekeys or not would tbe he
> only way to avoid that.

Right. Not really much point for now I guess.

> > So I think we're probably better off accepting the set_key but not
> > actually using it, and instead disconnecting... even if that's awkward
> > and should come with a big comment :-)
>
> Instead of returning an error I'll change the code to accept the rekey
> but do nothing with it. (Basically delete the new key and keep the old
> active).
> And of course calling ieee80211_set_disassoc() prior to return "success".

Right. Did you handle/consider modes other than BSS/P2P client btw?

> Let's see how the supplicant will react on a disassoc while doing a rekey...

Shouldn't matter to it, I'd think.

johannes

2018-07-11 19:49:18

by Denis Kenzior

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

Hi Alexander,

On 07/11/2018 12:08 PM, Alexander Wetzel wrote:
> Hi Denis,
>
>> Hi Alexander,
>>
>>>>> hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
>>>>> and are implementing the state machine and are in a good position to
>>>>> handle the fallout... at least theoretically.
>>>>
>>>> Ideally it would even know beforehand that we don't want to handle the
>>>> PTK rekeying, and then could reconnect instead of going through the
>>>> handshake.
>>>>
>>>
>>> Don't see how we could do that in the kernel, all the relevant
>>> information is handled in the state machine. I guess an API extension
>>> telling hostap/supplicant if we can handle rekeys or not would tbe he
>>> only way to avoid that.
>>>
>>
>> Can the kernel / driver provide some sort of hint to user space that PTK
>> rekey isn’t supported?  We could then have user space deauthenticate
>> with a big warning about what/why this is happening and try to
>> re-connect to the last used BSS.
>>
>
> Sure. In fact the latest patch is already doing that by returning an
> error when set_key is called for PTK and it's not an initial call.
> Tests with wpa_supplicant shows that this is is then handled like the
> initial key set is failing. Networkmanager prompts for the password and
> wpa_supplicant running without seems to blacklist a reconnect for 15s.

Ideally we shouldn't even get this far. We really need some kind of
capability bit on the phy telling userspace whether PTK rekey is
supported or not. Then userspace can take proper action based on this
information. E.g. if PTK rekey isn't safe, then we can simply issue a
CMD_DISCONNECT and re-connect to the last BSS. The kernel doesn't need
to play any 'tricks'.

The fact that current userspace implementations are broken is
regrettable and needs to be fixed.

>
> I kind of liked that solution, but with existing implematations out this
> is indeed awkward to find a "correct" solution.
>
> The main problem for me currently is to find a correct and still
> acceptable solution. This turned from "let's fix this nasty wlan
> connection freezes" to a projet spanning the complete wlan stack: From
> hardware up to and including the userspace...

Right. The problem is that this PTK rekey likely 'works' (for some
definition thereof) in a vast majority of cases, e.g. the link isn't
broken, so the user doesn't notice. So, if the kernel starts to
unilaterally issue disconnects, you will have a lot of grumbling users.

Just to clarify, I'm not arguing against this necessarily. I can see
why issuing a disconnect is a good idea for many reasons (e.g. security,
etc.) But, I would expect a lot of user backlash if this is done, and
given that this has been an issue for many years, I wonder if its the
right way of handling this?

>
> It's fun to learn how that interacts (if not very fast), I'm stuggling
> finding the best way forward here. Whatever we do has undesired
> consequences.
> Maybe I'm missing something, but here the high level options we have in
> my opinion:
>
> 1) Keep it as it is and solve that in a indefinite future when we and
> the world implement the ieee802.11 2012 addition, to use key 0+1 for PTK
> and 2+3 for GTK
> - rekey has a extrem high probability of freezing connections and
> leaking a few clear text packets for years (decades?) to come
> + The issue is fixed at the core

It would seem to me that 0+1 rekey is a separate issue that needs to be
supported in both kernel and userspace anyway.

>
> 2) Make it worse, like some (most) Windows systems/cards seem to handle
> it by encrypting EAPOL #4 with the NEW key, breaking the handshake and
> forcing a reconnect.
> - break something more to fix a problem sounds like a insane approach
> + This seems to be quite common and therefore well "tested" (based on my
> very limited data on that)

This seems awful. And then if you're unlucky someone will come in and
tell you that the kernel has to maintain this 'legacy' behavior forever.
So things can't ever be fixed.

Plus, as I already mentioned above, some users 'think' that PTK rekey
already works just fine.

>
> 3) Fix what we can in mac80211 but keep the API stable
> - Without driver actions still many drivers will be "undefined" and even
> if they are not freezing leak packets
> + This will reduce the problems to a fraction of what is is today with
> only a mac80211 update
>
> 4) Redesign the mac80211 rekey handling and interaction with drivers to
> only rekey if it is save and decline when not.
> + We only have to touch the kernel
> - any supplicant (whatever runs the wpa state machine) may get errors
> where the programmes did not expect them, leading to unexpected side
> effects.
>
> 5) The full-stack solution: Update the API for the userpace
> + We do not have to "trick" the wpa state machine to disconnect, the
> programmers of it have to code it.
> - Well, it must be suppurted from the wpa state machine. If not we still
> have to handle the rekey somehow or we accept freezes/cleartext leaks...
>
> The last two solutions will also need some "fallback" when a secure
> rekey is not possible and/or the user is runing an old state machine not
> knowing about the new way...
>

My vote is for something along the lines of 5. I realize there are no
good solutions here, but this really should be fixed by a combination of
userspace and kernel working properly together. If this isn't possible,
then perhaps the whole rekey should be done in the kernel and userspace
should be given no chance to make a bad decision.

>
>>>> So I think we're probably better off accepting the set_key but not
>>>> actually using it, and instead disconnecting... even if that's awkward
>>>> and should come with a big comment :-)
>>>
>>> Instead of returning an error I'll change the code to accept the rekey
>>> but do nothing with it. (Basically delete the new key and keep the old
>>> active).
>>> And of course calling ieee80211_set_disassoc() prior to return "success".
>>>
>>> Let's see how the supplicant will react on a disassoc while doing a
>>> rekey...
>>>
>>
>> This sounds pretty awful actually.  Now that wpa_s is not the only game
>> in town, can we stop resorting to these tactics?
>
> Nothing of it is wpa_supplicat/hostap specifiy. Only my current "test"
> environment is using it, simply due to the fact that I tracked the issue
> down in that environment.
> Everything besides ath9k as an AP running hostapd and a iwldvm card
> running wpa_supplicant is mostly untested. And even there I have some
> areas marked for follow up after we find a solution acceptable for the
> kernel...
>
> Do you have any other software you think I should add to my prelimitary
> tests? If possible I'll happy to extend the test of the patches with those.
>

You can try:
https://git.kernel.org/pub/scm/network/wireless/iwd.git/

We would be happy to implement whatever 'proper' userspace behavior /
kernel api that this discussion leads to.

Regards,
-Denis

2018-06-30 22:11:12

by Alexander Wetzel

[permalink] [raw]
Subject: [PATCH v3] mac80211: Fix PTK rekey freezes and cleartext leaks

Rekeying a pairwise key with encryption offload and only keyid 0 has
multiple races. Two of them can can freeze the wlan connection till
rekeyed again and the third can send out packets in clear which should
have been encrypted.

1) Freeze caused by incoming packets:
If the local STA installs the key prior to the remote STA we still
have the old key active in the hardware for a short time after
mac80211 switched to the new key.
The card can still hand over packets decoded with the old key to
mac80211, bumping the new PN (IV) value to an incorrect high number
and tricking the local replay detection to drop all packets really
sent with the new key.

2) Freeze caused by outgoing packets:
If mac80211 is providing the PN (IV) and hands over the cleartext
packets for encryption to the hardware immediately prior to a key
change the driver/card may process the queued packets after
switching to the new key.
This will immediately bump the PN (IV) value on the remote STA to
an incorrect high number, also freezing the connection.

3) Clear text leak:
Removing encryption offload from the card clears the encryption
offload flag only after the card has removed the key.

All three issues can be prevented by making sure we do not send out any
packets while replacing the key, replace the key first in the HW and
only after that in mac80211 and stop running/block new aggregation
sessions during the rekey.

This changes the old unicast key install sequence from:
- atomic switch over to the new key in mac80211 (TX still active!)
- remove the old key in the HW (stop encryption offloading)
- delete the now inactive old key in mac80211 (staring software
encryption after a potential clear text packet leak)
- add new key in the hw for encryption offloading (ending software
encryption)

to:
- mark the old key as tainted to drop TX packets with the outgoing key
- replace the key in HW with the new one using the new driver callback
"replace_key"
- atomic switch over to the new key in mac80211 (allow TX again)
- delete the now inactive old key in mac80211

For drivers not implementing the new callback "replace_key" it's
unknown if the driver can replace the key without leaking cleartext
packets. Mac80211 will therefore log an error message when trying to
update the PTK key and refuse cooperation. This will disconnect the STA.

With the new sequence the HW will be already unable to decode packets
still encrypted with the old key prior to switching to the new key in
mac80211. Locking down TX during the rekey makes sure that there are no
outgoing packets while the driver and card are switching to the new key.

The driver is allowed to hand over packets decrypted with either the new
or the old key till "replace_key" returns. But all packets queued prior
to calling the callback must be either still be send out encrypted with
the old key or be dropped.

A RX aggregation session started prior to the rekey and which completes
after can still dump frames received with the old key at mac80211 after
we switched over to the new key. This is currently avoided by stopping
all RX and TX aggregation sessions when we replace a PTK key and are
using key offloading.

Signed-off-by: Alexander Wetzel <[email protected]>
---
include/net/mac80211.h | 12 +++++
net/mac80211/driver-ops.h | 20 ++++++++
net/mac80211/key.c | 102 +++++++++++++++++++++++++++++++-------
net/mac80211/trace.h | 39 +++++++++++++++
net/mac80211/tx.c | 4 ++
5 files changed, 159 insertions(+), 18 deletions(-)

diff --git a/include/net/mac80211.h b/include/net/mac80211.h
index 5790f55c241d..a8673d653e7d 100644
--- a/include/net/mac80211.h
+++ b/include/net/mac80211.h
@@ -3137,6 +3137,14 @@ enum ieee80211_reconfig_type {
* Returns a negative error code if the key can't be added.
* The callback can sleep.
*
+ * @replace_key: Replace a PTK key in the HW for a running association with a
+ * new one.
+ * Implementing this callback confirms that the driver/card supports
+ * replacing the key without leaking cleartext packets, will no longer hand
+ * over packets decrypted with the old key and not send out any packet
+ * queued prior to this call with the new key after the callback has
+ * returned.
+ *
* @update_tkip_key: See the section "Hardware crypto acceleration"
* This callback will be called in the context of Rx. Called for drivers
* which set IEEE80211_KEY_FLAG_TKIP_REQ_RX_P1_KEY.
@@ -3585,6 +3593,10 @@ struct ieee80211_ops {
int (*set_key)(struct ieee80211_hw *hw, enum set_key_cmd cmd,
struct ieee80211_vif *vif, struct ieee80211_sta *sta,
struct ieee80211_key_conf *key);
+ int (*replace_key)(struct ieee80211_hw *hw,
+ struct ieee80211_vif *vif, struct ieee80211_sta *sta,
+ struct ieee80211_key_conf *old,
+ struct ieee80211_key_conf *new);
void (*update_tkip_key)(struct ieee80211_hw *hw,
struct ieee80211_vif *vif,
struct ieee80211_key_conf *conf,
diff --git a/net/mac80211/driver-ops.h b/net/mac80211/driver-ops.h
index 8f6998091d26..ebd7f1463336 100644
--- a/net/mac80211/driver-ops.h
+++ b/net/mac80211/driver-ops.h
@@ -255,6 +255,26 @@ static inline int drv_set_key(struct ieee80211_local *local,
return ret;
}

+static inline int drv_replace_key(struct ieee80211_local *local,
+ struct ieee80211_sub_if_data *sdata,
+ struct ieee80211_sta *sta,
+ struct ieee80211_key_conf *old_key,
+ struct ieee80211_key_conf *new_key)
+{
+ int ret;
+
+ might_sleep();
+
+ sdata = get_bss_sdata(sdata);
+ if (!check_sdata_in_driver(sdata))
+ return -EIO;
+
+ trace_drv_replace_key(local, sdata, sta, old_key, new_key);
+ ret = local->ops->replace_key(&local->hw, &sdata->vif, sta, old_key, new_key);
+ trace_drv_return_int(local, ret);
+ return ret;
+}
+
static inline void drv_update_tkip_key(struct ieee80211_local *local,
struct ieee80211_sub_if_data *sdata,
struct ieee80211_key_conf *conf,
diff --git a/net/mac80211/key.c b/net/mac80211/key.c
index ee0d0cc8dc3b..b148f90b209c 100644
--- a/net/mac80211/key.c
+++ b/net/mac80211/key.c
@@ -248,6 +248,7 @@ static void ieee80211_key_disable_hw_accel(struct ieee80211_key *key)
(key->conf.flags & IEEE80211_KEY_FLAG_RESERVE_TAILROOM)))
increment_tailroom_need_count(sdata);

+ key->flags &= ~KEY_FLAG_UPLOADED_TO_HARDWARE;
ret = drv_set_key(key->local, DISABLE_KEY, sdata,
sta ? &sta->sta : NULL, &key->conf);

@@ -257,7 +258,60 @@ static void ieee80211_key_disable_hw_accel(struct ieee80211_key *key)
key->conf.keyidx,
sta ? sta->sta.addr : bcast_addr, ret);

- key->flags &= ~KEY_FLAG_UPLOADED_TO_HARDWARE;
+}
+
+static int ieee80211_hw_ptk_replace(struct ieee80211_key *old_key,
+ struct ieee80211_key *new_key)
+{
+ struct ieee80211_sub_if_data *sdata;
+ struct ieee80211_local *local;
+ struct sta_info *sta;
+ int ret;
+
+ if (!old_key || !new_key || !old_key->sta)
+ return -EINVAL;
+
+ /* Running on software encryption */
+ if (!(old_key->flags & KEY_FLAG_UPLOADED_TO_HARDWARE))
+ return 0;
+
+ assert_key_lock(old_key->local);
+
+ sta = old_key->sta;
+ local = old_key->local;
+ sdata = old_key->sdata;
+
+ if (!old_key->local->ops->replace_key) {
+ sdata_err(sdata,
+ "Driver is not supporting save PTK key replacement. "
+ "Insecure rekey attempt for STA %pM denied.\n",
+ sta->sta.addr);
+ return -EINVAL;
+ }
+
+ /* Stop TX till we are on the new key */
+ old_key->flags |= KEY_FLAG_TAINTED;
+ ieee80211_clear_fast_xmit(sta);
+
+ if (ieee80211_hw_check(&local->hw, AMPDU_AGGREGATION)) {
+ set_sta_flag(sta, WLAN_STA_BLOCK_BA);
+ ieee80211_sta_tear_down_BA_sessions(sta, AGG_STOP_LOCAL_REQUEST);
+ }
+ ret = drv_replace_key(old_key->local, sdata,
+ &sta->sta, &old_key->conf, &new_key->conf);
+
+ if (!ret)
+ new_key->flags |= KEY_FLAG_UPLOADED_TO_HARDWARE;
+ else
+ sdata_err(sdata,
+ "failed to replace key (%d) with key " \
+ "(%d) for STA, %pM in hardware (%d)\n",
+ old_key->conf.keyidx,
+ new_key->conf.keyidx,
+ sta ? sta->sta.addr : bcast_addr, ret);
+
+ old_key->flags &= ~KEY_FLAG_UPLOADED_TO_HARDWARE;
+ return (ret);
}

static void __ieee80211_set_default_key(struct ieee80211_sub_if_data *sdata,
@@ -316,38 +370,55 @@ void ieee80211_set_default_mgmt_key(struct ieee80211_sub_if_data *sdata,
}


-static void ieee80211_key_replace(struct ieee80211_sub_if_data *sdata,
+static int ieee80211_key_replace(struct ieee80211_sub_if_data *sdata,
struct sta_info *sta,
bool pairwise,
struct ieee80211_key *old,
struct ieee80211_key *new)
{
int idx;
+ int ret;
bool defunikey, defmultikey, defmgmtkey;

/* caller must provide at least one old/new */
if (WARN_ON(!new && !old))
- return;
+ return 0;

if (new)
list_add_tail_rcu(&new->list, &sdata->key_list);

WARN_ON(new && old && new->conf.keyidx != old->conf.keyidx);

- if (old)
+ if (old) {
idx = old->conf.keyidx;
- else
+ if(new && sta && pairwise) {
+ ret = ieee80211_hw_ptk_replace(old, new);
+ if (ret)
+ return ret;
+ }
+ } else {
idx = new->conf.keyidx;
+ }
+
+ if (new && !new->local->wowlan && (!sta || !old )) {
+ ret = ieee80211_key_enable_hw_accel(new);
+ if (ret)
+ return ret;
+ }

if (sta) {
if (pairwise) {
rcu_assign_pointer(sta->ptk[idx], new);
sta->ptk_idx = idx;
- ieee80211_check_fast_xmit(sta);
+ if (new) {
+ clear_sta_flag(sta, WLAN_STA_BLOCK_BA);
+ ieee80211_check_fast_xmit(sta);
+ }
} else {
rcu_assign_pointer(sta->gtk[idx], new);
}
- ieee80211_check_fast_rx(sta);
+ if (new)
+ ieee80211_check_fast_rx(sta);
} else {
defunikey = old &&
old == key_mtx_dereference(sdata->local,
@@ -380,6 +451,7 @@ static void ieee80211_key_replace(struct ieee80211_sub_if_data *sdata,

if (old)
list_del_rcu(&old->list);
+ return 0;
}

struct ieee80211_key *
@@ -654,7 +726,6 @@ int ieee80211_key_link(struct ieee80211_key *key,
struct ieee80211_sub_if_data *sdata,
struct sta_info *sta)
{
- struct ieee80211_local *local = sdata->local;
struct ieee80211_key *old_key;
int idx, ret;
bool pairwise;
@@ -687,19 +758,14 @@ int ieee80211_key_link(struct ieee80211_key *key,

increment_tailroom_need_count(sdata);

- ieee80211_key_replace(sdata, sta, pairwise, old_key, key);
- ieee80211_key_destroy(old_key, true);
+ ret = ieee80211_key_replace(sdata, sta, pairwise, old_key, key);

- ieee80211_debugfs_key_add(key);
-
- if (!local->wowlan) {
- ret = ieee80211_key_enable_hw_accel(key);
- if (ret)
- ieee80211_key_free(key, true);
+ if (!ret) {
+ ieee80211_debugfs_key_add(key);
+ ieee80211_key_destroy(old_key, true);
} else {
- ret = 0;
+ ieee80211_key_free(key, true);
}
-
out:
mutex_unlock(&sdata->local->key_mtx);

diff --git a/net/mac80211/trace.h b/net/mac80211/trace.h
index 0ab69a1964f8..f93e00f1ae4d 100644
--- a/net/mac80211/trace.h
+++ b/net/mac80211/trace.h
@@ -603,6 +603,45 @@ TRACE_EVENT(drv_set_key,
)
);

+TRACE_EVENT(drv_replace_key,
+ TP_PROTO(struct ieee80211_local *local,
+ struct ieee80211_sub_if_data *sdata,
+ struct ieee80211_sta *sta,
+ struct ieee80211_key_conf *old_key,
+ struct ieee80211_key_conf *new_key),
+
+ TP_ARGS(local, sdata, sta, old_key, new_key),
+
+ TP_STRUCT__entry(
+ LOCAL_ENTRY
+ VIF_ENTRY
+ STA_ENTRY
+ KEY_ENTRY
+ __field(u32, cipher2)
+ __field(u8, hw_key_idx2)
+ __field(u8, flags2)
+ __field(s8, keyidx2)
+ ),
+
+ TP_fast_assign(
+ LOCAL_ASSIGN;
+ VIF_ASSIGN;
+ STA_ASSIGN;
+ KEY_ASSIGN(old_key);
+ __entry->cipher2 = new_key->cipher;
+ __entry->flags2 = new_key->flags;
+ __entry->keyidx2 = new_key->keyidx;
+ __entry->hw_key_idx2 = new_key->hw_key_idx;
+ ),
+
+ TP_printk(
+ LOCAL_PR_FMT VIF_PR_FMT STA_PR_FMT KEY_PR_FMT
+ " cipher2:0x%x, flags2=%#x, keyidx2=%d, hw_key_idx2=%d",
+ LOCAL_PR_ARG, VIF_PR_ARG, STA_PR_ARG, KEY_PR_ARG,
+ __entry->cipher2, __entry->flags2, __entry->keyidx2, __entry->hw_key_idx2
+ )
+);
+
TRACE_EVENT(drv_update_tkip_key,
TP_PROTO(struct ieee80211_local *local,
struct ieee80211_sub_if_data *sdata,
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 5b93bde248fd..f897de5dcf83 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -2951,6 +2951,10 @@ void ieee80211_check_fast_xmit(struct sta_info *sta)
if (!(build.key->flags & KEY_FLAG_UPLOADED_TO_HARDWARE))
goto out;

+ /* Key is beeing removed */
+ if (build.key->flags & KEY_FLAG_TAINTED)
+ goto out;
+
switch (build.key->conf.cipher) {
case WLAN_CIPHER_SUITE_CCMP:
case WLAN_CIPHER_SUITE_CCMP_256:
--
2.17.1

2018-07-11 19:35:11

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey

Hi,


>> I'll try that, but will probably take another week. My main main work
>> station got severe file system corruption, forcing me to reinstall it
>> from scratch. I suspect it was something in the wireless testing kernel
>> 4.18-rc1 (944ae08d4e71) related to either btrfs or the ssd disk but
>> since I needed the system just started over and avoid to run 4.18 if I
>> do not have a full backup...
>
> Ouch. FWIW, it's possible to run inside a VM with PCI(e) devices
> outside, at least on some machines.
>
>>>> hostapd or wpa_supplicant are "ordering" mac80211 to install a new key
>>>> and are implementing the state machine and are in a good position to
>>>> handle the fallout... at least theoretically.
>>>
>>> Ideally it would even know beforehand that we don't want to handle the
>>> PTK rekeying, and then could reconnect instead of going through the
>>> handshake.
>>>
>>
>> Don't see how we could do that in the kernel, all the relevant
>> information is handled in the state machine. I guess an API extension
>> telling hostap/supplicant if we can handle rekeys or not would tbe he
>> only way to avoid that.
>
> Right. Not really much point for now I guess.
>
>>> So I think we're probably better off accepting the set_key but not
>>> actually using it, and instead disconnecting... even if that's awkward
>>> and should come with a big comment :-)
>>
>> Instead of returning an error I'll change the code to accept the rekey
>> but do nothing with it. (Basically delete the new key and keep the old
>> active).
>> And of course calling ieee80211_set_disassoc() prior to return "success".
>
> Right. Did you handle/consider modes other than BSS/P2P client btw?

I still have huge gaps in my wlan knowledge, I only drilled into the
rekey issue so far and never used anything besides AP/STA...

That said we only should have mesh and IBSS left to consider, right?
(I think I remember some new mode allowing STAs to diretly talk to each
other while beeing associated to an AP, but can't find it again and
don't see how this could work with PTK keys without somehow making a
mesh out of it.)

So far I have avoided non-generic changes. The current patch should work
with all modes, shouldn't it?. My initial impression of
ieee80211_set_disassoc() was much the same, but you seem to imply it's
not usable in some modes? That would again be awkward...

As for IBSS: I have no idea how a mesh would handle rekeys. If it can
but won't work with ieee80211_set_disassoc() it will be quite some time
till I've can propose a new patch.
I normally only have a few hours per week for the forseeable future, and
some weeks not even those...

On the plus side i got my hands on a AP using ath10k and can look at
that from yet another angle. I'll devinitelly continue here, but I
suspect I'll be slow with patches for a while...

>
>> Let's see how the supplicant will react on a disassoc while doing a rekey...
>
> Shouldn't matter to it, I'd think.

Alexander


Attachments:
pEpkey.asc (1.73 kB)