Return-path: Received: from s3.sipsolutions.net ([144.76.63.242]:36618 "EHLO sipsolutions.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932266AbeF2KMa (ORCPT ); Fri, 29 Jun 2018 06:12:30 -0400 Message-ID: <1530267147.3481.78.camel@sipsolutions.net> (sfid-20180629_121235_477634_29C88F76) Subject: Re: [PATCH v2] mac80211: Fix wlan freezes under load at rekey From: Johannes Berg To: Alexander Wetzel Cc: linux-wireless@vger.kernel.org, greearb@candelatech.com, s.gottschall@dd-wrt.com Date: Fri, 29 Jun 2018 12:12:27 +0200 In-Reply-To: <2de1493a-5439-b9ef-6d06-3895333f01c3@web.de> References: <1523258757.3076.5.camel@sipsolutions.net> <20180515102202.2021-1-alexander.wetzel@web.de> <1529062392.10037.7.camel@sipsolutions.net> <1529357234.3092.57.camel@sipsolutions.net> <2de1493a-5439-b9ef-6d06-3895333f01c3@web.de> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Sender: linux-wireless-owner@vger.kernel.org List-ID: Hi, On Tue, 2018-06-19 at 22:12 +0200, Alexander Wetzel wrote: > > Am I understanding correctly that in the latter (outgoing) case this > > might cause unencrypted packets to be transmitted? > > Yes, if we have a driver handling the keys similar to ath9k which is not > implementing drv_flush (correctly) this is a possibility. Sadly, I think many drivers fall into this category, so I'm not sure we should "fix the bug" for them at the expense of a potential security risk? So maybe we need to have an opt-in flag for drivers. > This ties somewhat into the discussion what to handle in mac80211 and > what delegate to the drivers. mac80211 itself should drop any packets > trying to use the old outgoing key till the new key is rolled out > correctly with the patch. Yeah, but it can't do it for packets already queued to the driver/hardware, and it can't know how the hardware behaves wrt. encryption: old-iwlwifi style (key material in descriptor) vs. ath9k style (key material in on-chip table). > > > We can now decrypt and get incoming packets while mac80211 is still on > > > the old key, but the worst (and most likely) case is now, that the > > > replay detection will drop those packets till mac80211 also switch over. > > > > This actually is somewhat problematic, at least for TKIP it could cause > > countermeasures. Should we exclude TKIP here somehow? > > > > I don't think we can disable countermeasures because then we could be > > attacked in this time window without starting them, and we have a really > > hard time distinguishing attacks and "fake" replays. > > > > Yes, that is a definitely a possibility. But excluding TKIP seems to be > counterproductive for my understanding: TKIP is due to the weaker cipher > more likely to have unicast rekeying enabled. > So far I was hesitant to add new per packet check for that, but I guess > we have to... I'll add that in the next version, so we can discuss that > with some code. Ok. > > Btw - perhaps that means we should avoid the complicated mechanisms like > > TX aggregation shutdown for keys that the driver marks as being safe? > > Clearly for iwlwifi (at least CCMP and before, not with the longer keys > > in GCMP-256) the race can't possibly happen. > > Sounds like that should work and I think I'll just try it out. We'll > lose the side benefit that shutting down TX aggregation will reduce the > risk that a unpatched remote sta freezes the connection, though. Fair point. > And since I started out to first patch ath9k to drop packets for an > outgoing key: That looks to become either be ugly (delayed work to > complete the key clean up after a given time) or need some API change. Ok. > Remember we'll still have to shut down RX aggregation or drop all frames > of a session running during the rekey. We are not able to tell which key > was used to encrypt the frames and which have "old" PNs we can't allow > mac80211 to process after switching to the new key. So I'm not sure that > keeping TX up and running is worth it. Yeah, good point, and TX is less interesting because we control when/how it's set up. > That said I have another working patch which is delaying mac80211 key > deletion from my first misguided attempts to at this issue. > I could repurpose that and keep RX aggregation also up and running, > allowing us to first check if the packet would have been accepted by the > old key and only switch to the new PN counter once it's not. But that > seems to be kind of invasive and overkill. Yeah, let's not go there for now. Also if HW crypto is involved it usually won't work unless you (a) get the packet and (b) re-encrypt the packet with the wrong key, and then decrypt it with the correct key ... > Yes, that is part of the reason I had to add the call to drv_flush for > ath9k. I've experimented with an additional "retire_key" command on top > of the existing to add and delete keys but came to the conclusion that > "replace_key" would make more sense for ath9k at least. Ok. > But in order of my preference I see this options: > 1) removing a key in HW must also grantee that any packets queued for > this key has either been send or will be dropped after the call returns > to mac80211. (Simply sleeping the max queue and retransmit time would be > an ugly but simple way to implement that) This works but we have to consider a lot of drivers. I guess we're back to some form of "opt-in" scheme then? > 2) we add a new function/call like "replace_key" for this special case. > But in that case we have to sort out how to handle drivers not > implementing it. Thy only secure way would be to disconnect if someone > tries to rekey... Maybe that's not so bad. > 3) Is what I implemented. We try what we can with the existing API and > any driver not working with that has to be considered buggy. I don't think we can really do this though. We break - in a potentially security-relevant way - the older drivers. We can't just say that's driver bugs, IMHO. > > > I tried to minimize the changes, but if we can consider an API change I > > > would ask the drivers to provide a new function to switch directly from > > > one key to another, without the need to delete it first. And to > > > guarantee there, that after the function returns no packets with > > > prepared for the old key can be sent out. Including retransmissions. > > > > This would be pretty tricky for most drivers though. > > I do not see any real alternative. Sleeping some reasonable time should > be acceptable here, to allow the hw to flush out queued packets. (100ms > should probably acceptable here, especially compared to complete > connection loss.. You have no guarantee that even 100ms is enough to get all packets out. They could be sitting on a BK queue and waiting for all BE/VI traffic in the vicinity ... basically forever. > Only other idea I have is to lock down TX for all but EAPOL packets > sooner, e.g. during the key handshake. But that's more or less only a > variant of the sleep above. Yeah, doesn't really fully address this issue anyway, since EAPOL are on VO and others might get way more delayed on HW queues due to contention. I think our best bet is some form of opt-in scheme. Perhaps (2), which drivers can implement also to get like the "best" behaviour. They could also implement there the flushing if they can, for example. Others would get a disconnect, but they probably effectively do already? Obviously if SW crypto is used none of this is a concern, so that's another factor to take into account in this decision logic of whether to disconnect or not? IOW - I'd rather get bugs that we now force a disconnect (if anyone even notices), rather than potentially having a bug that causes unencrypted packets to be sent. johannes