2015-02-21 23:34:41

by Linus Torvalds

[permalink] [raw]
Subject: AR9462 problems connecting again..

So I've had problems connecting to some networks before on my
Chromebook Pixel, but now I'm testing a new Ubiquiti network at home,
and can see this issue at home too.

I know the wireless works, because other devices work fine on that
network. Also, I know the AR9462 works, because I still have my old
network up and it connects to that.

And it *occasionally* connects to the new one. But it's rare, and it
clearly has problems.

It looks something like this:

[ 73.757869] wlp1s0: authenticate with 20:9f:db:e7:80:80
[ 73.771471] wlp1s0: send auth to 20:9f:db:e7:80:80 (try 1/3)
[ 73.773706] wlp1s0: authenticated
[ 73.775122] wlp1s0: associate with 20:9f:db:e7:80:80 (try 1/3)
[ 73.787434] wlp1s0: RX AssocResp from 20:9f:db:e7:80:80
(capab=0x431 status=0 aid=9)
[ 73.787573] wlp1s0: associated
[ 77.784931] wlp1s0: deauthenticated from 20:9f:db:e7:80:80 (Reason:
2=PREV_AUTH_NOT_VALID)

and the password I used definitely is right, and sometimes works.
Despite that PREV_AUTH_NOT_VALID thing.

Any suggestions for what I should do to give you guys any sane and
useful debug output?

Linus


2015-02-23 17:17:04

by Jouni Malinen

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Sun, Feb 22, 2015 at 05:41:25PM -0800, Linus Torvalds wrote:
> Nope, everything else I have seems to be intel wireless. I think one
> of the kids machines is a Mac Mini with an ath5k thing, but I'm hoping
> the wpa_supplicant.log is sufficient to give somebody an idea.

It looks like there are two issues here: 1) EAPOL-Key message 4/4 (i.e.,
the second Data frame sent by the station during association) is somehow
not seen or accepted by the AP, 2) recovery from that msg 4/4 getting
lost does not work in the intended way.

For (1), one would likely need to see a wireless capture from a separate
WLAN radio to say something certain about what exactly happened.
ath5k-compatible radios would not be sufficient since this would need to
be able to see HT frames which ath9k is mostly like using here. I
haven't used iwlwifi as a sniffer, so I do not know whether that would
be a workable option for this.

In my tests, I can see the rate control algorithm (minstrel_ht) using a
pretty high rate (even MCS14 with 2-stream device, which is one short of
maximum) which is quite a bit higher than I would myself have selected
for an EAPOL frame (especially for EAPOL-Key 4/4 which has these lovely
issues with retransmissions) more or less immediately after association.
Anyway, that frame is supposed to get additional fall-back TX rates for
link layer retransmissions and those should make it much more likely for
this to be received by the AP. Sniffer trace would confirm that.


For (2), wpa_supplicant debug log gives a pretty clear idea on what is
happening and based on that, I can easily reproduce this part. In fact,
I now have a fully automated test script for verifying this with
mac80211_hwsim.

Some alternative means of improving this was discussed in this thread.
I'm not completely happy with this, but the following mac80211 changes
do fix this retransmission case and will likely make the issue you are
seeing disappear since it allows any of the four EAPOL-Key msg 4/4
transmissions to be received by the AP to avoid the disconnection. This
doesn't fix the initial trigger behind the issue, but with those EAPOL
retransmissions working, the likelihood of all four attempts failing
(with each getting multiple link-layer retransmissions) is quite small.


mac80211: Do not encrypt EAPOL frames before peer has used the key

The 4-way handshake design is a bit inconvenient for the case where msg
3/4 needs to be transmitted (e.g., AP not receiving first transmission
of 4/4). The supplicant side has already configured the pairwise key at
that point in time and while we allow unencrypted msg 3/4 to be
received, we were sending out 4/4 encrypted which will result in it
getting dropped. User space would be aware of when the EAPOL frame
should not be encrypted, but we do not have convenient means of telling
mac80211 that. For now, use a mac80211-specific hack to avoid EAPOL
frame encryption to allow retransmission of 4-way handshake messages 3/4
and 4/4 to work in a way that the authenticator side can process 4/4.

---
net/mac80211/key.h | 2 ++
net/mac80211/rx.c | 11 +++++++++++
net/mac80211/tx.c | 13 +++++++++++++
3 files changed, 26 insertions(+)

diff --git a/net/mac80211/key.h b/net/mac80211/key.h
index d57a9915..3e23276 100644
--- a/net/mac80211/key.h
+++ b/net/mac80211/key.h
@@ -120,6 +120,8 @@ struct ieee80211_key {

/* number of times this key has been used */
int tx_rx_count;
+ /* whether a valid RX decryption has happened with this key */
+ bool valid_rx_seen;

#ifdef CONFIG_MAC80211_DEBUGFS
struct {
diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
index 1101563..8f3f86c 100644
--- a/net/mac80211/rx.c
+++ b/net/mac80211/rx.c
@@ -1691,6 +1691,16 @@ ieee80211_rx_h_decrypt(struct ieee80211_rx_data *rx)
return result;
}

+static ieee80211_rx_result debug_noinline
+ieee80211_rx_h_check_key_use(struct ieee80211_rx_data *rx)
+{
+ struct ieee80211_rx_status *status = IEEE80211_SKB_RXCB(rx->skb);
+
+ if ((status->flag & RX_FLAG_DECRYPTED) && rx->key)
+ rx->key->valid_rx_seen = true;
+ return RX_CONTINUE;
+}
+
static inline struct ieee80211_fragment_entry *
ieee80211_reassemble_add(struct ieee80211_sub_if_data *sdata,
unsigned int frag, unsigned int seq, int rx_queue,
@@ -3139,6 +3149,7 @@ static void ieee80211_rx_handlers(struct ieee80211_rx_data *rx,
CALL_RXH(ieee80211_rx_h_uapsd_and_pspoll)
CALL_RXH(ieee80211_rx_h_sta_process)
CALL_RXH(ieee80211_rx_h_decrypt)
+ CALL_RXH(ieee80211_rx_h_check_key_use)
CALL_RXH(ieee80211_rx_h_defragment)
CALL_RXH(ieee80211_rx_h_michael_mic_verify)
/* must be after MMIC verify so header is counted in MPDU mic */
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 88a18ff..c314c59 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -612,6 +612,19 @@ ieee80211_tx_h_select_key(struct ieee80211_tx_data *tx)
return TX_DROP;
}

+ if (tx->key &&
+ unlikely(info->control.flags & IEEE80211_TX_CTRL_PORT_CTRL_PROTO) &&
+ !tx->key->valid_rx_seen) {
+ /* Do not encrypt EAPOL frames before peer has used the key */
+ /* FIX: This is not really complete.. It would be at least
+ * theoretically possible for the peer to never send a Data
+ * frame and if we were to initiate reauthentication or
+ * rekeying, we might need to encrypt the initiating EAPOL
+ * frame.
+ */
+ tx->key = NULL;
+ }
+
if (tx->key) {
bool skip_hw = false;

--
Jouni Malinen PGP id EFC895FA

2015-02-23 01:59:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Sun, Feb 22, 2015 at 5:55 PM, Adrian Chadd <[email protected]> wrote:
>
> Do you have a 5GHz SSID setup on this access point? Is this kind of
> messed up diassociation-to-steer-you-to-another-band thing?

Nope. That's the older single-band UniFi UAP - 2.4GHz only.

Linus

2015-02-22 21:45:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [ath9k-devel] AR9462 problems connecting again..

On Sun, Feb 22, 2015 at 10:58 AM, Dave Taht <[email protected]> wrote:
>
> Hint: Several unifi (and most ubnt) products are well supported by
> openwrt directly,

I want Linux to "just work". None of this "oh, you can change
something else and it probably works". I want to fix the problem in
*linux*.

There's clearly something wrong with the AR9462 driver and/or how it
uses the wireless infrastructure, and it should be fixed. Not worked
around with "use some other AP software".

Especially since this has happened before.

Besides, the reason I'm trying to use UniFi is because I want to have
seamless roaming ("zero-handoff"). And I do *not* want to play the
endless openwrt configuration games in the hopes I can get something
like that working. I've tried openwrt, and I don't like tinkering with
my AP's. I just want things to work out of the box.

Linus

2015-02-23 20:11:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Mon, Feb 23, 2015 at 12:06 PM, Linus Torvalds
<[email protected]> wrote:
>
> This machine has a fairly minimal kernel config. Does that "type
> monitor" interface perhaps need some debug infrastructure that I
> haven't added?

Nope. Same behavior with a F21 kernel (which means that the touchpad
doesn't work, but that's a separate and known issue with this
platform).

Linus

2015-02-23 00:54:59

by Adrian Chadd

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On 22 February 2015 at 15:00, Linus Torvalds
<[email protected]> wrote:
> On Sun, Feb 22, 2015 at 1:50 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Ugh. When I add "-dd" to the command line, it has now worked three
>> times in a row, when before it worked once out of ten tries.
>>
>> So my guess is that it's something timing-dependent.
>
> So it stays working with -dd, but I do get *occasional* failures that
> then seem to clear up on retry. So it ends up working in the end, but
> I think I have a few example failures in the logs.
>
> So for example, from my dmesg, I get this:
>
> [14:07:15] wlp1s0: authenticate with 20:9f:db:e7:80:80
> [14:07:15] wlp1s0: send auth to 20:9f:db:e7:80:80 (try 1/3)
> [14:07:15] wlp1s0: send auth to 20:9f:db:e7:80:80 (try 2/3)
> [14:07:15] wlp1s0: send auth to 20:9f:db:e7:80:80 (try 3/3)
> [14:07:15] wlp1s0: authentication with 20:9f:db:e7:80:80 timed out
> [14:07:18] wlp1s0: authenticate with 20:9f:db:e7:80:80
> [14:07:18] wlp1s0: send auth to 20:9f:db:e7:80:80 (try 1/3)
> [14:07:18] wlp1s0: authenticated
> [14:07:18] wlp1s0: associate with 20:9f:db:e7:80:80 (try 1/3)
> [14:07:18] wlp1s0: RX AssocResp from 20:9f:db:e7:80:80 (capab=0x431
> status=0 aid=16)
> [14:07:18] wlp1s0: associated
> [14:07:22] wlp1s0: deauthenticated from 20:9f:db:e7:80:80 (Reason:
> 2=PREV_AUTH_NOT_VALID)
>
> with another failure at 14:07:22, but then it ends up working a bit
> later at 14:07:33.
>
> I've got the wpa supplicant log for this timeframe, but I'd rather not
> send it out in public. I see "[REMOVED] for what looks like the key
> data, but there's a lot of other hex data. Is any of it sensitive?

I /think/ it's okay? The removed stuff is the pre-shared key pieces.

So if I were debugging this (on any OS); I'd start by looking at
what's going on over the air. I'd be worried that there's something
failing to transmit right / be heard by the recipient. In the past
it's been things like EAPOL frames not going out at basic rate (and
listening to the rate control code for what rate it should be
transmitted at) and that makes it transmit at rates wholly unsuited
for the initial handshake.

Do you have another laptop with an atheros NIC in it that you could
use in monitor mode to capture all the frames?

(Also, you can email me the log file privately if you'd like.)



-adrian

2015-02-22 18:30:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Sun, Feb 22, 2015 at 10:24 AM, Adrian Chadd <[email protected]> wrote:
>
> Just a wild shot - try disabling fast authentication and see if that
> makes a difference?
>
> wpa_supplicant.conf:
>
> fast_reauth=0
>
> I recall having issues with fast_reauth once, but I never stuck around
> that location long enough to debug it.

Nope. Did that, killed wpa_supplicant (which restarts it), tried
connecting, still failed.

Linus

2015-02-23 23:00:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Mon, Feb 23, 2015 at 2:43 PM, Jouni Malinen <[email protected]> wrote:
>
> This did not get exactly supportive response when this was proposed last
> time (Sep 2013). Anyway, for a quick test, this can be done with the
> following one-liner:

fwiw, that one-liner seems to work fine for me.

Which I guess is not a huge surprise.

Side note: I've done the "turn off wifi and turn it back on" several
times to test that patch, and it has worked every time. BUT I also see
this odd behavior where the logs show that it tries to authenticate
twice: the first time it does that "send auth to 20:9f .." thing three
times (looks like 100ms apart), and nothing happens so it does
"authentication with 20:9f .. timed out". Then it waits three seconds
and tries again, and now it succeeds on the first try.

The only downside of that seems to be that it takes an extra 3s to
connect to the network - but it does now seem to *reliably* connect -
so it's not a big problem, but I wonder why it should be that
repeatable. Is there some difference between the first and the second
time it tries to authenticate?

Anyway, even if people don't like that particular patch, it does seem
like *something* like that should be done.

Linus

2015-02-23 18:00:34

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Mon, Feb 23, 2015 at 7:17 PM, Jouni Malinen <[email protected]> wrote:
> On Sun, Feb 22, 2015 at 05:41:25PM -0800, Linus Torvalds wrote:
>> Nope, everything else I have seems to be intel wireless. I think one
>> of the kids machines is a Mac Mini with an ath5k thing, but I'm hoping
>> the wpa_supplicant.log is sufficient to give somebody an idea.
>
> It looks like there are two issues here: 1) EAPOL-Key message 4/4 (i.e.,
> the second Data frame sent by the station during association) is somehow
> not seen or accepted by the AP, 2) recovery from that msg 4/4 getting
> lost does not work in the intended way.
>
> For (1), one would likely need to see a wireless capture from a separate
> WLAN radio to say something certain about what exactly happened.
> ath5k-compatible radios would not be sufficient since this would need to
> be able to see HT frames which ath9k is mostly like using here. I
> haven't used iwlwifi as a sniffer, so I do not know whether that would
> be a workable option for this.

iwlwifi as a sniffer is good enough for that.

>
> In my tests, I can see the rate control algorithm (minstrel_ht) using a
> pretty high rate (even MCS14 with 2-stream device, which is one short of
> maximum) which is quite a bit higher than I would myself have selected
> for an EAPOL frame (especially for EAPOL-Key 4/4 which has these lovely
> issues with retransmissions) more or less immediately after association.
> Anyway, that frame is supposed to get additional fall-back TX rates for
> link layer retransmissions and those should make it much more likely for
> this to be received by the AP. Sniffer trace would confirm that.
>

IIRC we had lots of issues with that and iwlwifi's rate control doesn't jump to
high rates immediately.

>
> For (2), wpa_supplicant debug log gives a pretty clear idea on what is
> happening and based on that, I can easily reproduce this part. In fact,
> I now have a fully automated test script for verifying this with
> mac80211_hwsim.
>
> Some alternative means of improving this was discussed in this thread.
> I'm not completely happy with this, but the following mac80211 changes
> do fix this retransmission case and will likely make the issue you are
> seeing disappear since it allows any of the four EAPOL-Key msg 4/4
> transmissions to be received by the AP to avoid the disconnection. This
> doesn't fix the initial trigger behind the issue, but with those EAPOL
> retransmissions working, the likelihood of all four attempts failing
> (with each getting multiple link-layer retransmissions) is quite small.
>
>
> mac80211: Do not encrypt EAPOL frames before peer has used the key
>
> The 4-way handshake design is a bit inconvenient for the case where msg
> 3/4 needs to be transmitted (e.g., AP not receiving first transmission

needs to *re*transmitted I guess.

> of 4/4). The supplicant side has already configured the pairwise key at
> that point in time and while we allow unencrypted msg 3/4 to be
> received, we were sending out 4/4 encrypted which will result in it
> getting dropped. User space would be aware of when the EAPOL frame
> should not be encrypted, but we do not have convenient means of telling
> mac80211 that. For now, use a mac80211-specific hack to avoid EAPOL
> frame encryption to allow retransmission of 4-way handshake messages 3/4
> and 4/4 to work in a way that the authenticator side can process 4/4.
>
> ---
> net/mac80211/key.h | 2 ++
> net/mac80211/rx.c | 11 +++++++++++
> net/mac80211/tx.c | 13 +++++++++++++
> 3 files changed, 26 insertions(+)
>
> diff --git a/net/mac80211/key.h b/net/mac80211/key.h
> index d57a9915..3e23276 100644
> --- a/net/mac80211/key.h
> +++ b/net/mac80211/key.h
> @@ -120,6 +120,8 @@ struct ieee80211_key {
>
> /* number of times this key has been used */
> int tx_rx_count;
> + /* whether a valid RX decryption has happened with this key */
> + bool valid_rx_seen;
>
> #ifdef CONFIG_MAC80211_DEBUGFS
> struct {
> diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
> index 1101563..8f3f86c 100644
> --- a/net/mac80211/rx.c
> +++ b/net/mac80211/rx.c
> @@ -1691,6 +1691,16 @@ ieee80211_rx_h_decrypt(struct ieee80211_rx_data *rx)
> return result;
> }
>
> +static ieee80211_rx_result debug_noinline
> +ieee80211_rx_h_check_key_use(struct ieee80211_rx_data *rx)
> +{
> + struct ieee80211_rx_status *status = IEEE80211_SKB_RXCB(rx->skb);
> +
> + if ((status->flag & RX_FLAG_DECRYPTED) && rx->key)
> + rx->key->valid_rx_seen = true;
> + return RX_CONTINUE;
> +}
> +
> static inline struct ieee80211_fragment_entry *
> ieee80211_reassemble_add(struct ieee80211_sub_if_data *sdata,
> unsigned int frag, unsigned int seq, int rx_queue,
> @@ -3139,6 +3149,7 @@ static void ieee80211_rx_handlers(struct ieee80211_rx_data *rx,
> CALL_RXH(ieee80211_rx_h_uapsd_and_pspoll)
> CALL_RXH(ieee80211_rx_h_sta_process)
> CALL_RXH(ieee80211_rx_h_decrypt)
> + CALL_RXH(ieee80211_rx_h_check_key_use)
> CALL_RXH(ieee80211_rx_h_defragment)
> CALL_RXH(ieee80211_rx_h_michael_mic_verify)
> /* must be after MMIC verify so header is counted in MPDU mic */
> diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
> index 88a18ff..c314c59 100644
> --- a/net/mac80211/tx.c
> +++ b/net/mac80211/tx.c
> @@ -612,6 +612,19 @@ ieee80211_tx_h_select_key(struct ieee80211_tx_data *tx)
> return TX_DROP;
> }
>
> + if (tx->key &&
> + unlikely(info->control.flags & IEEE80211_TX_CTRL_PORT_CTRL_PROTO) &&
> + !tx->key->valid_rx_seen) {
> + /* Do not encrypt EAPOL frames before peer has used the key */
> + /* FIX: This is not really complete.. It would be at least
> + * theoretically possible for the peer to never send a Data
> + * frame and if we were to initiate reauthentication or
> + * rekeying, we might need to encrypt the initiating EAPOL
> + * frame.
> + */
> + tx->key = NULL;
> + }
> +
> if (tx->key) {
> bool skip_hw = false;
>
> --
> Jouni Malinen PGP id EFC895FA
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-02-23 21:53:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Mon, Feb 23, 2015 at 1:30 PM, Jouni Malinen <[email protected]> wrote:
>
> How far is the station from the AP? Would it be possible to see whether
> the behavior changes if you were within, say, five meters or so?

Well, it was pretty much within five meters already, but there was a
thin wall in between (and the old AP was right next to the laptop,
which might add some noise even if they are on different channels).
Going closer does seem to help, but again, it's not like this is 100%
reproducible to begin with.

So the theory that the driver starts at too high a transmit rate, and
then does not handle failures well, might be true. Of course, "not
handle failures well" is something of an understatement.

> It would be useful if you can capture the 802.11 frame exchange from a
> failed connection case with an external wireless sniffer.

I will try with my (much more reliable) iwlwifi laptop. At least the
merge window is over, so I should have some time. Knock wood.

Linus

2015-02-23 21:30:54

by Jouni Malinen

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Mon, Feb 23, 2015 at 12:06:09PM -0800, Linus Torvalds wrote:
> On Mon, Feb 23, 2015 at 9:17 AM, Jouni Malinen <[email protected]> wrote:
> > mac80211: Do not encrypt EAPOL frames before peer has used the key
>
> Hmm. This patch does not seem to make a difference. I thought it did
> at first, but then removed the wpa_supplicant debugging, and got the
> same failures.

Interesting. That would seem to imply that this AP does not like
something about the EAPOL-Key msg 4/4 from the station during the faster
timing (no wpa_supplicant debug) and would even be unable to accept the
responses during retransmissions.. I'm not sure what could cause that.
Is there anything that the AP could provide as far as logging is
concerned?

How far is the station from the AP? Would it be possible to see whether
the behavior changes if you were within, say, five meters or so? This
should allow the higher transmit rate proposed by minstrel_ht to work
pretty reliably and that could confirm if this is indeed related to too
high a rate being used here and then for some reason being unable to
fall back to a sufficiently low rate to work at higher distance.

It would be useful if you can capture the 802.11 frame exchange from a
failed connection case with an external wireless sniffer. It sounds
like this should be doable with iwlwifi and while I haven't tested this
myself, I'd assume something like this would do the trick (this is
assuming that the interface is not being used at the time, e.g., with
wpa_supplicant):

sudo iw dev wlan0 set type monitor
sudo ip link set wlan0 up
sudo iw dev wlan0 set freq 2462 HT20
sudo dumpcap -i wlan0 -w /tmp/wlan0.pcap

# test connection and stop dumpcap after the failure
# and "sudo ip link set wlan0 down; sudo iw dev wlan0 set type station"
# to restore normal station mode

--
Jouni Malinen PGP id EFC895FA

2015-02-23 01:55:06

by Adrian Chadd

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On 22 February 2015 at 17:41, Linus Torvalds
<[email protected]> wrote:
> On Sun, Feb 22, 2015 at 4:54 PM, Adrian Chadd <[email protected]> wrote:
>>
>> I /think/ it's okay? The removed stuff is the pre-shared key pieces.
>
> Ok. Attached is what seems to be the relevant part of the
> wpa_supplicant.log file.
>
> The datestamp has been changed so that it can be matched up with the
> dmesg, and I added empty lines for pauses when I was trying to figure
> out what the heck it was doing, but other than that it's the raw log.
>
>> Do you have another laptop with an atheros NIC in it that you could
>> use in monitor mode to capture all the frames?
>
> Nope, everything else I have seems to be intel wireless. I think one
> of the kids machines is a Mac Mini with an ath5k thing, but I'm hoping
> the wpa_supplicant.log is sufficient to give somebody an idea.

Do you have a 5GHz SSID setup on this access point? Is this kind of
messed up diassociation-to-steer-you-to-another-band thing?




-adrian

2015-02-22 06:48:07

by Sujith Manoharan

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

Linus Torvalds wrote:
> So I've had problems connecting to some networks before on my
> Chromebook Pixel, but now I'm testing a new Ubiquiti network at home,
> and can see this issue at home too.

Can you please post the output of 'iw dev wlp1s0 scan' ?

Sujith

2015-02-22 21:50:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Sun, Feb 22, 2015 at 11:39 AM, Adrian Chadd <[email protected]> wrote:
>
> Hm, can you enable wpa debugging to log everything whilst it's
> associating / reassociating?

Ugh. When I add "-dd" to the command line, it has now worked three
times in a row, when before it worked once out of ten tries.

So my guess is that it's something timing-dependent.

Or it's something where once it starts working, it stays working until
I reboot. I'll try that.

Linus

2015-02-22 18:58:39

by Dave Taht

[permalink] [raw]
Subject: Re: [ath9k-devel] AR9462 problems connecting again..

On Sun, Feb 22, 2015 at 10:30 AM, Linus Torvalds
<[email protected]> wrote:
> On Sun, Feb 22, 2015 at 10:24 AM, Adrian Chadd <[email protected]> wrote:
>>
>> Just a wild shot - try disabling fast authentication and see if that
>> makes a difference?
>>
>> wpa_supplicant.conf:
>>
>> fast_reauth=0
>>
>> I recall having issues with fast_reauth once, but I never stuck around
>> that location long enough to debug it.
>
> Nope. Did that, killed wpa_supplicant (which restarts it), tried
> connecting, still failed.

Hint: Several unifi (and most ubnt) products are well supported by
openwrt directly, which by reflashing your device to it via their web
interface would

A) probably fix the problem, and
B) give you more insight into fixing it, if it persists, by giving you
full access on both sides of the connection.

https://downloads.openwrt.org/snapshots/trunk/ar71xx/generic/

I have been replacing ubnt´s default firmware on first boot for 6+
years now. It is good hardware, after you do that.

--
Dave Täht

http://www.bufferbloat.net/projects/cerowrt/wiki

2015-02-23 01:21:30

by Sujith Manoharan

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

Linus Torvalds wrote:
> I've got the wpa supplicant log for this timeframe, but I'd rather not
> send it out in public. I see "[REMOVED] for what looks like the key
> data, but there's a lot of other hex data. Is any of it sensitive?

Unless the '-K' option is used with wpa_s, the keys are not displayed
in the debug output when '-dd' is used, so it would be helpful to
have the wpa_s log.

Sujith

2015-02-22 23:00:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Sun, Feb 22, 2015 at 1:50 PM, Linus Torvalds
<[email protected]> wrote:
>
> Ugh. When I add "-dd" to the command line, it has now worked three
> times in a row, when before it worked once out of ten tries.
>
> So my guess is that it's something timing-dependent.

So it stays working with -dd, but I do get *occasional* failures that
then seem to clear up on retry. So it ends up working in the end, but
I think I have a few example failures in the logs.

So for example, from my dmesg, I get this:

[14:07:15] wlp1s0: authenticate with 20:9f:db:e7:80:80
[14:07:15] wlp1s0: send auth to 20:9f:db:e7:80:80 (try 1/3)
[14:07:15] wlp1s0: send auth to 20:9f:db:e7:80:80 (try 2/3)
[14:07:15] wlp1s0: send auth to 20:9f:db:e7:80:80 (try 3/3)
[14:07:15] wlp1s0: authentication with 20:9f:db:e7:80:80 timed out
[14:07:18] wlp1s0: authenticate with 20:9f:db:e7:80:80
[14:07:18] wlp1s0: send auth to 20:9f:db:e7:80:80 (try 1/3)
[14:07:18] wlp1s0: authenticated
[14:07:18] wlp1s0: associate with 20:9f:db:e7:80:80 (try 1/3)
[14:07:18] wlp1s0: RX AssocResp from 20:9f:db:e7:80:80 (capab=0x431
status=0 aid=16)
[14:07:18] wlp1s0: associated
[14:07:22] wlp1s0: deauthenticated from 20:9f:db:e7:80:80 (Reason:
2=PREV_AUTH_NOT_VALID)

with another failure at 14:07:22, but then it ends up working a bit
later at 14:07:33.

I've got the wpa supplicant log for this timeframe, but I'd rather not
send it out in public. I see "[REMOVED] for what looks like the key
data, but there's a lot of other hex data. Is any of it sensitive?

Linus

2015-02-23 20:06:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Mon, Feb 23, 2015 at 9:17 AM, Jouni Malinen <[email protected]> wrote:
>
> mac80211: Do not encrypt EAPOL frames before peer has used the key

Hmm. This patch does not seem to make a difference. I thought it did
at first, but then removed the wpa_supplicant debugging, and got the
same failures.

On Sun, Feb 22, 2015 at 10:01 PM, Sujith Manoharan <[email protected]> wrote:
>
> Or 'iw dev wlp1s0 set bitrates ht-mcs-2.4 0' to choose the lowest
> HT rate.

This *might* have worked. But it's a bit hard to really make sure,
since it sometimes does succeed even without debugging when I do
nothing at all, but it did work twice in a row after doing that.

> Sporadic association problems could be a problem with the chosen rates.
> This would show the rates for the EAPOL frames:
>
> iw dev wlp1s0 interface add mon0 type monitor
> ifconfig mon0 up

Hmm. I don't actually see a "mon0" interface after the "iw dev
interface add" thing. Yes, "iw" sees it when I do "iw dev", but
"ifconfig" does not.

This machine has a fairly minimal kernel config. Does that "type
monitor" interface perhaps need some debug infrastructure that I
haven't added?

> tshark -i mon0 -Y eapol -T fields -e radiotap.datarate -e wlan -e eapol -e wlan.sa -e wlan.da

.. and then this fails, presumably for similar reasons.

Linus

2015-02-23 10:37:45

by Jouni Malinen

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Sun, Feb 22, 2015 at 05:41:25PM -0800, Linus Torvalds wrote:
> Ok. Attached is what seems to be the relevant part of the
> wpa_supplicant.log file.
>
> The datestamp has been changed so that it can be matched up with the
> dmesg, and I added empty lines for pauses when I was trying to figure
> out what the heck it was doing, but other than that it's the raw log.

> 14:07:16.059389: nl80211: Authenticate (ifindex=3)
> 14:07:16.390659: nl80211: MLME event 37; timeout with 20:9f:db:e7:80:80

I'm assuming this is unrelated to the issue of getting disconnected
after 4-way handshake, but anyway, something here prevent the simple
Authentication frame exchange from completing (i.e., either the AP did
not reply to these these frames or the response was lost for some
reason). The following attempt did you go through without issues.

> 14:07:19.478850: nl80211: Authenticate (ifindex=3)
> 14:07:19.493911: nl80211: Authenticate event

> 14:07:19.494223: nl80211: Associate (ifindex=3)
> 14:07:19.503237: nl80211: Associate event

> 14:07:19.504682: wlp1s0: WPA: RX message 1 of 4-Way Handshake from 20:9f:db:e7:80:80 (ver=2)
> 14:07:19.505850: wlp1s0: WPA: Sending EAPOL-Key 2/4
> 14:07:19.510108: wlp1s0: WPA: RX message 3 of 4-Way Handshake from 20:9f:db:e7:80:80 (ver=2)
> 14:07:19.510233: wlp1s0: WPA: Sending EAPOL-Key 4/4

So it looks like both the AP and the station are able to send and
receive EAPOL frames. EAPOL-Key message 1/4..3/4 worked fine. However,
I'm assuming msg 4/4 did not make it through for some reason.

> 14:07:19.510338: wlp1s0: WPA: Installing PTK to the driver
> 14:07:19.510405: wpa_driver_nl80211_set_key: ifindex=3 alg=3 addr=0x7f0ab41cad68 key_idx=0 set_tx=1 seq_len=6 key_len=16
> 14:07:19.510435: addr=20:9f:db:e7:80:80

And this is where IEEE 802.11 RSN gets a bit messy.. The station
installs a key for encrypting all unicast frames to the AP now that it
believes 4-way handshake has been completed successfully.

> 14:07:19.511009: wlp1s0: WPA: Installing GTK to the driver (keyidx=1 tx=0 len=16)

And same for a key to decrypt received broadcast/multicast frames.

> 14:07:19.511205: wlp1s0: WPA: Key negotiation completed with 20:9f:db:e7:80:80 [PTK=CCMP GTK=CCMP]
> 14:07:19.511223: wlp1s0: Cancelling authentication timeout
> 14:07:19.511238: wlp1s0: State: GROUP_HANDSHAKE -> COMPLETED
> 14:07:19.511253: wlp1s0: CTRL-EVENT-CONNECTED - Connection to 20:9f:db:e7:80:80 completed [id=0 id_str=]

As far as the station is concerned, everything looked fine until here
and connection was established successfully.

> 14:07:20.512476: wlp1s0: WPA: RX message 3 of 4-Way Handshake from 20:9f:db:e7:80:80 (ver=2)

However, the AP disagrees.. It looks like this specific AP uses a one
second timeout on EAPOL-Key timeouts. This EAPOL-Key frame was likely
unencrypted since the AP did not receive (or accept, but more likely not
receive) msg 4/4. We are currently allowing such unencrypted EAPOL
frames (but not other ethertypes) to get processed even when the
pairwise key has been configured.

> 14:07:20.512674: wlp1s0: WPA: Sending EAPOL-Key 4/4

And here's an attempt to reply.. Alas, this will likely go out encrypted
since we have the pairwise key configured and the AP will be dropping it
most likely since it would configure the pairwise key for this station
only after having receive valid 4/4.

> 14:07:20.513170: wpa_driver_nl80211_set_key: ifindex=3 alg=3 addr=0x7f0ab41cad68 key_idx=0 set_tx=1 seq_len=6 key_len=16

And station reconfigured the key..

> 14:07:21.513331: wlp1s0: WPA: RX message 3 of 4-Way Handshake from 20:9f:db:e7:80:80 (ver=2)
> 14:07:22.514320: wlp1s0: WPA: RX message 3 of 4-Way Handshake from 20:9f:db:e7:80:80 (ver=2)

These are the following two retries from AP, one second apart. Trying
EAPOL-Key frames four times in normal behavior, so based on timing here,
the issue is in AP not receiving (or accepting) msg 4/4.

> 14:07:23.542903: nl80211: Deauthenticate event

And this is the expected deauthentication by the AP after the last
EAPOL-Key frame retransmission failed to get a valid response.


So yes, this is a race condition of sorts. It is interesting if verbose
debugging in wpa_supplicant is enough to make this less likely to fail.
That will give some more time between the EAPOL-Key msg 4/4 transmission
(packet socket send() on the netdev) and pairwise key configuration
(netlink send). In theory, it could be possible to get even the first
EAPOL-Key msg 4/4 encrypted if the kernel operations for Data frame
transmission and nl80211 key configuration get reordered. Though, if
that were to be happening commonly, we would see significantly more
issues with this, so I'd assume this is not really the main issue here.

I'm not sure why the AP would not accept EAPOL-Key msg 4/4, so it would
be interesting to see a wireless capture log to confirm that the first
attempt of sending that out did indeed result in the frame going out and
doing so without encryption.


As far as making this issue less likely to cause connectivity issues, I
guess we could make mac80211 do a similar hack that some drivers use
here and force the EAPOL-Key msg 4/4 retransmissions to go out without
encryption even though the pairwise key was configured. I think it would
be nicer to control this from user space (i.e., wpa_supplicant could
indicate with the send command that the frame needs to go out without
encryption), but that is unfortunately not something that the current
mechanism used for sending out EAPOL frames in station mode supports.
mac80211 should be able to track this state if it were to look at which
EAPOL-Key frames has been exchanged and start encrypting them only after
the AP has transmitted an encrypted frame with the new pairwise key.
This is somewhat ugly, but doable. (Though, this would be yet another
justification for finally trying to come up with an alternative
mechanism for handling EAPOL frame TX/RX that is really a special case
for IEEE 802.1X.)

--
Jouni Malinen PGP id EFC895FA

2015-02-23 11:05:54

by Jouni Malinen

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Mon, Feb 23, 2015 at 11:55:30AM +0100, wim torfs wrote:
> If it is the case that the 4-way handshake fails, then I have seen
> this issue before. The problem is that messages 1 to 4 are sent with
> the previous key. However, right after sending message 4/4, does
> wpa_supplicant set the new key. In some cases, such as in a high
> throughput environment, this could result that the key is set even
> before the last packet is sent out. As a result, the AP receives a
> packet which is encrypted with the wrong key, since it still listens
> with the old key.

The specific case here is not PTK rekeying, i.e., it is for the initial
4-way handshake, so there is no previous pairwise key in place. That
said, I guess it would be possible for some drivers to hit this even
with the initial PTK configuration if hardware acceleration is used for
encryption and the frame gets delayed sufficiently long while waiting
for medium access.

> A solution would be to adjust wpa_supplicant to wait for the
> transmission ack before it sets its key, however, this causes issues
> with the key reception and transmission count, which could be
> circumvented by copying the old counter values upon setting the new
> key, but this is a dirty hack. Another solution would require some
> more invasive operations, that is, the new key should somehow be
> linked to the message 4/4 and should only be set by the driver upon
> transmission of the message 4/4.

I've been thinking of adding EAPOL TX status reporting into
wpa_supplicant at least to make the debug log more helpful. However,
this does indeed cause issues for RX and I'm not really sure I'd like to
delay pairwise key configuration. The proper way of handling this would
be by doing what the standard really implies, i.e., configure the key
for RX only at first and then enable it for TX after EAPOL-Key msg 4/4
has been transmitted. Though, that should really be interpreted with
something like "enable for TX after the AP has used the key" due to this
possibility for retransmissions of EAPOL-Key 3/4. That said, there are
corner cases even with this design, e.g., due to the STA wanting to
transmit Data frames to the AP immediately after EAPOL-Key 4/4; those
should really be encrypted, so the behavior here would likely need to be
conditional on ethertype (RX-only for EAPOL, RX+TX for everything else).

Things get even messier with PTK rekeying when there would actually need
to be support for two keys (even with the same key index..) so that the
retransmitted EAPOL-Key 3/4 case could be detected and replied to in a
way that the AP understands the response. This gets unfortunately ugly
and I'd assume almost no station implements this in a way that would
handle all cases cleanly (i.e., just hope for msg 4/4 to go through and
reassociate as a backup plan if things fail..).

--
Jouni Malinen PGP id EFC895FA

2015-02-23 01:41:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Sun, Feb 22, 2015 at 4:54 PM, Adrian Chadd <[email protected]> wrote:
>
> I /think/ it's okay? The removed stuff is the pre-shared key pieces.

Ok. Attached is what seems to be the relevant part of the
wpa_supplicant.log file.

The datestamp has been changed so that it can be matched up with the
dmesg, and I added empty lines for pauses when I was trying to figure
out what the heck it was doing, but other than that it's the raw log.

> Do you have another laptop with an atheros NIC in it that you could
> use in monitor mode to capture all the frames?

Nope, everything else I have seems to be intel wireless. I think one
of the kids machines is a Mac Mini with an ath5k thing, but I'm hoping
the wpa_supplicant.log is sufficient to give somebody an idea.

Linus


Attachments:
wpa_supplicant.log (40.37 kB)

2015-02-22 19:39:59

by Adrian Chadd

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On 22 February 2015 at 10:30, Linus Torvalds
<[email protected]> wrote:
> On Sun, Feb 22, 2015 at 10:24 AM, Adrian Chadd <[email protected]> wrote:
>>
>> Just a wild shot - try disabling fast authentication and see if that
>> makes a difference?
>>
>> wpa_supplicant.conf:
>>
>> fast_reauth=0
>>
>> I recall having issues with fast_reauth once, but I never stuck around
>> that location long enough to debug it.
>
> Nope. Did that, killed wpa_supplicant (which restarts it), tried
> connecting, still failed.

Hm, can you enable wpa debugging to log everything whilst it's
associating / reassociating?

(maybe log_devel debug? I tend to run wpa_supplicant at the command
line with debugging enabled, rather than doing it after-the-fact.)




-adrian

2015-02-22 18:24:59

by Adrian Chadd

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On 21 February 2015 at 15:34, Linus Torvalds
<[email protected]> wrote:
> So I've had problems connecting to some networks before on my
> Chromebook Pixel, but now I'm testing a new Ubiquiti network at home,
> and can see this issue at home too.
>
> I know the wireless works, because other devices work fine on that
> network. Also, I know the AR9462 works, because I still have my old
> network up and it connects to that.
>
> And it *occasionally* connects to the new one. But it's rare, and it
> clearly has problems.
>
> It looks something like this:
>
> [ 73.757869] wlp1s0: authenticate with 20:9f:db:e7:80:80
> [ 73.771471] wlp1s0: send auth to 20:9f:db:e7:80:80 (try 1/3)
> [ 73.773706] wlp1s0: authenticated
> [ 73.775122] wlp1s0: associate with 20:9f:db:e7:80:80 (try 1/3)
> [ 73.787434] wlp1s0: RX AssocResp from 20:9f:db:e7:80:80
> (capab=0x431 status=0 aid=9)
> [ 73.787573] wlp1s0: associated
> [ 77.784931] wlp1s0: deauthenticated from 20:9f:db:e7:80:80 (Reason:
> 2=PREV_AUTH_NOT_VALID)
>
> and the password I used definitely is right, and sometimes works.
> Despite that PREV_AUTH_NOT_VALID thing.
>
> Any suggestions for what I should do to give you guys any sane and
> useful debug output?

Just a wild shot - try disabling fast authentication and see if that
makes a difference?

wpa_supplicant.conf:

fast_reauth=0

I recall having issues with fast_reauth once, but I never stuck around
that location long enough to debug it.




-adrian

2015-02-23 23:13:27

by Jouni Malinen

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Mon, Feb 23, 2015 at 03:00:39PM -0800, Linus Torvalds wrote:
> On Mon, Feb 23, 2015 at 2:43 PM, Jouni Malinen <[email protected]> wrote:
> >
> > This did not get exactly supportive response when this was proposed last
> > time (Sep 2013). Anyway, for a quick test, this can be done with the
> > following one-liner:
>
> fwiw, that one-liner seems to work fine for me.

Good to know. That seems to confirm that the issue is most likely in
some of the higher HT rates not working with that AP for some reason at
least for EAPOL frames immediately following the association. This
should really have worked since I'm pretty sure minstrel will add a
lower MCS index as a fallback rate, but maybe there are some interop
issues or just close enough to being below the required signal strength
(though, I'd find that somewhat unlikely unless there is something
messed up with the antennas etc. taken into account your description on
how close the devices are and what's between them).

> Side note: I've done the "turn off wifi and turn it back on" several
> times to test that patch, and it has worked every time. BUT I also see
> this odd behavior where the logs show that it tries to authenticate
> twice: the first time it does that "send auth to 20:9f .." thing three
> times (looks like 100ms apart), and nothing happens so it does
> "authentication with 20:9f .. timed out". Then it waits three seconds
> and tries again, and now it succeeds on the first try.
>
> The only downside of that seems to be that it takes an extra 3s to
> connect to the network - but it does now seem to *reliably* connect -
> so it's not a big problem, but I wonder why it should be that
> repeatable. Is there some difference between the first and the second
> time it tries to authenticate?

There have been some issues in this area in the past, but it is a bit
difficult to say what could have caused it here without a wireless
capture from the operating channel of the AP. It is possible that those
Authentication frames were not actually transmitted due to some issue or
they could have even been sent out on incorrect channel, etc. That
shouldn't really happen that frequently (i.e., others should be seeing
and reporting this as well..), but it's possible something gets messed
up in channel configuration.

> Anyway, even if people don't like that particular patch, it does seem
> like *something* like that should be done.

Agreed. Just need to figure out a suitable compromise somewhere in the
middle of the minimum and close-to-maximum rates. As far as ath9k is
concerned, it would actually support four different rates for retry
series where minstrel uses only three, so it would be easy even for that
driver on its own to add a lower rate at the end if we cannot find
consensus on something more generic in mac80211. That said, I'd rather
see minstrel getting more conservative for EAPOL frames.

--
Jouni Malinen PGP id EFC895FA

2015-02-23 05:43:04

by Sujith Manoharan

[permalink] [raw]
Subject: Re: [ath9k-devel] AR9462 problems connecting again..

Linus Torvalds wrote:
> 14:07:23.542927: wlp1s0: Event DEAUTH (12) received
> 14:07:23.542946: wlp1s0: Deauthentication notification
> 14:07:23.542964: wlp1s0: * reason 2
> 14:07:23.542982: wlp1s0: * address 20:9f:db:e7:80:80
> 14:07:23.542997: Deauthentication frame IE(s) - hexdump(len=0): [NULL]
> 14:07:23.543055: wlp1s0: CTRL-EVENT-DISCONNECTED bssid=20:9f:db:e7:80:80 reason=2

Sporadic association problems could be a problem with the chosen rates.
This would show the rates for the EAPOL frames:

iw dev wlp1s0 interface add mon0 type monitor
ifconfig mon0 up
tshark -i mon0 -Y eapol -T fields -e radiotap.datarate -e wlan -e eapol -e wlan.sa -e wlan.da
Run wpa_s.

Does setting a lower rate improve connectivity ?

'iw dev wlp1s0 set bitrates ht-mcs-2.4 1' would choose a lower
rate for the key-exchange frames.

Sujith

2015-02-22 17:55:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Sat, Feb 21, 2015 at 10:50 PM, Sujith Manoharan <[email protected]> wrote:
>
> Can you please post the output of 'iw dev wlp1s0 scan' ?

Attached.

It's the "UniFi-home" SSID that doesn't work. The 1gnoraNT one is the
old working one that I'm obviously associated with, and that has
multiple AP's.

(The UniFi-home also has two AP's, but they should all show up as a
single network)

Linus


Attachments:
out (4.26 kB)

2015-02-23 05:58:05

by Sujith Manoharan

[permalink] [raw]
Subject: Re: [ath9k-devel] AR9462 problems connecting again..

Sujith Manoharan wrote:
> 'iw dev wlp1s0 set bitrates ht-mcs-2.4 1' would choose a lower
> rate for the key-exchange frames.

Or 'iw dev wlp1s0 set bitrates ht-mcs-2.4 0' to choose the lowest
HT rate.

Sujith

2015-02-23 02:05:06

by Adrian Chadd

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

So the interesting part thus far:

14:07:23.513500: RTM_NEWLINK: operstate=1 ifi_flags=0x1003 ([UP])
14:07:23.513662: RTM_NEWLINK, IFLA_IFNAME: Interface 'wlp1s0' added
14:07:23.513842: nl80211: if_removed already cleared - ignore event
14:07:23.536309: nl80211: Event message available
14:07:23.536454: nl80211: Delete station 20:9f:db:e7:80:80
14:07:23.542786: nl80211: Event message available
14:07:23.542882: nl80211: MLME event 39
14:07:23.542903: nl80211: Deauthenticate event
14:07:23.542927: wlp1s0: Event DEAUTH (12) received
14:07:23.542946: wlp1s0: Deauthentication notification
14:07:23.542964: wlp1s0: * reason 2
14:07:23.542982: wlp1s0: * address 20:9f:db:e7:80:80
14:07:23.542997: Deauthentication frame IE(s) - hexdump(len=0): [NULL]
14:07:23.543055: wlp1s0: CTRL-EVENT-DISCONNECTED
bssid=20:9f:db:e7:80:80 reason=2
14:07:23.543096: wlp1s0: Auto connect enabled: try to reconnect (wps=0
wpa_state=9)
14:07:23.543115: wlp1s0: Setting scan request: 0 sec 100000 usec
14:07:23.543133: Added BSSID 20:9f:db:e7:80:80 into blacklist

You've associated, you do the 4-way handshake, that works fine; then
one second later you do it again; then one second later it fails and
it sends a disconnect.

Whenever I've seen this in the past, it's been because some part of
the EAPOL exchange is being transmitted at the wrong rate or something
about the sequence number/etc is incorrect. So, hm, I'd want to verify
that the whole exchange is working correctly and peek into the header
values for each of those.





-adrian

2015-02-23 10:55:34

by Wim Torfs

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

If it is the case that the 4-way handshake fails, then I have seen this
issue before. The problem is that messages 1 to 4 are sent with the
previous key. However, right after sending message 4/4, does
wpa_supplicant set the new key. In some cases, such as in a high
throughput environment, this could result that the key is set even
before the last packet is sent out. As a result, the AP receives a
packet which is encrypted with the wrong key, since it still listens
with the old key.

I did not notice this issue while authenticating for the first time,
only after the first rekey operation.

A solution would be to adjust wpa_supplicant to wait for the
transmission ack before it sets its key, however, this causes issues
with the key reception and transmission count, which could be
circumvented by copying the old counter values upon setting the new key,
but this is a dirty hack. Another solution would require some more
invasive operations, that is, the new key should somehow be linked to
the message 4/4 and should only be set by the driver upon transmission
of the message 4/4.

Best regards,
Wim.


On 02/23/2015 11:37 AM, Jouni Malinen wrote:
> On Sun, Feb 22, 2015 at 05:41:25PM -0800, Linus Torvalds wrote:
>> Ok. Attached is what seems to be the relevant part of the
>> wpa_supplicant.log file.
>>
>> The datestamp has been changed so that it can be matched up with the
>> dmesg, and I added empty lines for pauses when I was trying to figure
>> out what the heck it was doing, but other than that it's the raw log.
>
>> 14:07:16.059389: nl80211: Authenticate (ifindex=3)
>> 14:07:16.390659: nl80211: MLME event 37; timeout with 20:9f:db:e7:80:80
>
> I'm assuming this is unrelated to the issue of getting disconnected
> after 4-way handshake, but anyway, something here prevent the simple
> Authentication frame exchange from completing (i.e., either the AP did
> not reply to these these frames or the response was lost for some
> reason). The following attempt did you go through without issues.
>
>> 14:07:19.478850: nl80211: Authenticate (ifindex=3)
>> 14:07:19.493911: nl80211: Authenticate event
>
>> 14:07:19.494223: nl80211: Associate (ifindex=3)
>> 14:07:19.503237: nl80211: Associate event
>
>> 14:07:19.504682: wlp1s0: WPA: RX message 1 of 4-Way Handshake from 20:9f:db:e7:80:80 (ver=2)
>> 14:07:19.505850: wlp1s0: WPA: Sending EAPOL-Key 2/4
>> 14:07:19.510108: wlp1s0: WPA: RX message 3 of 4-Way Handshake from 20:9f:db:e7:80:80 (ver=2)
>> 14:07:19.510233: wlp1s0: WPA: Sending EAPOL-Key 4/4
>
> So it looks like both the AP and the station are able to send and
> receive EAPOL frames. EAPOL-Key message 1/4..3/4 worked fine. However,
> I'm assuming msg 4/4 did not make it through for some reason.
>
>> 14:07:19.510338: wlp1s0: WPA: Installing PTK to the driver
>> 14:07:19.510405: wpa_driver_nl80211_set_key: ifindex=3 alg=3 addr=0x7f0ab41cad68 key_idx=0 set_tx=1 seq_len=6 key_len=16
>> 14:07:19.510435: addr=20:9f:db:e7:80:80
>
> And this is where IEEE 802.11 RSN gets a bit messy.. The station
> installs a key for encrypting all unicast frames to the AP now that it
> believes 4-way handshake has been completed successfully.
>
>> 14:07:19.511009: wlp1s0: WPA: Installing GTK to the driver (keyidx=1 tx=0 len=16)
>
> And same for a key to decrypt received broadcast/multicast frames.
>
>> 14:07:19.511205: wlp1s0: WPA: Key negotiation completed with 20:9f:db:e7:80:80 [PTK=CCMP GTK=CCMP]
>> 14:07:19.511223: wlp1s0: Cancelling authentication timeout
>> 14:07:19.511238: wlp1s0: State: GROUP_HANDSHAKE -> COMPLETED
>> 14:07:19.511253: wlp1s0: CTRL-EVENT-CONNECTED - Connection to 20:9f:db:e7:80:80 completed [id=0 id_str=]
>
> As far as the station is concerned, everything looked fine until here
> and connection was established successfully.
>
>> 14:07:20.512476: wlp1s0: WPA: RX message 3 of 4-Way Handshake from 20:9f:db:e7:80:80 (ver=2)
>
> However, the AP disagrees.. It looks like this specific AP uses a one
> second timeout on EAPOL-Key timeouts. This EAPOL-Key frame was likely
> unencrypted since the AP did not receive (or accept, but more likely not
> receive) msg 4/4. We are currently allowing such unencrypted EAPOL
> frames (but not other ethertypes) to get processed even when the
> pairwise key has been configured.
>
>> 14:07:20.512674: wlp1s0: WPA: Sending EAPOL-Key 4/4
>
> And here's an attempt to reply.. Alas, this will likely go out encrypted
> since we have the pairwise key configured and the AP will be dropping it
> most likely since it would configure the pairwise key for this station
> only after having receive valid 4/4.
>
>> 14:07:20.513170: wpa_driver_nl80211_set_key: ifindex=3 alg=3 addr=0x7f0ab41cad68 key_idx=0 set_tx=1 seq_len=6 key_len=16
>
> And station reconfigured the key..
>
>> 14:07:21.513331: wlp1s0: WPA: RX message 3 of 4-Way Handshake from 20:9f:db:e7:80:80 (ver=2)
>> 14:07:22.514320: wlp1s0: WPA: RX message 3 of 4-Way Handshake from 20:9f:db:e7:80:80 (ver=2)
>
> These are the following two retries from AP, one second apart. Trying
> EAPOL-Key frames four times in normal behavior, so based on timing here,
> the issue is in AP not receiving (or accepting) msg 4/4.
>
>> 14:07:23.542903: nl80211: Deauthenticate event
>
> And this is the expected deauthentication by the AP after the last
> EAPOL-Key frame retransmission failed to get a valid response.
>
>
> So yes, this is a race condition of sorts. It is interesting if verbose
> debugging in wpa_supplicant is enough to make this less likely to fail.
> That will give some more time between the EAPOL-Key msg 4/4 transmission
> (packet socket send() on the netdev) and pairwise key configuration
> (netlink send). In theory, it could be possible to get even the first
> EAPOL-Key msg 4/4 encrypted if the kernel operations for Data frame
> transmission and nl80211 key configuration get reordered. Though, if
> that were to be happening commonly, we would see significantly more
> issues with this, so I'd assume this is not really the main issue here.
>
> I'm not sure why the AP would not accept EAPOL-Key msg 4/4, so it would
> be interesting to see a wireless capture log to confirm that the first
> attempt of sending that out did indeed result in the frame going out and
> doing so without encryption.
>
>
> As far as making this issue less likely to cause connectivity issues, I
> guess we could make mac80211 do a similar hack that some drivers use
> here and force the EAPOL-Key msg 4/4 retransmissions to go out without
> encryption even though the pairwise key was configured. I think it would
> be nicer to control this from user space (i.e., wpa_supplicant could
> indicate with the send command that the frame needs to go out without
> encryption), but that is unfortunately not something that the current
> mechanism used for sending out EAPOL frames in station mode supports.
> mac80211 should be able to track this state if it were to look at which
> EAPOL-Key frames has been exchanged and start encrypting them only after
> the AP has transmitted an encrypted frame with the new pairwise key.
> This is somewhat ugly, but doable. (Though, this would be yet another
> justification for finally trying to come up with an alternative
> mechanism for handling EAPOL frame TX/RX that is really a special case
> for IEEE 802.1X.)
>

2015-02-24 00:26:08

by Sujith Manoharan

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

Jouni Malinen wrote:
> Even I think that this goes a bit too far especially on 2.4 GHz band,
> but I would actually consider limiting EAPOL frames to using non-HT/VHT
> (e.g., 6 Mbps OFDM at least for EAPOL-Key frames) to avoid some
> interoperability issues. I would say that the current minstrel_ht
> behavior is somewhat excessive for EAPOL-Key frames in the other
> direction. Using MCS 14 with fallback to something like MCS 5 for the
> second Data frame after an association can certainly fail.
>
> Number of Linux drivers do already limit EAPOL frame TX rate, so this
> specific item is mainly applicable only to driver that use minstrel from
> mac80211 (e.g., ath9k). Though, that IEEE80211_TX_CTL_USE_MINRATE would
> likely affect most mac80211 drivers.

We had this in the ath9k RC, where EAPOL frames were sent out at min-rate
on the VO queue. I think instead of re-introducing it in the driver,
having it in minstrel is cleaner.

Sujith

2015-02-23 22:22:33

by Adrian Chadd

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On 23 February 2015 at 13:53, Linus Torvalds
<[email protected]> wrote:
> On Mon, Feb 23, 2015 at 1:30 PM, Jouni Malinen <[email protected]> wrote:
>>
>> How far is the station from the AP? Would it be possible to see whether
>> the behavior changes if you were within, say, five meters or so?
>
> Well, it was pretty much within five meters already, but there was a
> thin wall in between (and the old AP was right next to the laptop,
> which might add some noise even if they are on different channels).
> Going closer does seem to help, but again, it's not like this is 100%
> reproducible to begin with.
>
> So the theory that the driver starts at too high a transmit rate, and
> then does not handle failures well, might be true. Of course, "not
> handle failures well" is something of an understatement.
>
>> It would be useful if you can capture the 802.11 frame exchange from a
>> failed connection case with an external wireless sniffer.
>
> I will try with my (much more reliable) iwlwifi laptop. At least the
> merge window is over, so I should have some time. Knock wood.

Hm, can we just hack mac80211/ath9k to set /all/ EAPOL frames to the
lowest negotiated basic rate and test? That way we don't have to worry
about things resetting fixed rates or whatnot.

I've done that with FreeBSD's atheros/intel drivers and net80211 stack
to fix exactly these, although I'm thinking about adding a patch based
on Jouni's note about EAPOL encrypting the final response.

Does ath9k have an easy interface for dumping out the TX descriptors
and response? That way we can see (a) what it's transmitting as, and
(b) whether the hardware indicated it got an ACK for the particular
frame.


-adrian

2015-02-23 22:43:08

by Jouni Malinen

[permalink] [raw]
Subject: Re: AR9462 problems connecting again..

On Mon, Feb 23, 2015 at 02:22:32PM -0800, Adrian Chadd wrote:
> On 23 February 2015 at 13:53, Linus Torvalds
> <[email protected]> wrote:
> > So the theory that the driver starts at too high a transmit rate, and
> > then does not handle failures well, might be true. Of course, "not
> > handle failures well" is something of an understatement.

> Hm, can we just hack mac80211/ath9k to set /all/ EAPOL frames to the
> lowest negotiated basic rate and test? That way we don't have to worry
> about things resetting fixed rates or whatnot.

This did not get exactly supportive response when this was proposed last
time (Sep 2013). Anyway, for a quick test, this can be done with the
following one-liner:

diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index c314c59..b45038f 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -566,6 +566,7 @@ ieee80211_tx_h_check_control_port_protocol(struct ieee80211_tx_data *tx)
if (tx->sdata->control_port_no_encrypt)
info->flags |= IEEE80211_TX_INTFL_DONT_ENCRYPT;
info->control.flags |= IEEE80211_TX_CTRL_PORT_CTRL_PROTO;
+ info->flags |= IEEE80211_TX_CTL_USE_MINRATE;
}

return TX_CONTINUE;


Even I think that this goes a bit too far especially on 2.4 GHz band,
but I would actually consider limiting EAPOL frames to using non-HT/VHT
(e.g., 6 Mbps OFDM at least for EAPOL-Key frames) to avoid some
interoperability issues. I would say that the current minstrel_ht
behavior is somewhat excessive for EAPOL-Key frames in the other
direction. Using MCS 14 with fallback to something like MCS 5 for the
second Data frame after an association can certainly fail.

Number of Linux drivers do already limit EAPOL frame TX rate, so this
specific item is mainly applicable only to driver that use minstrel from
mac80211 (e.g., ath9k). Though, that IEEE80211_TX_CTL_USE_MINRATE would
likely affect most mac80211 drivers.

--
Jouni Malinen PGP id EFC895FA