Hi!
I'm working at the IT department at the University of Vienna. We've a
large installation of CISCO APs providing WLAN access to students and
employees. All of these APs provide both 2,4GHz and 5GHz channels. CISCO
provides two features called "load balancing" and "band select".
At least "band select" causes lots of troubles using a Linux client. It
needs a big portion of luck to successfully connect.
I'm using my HP Elitebook 2540p with Intel 6200 abgn
pci id: 8086:4239 (rev 35)
Starting with Fedora 13, now Fedora 14 I tried to get into all the
wireless stuff. Currently I'm running compat-wireless-20101115 and
wpa_supplicant 0.7.3. Additionally I patched NetworkManager to use a
timeout of 180 seconds instead of the default 25 and "-D nl80211" as
driver for wpa_supplicant. Firmware used is iwlwifi-6000-4.ucode.
AFAIK "band select" tries to "convince" a client to prefer 5GHz channels
by not answering to 2.4GHz probes at least two times (configurable with
2 as default) the same client asks. But the AP appears in scans since
beacons are received as usual.
In my case I see 10 BSSIDs for this SSID. 2 strong 2.4GHz APs and the
first 5GHz AP appears on third position reception wise. wpa_supplicant
starts authentication at the strongest. Then I see a probe request for
the SSID in wireshark, but no response from the selected BSSID. No
authentication packet is seen from wireshark.
Authentication times out. And then the worst case scenario takes
place... wpa_supplicant retries and retries the same AP with time outs
and scans in between. Sometimes even 180 seconds is not enough to try an
other AP.
I can provide sample wpa_supplicant.log and wireshark traces if of interest.
I just built wireless-compat 20101119 with DEBUG_VERBOSE and can get
details if needed.
Last but not least I tried with Windows. Windows is able to connected
even to the 2.4GHz channels. I've monitored the channel with my linux
machine while windows connected to the 2.4GHz AP. All I see are
unanswered probes also, but Windows seems to simply send an
authentication request afterwards and gets an answer then.
I can't figure out how CISCO hopes that a client behaves to cooperate
well with this feature.
I'm sorry that I'm not very proficient with all that wireless stuff yet,
but I'll try to improve and help as good as possible if that's appreciated.
With kind regards,
Wolfgang Breyha
University of Vienna
--
Wolfgang Breyha <[email protected]> | http://www.blafasel.at/
Vienna University Computer Center | Austria
On 2010-11-20 13:04, Helmut Schaa wrote:
> If the Cisco APs would reply to direct probes we could (as a workaround) just
> send an additional direct probe here. I agree with Jouni that the AP behavior
> is just stupid but the users will blame Linux for not being able to connect
> and not the AP vendor.
Suddenly the term "<whatever working protocol> fixup" comes to mind reading
your and Jounis answers;-) I agree with you, too. But I'm not the one
administrating the APs here at the university. Maybe I can convince my
college, but as long as you want to find a solution for Linux as long I'll
try to give you remote hands;-)
We tried to deactivate "band select" already and my laptop was able to
connect instantly to the nearest 2.4GHz AP. But that's the point where the
second "feature" kicks in. "load balancing" is then used by the APs to kick
stations trying to push them to an other AP. At this point Linux clients
have troubles again to get a stable connection.
> Wolfgang, could you please try the (untested) patch below if it makes any
> difference?
Sure, I'll try as soon as I'm back at the office on Monday. And I'll try to
get the logs and packet traces for Jouni, too.
Greetings,
Wolfgang
--
Wolfgang Breyha <[email protected]> | http://www.blafasel.at/
Vienna University Computer Center | Austria
On Thu, Nov 25, 2010 at 06:47:24PM +0200, Jouni Malinen wrote:
> Interestingly enough, that seems to be exactly what the current
> wireless-testing.git snapshot is doing.. I was trying to reproduce this
> issue by modifying my AP not to reply to Probe Request frames on 2.4 GHz
> band and did not see any problems in getting connected. The station saw
> both the 2.4 and 5 GHz BSSes from the AP and the 2.4 GHz BSS was
> selected based on signal strength. mac80211 went through the
> authentication and association frame exchanges without any problems (and
> without sending out a directed Probe Request frame). Whether this change
> was done by design is another question, but at least this seems to be
> the current behavior.
Well, maybe not. I could not reproduce that after adding more debug code
to mac80211. In other words, the Probe Request is still there before
authentication is attempted. Anyway, with this, I did get to reproduce
the problem that shows up wpa_supplicant blacklisting not working
properly in this case at least with -Dnl80211.
--
Jouni Malinen PGP id EFC895FA
On Fri, Nov 19, 2010 at 10:22:32PM +0100, Wolfgang Breyha wrote:
> AFAIK "band select" tries to "convince" a client to prefer 5GHz channels
> by not answering to 2.4GHz probes at least two times (configurable with
> 2 as default) the same client asks. But the AP appears in scans since
> beacons are received as usual.
Huh.. This makes the AP completely non-compliant with IEEE Std
802.11-2007 and such madness should not really be encouraged in any way
or form. Please just disable it and request Cisco to provide a sane
solution that allows the stations to opt-in to whatever games the AP
want to play and not some non-standard hacks. If the AP wants to suggest
the station to move to another band, there better be documented,
publicly available specification describing a clear message that
stations can use as a clear input to BSS selection. Arbitrarily breaking
required standard functionality is not such a mechanism.
> In my case I see 10 BSSIDs for this SSID. 2 strong 2.4GHz APs and the
> first 5GHz AP appears on third position reception wise. wpa_supplicant
> starts authentication at the strongest. Then I see a probe request for
> the SSID in wireshark, but no response from the selected BSSID. No
> authentication packet is seen from wireshark.
OK, this is all expected thanks to the silly AP design.
> Authentication times out. And then the worst case scenario takes
> place... wpa_supplicant retries and retries the same AP with time outs
> and scans in between. Sometimes even 180 seconds is not enough to try an
> other AP.
This is not.. wpa_supplicant should use blacklist to block the BSSID
temporarily and try to find another BSS at that point.
> I can provide sample wpa_supplicant.log and wireshark traces if of interest.
Could you please send me those? Ideally, I would like to see
wpa_supplicant debug log with -ddt on the command line (i.e., timestamps
and verbose debugging) with -Dnl80211 and preferably, without using
NetworkManager to control it to avoid any extra timeouts etc. making the
log more confusing to interpret.
> Last but not least I tried with Windows. Windows is able to connected
> even to the 2.4GHz channels. I've monitored the channel with my linux
> machine while windows connected to the 2.4GHz AP. All I see are
> unanswered probes also, but Windows seems to simply send an
> authentication request afterwards and gets an answer then.
This is all driver/802.11 specific and it is not really same for all
Windows drivers or all Linux drivers. The AP is behaving incorrectly and
the station behavior in such a case is undefined..
> I can't figure out how CISCO hopes that a client behaves to cooperate
> well with this feature.
Neither can I.. Unfortunately, some enterprise AP vendors seem to be
coming up with load balancing designs that are based on some proprietary
hacks and hope that all stations behave in a specific way. There is no
sane way of implementing these things properly without depending on some
common standard that both the APs and stations can use to exchange
information about preferred BSS candidates. IEEE 802.11 was designed to
keep the station, not the AP, in control of roaming; that cannot be
changed unilaterally at the AP without breaking things badly.
--
Jouni Malinen PGP id EFC895FA
Hi!
On 2010-11-20 13:04, Helmut Schaa wrote:
> If the Cisco APs would reply to direct probes we could (as a workaround) just
> send an additional direct probe here. I agree with Jouni that the AP behavior
> is just stupid but the users will blame Linux for not being able to connect
> and not the AP vendor.
>
> Wolfgang, could you please try the (untested) patch below if it makes any
> difference?
Sorry, it took me a day longer as promised because I had to stay at home
yesterday.
The patch from Helmut didn't change anything. I even tried to send both
broadcast and direct probes in triples to check if that's the threshold
which is configured in band select as retries. It's not;-)
After that I tried a dirty hack on wpa_supplicant 0.7.3:
-------
--- wpa_supplicant-0.7.3.orig/wpa_supplicant/sme.c 2010-09-07
17:43:39.000000000 +0200
+++ wpa_supplicant-0.7.3/wpa_supplicant/sme.c 2010-11-23
15:21:23.866829986 +0100
@@ -456,8 +456,23 @@
void sme_event_auth_timed_out(struct wpa_supplicant *wpa_s,
union wpa_event_data *data)
{
+ int timeout = 5000;
wpa_printf(MSG_DEBUG, "SME: Authentication timed out");
- wpa_supplicant_req_scan(wpa_s, 5, 0);
+ if (wpa_blacklist_add(wpa_s, wpa_s->pending_bssid) == 0) {
+ struct wpa_blacklist *b;
+ wpa_blacklist_add(wpa_s, wpa_s->pending_bssid);
+ b = wpa_blacklist_get(wpa_s, wpa_s->pending_bssid);
+ if (b && b->count < 3) {
+ /*
+ * Speed up next attempt if there could be other APs
+ * that could accept association.
+ */
+ timeout = 100;
+ }
+ }
+ wpa_supplicant_req_scan(wpa_s, timeout / 1000,
+ 1000 * (timeout % 1000));
+// wpa_supplicant_req_scan(wpa_s, 5, 0);
}
--------
In other words I reused the code found in sme_event_assoc_reject() to
add the BSSID to the blacklist. To speed up things further I add it
twice;-) I don't know why wpa_supplicant needs a blacklist count of 2 to
finally try an other BSSID.
And it helps a lot. With this change wpa_supplicant stops retrying the
same BSSID all the time and tries a 5GHz one pretty fast. And I think
that's exactly what CISCO tries to achieve.
Finally there is another timeout in the EAP stage (SUPP_BE) I can't
pinpoint. I attached the wpa_supplicant.log.
Authenticated once reauthentication works very fast if needed.
Knowing where to search and how to hack mac80211 and wpa_supplicant I'll
try to find some details which probes CISCO responds to reaching the
threshold.
I can still provide packet traces if you need/want them. In case of the
load balancing feature it may take some time because I've not found a
trick to provoke it. But I think a well and fast trained blacklist will
help in this case, too.
Greetings,
Wolfgang
On Fri, 2010-11-19 at 22:22 +0100, Wolfgang Breyha wrote:
> Hi!
>
> I'm working at the IT department at the University of Vienna. We've a
> large installation of CISCO APs providing WLAN access to students and
> employees. All of these APs provide both 2,4GHz and 5GHz channels. CISCO
> provides two features called "load balancing" and "band select".
>
> At least "band select" causes lots of troubles using a Linux client. It
> needs a big portion of luck to successfully connect.
>
> I'm using my HP Elitebook 2540p with Intel 6200 abgn
> pci id: 8086:4239 (rev 35)
>
> Starting with Fedora 13, now Fedora 14 I tried to get into all the
> wireless stuff. Currently I'm running compat-wireless-20101115 and
> wpa_supplicant 0.7.3. Additionally I patched NetworkManager to use a
> timeout of 180 seconds instead of the default 25 and "-D nl80211" as
Eww, 180 seconds indicates something is clearly wrong with the network
setup or the driver. Based on your description below, we do need to
figure out something in the supplicant or driver to better handle this
behavior. We will be adding settings to NM to lock/prefer specific
bands too.
(NM 0.9 will default to nl80211 supplicant driver)
> driver for wpa_supplicant. Firmware used is iwlwifi-6000-4.ucode.
>
> AFAIK "band select" tries to "convince" a client to prefer 5GHz channels
> by not answering to 2.4GHz probes at least two times (configurable with
> 2 as default) the same client asks. But the AP appears in scans since
> beacons are received as usual.
>
> In my case I see 10 BSSIDs for this SSID. 2 strong 2.4GHz APs and the
> first 5GHz AP appears on third position reception wise. wpa_supplicant
> starts authentication at the strongest. Then I see a probe request for
> the SSID in wireshark, but no response from the selected BSSID. No
> authentication packet is seen from wireshark.
>
> Authentication times out. And then the worst case scenario takes
> place... wpa_supplicant retries and retries the same AP with time outs
> and scans in between. Sometimes even 180 seconds is not enough to try an
> other AP.
Yeah, there's gotta be some better way to handle this. The cisco
behavior seems like a huge hack that tries to work around Windows
specific 802.11 stack behavior, but unfortunately we've got to handle it
as well. Not sure how that should happen though.
Dan
> I can provide sample wpa_supplicant.log and wireshark traces if of interest.
>
> I just built wireless-compat 20101119 with DEBUG_VERBOSE and can get
> details if needed.
>
> Last but not least I tried with Windows. Windows is able to connected
> even to the 2.4GHz channels. I've monitored the channel with my linux
> machine while windows connected to the 2.4GHz AP. All I see are
> unanswered probes also, but Windows seems to simply send an
> authentication request afterwards and gets an answer then.
>
> I can't figure out how CISCO hopes that a client behaves to cooperate
> well with this feature.
>
> I'm sorry that I'm not very proficient with all that wireless stuff yet,
> but I'll try to improve and help as good as possible if that's appreciated.
>
> With kind regards,
> Wolfgang Breyha
> University of Vienna
On Fri, Nov 26, 2010 at 12:24:42AM +0100, Wolfgang Breyha wrote:
> I did some more tests meanwhile. After hacking mac80211 to not send the
> direct probe I was able to connect to 2.4GHz again as I already noted. What
> I didn't recognize initially was that the AP responded to probes afterwards
> for some time. But after a short time (0-5 Minutes) it stopped responding
> again. I think that's the way load balancing works with CISCO APs.
Interesting.. I'm not sure whether that would get many stations leaving
the current AP, so there may be other more aggressive options that the
AP ends up using in the end, though. I think that mac80211 was just
modified to use another AP probing mechanism (data nullfunc instead of
Probe Request), so mac80211-based drivers may not react to the probe
response changes while associated anymore.
> wpa_supplicant deauthenticates then with "due to inactivity" and blacklists
> the AP. And this is another case in which the same AP is tried again at the
> next reconnect attempt because the blacklist count reaches only 1 in
> events.c:wpa_supplicant_event_disassoc():1298
> e = wpa_blacklist_get(wpa_s, bss->bssid);
> if (e && e->count > 1) {
> wpa_printf(MSG_DEBUG, " skip - blacklisted");
> to "e->count >= 1" and had better results since a BSSID is never tried
> again in the following retry. But your commitdiffs let me guess that it is
> wanted in some other cases I'm not aware of.
Yes, but that only applies for the case where more than a single network
configuration block is enabled. I changed wpa_supplicant to change
between 0 and 1 in this check based on the number of enabled networks.
In addition, I extended the optimized scan after auth/assoc failure
mechanism to apply for the disconnection event, too. That should speed
up recovery from this type of situation quite a bit.
> Last but not least I talked to my college some days ago and he told me that
> "band select" is not a feature he needs desperately. But "load balancing"
> is indeed needed for our large audiences with up to 750 people. If the
> decision is left to the clients alone some APs are pretty overcrowded very
> fast.
Could you please send me a wireless capture log with some of the Beacon
and Probe Response frames from those APs? I would like to see what kind
of information they advertise and whether there would be anything worth
using in BSS selection to avoid being kicked off from the network based
on more aggressive load balancing mechanisms.
--
Jouni Malinen PGP id EFC895FA
On Tue, Nov 23, 2010 at 05:13:40PM +0100, Wolfgang Breyha wrote:
> The patch from Helmut didn't change anything. I even tried to send both
> broadcast and direct probes in triples to check if that's the threshold
> which is configured in band select as retries. It's not;-)
>
> After that I tried a dirty hack on wpa_supplicant 0.7.3:
...
> In other words I reused the code found in sme_event_assoc_reject() to
> add the BSSID to the blacklist. To speed up things further I add it
> twice;-) I don't know why wpa_supplicant needs a blacklist count of 2 to
> finally try an other BSSID.
Thanks! This was indeed one of the problems (but not the only one). The
1 vs. 2 part comes from five years ago (needed to go through the commit
log messages to remember that one..). It avoids getting stuck with worse
networks when multiple network blocks are configured. So yes,
incrementing the blacklist count by two is indeed the way to go here in
some cases.
I simulated the five most likely ways current APs could attempt to
implement load balancing and fixed/optimized those in sme.c. Please take
a look at following commits if you want to see more details:
http://w1.fi/gitweb/gitweb.cgi?p=hostap.git;a=commitdiff;h=7e6646c794ccd1df8d38b9927d11e101c0d45517
http://w1.fi/gitweb/gitweb.cgi?p=hostap.git;a=commitdiff;h=f47d639d495b32f0348c09a0fd0ff5b5791720d4
With those in place, it should now be possible to recover from the
authentication failure (this no Probe Request looks like auth timeout
with mac80211) or association failure (e.g., AP rejecting association
with status code 17) in about 0.5 seconds or so (or a bit more if there
are APs in multiple channels). Though, please note that this is only the
case with nl80211 as the driver interface (-Dnl80211). WEXT will still
go through three full scans in this type of case (i.e., two full scans
to recover vs. one scan with just the known channels when using
nl80211).
> And it helps a lot. With this change wpa_supplicant stops retrying the
> same BSSID all the time and tries a 5GHz one pretty fast. And I think
> that's exactly what CISCO tries to achieve.
Yes, I would assume so.
> Finally there is another timeout in the EAP stage (SUPP_BE) I can't
> pinpoint. I attached the wpa_supplicant.log.
That looks like a lost EAPOL packet to me based on that log.. Would
likely need to use a wireless sniffer to take a closer look at where the
packet is dropped.
> Knowing where to search and how to hack mac80211 and wpa_supplicant I'll
> try to find some details which probes CISCO responds to reaching the
> threshold.
I don't think that that would be very critical to figure out anymore
with the current wpa_supplicant (-Dnl80211). Sure, we could consider
removing the need-a-probe-response-before-auth case from mac80211, but
actually, in this particular case, it would result in not following the
not-so-gently hint from the AP.
> I can still provide packet traces if you need/want them. In case of the
> load balancing feature it may take some time because I've not found a
> trick to provoke it. But I think a well and fast trained blacklist will
> help in this case, too.
For band enforcement, I think the behavior is clear enough and no
additional information is needed. I can easily simulate this type of
behavior by modifying hostapd. For load balancing while being
associated, it would be interesting to hear if it behaves badly, i.e.,
if there is a long gap in connectivity etc. user visible badness. I
would assume I can easily simulate those for testing, but to do that, I
would need to first see how the particular AP/network is behaving.
--
Jouni Malinen PGP id EFC895FA
On Wed, Nov 24, 2010 at 03:56:32PM +0100, Wolfgang Breyha wrote:
> I have proof now, that the APs respond to authentication requests regardless
> of a successful probe before. Simply skipping the direct probe is sufficient
> to connect successfully.
Interestingly enough, that seems to be exactly what the current
wireless-testing.git snapshot is doing.. I was trying to reproduce this
issue by modifying my AP not to reply to Probe Request frames on 2.4 GHz
band and did not see any problems in getting connected. The station saw
both the 2.4 and 5 GHz BSSes from the AP and the 2.4 GHz BSS was
selected based on signal strength. mac80211 went through the
authentication and association frame exchanges without any problems (and
without sending out a directed Probe Request frame). Whether this change
was done by design is another question, but at least this seems to be
the current behavior.
There may still be some issues in wpa_supplicant blacklist handling
which I will try to reproduce in some other way since the station should
have actually managed to follow the not so polite hint from the AP and
try to use the 5 GHz band here.
--
Jouni Malinen PGP id EFC895FA
Hi again;-)
I have proof now, that the APs respond to authentication requests regardless
of a successful probe before. Simply skipping the direct probe is sufficient
to connect successfully.
All my efforts to find a way to get a response to the probe requests were
unsuccessful. Maybe that's something only aironet devices handle correctly?
Greetings,
Wolfgang
--
Wolfgang Breyha <[email protected]> | http://www.blafasel.at/
Vienna University Computer Center | Austria
On 2010-11-25 22:18, Jouni Malinen wrote:
> I simulated the five most likely ways current APs could attempt to
> implement load balancing and fixed/optimized those in sme.c. Please take
> a look at following commits if you want to see more details:
Sure! I definitely will try a git checkout tomorrow and report back!
I did some more tests meanwhile. After hacking mac80211 to not send the
direct probe I was able to connect to 2.4GHz again as I already noted. What
I didn't recognize initially was that the AP responded to probes afterwards
for some time. But after a short time (0-5 Minutes) it stopped responding
again. I think that's the way load balancing works with CISCO APs.
wpa_supplicant deauthenticates then with "due to inactivity" and blacklists
the AP. And this is another case in which the same AP is tried again at the
next reconnect attempt because the blacklist count reaches only 1 in
events.c:wpa_supplicant_event_disassoc():1298
I tried to change events.c:472ff
e = wpa_blacklist_get(wpa_s, bss->bssid);
if (e && e->count > 1) {
wpa_printf(MSG_DEBUG, " skip - blacklisted");
return 0;
}
to "e->count >= 1" and had better results since a BSSID is never tried
again in the following retry. But your commitdiffs let me guess that it is
wanted in some other cases I'm not aware of.
Last but not least I talked to my college some days ago and he told me that
"band select" is not a feature he needs desperately. But "load balancing"
is indeed needed for our large audiences with up to 750 people. If the
decision is left to the clients alone some APs are pretty overcrowded very
fast.
We decided to keep both features active as long as I can help to get the
issues fixed for Linux. Afterwards we most likely will deactivate "band
select" until the fixes find their way into common Ubuntus, Fedoras & Co.
Greetings,
Wolfgang
--
Wolfgang Breyha <[email protected]> | http://www.blafasel.at/
Vienna University Computer Center | Austria
Am Samstag 20 November 2010 schrieb Jouni Malinen:
> On Fri, Nov 19, 2010 at 10:22:32PM +0100, Wolfgang Breyha wrote:
> > In my case I see 10 BSSIDs for this SSID. 2 strong 2.4GHz APs and the
> > first 5GHz AP appears on third position reception wise. wpa_supplicant
> > starts authentication at the strongest. Then I see a probe request for
> > the SSID in wireshark, but no response from the selected BSSID. No
> > authentication packet is seen from wireshark.
>
> OK, this is all expected thanks to the silly AP design.
I'm wondering if the Cisco APs would reply to direct probe requests (with
the bssid being set instead of the broadcast address). At least we're sending
broadcast probes before authentication (in case we did not receive a probe
response from this AP yet during a previous scan):
464 /*
465 * Direct probe is sent to broadcast address as some APs
466 * will not answer to direct packet in unassociated state.
467 */
468 ieee80211_send_probe_req(sdata, NULL, wk->probe_auth.ssid,
469 wk->probe_auth.ssid_len, NULL, 0);
I guess this was introduced to work around another strange AP behavior.
If the Cisco APs would reply to direct probes we could (as a workaround) just
send an additional direct probe here. I agree with Jouni that the AP behavior
is just stupid but the users will blame Linux for not being able to connect
and not the AP vendor.
Wolfgang, could you please try the (untested) patch below if it makes any
difference?
Helmut
---
diff --git a/net/mac80211/work.c b/net/mac80211/work.c
index ae344d1..57ae8d5 100644
--- a/net/mac80211/work.c
+++ b/net/mac80211/work.c
@@ -467,6 +467,9 @@ ieee80211_direct_probe(struct ieee80211_work *wk)
*/
ieee80211_send_probe_req(sdata, NULL, wk->probe_auth.ssid,
wk->probe_auth.ssid_len, NULL, 0);
+ ieee80211_send_probe_req(sdata, wk->filter_ta, wk->probe_auth.ssid,
+ wk->probe_auth.ssid_len, NULL, 0);
+
wk->timeout = jiffies + IEEE80211_AUTH_TIMEOUT;
run_again(local, wk->timeout);
On Sat, Nov 20, 2010 at 05:49:30PM +0100, Wolfgang Breyha wrote:
> We tried to deactivate "band select" already and my laptop was able to
> connect instantly to the nearest 2.4GHz AP. But that's the point where the
> second "feature" kicks in. "load balancing" is then used by the APs to kick
> stations trying to push them to an other AP. At this point Linux clients
> have troubles again to get a stable connection.
Could you please send a wpa_supplicant debug log for this one, too? If
the AP is just kicking out the station, we should be able to blacklist
the AP and try to find someone else..
I've seen number of issues with load balancing designs in the past, but
I think I fixed many of them the last time I was testing this.. It's a
bit pity that I don't have this type of "enterprise AP" test setup at
home to remind me that things are not working. I usually end up
debugging this only when traveling and getting hit by connectivity
issues myself.
--
Jouni Malinen PGP id EFC895FA