2011-06-03 20:16:48

by Ignacy Gawedzki

[permalink] [raw]
Subject: sta_find_ibss (active_ibss=1) in a loop?

Hi,

I'm still investigating on that problem but have such a hard time reproducing
it that I thought maybe someone here could help.

I have a testbed of three nodes in IBSS (using carl9170 from compat-wireless'
latest snapshot), that happen to go into suspend-to-ram quite often.
Sometimes it appears one of the nodes just can't "see" some other node.
Then tcpdump doesn't see it either, but another vif in monitor mode sees
everyone just fine.

I activated as much debug options as I could and I saw something quite strange
in the kernel messages (just after the node goes back from suspend-to-ram):

[16226.464987] ieee80211 phy14: device no longer idle - in use
[16226.465043] wlan1: sta_find_ibss (active_ibss=0)
[16226.465055] sta_find_ibss: selected 83:12:97:63:54:f2 current
00:00:00:00:00:00
[16226.465063] wlan1: Selected IBSS BSSID 83:12:97:63:54:f2 based on
configured SSID
[16226.632608] ieee80211 phy14: Adding new IBSS station 00:26:f2:ed:60:f9
(dev=wlan1)
[16226.632648] ieee80211 phy14: Allocated STA 00:26:f2:ed:60:f9
[16226.633549] ieee80211 phy14: Added IBSS STA 00:26:f2:ed:60:f9
[16226.633598] wlan1: No active IBSS STAs - trying to scan for other IBSS
networks with same SSID (merge)
[16226.633626] ieee80211 phy14: Finished adding IBSS STA 00:26:f2:ed:60:f9
[16226.948957] ieee80211 phy14: Adding new IBSS station 00:24:b2:5c:bb:b5
(dev=wlan1)
[16226.949004] ieee80211 phy14: Allocated STA 00:24:b2:5c:bb:b5
[16226.950495] ieee80211 phy14: Added IBSS STA 00:24:b2:5c:bb:b5
[16226.950593] ieee80211 phy14: Finished adding IBSS STA 00:24:b2:5c:bb:b5
[16228.050250] wlan1: updated supp_rates set for 00:24:b2:5c:bb:b5 based on
beacon/probe_resp (0x1 -> 0xfff)
[16229.180184] ieee80211 phy14: Removed STA 00:24:b2:5c:bb:b5
[16229.180339] ieee80211 phy14: Destroyed STA 00:24:b2:5c:bb:b5
[16229.196176] ieee80211 phy14: Removed STA 00:26:f2:ed:60:f9
[16229.196322] ieee80211 phy14: Destroyed STA 00:26:f2:ed:60:f9
[16229.199085] ieee80211 phy14: Adding new IBSS station 00:26:f2:ed:60:f9
(dev=wlan1)
[16229.199113] ieee80211 phy14: Allocated STA 00:26:f2:ed:60:f9
[16229.199845] ieee80211 phy14: Added IBSS STA 00:26:f2:ed:60:f9
[16229.212204] ieee80211 phy14: device now idle
[16229.212222] ieee80211 phy14: Finished adding IBSS STA 00:26:f2:ed:60:f9
[16229.212909] ieee80211 phy14: device no longer idle - in use
[16229.212956] wlan1: sta_find_ibss (active_ibss=1)
[16229.277675] wlan1: sta_find_ibss (active_ibss=1)
[16229.311661] wlan1: sta_find_ibss (active_ibss=1)
[16229.379920] wlan1: sta_find_ibss (active_ibss=1)
[16229.414232] wlan1: sta_find_ibss (active_ibss=1)
...

the following is just a long stream of sta_find_ibss (active_ibss=1).

>From that point on, the interface sees 00:26:f2:ed:60:f9 but not
00:24:b2:5c:bb:b5. BTW, why are both neighbors removed? When this problem is
not appearing, both neighbors are still removed, but they're both added
immediately after that (and there's no stream of sta_find_ibss). Here only
00:26:f2:ed:60:f9 is added back.

As I read the code in net/mac80211/ibss.c to try to understand how things
work, I noticed the following in ieee80211_sta_find_ibss(). Apparently the
function does nothing if the call to ieee80211_sta_active_ibss() returns
non-zero, i.e. the last received frame from a known station was too recent.

So I wonder, what happens if state somehow becomes IEEE80211_IBSS_MLME_SEARCH
and a neighbor station is sending plenty of frames. Apparently we don't come
out of this state, since we don't get a chance to set it back to
IEEE80211_IBSS_MLME_JOINED.

Am I obviously missing something here?

I can't confirm right now that this is indeed what's happening, but I'll post
more information as soon as possible. Besides, if anyone has any suggestion
regarding what additional debug info could be useful, please go ahead.

Thanks,

Ignacy

--
NO CARRIER


2011-06-06 16:11:35

by Johannes Berg

[permalink] [raw]
Subject: Re: sta_find_ibss (active_ibss=1) in a loop?

On Mon, 2011-06-06 at 18:01 +0200, Ignacy Gawedzki wrote:

> > @@ -999,7 +1002,6 @@ int ieee80211_ibss_leave(struct ieee8021
> > kfree_skb(skb);
> >
> > skb_queue_purge(&sdata->skb_queue);
> > - memset(sdata->u.ibss.bssid, 0, ETH_ALEN);
> > sdata->u.ibss.ssid_len = 0;
>
> Shouldn't that second line also move up with the memset?

Not really necessary -- that's not used in the RX path. But yeah we can
do that for consistency. Also maybe the skb queue purge.

> > del_timer_sync(&sdata->u.ibss.timer);
>
> So this fix relies on the fact that reading sdata->i.ibss.state is an atomic
> operation, right?
>
> Thanks for the patch, I'll give it a try, though it may be some time until I
> report further on this, since the problem is difficult to reproduce.

Oh ok. Well, I think we should put these changes in anyway since I see
nothing preventing the race now.

johannes


2011-06-08 11:26:10

by Johannes Berg

[permalink] [raw]
Subject: Re: sta_find_ibss (active_ibss=1) in a loop?

On Wed, 2011-06-08 at 13:13 +0200, Ignacy Gawedzki wrote:
> On Mon, Jun 06, 2011 at 06:11:32PM +0200, thus spake Johannes Berg:
> > Oh ok. Well, I think we should put these changes in anyway since I see
> > nothing preventing the race now.
>
> After having applied this patch, I haven't seen the problem appear for two
> days (and two nights) of testing. I suppose this sort of confirms the fix.

Thanks. I just sent out the patch officially.

johannes


2011-06-06 16:01:20

by Ignacy Gawedzki

[permalink] [raw]
Subject: Re: sta_find_ibss (active_ibss=1) in a loop?

On Mon, Jun 06, 2011 at 04:31:56PM +0200, thus spake Johannes Berg:
> On Mon, 2011-06-06 at 13:04 +0200, Ignacy Gawedzki wrote:
> > Is flushing the local->sta_list in ieee80211_ibss_join() a good solution?
>
> I think we should see how this situation happens. cfg80211 shouldn't be
> allowing a join while an IBSS is already joined, so the above situation
> you describe shouldn't be possible. Also, I don't see that it is
> possible that a station is added while in search state or not joined
> (BSSID should be zeroes).
>
> Then again, it looks like it could be a race.

Given how difficult it is to reproduce, it smells like a race to me.

> A station could be added
> while _leave() is in the middle of the flush. This is because
> ieee80211_ibss_add_sta() can't hold the mutex.

I see.

> Thus, the proper fix might be this:
>
> --- wireless-testing.orig/net/mac80211/ibss.c 2011-06-06 12:25:06.000000000 +0200
> +++ wireless-testing/net/mac80211/ibss.c 2011-06-06 16:31:26.000000000 +0200
> @@ -965,6 +965,9 @@ int ieee80211_ibss_leave(struct ieee8021
>
> mutex_lock(&sdata->u.ibss.mtx);
>
> + memset(sdata->u.ibss.bssid, 0, ETH_ALEN);
> + sdata->u.ibss.state = IEEE80211_IBSS_MLME_SEARCH;
> +
> active_ibss = ieee80211_sta_active_ibss(sdata);
>
> if (!active_ibss && !is_zero_ether_addr(ifibss->bssid)) {
> @@ -999,7 +1002,6 @@ int ieee80211_ibss_leave(struct ieee8021
> kfree_skb(skb);
>
> skb_queue_purge(&sdata->skb_queue);
> - memset(sdata->u.ibss.bssid, 0, ETH_ALEN);
> sdata->u.ibss.ssid_len = 0;

Shouldn't that second line also move up with the memset?

>
> del_timer_sync(&sdata->u.ibss.timer);

So this fix relies on the fact that reading sdata->i.ibss.state is an atomic
operation, right?

Thanks for the patch, I'll give it a try, though it may be some time until I
report further on this, since the problem is difficult to reproduce.

--
The groove will take you through times without money
much better than money will take you through times without groove.

2011-06-06 11:04:32

by Ignacy Gawedzki

[permalink] [raw]
Subject: Re: sta_find_ibss (active_ibss=1) in a loop?

On Fri, Jun 03, 2011 at 10:49:39PM +0200, thus spake Ignacy Gawedzki:
> On Fri, Jun 03, 2011 at 10:16:44PM +0200, thus spake Ignacy Gawedzki:
> > I can't confirm right now that this is indeed what's happening, but I'll post
> > more information as soon as possible.
>
> I can now confirm that the state is MLME_SEARCH and that the ignored station
> can't be added because of the following in ieee80211_ibss_add_sta():
>
> if (ifibss->state == IEEE80211_IBSS_MLME_SEARCH)
> return NULL;

After looking more in details at the code, I think I'm getting closer to a
proper diagnostic.

The state is changed to MLME_SEARCH when ieee80211_ibss_join() is called from
ieee80211_join_ibss() in cfg.c. I suppose this is what's called when an
application asks the kernel to make the interface join an IBSS. Since in that
state no new STA can be added to local->sta_list, if there were no STA at all,
there's no last_rx to update and consequently ieee80211_sta_active_ibss()
returns 0 and ieee80211_sta_find_ibss() may proceed.

But apparently, if there are still some remaining STAs in local->sta_list,
their last_rx is updated upon reception of a frame from them, so if they are
transmitting frames often enough, ieee80211_sta_active_ibss() won't ever
return 0 and ieee80211_sta_find_ibss() won't be able to proceed at all, which
will keep the interface in the MLME_SEARCH state and hence the loop.

So how comes there are remaining STAs in local->sta_list? When
ieee80211_ibss_leave() is called from ieee80211_leave_ibss() in cfg.c,
sta_info_flush() is called and there are no remaining STAs. But what if
ieee80211_ibss_join() is called without prior call to ieee80211_ibss_leave()?
I suppose this is when it goes wrong.

Is flushing the local->sta_list in ieee80211_ibss_join() a good solution?

Please just tell me and I'll happily post a patch.

Thanks,

Ignacy

--
The groove will take you through times without money
much better than money will take you through times without groove.

2011-06-08 11:13:49

by Ignacy Gawedzki

[permalink] [raw]
Subject: Re: sta_find_ibss (active_ibss=1) in a loop?

On Mon, Jun 06, 2011 at 06:11:32PM +0200, thus spake Johannes Berg:
> Oh ok. Well, I think we should put these changes in anyway since I see
> nothing preventing the race now.

After having applied this patch, I haven't seen the problem appear for two
days (and two nights) of testing. I suppose this sort of confirms the fix.

Thanks again.

Ignacy

--
To err is human, to moo bovine.

2011-06-03 20:49:42

by Ignacy Gawedzki

[permalink] [raw]
Subject: Re: sta_find_ibss (active_ibss=1) in a loop?

On Fri, Jun 03, 2011 at 10:16:44PM +0200, thus spake Ignacy Gawedzki:
> I can't confirm right now that this is indeed what's happening, but I'll post
> more information as soon as possible.

I can now confirm that the state is MLME_SEARCH and that the ignored station
can't be added because of the following in ieee80211_ibss_add_sta():

if (ifibss->state == IEEE80211_IBSS_MLME_SEARCH)
return NULL;

--
I have not lost my mind, it's backed up on disk somewhere.

2011-06-06 14:32:00

by Johannes Berg

[permalink] [raw]
Subject: Re: sta_find_ibss (active_ibss=1) in a loop?

On Mon, 2011-06-06 at 13:04 +0200, Ignacy Gawedzki wrote:

> After looking more in details at the code, I think I'm getting closer to a
> proper diagnostic.

Thanks for digging into this.

> The state is changed to MLME_SEARCH when ieee80211_ibss_join() is called from
> ieee80211_join_ibss() in cfg.c. I suppose this is what's called when an
> application asks the kernel to make the interface join an IBSS. Since in that
> state no new STA can be added to local->sta_list, if there were no STA at all,
> there's no last_rx to update and consequently ieee80211_sta_active_ibss()
> returns 0 and ieee80211_sta_find_ibss() may proceed.
>
> But apparently, if there are still some remaining STAs in local->sta_list,
> their last_rx is updated upon reception of a frame from them, so if they are
> transmitting frames often enough, ieee80211_sta_active_ibss() won't ever
> return 0 and ieee80211_sta_find_ibss() won't be able to proceed at all, which
> will keep the interface in the MLME_SEARCH state and hence the loop.
>
> So how comes there are remaining STAs in local->sta_list? When
> ieee80211_ibss_leave() is called from ieee80211_leave_ibss() in cfg.c,
> sta_info_flush() is called and there are no remaining STAs. But what if
> ieee80211_ibss_join() is called without prior call to ieee80211_ibss_leave()?
> I suppose this is when it goes wrong.

I don't think this can happen.

> Is flushing the local->sta_list in ieee80211_ibss_join() a good solution?

I think we should see how this situation happens. cfg80211 shouldn't be
allowing a join while an IBSS is already joined, so the above situation
you describe shouldn't be possible. Also, I don't see that it is
possible that a station is added while in search state or not joined
(BSSID should be zeroes).

Then again, it looks like it could be a race. A station could be added
while _leave() is in the middle of the flush. This is because
ieee80211_ibss_add_sta() can't hold the mutex.

Thus, the proper fix might be this:

--- wireless-testing.orig/net/mac80211/ibss.c 2011-06-06 12:25:06.000000000 +0200
+++ wireless-testing/net/mac80211/ibss.c 2011-06-06 16:31:26.000000000 +0200
@@ -965,6 +965,9 @@ int ieee80211_ibss_leave(struct ieee8021

mutex_lock(&sdata->u.ibss.mtx);

+ memset(sdata->u.ibss.bssid, 0, ETH_ALEN);
+ sdata->u.ibss.state = IEEE80211_IBSS_MLME_SEARCH;
+
active_ibss = ieee80211_sta_active_ibss(sdata);

if (!active_ibss && !is_zero_ether_addr(ifibss->bssid)) {
@@ -999,7 +1002,6 @@ int ieee80211_ibss_leave(struct ieee8021
kfree_skb(skb);

skb_queue_purge(&sdata->skb_queue);
- memset(sdata->u.ibss.bssid, 0, ETH_ALEN);
sdata->u.ibss.ssid_len = 0;

del_timer_sync(&sdata->u.ibss.timer);


johannes