2013-06-17 18:49:56

by Ben Greear

[permalink] [raw]
Subject: Lots of confusion on bss refcounting.

More on looking for bss and ies leaks...

I am trying to understand the bss refcounting, but everywhere I
look it seems like the code is weird at best.

For instance:

We create an assoc_data, assign a bss pointer in ieee80211_mgd_assoc,
but do not claim a reference.

Later, when deleting the assoc_data, the ref is not freed either,
except in one error path where it is explicitly freed:

if (!ieee80211_assoc_success(sdata, *bss, mgmt, len)) {
/* oops -- internal error -- send timeout for now */
ieee80211_destroy_assoc_data(sdata, false);
cfg80211_put_bss(sdata->local->hw.wiphy, *bss);
return RX_MGMT_CFG80211_ASSOC_TIMEOUT;
}

This seems ripe for bugs, if not already buggy.

Maybe we should be more explicit about always grabbing a ref when
we take a reference to the pointer, and always put it when we
destroy the pointer?

I'll be happy to cook up some patches if this seems like the right
path to take.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com



2013-06-18 15:59:36

by Ben Greear

[permalink] [raw]
Subject: Re: Lots of confusion on bss refcounting.

On 06/18/2013 08:52 AM, Johannes Berg wrote:
>
>>> You mean ->current_bss? That should be handled in all the callbacks in
>>> sme.c or so
>>
>> Looks like much of the action happens on work-queues. I'm wondering if
>> we managed to delete wdev objects before we have completely cleaned up
>> in some cases...
>
> Don't we flush work structs appropriately?

Looks like it, from core.c in the netdev event handler:

/*
* Ensure that all events have been processed and
* freed.
*/
cfg80211_process_wdev_events(wdev);


/* I just added this to see if it helps... */
if (WARN_ON(wdev->current_bss)) {
cfg80211_unhold_bss(wdev->current_bss);
cfg80211_put_bss(wdev->wiphy, &wdev->current_bss->pub);
SET_BSS(wdev, NULL);
}
break;

Some of the unregister and similar sme.c calls that should be cleaning up
the current_bss have some early returns if state does not match expected
value. If the warning above hits, then probably we are hitting those
somehow.

If not, then I'll keep looking :)

Thanks,
Ben

>
> johannes
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-06-17 19:09:13

by Ben Greear

[permalink] [raw]
Subject: Re: Lots of confusion on bss refcounting.

On 06/17/2013 12:02 PM, Johannes Berg wrote:
> On Mon, 2013-06-17 at 11:49 -0700, Ben Greear wrote:
>
>> We create an assoc_data, assign a bss pointer in ieee80211_mgd_assoc,
>> but do not claim a reference.
>
> Heh, yeah, I was actually looking at this code too but didn't have time
> today to finish my thoughts ...
>
>> Later, when deleting the assoc_data, the ref is not freed either,
>> except in one error path where it is explicitly freed:
>>
>> if (!ieee80211_assoc_success(sdata, *bss, mgmt, len)) {
>> /* oops -- internal error -- send timeout for now */
>> ieee80211_destroy_assoc_data(sdata, false);
>> cfg80211_put_bss(sdata->local->hw.wiphy, *bss);
>> return RX_MGMT_CFG80211_ASSOC_TIMEOUT;
>> }
>>
>> This seems ripe for bugs, if not already buggy.
>>
>> Maybe we should be more explicit about always grabbing a ref when
>> we take a reference to the pointer, and always put it when we
>> destroy the pointer?
>
> I think the reference is actually given to mac80211 by cfg80211 in
> cfg80211_mlme_assoc(), so we shouldn't need to grab a reference?
> Although I'm certainly willing to change this and make cfg80211 always
> put the reference after calling rdev_assoc() so that the driver/mac80211
> would be responsible for obtaining its own if it needs to hang on to the
> struct.
>
> This does seem broken though.

The bss reference is passed back, and through luck or careful programming,
it *seems* that all paths related to calling ieee80211_rx_mgmt_assoc_resp
managed to consume the bss.

I haven't figured out yet why this is not an erroneous put since I didn't
find the reference taken in the first place.

I'm going to work on making some changes to the ref counting scheme
a bit. I'd rather have the code perhaps take and put a few refs
it might otherwise skip to keep the ownership cleaner and make
the code easier to debug and understand...

I'll post some for RFC when I make some progress.

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-06-18 21:36:30

by Ben Greear

[permalink] [raw]
Subject: Re: Lots of confusion on bss refcounting.

On 06/18/2013 08:52 AM, Johannes Berg wrote:
>
>>> You mean ->current_bss? That should be handled in all the callbacks in
>>> sme.c or so
>>
>> Looks like much of the action happens on work-queues. I'm wondering if
>> we managed to delete wdev objects before we have completely cleaned up
>> in some cases...
>
> Don't we flush work structs appropriately?

I'm still seeing leaks after all those patches I posted...so I'm still
looking. The leaked bss objects have a quite large refcount, in the
hundreds after an hour or so of running, so this is probably more than
a strange race somewhere.


This code looks questionable in wireless/mlme.c

int __cfg80211_mlme_assoc()
.....

It grabs a reference using cfg80211_get_bss, but it only
does a put on the reference if there was an error code.

The __cfg80211_mlme_auth a bit above always does a put on
the reference.

I'm thinking mlme_assoc should also always do put. Any
reason you can think of otherwise?

Thanks,
Ben

>
> johannes
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-06-18 15:47:52

by Ben Greear

[permalink] [raw]
Subject: Re: Lots of confusion on bss refcounting.

On 06/18/2013 05:49 AM, Johannes Berg wrote:
> On Mon, 2013-06-17 at 17:30 -0700, Ben Greear wrote:
>> On 06/17/2013 02:31 PM, Ben Greear wrote:
>>> On 06/17/2013 12:09 PM, Ben Greear wrote:
>>>> On 06/17/2013 12:02 PM, Johannes Berg wrote:
>>>
>>>> The bss reference is passed back, and through luck or careful programming,
>>>> it *seems* that all paths related to calling ieee80211_rx_mgmt_assoc_resp
>>>> managed to consume the bss.
>>>>
>>>> I haven't figured out yet why this is not an erroneous put since I didn't
>>>> find the reference taken in the first place.
>>>>
>>>> I'm going to work on making some changes to the ref counting scheme
>>>> a bit. I'd rather have the code perhaps take and put a few refs
>>>> it might otherwise skip to keep the ownership cleaner and make
>>>> the code easier to debug and understand...
>>>>
>>>> I'll post some for RFC when I make some progress.
>>>
>>> I think I found at least some of the leaks.
>>>
>>> In places like ieee80211_mgd_stop, we were calling ieee80211_destroy_assoc_data,
>>> but it was not putting the bss reference.
>>>
>>> I'll post some RFC patches in a minute or two...first is debugging
>>> logic, second attempts to fix bss ref counting. This needs more
>>> testing before it is applied...we will continue testing it....
>>
>> It seems that the wdev objects (struct wireless_dev) can also
>> hold a reference to the bss.
>>
>> Do you happen to know what code is responsible for destructing
>> those objects? I want to check to make sure it properly puts
>> its reference.
>
> You mean ->current_bss? That should be handled in all the callbacks in
> sme.c or so

Looks like much of the action happens on work-queues. I'm wondering if
we managed to delete wdev objects before we have completely cleaned up
in some cases...

I was planning to add code to explicitly clean up current_bss if it
is not NULL in whatever code actually does the final cleanup for
wdev objects.

I didn't see an obvious cleanup method in sme.c, but will go look around
some more...

Thanks,
Ben

>
> johannes
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-06-17 19:03:04

by Johannes Berg

[permalink] [raw]
Subject: Re: Lots of confusion on bss refcounting.

On Mon, 2013-06-17 at 11:49 -0700, Ben Greear wrote:

> We create an assoc_data, assign a bss pointer in ieee80211_mgd_assoc,
> but do not claim a reference.

Heh, yeah, I was actually looking at this code too but didn't have time
today to finish my thoughts ...

> Later, when deleting the assoc_data, the ref is not freed either,
> except in one error path where it is explicitly freed:
>
> if (!ieee80211_assoc_success(sdata, *bss, mgmt, len)) {
> /* oops -- internal error -- send timeout for now */
> ieee80211_destroy_assoc_data(sdata, false);
> cfg80211_put_bss(sdata->local->hw.wiphy, *bss);
> return RX_MGMT_CFG80211_ASSOC_TIMEOUT;
> }
>
> This seems ripe for bugs, if not already buggy.
>
> Maybe we should be more explicit about always grabbing a ref when
> we take a reference to the pointer, and always put it when we
> destroy the pointer?

I think the reference is actually given to mac80211 by cfg80211 in
cfg80211_mlme_assoc(), so we shouldn't need to grab a reference?
Although I'm certainly willing to change this and make cfg80211 always
put the reference after calling rdev_assoc() so that the driver/mac80211
would be responsible for obtaining its own if it needs to hang on to the
struct.

This does seem broken though.

johannes


2013-06-18 12:49:35

by Johannes Berg

[permalink] [raw]
Subject: Re: Lots of confusion on bss refcounting.

On Mon, 2013-06-17 at 17:30 -0700, Ben Greear wrote:
> On 06/17/2013 02:31 PM, Ben Greear wrote:
> > On 06/17/2013 12:09 PM, Ben Greear wrote:
> >> On 06/17/2013 12:02 PM, Johannes Berg wrote:
> >
> >> The bss reference is passed back, and through luck or careful programming,
> >> it *seems* that all paths related to calling ieee80211_rx_mgmt_assoc_resp
> >> managed to consume the bss.
> >>
> >> I haven't figured out yet why this is not an erroneous put since I didn't
> >> find the reference taken in the first place.
> >>
> >> I'm going to work on making some changes to the ref counting scheme
> >> a bit. I'd rather have the code perhaps take and put a few refs
> >> it might otherwise skip to keep the ownership cleaner and make
> >> the code easier to debug and understand...
> >>
> >> I'll post some for RFC when I make some progress.
> >
> > I think I found at least some of the leaks.
> >
> > In places like ieee80211_mgd_stop, we were calling ieee80211_destroy_assoc_data,
> > but it was not putting the bss reference.
> >
> > I'll post some RFC patches in a minute or two...first is debugging
> > logic, second attempts to fix bss ref counting. This needs more
> > testing before it is applied...we will continue testing it....
>
> It seems that the wdev objects (struct wireless_dev) can also
> hold a reference to the bss.
>
> Do you happen to know what code is responsible for destructing
> those objects? I want to check to make sure it properly puts
> its reference.

You mean ->current_bss? That should be handled in all the callbacks in
sme.c or so

johannes


2013-06-17 21:31:50

by Ben Greear

[permalink] [raw]
Subject: Re: Lots of confusion on bss refcounting.

On 06/17/2013 12:09 PM, Ben Greear wrote:
> On 06/17/2013 12:02 PM, Johannes Berg wrote:

> The bss reference is passed back, and through luck or careful programming,
> it *seems* that all paths related to calling ieee80211_rx_mgmt_assoc_resp
> managed to consume the bss.
>
> I haven't figured out yet why this is not an erroneous put since I didn't
> find the reference taken in the first place.
>
> I'm going to work on making some changes to the ref counting scheme
> a bit. I'd rather have the code perhaps take and put a few refs
> it might otherwise skip to keep the ownership cleaner and make
> the code easier to debug and understand...
>
> I'll post some for RFC when I make some progress.

I think I found at least some of the leaks.

In places like ieee80211_mgd_stop, we were calling ieee80211_destroy_assoc_data,
but it was not putting the bss reference.

I'll post some RFC patches in a minute or two...first is debugging
logic, second attempts to fix bss ref counting. This needs more
testing before it is applied...we will continue testing it....

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-06-18 15:52:13

by Johannes Berg

[permalink] [raw]
Subject: Re: Lots of confusion on bss refcounting.


> > You mean ->current_bss? That should be handled in all the callbacks in
> > sme.c or so
>
> Looks like much of the action happens on work-queues. I'm wondering if
> we managed to delete wdev objects before we have completely cleaned up
> in some cases...

Don't we flush work structs appropriately?

johannes


2013-06-18 00:30:48

by Ben Greear

[permalink] [raw]
Subject: Re: Lots of confusion on bss refcounting.

On 06/17/2013 02:31 PM, Ben Greear wrote:
> On 06/17/2013 12:09 PM, Ben Greear wrote:
>> On 06/17/2013 12:02 PM, Johannes Berg wrote:
>
>> The bss reference is passed back, and through luck or careful programming,
>> it *seems* that all paths related to calling ieee80211_rx_mgmt_assoc_resp
>> managed to consume the bss.
>>
>> I haven't figured out yet why this is not an erroneous put since I didn't
>> find the reference taken in the first place.
>>
>> I'm going to work on making some changes to the ref counting scheme
>> a bit. I'd rather have the code perhaps take and put a few refs
>> it might otherwise skip to keep the ownership cleaner and make
>> the code easier to debug and understand...
>>
>> I'll post some for RFC when I make some progress.
>
> I think I found at least some of the leaks.
>
> In places like ieee80211_mgd_stop, we were calling ieee80211_destroy_assoc_data,
> but it was not putting the bss reference.
>
> I'll post some RFC patches in a minute or two...first is debugging
> logic, second attempts to fix bss ref counting. This needs more
> testing before it is applied...we will continue testing it....

It seems that the wdev objects (struct wireless_dev) can also
hold a reference to the bss.

Do you happen to know what code is responsible for destructing
those objects? I want to check to make sure it properly puts
its reference.

Even after the patches posted I still see a few leaked bss objects...

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com