2012-07-31 17:36:30

by Daniel Drake

[permalink] [raw]
Subject: cfg80211_disconnected memory leak

Hi,

After doing insmod/rmmod of libertas, kmemleak found:

unreferenced object 0xe90f1398 (size 64):
comm "rmmod", pid 836, jiffies 4294944467 (age 34.620s)
hex dump (first 32 bytes):
58 cd 24 e9 58 cd 24 e9 02 00 00 00 c0 13 0f e9 X.$.X.$.........
00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<b0747a00>] kmemleak_alloc+0x26/0x44
[<b049eff1>] __kmalloc+0xf3/0x176
[<b0740aae>] cfg80211_disconnected+0x3e/0xc8
[<b06008c9>] lbs_disconnect+0x73/0x86
[<b0600955>] lbs_cfg_disconnect+0x79/0x88
[<b0741c20>] __cfg80211_disconnect+0xf5/0x148
[<b072cd8f>] cfg80211_netdev_notifier_call+0x253/0x452
[<b075781b>] notifier_call_chain+0x2a/0x4b
[<b04344f6>] __raw_notifier_call_chain+0x13/0x15
[<b0434509>] raw_notifier_call_chain+0x11/0x13
[<b06a51ff>] call_netdevice_notifiers+0x41/0x48
[<b06a5247>] __dev_close_many+0x41/0x8b
[<b06a5323>] dev_close_many+0x58/0xa8
[<b06a5401>] rollback_registered_many+0x8e/0x1f8
[<b06a5593>] rollback_registered+0x28/0x34
[<b06a66b0>] unregister_netdevice_queue+0x51/0x6e

By adding some printks I have found that cfg80211_disconnected() does
indeed queue an event to be processed in cfg80211_wq on the eth0
device, but by the time cfg80211_process_rdev_events() is called, eth0
is no longer present in the rdev's netdev_list, so the event never
gets processed (or freed).

Daniel


2012-08-01 17:39:35

by Johannes Berg

[permalink] [raw]
Subject: Re: cfg80211_disconnected memory leak

Hi,

> unreferenced object 0xe90f1398 (size 64):

> backtrace:
> [<b0747a00>] kmemleak_alloc+0x26/0x44
> [<b049eff1>] __kmalloc+0xf3/0x176
> [<b0740aae>] cfg80211_disconnected+0x3e/0xc8
> [<b06008c9>] lbs_disconnect+0x73/0x86
> [<b0600955>] lbs_cfg_disconnect+0x79/0x88
> [<b0741c20>] __cfg80211_disconnect+0xf5/0x148
> [<b072cd8f>] cfg80211_netdev_notifier_call+0x253/0x452

> By adding some printks I have found that cfg80211_disconnected() does
> indeed queue an event to be processed in cfg80211_wq on the eth0
> device, but by the time cfg80211_process_rdev_events() is called, eth0
> is no longer present in the rdev's netdev_list, so the event never
> gets processed (or freed).

This is very odd. What version of the kernel is this?

The strange thing is that we call __cfg80211_disconnect() from the
netdev notifier with NETDEV_GOING_DOWN. This will allocate and queue the
work item as you found. The next thing that happens should be
NETDEV_DOWN, which will cause us to dev_hold() the device and then queue
the cleanup work. The cleanup work must run for us to dev_put() the
device, so that it can only be unregistered after that runs. Then,
finally, we get NETDEV_UNREGISTER which removes it from the list.

Now note that the work item we queue in __cfg80211_disconnect() is
queued *before* the cleanup work, therefore it should also run before
the cleanup work since the workqueue is singlethreaded.

Hence I have no idea how this comes about.

johannes


2012-08-02 16:26:33

by Johannes Berg

[permalink] [raw]
Subject: Re: cfg80211_disconnected memory leak

On Thu, 2012-08-02 at 10:11 -0600, Daniel Drake wrote:
> On Thu, Aug 2, 2012 at 2:02 AM, Johannes Berg <[email protected]> wrote:
> > Oh, hm. I didn't think it could unregister before we give up our
> > reference, but I guess that makes sense after all.
> >
> > I'm not sure there's an easy way to fix it other than making the driver
> > not call cfg80211_disconnected() in case the disconnect was requested by
> > cfg80211 -- that call isn't needed and will not do anything at all, but
> > I'm not sure how easy that would be in the driver?
>
> I guess you've considered clearing all the pending work before
> removing a netdev from the rdev's list?

Yes, but we can't just try to flush the workqueue because of locking
concerns in this area.

Hmm. Then again, I think we can call cfg80211_process_wdev_events() from
case NETDEV_UNREGISTER though, probably after removing from the list.
Maybe you could try that?

> I think a driver modification would be easy, if it is the right solution.
>
> lbs_disconnect() is the function that calls cfg80211_disconnected().
> We only ever call this in 2 contexts:
>
> 1. From our cfg80211_ops.disconnect handler - you say this isn't needed
> 2. From the netdev ndo_stop handler - I guess it is also not necessary
> to inform cfg80211 that we have disconnected at this point, it is kind
> of obvious..?

In fact, you won't get to ndo_stop without cfg80211 calling
disconnect(), because it does that from NETDEV_GOING_DOWN.

> So completely removing the call to cfg80211_disconnected() may be the
> right option here, is that what you recommend?

I'm not 100% sure about the API in this area right now though, it's been
a while and I never worked much with this API (rather than the mac80211
one with auth/assoc/disassoc/deauth.)

johannes


2012-08-02 19:57:18

by Johannes Berg

[permalink] [raw]
Subject: Re: cfg80211_disconnected memory leak

On Thu, 2012-08-02 at 11:25 -0600, Daniel Drake wrote:
> On Thu, Aug 2, 2012 at 10:26 AM, Johannes Berg
> <[email protected]> wrote:
> > Hmm. Then again, I think we can call cfg80211_process_wdev_events() from
> > case NETDEV_UNREGISTER though, probably after removing from the list.
> > Maybe you could try that?
>
> That solves the issue - confirmed that kmemleak now shuts up, and with
> some added printks to confirm event creation and freeing.
> Patch coming up, titled: cfg80211: process pending events when
> unregistering net device

Nice, thanks!

> Even if libertas isn't quite doing the right thing here, I think this
> is the right thing to do. I guess there are other situations, perhaps
> more legitimate, where we can reach this point with events in the
> queue.

Yes, I agree. I just wasn't completely sure this would be OK when I
looked first, and then started wondering why it didn't happen with
mac80211, but that doesn't send the event.

> > I'm not 100% sure about the API in this area right now though, it's been
> > a while and I never worked much with this API (rather than the mac80211
> > one with auth/assoc/disassoc/deauth.)
>
> I think we both feel that removing it is correct. I'll test this when
> I find some free time, and if things seem OK i'll post a libertas
> patch in addition to the cfg80211 fix.

It shouldn't hurt either way, since if the event is there but we're
already disconnected we'll just ignore it, afaict.

johannes


2012-08-02 16:11:10

by Daniel Drake

[permalink] [raw]
Subject: Re: cfg80211_disconnected memory leak

On Thu, Aug 2, 2012 at 2:02 AM, Johannes Berg <[email protected]> wrote:
> Oh, hm. I didn't think it could unregister before we give up our
> reference, but I guess that makes sense after all.
>
> I'm not sure there's an easy way to fix it other than making the driver
> not call cfg80211_disconnected() in case the disconnect was requested by
> cfg80211 -- that call isn't needed and will not do anything at all, but
> I'm not sure how easy that would be in the driver?

I guess you've considered clearing all the pending work before
removing a netdev from the rdev's list?


I think a driver modification would be easy, if it is the right solution.

lbs_disconnect() is the function that calls cfg80211_disconnected().
We only ever call this in 2 contexts:

1. From our cfg80211_ops.disconnect handler - you say this isn't needed
2. From the netdev ndo_stop handler - I guess it is also not necessary
to inform cfg80211 that we have disconnected at this point, it is kind
of obvious..?

So completely removing the call to cfg80211_disconnected() may be the
right option here, is that what you recommend?

Thanks
Daniel

2012-08-01 23:22:44

by Daniel Drake

[permalink] [raw]
Subject: Re: cfg80211_disconnected memory leak

On Wed, Aug 1, 2012 at 11:39 AM, Johannes Berg
<[email protected]> wrote:
> The strange thing is that we call __cfg80211_disconnect() from the
> netdev notifier with NETDEV_GOING_DOWN. This will allocate and queue the
> work item as you found. The next thing that happens should be
> NETDEV_DOWN, which will cause us to dev_hold() the device and then queue
> the cleanup work. The cleanup work must run for us to dev_put() the
> device, so that it can only be unregistered after that runs. Then,
> finally, we get NETDEV_UNREGISTER which removes it from the list.
>
> Now note that the work item we queue in __cfg80211_disconnect() is
> queued *before* the cleanup work, therefore it should also run before
> the cleanup work since the workqueue is singlethreaded.

Here is what happens:

NETDEV_GOING_DOWN
cfg80211_disconnected() called, disconnect event work queued
NETDEV_DOWN
cleanup work queued
NETDEV_UNREGISTER
*** cfg80211_netdev_notifier_call now calls: list_del_rcu(&wdev->list);
disconnect even work runs, calls cfg80211_process_rdev_events() but
the wdev is already removed from rdev->netdev_list as above
cleanup work runs

The bit I marked with *** is what is causing the difficulties - it
runs before the work items do.

Daniel

2012-08-01 21:04:40

by Daniel Drake

[permalink] [raw]
Subject: Re: cfg80211_disconnected memory leak

On Wed, Aug 1, 2012 at 11:39 AM, Johannes Berg
<[email protected]> wrote:
> This is very odd. What version of the kernel is this?

3.3, and just reproduced on 3.5.

I'll have a closer look later in the week.

Thanks
Daniel

2012-08-02 08:02:40

by Johannes Berg

[permalink] [raw]
Subject: Re: cfg80211_disconnected memory leak

On Wed, 2012-08-01 at 17:22 -0600, Daniel Drake wrote:

> Here is what happens:
>
> NETDEV_GOING_DOWN
> cfg80211_disconnected() called, disconnect event work queued
> NETDEV_DOWN
> cleanup work queued
> NETDEV_UNREGISTER
> *** cfg80211_netdev_notifier_call now calls: list_del_rcu(&wdev->list);
> disconnect even work runs, calls cfg80211_process_rdev_events() but
> the wdev is already removed from rdev->netdev_list as above
> cleanup work runs
>
> The bit I marked with *** is what is causing the difficulties - it
> runs before the work items do.

Oh, hm. I didn't think it could unregister before we give up our
reference, but I guess that makes sense after all.

I'm not sure there's an easy way to fix it other than making the driver
not call cfg80211_disconnected() in case the disconnect was requested by
cfg80211 -- that call isn't needed and will not do anything at all, but
I'm not sure how easy that would be in the driver?

johannes


2012-08-02 17:25:54

by Daniel Drake

[permalink] [raw]
Subject: Re: cfg80211_disconnected memory leak

On Thu, Aug 2, 2012 at 10:26 AM, Johannes Berg
<[email protected]> wrote:
> Hmm. Then again, I think we can call cfg80211_process_wdev_events() from
> case NETDEV_UNREGISTER though, probably after removing from the list.
> Maybe you could try that?

That solves the issue - confirmed that kmemleak now shuts up, and with
some added printks to confirm event creation and freeing.
Patch coming up, titled: cfg80211: process pending events when
unregistering net device

Even if libertas isn't quite doing the right thing here, I think this
is the right thing to do. I guess there are other situations, perhaps
more legitimate, where we can reach this point with events in the
queue.

> I'm not 100% sure about the API in this area right now though, it's been
> a while and I never worked much with this API (rather than the mac80211
> one with auth/assoc/disassoc/deauth.)

I think we both feel that removing it is correct. I'll test this when
I find some free time, and if things seem OK i'll post a libertas
patch in addition to the cfg80211 fix.

Thanks
Daniel