Hi,
After doing insmod/rmmod of libertas, kmemleak found:
unreferenced object 0xe90f1398 (size 64):
comm "rmmod", pid 836, jiffies 4294944467 (age 34.620s)
hex dump (first 32 bytes):
58 cd 24 e9 58 cd 24 e9 02 00 00 00 c0 13 0f e9 X.$.X.$.........
00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<b0747a00>] kmemleak_alloc+0x26/0x44
[<b049eff1>] __kmalloc+0xf3/0x176
[<b0740aae>] cfg80211_disconnected+0x3e/0xc8
[<b06008c9>] lbs_disconnect+0x73/0x86
[<b0600955>] lbs_cfg_disconnect+0x79/0x88
[<b0741c20>] __cfg80211_disconnect+0xf5/0x148
[<b072cd8f>] cfg80211_netdev_notifier_call+0x253/0x452
[<b075781b>] notifier_call_chain+0x2a/0x4b
[<b04344f6>] __raw_notifier_call_chain+0x13/0x15
[<b0434509>] raw_notifier_call_chain+0x11/0x13
[<b06a51ff>] call_netdevice_notifiers+0x41/0x48
[<b06a5247>] __dev_close_many+0x41/0x8b
[<b06a5323>] dev_close_many+0x58/0xa8
[<b06a5401>] rollback_registered_many+0x8e/0x1f8
[<b06a5593>] rollback_registered+0x28/0x34
[<b06a66b0>] unregister_netdevice_queue+0x51/0x6e
By adding some printks I have found that cfg80211_disconnected() does
indeed queue an event to be processed in cfg80211_wq on the eth0
device, but by the time cfg80211_process_rdev_events() is called, eth0
is no longer present in the rdev's netdev_list, so the event never
gets processed (or freed).
Daniel
Hi,
> unreferenced object 0xe90f1398 (size 64):
> backtrace:
> [<b0747a00>] kmemleak_alloc+0x26/0x44
> [<b049eff1>] __kmalloc+0xf3/0x176
> [<b0740aae>] cfg80211_disconnected+0x3e/0xc8
> [<b06008c9>] lbs_disconnect+0x73/0x86
> [<b0600955>] lbs_cfg_disconnect+0x79/0x88
> [<b0741c20>] __cfg80211_disconnect+0xf5/0x148
> [<b072cd8f>] cfg80211_netdev_notifier_call+0x253/0x452
> By adding some printks I have found that cfg80211_disconnected() does
> indeed queue an event to be processed in cfg80211_wq on the eth0
> device, but by the time cfg80211_process_rdev_events() is called, eth0
> is no longer present in the rdev's netdev_list, so the event never
> gets processed (or freed).
This is very odd. What version of the kernel is this?
The strange thing is that we call __cfg80211_disconnect() from the
netdev notifier with NETDEV_GOING_DOWN. This will allocate and queue the
work item as you found. The next thing that happens should be
NETDEV_DOWN, which will cause us to dev_hold() the device and then queue
the cleanup work. The cleanup work must run for us to dev_put() the
device, so that it can only be unregistered after that runs. Then,
finally, we get NETDEV_UNREGISTER which removes it from the list.
Now note that the work item we queue in __cfg80211_disconnect() is
queued *before* the cleanup work, therefore it should also run before
the cleanup work since the workqueue is singlethreaded.
Hence I have no idea how this comes about.
johannes
On Thu, 2012-08-02 at 10:11 -0600, Daniel Drake wrote:
> On Thu, Aug 2, 2012 at 2:02 AM, Johannes Berg <[email protected]> wrote:
> > Oh, hm. I didn't think it could unregister before we give up our
> > reference, but I guess that makes sense after all.
> >
> > I'm not sure there's an easy way to fix it other than making the driver
> > not call cfg80211_disconnected() in case the disconnect was requested by
> > cfg80211 -- that call isn't needed and will not do anything at all, but
> > I'm not sure how easy that would be in the driver?
>
> I guess you've considered clearing all the pending work before
> removing a netdev from the rdev's list?
Yes, but we can't just try to flush the workqueue because of locking
concerns in this area.
Hmm. Then again, I think we can call cfg80211_process_wdev_events() from
case NETDEV_UNREGISTER though, probably after removing from the list.
Maybe you could try that?
> I think a driver modification would be easy, if it is the right solution.
>
> lbs_disconnect() is the function that calls cfg80211_disconnected().
> We only ever call this in 2 contexts:
>
> 1. From our cfg80211_ops.disconnect handler - you say this isn't needed
> 2. From the netdev ndo_stop handler - I guess it is also not necessary
> to inform cfg80211 that we have disconnected at this point, it is kind
> of obvious..?
In fact, you won't get to ndo_stop without cfg80211 calling
disconnect(), because it does that from NETDEV_GOING_DOWN.
> So completely removing the call to cfg80211_disconnected() may be the
> right option here, is that what you recommend?
I'm not 100% sure about the API in this area right now though, it's been
a while and I never worked much with this API (rather than the mac80211
one with auth/assoc/disassoc/deauth.)
johannes
On Thu, 2012-08-02 at 11:25 -0600, Daniel Drake wrote:
> On Thu, Aug 2, 2012 at 10:26 AM, Johannes Berg
> <[email protected]> wrote:
> > Hmm. Then again, I think we can call cfg80211_process_wdev_events() from
> > case NETDEV_UNREGISTER though, probably after removing from the list.
> > Maybe you could try that?
>
> That solves the issue - confirmed that kmemleak now shuts up, and with
> some added printks to confirm event creation and freeing.
> Patch coming up, titled: cfg80211: process pending events when
> unregistering net device
Nice, thanks!
> Even if libertas isn't quite doing the right thing here, I think this
> is the right thing to do. I guess there are other situations, perhaps
> more legitimate, where we can reach this point with events in the
> queue.
Yes, I agree. I just wasn't completely sure this would be OK when I
looked first, and then started wondering why it didn't happen with
mac80211, but that doesn't send the event.
> > I'm not 100% sure about the API in this area right now though, it's been
> > a while and I never worked much with this API (rather than the mac80211
> > one with auth/assoc/disassoc/deauth.)
>
> I think we both feel that removing it is correct. I'll test this when
> I find some free time, and if things seem OK i'll post a libertas
> patch in addition to the cfg80211 fix.
It shouldn't hurt either way, since if the event is there but we're
already disconnected we'll just ignore it, afaict.
johannes
On Thu, Aug 2, 2012 at 2:02 AM, Johannes Berg <[email protected]> wrote:
> Oh, hm. I didn't think it could unregister before we give up our
> reference, but I guess that makes sense after all.
>
> I'm not sure there's an easy way to fix it other than making the driver
> not call cfg80211_disconnected() in case the disconnect was requested by
> cfg80211 -- that call isn't needed and will not do anything at all, but
> I'm not sure how easy that would be in the driver?
I guess you've considered clearing all the pending work before
removing a netdev from the rdev's list?
I think a driver modification would be easy, if it is the right solution.
lbs_disconnect() is the function that calls cfg80211_disconnected().
We only ever call this in 2 contexts:
1. From our cfg80211_ops.disconnect handler - you say this isn't needed
2. From the netdev ndo_stop handler - I guess it is also not necessary
to inform cfg80211 that we have disconnected at this point, it is kind
of obvious..?
So completely removing the call to cfg80211_disconnected() may be the
right option here, is that what you recommend?
Thanks
Daniel
On Wed, Aug 1, 2012 at 11:39 AM, Johannes Berg
<[email protected]> wrote:
> The strange thing is that we call __cfg80211_disconnect() from the
> netdev notifier with NETDEV_GOING_DOWN. This will allocate and queue the
> work item as you found. The next thing that happens should be
> NETDEV_DOWN, which will cause us to dev_hold() the device and then queue
> the cleanup work. The cleanup work must run for us to dev_put() the
> device, so that it can only be unregistered after that runs. Then,
> finally, we get NETDEV_UNREGISTER which removes it from the list.
>
> Now note that the work item we queue in __cfg80211_disconnect() is
> queued *before* the cleanup work, therefore it should also run before
> the cleanup work since the workqueue is singlethreaded.
Here is what happens:
NETDEV_GOING_DOWN
cfg80211_disconnected() called, disconnect event work queued
NETDEV_DOWN
cleanup work queued
NETDEV_UNREGISTER
*** cfg80211_netdev_notifier_call now calls: list_del_rcu(&wdev->list);
disconnect even work runs, calls cfg80211_process_rdev_events() but
the wdev is already removed from rdev->netdev_list as above
cleanup work runs
The bit I marked with *** is what is causing the difficulties - it
runs before the work items do.
Daniel
On Wed, Aug 1, 2012 at 11:39 AM, Johannes Berg
<[email protected]> wrote:
> This is very odd. What version of the kernel is this?
3.3, and just reproduced on 3.5.
I'll have a closer look later in the week.
Thanks
Daniel
On Wed, 2012-08-01 at 17:22 -0600, Daniel Drake wrote:
> Here is what happens:
>
> NETDEV_GOING_DOWN
> cfg80211_disconnected() called, disconnect event work queued
> NETDEV_DOWN
> cleanup work queued
> NETDEV_UNREGISTER
> *** cfg80211_netdev_notifier_call now calls: list_del_rcu(&wdev->list);
> disconnect even work runs, calls cfg80211_process_rdev_events() but
> the wdev is already removed from rdev->netdev_list as above
> cleanup work runs
>
> The bit I marked with *** is what is causing the difficulties - it
> runs before the work items do.
Oh, hm. I didn't think it could unregister before we give up our
reference, but I guess that makes sense after all.
I'm not sure there's an easy way to fix it other than making the driver
not call cfg80211_disconnected() in case the disconnect was requested by
cfg80211 -- that call isn't needed and will not do anything at all, but
I'm not sure how easy that would be in the driver?
johannes
On Thu, Aug 2, 2012 at 10:26 AM, Johannes Berg
<[email protected]> wrote:
> Hmm. Then again, I think we can call cfg80211_process_wdev_events() from
> case NETDEV_UNREGISTER though, probably after removing from the list.
> Maybe you could try that?
That solves the issue - confirmed that kmemleak now shuts up, and with
some added printks to confirm event creation and freeing.
Patch coming up, titled: cfg80211: process pending events when
unregistering net device
Even if libertas isn't quite doing the right thing here, I think this
is the right thing to do. I guess there are other situations, perhaps
more legitimate, where we can reach this point with events in the
queue.
> I'm not 100% sure about the API in this area right now though, it's been
> a while and I never worked much with this API (rather than the mac80211
> one with auth/assoc/disassoc/deauth.)
I think we both feel that removing it is correct. I'll test this when
I find some free time, and if things seem OK i'll post a libertas
patch in addition to the cfg80211 fix.
Thanks
Daniel