2009-10-07 15:08:03

by Larry Finger

[permalink] [raw]
Subject: [PATCH] b43: Fix locking problem when stopping rfkill polling

In commit 26e5ab35b4c7b1d4cb487a11084520aed9a8d05e entitled "b43: Fix PPC
crash in rfkill polling on unload", the call to stop polling should not have
been placed inside the wl->mutex. The result was incorrect locking messages.

Signed-off-by: Larry Finger <[email protected]>
---

John,

I had not intended for the previous patch to be applied as I was waiting for
the Bugzilla OP to test. He promised to do that today. In any case, that patch
introduced a locking problem that needs to be fixed.

Why do the one-liners cause so many problems?

Larry
---

Index: wireless-testing/drivers/net/wireless/b43/main.c
===================================================================
--- wireless-testing.orig/drivers/net/wireless/b43/main.c
+++ wireless-testing/drivers/net/wireless/b43/main.c
@@ -4501,8 +4501,8 @@ static void b43_op_stop(struct ieee80211

cancel_work_sync(&(wl->beacon_update_trigger));

- mutex_lock(&wl->mutex);
wiphy_rfkill_stop_polling(hw->wiphy);
+ mutex_lock(&wl->mutex);
if (b43_status(dev) >= B43_STAT_STARTED) {
dev = b43_wireless_core_stop(dev);
if (!dev)


2009-10-07 22:44:17

by Larry Finger

[permalink] [raw]
Subject: Re: [PATCH] b43: Fix locking problem when stopping rfkill polling

Johannes Berg wrote:
> This is because when stopping polling we need to cancel the work and
> sync on it, but when starting it's completely async so starting can be
> in any context.

Does that mean that start could be moved outside the mutex_lock so
that the start and stop would look symmetrical in b43?

Larry

2009-10-07 19:29:22

by Larry Finger

[permalink] [raw]
Subject: Re: [PATCH] b43: Fix locking problem when stopping rfkill polling

John W. Linville wrote:
> On Wed, Oct 07, 2009 at 10:06:05AM -0500, Larry Finger wrote:
>> In commit 26e5ab35b4c7b1d4cb487a11084520aed9a8d05e entitled "b43: Fix PPC
>> crash in rfkill polling on unload", the call to stop polling should not have
>> been placed inside the wl->mutex. The result was incorrect locking messages.
>>
>> Signed-off-by: Larry Finger <[email protected]>
>> ---
>>
>> John,
>>
>> I had not intended for the previous patch to be applied as I was waiting for
>> the Bugzilla OP to test. He promised to do that today. In any case, that patch
>> introduced a locking problem that needs to be fixed.
>>
>> Why do the one-liners cause so many problems?
>>
>> Larry
>> ---
>>
>> Index: wireless-testing/drivers/net/wireless/b43/main.c
>> ===================================================================
>> --- wireless-testing.orig/drivers/net/wireless/b43/main.c
>> +++ wireless-testing/drivers/net/wireless/b43/main.c
>> @@ -4501,8 +4501,8 @@ static void b43_op_stop(struct ieee80211
>>
>> cancel_work_sync(&(wl->beacon_update_trigger));
>>
>> - mutex_lock(&wl->mutex);
>> wiphy_rfkill_stop_polling(hw->wiphy);
>> + mutex_lock(&wl->mutex);
>> if (b43_status(dev) >= B43_STAT_STARTED) {
>> dev = b43_wireless_core_stop(dev);
>> if (!dev)
>
> OK, but why do we start polling under the lock but stop polling without
> the lock? Should we start polling without holding the lock too?

I'll test that, but I suspect it doesn't matter. Of course, the reason
I put the stop under the lock was for symmetry, but then I got the
following when shutting down:

b43-phy0 debug: Removing Interface type 2

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.32-rc3-wl #225
-------------------------------------------------------
modprobe/25391 is trying to acquire lock:
(&(&rfkill->poll_work)->work){+.+...}, at: [<ffffffff81054a7f>]
__cancel_work_timer+0xd9/0x224

but task is already holding lock:
(&wl->mutex){+.+.+.}, at: [<ffffffffa02ff3d0>] b43_op_stop+0x30/0x7f
[b43]

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&wl->mutex){+.+.+.}:
[<ffffffff81069790>] __lock_acquire+0x140e/0x174d
[<ffffffff81069b8b>] lock_acquire+0xbc/0xd9
[<ffffffff8128d420>] mutex_lock_nested+0x58/0x29c
[<ffffffffa03150ea>] b43_rfkill_poll+0x3a/0xfc [b43]
[<ffffffffa02c2f33>] ieee80211_rfkill_poll+0x26/0x28 [mac80211]
[<ffffffffa027c028>] cfg80211_rfkill_poll+0x14/0x16 [cfg80211]
[<ffffffffa0271081>] rfkill_poll+0x23/0x3d [rfkill]
[<ffffffff81054224>] worker_thread+0x22c/0x332
[<ffffffff81057fd8>] kthread+0x7d/0x85
[<ffffffff8100caba>] child_rip+0xa/0x20

Moving the stop ooutside the lock cured the problem.

Larry


2009-10-07 22:38:05

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] b43: Fix locking problem when stopping rfkill polling

On Wed, 2009-10-07 at 14:28 -0500, Larry Finger wrote:

> > OK, but why do we start polling under the lock but stop polling without
> > the lock? Should we start polling without holding the lock too?
>
> I'll test that, but I suspect it doesn't matter. Of course, the reason
> I put the stop under the lock was for symmetry, but then I got the
> following when shutting down:
>
> b43-phy0 debug: Removing Interface type 2
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.32-rc3-wl #225
> -------------------------------------------------------
> modprobe/25391 is trying to acquire lock:
> (&(&rfkill->poll_work)->work){+.+...}, at: [<ffffffff81054a7f>]
> __cancel_work_timer+0xd9/0x224

This is because when stopping polling we need to cancel the work and
sync on it, but when starting it's completely async so starting can be
in any context.

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2009-10-07 19:16:45

by John W. Linville

[permalink] [raw]
Subject: Re: [PATCH] b43: Fix locking problem when stopping rfkill polling

On Wed, Oct 07, 2009 at 10:06:05AM -0500, Larry Finger wrote:
> In commit 26e5ab35b4c7b1d4cb487a11084520aed9a8d05e entitled "b43: Fix PPC
> crash in rfkill polling on unload", the call to stop polling should not have
> been placed inside the wl->mutex. The result was incorrect locking messages.
>
> Signed-off-by: Larry Finger <[email protected]>
> ---
>
> John,
>
> I had not intended for the previous patch to be applied as I was waiting for
> the Bugzilla OP to test. He promised to do that today. In any case, that patch
> introduced a locking problem that needs to be fixed.
>
> Why do the one-liners cause so many problems?
>
> Larry
> ---
>
> Index: wireless-testing/drivers/net/wireless/b43/main.c
> ===================================================================
> --- wireless-testing.orig/drivers/net/wireless/b43/main.c
> +++ wireless-testing/drivers/net/wireless/b43/main.c
> @@ -4501,8 +4501,8 @@ static void b43_op_stop(struct ieee80211
>
> cancel_work_sync(&(wl->beacon_update_trigger));
>
> - mutex_lock(&wl->mutex);
> wiphy_rfkill_stop_polling(hw->wiphy);
> + mutex_lock(&wl->mutex);
> if (b43_status(dev) >= B43_STAT_STARTED) {
> dev = b43_wireless_core_stop(dev);
> if (!dev)

OK, but why do we start polling under the lock but stop polling without
the lock? Should we start polling without holding the lock too?

John
--
John W. Linville Someday the world will need a hero, and you
[email protected] might be all we have. Be ready.

2009-10-07 23:09:17

by Michael Büsch

[permalink] [raw]
Subject: Re: [PATCH] b43: Fix locking problem when stopping rfkill polling

On Thursday 08 October 2009 00:43:36 Larry Finger wrote:
> Johannes Berg wrote:
> > This is because when stopping polling we need to cancel the work and
> > sync on it, but when starting it's completely async so starting can be
> > in any context.
>
> Does that mean that start could be moved outside the mutex_lock so
> that the start and stop would look symmetrical in b43?

I think it's already complicated enough what's safe in b43 to do
outside of the locks. Let's not complicate it further. The simple rule is:
Do it locked. The only exception really are syncing functions which sync
for code holding the same lock.

--
Greetings, Michael.

2009-10-08 20:24:18

by Larry Finger

[permalink] [raw]
Subject: Re: [PATCH] b43: Fix locking problem when stopping rfkill polling

On 10/07/2009 02:01 PM, John W. Linville wrote:
> OK, but why do we start polling under the lock but stop polling without
> the lock? Should we start polling without holding the lock too?

I see that this patch was committed. I hope that you will be sending
it upstream for 2.6.32. The correct version has been sent to stable
for inclusion in 2.6.31, but as usual, a mainline commit ID will be
needed.

Larry

2009-10-08 22:46:43

by John W. Linville

[permalink] [raw]
Subject: Re: [PATCH] b43: Fix locking problem when stopping rfkill polling

On Thu, Oct 08, 2009 at 03:23:36PM -0500, Larry Finger wrote:
> On 10/07/2009 02:01 PM, John W. Linville wrote:
>> OK, but why do we start polling under the lock but stop polling without
>> the lock? Should we start polling without holding the lock too?
>
> I see that this patch was committed. I hope that you will be sending it
> upstream for 2.6.32. The correct version has been sent to stable for
> inclusion in 2.6.31, but as usual, a mainline commit ID will be needed.

I've folded this into "b43: Fix PPC crash in rfkill polling on unload"
for sending upstream.

Thanks!

John
--
John W. Linville Someday the world will need a hero, and you
[email protected] might be all we have. Be ready.

2009-10-07 19:47:01

by Michael Büsch

[permalink] [raw]
Subject: Re: [PATCH] b43: Fix locking problem when stopping rfkill polling

On Wednesday 07 October 2009 21:28:40 Larry Finger wrote:
> John W. Linville wrote:
> > On Wed, Oct 07, 2009 at 10:06:05AM -0500, Larry Finger wrote:
> >> In commit 26e5ab35b4c7b1d4cb487a11084520aed9a8d05e entitled "b43: Fix PPC
> >> crash in rfkill polling on unload", the call to stop polling should not have
> >> been placed inside the wl->mutex. The result was incorrect locking messages.
> >>
> >> Signed-off-by: Larry Finger <[email protected]>
> >> ---
> >>
> >> John,
> >>
> >> I had not intended for the previous patch to be applied as I was waiting for
> >> the Bugzilla OP to test. He promised to do that today. In any case, that patch
> >> introduced a locking problem that needs to be fixed.
> >>
> >> Why do the one-liners cause so many problems?
> >>
> >> Larry
> >> ---
> >>
> >> Index: wireless-testing/drivers/net/wireless/b43/main.c
> >> ===================================================================
> >> --- wireless-testing.orig/drivers/net/wireless/b43/main.c
> >> +++ wireless-testing/drivers/net/wireless/b43/main.c
> >> @@ -4501,8 +4501,8 @@ static void b43_op_stop(struct ieee80211
> >>
> >> cancel_work_sync(&(wl->beacon_update_trigger));
> >>
> >> - mutex_lock(&wl->mutex);
> >> wiphy_rfkill_stop_polling(hw->wiphy);
> >> + mutex_lock(&wl->mutex);
> >> if (b43_status(dev) >= B43_STAT_STARTED) {
> >> dev = b43_wireless_core_stop(dev);
> >> if (!dev)
> >
> > OK, but why do we start polling under the lock but stop polling without
> > the lock? Should we start polling without holding the lock too?
>
> I'll test that, but I suspect it doesn't matter. Of course, the reason
> I put the stop under the lock was for symmetry, but then I got the
> following when shutting down:
>
> b43-phy0 debug: Removing Interface type 2
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.32-rc3-wl #225
> -------------------------------------------------------
> modprobe/25391 is trying to acquire lock:
> (&(&rfkill->poll_work)->work){+.+...}, at: [<ffffffff81054a7f>]
> __cancel_work_timer+0xd9/0x224
>
> but task is already holding lock:
> (&wl->mutex){+.+.+.}, at: [<ffffffffa02ff3d0>] b43_op_stop+0x30/0x7f
> [b43]
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (&wl->mutex){+.+.+.}:
> [<ffffffff81069790>] __lock_acquire+0x140e/0x174d
> [<ffffffff81069b8b>] lock_acquire+0xbc/0xd9
> [<ffffffff8128d420>] mutex_lock_nested+0x58/0x29c
> [<ffffffffa03150ea>] b43_rfkill_poll+0x3a/0xfc [b43]
> [<ffffffffa02c2f33>] ieee80211_rfkill_poll+0x26/0x28 [mac80211]
> [<ffffffffa027c028>] cfg80211_rfkill_poll+0x14/0x16 [cfg80211]
> [<ffffffffa0271081>] rfkill_poll+0x23/0x3d [rfkill]
> [<ffffffff81054224>] worker_thread+0x22c/0x332
> [<ffffffff81057fd8>] kthread+0x7d/0x85
> [<ffffffff8100caba>] child_rip+0xa/0x20
>
> Moving the stop ooutside the lock cured the problem.
>

Just move it right after the existing cancel_work_sync() call



--
Greetings, Michael.