Hey,
I just managed to deadlock bcm43xx when I was trying to do something
with iwconfig at the same time as telling network manager to shut off
the interface. It stayed up and things were locked up.
I haven't been able to analyse it yet, I dumped process stacks a few
times and put the resulting log at
http://johannes.sipsolutions.net/files/mac80211-bcm43xx-deadlock
Note that many processes are in rtnl_lock() but wpa_supplicant is stuck
in wait_for_completion.
johannes
On Wed, 2007-06-27 at 16:21 +0200, Johannes Berg wrote:
> I shall be trying to add lockdep support for this sort of thing, but
> until then how can we fix this? What exactly does the rtnl protect in
> ieee80211_sta_config_auth?
Hah. My hacked-up lockdep tells me that we're not allowed to flush the
workqueue under rtnl:
[ 164.126015] =======================================================
[ 164.126026] [ INFO: possible circular locking dependency detected ]
[ 164.126037] 2.6.22-rc6 #163
[ 164.126045] -------------------------------------------------------
[ 164.126054] wpa_supplicant/4418 is trying to acquire lock:
[ 164.126062] (khelper){--..}, at: [<c000000000060c8c>] .flush_workqueue+0x48/0xf4
[ 164.126099]
[ 164.126102] but task is already holding lock:
[ 164.126111] (rtnl_mutex){--..}, at: [<c0000000003a4784>] .mutex_lock+0x3c/0x58
[ 164.126146]
[ 164.126148] which lock already depends on the new lock.
[ 164.126152]
[ 164.126161]
[ 164.126163] the existing dependency chain (in reverse order) is:
[ 164.126174]
[ 164.126177] -> #1 (rtnl_mutex){--..}:
[ 164.126202] [<c000000000072434>] .__lock_acquire+0xb8c/0xd68
[ 164.126265] [<c0000000000726b0>] .lock_acquire+0xa0/0xe8
[ 164.126319] [<c0000000003a44b0>] .__mutex_lock_slowpath+0x138/0x3d0
[ 164.126376] [<c0000000003a4784>] .mutex_lock+0x3c/0x58
[ 164.126439] [<c000000000327208>] .rtnl_lock+0x24/0x40
[ 164.126494] [<c000000000328fb4>] .linkwatch_event+0x20/0x70
[ 164.126554] [<c00000000005fe38>] .run_workqueue+0x114/0x22c
[ 164.126608] [<c0000000000612f8>] .worker_thread+0x11c/0x140
[ 164.126657] [<c000000000066230>] .kthread+0x84/0xd4
[ 164.126708] [<c0000000000235ec>] .kernel_thread+0x4c/0x68
[ 164.126767]
[ 164.126770] -> #0 (khelper){--..}:
[ 164.126797] [<c000000000072324>] .__lock_acquire+0xa7c/0xd68
[ 164.126853] [<c0000000000726b0>] .lock_acquire+0xa0/0xe8
[ 164.126903] [<c000000000060cbc>] .flush_workqueue+0x78/0xf4
[ 164.126958] [<d00000000048d374>] .ieee80211_stop+0x1fc/0x3a4 [mac80211]
[ 164.127032] [<c00000000031b8bc>] .dev_close+0xb8/0xfc
[ 164.127086] [<c00000000031a5b0>] .dev_change_flags+0xa4/0x1b4
[ 164.127135] [<c00000000036da9c>] .devinet_ioctl+0x2bc/0x768
[ 164.127186] [<c00000000036e9dc>] .inet_ioctl+0xe4/0x138
[ 164.127247] [<c00000000030d214>] .sock_ioctl+0x2cc/0x324
[ 164.127305] [<c0000000000d7e84>] .do_ioctl+0x68/0x14c
[ 164.127359] [<c0000000000d83b8>] .vfs_ioctl+0x450/0x494
[ 164.127411] [<c0000000000d8450>] .sys_ioctl+0x54/0x94
[ 164.127462] [<c0000000000086d0>] syscall_exit+0x0/0x40
[ 164.127513]
[ 164.127516] other info that might help us debug this:
[ 164.127519]
[ 164.127530] 1 lock held by wpa_supplicant/4418:
[ 164.127538] #0: (rtnl_mutex){--..}, at: [<c0000000003a4784>] .mutex_lock+0x3c/0x58
[ 164.127577]
[ 164.127580] stack backtrace:
[ 164.127589] Call Trace:
[ 164.127600] [c000000111f933a0] [c00000000000f624] .show_stack+0x70/0x1bc (unreliable)
[ 164.127631] [c000000111f93450] [c00000000000f790] .dump_stack+0x20/0x34
[ 164.127655] [c000000111f934d0] [c000000000070310] .print_circular_bug_tail+0x84/0xa8
[ 164.127680] [c000000111f935a0] [c000000000072324] .__lock_acquire+0xa7c/0xd68
[ 164.127705] [c000000111f93690] [c0000000000726b0] .lock_acquire+0xa0/0xe8
[ 164.127729] [c000000111f93750] [c000000000060cbc] .flush_workqueue+0x78/0xf4
[ 164.127753] [c000000111f937f0] [d00000000048d374] .ieee80211_stop+0x1fc/0x3a4 [mac80211]
[ 164.127801] [c000000111f938b0] [c00000000031b8bc] .dev_close+0xb8/0xfc
[ 164.127828] [c000000111f93940] [c00000000031a5b0] .dev_change_flags+0xa4/0x1b4
[ 164.127853] [c000000111f939f0] [c00000000036da9c] .devinet_ioctl+0x2bc/0x768
[ 164.127878] [c000000111f93b00] [c00000000036e9dc] .inet_ioctl+0xe4/0x138
[ 164.127914] [c000000111f93b80] [c00000000030d214] .sock_ioctl+0x2cc/0x324
[ 164.127943] [c000000111f93c20] [c0000000000d7e84] .do_ioctl+0x68/0x14c
[ 164.127972] [c000000111f93cc0] [c0000000000d83b8] .vfs_ioctl+0x450/0x494
[ 164.127999] [c000000111f93d80] [c0000000000d8450] .sys_ioctl+0x54/0x94
[ 164.128024] [c000000111f93e30] [c0000000000086d0] syscall_exit+0x0/0x40
On Wed, 2007-06-27 at 14:48 +0200, Michael Buesch wrote:
> On Wednesday 27 June 2007 11:38:26 Johannes Berg wrote:
> > On Tue, 2007-06-26 at 16:17 +0200, Michael Buesch wrote:
> >
> > > That's a known bug.
> > > See the FIXME at stop_core() where we stop the mac80211-queues.
> > > mac80211 needs to be fixed to fix this. ;)
> > > We need a function to stop and flush the TX queues from
> > > _outside_ of the TX handlers.
> >
> > Are you sure it's the same bug? I don't think we get into
>
> Pretty, yeah. I'm probably going to solve this bug elsewhere.
> Need to think about it.
>
> > ieee80211_if_shutdown from bcm43xx itself.
>
> bcm43xx causes it only imlicitely. Calling ieee80211_stop_queues()
> (from outside of the TX path) will result in a deadlock in the
> qdisc handler, which will result in a complete system freeze on UP.
> So bcm43xx won't show up in the call-trace.
Hmm.
The thing here is:
Jun 26 09:05:56 johannes kernel: [22894.397724] wpa_supplican D 0FCE04A8 0 9032 1 (NOTLB)
Jun 26 09:05:56 johannes kernel: [22894.397733] Call Trace:
Jun 26 09:05:56 johannes kernel: [22894.397738] [dd7b7c00] [c0300210] __mutex_lock_slowpath+0x128/0x1d4 (unreliable)
Jun 26 09:05:56 johannes kernel: [22894.397750] [dd7b7cc0] [c0009434] __switch_to+0x44/0x98
Jun 26 09:05:56 johannes kernel: [22894.397761] [dd7b7ce0] [c02fe940] schedule+0x340/0x6cc
Jun 26 09:05:56 johannes kernel: [22894.397771] [dd7b7d40] [c02ff2e8] wait_for_completion+0xf4/0x138
Jun 26 09:05:56 johannes kernel: [22894.397781] [dd7b7d80] [c004073c] flush_cpu_workqueue+0xb8/0xe4
Jun 26 09:05:56 johannes kernel: [22894.397792] [dd7b7db0] [f24876c4] ieee80211_stop+0x1b4/0x2a8 [mac80211]
Jun 26 09:05:56 johannes kernel: [22894.397825] [dd7b7de0] [c025b158] dev_close+0xa4/0xd8
Jun 26 09:05:56 johannes kernel: [22894.397835] [dd7b7df0] [c025a01c] dev_change_flags+0x64/0x168
Jun 26 09:05:56 johannes kernel: [22894.397844] [dd7b7e10] [c02ab1cc] devinet_ioctl+0x5d0/0x760
Jun 26 09:05:56 johannes kernel: [22894.397861] [dd7b7e80] [c02ab94c] inet_ioctl+0x98/0xbc
Jun 26 09:05:56 johannes kernel: [22894.397871] [dd7b7e90] [c024c6bc] sock_ioctl+0xd8/0x290
Jun 26 09:05:56 johannes kernel: [22894.397881] [dd7b7eb0] [c00a083c] do_ioctl+0x40/0x100
Jun 26 09:05:56 johannes kernel: [22894.397891] [dd7b7ed0] [c00a0c14] vfs_ioctl+0x318/0x4f0
Jun 26 09:05:56 johannes kernel: [22894.397901] [dd7b7f10] [c00a0e2c] sys_ioctl+0x40/0x74
Jun 26 09:05:56 johannes kernel: [22894.397911] [dd7b7f40] [c00116f8] ret_from_syscall+0x0/0x38
Jun 26 09:05:56 johannes kernel: [22894.397921] --- Exception: c01 at 0xfce04a8
Jun 26 09:05:56 johannes kernel: [22894.397931] LR = 0xfce0440
and
Jun 26 09:06:50 johannes kernel: [22948.985461] bcm43xx_mac80 D 00000000 0 806 2 (L-TLB)
Jun 26 09:06:50 johannes kernel: [22948.985470] Call Trace:
Jun 26 09:06:50 johannes kernel: [22948.985474] [eac87c70] [00009032] 0x9032 (unreliable)
Jun 26 09:06:50 johannes kernel: [22948.985484] [eac87d30] [c0009434] __switch_to+0x44/0x98
Jun 26 09:06:50 johannes kernel: [22948.985494] [eac87d50] [c02fe940] schedule+0x340/0x6cc
Jun 26 09:06:50 johannes kernel: [22948.985504] [eac87db0] [c0300184] __mutex_lock_slowpath+0x9c/0x1d4
Jun 26 09:06:50 johannes kernel: [22948.985515] [eac87e00] [c02660f4] rtnl_lock+0x18/0x28
Jun 26 09:06:50 johannes kernel: [22948.985529] [eac87e10] [f2496714] ieee80211_sta_config_auth+0x44/0x2c4 [mac80211]
Jun 26 09:06:50 johannes kernel: [22948.985589] [eac87e60] [f2497e30] ieee80211_sta_work+0xb90/0x1354 [mac80211]
Jun 26 09:06:50 johannes kernel: [22948.985614] [eac87f60] [c0040444] run_workqueue+0xf4/0x1c8
Jun 26 09:06:50 johannes kernel: [22948.985625] [eac87f90] [c0040b14] worker_thread+0x9c/0x120
Jun 26 09:06:50 johannes kernel: [22948.985635] [eac87fd0] [c0044f78] kthread+0x48/0x84
Jun 26 09:06:50 johannes kernel: [22948.985646] [eac87ff0] [c00125f4] kernel_thread+0x44/0x60
I'm not convinced this is the same thing, the deadlock here is simply
that something running on the workqueue is trying to rtnl_lock() while
flushing the workqueue is done under rtnl.
johannes
On Wed, 2007-06-27 at 16:06 +0200, Johannes Berg wrote:
> I'm not convinced this is the same thing, the deadlock here is simply
> that something running on the workqueue is trying to rtnl_lock() while
> flushing the workqueue is done under rtnl.
Looking again, Michael agrees that this is not the same deadlock he's
talking about, so back to the drawing board.
I shall be trying to add lockdep support for this sort of thing, but
until then how can we fix this? What exactly does the rtnl protect in
ieee80211_sta_config_auth?
johannes
On Tuesday 26 June 2007 03:14, Johannes Berg wrote:
> Got it. ieee80211_sta_config_auth is running off the workqueue and doing
> rtnl_lock while ieee80211_if_shutdown is trying to flush the workqueue
> with the rtnl lock held. Nice one, anybody have ideas how to fix? Why
> does ieee80211_sta_config_auth need to take the rtnl anyway?
>
There is potential for racing with userspace without that lock, but it's not
very likely or dangerous, and deadlocking is a whole lot worse. Patch
attached to remove that locking for now until a better solution is found.
Thanks,
-Michael Wu
On Tue, 2007-06-26 at 10:59 +0200, Johannes Berg wrote:
> I haven't been able to analyse it yet, I dumped process stacks a few
> times and put the resulting log at
> http://johannes.sipsolutions.net/files/mac80211-bcm43xx-deadlock
Got it. ieee80211_sta_config_auth is running off the workqueue and doing
rtnl_lock while ieee80211_if_shutdown is trying to flush the workqueue
with the rtnl lock held. Nice one, anybody have ideas how to fix? Why
does ieee80211_sta_config_auth need to take the rtnl anyway?
johannes
On Tue, 2007-06-26 at 16:17 +0200, Michael Buesch wrote:
> That's a known bug.
> See the FIXME at stop_core() where we stop the mac80211-queues.
> mac80211 needs to be fixed to fix this. ;)
> We need a function to stop and flush the TX queues from
> _outside_ of the TX handlers.
Are you sure it's the same bug? I don't think we get into
ieee80211_if_shutdown from bcm43xx itself.
johannes
On Tuesday 26 June 2007 10:59:41 Johannes Berg wrote:
> Hey,
>
> I just managed to deadlock bcm43xx when I was trying to do something
> with iwconfig at the same time as telling network manager to shut off
> the interface. It stayed up and things were locked up.
>
> I haven't been able to analyse it yet, I dumped process stacks a few
> times and put the resulting log at
> http://johannes.sipsolutions.net/files/mac80211-bcm43xx-deadlock
>
> Note that many processes are in rtnl_lock() but wpa_supplicant is stuck
> in wait_for_completion.
>
> johannes
>
That's a known bug.
See the FIXME at stop_core() where we stop the mac80211-queues.
mac80211 needs to be fixed to fix this. ;)
We need a function to stop and flush the TX queues from
_outside_ of the TX handlers.
--
Greetings Michael.
On Wednesday 27 June 2007 11:38:26 Johannes Berg wrote:
> On Tue, 2007-06-26 at 16:17 +0200, Michael Buesch wrote:
>
> > That's a known bug.
> > See the FIXME at stop_core() where we stop the mac80211-queues.
> > mac80211 needs to be fixed to fix this. ;)
> > We need a function to stop and flush the TX queues from
> > _outside_ of the TX handlers.
>
> Are you sure it's the same bug? I don't think we get into
Pretty, yeah. I'm probably going to solve this bug elsewhere.
Need to think about it.
> ieee80211_if_shutdown from bcm43xx itself.
bcm43xx causes it only imlicitely. Calling ieee80211_stop_queues()
(from outside of the TX path) will result in a deadlock in the
qdisc handler, which will result in a complete system freeze on UP.
So bcm43xx won't show up in the call-trace.
--
Greetings Michael.
On Wed, Jun 27, 2007 at 04:21:30PM +0200, Johannes Berg wrote:
> On Wed, 2007-06-27 at 16:06 +0200, Johannes Berg wrote:
>
> > I'm not convinced this is the same thing, the deadlock here is simply
> > that something running on the workqueue is trying to rtnl_lock() while
> > flushing the workqueue is done under rtnl.
>
> Looking again, Michael agrees that this is not the same deadlock he's
> talking about, so back to the drawing board.
>
> I shall be trying to add lockdep support for this sort of thing, but
> until then how can we fix this? What exactly does the rtnl protect in
> ieee80211_sta_config_auth?
I didn't see an answer here. Before we remove it, I'd prefer to know
what we thought we were protecting in the first place. :-)
John
P.S. Don't get me wrong -- less "big kernel lock" usage is probably
better...
--
John W. Linville
[email protected]
On Wednesday 18 July 2007 11:13, John W. Linville wrote:
> I didn't see an answer here. Before we remove it, I'd prefer to know
> what we thought we were protecting in the first place. :-)
>
It prevents BSSID/SSID/channel configuration from racing against userspace
configuration of the same thing. It's unlikely, however, and this deadlock is
much more serious than that race. I answered this in the patch for removing
the rtnl (01-kill-rtnl-locking.diff), though I wasn't as specific. This patch
should probably also go to stable.
-Michael Wu