2018-02-27 20:49:38

by Ben Greear

[permalink] [raw]
Subject: Deadlock debugging help.

I notice I can reliably lock up the kernel if I rmmod ath10k while it is under
heavy tx/rx traffic. First, this causes the firmware to crash, and then right
after (or possibly during?) the related kernel threads deadlock.

This is with my hacked driver and hacked firmware. In particular, the
ath10k_debug_nop_dwork is something I added, though it is pretty trivial,
it does take the ar->conf_mutex. It appears blocked trying to get it.

It appears something is holding the ar->conf_mutex, but it is not clear to
me from the lockdep output what process actually holds it.
Anyone see a clue they could share?

Thanks,
Ben


[ 1618.914057] sysrq: SysRq : Show Regs
[ 1618.916617] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G W O 4.13.16+ #2
[ 1618.916618] Hardware name: _ _/ , BIOS 5.11 08/26/2016
[ 1618.916620] task: ffff88016d3ea740 task.stack: ffffc900006c8000
[ 1618.916624] RIP: 0010:cpuidle_enter_state+0x130/0x3b0
[ 1618.916625] RSP: 0018:ffffc900006cbe78 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff3e
[ 1618.916628] RAX: ffff88016d3ea740 RBX: 0000000000000008 RCX: 000000000000001f
[ 1618.916629] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88016d3ea740
[ 1618.916630] RBP: ffffc900006cbeb0 R08: 00000000000052ca R09: 0000000000000000
[ 1618.916631] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000008
[ 1618.916633] R13: ffff880172d26c00 R14: ffffffff81ec66b8 R15: 00000178ee494b98
[ 1618.916634] FS: 0000000000000000(0000) GS:ffff880172d00000(0000) knlGS:0000000000000000
[ 1618.916636] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1618.916637] CR2: 00007ffdda2194b8 CR3: 0000000001e0f000 CR4: 00000000003406e0
[ 1618.916638] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1618.916639] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1618.916640] Call Trace:
[ 1618.916645] cpuidle_enter+0x12/0x20
[ 1618.916647] call_cpuidle+0x1e/0x40
[ 1618.916649] do_idle+0x184/0x1e0
[ 1618.916651] cpu_startup_entry+0x5f/0x70
[ 1618.916654] start_secondary+0x150/0x180
[ 1618.916657] secondary_startup_64+0x9f/0x9f
[ 1618.916662] Code: b7 df 95 ff 48 89 45 d0 0f 1f 44 00 00 31 ff e8 97 6c 97 ff 8b 45 c8 85 c0 0f 85 83 01 00 00 e8 57 5b 98 ff fb 66 0f 1f 44 00 00 <48> 8b 5d
d0 48 ba cf f7 53 e3 a5 9b c4 20 4c 29 fb 48 89 d8 48

[ 1618.916724] CPU#2: ctrl: 000000070000000f
[ 1618.916726] CPU#2: status: 0000000000000000
[ 1618.916727] CPU#2: overflow: 0000000000000000
[ 1618.916728] CPU#2: fixed: 00000000000000b0
[ 1618.916729] CPU#2: pebs: 0000000000000000
[ 1618.916730] CPU#2: debugctl: 0000000000000000
[ 1618.916731] CPU#2: active: 0000000200000000
[ 1618.916732] CPU#2: gen-PMC0 ctrl: 0000000000000000
[ 1618.916733] CPU#2: gen-PMC0 count: 0000000000000000
[ 1618.916734] CPU#2: gen-PMC0 left: 0000000000000000
[ 1618.916735] CPU#2: gen-PMC1 ctrl: 0000000000000000
[ 1618.916736] CPU#2: gen-PMC1 count: 0000000000000000
[ 1618.916737] CPU#2: gen-PMC1 left: 0000000000000000
[ 1618.916738] CPU#2: gen-PMC2 ctrl: 0000000000000000
[ 1618.916739] CPU#2: gen-PMC2 count: 0000000000000000
[ 1618.916740] CPU#2: gen-PMC2 left: 0000000000000000
[ 1618.916742] CPU#2: gen-PMC3 ctrl: 0000000000000000
[ 1618.916743] CPU#2: gen-PMC3 count: 0000000000000000
[ 1618.916744] CPU#2: gen-PMC3 left: 0000000000000000
[ 1618.916745] CPU#2: fixed-PMC0 count: 0000000000000000
[ 1618.916746] CPU#2: fixed-PMC1 count: 0000fffed3945191
[ 1618.916747] CPU#2: fixed-PMC2 count: 0000000000000000
[ 1623.035449] INFO: task kworker/u8:1:273 blocked for more than 180 seconds.
[ 1623.041155] Tainted: G W O 4.13.16+ #2
[ 1623.045173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1623.051966] kworker/u8:1 D 0 273 2 0x00000000
[ 1623.052010] Workqueue: cfg80211 cfg80211_dfs_channels_update_work [cfg80211]
[ 1623.052026] Call Trace:
[ 1623.052032] __schedule+0x407/0xb70
[ 1623.052037] schedule+0x38/0x90
[ 1623.052039] schedule_preempt_disabled+0x10/0x20
[ 1623.052041] __mutex_lock+0x3d3/0x930
[ 1623.052044] ? rtnl_lock+0x12/0x20
[ 1623.052050] mutex_lock_nested+0x16/0x20
[ 1623.052052] ? mutex_lock_nested+0x16/0x20
[ 1623.052054] rtnl_lock+0x12/0x20
[ 1623.052066] cfg80211_dfs_channels_update_work+0x3c/0x1c0 [cfg80211]
[ 1623.052069] ? process_one_work+0x14f/0x6a0
[ 1623.052074] process_one_work+0x1ce/0x6a0
[ 1623.052078] worker_thread+0x46/0x440
[ 1623.052083] kthread+0x10f/0x150
[ 1623.052084] ? process_one_work+0x6a0/0x6a0
[ 1623.052086] ? kthread_create_on_node+0x40/0x40
[ 1623.052088] ? kthread_create_on_node+0x40/0x40
[ 1623.052091] ret_from_fork+0x2a/0x40
[ 1623.052098] INFO: task kworker/u8:2:281 blocked for more than 180 seconds.
[ 1623.057700] Tainted: G W O 4.13.16+ #2
[ 1623.061659] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1623.068270] kworker/u8:2 D 0 281 2 0x00000000
[ 1623.068282] Workqueue: ath10k_wq ath10k_debug_nop_dwork [ath10k_core]
[ 1623.068284] Call Trace:
[ 1623.068288] __schedule+0x407/0xb70
[ 1623.068292] schedule+0x38/0x90
[ 1623.068308] schedule_preempt_disabled+0x10/0x20
[ 1623.068310] __mutex_lock+0x3d3/0x930
[ 1623.068315] ? ath10k_debug_nop_dwork+0x24/0x80 [ath10k_core]
[ 1623.068322] ? lock_acquire+0xac/0x200
[ 1623.068325] mutex_lock_nested+0x16/0x20
[ 1623.068326] ? mutex_lock_nested+0x16/0x20
[ 1623.068331] ath10k_debug_nop_dwork+0x24/0x80 [ath10k_core]
[ 1623.068340] process_one_work+0x1ce/0x6a0
[ 1623.068345] worker_thread+0x46/0x440
[ 1623.068349] kthread+0x10f/0x150
[ 1623.068350] ? process_one_work+0x6a0/0x6a0
[ 1623.068352] ? kthread_create_on_node+0x40/0x40
[ 1623.068354] ? kthread_create_on_node+0x40/0x40
[ 1623.068356] ret_from_fork+0x2a/0x40
[ 1623.068363] INFO: task kworker/0:3:462 blocked for more than 180 seconds.
[ 1623.073888] Tainted: G W O 4.13.16+ #2
[ 1623.077882] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1623.084475] kworker/0:3 D 0 462 2 0x00000080
[ 1623.084480] Workqueue: events linkwatch_event
[ 1623.084482] Call Trace:
[ 1623.084485] __schedule+0x407/0xb70
[ 1623.084489] schedule+0x38/0x90
[ 1623.084491] schedule_preempt_disabled+0x10/0x20
[ 1623.084493] __mutex_lock+0x3d3/0x930
[ 1623.084495] ? rtnl_lock+0x12/0x20
[ 1623.084501] ? lock_acquire+0xac/0x200
[ 1623.084504] mutex_lock_nested+0x16/0x20
[ 1623.084507] ? mutex_lock_nested+0x16/0x20
[ 1623.084509] rtnl_lock+0x12/0x20
[ 1623.084511] linkwatch_event+0x9/0x30
[ 1623.084513] process_one_work+0x1ce/0x6a0
[ 1623.084517] worker_thread+0x46/0x440
[ 1623.084521] kthread+0x10f/0x150
[ 1623.084523] ? process_one_work+0x6a0/0x6a0
[ 1623.084525] ? kthread_create_on_node+0x40/0x40
[ 1623.084528] ? do_syscall_64+0x64/0x140
[ 1623.084531] ret_from_fork+0x2a/0x40
[ 1623.084537] INFO: task kworker/1:2:562 blocked for more than 180 seconds.
[ 1623.090051] Tainted: G W O 4.13.16+ #2
[ 1623.094000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1623.100574] kworker/1:2 D 0 562 2 0x00000080
[ 1623.100615] Workqueue: events_freezable ieee80211_restart_work [mac80211]
[ 1623.100617] Call Trace:
[ 1623.100620] __schedule+0x407/0xb70
[ 1623.100625] schedule+0x38/0x90
[ 1623.100627] schedule_preempt_disabled+0x10/0x20
[ 1623.100629] __mutex_lock+0x3d3/0x930
[ 1623.100631] ? rtnl_lock+0x12/0x20
[ 1623.100648] ? flush_work+0x298/0x2d0
[ 1623.100654] mutex_lock_nested+0x16/0x20
[ 1623.100655] ? mutex_lock_nested+0x16/0x20
[ 1623.100658] rtnl_lock+0x12/0x20
[ 1623.100667] ieee80211_restart_work+0x50/0xc0 [mac80211]
[ 1623.100671] process_one_work+0x1ce/0x6a0
[ 1623.100675] worker_thread+0x46/0x440
[ 1623.100679] kthread+0x10f/0x150
[ 1623.100681] ? process_one_work+0x6a0/0x6a0
[ 1623.100683] ? kthread_create_on_node+0x40/0x40
[ 1623.100685] ? do_syscall_64+0x64/0x140
[ 1623.100688] ret_from_fork+0x2a/0x40
[ 1623.100714] INFO: task btserver:3871 blocked for more than 180 seconds.
[ 1623.106094] Tainted: G W O 4.13.16+ #2
[ 1623.110042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1623.116600] btserver D 0 3871 1958 0x00000080
[ 1623.116604] Call Trace:
[ 1623.116627] __schedule+0x407/0xb70
[ 1623.116633] schedule+0x38/0x90
[ 1623.116636] schedule_preempt_disabled+0x10/0x20
[ 1623.116638] __mutex_lock+0x3d3/0x930
[ 1623.116641] ? rtnetlink_rcv+0x16/0x30
[ 1623.116648] mutex_lock_nested+0x16/0x20
[ 1623.116649] ? mutex_lock_nested+0x16/0x20
[ 1623.116651] rtnetlink_rcv+0x16/0x30
[ 1623.116653] netlink_unicast+0x1cb/0x2e0
[ 1623.116656] netlink_sendmsg+0x2c6/0x3a0
[ 1623.116661] sock_sendmsg+0x33/0x40
[ 1623.116664] SYSC_sendto+0x101/0x190
[ 1623.116670] ? __getnstimeofday64+0x9a/0x150
[ 1623.116673] ? getnstimeofday64+0x9/0x20
[ 1623.116675] ? __audit_syscall_entry+0xb2/0x100
[ 1623.116678] ? syscall_trace_enter+0x1bc/0x3b0
[ 1623.116681] ? __do_page_fault+0x28b/0x520
[ 1623.116684] ? trace_hardirqs_on_caller+0x11f/0x190
[ 1623.116688] SyS_sendto+0x9/0x10
[ 1623.116690] do_syscall_64+0x64/0x140
[ 1623.116693] entry_SYSCALL64_slow_path+0x25/0x25
[ 1623.116695] RIP: 0033:0x7f7951ec71b3
[ 1623.116696] RSP: 002b:00007fffcf807788 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 1623.116699] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f7951ec71b3
[ 1623.116700] RDX: 0000000000000020 RSI: 00007fffcf8077c0 RDI: 000000000000000d
[ 1623.116716] RBP: 00007fffcf807a20 R08: 00007fffcf8077b0 R09: 000000000000000c
[ 1623.116717] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 1623.116718] R13: 000000000084b5a5 R14: 000000000084b5ad R15: 000000000084ba53
[ 1623.116728]
Showing all locks held in the system:
[ 1623.116733] 1 lock held by khungtaskd/38:
[ 1623.116738] #0: (tasklist_lock){.+.+..}, at: [<ffffffff8117ba8d>] debug_show_all_locks+0x3d/0x1a0
[ 1623.116747] 3 locks held by kworker/u8:1/273:
[ 1623.116749] #0: ("cfg80211"){.+.+.+}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
[ 1623.116755] #1: ((&(&rdev->dfs_update_channels_wk)->work)){+.+...}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
[ 1623.116761] #2: (rtnl_mutex){+.+.+.}, at: [<ffffffff818a2b72>] rtnl_lock+0x12/0x20
[ 1623.116768] 3 locks held by kworker/u8:2/281:
[ 1623.116769] #0: ("%s""ath10k_wq"){.+.+.+}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
[ 1623.116775] #1: ((&(&ar->debug.nop_dwork)->work)){+.+...}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
[ 1623.116781] #2: (&ar->conf_mutex){+.+.+.}, at: [<ffffffffa066d554>] ath10k_debug_nop_dwork+0x24/0x80 [ath10k_core]
[ 1623.116792] 3 locks held by kworker/0:3/462:
[ 1623.116793] #0: ("events"){.+.+.+}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
[ 1623.116799] #1: ((linkwatch_work).work){+.+.+.}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
[ 1623.116805] #2: (rtnl_mutex){+.+.+.}, at: [<ffffffff818a2b72>] rtnl_lock+0x12/0x20
[ 1623.116812] 3 locks held by kworker/1:2/562:
[ 1623.116813] #0: ("events_freezable"){.+.+..}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
[ 1623.116819] #1: ((&local->restart_work)){+.+...}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
[ 1623.116825] #2: (rtnl_mutex){+.+.+.}, at: [<ffffffff818a2b72>] rtnl_lock+0x12/0x20
[ 1623.116845] 2 locks held by bash/1896:
[ 1623.116846] #0: (&tty->ldisc_sem){++++.+}, at: [<ffffffff819b67ad>] ldsem_down_read+0x2d/0x40
[ 1623.116852] #1: (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff815f5ed9>] n_tty_read+0xa9/0x8d0
[ 1623.116860] 1 lock held by btserver/3871:
[ 1623.116861] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff818a40f6>] rtnetlink_rcv+0x16/0x30
[ 1623.116867] 3 locks held by kworker/2:0/4308:
[ 1623.116868] #0: ("%s"("ipv6_addrconf")){.+.+..}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
[ 1623.116874] #1: ((addr_chk_work).work){+.+...}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
[ 1623.116880] #2: (rtnl_mutex){+.+.+.}, at: [<ffffffff818a2b72>] rtnl_lock+0x12/0x20
[ 1623.116887] 4 locks held by rmmod/7551:
[ 1623.116888] #0: (&dev->mutex){......}, at: [<ffffffff81656b76>] device_release_driver_internal+0x26/0x210
[ 1623.116894] #1: (&dev->mutex){......}, at: [<ffffffff81656b84>] device_release_driver_internal+0x34/0x210
[ 1623.116899] #2: (rtnl_mutex){+.+.+.}, at: [<ffffffff818a2b72>] rtnl_lock+0x12/0x20
[ 1623.116905] #3: (&ar->conf_mutex){+.+.+.}, at: [<ffffffffa0668eb4>] ath10k_stop+0x24/0x80 [ath10k_core]
[ 1623.116915] 2 locks held by bash/7559:
[ 1623.116916] #0: (&tty->ldisc_sem){++++.+}, at: [<ffffffff819b67ad>] ldsem_down_read+0x2d/0x40
[ 1623.116922] #1: (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff815f5ed9>] n_tty_read+0xa9/0x8d0

[ 1623.116929] =============================================


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2018-02-27 21:42:34

by Ben Greear

[permalink] [raw]
Subject: Re: Deadlock debugging help.

On 02/27/2018 12:49 PM, Ben Greear wrote:
> I notice I can reliably lock up the kernel if I rmmod ath10k while it is under
> heavy tx/rx traffic. First, this causes the firmware to crash, and then right
> after (or possibly during?) the related kernel threads deadlock.
>
> This is with my hacked driver and hacked firmware. In particular, the
> ath10k_debug_nop_dwork is something I added, though it is pretty trivial,
> it does take the ar->conf_mutex. It appears blocked trying to get it.
>
> It appears something is holding the ar->conf_mutex, but it is not clear to
> me from the lockdep output what process actually holds it.
> Anyone see a clue they could share?

Changing how I start/stop the nop_dwork stuff seems to have made the
problem go away, so I guess maybe that was the issue.

Thanks,
Ben

>
> Thanks,
> Ben
>
>
> [ 1618.914057] sysrq: SysRq : Show Regs
> [ 1618.916617] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G W O 4.13.16+ #2
> [ 1618.916618] Hardware name: _ _/ , BIOS 5.11 08/26/2016
> [ 1618.916620] task: ffff88016d3ea740 task.stack: ffffc900006c8000
> [ 1618.916624] RIP: 0010:cpuidle_enter_state+0x130/0x3b0
> [ 1618.916625] RSP: 0018:ffffc900006cbe78 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff3e
> [ 1618.916628] RAX: ffff88016d3ea740 RBX: 0000000000000008 RCX: 000000000000001f
> [ 1618.916629] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88016d3ea740
> [ 1618.916630] RBP: ffffc900006cbeb0 R08: 00000000000052ca R09: 0000000000000000
> [ 1618.916631] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000008
> [ 1618.916633] R13: ffff880172d26c00 R14: ffffffff81ec66b8 R15: 00000178ee494b98
> [ 1618.916634] FS: 0000000000000000(0000) GS:ffff880172d00000(0000) knlGS:0000000000000000
> [ 1618.916636] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1618.916637] CR2: 00007ffdda2194b8 CR3: 0000000001e0f000 CR4: 00000000003406e0
> [ 1618.916638] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1618.916639] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1618.916640] Call Trace:
> [ 1618.916645] cpuidle_enter+0x12/0x20
> [ 1618.916647] call_cpuidle+0x1e/0x40
> [ 1618.916649] do_idle+0x184/0x1e0
> [ 1618.916651] cpu_startup_entry+0x5f/0x70
> [ 1618.916654] start_secondary+0x150/0x180
> [ 1618.916657] secondary_startup_64+0x9f/0x9f
> [ 1618.916662] Code: b7 df 95 ff 48 89 45 d0 0f 1f 44 00 00 31 ff e8 97 6c 97 ff 8b 45 c8 85 c0 0f 85 83 01 00 00 e8 57 5b 98 ff fb 66 0f 1f 44 00 00 <48> 8b 5d
> d0 48 ba cf f7 53 e3 a5 9b c4 20 4c 29 fb 48 89 d8 48
>
> [ 1618.916724] CPU#2: ctrl: 000000070000000f
> [ 1618.916726] CPU#2: status: 0000000000000000
> [ 1618.916727] CPU#2: overflow: 0000000000000000
> [ 1618.916728] CPU#2: fixed: 00000000000000b0
> [ 1618.916729] CPU#2: pebs: 0000000000000000
> [ 1618.916730] CPU#2: debugctl: 0000000000000000
> [ 1618.916731] CPU#2: active: 0000000200000000
> [ 1618.916732] CPU#2: gen-PMC0 ctrl: 0000000000000000
> [ 1618.916733] CPU#2: gen-PMC0 count: 0000000000000000
> [ 1618.916734] CPU#2: gen-PMC0 left: 0000000000000000
> [ 1618.916735] CPU#2: gen-PMC1 ctrl: 0000000000000000
> [ 1618.916736] CPU#2: gen-PMC1 count: 0000000000000000
> [ 1618.916737] CPU#2: gen-PMC1 left: 0000000000000000
> [ 1618.916738] CPU#2: gen-PMC2 ctrl: 0000000000000000
> [ 1618.916739] CPU#2: gen-PMC2 count: 0000000000000000
> [ 1618.916740] CPU#2: gen-PMC2 left: 0000000000000000
> [ 1618.916742] CPU#2: gen-PMC3 ctrl: 0000000000000000
> [ 1618.916743] CPU#2: gen-PMC3 count: 0000000000000000
> [ 1618.916744] CPU#2: gen-PMC3 left: 0000000000000000
> [ 1618.916745] CPU#2: fixed-PMC0 count: 0000000000000000
> [ 1618.916746] CPU#2: fixed-PMC1 count: 0000fffed3945191
> [ 1618.916747] CPU#2: fixed-PMC2 count: 0000000000000000
> [ 1623.035449] INFO: task kworker/u8:1:273 blocked for more than 180 seconds.
> [ 1623.041155] Tainted: G W O 4.13.16+ #2
> [ 1623.045173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1623.051966] kworker/u8:1 D 0 273 2 0x00000000
> [ 1623.052010] Workqueue: cfg80211 cfg80211_dfs_channels_update_work [cfg80211]
> [ 1623.052026] Call Trace:
> [ 1623.052032] __schedule+0x407/0xb70
> [ 1623.052037] schedule+0x38/0x90
> [ 1623.052039] schedule_preempt_disabled+0x10/0x20
> [ 1623.052041] __mutex_lock+0x3d3/0x930
> [ 1623.052044] ? rtnl_lock+0x12/0x20
> [ 1623.052050] mutex_lock_nested+0x16/0x20
> [ 1623.052052] ? mutex_lock_nested+0x16/0x20
> [ 1623.052054] rtnl_lock+0x12/0x20
> [ 1623.052066] cfg80211_dfs_channels_update_work+0x3c/0x1c0 [cfg80211]
> [ 1623.052069] ? process_one_work+0x14f/0x6a0
> [ 1623.052074] process_one_work+0x1ce/0x6a0
> [ 1623.052078] worker_thread+0x46/0x440
> [ 1623.052083] kthread+0x10f/0x150
> [ 1623.052084] ? process_one_work+0x6a0/0x6a0
> [ 1623.052086] ? kthread_create_on_node+0x40/0x40
> [ 1623.052088] ? kthread_create_on_node+0x40/0x40
> [ 1623.052091] ret_from_fork+0x2a/0x40
> [ 1623.052098] INFO: task kworker/u8:2:281 blocked for more than 180 seconds.
> [ 1623.057700] Tainted: G W O 4.13.16+ #2
> [ 1623.061659] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1623.068270] kworker/u8:2 D 0 281 2 0x00000000
> [ 1623.068282] Workqueue: ath10k_wq ath10k_debug_nop_dwork [ath10k_core]
> [ 1623.068284] Call Trace:
> [ 1623.068288] __schedule+0x407/0xb70
> [ 1623.068292] schedule+0x38/0x90
> [ 1623.068308] schedule_preempt_disabled+0x10/0x20
> [ 1623.068310] __mutex_lock+0x3d3/0x930
> [ 1623.068315] ? ath10k_debug_nop_dwork+0x24/0x80 [ath10k_core]
> [ 1623.068322] ? lock_acquire+0xac/0x200
> [ 1623.068325] mutex_lock_nested+0x16/0x20
> [ 1623.068326] ? mutex_lock_nested+0x16/0x20
> [ 1623.068331] ath10k_debug_nop_dwork+0x24/0x80 [ath10k_core]
> [ 1623.068340] process_one_work+0x1ce/0x6a0
> [ 1623.068345] worker_thread+0x46/0x440
> [ 1623.068349] kthread+0x10f/0x150
> [ 1623.068350] ? process_one_work+0x6a0/0x6a0
> [ 1623.068352] ? kthread_create_on_node+0x40/0x40
> [ 1623.068354] ? kthread_create_on_node+0x40/0x40
> [ 1623.068356] ret_from_fork+0x2a/0x40
> [ 1623.068363] INFO: task kworker/0:3:462 blocked for more than 180 seconds.
> [ 1623.073888] Tainted: G W O 4.13.16+ #2
> [ 1623.077882] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1623.084475] kworker/0:3 D 0 462 2 0x00000080
> [ 1623.084480] Workqueue: events linkwatch_event
> [ 1623.084482] Call Trace:
> [ 1623.084485] __schedule+0x407/0xb70
> [ 1623.084489] schedule+0x38/0x90
> [ 1623.084491] schedule_preempt_disabled+0x10/0x20
> [ 1623.084493] __mutex_lock+0x3d3/0x930
> [ 1623.084495] ? rtnl_lock+0x12/0x20
> [ 1623.084501] ? lock_acquire+0xac/0x200
> [ 1623.084504] mutex_lock_nested+0x16/0x20
> [ 1623.084507] ? mutex_lock_nested+0x16/0x20
> [ 1623.084509] rtnl_lock+0x12/0x20
> [ 1623.084511] linkwatch_event+0x9/0x30
> [ 1623.084513] process_one_work+0x1ce/0x6a0
> [ 1623.084517] worker_thread+0x46/0x440
> [ 1623.084521] kthread+0x10f/0x150
> [ 1623.084523] ? process_one_work+0x6a0/0x6a0
> [ 1623.084525] ? kthread_create_on_node+0x40/0x40
> [ 1623.084528] ? do_syscall_64+0x64/0x140
> [ 1623.084531] ret_from_fork+0x2a/0x40
> [ 1623.084537] INFO: task kworker/1:2:562 blocked for more than 180 seconds.
> [ 1623.090051] Tainted: G W O 4.13.16+ #2
> [ 1623.094000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1623.100574] kworker/1:2 D 0 562 2 0x00000080
> [ 1623.100615] Workqueue: events_freezable ieee80211_restart_work [mac80211]
> [ 1623.100617] Call Trace:
> [ 1623.100620] __schedule+0x407/0xb70
> [ 1623.100625] schedule+0x38/0x90
> [ 1623.100627] schedule_preempt_disabled+0x10/0x20
> [ 1623.100629] __mutex_lock+0x3d3/0x930
> [ 1623.100631] ? rtnl_lock+0x12/0x20
> [ 1623.100648] ? flush_work+0x298/0x2d0
> [ 1623.100654] mutex_lock_nested+0x16/0x20
> [ 1623.100655] ? mutex_lock_nested+0x16/0x20
> [ 1623.100658] rtnl_lock+0x12/0x20
> [ 1623.100667] ieee80211_restart_work+0x50/0xc0 [mac80211]
> [ 1623.100671] process_one_work+0x1ce/0x6a0
> [ 1623.100675] worker_thread+0x46/0x440
> [ 1623.100679] kthread+0x10f/0x150
> [ 1623.100681] ? process_one_work+0x6a0/0x6a0
> [ 1623.100683] ? kthread_create_on_node+0x40/0x40
> [ 1623.100685] ? do_syscall_64+0x64/0x140
> [ 1623.100688] ret_from_fork+0x2a/0x40
> [ 1623.100714] INFO: task btserver:3871 blocked for more than 180 seconds.
> [ 1623.106094] Tainted: G W O 4.13.16+ #2
> [ 1623.110042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1623.116600] btserver D 0 3871 1958 0x00000080
> [ 1623.116604] Call Trace:
> [ 1623.116627] __schedule+0x407/0xb70
> [ 1623.116633] schedule+0x38/0x90
> [ 1623.116636] schedule_preempt_disabled+0x10/0x20
> [ 1623.116638] __mutex_lock+0x3d3/0x930
> [ 1623.116641] ? rtnetlink_rcv+0x16/0x30
> [ 1623.116648] mutex_lock_nested+0x16/0x20
> [ 1623.116649] ? mutex_lock_nested+0x16/0x20
> [ 1623.116651] rtnetlink_rcv+0x16/0x30
> [ 1623.116653] netlink_unicast+0x1cb/0x2e0
> [ 1623.116656] netlink_sendmsg+0x2c6/0x3a0
> [ 1623.116661] sock_sendmsg+0x33/0x40
> [ 1623.116664] SYSC_sendto+0x101/0x190
> [ 1623.116670] ? __getnstimeofday64+0x9a/0x150
> [ 1623.116673] ? getnstimeofday64+0x9/0x20
> [ 1623.116675] ? __audit_syscall_entry+0xb2/0x100
> [ 1623.116678] ? syscall_trace_enter+0x1bc/0x3b0
> [ 1623.116681] ? __do_page_fault+0x28b/0x520
> [ 1623.116684] ? trace_hardirqs_on_caller+0x11f/0x190
> [ 1623.116688] SyS_sendto+0x9/0x10
> [ 1623.116690] do_syscall_64+0x64/0x140
> [ 1623.116693] entry_SYSCALL64_slow_path+0x25/0x25
> [ 1623.116695] RIP: 0033:0x7f7951ec71b3
> [ 1623.116696] RSP: 002b:00007fffcf807788 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
> [ 1623.116699] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f7951ec71b3
> [ 1623.116700] RDX: 0000000000000020 RSI: 00007fffcf8077c0 RDI: 000000000000000d
> [ 1623.116716] RBP: 00007fffcf807a20 R08: 00007fffcf8077b0 R09: 000000000000000c
> [ 1623.116717] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [ 1623.116718] R13: 000000000084b5a5 R14: 000000000084b5ad R15: 000000000084ba53
> [ 1623.116728]
> Showing all locks held in the system:
> [ 1623.116733] 1 lock held by khungtaskd/38:
> [ 1623.116738] #0: (tasklist_lock){.+.+..}, at: [<ffffffff8117ba8d>] debug_show_all_locks+0x3d/0x1a0
> [ 1623.116747] 3 locks held by kworker/u8:1/273:
> [ 1623.116749] #0: ("cfg80211"){.+.+.+}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
> [ 1623.116755] #1: ((&(&rdev->dfs_update_channels_wk)->work)){+.+...}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
> [ 1623.116761] #2: (rtnl_mutex){+.+.+.}, at: [<ffffffff818a2b72>] rtnl_lock+0x12/0x20
> [ 1623.116768] 3 locks held by kworker/u8:2/281:
> [ 1623.116769] #0: ("%s""ath10k_wq"){.+.+.+}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
> [ 1623.116775] #1: ((&(&ar->debug.nop_dwork)->work)){+.+...}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
> [ 1623.116781] #2: (&ar->conf_mutex){+.+.+.}, at: [<ffffffffa066d554>] ath10k_debug_nop_dwork+0x24/0x80 [ath10k_core]
> [ 1623.116792] 3 locks held by kworker/0:3/462:
> [ 1623.116793] #0: ("events"){.+.+.+}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
> [ 1623.116799] #1: ((linkwatch_work).work){+.+.+.}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
> [ 1623.116805] #2: (rtnl_mutex){+.+.+.}, at: [<ffffffff818a2b72>] rtnl_lock+0x12/0x20
> [ 1623.116812] 3 locks held by kworker/1:2/562:
> [ 1623.116813] #0: ("events_freezable"){.+.+..}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
> [ 1623.116819] #1: ((&local->restart_work)){+.+...}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
> [ 1623.116825] #2: (rtnl_mutex){+.+.+.}, at: [<ffffffff818a2b72>] rtnl_lock+0x12/0x20
> [ 1623.116845] 2 locks held by bash/1896:
> [ 1623.116846] #0: (&tty->ldisc_sem){++++.+}, at: [<ffffffff819b67ad>] ldsem_down_read+0x2d/0x40
> [ 1623.116852] #1: (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff815f5ed9>] n_tty_read+0xa9/0x8d0
> [ 1623.116860] 1 lock held by btserver/3871:
> [ 1623.116861] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff818a40f6>] rtnetlink_rcv+0x16/0x30
> [ 1623.116867] 3 locks held by kworker/2:0/4308:
> [ 1623.116868] #0: ("%s"("ipv6_addrconf")){.+.+..}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
> [ 1623.116874] #1: ((addr_chk_work).work){+.+...}, at: [<ffffffff8113e14f>] process_one_work+0x14f/0x6a0
> [ 1623.116880] #2: (rtnl_mutex){+.+.+.}, at: [<ffffffff818a2b72>] rtnl_lock+0x12/0x20
> [ 1623.116887] 4 locks held by rmmod/7551:
> [ 1623.116888] #0: (&dev->mutex){......}, at: [<ffffffff81656b76>] device_release_driver_internal+0x26/0x210
> [ 1623.116894] #1: (&dev->mutex){......}, at: [<ffffffff81656b84>] device_release_driver_internal+0x34/0x210
> [ 1623.116899] #2: (rtnl_mutex){+.+.+.}, at: [<ffffffff818a2b72>] rtnl_lock+0x12/0x20
> [ 1623.116905] #3: (&ar->conf_mutex){+.+.+.}, at: [<ffffffffa0668eb4>] ath10k_stop+0x24/0x80 [ath10k_core]
> [ 1623.116915] 2 locks held by bash/7559:
> [ 1623.116916] #0: (&tty->ldisc_sem){++++.+}, at: [<ffffffff819b67ad>] ldsem_down_read+0x2d/0x40
> [ 1623.116922] #1: (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff815f5ed9>] n_tty_read+0xa9/0x8d0
>
> [ 1623.116929] =============================================
>
>


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2018-02-27 23:32:02

by Ben Greear

[permalink] [raw]
Subject: Re: Deadlock debugging help.

On 02/27/2018 01:42 PM, Ben Greear wrote:
> On 02/27/2018 12:49 PM, Ben Greear wrote:
>> I notice I can reliably lock up the kernel if I rmmod ath10k while it is under
>> heavy tx/rx traffic. First, this causes the firmware to crash, and then right
>> after (or possibly during?) the related kernel threads deadlock.
>>
>> This is with my hacked driver and hacked firmware. In particular, the
>> ath10k_debug_nop_dwork is something I added, though it is pretty trivial,
>> it does take the ar->conf_mutex. It appears blocked trying to get it.
>>
>> It appears something is holding the ar->conf_mutex, but it is not clear to
>> me from the lockdep output what process actually holds it.
>> Anyone see a clue they could share?
>
> Changing how I start/stop the nop_dwork stuff seems to have made the
> problem go away, so I guess maybe that was the issue.

Ok, so problem still remains. The 'rmmod' process appears to be the
one that is really not making progress. Unfortunately, decoding
ath10k_pci_hif_stop+0x6f leads to some bitops.h inline, which doesn't
let me know where it is actually stuck... Off to do more debugging....


[ 4037.220992] rmmod D 0 20267 3050 0x00000080
[ 4037.220995] Call Trace:
[ 4037.220997] __schedule+0x407/0xb70
[ 4037.220999] ? _raw_spin_unlock_irqrestore+0x4e/0x70
[ 4037.221003] schedule+0x38/0x90
[ 4037.221005] schedule_timeout+0x224/0x580
[ 4037.221007] ? retint_kernel+0x2d/0x2d
[ 4037.221010] ? call_timer_fn+0x370/0x370
[ 4037.221015] msleep+0x34/0x40
[ 4037.221017] ? msleep+0x34/0x40
[ 4037.221021] ath10k_pci_hif_stop+0x6f/0xd0 [ath10k_pci]
[ 4037.221032] ath10k_core_stop+0x4d/0x90 [ath10k_core]
[ 4037.221038] ath10k_halt+0x14b/0x1f0 [ath10k_core]
[ 4037.221044] ath10k_stop+0x36/0x80 [ath10k_core]
[ 4037.221059] drv_stop+0x58/0x2d0 [mac80211]
[ 4037.221075] ieee80211_stop_device+0x3e/0x50 [mac80211]
[ 4037.221088] ieee80211_do_stop+0x501/0x880 [mac80211]
[ 4037.221092] ? dev_deactivate_many+0x2b2/0x2f0
[ 4037.221105] ieee80211_stop+0x15/0x20 [mac80211]
[ 4037.221107] __dev_close_many+0x93/0xe0
[ 4037.221110] dev_close_many+0x7d/0x120
[ 4037.221114] dev_close.part.85+0x36/0x50
[ 4037.221116] dev_close+0x15/0x20
[ 4037.221155] cfg80211_shutdown_all_interfaces+0x44/0xc0 [cfg80211]
[ 4037.221168] ieee80211_remove_interfaces+0x42/0x1c0 [mac80211]
[ 4037.221180] ieee80211_unregister_hw+0x45/0x130 [mac80211]
[ 4037.221187] ath10k_mac_unregister+0x14/0x60 [ath10k_core]
[ 4037.221193] ath10k_core_unregister+0x3a/0xa0 [ath10k_core]
[ 4037.221197] ath10k_pci_remove+0x2d/0x70 [ath10k_pci]
[ 4037.221200] pci_device_remove+0x34/0xb0
[ 4037.221203] device_release_driver_internal+0x158/0x210
[ 4037.221206] driver_detach+0x3b/0x80
[ 4037.221208] bus_remove_driver+0x53/0xd0
[ 4037.221210] driver_unregister+0x27/0x40
[ 4037.221213] pci_unregister_driver+0x24/0x90
[ 4037.221216] ath10k_pci_exit+0x10/0x6ee [ath10k_pci]
[ 4037.221218] SyS_delete_module+0x1e1/0x2a0
[ 4037.221222] do_syscall_64+0x64/0x140
[ 4037.221225] entry_SYSCALL64_slow_path+0x25/0x25

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com