2015-12-04 05:53:10

by Andy Lutomirski

[permalink] [raw]
Subject: Hang triggered by udev coldplug, looks like a race

Sometimes udevadm trigger --action=add hangs the system, and the splat
below happens. This seems to be timing dependent, and I haven't been
able to trigger it yet with lockdep enabled, sadly.

Any ideas? I not, I'll try to instrument it better tomorrow.

This is 4.4-rc3 plus some patches that I don't think are related. The
rootfs is 9p.

--Andy

[ 60.048359] INFO: task kworker/u2:1:980 blocked for more than 30 seconds.
[ 60.051673] Not tainted 4.3.0-rc4+ #3
[ 60.052149] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 60.053013] kworker/u2:1 D ffff880006823bc8 14152 980 14 0x00000000
[ 60.053786] ffff880006823bc8 ffff8800072e4740 ffff8800072e63c0
ffff880006824000
[ 60.054637] ffff88000613cfd0 ffff8800072e63c0 00000000ffffffff
0000000000000246
[ 60.055447] ffff880006823be0 ffffffff818d2bfe ffff88000613cfc8
ffff880006823bf0
[ 60.056250] Call Trace:
[ 60.056513] [<ffffffff818d2bfe>] schedule+0x2e/0x70
[ 60.057017] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
[ 60.057632] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
[ 60.058242] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
[ 60.058790] [<ffffffff818d5655>] mutex_lock+0x15/0x30
[ 60.059272] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
[ 60.059805] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
[ 60.060352] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
[ 60.060966] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
[ 60.061805] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
[ 60.062316] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
[ 60.062916] [<ffffffff811c6297>] do_execve+0x27/0x30
[ 60.063355] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
[ 60.063951] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
[ 60.064416] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
[ 60.064889] INFO: task kworker/u2:1:981 blocked for more than 30 seconds.
[ 60.065473] Not tainted 4.3.0-rc4+ #3
[ 60.065834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 60.066507] kworker/u2:1 D ffff880006827bc8 14152 981 14 0x00000000
[ 60.067135] ffff880006827bc8 ffff880006965580 ffff8800072e0e40
ffff880006828000
[ 60.067810] ffff88000613cfd0 ffff8800072e0e40 00000000ffffffff
0000000000000246
[ 60.068491] ffff880006827be0 ffffffff818d2bfe ffff88000613cfc8
ffff880006827bf0
[ 60.069169] Call Trace:
[ 60.069386] [<ffffffff818d2bfe>] schedule+0x2e/0x70
[ 60.069814] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
[ 60.070361] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
[ 60.070898] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
[ 60.071381] [<ffffffff818d5655>] mutex_lock+0x15/0x30
[ 60.071825] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
[ 60.072299] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
[ 60.072756] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
[ 60.073289] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
[ 60.073832] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
[ 60.074316] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
[ 60.075083] [<ffffffff811c6297>] do_execve+0x27/0x30
[ 60.075736] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
[ 60.076648] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
[ 60.077235] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
[ 60.077805] INFO: task kworker/u2:1:982 blocked for more than 30 seconds.
[ 60.078529] Not tainted 4.3.0-rc4+ #3
[ 60.078969] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 60.079891] kworker/u2:1 D ffff88000682bbc8 14584 982 14 0x00000000
[ 60.080658] ffff88000682bbc8 ffff8800064cb900 ffff880006bc0e40
ffff88000682c000
[ 60.081474] ffff88000613cfd0 ffff880006bc0e40 00000000ffffffff
0000000000000246
[ 60.082282] ffff88000682bbe0 ffffffff818d2bfe ffff88000613cfc8
ffff88000682bbf0
[ 60.083095] Call Trace:
[ 60.083357] [<ffffffff818d2bfe>] schedule+0x2e/0x70
[ 60.083871] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
[ 60.084533] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
[ 60.085181] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
[ 60.085761] [<ffffffff818d5655>] mutex_lock+0x15/0x30
[ 60.086300] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
[ 60.086869] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
[ 60.087422] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
[ 60.088067] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
[ 60.088722] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
[ 60.089301] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
[ 60.089991] [<ffffffff811c6297>] do_execve+0x27/0x30
[ 60.090552] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
[ 60.091274] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
[ 60.091832] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
[ 60.092401] INFO: task kworker/u2:1:983 blocked for more than 30 seconds.
[ 60.093138] Not tainted 4.3.0-rc4+ #3
[ 60.093574] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 60.094379] kworker/u2:1 D ffff88000682fbc8 14544 983 14 0x00000000
[ 60.095139] ffff88000682fbc8 ffff880006bc0e40 ffff88000644c740
ffff880006830000
[ 60.095973] ffff88000613cfd0 ffff88000644c740 00000000ffffffff
0000000000000246
[ 60.096871] ffff88000682fbe0 ffffffff818d2bfe ffff88000613cfc8
ffff88000682fbf0
[ 60.097686] Call Trace:
[ 60.097945] [<ffffffff818d2bfe>] schedule+0x2e/0x70
[ 60.098472] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
[ 60.099136] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
[ 60.099786] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
[ 60.100368] [<ffffffff818d5655>] mutex_lock+0x15/0x30
[ 60.100902] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
[ 60.101478] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
[ 60.102060] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
[ 60.102706] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
[ 60.103366] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
[ 60.103947] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
[ 60.104647] [<ffffffff811c6297>] do_execve+0x27/0x30
[ 60.105205] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
[ 60.105923] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
[ 60.106494] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
[ 60.107065] INFO: task kworker/u2:1:984 blocked for more than 30 seconds.
[ 60.107795] Not tainted 4.3.0-rc4+ #3
[ 60.108248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 60.109115] kworker/u2:1 D ffff8800068e3bc8 14152 984 14 0x00000000
[ 60.109881] ffff8800068e3bc8 ffff880006a25580 ffff880006b65580
ffff8800068e4000
[ 60.110737] ffff88000613cfd0 ffff880006b65580 00000000ffffffff
0000000000000246
[ 60.111577] ffff8800068e3be0 ffffffff818d2bfe ffff88000613cfc8
ffff8800068e3bf0
[ 60.112533] Call Trace:
[ 60.112798] [<ffffffff818d2bfe>] schedule+0x2e/0x70
[ 60.113325] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
[ 60.113990] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
[ 60.114652] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
[ 60.115240] [<ffffffff818d5655>] mutex_lock+0x15/0x30
[ 60.115807] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
[ 60.116385] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
[ 60.116942] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
[ 60.117595] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
[ 60.118261] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
[ 60.118845] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
[ 60.119548] [<ffffffff811c6297>] do_execve+0x27/0x30
[ 60.120110] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
[ 60.120836] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
[ 60.121406] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
[ 60.121981] INFO: task kworker/u2:1:985 blocked for more than 30 seconds.
[ 60.122732] Not tainted 4.3.0-rc4+ #3
[ 60.123177] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 60.123992] kworker/u2:1 D ffff8800068e7bc8 14152 985 14 0x00000000
[ 60.124759] ffff8800068e7bc8 ffff8800073c0000 ffff880006b61c80
ffff8800068e8000
[ 60.125579] ffff88000613cfd0 ffff880006b61c80 00000000ffffffff
0000000000000246
[ 60.126389] ffff8800068e7be0 ffffffff818d2bfe ffff88000613cfc8
ffff8800068e7bf0
[ 60.127258] Call Trace:
[ 60.127525] [<ffffffff818d2bfe>] schedule+0x2e/0x70
[ 60.128052] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
[ 60.128720] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
[ 60.129380] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
[ 60.129995] [<ffffffff818d5655>] mutex_lock+0x15/0x30
[ 60.130570] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
[ 60.131150] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
[ 60.131733] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
[ 60.132410] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
[ 60.133106] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
[ 60.133692] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
[ 60.134393] [<ffffffff811c6297>] do_execve+0x27/0x30
[ 60.134951] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
[ 60.135711] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
[ 60.136280] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
[ 60.136856] INFO: task kworker/u2:1:986 blocked for more than 30 seconds.
[ 60.137600] Not tainted 4.3.0-rc4+ #3
[ 60.138073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 60.138881] kworker/u2:1 D ffff8800068ebbc8 14152 986 14 0x00000000
[ 60.139672] ffff8800068ebbc8 ffff8800066bc740 ffff880006b663c0
ffff8800068ec000
[ 60.140545] ffff88000613cfd0 ffff880006b663c0 00000000ffffffff
0000000000000246
[ 60.141356] ffff8800068ebbe0 ffffffff818d2bfe ffff88000613cfc8
ffff8800068ebbf0
[ 60.142209] Call Trace:
[ 60.142487] [<ffffffff818d2bfe>] schedule+0x2e/0x70
[ 60.142998] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
[ 60.143657] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
[ 60.144307] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
[ 60.144895] [<ffffffff818d5655>] mutex_lock+0x15/0x30
[ 60.145455] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
[ 60.146117] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
[ 60.146715] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
[ 60.147391] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
[ 60.148081] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
[ 60.148660] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
[ 60.149380] [<ffffffff811c6297>] do_execve+0x27/0x30
[ 60.149905] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
[ 60.150629] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
[ 60.151189] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
[ 60.151787] INFO: task kworker/u2:1:987 blocked for more than 30 seconds.
[ 60.152497] Not tainted 4.3.0-rc4+ #3
[ 60.152927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 60.153735] kworker/u2:1 D ffff8800068efbc8 14152 987 14 0x00000000
[ 60.154524] ffff8800068efbc8 ffff880007234740 ffff880006bf9c80
ffff8800068f0000
[ 60.155389] ffff88000613cfd0 ffff880006bf9c80 00000000ffffffff
0000000000000246
[ 60.156227] ffff8800068efbe0 ffffffff818d2bfe ffff88000613cfc8
ffff8800068efbf0
[ 60.157040] Call Trace:
[ 60.157299] [<ffffffff818d2bfe>] schedule+0x2e/0x70
[ 60.157813] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
[ 60.158478] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
[ 60.159125] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
[ 60.159731] [<ffffffff818d5655>] mutex_lock+0x15/0x30
[ 60.160294] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
[ 60.160864] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
[ 60.161414] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
[ 60.162058] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
[ 60.162711] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
[ 60.163315] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
[ 60.164038] [<ffffffff811c6297>] do_execve+0x27/0x30
[ 60.164586] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
[ 60.165305] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
[ 60.165890] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
[ 60.166490] INFO: task kworker/u2:1:988 blocked for more than 30 seconds.
[ 60.167220] Not tainted 4.3.0-rc4+ #3
[ 60.167682] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 60.168515] kworker/u2:1 D ffff880006913bc8 14152 988 14 0x00000000
[ 60.169270] ffff880006913bc8 ffff8800001663c0 ffff880006bfc740
ffff880006914000
[ 60.170081] ffff88000613cfd0 ffff880006bfc740 00000000ffffffff
0000000000000246
[ 60.170884] ffff880006913be0 ffffffff818d2bfe ffff88000613cfc8
ffff880006913bf0
[ 60.171742] Call Trace:
[ 60.172037] [<ffffffff818d2bfe>] schedule+0x2e/0x70
[ 60.172587] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
[ 60.173263] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
[ 60.173928] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
[ 60.174550] [<ffffffff818d5655>] mutex_lock+0x15/0x30
[ 60.175097] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
[ 60.175703] [<ffffffff811afa1c>] ? alloc_pages_current+0x8c/0x100
[ 60.176354] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
[ 60.176955] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
[ 60.177630] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
[ 60.178303] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
[ 60.178984] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
[ 60.179701] [<ffffffff811c6297>] do_execve+0x27/0x30
[ 60.180268] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
[ 60.181057] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
[ 60.181625] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
[ 60.182233] INFO: task kworker/u2:1:989 blocked for more than 30 seconds.
[ 60.182972] Not tainted 4.3.0-rc4+ #3
[ 60.183420] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 60.184268] kworker/u2:1 D ffff880006917bc8 14152 989 14 0x00000000
[ 60.185063] ffff880006917bc8 ffff880000149c80 ffff880006bf8e40
ffff880006918000
[ 60.185954] ffff88000613cfd0 ffff880006bf8e40 00000000ffffffff
0000000000000246
[ 60.186806] ffff880006917be0 ffffffff818d2bfe ffff88000613cfc8
ffff880006917bf0
[ 60.187672] Call Trace:
[ 60.187937] [<ffffffff818d2bfe>] schedule+0x2e/0x70
[ 60.188495] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
[ 60.189193] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
[ 60.189851] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
[ 60.190472] [<ffffffff818d5655>] mutex_lock+0x15/0x30
[ 60.191043] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
[ 60.191649] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
[ 60.192112] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
[ 60.192678] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
[ 60.193248] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
[ 60.193722] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
[ 60.194291] [<ffffffff811c6297>] do_execve+0x27/0x30
[ 60.194724] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
[ 60.195318] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
[ 60.195781] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40


2015-12-07 19:19:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Hang triggered by udev coldplug, looks like a race

On Thu, Dec 3, 2015 at 9:52 PM, Andy Lutomirski <[email protected]> wrote:
> Sometimes udevadm trigger --action=add hangs the system, and the splat
> below happens. This seems to be timing dependent, and I haven't been
> able to trigger it yet with lockdep enabled, sadly.
>
> Any ideas? I not, I'll try to instrument it better tomorrow.

More details: this is caused by a storm of /sbin/hotplug UMH calls
(yes, misconfigured kernel, but still). /sbin is a symlink to
/usr/sbin, /usr/sbin/hotplug doesn't exist, and all of the above is on
rootfs, which is 9p over virtio.

Pointing uevent_helper at /usr/sbin/hotplug (which still doesn't
exist) seems to work around it.

>
> This is 4.4-rc3 plus some patches that I don't think are related. The
> rootfs is 9p.
>
> --Andy
>
> [ 60.048359] INFO: task kworker/u2:1:980 blocked for more than 30 seconds.
> [ 60.051673] Not tainted 4.3.0-rc4+ #3
> [ 60.052149] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 60.053013] kworker/u2:1 D ffff880006823bc8 14152 980 14 0x00000000
> [ 60.053786] ffff880006823bc8 ffff8800072e4740 ffff8800072e63c0
> ffff880006824000
> [ 60.054637] ffff88000613cfd0 ffff8800072e63c0 00000000ffffffff
> 0000000000000246
> [ 60.055447] ffff880006823be0 ffffffff818d2bfe ffff88000613cfc8
> ffff880006823bf0
> [ 60.056250] Call Trace:
> [ 60.056513] [<ffffffff818d2bfe>] schedule+0x2e/0x70
> [ 60.057017] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
> [ 60.057632] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
> [ 60.058242] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
> [ 60.058790] [<ffffffff818d5655>] mutex_lock+0x15/0x30
> [ 60.059272] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
> [ 60.059805] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
> [ 60.060352] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
> [ 60.060966] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
> [ 60.061805] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
> [ 60.062316] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
> [ 60.062916] [<ffffffff811c6297>] do_execve+0x27/0x30
> [ 60.063355] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
> [ 60.063951] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
> [ 60.064416] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
> [ 60.064889] INFO: task kworker/u2:1:981 blocked for more than 30 seconds.
> [ 60.065473] Not tainted 4.3.0-rc4+ #3
> [ 60.065834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 60.066507] kworker/u2:1 D ffff880006827bc8 14152 981 14 0x00000000
> [ 60.067135] ffff880006827bc8 ffff880006965580 ffff8800072e0e40
> ffff880006828000
> [ 60.067810] ffff88000613cfd0 ffff8800072e0e40 00000000ffffffff
> 0000000000000246
> [ 60.068491] ffff880006827be0 ffffffff818d2bfe ffff88000613cfc8
> ffff880006827bf0
> [ 60.069169] Call Trace:
> [ 60.069386] [<ffffffff818d2bfe>] schedule+0x2e/0x70
> [ 60.069814] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
> [ 60.070361] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
> [ 60.070898] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
> [ 60.071381] [<ffffffff818d5655>] mutex_lock+0x15/0x30
> [ 60.071825] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
> [ 60.072299] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
> [ 60.072756] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
> [ 60.073289] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
> [ 60.073832] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
> [ 60.074316] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
> [ 60.075083] [<ffffffff811c6297>] do_execve+0x27/0x30
> [ 60.075736] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
> [ 60.076648] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
> [ 60.077235] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
> [ 60.077805] INFO: task kworker/u2:1:982 blocked for more than 30 seconds.
> [ 60.078529] Not tainted 4.3.0-rc4+ #3
> [ 60.078969] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 60.079891] kworker/u2:1 D ffff88000682bbc8 14584 982 14 0x00000000
> [ 60.080658] ffff88000682bbc8 ffff8800064cb900 ffff880006bc0e40
> ffff88000682c000
> [ 60.081474] ffff88000613cfd0 ffff880006bc0e40 00000000ffffffff
> 0000000000000246
> [ 60.082282] ffff88000682bbe0 ffffffff818d2bfe ffff88000613cfc8
> ffff88000682bbf0
> [ 60.083095] Call Trace:
> [ 60.083357] [<ffffffff818d2bfe>] schedule+0x2e/0x70
> [ 60.083871] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
> [ 60.084533] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
> [ 60.085181] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
> [ 60.085761] [<ffffffff818d5655>] mutex_lock+0x15/0x30
> [ 60.086300] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
> [ 60.086869] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
> [ 60.087422] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
> [ 60.088067] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
> [ 60.088722] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
> [ 60.089301] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
> [ 60.089991] [<ffffffff811c6297>] do_execve+0x27/0x30
> [ 60.090552] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
> [ 60.091274] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
> [ 60.091832] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
> [ 60.092401] INFO: task kworker/u2:1:983 blocked for more than 30 seconds.
> [ 60.093138] Not tainted 4.3.0-rc4+ #3
> [ 60.093574] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 60.094379] kworker/u2:1 D ffff88000682fbc8 14544 983 14 0x00000000
> [ 60.095139] ffff88000682fbc8 ffff880006bc0e40 ffff88000644c740
> ffff880006830000
> [ 60.095973] ffff88000613cfd0 ffff88000644c740 00000000ffffffff
> 0000000000000246
> [ 60.096871] ffff88000682fbe0 ffffffff818d2bfe ffff88000613cfc8
> ffff88000682fbf0
> [ 60.097686] Call Trace:
> [ 60.097945] [<ffffffff818d2bfe>] schedule+0x2e/0x70
> [ 60.098472] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
> [ 60.099136] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
> [ 60.099786] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
> [ 60.100368] [<ffffffff818d5655>] mutex_lock+0x15/0x30
> [ 60.100902] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
> [ 60.101478] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
> [ 60.102060] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
> [ 60.102706] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
> [ 60.103366] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
> [ 60.103947] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
> [ 60.104647] [<ffffffff811c6297>] do_execve+0x27/0x30
> [ 60.105205] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
> [ 60.105923] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
> [ 60.106494] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
> [ 60.107065] INFO: task kworker/u2:1:984 blocked for more than 30 seconds.
> [ 60.107795] Not tainted 4.3.0-rc4+ #3
> [ 60.108248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 60.109115] kworker/u2:1 D ffff8800068e3bc8 14152 984 14 0x00000000
> [ 60.109881] ffff8800068e3bc8 ffff880006a25580 ffff880006b65580
> ffff8800068e4000
> [ 60.110737] ffff88000613cfd0 ffff880006b65580 00000000ffffffff
> 0000000000000246
> [ 60.111577] ffff8800068e3be0 ffffffff818d2bfe ffff88000613cfc8
> ffff8800068e3bf0
> [ 60.112533] Call Trace:
> [ 60.112798] [<ffffffff818d2bfe>] schedule+0x2e/0x70
> [ 60.113325] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
> [ 60.113990] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
> [ 60.114652] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
> [ 60.115240] [<ffffffff818d5655>] mutex_lock+0x15/0x30
> [ 60.115807] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
> [ 60.116385] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
> [ 60.116942] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
> [ 60.117595] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
> [ 60.118261] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
> [ 60.118845] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
> [ 60.119548] [<ffffffff811c6297>] do_execve+0x27/0x30
> [ 60.120110] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
> [ 60.120836] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
> [ 60.121406] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
> [ 60.121981] INFO: task kworker/u2:1:985 blocked for more than 30 seconds.
> [ 60.122732] Not tainted 4.3.0-rc4+ #3
> [ 60.123177] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 60.123992] kworker/u2:1 D ffff8800068e7bc8 14152 985 14 0x00000000
> [ 60.124759] ffff8800068e7bc8 ffff8800073c0000 ffff880006b61c80
> ffff8800068e8000
> [ 60.125579] ffff88000613cfd0 ffff880006b61c80 00000000ffffffff
> 0000000000000246
> [ 60.126389] ffff8800068e7be0 ffffffff818d2bfe ffff88000613cfc8
> ffff8800068e7bf0
> [ 60.127258] Call Trace:
> [ 60.127525] [<ffffffff818d2bfe>] schedule+0x2e/0x70
> [ 60.128052] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
> [ 60.128720] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
> [ 60.129380] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
> [ 60.129995] [<ffffffff818d5655>] mutex_lock+0x15/0x30
> [ 60.130570] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
> [ 60.131150] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
> [ 60.131733] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
> [ 60.132410] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
> [ 60.133106] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
> [ 60.133692] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
> [ 60.134393] [<ffffffff811c6297>] do_execve+0x27/0x30
> [ 60.134951] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
> [ 60.135711] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
> [ 60.136280] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
> [ 60.136856] INFO: task kworker/u2:1:986 blocked for more than 30 seconds.
> [ 60.137600] Not tainted 4.3.0-rc4+ #3
> [ 60.138073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 60.138881] kworker/u2:1 D ffff8800068ebbc8 14152 986 14 0x00000000
> [ 60.139672] ffff8800068ebbc8 ffff8800066bc740 ffff880006b663c0
> ffff8800068ec000
> [ 60.140545] ffff88000613cfd0 ffff880006b663c0 00000000ffffffff
> 0000000000000246
> [ 60.141356] ffff8800068ebbe0 ffffffff818d2bfe ffff88000613cfc8
> ffff8800068ebbf0
> [ 60.142209] Call Trace:
> [ 60.142487] [<ffffffff818d2bfe>] schedule+0x2e/0x70
> [ 60.142998] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
> [ 60.143657] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
> [ 60.144307] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
> [ 60.144895] [<ffffffff818d5655>] mutex_lock+0x15/0x30
> [ 60.145455] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
> [ 60.146117] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
> [ 60.146715] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
> [ 60.147391] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
> [ 60.148081] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
> [ 60.148660] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
> [ 60.149380] [<ffffffff811c6297>] do_execve+0x27/0x30
> [ 60.149905] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
> [ 60.150629] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
> [ 60.151189] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
> [ 60.151787] INFO: task kworker/u2:1:987 blocked for more than 30 seconds.
> [ 60.152497] Not tainted 4.3.0-rc4+ #3
> [ 60.152927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 60.153735] kworker/u2:1 D ffff8800068efbc8 14152 987 14 0x00000000
> [ 60.154524] ffff8800068efbc8 ffff880007234740 ffff880006bf9c80
> ffff8800068f0000
> [ 60.155389] ffff88000613cfd0 ffff880006bf9c80 00000000ffffffff
> 0000000000000246
> [ 60.156227] ffff8800068efbe0 ffffffff818d2bfe ffff88000613cfc8
> ffff8800068efbf0
> [ 60.157040] Call Trace:
> [ 60.157299] [<ffffffff818d2bfe>] schedule+0x2e/0x70
> [ 60.157813] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
> [ 60.158478] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
> [ 60.159125] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
> [ 60.159731] [<ffffffff818d5655>] mutex_lock+0x15/0x30
> [ 60.160294] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
> [ 60.160864] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
> [ 60.161414] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
> [ 60.162058] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
> [ 60.162711] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
> [ 60.163315] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
> [ 60.164038] [<ffffffff811c6297>] do_execve+0x27/0x30
> [ 60.164586] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
> [ 60.165305] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
> [ 60.165890] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
> [ 60.166490] INFO: task kworker/u2:1:988 blocked for more than 30 seconds.
> [ 60.167220] Not tainted 4.3.0-rc4+ #3
> [ 60.167682] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 60.168515] kworker/u2:1 D ffff880006913bc8 14152 988 14 0x00000000
> [ 60.169270] ffff880006913bc8 ffff8800001663c0 ffff880006bfc740
> ffff880006914000
> [ 60.170081] ffff88000613cfd0 ffff880006bfc740 00000000ffffffff
> 0000000000000246
> [ 60.170884] ffff880006913be0 ffffffff818d2bfe ffff88000613cfc8
> ffff880006913bf0
> [ 60.171742] Call Trace:
> [ 60.172037] [<ffffffff818d2bfe>] schedule+0x2e/0x70
> [ 60.172587] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
> [ 60.173263] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
> [ 60.173928] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
> [ 60.174550] [<ffffffff818d5655>] mutex_lock+0x15/0x30
> [ 60.175097] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
> [ 60.175703] [<ffffffff811afa1c>] ? alloc_pages_current+0x8c/0x100
> [ 60.176354] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
> [ 60.176955] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
> [ 60.177630] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
> [ 60.178303] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
> [ 60.178984] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
> [ 60.179701] [<ffffffff811c6297>] do_execve+0x27/0x30
> [ 60.180268] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
> [ 60.181057] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
> [ 60.181625] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40
> [ 60.182233] INFO: task kworker/u2:1:989 blocked for more than 30 seconds.
> [ 60.182972] Not tainted 4.3.0-rc4+ #3
> [ 60.183420] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 60.184268] kworker/u2:1 D ffff880006917bc8 14152 989 14 0x00000000
> [ 60.185063] ffff880006917bc8 ffff880000149c80 ffff880006bf8e40
> ffff880006918000
> [ 60.185954] ffff88000613cfd0 ffff880006bf8e40 00000000ffffffff
> 0000000000000246
> [ 60.186806] ffff880006917be0 ffffffff818d2bfe ffff88000613cfc8
> ffff880006917bf0
> [ 60.187672] Call Trace:
> [ 60.187937] [<ffffffff818d2bfe>] schedule+0x2e/0x70
> [ 60.188495] [<ffffffff818d2f29>] schedule_preempt_disabled+0x9/0x10
> [ 60.189193] [<ffffffff818d5457>] __mutex_lock_slowpath+0x107/0x2f0
> [ 60.189851] [<ffffffff811c9123>] ? lookup_fast+0x133/0x2e0
> [ 60.190472] [<ffffffff818d5655>] mutex_lock+0x15/0x30
> [ 60.191043] [<ffffffff811cc76f>] path_openat+0x63f/0x1220
> [ 60.191649] [<ffffffff811ce339>] do_filp_open+0x79/0xd0
> [ 60.192112] [<ffffffff8130a136>] ? selinux_cred_prepare+0x16/0x30
> [ 60.192678] [<ffffffff8130449e>] ? security_prepare_creds+0x3e/0x60
> [ 60.193248] [<ffffffff811c403f>] do_open_execat+0x5f/0x150
> [ 60.193722] [<ffffffff811c5d53>] do_execveat_common.isra.33+0x1a3/0x6c0
> [ 60.194291] [<ffffffff811c6297>] do_execve+0x27/0x30
> [ 60.194724] [<ffffffff810b8d2b>] call_usermodehelper_exec_async+0xeb/0x140
> [ 60.195318] [<ffffffff818d7712>] ret_from_fork+0x22/0x40
> [ 60.195781] [<ffffffff810b8c40>] ? umh_complete+0x40/0x40



--
Andy Lutomirski
AMA Capital Management, LLC

2015-12-07 22:53:51

by Dominique Martinet

[permalink] [raw]
Subject: Re: [V9fs-developer] Hang triggered by udev coldplug, looks like a race

Andy Lutomirski wrote on Mon, Dec 07, 2015:
> On Thu, Dec 3, 2015 at 9:52 PM, Andy Lutomirski <[email protected]> wrote:
> > Sometimes udevadm trigger --action=add hangs the system, and the splat
> > below happens. This seems to be timing dependent, and I haven't been
> > able to trigger it yet with lockdep enabled, sadly.
> >
> > Any ideas? I not, I'll try to instrument it better tomorrow.
>
> More details: this is caused by a storm of /sbin/hotplug UMH calls
> (yes, misconfigured kernel, but still). /sbin is a symlink to
> /usr/sbin, /usr/sbin/hotplug doesn't exist, and all of the above is on
> rootfs, which is 9p over virtio.
>
> Pointing uevent_helper at /usr/sbin/hotplug (which still doesn't
> exist) seems to work around it.

Can you reproduce it on a booted system with something like
`seq 1 1000000 | xargs -P 1024 -I{} cat /sbin/foo >&/dev/null` ?

(trying execs might be closer to your workload, not sure how much this
or using umh might change)


Also, what qemu version please just to try to match your environment ?

--
Dominique Martinet

2015-12-08 02:00:05

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [V9fs-developer] Hang triggered by udev coldplug, looks like a race

On Mon, Dec 7, 2015 at 2:46 PM, Dominique Martinet
<[email protected]> wrote:
> Andy Lutomirski wrote on Mon, Dec 07, 2015:
>> On Thu, Dec 3, 2015 at 9:52 PM, Andy Lutomirski <[email protected]> wrote:
>> > Sometimes udevadm trigger --action=add hangs the system, and the splat
>> > below happens. This seems to be timing dependent, and I haven't been
>> > able to trigger it yet with lockdep enabled, sadly.
>> >
>> > Any ideas? I not, I'll try to instrument it better tomorrow.
>>
>> More details: this is caused by a storm of /sbin/hotplug UMH calls
>> (yes, misconfigured kernel, but still). /sbin is a symlink to
>> /usr/sbin, /usr/sbin/hotplug doesn't exist, and all of the above is on
>> rootfs, which is 9p over virtio.
>>
>> Pointing uevent_helper at /usr/sbin/hotplug (which still doesn't
>> exist) seems to work around it.
>
> Can you reproduce it on a booted system with something like
> `seq 1 1000000 | xargs -P 1024 -I{} cat /sbin/foo >&/dev/null` ?

This doesn't reproduce it.

This doesn't either:

seq 1 1000000 | xargs -P 1024 -I{} bash -c 'exec /sbin/foo' &>/dev/null

>
> (trying execs might be closer to your workload, not sure how much this
> or using umh might change)
>
>
> Also, what qemu version please just to try to match your environment ?

$ qemu-system-x86_64 --version
QEMU emulator version 2.4.1 (qemu-2.4.1-1.fc23), Copyright (c)
2003-2008 Fabrice Bellard

My reproducer is:

$ virtme-run --kdir . --pwd

using this virtme version:

https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git/commit/?id=17363c2900e8b796c80c920c6fcdcc6747784ef7

Bad kernel config attached. This config with v4.4-rc3 (and no
additional patches) reproduces it reliably for me.

With the latest virtme, I don't reproduce it -- the latest virtme
turns off uevent_helper early in boot, which suppresses the bug for
me, at least most of the time.


If I dump all task states (see attached typescript), I see a bunch of
things blocked in 9p rpc. This makes me think it could be a QEMU bug,
not a kernel bug.

--Andy



--
Andy Lutomirski
AMA Capital Management, LLC


Attachments:
bad-config.xz (20.85 kB)
typescript.xz (18.15 kB)
Download all attachments

2015-12-08 02:33:35

by Al Viro

[permalink] [raw]
Subject: Re: [V9fs-developer] Hang triggered by udev coldplug, looks like a race

On Mon, Dec 07, 2015 at 05:59:41PM -0800, Andy Lutomirski wrote:

> If I dump all task states (see attached typescript), I see a bunch of
> things blocked in 9p rpc. This makes me think it could be a QEMU bug,
> not a kernel bug.

Maybe, maybe not - I'd suggest dumping the 9p traffic and checking the
request and response IDs; at least one bug in that area got fixed
(see commit a84b69 for gory details).

Really, check if something gets screwed with IDs. They should be unique
among outstanding requests and never reused until either a response
with the matching ID arrives, *or* TFLUSH on that ID is sent and reply
has arrived. If those rules get buggered, it's very easy to get hangs
like that.

2015-12-08 22:49:42

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [V9fs-developer] Hang triggered by udev coldplug, looks like a race

On Mon, Dec 7, 2015 at 6:33 PM, Al Viro <[email protected]> wrote:
> On Mon, Dec 07, 2015 at 05:59:41PM -0800, Andy Lutomirski wrote:
>
>> If I dump all task states (see attached typescript), I see a bunch of
>> things blocked in 9p rpc. This makes me think it could be a QEMU bug,
>> not a kernel bug.
>
> Maybe, maybe not - I'd suggest dumping the 9p traffic and checking the
> request and response IDs; at least one bug in that area got fixed
> (see commit a84b69 for gory details).
>
> Really, check if something gets screwed with IDs. They should be unique
> among outstanding requests and never reused until either a response
> with the matching ID arrives, *or* TFLUSH on that ID is sent and reply
> has arrived. If those rules get buggered, it's very easy to get hangs
> like that.

Trace attached. I don't see anything wrong, but I also don't know
what I'm looking for.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC


Attachments:
9ptrace.txt.xz (19.45 kB)

2015-12-09 06:23:34

by Dominique Martinet

[permalink] [raw]
Subject: Re: [V9fs-developer] Hang triggered by udev coldplug, looks like a race

Andy Lutomirski wrote on Tue, Dec 08, 2015:
> Trace attached. I don't see anything wrong, but I also don't know
> what I'm looking for.

Actually doesn't look good, not sure if trace could be missing messages
but it looks like tags get reused...

Quick and dirty parse script (attached output, it doesn't handle TFLUSH
but there are none in his message):

awk '{print NR " " $0}' 9ptrace.txt | \
awk '/^[0-9]* T/ {
if ($5 in tag) {
print "DOUBLETAG " $0 " || had " tag[$5];
} else {
tag[$5] = $0;
}
}
/^[0-9]* R/ {
if (! $5 in tag ) {
print "MISSTAG " $0;
} else {
delete tag[$5];
}
}'


--
Dominique Martinet


Attachments:
(No filename) (629.00 B)
9pdoubles.xz (5.39 kB)
Download all attachments

2015-12-09 06:41:50

by Dominique Martinet

[permalink] [raw]
Subject: Re: [V9fs-developer] Hang triggered by udev coldplug, looks like a race

Dominique Martinet wrote on Wed, Dec 09, 2015:
> Andy Lutomirski wrote on Tue, Dec 08, 2015:
> > Trace attached. I don't see anything wrong, but I also don't know
> > what I'm looking for.
>
> Actually doesn't look good, not sure if trace could be missing messages
> but it looks like tags get reused...

talked too fast as usual, it looks like traces are just missing RCLUNKS
so *from what I can see* there's no overlap

--
Dominique

2015-12-09 06:45:47

by Al Viro

[permalink] [raw]
Subject: Re: [V9fs-developer] Hang triggered by udev coldplug, looks like a race

On Wed, Dec 09, 2015 at 07:23:16AM +0100, Dominique Martinet wrote:
> Andy Lutomirski wrote on Tue, Dec 08, 2015:
> > Trace attached. I don't see anything wrong, but I also don't know
> > what I'm looking for.
>
> Actually doesn't look good, not sure if trace could be missing messages
> but it looks like tags get reused...
>
> Quick and dirty parse script (attached output, it doesn't handle TFLUSH
> but there are none in his message):

There's also not a single RCLUNK in there, which looks like missing packets.
AFAICS, all (apparent) tag reuses caught by your awk script are of that nature.
And all RERROR are plain "no such file" in response to TWALK; all those
TCLUNK in there get no reply whatsoever - neither success nor failure...

2015-12-17 21:07:18

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [V9fs-developer] Hang triggered by udev coldplug, looks like a race

On Tue, Dec 8, 2015 at 10:45 PM, Al Viro <[email protected]> wrote:
> On Wed, Dec 09, 2015 at 07:23:16AM +0100, Dominique Martinet wrote:
>> Andy Lutomirski wrote on Tue, Dec 08, 2015:
>> > Trace attached. I don't see anything wrong, but I also don't know
>> > what I'm looking for.
>>
>> Actually doesn't look good, not sure if trace could be missing messages
>> but it looks like tags get reused...
>>
>> Quick and dirty parse script (attached output, it doesn't handle TFLUSH
>> but there are none in his message):
>
> There's also not a single RCLUNK in there, which looks like missing packets.
> AFAICS, all (apparent) tag reuses caught by your awk script are of that nature.
> And all RERROR are plain "no such file" in response to TWALK; all those
> TCLUNK in there get no reply whatsoever - neither success nor failure...

This could be QEMU's analysis script screwing up. Is there a good way
for me to gather more info?

--Andy

2015-12-24 10:52:10

by Dominique Martinet

[permalink] [raw]
Subject: Re: [V9fs-developer] Hang triggered by udev coldplug, looks like a race

Andy Lutomirski wrote on Thu, Dec 17, 2015:
> This could be QEMU's analysis script screwing up. Is there a good way
> for me to gather more info?

I finally took some time to reproduce it (sorry for the delay)

Using your config, virtme commit (17363c2) and kernel tag v4.4-rc3 I was
able to reproduce it just fine with my qemu (2.4.90)

Now for the fun bit... I ran it with a gdb server, attaching gdb and
running cont always 'unblocks' it
Using the kernel gdb scripts (lx-ps) I see about 250 kworker threads
running, the backtraces all look the same:

[ 20.273945] [<ffffffff818c3910>] schedule+0x30/0x80
[ 20.274644] [<ffffffff818c3c39>] schedule_preempt_disabled+0x9/0x10
[ 20.275539] [<ffffffff818c6147>] __mutex_lock_slowpath+0x107/0x2f0
[ 20.276421] [<ffffffff811cf02e>] ? lookup_fast+0xbe/0x320
[ 20.277195] [<ffffffff818c6345>] mutex_lock+0x15/0x30
[ 20.277916] [<ffffffff811d0df7>] walk_component+0x1a7/0x270


so given it unblocks after hooking gdb + cont I'm actually thinking this
might be a pure scheduling issue? (e.g. thread is never re-scheduled or
something like that?)
I can't see any task not in schedule() in your sysrq dump task
transcript either.


Not sure how to go around debugging that, to be honest.
I've tried both default one virtual cpu and -smp 3 or 4 and both can
reproduce it; cpu usage on the host is always low so it doesn't look
like there's any busy-polling involved.. This is a pretty subtle bug we
have there..

--
Dominique Martinet

2015-12-30 06:43:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [V9fs-developer] Hang triggered by udev coldplug, looks like a race

[add cc's]

Hi scheduler people:

This is relatively easy for me to reproduce. Any hints for debugging
it? Could we really have a bug in which processes that are
schedulable as a result of mutex unlock aren't always reliably
scheduled?

--Andy

On Thu, Dec 24, 2015 at 2:51 AM, Dominique Martinet
<[email protected]> wrote:
> Andy Lutomirski wrote on Thu, Dec 17, 2015:
>> This could be QEMU's analysis script screwing up. Is there a good way
>> for me to gather more info?
>
> I finally took some time to reproduce it (sorry for the delay)
>
> Using your config, virtme commit (17363c2) and kernel tag v4.4-rc3 I was
> able to reproduce it just fine with my qemu (2.4.90)
>
> Now for the fun bit... I ran it with a gdb server, attaching gdb and
> running cont always 'unblocks' it
> Using the kernel gdb scripts (lx-ps) I see about 250 kworker threads
> running, the backtraces all look the same:
>
> [ 20.273945] [<ffffffff818c3910>] schedule+0x30/0x80
> [ 20.274644] [<ffffffff818c3c39>] schedule_preempt_disabled+0x9/0x10
> [ 20.275539] [<ffffffff818c6147>] __mutex_lock_slowpath+0x107/0x2f0
> [ 20.276421] [<ffffffff811cf02e>] ? lookup_fast+0xbe/0x320
> [ 20.277195] [<ffffffff818c6345>] mutex_lock+0x15/0x30
> [ 20.277916] [<ffffffff811d0df7>] walk_component+0x1a7/0x270
>
>
> so given it unblocks after hooking gdb + cont I'm actually thinking this
> might be a pure scheduling issue? (e.g. thread is never re-scheduled or
> something like that?)
> I can't see any task not in schedule() in your sysrq dump task
> transcript either.
>
>
> Not sure how to go around debugging that, to be honest.
> I've tried both default one virtual cpu and -smp 3 or 4 and both can
> reproduce it; cpu usage on the host is always low so it doesn't look
> like there's any busy-polling involved.. This is a pretty subtle bug we
> have there..
>
> --
> Dominique Martinet



--
Andy Lutomirski
AMA Capital Management, LLC