2020-04-30 03:04:07

by Qian Cai

[permalink] [raw]
Subject: rcu_barrier() + membarrier() + scheduler deadlock?

Doing some fuzzers on many CPUs triggers a deadlock really quickly. I can see that there were several tasks had been stuck for a while,

CPUA:
slab_caches_to_rcu_destroy_workfn()
rcu_barrier()
wait_for_completion()

CPUB:
sched_setaffinity()
__set_cpus_allowed_ptr()
stop_one_cpu()

CPUC:
__x64_sys_membarrier
synchronize_rcu()
__wait_rcu_gp()
wait_for_completion()

Lockdep did not flag anything useful at all. Any clue?

[52885.040113][T150406] futex_wake_op: trinity-c9 tries to shift op by -1189; fix this program
[52890.366916][T150196] futex_wake_op: trinity-c3 tries to shift op by -1; fix this program
[52896.369779][T151054] futex_wake_op: trinity-c23 tries to shift op by 710; fix this program
[52901.770688][ T310] INFO: task kworker/36:2:142207 blocked for more than 122 seconds.
[52901.811471][ T310] Tainted: G L 5.7.0-rc3-next-20200429 #2
[52901.849148][ T310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[52901.889803][ T310] kworker/36:2 D29104 142207 2 0x90004000
[52901.919461][ T310] Workqueue: events slab_caches_to_rcu_destroy_workfn
[52901.950935][ T310] Call Trace:
[52901.965369][ T310] __schedule+0x77e/0x1070
[52901.986211][ T310] ? __sched_text_start+0x8/0x8
[52902.008267][ T310] ? trace_hardirqs_on+0x3a/0x160
[52902.031764][ T310] schedule+0x95/0x160
[52902.050729][ T310] schedule_timeout+0x47c/0x6a0
[52902.072897][ T310] ? print_irqtrace_events+0x110/0x110
[52902.098316][ T310] ? usleep_range+0x100/0x100
[52902.119664][ T310] ? mark_held_locks+0x86/0xb0
[52902.141397][ T310] ? _raw_spin_unlock_irq+0x27/0x40
[52902.165326][ T310] ? _raw_spin_unlock_irq+0x27/0x40
[52902.189622][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
[52902.213641][ T310] ? wait_for_completion+0x116/0x1a0
[52902.238757][ T310] ? trace_hardirqs_on+0x3a/0x160
[52902.261648][ T310] wait_for_completion+0x11e/0x1a0
[52902.285516][ T310] ? wait_for_completion_interruptible+0x220/0x220
[52902.316342][ T310] ? rcu_read_lock_held+0xc0/0xc0
[52902.340298][ T310] rcu_barrier+0x2c5/0x440
[52902.361736][ T310] slab_caches_to_rcu_destroy_workfn+0x95/0xe0
[52902.391109][ T310] process_one_work+0x57e/0xb90
[52902.413244][ T310] ? pwq_dec_nr_in_flight+0x170/0x170
[52902.437957][ T310] ? worker_thread+0x14b/0x5b0
[52902.460357][ T310] worker_thread+0x63/0x5b0
[52902.481267][ T310] ? process_one_work+0xb90/0xb90
[52902.504818][ T310] kthread+0x1f7/0x220
[52902.523542][ T310] ? kthread_create_worker_on_cpu+0xc0/0xc0
[52902.550801][ T310] ret_from_fork+0x3a/0x50
[52902.560427][T151328] futex_wake_op: trinity-c10 tries to shift op by -1; fix this program
[52902.571250][ T310] INFO: task trinity-c18:145344 can't die for more than 123 seconds.
[52902.648554][ T310] trinity-c18 R running task 26944 145344 74932 0x10004004
[52902.700179][ T310] Call Trace:
[52902.715039][ T310] __schedule+0x77e/0x1070
[52902.735531][ T310] ? __sched_text_start+0x8/0x8
[52902.757922][ T310] ? do_syscall_64+0x28d/0xaf0
[52902.779731][ T310] ? do_syscall_64+0x28d/0xaf0
[52902.802215][ T310] schedule+0x95/0x160
[52902.821375][ T310] do_syscall_64+0x23e/0xaf0
[52902.843517][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
[52902.871003][ T310] ? ftrace_syscall_enter+0x4b0/0x4b0
[52902.896034][ T310] ? syscall_return_slowpath+0x580/0x580
[52902.922153][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
[52902.950409][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
[52902.976721][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
[52903.002409][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[52903.030018][ T310] RIP: 0033:0x7fa3d9cd670d
[52903.050021][ T310] Code: Bad RIP value.
[52903.068733][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
[52903.107680][ T310] RAX: 0000000000000000 RBX: 00000000000000cb RCX: 00007fa3d9cd670d
[52903.144861][ T310] RDX: 00007fa3d7c38000 RSI: 0000000000000004 RDI: 0000000000000000
[52903.182189][ T310] RBP: 00000000000000cb R08: 000000000000ca23 R09: 00000000000000b3
[52903.219290][ T310] R10: 000000002ab3d706 R11: 0000000000000246 R12: 0000000000000002
[52903.256653][ T310] R13: 00007fa3d8608058 R14: 00007fa3d9b956c0 R15: 00007fa3d8608000
[52903.294988][ T310] INFO: task trinity-c24:146199 can't die for more than 124 seconds.
[52903.333911][ T310] trinity-c24 R running task 26944 146199 74932 0x10000004
[52903.377051][ T310] Call Trace:
[52903.393139][ T310] __schedule+0x77e/0x1070
[52903.413685][ T310] ? __sched_text_start+0x8/0x8
[52903.435602][ T310] ? check_flags.part.28+0x86/0x220
[52903.459452][ T310] preempt_schedule_common+0x16/0x50
[52903.483985][ T310] _cond_resched+0x22/0x30
[52903.504130][ T310] stop_one_cpu+0xd7/0x140
[52903.524695][ T310] ? stop_cpus.constprop.3+0x130/0x130
[52903.550266][ T310] ? task_rq_unlock+0x6e/0x80
[52903.572005][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
[52903.595973][ T310] ? lock_downgrade+0x3e0/0x3e0
[52903.618468][ T310] ? mark_held_locks+0x34/0xb0
[52903.640239][ T310] ? task_rq_unlock+0x6e/0x80
[52903.662057][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
[52903.686744][ T310] ? move_queued_task+0x3c0/0x3c0
[52903.709784][ T310] ? cpuset_cpus_allowed+0x10b/0x200
[52903.734212][ T310] ? trace_hardirqs_on+0x3a/0x160
[52903.757482][ T310] ? __bitmap_and+0xd8/0x100
[52903.779002][ T310] sched_setaffinity+0x476/0x660
[52903.802046][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
[52903.827730][ T310] ? __kasan_check_write+0x14/0x20
[52903.851859][ T310] __x64_sys_sched_setaffinity+0x118/0x180
[52903.881315][ T310] ? sched_setaffinity+0x660/0x660
[52903.906012][ T310] ? do_syscall_64+0x79/0xaf0
[52903.927530][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
[52903.951683][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
[52903.976607][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
[52904.004024][ T310] do_syscall_64+0xcc/0xaf0
[52904.023909][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
[52904.047409][ T310] ? syscall_return_slowpath+0x580/0x580
[52904.072656][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
[52904.099856][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
[52904.125615][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
[52904.167434][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[52904.193732][ T310] RIP: 0033:0x7fa3d9cd670d
[52904.213368][ T310] Code: Bad RIP value.
[52904.231318][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
[52904.269002][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
[52904.344529][ T310] RDX: 00007fa3d803c000 RSI: 0000000000000200 RDI: 0000000000000000
[52904.382761][ T310] RBP: 00000000000000cb R08: 0001802261008040 R09: 00000000000000ba
[52904.422223][ T310] R10: 00000000000000a8 R11: 0000000000000246 R12: 0000000000000002
[52904.458152][ T310] R13: 00007fa3d85de058 R14: 00007fa3d9b956c0 R15: 00007fa3d85de000
[52904.494711][ T310] INFO: task trinity-c6:146436 can't die for more than 125 seconds.
[52904.530683][ T310] trinity-c6 R running task 27328 146436 74932 0x10000004
[52904.567167][ T310] Call Trace:
[52904.582145][ T310] __schedule+0x77e/0x1070
[52904.601969][ T310] ? __sched_text_start+0x8/0x8
[52904.623515][ T310] ? check_flags.part.28+0x86/0x220
[52904.647020][ T310] preempt_schedule_common+0x16/0x50
[52904.671405][ T310] _cond_resched+0x22/0x30
[52904.690871][ T310] stop_one_cpu+0xd7/0x140
[52904.710359][ T310] ? stop_cpus.constprop.3+0x130/0x130
[52904.734818][ T310] ? task_rq_unlock+0x6e/0x80
[52904.755537][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
[52904.778632][ T310] ? lock_downgrade+0x3e0/0x3e0
[52904.800232][ T310] ? mark_held_locks+0x34/0xb0
[52904.821408][ T310] ? task_rq_unlock+0x6e/0x80
[52904.842299][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
[52904.866634][ T310] ? move_queued_task+0x3c0/0x3c0
[52904.890341][ T310] ? cpuset_cpus_allowed+0x10b/0x200
[52904.916217][ T310] ? trace_hardirqs_on+0x3a/0x160
[52904.939059][ T310] ? __bitmap_and+0xd8/0x100
[52904.959333][ T310] sched_setaffinity+0x476/0x660
[52904.981422][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
[52905.006183][ T310] ? copy_user_generic_unrolled+0x9e/0xc0
[52905.031792][ T310] __x64_sys_sched_setaffinity+0x118/0x180
[52905.057598][ T310] ? sched_setaffinity+0x660/0x660
[52905.080302][ T310] ? do_syscall_64+0x79/0xaf0
[52905.101464][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
[52905.126828][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
[52905.150334][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
[52905.177570][ T310] do_syscall_64+0xcc/0xaf0
[52905.197688][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
[52905.221392][ T310] ? syscall_return_slowpath+0x580/0x580
[52905.246586][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
[52905.273733][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
[52905.300142][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
[52905.324980][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[52905.351394][ T310] RIP: 0033:0x7fa3d9cd670d
[52905.371850][ T310] Code: Bad RIP value.
[52905.390680][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
[52905.432007][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
[52905.470212][ T310] RDX: 00007fa3d7e3c000 RSI: 0000000000000004 RDI: 0000000000000000
[52905.507836][ T310] RBP: 00000000000000cb R08: 00000000fffffff9 R09: ffffffffffff896f
[52905.545501][ T310] R10: 000000000000fff7 R11: 0000000000000246 R12: 0000000000000002
[52905.581885][ T310] R13: 00007fa3d865c058 R14: 00007fa3d9b956c0 R15: 00007fa3d865c000
[52905.620176][ T310] INFO: task trinity-c38:147003 can't die for more than 126 seconds.
[52905.656622][ T310] trinity-c38 R running task 27880 147003 74932 0x10000004
[52905.692374][ T310] Call Trace:
[52905.706772][ T310] __schedule+0x77e/0x1070
[52905.726246][ T310] ? __sched_text_start+0x8/0x8
[52905.747783][ T310] ? check_flags.part.28+0x86/0x220
[52905.770984][ T310] preempt_schedule_common+0x16/0x50
[52905.794507][ T310] _cond_resched+0x22/0x30
[52905.814101][ T310] stop_one_cpu+0xd7/0x140
[52905.833704][ T310] ? stop_cpus.constprop.3+0x130/0x130
[52905.858045][ T310] ? task_rq_unlock+0x6e/0x80
[52905.879452][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
[52905.903104][ T310] ? lock_downgrade+0x3e0/0x3e0
[52905.927387][ T310] ? mark_held_locks+0x34/0xb0
[52905.949981][ T310] ? task_rq_unlock+0x6e/0x80
[52905.970691][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
[52905.994868][ T310] ? move_queued_task+0x3c0/0x3c0
[52906.017315][ T310] ? cpuset_cpus_allowed+0x10b/0x200
[52906.040870][ T310] ? trace_hardirqs_on+0x3a/0x160
[52906.063266][ T310] ? __bitmap_and+0xd8/0x100
[52906.083575][ T310] sched_setaffinity+0x476/0x660
[52906.105559][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
[52906.130620][ T310] ? __kasan_check_write+0x14/0x20
[52906.153322][ T310] __x64_sys_sched_setaffinity+0x118/0x180
[52906.179289][ T310] ? sched_setaffinity+0x660/0x660
[52906.202438][ T310] ? do_syscall_64+0x79/0xaf0
[52906.223078][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
[52906.246181][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
[52906.269667][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
[52906.296537][ T310] do_syscall_64+0xcc/0xaf0
[52906.316130][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
[52906.339641][ T310] ? syscall_return_slowpath+0x580/0x580
[52906.364738][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
[52906.391937][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
[52906.419260][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
[52906.446669][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[52906.473222][ T310] RIP: 0033:0x7fa3d9cd670d
[52906.493314][ T310] Code: Bad RIP value.
[52906.511339][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
[52906.549312][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
[52906.585138][ T310] RDX: 00007fa3d7c3c000 RSI: 0000000000000008 RDI: 0000000000000000
[52906.620955][ T310] RBP: 00000000000000cb R08: fffffffffffffff8 R09: fffffffffffffffc
[52906.657616][ T310] R10: 0000000000000200 R11: 0000000000000246 R12: 0000000000000002
[52906.694196][ T310] R13: 00007fa3d857c058 R14: 00007fa3d9b956c0 R15: 00007fa3d857c000
[52906.730152][ T310] INFO: task trinity-c2:147285 can't die for more than 127 seconds.
[52906.766086][ T310] trinity-c2 R running task 27896 147285 74932 0x10000004
[52906.801993][ T310] Call Trace:
[52906.816338][ T310] __schedule+0x77e/0x1070
[52906.836024][ T310] ? __sched_text_start+0x8/0x8
[52906.857547][ T310] ? up_write+0xe9/0x290
[52906.876797][ T310] schedule+0x95/0x160
[52906.915571][ T310] schedule_timeout+0x47c/0x6a0
[52906.939198][ T310] ? usleep_range+0x100/0x100
[52906.961983][ T310] ? lock_downgrade+0x3e0/0x3e0
[52906.983597][ T310] ? mark_held_locks+0x34/0xb0
[52907.005053][ T310] ? _raw_spin_unlock_irq+0x27/0x40
[52907.028158][ T310] ? _raw_spin_unlock_irq+0x27/0x40
[52907.051323][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
[52907.074918][ T310] ? wait_for_completion+0x116/0x1a0
[52907.098366][ T310] ? trace_hardirqs_on+0x3a/0x160
[52907.120794][ T310] wait_for_completion+0x11e/0x1a0
[52907.143566][ T310] ? wait_for_completion_interruptible+0x220/0x220
[52907.172861][ T310] __wait_rcu_gp+0x1ad/0x1e0
[52907.193176][ T310] synchronize_rcu+0x128/0x1a0
[52907.214557][ T310] ? synchronize_rcu_expedited+0x5e0/0x5e0
[52907.240532][ T310] ? match_held_lock+0x20/0x270
[52907.262106][ T310] ? invoke_rcu_core+0xf0/0xf0
[52907.282779][ T310] ? check_chain_key+0x1df/0x2e0
[52907.304629][ T310] ? find_held_lock+0xca/0xf0
[52907.325381][ T310] ? __kasan_check_read+0x11/0x20
[52907.347666][ T310] __x64_sys_membarrier+0x205/0x62e
[52907.370924][ T310] ? membarrier_exec_mmap+0x50/0x50
[52907.394034][ T310] ? do_syscall_64+0x79/0xaf0
[52907.414823][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
[52907.440213][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
[52907.466322][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
[52907.494190][ T310] do_syscall_64+0xcc/0xaf0
[52907.514834][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
[52907.539609][ T310] ? syscall_return_slowpath+0x580/0x580
[52907.564670][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
[52907.593100][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
[52907.619749][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
[52907.645538][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[52907.673066][ T310] RIP: 0033:0x7fa3d9cd670d
[52907.693218][ T310] Code: Bad RIP value.
[52907.711803][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000144
[52907.750495][ T310] RAX: ffffffffffffffda RBX: 0000000000000144 RCX: 00007fa3d9cd670d
[52907.786490][ T310] RDX: 0000000000000080 RSI: 0000000000000000 RDI: 0000000000000001
[52907.822363][ T310] RBP: 0000000000000144 R08: 00000000eeeeeeee R09: 0000000000003418
[52907.858106][ T310] R10: 0000000000004ff8 R11: 0000000000000246 R12: 0000000000000002
[52907.894476][ T310] R13: 00007fa3d8678058 R14: 00007fa3d9b956c0 R15: 00007fa3d8678000
[52907.930457][ T310] INFO: task trinity-c27:149010 can't die for more than 129 seconds.
[52907.971719][ T310] trinity-c27 R running task 26944 149010 74932 0x10000004
[52908.008456][ T310] Call Trace:
[52908.022892][ T310] __schedule+0x77e/0x1070
[52908.042479][ T310] ? __sched_text_start+0x8/0x8
[52908.063993][ T310] ? check_flags.part.28+0x86/0x220
[52908.087103][ T310] preempt_schedule_common+0x16/0x50
[52908.110659][ T310] _cond_resched+0x22/0x30
[52908.130167][ T310] stop_one_cpu+0xd7/0x140
[52908.149705][ T310] ? stop_cpus.constprop.3+0x130/0x130
[52908.174087][ T310] ? task_rq_unlock+0x6e/0x80
[52908.194356][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
[52908.217715][ T310] ? lock_downgrade+0x3e0/0x3e0
[52908.239236][ T310] ? mark_held_locks+0x34/0xb0
[52908.260471][ T310] ? task_rq_unlock+0x6e/0x80
[52908.281298][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
[52908.305375][ T310] ? move_queued_task+0x3c0/0x3c0
[52908.327364][ T310] ? cpuset_cpus_allowed+0x10b/0x200
[52908.350892][ T310] ? trace_hardirqs_on+0x3a/0x160
[52908.373276][ T310] ? __bitmap_and+0xd8/0x100
[52908.393650][ T310] sched_setaffinity+0x476/0x660
[52908.397305][T151560] futex_wake_op: trinity-c39 tries to shift op by -852; fix this program
[52908.415599][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
[52908.455822][T151560] futex_wake_op: trinity-c39 tries to shift op by -852; fix this program
[52908.483899][ T310] ? __kasan_check_write+0x14/0x20
[52908.483908][ T310] __x64_sys_sched_setaffinity+0x118/0x180
[52908.483916][ T310] ? sched_setaffinity+0x660/0x660
[52908.483925][ T310] ? do_syscall_64+0x79/0xaf0
[52908.483934][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
[52908.640521][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
[52908.664068][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
[52908.692710][ T310] do_syscall_64+0xcc/0xaf0
[52908.712603][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
[52908.736117][ T310] ? syscall_return_slowpath+0x580/0x580
[52908.762313][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
[52908.789929][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
[52908.815487][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
[52908.840248][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[52908.867098][ T310] RIP: 0033:0x7fa3d9cd670d
[52908.886542][ T310] Code: Bad RIP value.
[52908.904432][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
[52908.942172][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
[52908.981963][ T310] RDX: 00007fa3d843c000 RSI: 0000000000000004 RDI: 0000000000000000
[52909.020646][ T310] RBP: 00000000000000cb R08: 00000000ffffffff R09: fffffffffffffffa
[52909.057043][ T310] R10: 000000000000001b R11: 0000000000000246 R12: 0000000000000002
[52909.093664][ T310] R13: 00007fa3d85c9058 R14: 00007fa3d9b956c0 R15: 00007fa3d85c9000
[52909.129758][ T310] INFO: task trinity-c20:149065 can't die for more than 130 seconds.
[52909.166089][ T310] trinity-c20 R running task 27712 149065 74932 0x10000004
[52909.202031][ T310] Call Trace:
[52909.216658][ T310] __schedule+0x77e/0x1070
[52909.236187][ T310] ? __sched_text_start+0x8/0x8
[52909.257916][ T310] ? check_flags.part.28+0x86/0x220
[52909.281061][ T310] preempt_schedule_common+0x16/0x50
[52909.304664][ T310] _cond_resched+0x22/0x30
[52909.324257][ T310] stop_one_cpu+0xd7/0x140
[52909.343842][ T310] ? stop_cpus.constprop.3+0x130/0x130
[52909.369442][ T310] ? task_rq_unlock+0x6e/0x80
[52909.390672][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
[52909.413672][ T310] ? lock_downgrade+0x3e0/0x3e0
[52909.435178][ T310] ? mark_held_locks+0x34/0xb0
[52909.456323][ T310] ? task_rq_unlock+0x6e/0x80
[52909.478561][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
[52909.505164][ T310] ? move_queued_task+0x3c0/0x3c0
[52909.529450][ T310] ? cpuset_cpus_allowed+0x10b/0x200
[52909.553226][ T310] ? trace_hardirqs_on+0x3a/0x160
[52909.575565][ T310] ? __bitmap_and+0xd8/0x100
[52909.596065][ T310] sched_setaffinity+0x476/0x660
[52909.617927][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
[52909.643034][ T310] ? copy_user_generic_unrolled+0x86/0xc0
[52909.669320][ T310] __x64_sys_sched_setaffinity+0x118/0x180
[52909.696241][ T310] ? sched_setaffinity+0x660/0x660
[52909.719315][ T310] ? do_syscall_64+0x79/0xaf0
[52909.740952][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
[52909.765524][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
[52909.790368][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
[52909.818929][ T310] do_syscall_64+0xcc/0xaf0
[52909.839558][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
[52909.863846][ T310] ? syscall_return_slowpath+0x580/0x580
[52909.889545][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
[52909.916624][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
[52909.942269][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
[52909.966872][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[52909.995175][ T310] RIP: 0033:0x7fa3d9cd670d
[52910.017390][ T310] Code: Bad RIP value.
[52910.035569][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
[52910.073407][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
[52910.109426][ T310] RDX: 00007fa3d7c37000 RSI: 00000000000000a2 RDI: 0000000000000000
[52910.145197][ T310] RBP: 00000000000000cb R08: ffffffdffdefffff R09: 000000008084b740
[52910.181004][ T310] R10: 0000000000000080 R11: 0000000000000246 R12: 0000000000000002
[52910.217004][ T310] R13: 00007fa3d85fa058 R14: 00007fa3d9b956c0 R15: 00007fa3d85fa000
[52910.254432][ T310]
[52910.254432][ T310] Showing all locks held in the system:
[52910.289717][ T310] 1 lock held by ksoftirqd/16/94:
[52910.312041][ T310] #0: ffff88985fa453d8 (&rq->lock){-.-.}-{2:2}, at: newidle_balance+0x88d/0x9c0
[52910.353309][ T310] 1 lock held by khungtaskd/310:
[52910.375374][ T310] #0: ffffffffb3bcacc0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.30+0x0/0x30
[52910.422536][ T310] 3 locks held by kworker/u96:1/60424:
[52910.446875][ T310] #0: ffff88984fef1538 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
[52910.493637][ T310] #1: ffffc900240ffe00 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
[52910.542269][ T310] #2: ffffffffb409cf30 (pernet_ops_rwsem){++++}-{3:3}, at: cleanup_net+0xa9/0x5c0
[52910.583972][ T310] 2 locks held by runc/74923:
[52910.604801][ T310] #0: ffff88842f510898 (&tty->ldisc_sem){++++}-{0:0}, at: ldsem_down_read+0x32/0x40
[52910.647284][ T310] #1: ffffc900064bb2e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x187/0xe20
[52910.692274][ T310] 8 locks held by runc/74924:
[52910.713003][ T310] #0: ffff888842e453d8 (&rq->lock){-.-.}-{2:2}, at: __schedule+0x227/0x1070
[52910.752393][ T310] #1: ffffffffb3bcacc0 (rcu_read_lock){....}-{1:2}, at: perf_iterate_sb+0x0/0x500
[52910.794272][ T310] #2: ffff88842f510ae8 (&o_tty->termios_rwsem/1){++++}-{3:3}, at: n_tty_write+0x126/0x7c0
[52910.839386][ T310] #3: ffff88842f510d30 (&tty->write_wait){-.-.}-{2:2}, at: remove_wait_queue+0x29/0xc0
[52910.885384][ T310] #4: ffff888429bb0558 (&port->lock#2){-.-.}-{2:2}, at: pty_write+0x82/0xf0
[52910.924906][ T310] #5: ffffffffb3bcacc0 (rcu_read_lock){....}-{1:2}, at: __queue_work+0x8b/0xa40
[52910.965808][ T310] #6: ffff88820f19c418 (&pool->lock/1){-.-.}-{2:2}, at: __queue_work+0x154/0xa40
[52911.008506][ T310] #7: ffff888349990828 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0xb0/0xfb0
[52911.054482][ T310] 2 locks held by kworker/26:2/87166:
[52911.078400][ T310] #0: ffff888207838338 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
[52911.124769][ T310] #1: ffffc900203ffe00 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
[52911.175647][ T310] 2 locks held by kworker/13:2/110105:
[52911.199910][ T310] #0: ffff888207838338 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
[52911.246366][ T310] #1: ffffc90031bffe00 (free_ipc_work){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
[52911.289451][ T310] 2 locks held by kworker/24:0/140617:
[52911.313772][ T310] #0: ffff888207018938 ((wq_completion)rcu_gp){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
[52911.360933][ T310] #1: ffffc90021a6fe00 ((work_completion)(&rew.rew_work)){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
[52911.411991][ T310] 3 locks held by kworker/36:2/142207:
[52911.437085][ T310] #0: ffff888207838338 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
[52911.483296][ T310] #1: ffffc9002420fe00 (slab_caches_to_rcu_destroy_work){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
[52911.537354][ T310] #2: ffffffffb3bd9128 (rcu_state.barrier_mutex){+.+.}-{3:3}, at: rcu_barrier+0x4c/0x440
[52911.585581][ T310] 1 lock held by trinity-c11/151262:
[52911.609175][ T310] #0: ffff889660160248 (&sb->s_type->i_mutex_key#8){+.+.}-{3:3}, at: __sock_release+0x54/0x120
[52911.655737][ T310] 1 lock held by trinity-c35/151367:
[52911.680063][ T310] #0: ffffffffb3bd9220 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x4cc/0x5e0
[52911.768193][ T310] 2 locks held by trinity-c30/151624:
[52911.792141][ T310] 1 lock held by trinity-c33/151643:
[52911.815251][ T310] #0: ffffffffb3bd9220 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x4cc/0x5e0
[52911.863988][ T310] 1 lock held by trinity-c19/151731:
[52911.889078][ T310]
[52911.899680][ T310] =============================================


2020-04-30 03:25:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_barrier() + membarrier() + scheduler deadlock?

On Wed, Apr 29, 2020 at 11:00:58PM -0400, Qian Cai wrote:
> Doing some fuzzers on many CPUs triggers a deadlock really quickly. I can see that there were several tasks had been stuck for a while,
>
> CPUA:
> slab_caches_to_rcu_destroy_workfn()
> rcu_barrier()
> wait_for_completion()
>
> CPUB:
> sched_setaffinity()
> __set_cpus_allowed_ptr()
> stop_one_cpu()
>
> CPUC:
> __x64_sys_membarrier
> synchronize_rcu()
> __wait_rcu_gp()
> wait_for_completion()
>
> Lockdep did not flag anything useful at all. Any clue?

CPUA and CPUC are most likely both waiting for a grace period to complete.

CPUA will be blocking CPU hotplug (get_online_cpus()) in case that
matters. I am not seeing any obvious wait for an RCU grace period
in CPUB.

But which task's affinity is being set? That of the grace-period kthread?
If not, are there rcuo kthreads present? Either way, preventing any of
RCU's kthreads from running could block potentially both CPUA and CPUC.
Though in the case of the grace-period kthread, I would expect to see
an RCU CPU stall warning.

Could you please share your .config?

Thanx, Paul

> [52885.040113][T150406] futex_wake_op: trinity-c9 tries to shift op by -1189; fix this program
> [52890.366916][T150196] futex_wake_op: trinity-c3 tries to shift op by -1; fix this program
> [52896.369779][T151054] futex_wake_op: trinity-c23 tries to shift op by 710; fix this program
> [52901.770688][ T310] INFO: task kworker/36:2:142207 blocked for more than 122 seconds.
> [52901.811471][ T310] Tainted: G L 5.7.0-rc3-next-20200429 #2
> [52901.849148][ T310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [52901.889803][ T310] kworker/36:2 D29104 142207 2 0x90004000
> [52901.919461][ T310] Workqueue: events slab_caches_to_rcu_destroy_workfn
> [52901.950935][ T310] Call Trace:
> [52901.965369][ T310] __schedule+0x77e/0x1070
> [52901.986211][ T310] ? __sched_text_start+0x8/0x8
> [52902.008267][ T310] ? trace_hardirqs_on+0x3a/0x160
> [52902.031764][ T310] schedule+0x95/0x160
> [52902.050729][ T310] schedule_timeout+0x47c/0x6a0
> [52902.072897][ T310] ? print_irqtrace_events+0x110/0x110
> [52902.098316][ T310] ? usleep_range+0x100/0x100
> [52902.119664][ T310] ? mark_held_locks+0x86/0xb0
> [52902.141397][ T310] ? _raw_spin_unlock_irq+0x27/0x40
> [52902.165326][ T310] ? _raw_spin_unlock_irq+0x27/0x40
> [52902.189622][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> [52902.213641][ T310] ? wait_for_completion+0x116/0x1a0
> [52902.238757][ T310] ? trace_hardirqs_on+0x3a/0x160
> [52902.261648][ T310] wait_for_completion+0x11e/0x1a0
> [52902.285516][ T310] ? wait_for_completion_interruptible+0x220/0x220
> [52902.316342][ T310] ? rcu_read_lock_held+0xc0/0xc0
> [52902.340298][ T310] rcu_barrier+0x2c5/0x440
> [52902.361736][ T310] slab_caches_to_rcu_destroy_workfn+0x95/0xe0
> [52902.391109][ T310] process_one_work+0x57e/0xb90
> [52902.413244][ T310] ? pwq_dec_nr_in_flight+0x170/0x170
> [52902.437957][ T310] ? worker_thread+0x14b/0x5b0
> [52902.460357][ T310] worker_thread+0x63/0x5b0
> [52902.481267][ T310] ? process_one_work+0xb90/0xb90
> [52902.504818][ T310] kthread+0x1f7/0x220
> [52902.523542][ T310] ? kthread_create_worker_on_cpu+0xc0/0xc0
> [52902.550801][ T310] ret_from_fork+0x3a/0x50
> [52902.560427][T151328] futex_wake_op: trinity-c10 tries to shift op by -1; fix this program
> [52902.571250][ T310] INFO: task trinity-c18:145344 can't die for more than 123 seconds.
> [52902.648554][ T310] trinity-c18 R running task 26944 145344 74932 0x10004004
> [52902.700179][ T310] Call Trace:
> [52902.715039][ T310] __schedule+0x77e/0x1070
> [52902.735531][ T310] ? __sched_text_start+0x8/0x8
> [52902.757922][ T310] ? do_syscall_64+0x28d/0xaf0
> [52902.779731][ T310] ? do_syscall_64+0x28d/0xaf0
> [52902.802215][ T310] schedule+0x95/0x160
> [52902.821375][ T310] do_syscall_64+0x23e/0xaf0
> [52902.843517][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> [52902.871003][ T310] ? ftrace_syscall_enter+0x4b0/0x4b0
> [52902.896034][ T310] ? syscall_return_slowpath+0x580/0x580
> [52902.922153][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> [52902.950409][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> [52902.976721][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [52903.002409][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [52903.030018][ T310] RIP: 0033:0x7fa3d9cd670d
> [52903.050021][ T310] Code: Bad RIP value.
> [52903.068733][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
> [52903.107680][ T310] RAX: 0000000000000000 RBX: 00000000000000cb RCX: 00007fa3d9cd670d
> [52903.144861][ T310] RDX: 00007fa3d7c38000 RSI: 0000000000000004 RDI: 0000000000000000
> [52903.182189][ T310] RBP: 00000000000000cb R08: 000000000000ca23 R09: 00000000000000b3
> [52903.219290][ T310] R10: 000000002ab3d706 R11: 0000000000000246 R12: 0000000000000002
> [52903.256653][ T310] R13: 00007fa3d8608058 R14: 00007fa3d9b956c0 R15: 00007fa3d8608000
> [52903.294988][ T310] INFO: task trinity-c24:146199 can't die for more than 124 seconds.
> [52903.333911][ T310] trinity-c24 R running task 26944 146199 74932 0x10000004
> [52903.377051][ T310] Call Trace:
> [52903.393139][ T310] __schedule+0x77e/0x1070
> [52903.413685][ T310] ? __sched_text_start+0x8/0x8
> [52903.435602][ T310] ? check_flags.part.28+0x86/0x220
> [52903.459452][ T310] preempt_schedule_common+0x16/0x50
> [52903.483985][ T310] _cond_resched+0x22/0x30
> [52903.504130][ T310] stop_one_cpu+0xd7/0x140
> [52903.524695][ T310] ? stop_cpus.constprop.3+0x130/0x130
> [52903.550266][ T310] ? task_rq_unlock+0x6e/0x80
> [52903.572005][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
> [52903.595973][ T310] ? lock_downgrade+0x3e0/0x3e0
> [52903.618468][ T310] ? mark_held_locks+0x34/0xb0
> [52903.640239][ T310] ? task_rq_unlock+0x6e/0x80
> [52903.662057][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
> [52903.686744][ T310] ? move_queued_task+0x3c0/0x3c0
> [52903.709784][ T310] ? cpuset_cpus_allowed+0x10b/0x200
> [52903.734212][ T310] ? trace_hardirqs_on+0x3a/0x160
> [52903.757482][ T310] ? __bitmap_and+0xd8/0x100
> [52903.779002][ T310] sched_setaffinity+0x476/0x660
> [52903.802046][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
> [52903.827730][ T310] ? __kasan_check_write+0x14/0x20
> [52903.851859][ T310] __x64_sys_sched_setaffinity+0x118/0x180
> [52903.881315][ T310] ? sched_setaffinity+0x660/0x660
> [52903.906012][ T310] ? do_syscall_64+0x79/0xaf0
> [52903.927530][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
> [52903.951683][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> [52903.976607][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [52904.004024][ T310] do_syscall_64+0xcc/0xaf0
> [52904.023909][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> [52904.047409][ T310] ? syscall_return_slowpath+0x580/0x580
> [52904.072656][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> [52904.099856][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> [52904.125615][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [52904.167434][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [52904.193732][ T310] RIP: 0033:0x7fa3d9cd670d
> [52904.213368][ T310] Code: Bad RIP value.
> [52904.231318][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
> [52904.269002][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
> [52904.344529][ T310] RDX: 00007fa3d803c000 RSI: 0000000000000200 RDI: 0000000000000000
> [52904.382761][ T310] RBP: 00000000000000cb R08: 0001802261008040 R09: 00000000000000ba
> [52904.422223][ T310] R10: 00000000000000a8 R11: 0000000000000246 R12: 0000000000000002
> [52904.458152][ T310] R13: 00007fa3d85de058 R14: 00007fa3d9b956c0 R15: 00007fa3d85de000
> [52904.494711][ T310] INFO: task trinity-c6:146436 can't die for more than 125 seconds.
> [52904.530683][ T310] trinity-c6 R running task 27328 146436 74932 0x10000004
> [52904.567167][ T310] Call Trace:
> [52904.582145][ T310] __schedule+0x77e/0x1070
> [52904.601969][ T310] ? __sched_text_start+0x8/0x8
> [52904.623515][ T310] ? check_flags.part.28+0x86/0x220
> [52904.647020][ T310] preempt_schedule_common+0x16/0x50
> [52904.671405][ T310] _cond_resched+0x22/0x30
> [52904.690871][ T310] stop_one_cpu+0xd7/0x140
> [52904.710359][ T310] ? stop_cpus.constprop.3+0x130/0x130
> [52904.734818][ T310] ? task_rq_unlock+0x6e/0x80
> [52904.755537][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
> [52904.778632][ T310] ? lock_downgrade+0x3e0/0x3e0
> [52904.800232][ T310] ? mark_held_locks+0x34/0xb0
> [52904.821408][ T310] ? task_rq_unlock+0x6e/0x80
> [52904.842299][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
> [52904.866634][ T310] ? move_queued_task+0x3c0/0x3c0
> [52904.890341][ T310] ? cpuset_cpus_allowed+0x10b/0x200
> [52904.916217][ T310] ? trace_hardirqs_on+0x3a/0x160
> [52904.939059][ T310] ? __bitmap_and+0xd8/0x100
> [52904.959333][ T310] sched_setaffinity+0x476/0x660
> [52904.981422][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
> [52905.006183][ T310] ? copy_user_generic_unrolled+0x9e/0xc0
> [52905.031792][ T310] __x64_sys_sched_setaffinity+0x118/0x180
> [52905.057598][ T310] ? sched_setaffinity+0x660/0x660
> [52905.080302][ T310] ? do_syscall_64+0x79/0xaf0
> [52905.101464][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
> [52905.126828][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> [52905.150334][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [52905.177570][ T310] do_syscall_64+0xcc/0xaf0
> [52905.197688][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> [52905.221392][ T310] ? syscall_return_slowpath+0x580/0x580
> [52905.246586][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> [52905.273733][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> [52905.300142][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [52905.324980][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [52905.351394][ T310] RIP: 0033:0x7fa3d9cd670d
> [52905.371850][ T310] Code: Bad RIP value.
> [52905.390680][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
> [52905.432007][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
> [52905.470212][ T310] RDX: 00007fa3d7e3c000 RSI: 0000000000000004 RDI: 0000000000000000
> [52905.507836][ T310] RBP: 00000000000000cb R08: 00000000fffffff9 R09: ffffffffffff896f
> [52905.545501][ T310] R10: 000000000000fff7 R11: 0000000000000246 R12: 0000000000000002
> [52905.581885][ T310] R13: 00007fa3d865c058 R14: 00007fa3d9b956c0 R15: 00007fa3d865c000
> [52905.620176][ T310] INFO: task trinity-c38:147003 can't die for more than 126 seconds.
> [52905.656622][ T310] trinity-c38 R running task 27880 147003 74932 0x10000004
> [52905.692374][ T310] Call Trace:
> [52905.706772][ T310] __schedule+0x77e/0x1070
> [52905.726246][ T310] ? __sched_text_start+0x8/0x8
> [52905.747783][ T310] ? check_flags.part.28+0x86/0x220
> [52905.770984][ T310] preempt_schedule_common+0x16/0x50
> [52905.794507][ T310] _cond_resched+0x22/0x30
> [52905.814101][ T310] stop_one_cpu+0xd7/0x140
> [52905.833704][ T310] ? stop_cpus.constprop.3+0x130/0x130
> [52905.858045][ T310] ? task_rq_unlock+0x6e/0x80
> [52905.879452][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
> [52905.903104][ T310] ? lock_downgrade+0x3e0/0x3e0
> [52905.927387][ T310] ? mark_held_locks+0x34/0xb0
> [52905.949981][ T310] ? task_rq_unlock+0x6e/0x80
> [52905.970691][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
> [52905.994868][ T310] ? move_queued_task+0x3c0/0x3c0
> [52906.017315][ T310] ? cpuset_cpus_allowed+0x10b/0x200
> [52906.040870][ T310] ? trace_hardirqs_on+0x3a/0x160
> [52906.063266][ T310] ? __bitmap_and+0xd8/0x100
> [52906.083575][ T310] sched_setaffinity+0x476/0x660
> [52906.105559][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
> [52906.130620][ T310] ? __kasan_check_write+0x14/0x20
> [52906.153322][ T310] __x64_sys_sched_setaffinity+0x118/0x180
> [52906.179289][ T310] ? sched_setaffinity+0x660/0x660
> [52906.202438][ T310] ? do_syscall_64+0x79/0xaf0
> [52906.223078][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
> [52906.246181][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> [52906.269667][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [52906.296537][ T310] do_syscall_64+0xcc/0xaf0
> [52906.316130][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> [52906.339641][ T310] ? syscall_return_slowpath+0x580/0x580
> [52906.364738][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> [52906.391937][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> [52906.419260][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [52906.446669][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [52906.473222][ T310] RIP: 0033:0x7fa3d9cd670d
> [52906.493314][ T310] Code: Bad RIP value.
> [52906.511339][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
> [52906.549312][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
> [52906.585138][ T310] RDX: 00007fa3d7c3c000 RSI: 0000000000000008 RDI: 0000000000000000
> [52906.620955][ T310] RBP: 00000000000000cb R08: fffffffffffffff8 R09: fffffffffffffffc
> [52906.657616][ T310] R10: 0000000000000200 R11: 0000000000000246 R12: 0000000000000002
> [52906.694196][ T310] R13: 00007fa3d857c058 R14: 00007fa3d9b956c0 R15: 00007fa3d857c000
> [52906.730152][ T310] INFO: task trinity-c2:147285 can't die for more than 127 seconds.
> [52906.766086][ T310] trinity-c2 R running task 27896 147285 74932 0x10000004
> [52906.801993][ T310] Call Trace:
> [52906.816338][ T310] __schedule+0x77e/0x1070
> [52906.836024][ T310] ? __sched_text_start+0x8/0x8
> [52906.857547][ T310] ? up_write+0xe9/0x290
> [52906.876797][ T310] schedule+0x95/0x160
> [52906.915571][ T310] schedule_timeout+0x47c/0x6a0
> [52906.939198][ T310] ? usleep_range+0x100/0x100
> [52906.961983][ T310] ? lock_downgrade+0x3e0/0x3e0
> [52906.983597][ T310] ? mark_held_locks+0x34/0xb0
> [52907.005053][ T310] ? _raw_spin_unlock_irq+0x27/0x40
> [52907.028158][ T310] ? _raw_spin_unlock_irq+0x27/0x40
> [52907.051323][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> [52907.074918][ T310] ? wait_for_completion+0x116/0x1a0
> [52907.098366][ T310] ? trace_hardirqs_on+0x3a/0x160
> [52907.120794][ T310] wait_for_completion+0x11e/0x1a0
> [52907.143566][ T310] ? wait_for_completion_interruptible+0x220/0x220
> [52907.172861][ T310] __wait_rcu_gp+0x1ad/0x1e0
> [52907.193176][ T310] synchronize_rcu+0x128/0x1a0
> [52907.214557][ T310] ? synchronize_rcu_expedited+0x5e0/0x5e0
> [52907.240532][ T310] ? match_held_lock+0x20/0x270
> [52907.262106][ T310] ? invoke_rcu_core+0xf0/0xf0
> [52907.282779][ T310] ? check_chain_key+0x1df/0x2e0
> [52907.304629][ T310] ? find_held_lock+0xca/0xf0
> [52907.325381][ T310] ? __kasan_check_read+0x11/0x20
> [52907.347666][ T310] __x64_sys_membarrier+0x205/0x62e
> [52907.370924][ T310] ? membarrier_exec_mmap+0x50/0x50
> [52907.394034][ T310] ? do_syscall_64+0x79/0xaf0
> [52907.414823][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
> [52907.440213][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> [52907.466322][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [52907.494190][ T310] do_syscall_64+0xcc/0xaf0
> [52907.514834][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> [52907.539609][ T310] ? syscall_return_slowpath+0x580/0x580
> [52907.564670][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> [52907.593100][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> [52907.619749][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [52907.645538][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [52907.673066][ T310] RIP: 0033:0x7fa3d9cd670d
> [52907.693218][ T310] Code: Bad RIP value.
> [52907.711803][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000144
> [52907.750495][ T310] RAX: ffffffffffffffda RBX: 0000000000000144 RCX: 00007fa3d9cd670d
> [52907.786490][ T310] RDX: 0000000000000080 RSI: 0000000000000000 RDI: 0000000000000001
> [52907.822363][ T310] RBP: 0000000000000144 R08: 00000000eeeeeeee R09: 0000000000003418
> [52907.858106][ T310] R10: 0000000000004ff8 R11: 0000000000000246 R12: 0000000000000002
> [52907.894476][ T310] R13: 00007fa3d8678058 R14: 00007fa3d9b956c0 R15: 00007fa3d8678000
> [52907.930457][ T310] INFO: task trinity-c27:149010 can't die for more than 129 seconds.
> [52907.971719][ T310] trinity-c27 R running task 26944 149010 74932 0x10000004
> [52908.008456][ T310] Call Trace:
> [52908.022892][ T310] __schedule+0x77e/0x1070
> [52908.042479][ T310] ? __sched_text_start+0x8/0x8
> [52908.063993][ T310] ? check_flags.part.28+0x86/0x220
> [52908.087103][ T310] preempt_schedule_common+0x16/0x50
> [52908.110659][ T310] _cond_resched+0x22/0x30
> [52908.130167][ T310] stop_one_cpu+0xd7/0x140
> [52908.149705][ T310] ? stop_cpus.constprop.3+0x130/0x130
> [52908.174087][ T310] ? task_rq_unlock+0x6e/0x80
> [52908.194356][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
> [52908.217715][ T310] ? lock_downgrade+0x3e0/0x3e0
> [52908.239236][ T310] ? mark_held_locks+0x34/0xb0
> [52908.260471][ T310] ? task_rq_unlock+0x6e/0x80
> [52908.281298][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
> [52908.305375][ T310] ? move_queued_task+0x3c0/0x3c0
> [52908.327364][ T310] ? cpuset_cpus_allowed+0x10b/0x200
> [52908.350892][ T310] ? trace_hardirqs_on+0x3a/0x160
> [52908.373276][ T310] ? __bitmap_and+0xd8/0x100
> [52908.393650][ T310] sched_setaffinity+0x476/0x660
> [52908.397305][T151560] futex_wake_op: trinity-c39 tries to shift op by -852; fix this program
> [52908.415599][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
> [52908.455822][T151560] futex_wake_op: trinity-c39 tries to shift op by -852; fix this program
> [52908.483899][ T310] ? __kasan_check_write+0x14/0x20
> [52908.483908][ T310] __x64_sys_sched_setaffinity+0x118/0x180
> [52908.483916][ T310] ? sched_setaffinity+0x660/0x660
> [52908.483925][ T310] ? do_syscall_64+0x79/0xaf0
> [52908.483934][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
> [52908.640521][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> [52908.664068][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [52908.692710][ T310] do_syscall_64+0xcc/0xaf0
> [52908.712603][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> [52908.736117][ T310] ? syscall_return_slowpath+0x580/0x580
> [52908.762313][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> [52908.789929][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> [52908.815487][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [52908.840248][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [52908.867098][ T310] RIP: 0033:0x7fa3d9cd670d
> [52908.886542][ T310] Code: Bad RIP value.
> [52908.904432][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
> [52908.942172][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
> [52908.981963][ T310] RDX: 00007fa3d843c000 RSI: 0000000000000004 RDI: 0000000000000000
> [52909.020646][ T310] RBP: 00000000000000cb R08: 00000000ffffffff R09: fffffffffffffffa
> [52909.057043][ T310] R10: 000000000000001b R11: 0000000000000246 R12: 0000000000000002
> [52909.093664][ T310] R13: 00007fa3d85c9058 R14: 00007fa3d9b956c0 R15: 00007fa3d85c9000
> [52909.129758][ T310] INFO: task trinity-c20:149065 can't die for more than 130 seconds.
> [52909.166089][ T310] trinity-c20 R running task 27712 149065 74932 0x10000004
> [52909.202031][ T310] Call Trace:
> [52909.216658][ T310] __schedule+0x77e/0x1070
> [52909.236187][ T310] ? __sched_text_start+0x8/0x8
> [52909.257916][ T310] ? check_flags.part.28+0x86/0x220
> [52909.281061][ T310] preempt_schedule_common+0x16/0x50
> [52909.304664][ T310] _cond_resched+0x22/0x30
> [52909.324257][ T310] stop_one_cpu+0xd7/0x140
> [52909.343842][ T310] ? stop_cpus.constprop.3+0x130/0x130
> [52909.369442][ T310] ? task_rq_unlock+0x6e/0x80
> [52909.390672][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
> [52909.413672][ T310] ? lock_downgrade+0x3e0/0x3e0
> [52909.435178][ T310] ? mark_held_locks+0x34/0xb0
> [52909.456323][ T310] ? task_rq_unlock+0x6e/0x80
> [52909.478561][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
> [52909.505164][ T310] ? move_queued_task+0x3c0/0x3c0
> [52909.529450][ T310] ? cpuset_cpus_allowed+0x10b/0x200
> [52909.553226][ T310] ? trace_hardirqs_on+0x3a/0x160
> [52909.575565][ T310] ? __bitmap_and+0xd8/0x100
> [52909.596065][ T310] sched_setaffinity+0x476/0x660
> [52909.617927][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
> [52909.643034][ T310] ? copy_user_generic_unrolled+0x86/0xc0
> [52909.669320][ T310] __x64_sys_sched_setaffinity+0x118/0x180
> [52909.696241][ T310] ? sched_setaffinity+0x660/0x660
> [52909.719315][ T310] ? do_syscall_64+0x79/0xaf0
> [52909.740952][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
> [52909.765524][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> [52909.790368][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [52909.818929][ T310] do_syscall_64+0xcc/0xaf0
> [52909.839558][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> [52909.863846][ T310] ? syscall_return_slowpath+0x580/0x580
> [52909.889545][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> [52909.916624][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> [52909.942269][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [52909.966872][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [52909.995175][ T310] RIP: 0033:0x7fa3d9cd670d
> [52910.017390][ T310] Code: Bad RIP value.
> [52910.035569][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
> [52910.073407][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
> [52910.109426][ T310] RDX: 00007fa3d7c37000 RSI: 00000000000000a2 RDI: 0000000000000000
> [52910.145197][ T310] RBP: 00000000000000cb R08: ffffffdffdefffff R09: 000000008084b740
> [52910.181004][ T310] R10: 0000000000000080 R11: 0000000000000246 R12: 0000000000000002
> [52910.217004][ T310] R13: 00007fa3d85fa058 R14: 00007fa3d9b956c0 R15: 00007fa3d85fa000
> [52910.254432][ T310]
> [52910.254432][ T310] Showing all locks held in the system:
> [52910.289717][ T310] 1 lock held by ksoftirqd/16/94:
> [52910.312041][ T310] #0: ffff88985fa453d8 (&rq->lock){-.-.}-{2:2}, at: newidle_balance+0x88d/0x9c0
> [52910.353309][ T310] 1 lock held by khungtaskd/310:
> [52910.375374][ T310] #0: ffffffffb3bcacc0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.30+0x0/0x30
> [52910.422536][ T310] 3 locks held by kworker/u96:1/60424:
> [52910.446875][ T310] #0: ffff88984fef1538 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> [52910.493637][ T310] #1: ffffc900240ffe00 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> [52910.542269][ T310] #2: ffffffffb409cf30 (pernet_ops_rwsem){++++}-{3:3}, at: cleanup_net+0xa9/0x5c0
> [52910.583972][ T310] 2 locks held by runc/74923:
> [52910.604801][ T310] #0: ffff88842f510898 (&tty->ldisc_sem){++++}-{0:0}, at: ldsem_down_read+0x32/0x40
> [52910.647284][ T310] #1: ffffc900064bb2e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x187/0xe20
> [52910.692274][ T310] 8 locks held by runc/74924:
> [52910.713003][ T310] #0: ffff888842e453d8 (&rq->lock){-.-.}-{2:2}, at: __schedule+0x227/0x1070
> [52910.752393][ T310] #1: ffffffffb3bcacc0 (rcu_read_lock){....}-{1:2}, at: perf_iterate_sb+0x0/0x500
> [52910.794272][ T310] #2: ffff88842f510ae8 (&o_tty->termios_rwsem/1){++++}-{3:3}, at: n_tty_write+0x126/0x7c0
> [52910.839386][ T310] #3: ffff88842f510d30 (&tty->write_wait){-.-.}-{2:2}, at: remove_wait_queue+0x29/0xc0
> [52910.885384][ T310] #4: ffff888429bb0558 (&port->lock#2){-.-.}-{2:2}, at: pty_write+0x82/0xf0
> [52910.924906][ T310] #5: ffffffffb3bcacc0 (rcu_read_lock){....}-{1:2}, at: __queue_work+0x8b/0xa40
> [52910.965808][ T310] #6: ffff88820f19c418 (&pool->lock/1){-.-.}-{2:2}, at: __queue_work+0x154/0xa40
> [52911.008506][ T310] #7: ffff888349990828 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0xb0/0xfb0
> [52911.054482][ T310] 2 locks held by kworker/26:2/87166:
> [52911.078400][ T310] #0: ffff888207838338 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> [52911.124769][ T310] #1: ffffc900203ffe00 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> [52911.175647][ T310] 2 locks held by kworker/13:2/110105:
> [52911.199910][ T310] #0: ffff888207838338 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> [52911.246366][ T310] #1: ffffc90031bffe00 (free_ipc_work){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> [52911.289451][ T310] 2 locks held by kworker/24:0/140617:
> [52911.313772][ T310] #0: ffff888207018938 ((wq_completion)rcu_gp){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> [52911.360933][ T310] #1: ffffc90021a6fe00 ((work_completion)(&rew.rew_work)){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> [52911.411991][ T310] 3 locks held by kworker/36:2/142207:
> [52911.437085][ T310] #0: ffff888207838338 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> [52911.483296][ T310] #1: ffffc9002420fe00 (slab_caches_to_rcu_destroy_work){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> [52911.537354][ T310] #2: ffffffffb3bd9128 (rcu_state.barrier_mutex){+.+.}-{3:3}, at: rcu_barrier+0x4c/0x440
> [52911.585581][ T310] 1 lock held by trinity-c11/151262:
> [52911.609175][ T310] #0: ffff889660160248 (&sb->s_type->i_mutex_key#8){+.+.}-{3:3}, at: __sock_release+0x54/0x120
> [52911.655737][ T310] 1 lock held by trinity-c35/151367:
> [52911.680063][ T310] #0: ffffffffb3bd9220 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x4cc/0x5e0
> [52911.768193][ T310] 2 locks held by trinity-c30/151624:
> [52911.792141][ T310] 1 lock held by trinity-c33/151643:
> [52911.815251][ T310] #0: ffffffffb3bd9220 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x4cc/0x5e0
> [52911.863988][ T310] 1 lock held by trinity-c19/151731:
> [52911.889078][ T310]
> [52911.899680][ T310] =============================================

2020-04-30 03:33:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_barrier() + membarrier() + scheduler deadlock?

On Wed, Apr 29, 2020 at 08:22:38PM -0700, Paul E. McKenney wrote:
> On Wed, Apr 29, 2020 at 11:00:58PM -0400, Qian Cai wrote:
> > Doing some fuzzers on many CPUs triggers a deadlock really quickly. I can see that there were several tasks had been stuck for a while,
> >
> > CPUA:
> > slab_caches_to_rcu_destroy_workfn()
> > rcu_barrier()
> > wait_for_completion()
> >
> > CPUB:
> > sched_setaffinity()
> > __set_cpus_allowed_ptr()
> > stop_one_cpu()
> >
> > CPUC:
> > __x64_sys_membarrier
> > synchronize_rcu()
> > __wait_rcu_gp()
> > wait_for_completion()
> >
> > Lockdep did not flag anything useful at all. Any clue?
>
> CPUA and CPUC are most likely both waiting for a grace period to complete.
>
> CPUA will be blocking CPU hotplug (get_online_cpus()) in case that
> matters. I am not seeing any obvious wait for an RCU grace period
> in CPUB.
>
> But which task's affinity is being set? That of the grace-period kthread?
> If not, are there rcuo kthreads present? Either way, preventing any of
> RCU's kthreads from running could block potentially both CPUA and CPUC.
> Though in the case of the grace-period kthread, I would expect to see
> an RCU CPU stall warning.
>
> Could you please share your .config?

And there are a number of tasks below marked "can't die". So maybe
something more fundamental is stuck?

Thanx, Paul

> > [52885.040113][T150406] futex_wake_op: trinity-c9 tries to shift op by -1189; fix this program
> > [52890.366916][T150196] futex_wake_op: trinity-c3 tries to shift op by -1; fix this program
> > [52896.369779][T151054] futex_wake_op: trinity-c23 tries to shift op by 710; fix this program
> > [52901.770688][ T310] INFO: task kworker/36:2:142207 blocked for more than 122 seconds.
> > [52901.811471][ T310] Tainted: G L 5.7.0-rc3-next-20200429 #2
> > [52901.849148][ T310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [52901.889803][ T310] kworker/36:2 D29104 142207 2 0x90004000
> > [52901.919461][ T310] Workqueue: events slab_caches_to_rcu_destroy_workfn
> > [52901.950935][ T310] Call Trace:
> > [52901.965369][ T310] __schedule+0x77e/0x1070
> > [52901.986211][ T310] ? __sched_text_start+0x8/0x8
> > [52902.008267][ T310] ? trace_hardirqs_on+0x3a/0x160
> > [52902.031764][ T310] schedule+0x95/0x160
> > [52902.050729][ T310] schedule_timeout+0x47c/0x6a0
> > [52902.072897][ T310] ? print_irqtrace_events+0x110/0x110
> > [52902.098316][ T310] ? usleep_range+0x100/0x100
> > [52902.119664][ T310] ? mark_held_locks+0x86/0xb0
> > [52902.141397][ T310] ? _raw_spin_unlock_irq+0x27/0x40
> > [52902.165326][ T310] ? _raw_spin_unlock_irq+0x27/0x40
> > [52902.189622][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> > [52902.213641][ T310] ? wait_for_completion+0x116/0x1a0
> > [52902.238757][ T310] ? trace_hardirqs_on+0x3a/0x160
> > [52902.261648][ T310] wait_for_completion+0x11e/0x1a0
> > [52902.285516][ T310] ? wait_for_completion_interruptible+0x220/0x220
> > [52902.316342][ T310] ? rcu_read_lock_held+0xc0/0xc0
> > [52902.340298][ T310] rcu_barrier+0x2c5/0x440
> > [52902.361736][ T310] slab_caches_to_rcu_destroy_workfn+0x95/0xe0
> > [52902.391109][ T310] process_one_work+0x57e/0xb90
> > [52902.413244][ T310] ? pwq_dec_nr_in_flight+0x170/0x170
> > [52902.437957][ T310] ? worker_thread+0x14b/0x5b0
> > [52902.460357][ T310] worker_thread+0x63/0x5b0
> > [52902.481267][ T310] ? process_one_work+0xb90/0xb90
> > [52902.504818][ T310] kthread+0x1f7/0x220
> > [52902.523542][ T310] ? kthread_create_worker_on_cpu+0xc0/0xc0
> > [52902.550801][ T310] ret_from_fork+0x3a/0x50
> > [52902.560427][T151328] futex_wake_op: trinity-c10 tries to shift op by -1; fix this program
> > [52902.571250][ T310] INFO: task trinity-c18:145344 can't die for more than 123 seconds.
> > [52902.648554][ T310] trinity-c18 R running task 26944 145344 74932 0x10004004
> > [52902.700179][ T310] Call Trace:
> > [52902.715039][ T310] __schedule+0x77e/0x1070
> > [52902.735531][ T310] ? __sched_text_start+0x8/0x8
> > [52902.757922][ T310] ? do_syscall_64+0x28d/0xaf0
> > [52902.779731][ T310] ? do_syscall_64+0x28d/0xaf0
> > [52902.802215][ T310] schedule+0x95/0x160
> > [52902.821375][ T310] do_syscall_64+0x23e/0xaf0
> > [52902.843517][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> > [52902.871003][ T310] ? ftrace_syscall_enter+0x4b0/0x4b0
> > [52902.896034][ T310] ? syscall_return_slowpath+0x580/0x580
> > [52902.922153][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> > [52902.950409][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> > [52902.976721][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> > [52903.002409][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [52903.030018][ T310] RIP: 0033:0x7fa3d9cd670d
> > [52903.050021][ T310] Code: Bad RIP value.
> > [52903.068733][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
> > [52903.107680][ T310] RAX: 0000000000000000 RBX: 00000000000000cb RCX: 00007fa3d9cd670d
> > [52903.144861][ T310] RDX: 00007fa3d7c38000 RSI: 0000000000000004 RDI: 0000000000000000
> > [52903.182189][ T310] RBP: 00000000000000cb R08: 000000000000ca23 R09: 00000000000000b3
> > [52903.219290][ T310] R10: 000000002ab3d706 R11: 0000000000000246 R12: 0000000000000002
> > [52903.256653][ T310] R13: 00007fa3d8608058 R14: 00007fa3d9b956c0 R15: 00007fa3d8608000
> > [52903.294988][ T310] INFO: task trinity-c24:146199 can't die for more than 124 seconds.
> > [52903.333911][ T310] trinity-c24 R running task 26944 146199 74932 0x10000004
> > [52903.377051][ T310] Call Trace:
> > [52903.393139][ T310] __schedule+0x77e/0x1070
> > [52903.413685][ T310] ? __sched_text_start+0x8/0x8
> > [52903.435602][ T310] ? check_flags.part.28+0x86/0x220
> > [52903.459452][ T310] preempt_schedule_common+0x16/0x50
> > [52903.483985][ T310] _cond_resched+0x22/0x30
> > [52903.504130][ T310] stop_one_cpu+0xd7/0x140
> > [52903.524695][ T310] ? stop_cpus.constprop.3+0x130/0x130
> > [52903.550266][ T310] ? task_rq_unlock+0x6e/0x80
> > [52903.572005][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
> > [52903.595973][ T310] ? lock_downgrade+0x3e0/0x3e0
> > [52903.618468][ T310] ? mark_held_locks+0x34/0xb0
> > [52903.640239][ T310] ? task_rq_unlock+0x6e/0x80
> > [52903.662057][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
> > [52903.686744][ T310] ? move_queued_task+0x3c0/0x3c0
> > [52903.709784][ T310] ? cpuset_cpus_allowed+0x10b/0x200
> > [52903.734212][ T310] ? trace_hardirqs_on+0x3a/0x160
> > [52903.757482][ T310] ? __bitmap_and+0xd8/0x100
> > [52903.779002][ T310] sched_setaffinity+0x476/0x660
> > [52903.802046][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
> > [52903.827730][ T310] ? __kasan_check_write+0x14/0x20
> > [52903.851859][ T310] __x64_sys_sched_setaffinity+0x118/0x180
> > [52903.881315][ T310] ? sched_setaffinity+0x660/0x660
> > [52903.906012][ T310] ? do_syscall_64+0x79/0xaf0
> > [52903.927530][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
> > [52903.951683][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> > [52903.976607][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [52904.004024][ T310] do_syscall_64+0xcc/0xaf0
> > [52904.023909][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> > [52904.047409][ T310] ? syscall_return_slowpath+0x580/0x580
> > [52904.072656][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> > [52904.099856][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> > [52904.125615][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> > [52904.167434][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [52904.193732][ T310] RIP: 0033:0x7fa3d9cd670d
> > [52904.213368][ T310] Code: Bad RIP value.
> > [52904.231318][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
> > [52904.269002][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
> > [52904.344529][ T310] RDX: 00007fa3d803c000 RSI: 0000000000000200 RDI: 0000000000000000
> > [52904.382761][ T310] RBP: 00000000000000cb R08: 0001802261008040 R09: 00000000000000ba
> > [52904.422223][ T310] R10: 00000000000000a8 R11: 0000000000000246 R12: 0000000000000002
> > [52904.458152][ T310] R13: 00007fa3d85de058 R14: 00007fa3d9b956c0 R15: 00007fa3d85de000
> > [52904.494711][ T310] INFO: task trinity-c6:146436 can't die for more than 125 seconds.
> > [52904.530683][ T310] trinity-c6 R running task 27328 146436 74932 0x10000004
> > [52904.567167][ T310] Call Trace:
> > [52904.582145][ T310] __schedule+0x77e/0x1070
> > [52904.601969][ T310] ? __sched_text_start+0x8/0x8
> > [52904.623515][ T310] ? check_flags.part.28+0x86/0x220
> > [52904.647020][ T310] preempt_schedule_common+0x16/0x50
> > [52904.671405][ T310] _cond_resched+0x22/0x30
> > [52904.690871][ T310] stop_one_cpu+0xd7/0x140
> > [52904.710359][ T310] ? stop_cpus.constprop.3+0x130/0x130
> > [52904.734818][ T310] ? task_rq_unlock+0x6e/0x80
> > [52904.755537][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
> > [52904.778632][ T310] ? lock_downgrade+0x3e0/0x3e0
> > [52904.800232][ T310] ? mark_held_locks+0x34/0xb0
> > [52904.821408][ T310] ? task_rq_unlock+0x6e/0x80
> > [52904.842299][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
> > [52904.866634][ T310] ? move_queued_task+0x3c0/0x3c0
> > [52904.890341][ T310] ? cpuset_cpus_allowed+0x10b/0x200
> > [52904.916217][ T310] ? trace_hardirqs_on+0x3a/0x160
> > [52904.939059][ T310] ? __bitmap_and+0xd8/0x100
> > [52904.959333][ T310] sched_setaffinity+0x476/0x660
> > [52904.981422][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
> > [52905.006183][ T310] ? copy_user_generic_unrolled+0x9e/0xc0
> > [52905.031792][ T310] __x64_sys_sched_setaffinity+0x118/0x180
> > [52905.057598][ T310] ? sched_setaffinity+0x660/0x660
> > [52905.080302][ T310] ? do_syscall_64+0x79/0xaf0
> > [52905.101464][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
> > [52905.126828][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> > [52905.150334][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [52905.177570][ T310] do_syscall_64+0xcc/0xaf0
> > [52905.197688][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> > [52905.221392][ T310] ? syscall_return_slowpath+0x580/0x580
> > [52905.246586][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> > [52905.273733][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> > [52905.300142][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> > [52905.324980][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [52905.351394][ T310] RIP: 0033:0x7fa3d9cd670d
> > [52905.371850][ T310] Code: Bad RIP value.
> > [52905.390680][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
> > [52905.432007][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
> > [52905.470212][ T310] RDX: 00007fa3d7e3c000 RSI: 0000000000000004 RDI: 0000000000000000
> > [52905.507836][ T310] RBP: 00000000000000cb R08: 00000000fffffff9 R09: ffffffffffff896f
> > [52905.545501][ T310] R10: 000000000000fff7 R11: 0000000000000246 R12: 0000000000000002
> > [52905.581885][ T310] R13: 00007fa3d865c058 R14: 00007fa3d9b956c0 R15: 00007fa3d865c000
> > [52905.620176][ T310] INFO: task trinity-c38:147003 can't die for more than 126 seconds.
> > [52905.656622][ T310] trinity-c38 R running task 27880 147003 74932 0x10000004
> > [52905.692374][ T310] Call Trace:
> > [52905.706772][ T310] __schedule+0x77e/0x1070
> > [52905.726246][ T310] ? __sched_text_start+0x8/0x8
> > [52905.747783][ T310] ? check_flags.part.28+0x86/0x220
> > [52905.770984][ T310] preempt_schedule_common+0x16/0x50
> > [52905.794507][ T310] _cond_resched+0x22/0x30
> > [52905.814101][ T310] stop_one_cpu+0xd7/0x140
> > [52905.833704][ T310] ? stop_cpus.constprop.3+0x130/0x130
> > [52905.858045][ T310] ? task_rq_unlock+0x6e/0x80
> > [52905.879452][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
> > [52905.903104][ T310] ? lock_downgrade+0x3e0/0x3e0
> > [52905.927387][ T310] ? mark_held_locks+0x34/0xb0
> > [52905.949981][ T310] ? task_rq_unlock+0x6e/0x80
> > [52905.970691][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
> > [52905.994868][ T310] ? move_queued_task+0x3c0/0x3c0
> > [52906.017315][ T310] ? cpuset_cpus_allowed+0x10b/0x200
> > [52906.040870][ T310] ? trace_hardirqs_on+0x3a/0x160
> > [52906.063266][ T310] ? __bitmap_and+0xd8/0x100
> > [52906.083575][ T310] sched_setaffinity+0x476/0x660
> > [52906.105559][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
> > [52906.130620][ T310] ? __kasan_check_write+0x14/0x20
> > [52906.153322][ T310] __x64_sys_sched_setaffinity+0x118/0x180
> > [52906.179289][ T310] ? sched_setaffinity+0x660/0x660
> > [52906.202438][ T310] ? do_syscall_64+0x79/0xaf0
> > [52906.223078][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
> > [52906.246181][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> > [52906.269667][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [52906.296537][ T310] do_syscall_64+0xcc/0xaf0
> > [52906.316130][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> > [52906.339641][ T310] ? syscall_return_slowpath+0x580/0x580
> > [52906.364738][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> > [52906.391937][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> > [52906.419260][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> > [52906.446669][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [52906.473222][ T310] RIP: 0033:0x7fa3d9cd670d
> > [52906.493314][ T310] Code: Bad RIP value.
> > [52906.511339][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
> > [52906.549312][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
> > [52906.585138][ T310] RDX: 00007fa3d7c3c000 RSI: 0000000000000008 RDI: 0000000000000000
> > [52906.620955][ T310] RBP: 00000000000000cb R08: fffffffffffffff8 R09: fffffffffffffffc
> > [52906.657616][ T310] R10: 0000000000000200 R11: 0000000000000246 R12: 0000000000000002
> > [52906.694196][ T310] R13: 00007fa3d857c058 R14: 00007fa3d9b956c0 R15: 00007fa3d857c000
> > [52906.730152][ T310] INFO: task trinity-c2:147285 can't die for more than 127 seconds.
> > [52906.766086][ T310] trinity-c2 R running task 27896 147285 74932 0x10000004
> > [52906.801993][ T310] Call Trace:
> > [52906.816338][ T310] __schedule+0x77e/0x1070
> > [52906.836024][ T310] ? __sched_text_start+0x8/0x8
> > [52906.857547][ T310] ? up_write+0xe9/0x290
> > [52906.876797][ T310] schedule+0x95/0x160
> > [52906.915571][ T310] schedule_timeout+0x47c/0x6a0
> > [52906.939198][ T310] ? usleep_range+0x100/0x100
> > [52906.961983][ T310] ? lock_downgrade+0x3e0/0x3e0
> > [52906.983597][ T310] ? mark_held_locks+0x34/0xb0
> > [52907.005053][ T310] ? _raw_spin_unlock_irq+0x27/0x40
> > [52907.028158][ T310] ? _raw_spin_unlock_irq+0x27/0x40
> > [52907.051323][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> > [52907.074918][ T310] ? wait_for_completion+0x116/0x1a0
> > [52907.098366][ T310] ? trace_hardirqs_on+0x3a/0x160
> > [52907.120794][ T310] wait_for_completion+0x11e/0x1a0
> > [52907.143566][ T310] ? wait_for_completion_interruptible+0x220/0x220
> > [52907.172861][ T310] __wait_rcu_gp+0x1ad/0x1e0
> > [52907.193176][ T310] synchronize_rcu+0x128/0x1a0
> > [52907.214557][ T310] ? synchronize_rcu_expedited+0x5e0/0x5e0
> > [52907.240532][ T310] ? match_held_lock+0x20/0x270
> > [52907.262106][ T310] ? invoke_rcu_core+0xf0/0xf0
> > [52907.282779][ T310] ? check_chain_key+0x1df/0x2e0
> > [52907.304629][ T310] ? find_held_lock+0xca/0xf0
> > [52907.325381][ T310] ? __kasan_check_read+0x11/0x20
> > [52907.347666][ T310] __x64_sys_membarrier+0x205/0x62e
> > [52907.370924][ T310] ? membarrier_exec_mmap+0x50/0x50
> > [52907.394034][ T310] ? do_syscall_64+0x79/0xaf0
> > [52907.414823][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
> > [52907.440213][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> > [52907.466322][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [52907.494190][ T310] do_syscall_64+0xcc/0xaf0
> > [52907.514834][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> > [52907.539609][ T310] ? syscall_return_slowpath+0x580/0x580
> > [52907.564670][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> > [52907.593100][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> > [52907.619749][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> > [52907.645538][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [52907.673066][ T310] RIP: 0033:0x7fa3d9cd670d
> > [52907.693218][ T310] Code: Bad RIP value.
> > [52907.711803][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000144
> > [52907.750495][ T310] RAX: ffffffffffffffda RBX: 0000000000000144 RCX: 00007fa3d9cd670d
> > [52907.786490][ T310] RDX: 0000000000000080 RSI: 0000000000000000 RDI: 0000000000000001
> > [52907.822363][ T310] RBP: 0000000000000144 R08: 00000000eeeeeeee R09: 0000000000003418
> > [52907.858106][ T310] R10: 0000000000004ff8 R11: 0000000000000246 R12: 0000000000000002
> > [52907.894476][ T310] R13: 00007fa3d8678058 R14: 00007fa3d9b956c0 R15: 00007fa3d8678000
> > [52907.930457][ T310] INFO: task trinity-c27:149010 can't die for more than 129 seconds.
> > [52907.971719][ T310] trinity-c27 R running task 26944 149010 74932 0x10000004
> > [52908.008456][ T310] Call Trace:
> > [52908.022892][ T310] __schedule+0x77e/0x1070
> > [52908.042479][ T310] ? __sched_text_start+0x8/0x8
> > [52908.063993][ T310] ? check_flags.part.28+0x86/0x220
> > [52908.087103][ T310] preempt_schedule_common+0x16/0x50
> > [52908.110659][ T310] _cond_resched+0x22/0x30
> > [52908.130167][ T310] stop_one_cpu+0xd7/0x140
> > [52908.149705][ T310] ? stop_cpus.constprop.3+0x130/0x130
> > [52908.174087][ T310] ? task_rq_unlock+0x6e/0x80
> > [52908.194356][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
> > [52908.217715][ T310] ? lock_downgrade+0x3e0/0x3e0
> > [52908.239236][ T310] ? mark_held_locks+0x34/0xb0
> > [52908.260471][ T310] ? task_rq_unlock+0x6e/0x80
> > [52908.281298][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
> > [52908.305375][ T310] ? move_queued_task+0x3c0/0x3c0
> > [52908.327364][ T310] ? cpuset_cpus_allowed+0x10b/0x200
> > [52908.350892][ T310] ? trace_hardirqs_on+0x3a/0x160
> > [52908.373276][ T310] ? __bitmap_and+0xd8/0x100
> > [52908.393650][ T310] sched_setaffinity+0x476/0x660
> > [52908.397305][T151560] futex_wake_op: trinity-c39 tries to shift op by -852; fix this program
> > [52908.415599][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
> > [52908.455822][T151560] futex_wake_op: trinity-c39 tries to shift op by -852; fix this program
> > [52908.483899][ T310] ? __kasan_check_write+0x14/0x20
> > [52908.483908][ T310] __x64_sys_sched_setaffinity+0x118/0x180
> > [52908.483916][ T310] ? sched_setaffinity+0x660/0x660
> > [52908.483925][ T310] ? do_syscall_64+0x79/0xaf0
> > [52908.483934][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
> > [52908.640521][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> > [52908.664068][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [52908.692710][ T310] do_syscall_64+0xcc/0xaf0
> > [52908.712603][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> > [52908.736117][ T310] ? syscall_return_slowpath+0x580/0x580
> > [52908.762313][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> > [52908.789929][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> > [52908.815487][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> > [52908.840248][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [52908.867098][ T310] RIP: 0033:0x7fa3d9cd670d
> > [52908.886542][ T310] Code: Bad RIP value.
> > [52908.904432][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
> > [52908.942172][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
> > [52908.981963][ T310] RDX: 00007fa3d843c000 RSI: 0000000000000004 RDI: 0000000000000000
> > [52909.020646][ T310] RBP: 00000000000000cb R08: 00000000ffffffff R09: fffffffffffffffa
> > [52909.057043][ T310] R10: 000000000000001b R11: 0000000000000246 R12: 0000000000000002
> > [52909.093664][ T310] R13: 00007fa3d85c9058 R14: 00007fa3d9b956c0 R15: 00007fa3d85c9000
> > [52909.129758][ T310] INFO: task trinity-c20:149065 can't die for more than 130 seconds.
> > [52909.166089][ T310] trinity-c20 R running task 27712 149065 74932 0x10000004
> > [52909.202031][ T310] Call Trace:
> > [52909.216658][ T310] __schedule+0x77e/0x1070
> > [52909.236187][ T310] ? __sched_text_start+0x8/0x8
> > [52909.257916][ T310] ? check_flags.part.28+0x86/0x220
> > [52909.281061][ T310] preempt_schedule_common+0x16/0x50
> > [52909.304664][ T310] _cond_resched+0x22/0x30
> > [52909.324257][ T310] stop_one_cpu+0xd7/0x140
> > [52909.343842][ T310] ? stop_cpus.constprop.3+0x130/0x130
> > [52909.369442][ T310] ? task_rq_unlock+0x6e/0x80
> > [52909.390672][ T310] ? sched_ttwu_pending+0x1b0/0x1b0
> > [52909.413672][ T310] ? lock_downgrade+0x3e0/0x3e0
> > [52909.435178][ T310] ? mark_held_locks+0x34/0xb0
> > [52909.456323][ T310] ? task_rq_unlock+0x6e/0x80
> > [52909.478561][ T310] __set_cpus_allowed_ptr+0x382/0x3a0
> > [52909.505164][ T310] ? move_queued_task+0x3c0/0x3c0
> > [52909.529450][ T310] ? cpuset_cpus_allowed+0x10b/0x200
> > [52909.553226][ T310] ? trace_hardirqs_on+0x3a/0x160
> > [52909.575565][ T310] ? __bitmap_and+0xd8/0x100
> > [52909.596065][ T310] sched_setaffinity+0x476/0x660
> > [52909.617927][ T310] ? __x64_sys_sched_getattr+0x3e0/0x3e0
> > [52909.643034][ T310] ? copy_user_generic_unrolled+0x86/0xc0
> > [52909.669320][ T310] __x64_sys_sched_setaffinity+0x118/0x180
> > [52909.696241][ T310] ? sched_setaffinity+0x660/0x660
> > [52909.719315][ T310] ? do_syscall_64+0x79/0xaf0
> > [52909.740952][ T310] ? rcu_read_lock_bh_held+0xc0/0xc0
> > [52909.765524][ T310] ? lockdep_hardirqs_on+0x1b0/0x2c0
> > [52909.790368][ T310] ? entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [52909.818929][ T310] do_syscall_64+0xcc/0xaf0
> > [52909.839558][ T310] ? perf_call_bpf_enter+0x1a0/0x1a0
> > [52909.863846][ T310] ? syscall_return_slowpath+0x580/0x580
> > [52909.889545][ T310] ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> > [52909.916624][ T310] ? trace_hardirqs_off_caller+0x3a/0x150
> > [52909.942269][ T310] ? trace_hardirqs_off_thunk+0x1a/0x1c
> > [52909.966872][ T310] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [52909.995175][ T310] RIP: 0033:0x7fa3d9cd670d
> > [52910.017390][ T310] Code: Bad RIP value.
> > [52910.035569][ T310] RSP: 002b:00007fff1d6588c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
> > [52910.073407][ T310] RAX: ffffffffffffffda RBX: 00000000000000cb RCX: 00007fa3d9cd670d
> > [52910.109426][ T310] RDX: 00007fa3d7c37000 RSI: 00000000000000a2 RDI: 0000000000000000
> > [52910.145197][ T310] RBP: 00000000000000cb R08: ffffffdffdefffff R09: 000000008084b740
> > [52910.181004][ T310] R10: 0000000000000080 R11: 0000000000000246 R12: 0000000000000002
> > [52910.217004][ T310] R13: 00007fa3d85fa058 R14: 00007fa3d9b956c0 R15: 00007fa3d85fa000
> > [52910.254432][ T310]
> > [52910.254432][ T310] Showing all locks held in the system:
> > [52910.289717][ T310] 1 lock held by ksoftirqd/16/94:
> > [52910.312041][ T310] #0: ffff88985fa453d8 (&rq->lock){-.-.}-{2:2}, at: newidle_balance+0x88d/0x9c0
> > [52910.353309][ T310] 1 lock held by khungtaskd/310:
> > [52910.375374][ T310] #0: ffffffffb3bcacc0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.30+0x0/0x30
> > [52910.422536][ T310] 3 locks held by kworker/u96:1/60424:
> > [52910.446875][ T310] #0: ffff88984fef1538 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> > [52910.493637][ T310] #1: ffffc900240ffe00 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> > [52910.542269][ T310] #2: ffffffffb409cf30 (pernet_ops_rwsem){++++}-{3:3}, at: cleanup_net+0xa9/0x5c0
> > [52910.583972][ T310] 2 locks held by runc/74923:
> > [52910.604801][ T310] #0: ffff88842f510898 (&tty->ldisc_sem){++++}-{0:0}, at: ldsem_down_read+0x32/0x40
> > [52910.647284][ T310] #1: ffffc900064bb2e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x187/0xe20
> > [52910.692274][ T310] 8 locks held by runc/74924:
> > [52910.713003][ T310] #0: ffff888842e453d8 (&rq->lock){-.-.}-{2:2}, at: __schedule+0x227/0x1070
> > [52910.752393][ T310] #1: ffffffffb3bcacc0 (rcu_read_lock){....}-{1:2}, at: perf_iterate_sb+0x0/0x500
> > [52910.794272][ T310] #2: ffff88842f510ae8 (&o_tty->termios_rwsem/1){++++}-{3:3}, at: n_tty_write+0x126/0x7c0
> > [52910.839386][ T310] #3: ffff88842f510d30 (&tty->write_wait){-.-.}-{2:2}, at: remove_wait_queue+0x29/0xc0
> > [52910.885384][ T310] #4: ffff888429bb0558 (&port->lock#2){-.-.}-{2:2}, at: pty_write+0x82/0xf0
> > [52910.924906][ T310] #5: ffffffffb3bcacc0 (rcu_read_lock){....}-{1:2}, at: __queue_work+0x8b/0xa40
> > [52910.965808][ T310] #6: ffff88820f19c418 (&pool->lock/1){-.-.}-{2:2}, at: __queue_work+0x154/0xa40
> > [52911.008506][ T310] #7: ffff888349990828 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0xb0/0xfb0
> > [52911.054482][ T310] 2 locks held by kworker/26:2/87166:
> > [52911.078400][ T310] #0: ffff888207838338 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> > [52911.124769][ T310] #1: ffffc900203ffe00 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> > [52911.175647][ T310] 2 locks held by kworker/13:2/110105:
> > [52911.199910][ T310] #0: ffff888207838338 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> > [52911.246366][ T310] #1: ffffc90031bffe00 (free_ipc_work){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> > [52911.289451][ T310] 2 locks held by kworker/24:0/140617:
> > [52911.313772][ T310] #0: ffff888207018938 ((wq_completion)rcu_gp){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> > [52911.360933][ T310] #1: ffffc90021a6fe00 ((work_completion)(&rew.rew_work)){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> > [52911.411991][ T310] 3 locks held by kworker/36:2/142207:
> > [52911.437085][ T310] #0: ffff888207838338 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> > [52911.483296][ T310] #1: ffffc9002420fe00 (slab_caches_to_rcu_destroy_work){+.+.}-{0:0}, at: process_one_work+0x45c/0xb90
> > [52911.537354][ T310] #2: ffffffffb3bd9128 (rcu_state.barrier_mutex){+.+.}-{3:3}, at: rcu_barrier+0x4c/0x440
> > [52911.585581][ T310] 1 lock held by trinity-c11/151262:
> > [52911.609175][ T310] #0: ffff889660160248 (&sb->s_type->i_mutex_key#8){+.+.}-{3:3}, at: __sock_release+0x54/0x120
> > [52911.655737][ T310] 1 lock held by trinity-c35/151367:
> > [52911.680063][ T310] #0: ffffffffb3bd9220 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x4cc/0x5e0
> > [52911.768193][ T310] 2 locks held by trinity-c30/151624:
> > [52911.792141][ T310] 1 lock held by trinity-c33/151643:
> > [52911.815251][ T310] #0: ffffffffb3bd9220 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x4cc/0x5e0
> > [52911.863988][ T310] 1 lock held by trinity-c19/151731:
> > [52911.889078][ T310]
> > [52911.899680][ T310] =============================================

2020-04-30 03:41:52

by Qian Cai

[permalink] [raw]
Subject: Re: rcu_barrier() + membarrier() + scheduler deadlock?



> On Apr 29, 2020, at 11:22 PM, Paul E. McKenney <[email protected]> wrote:
>
> On Wed, Apr 29, 2020 at 11:00:58PM -0400, Qian Cai wrote:
>> Doing some fuzzers on many CPUs triggers a deadlock really quickly. I can see that there were several tasks had been stuck for a while,
>>
>> CPUA:
>> slab_caches_to_rcu_destroy_workfn()
>> rcu_barrier()
>> wait_for_completion()
>>
>> CPUB:
>> sched_setaffinity()
>> __set_cpus_allowed_ptr()
>> stop_one_cpu()
>>
>> CPUC:
>> __x64_sys_membarrier
>> synchronize_rcu()
>> __wait_rcu_gp()
>> wait_for_completion()
>>
>> Lockdep did not flag anything useful at all. Any clue?
>
> CPUA and CPUC are most likely both waiting for a grace period to complete.
>
> CPUA will be blocking CPU hotplug (get_online_cpus()) in case that
> matters. I am not seeing any obvious wait for an RCU grace period
> in CPUB.
>
> But which task's affinity is being set? That of the grace-period kthread?

I am not sure. One of task's sched_setaffinity() can’t die, so here is the log from that task.

# grep setaffinity rootfs/tmp/trinity-child20.log
[child20:131] [313] sched_setaffinity(pid=0, len=0x13939393, user_mask_ptr=0x7fa3d7c37000) = -1 (Bad address)
[child20:131] [574] sched_setaffinity(pid=196, len=0xf000, user_mask_ptr=0x8) = -1 (Bad address)
[child20:131] [589] sched_setaffinity(pid=0, len=100, user_mask_ptr=0x1) = -1 (Bad address)
[child20:131] [615] sched_setaffinity(pid=0, len=1, user_mask_ptr=0x8) = -1 (Bad address)
[child20:589] [17] sched_setaffinity(pid=0, len=8, user_mask_ptr=0x7fa3d7c39000) = -1 (Invalid argument)
[child20:589] [346] sched_setaffinity(pid=0, len=167, user_mask_ptr=0x1) = -1 (Bad address)
[child20:926] [44] sched_setaffinity(pid=0, len=4096, user_mask_ptr=0x7fa3d7c3c000) = -1 (Invalid argument)
[child20:926] [124] sched_setaffinity(pid=0, len=1, user_mask_ptr=0x7fa3d7c38000) = 0
[child20:1007] [217] sched_setaffinity(pid=0, len=4096, user_mask_ptr=0x7fa3d7c38000) = 0
[child20:1007] [235] sched_setaffinity(pid=0, len=8, user_mask_ptr=0x0) = -1 (Bad address)
[child20:1122] [63] sched_setaffinity(pid=0, len=777, user_mask_ptr=0x4) = -1 (Bad address)
[child20:1122] [509] sched_setaffinity(pid=0, len=4096, user_mask_ptr=0x7fa3d803c000) = 0
[child20:1122] [902] sched_setaffinity(pid=1750, len=1, user_mask_ptr=0x8) = -1 (Bad address)
[child20:1824] [57] sched_setaffinity(pid=1723, len=0x1fa56, user_mask_ptr=0x1) = -1 (Bad address)
[child20:1853] [92] sched_setaffinity(pid=1741, len=4096, user_mask_ptr=0x7fa3d843c000) = 0
[child20:2058] [114] sched_setaffinity(pid=2019, len=23, user_mask_ptr=0x8) = -1 (Bad address)

> If not, are there rcuo kthreads present? Either way, preventing any of

I did not catch that. I might just be that workqueue was unable to handle fuzzers loads on many CPUs. I had another 32-CPU server stuck in flush_work(). Not sure if it is related.

> RCU's kthreads from running could block potentially both CPUA and CPUC.
> Though in the case of the grace-period kthread, I would expect to see
> an RCU CPU stall warning.
>
> Could you please share your .config?

https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config

[53294.651754][T149877] futex_wake_op: trinity-c25 tries to shift op by -17; fix this program
[53323.947396][T150988] futex_wake_op: trinity-c6 tries to shift op by -5; fix this program
[53458.295837][ T215] INFO: task kworker/u64:0:8 blocked for more than 122 seconds.
[53458.304063][ T215] Tainted: G O L 5.7.0-rc3-next-20200429 #1
[53458.311568][ T215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[53458.320190][ T215] kworker/u64:0 D10584 8 2 0x90004000
[53458.326668][ T215] Workqueue: netns cleanup_net
[53458.331330][ T215] Call Trace:
[53458.334510][ T215] __schedule+0x47b/0xa50
[53458.338765][ T215] ? wait_for_completion+0x80/0x120
[53458.343920][ T215] schedule+0x59/0xd0
[53458.348013][ T215] schedule_timeout+0x10a/0x150
[53458.352762][ T215] ? wait_for_completion+0x80/0x120
[53458.357881][ T215] ? _raw_spin_unlock_irq+0x30/0x40
[53458.362997][ T215] ? trace_hardirqs_on+0x22/0x100
[53458.367948][ T215] ? wait_for_completion+0x80/0x120
[53458.373195][ T215] wait_for_completion+0xb4/0x120
[53458.378149][ T215] __flush_work+0x3ff/0x6e0
[53458.382586][ T215] ? init_pwq+0x210/0x210
[53458.386840][ T215] flush_work+0x20/0x30
[53458.390891][ T215] rollback_registered_many+0x3d6/0x950
[53458.396438][ T215] ? mark_held_locks+0x4e/0x80
[53458.401339][ T215] unregister_netdevice_many+0x5d/0x200
[53458.406816][ T215] default_device_exit_batch+0x213/0x240
[53458.412348][ T215] ? do_wait_intr_irq+0xe0/0xe0
[53458.417225][ T215] ? dev_change_net_namespace+0x6d0/0x6d0
[53458.423000][ T215] ops_exit_list+0xa2/0xc0
[53458.427367][ T215] cleanup_net+0x3d0/0x600
[53458.431778][ T215] process_one_work+0x560/0xba0
[53458.436629][ T215] worker_thread+0x80/0x5f0
[53458.441078][ T215] ? process_scheduled_works+0x90/0x90
[53458.446485][ T215] kthread+0x1de/0x200
[53458.450600][ T215] ? kthread_unpark+0xd0/0xd0
[53458.455227][ T215] ret_from_fork+0x27/0x50
[53458.460332][ T215] INFO: task trinity-c17:150651 blocked for more than 123 seconds.
[53458.468180][ T215] Tainted: G O L 5.7.0-rc3-next-20200429 #1
[53458.475924][ T215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[53458.484511][ T215] trinity-c17 D12312 150651 93301 0x10000004
[53458.490862][ T215] Call Trace:
[53458.494039][ T215] __schedule+0x47b/0xa50
[53458.498452][ T215] schedule+0x59/0xd0
[53458.502376][ T215] schedule_preempt_disabled+0x15/0x20
[53458.507809][ T215] __mutex_lock+0x6f2/0xbf0
[53458.512296][ T215] ? rtnl_lock+0x20/0x30
[53458.516484][ T215] mutex_lock_nested+0x31/0x40
[53458.521157][ T215] ? mutex_lock_nested+0x31/0x40
[53458.526195][ T215] rtnl_lock+0x20/0x30
[53458.530251][ T215] do_ip_setsockopt.isra.12+0xec/0x1b90
[53458.535875][ T215] ? find_held_lock+0x35/0xa0
[53458.540603][ T215] ? rb_insert_color+0x10f/0x390
[53458.545436][ T215] ? lock_acquire+0xcd/0x450
[53458.550126][ T215] ? find_held_lock+0x35/0xa0
[53458.554717][ T215] ? __cgroup_bpf_prog_array_is_empty+0x121/0x230
[53458.561127][ T215] ip_setsockopt+0x3e/0x90
[53458.565511][ T215] udp_setsockopt+0x49/0x80
[53458.570059][ T215] sock_common_setsockopt+0x6d/0x90
[53458.575303][ T215] __sys_setsockopt+0x194/0x2e0
[53458.580128][ T215] __x64_sys_setsockopt+0x70/0x90
[53458.585066][ T215] do_syscall_64+0x91/0xb10
[53458.589504][ T215] ? perf_call_bpf_enter+0x120/0x120
[53458.594741][ T215] ? trace_hardirqs_off_thunk+0x1a/0x1c
[53458.600467][ T215] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[53458.606296][ T215] RIP: 0033:0x7f2cbe1b270d
[53458.610611][ T215] Code: Bad RIP value.
[53458.614570][ T215] RSP: 002b:00007ffe6a4b18d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
[53458.623094][ T215] RAX: ffffffffffffffda RBX: 0000000000000036 RCX: 00007f2cbe1b270d
[53458.631024][ T215] RDX: 000000000000002a RSI: 0000000000000000 RDI: 0000000000000060
[53458.638954][ T215] RBP: 0000000000000036 R08: 0000000000000088 R09: 00000000000000f0
[53458.647044][ T215] R10: 00000000025557f0 R11: 0000000000000246 R12: 0000000000000002
[53458.654960][ T215] R13: 00007f2cbcaeb058 R14: 00007f2cbe0716c0 R15: 00007f2cbcaeb000
[53458.662899][ T215] INFO: task trinity-c10:150896 blocked for more than 123 seconds.
[53458.670757][ T215] Tainted: G O L 5.7.0-rc3-next-20200429 #1
[53458.678380][ T215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[53458.687007][ T215] trinity-c10 D11512 150896 93301 0x10000004
[53458.693334][ T215] Call Trace:
[53458.696545][ T215] __schedule+0x47b/0xa50
[53458.700944][ T215] schedule+0x59/0xd0
[53458.704817][ T215] schedule_preempt_disabled+0x15/0x20
[53458.710246][ T215] __mutex_lock+0x6f2/0xbf0
[53458.714686][ T215] ? rtnl_lock+0x20/0x30
[53458.718851][ T215] mutex_lock_nested+0x31/0x40
[53458.723649][ T215] ? mutex_lock_nested+0x31/0x40
[53458.728556][ T215] rtnl_lock+0x20/0x30
[53458.732520][ T215] do_ip_setsockopt.isra.12+0xec/0x1b90
[53458.737998][ T215] ? find_held_lock+0x35/0xa0
[53458.742646][ T215] ? rb_insert_color+0x10f/0x390
[53458.747672][ T215] ? lock_acquire+0xcd/0x450
[53458.752236][ T215] ? find_held_lock+0x35/0xa0
[53458.756847][ T215] ? __cgroup_bpf_prog_array_is_empty+0x121/0x230
[53458.763169][ T215] ip_setsockopt+0x3e/0x90
[53458.767509][ T215] udp_setsockopt+0x49/0x80
[53458.772061][ T215] sock_common_setsockopt+0x6d/0x90
[53458.777192][ T215] __sys_setsockopt+0x194/0x2e0
[53458.781976][ T215] __x64_sys_setsockopt+0x70/0x90
[53458.786994][ T215] do_syscall_64+0x91/0xb10
[53458.791460][ T215] ? perf_call_bpf_enter+0x120/0x120
[53458.797008][ T215] ? trace_hardirqs_off_thunk+0x1a/0x1c
[53458.802523][ T215] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[53458.808368][ T215] RIP: 0033:0x7f2cbe1b270d
[53458.812678][ T215] Code: Bad RIP value.
[53458.816669][ T215] RSP: 002b:00007ffe6a4b18d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
[53458.825143][ T215] RAX: ffffffffffffffda RBX: 0000000000000036 RCX: 00007f2cbe1b270d
[53458.833070][ T215] RDX: 000000000000002c RSI: 0000000000000000 RDI: 000000000000005c
[53458.840997][ T215] RBP: 0000000000000036 R08: 0000000000000108 R09: 00000000b5b5b5b5
[53458.849155][ T215] R10: 000000000255b5a0 R11: 0000000000000246 R12: 0000000000000002
[53458.857072][ T215] R13: 00007f2cbcb1c058 R14: 00007f2cbe0716c0 R15: 00007f2cbcb1c000
[53458.865076][ T215]
[53458.865076][ T215] Showing all locks held in the system:
[53458.872936][ T215] 5 locks held by kworker/u64:0/8:
[53458.878066][ T215] #0: ffff95b4734e4538 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work+0x454/0xba0
[53458.888191][ T215] #1: ffffa459431abe18 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work+0x454/0xba0
[53458.898177][ T215] #2: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: cleanup_net+0x6d/0x600
[53458.907469][ T215] #3: ffffffff955975a8 (rtnl_mutexl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
[53459.304598][ T215] 1 lock held by trinity-c17/150651:
[53459.309827][ T215] #0: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x20/0x30
[53459.318343][ T215] 2 locks held by trinity-c11/150712:
[53459.323744][ T215] #0: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: copy_net_ns+0x15c/0x26e
[53459.333092][ T215] #1: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
[53459.342364][ T215] 2 locks held by trinity-c7/150739:
[53459.347768][ T215] #0: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: copy_net_ns+0x15c/0x26e
[53459.357179][ T215] #1: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
[53459.366479][ T215] 2 locks held by trinity-c15/150758:
[53459.372006][ T215] #0: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: copy_net_ns+0x15c/0x26e
[53459.381398][ T215] #1: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30

2020-04-30 13:17:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_barrier() + membarrier() + scheduler deadlock?

On Wed, Apr 29, 2020 at 11:39:48PM -0400, Qian Cai wrote:
>
>
> > On Apr 29, 2020, at 11:22 PM, Paul E. McKenney <[email protected]> wrote:
> >
> > On Wed, Apr 29, 2020 at 11:00:58PM -0400, Qian Cai wrote:
> >> Doing some fuzzers on many CPUs triggers a deadlock really quickly. I can see that there were several tasks had been stuck for a while,
> >>
> >> CPUA:
> >> slab_caches_to_rcu_destroy_workfn()
> >> rcu_barrier()
> >> wait_for_completion()
> >>
> >> CPUB:
> >> sched_setaffinity()
> >> __set_cpus_allowed_ptr()
> >> stop_one_cpu()
> >>
> >> CPUC:
> >> __x64_sys_membarrier
> >> synchronize_rcu()
> >> __wait_rcu_gp()
> >> wait_for_completion()
> >>
> >> Lockdep did not flag anything useful at all. Any clue?
> >
> > CPUA and CPUC are most likely both waiting for a grace period to complete.
> >
> > CPUA will be blocking CPU hotplug (get_online_cpus()) in case that
> > matters. I am not seeing any obvious wait for an RCU grace period
> > in CPUB.
> >
> > But which task's affinity is being set? That of the grace-period kthread?
>
> I am not sure. One of task's sched_setaffinity() can’t die, so here is the log from that task.
>
> # grep setaffinity rootfs/tmp/trinity-child20.log
> [child20:131] [313] sched_setaffinity(pid=0, len=0x13939393, user_mask_ptr=0x7fa3d7c37000) = -1 (Bad address)
> [child20:131] [574] sched_setaffinity(pid=196, len=0xf000, user_mask_ptr=0x8) = -1 (Bad address)
> [child20:131] [589] sched_setaffinity(pid=0, len=100, user_mask_ptr=0x1) = -1 (Bad address)
> [child20:131] [615] sched_setaffinity(pid=0, len=1, user_mask_ptr=0x8) = -1 (Bad address)
> [child20:589] [17] sched_setaffinity(pid=0, len=8, user_mask_ptr=0x7fa3d7c39000) = -1 (Invalid argument)
> [child20:589] [346] sched_setaffinity(pid=0, len=167, user_mask_ptr=0x1) = -1 (Bad address)
> [child20:926] [44] sched_setaffinity(pid=0, len=4096, user_mask_ptr=0x7fa3d7c3c000) = -1 (Invalid argument)
> [child20:926] [124] sched_setaffinity(pid=0, len=1, user_mask_ptr=0x7fa3d7c38000) = 0
> [child20:1007] [217] sched_setaffinity(pid=0, len=4096, user_mask_ptr=0x7fa3d7c38000) = 0
> [child20:1007] [235] sched_setaffinity(pid=0, len=8, user_mask_ptr=0x0) = -1 (Bad address)
> [child20:1122] [63] sched_setaffinity(pid=0, len=777, user_mask_ptr=0x4) = -1 (Bad address)
> [child20:1122] [509] sched_setaffinity(pid=0, len=4096, user_mask_ptr=0x7fa3d803c000) = 0
> [child20:1122] [902] sched_setaffinity(pid=1750, len=1, user_mask_ptr=0x8) = -1 (Bad address)
> [child20:1824] [57] sched_setaffinity(pid=1723, len=0x1fa56, user_mask_ptr=0x1) = -1 (Bad address)
> [child20:1853] [92] sched_setaffinity(pid=1741, len=4096, user_mask_ptr=0x7fa3d843c000) = 0
> [child20:2058] [114] sched_setaffinity(pid=2019, len=23, user_mask_ptr=0x8) = -1 (Bad address)

Not sure what to make of this. Some succeeded, others had various
error conditions. Though I would expect the PID of RCU's kthreads to
be smaller than shown, but it does depend on the hardware and software
configuration.

> > If not, are there rcuo kthreads present? Either way, preventing any of
>
> I did not catch that. I might just be that workqueue was unable to handle fuzzers loads on many CPUs. I had another 32-CPU server stuck in flush_work(). Not sure if it is related.

Your .config hss CONFIG_PREEMPT_NONE=y and CONFIG_NO_HZ_IDLE=y, and
there is no RCU callback offloading, so there should not have been any
rcuo kthreads.

But yes, too much overload might break things.

> > RCU's kthreads from running could block potentially both CPUA and CPUC.
> > Though in the case of the grace-period kthread, I would expect to see
> > an RCU CPU stall warning.
> >
> > Could you please share your .config?
>
> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>
> [53294.651754][T149877] futex_wake_op: trinity-c25 tries to shift op by -17; fix this program
> [53323.947396][T150988] futex_wake_op: trinity-c6 tries to shift op by -5; fix this program
> [53458.295837][ T215] INFO: task kworker/u64:0:8 blocked for more than 122 seconds.
> [53458.304063][ T215] Tainted: G O L 5.7.0-rc3-next-20200429 #1
> [53458.311568][ T215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [53458.320190][ T215] kworker/u64:0 D10584 8 2 0x90004000
> [53458.326668][ T215] Workqueue: netns cleanup_net
> [53458.331330][ T215] Call Trace:
> [53458.334510][ T215] __schedule+0x47b/0xa50
> [53458.338765][ T215] ? wait_for_completion+0x80/0x120
> [53458.343920][ T215] schedule+0x59/0xd0
> [53458.348013][ T215] schedule_timeout+0x10a/0x150
> [53458.352762][ T215] ? wait_for_completion+0x80/0x120
> [53458.357881][ T215] ? _raw_spin_unlock_irq+0x30/0x40
> [53458.362997][ T215] ? trace_hardirqs_on+0x22/0x100
> [53458.367948][ T215] ? wait_for_completion+0x80/0x120
> [53458.373195][ T215] wait_for_completion+0xb4/0x120
> [53458.378149][ T215] __flush_work+0x3ff/0x6e0
> [53458.382586][ T215] ? init_pwq+0x210/0x210
> [53458.386840][ T215] flush_work+0x20/0x30
> [53458.390891][ T215] rollback_registered_many+0x3d6/0x950
> [53458.396438][ T215] ? mark_held_locks+0x4e/0x80
> [53458.401339][ T215] unregister_netdevice_many+0x5d/0x200
> [53458.406816][ T215] default_device_exit_batch+0x213/0x240
> [53458.412348][ T215] ? do_wait_intr_irq+0xe0/0xe0
> [53458.417225][ T215] ? dev_change_net_namespace+0x6d0/0x6d0
> [53458.423000][ T215] ops_exit_list+0xa2/0xc0
> [53458.427367][ T215] cleanup_net+0x3d0/0x600
> [53458.431778][ T215] process_one_work+0x560/0xba0
> [53458.436629][ T215] worker_thread+0x80/0x5f0
> [53458.441078][ T215] ? process_scheduled_works+0x90/0x90
> [53458.446485][ T215] kthread+0x1de/0x200
> [53458.450600][ T215] ? kthread_unpark+0xd0/0xd0
> [53458.455227][ T215] ret_from_fork+0x27/0x50
> [53458.460332][ T215] INFO: task trinity-c17:150651 blocked for more than 123 seconds.
> [53458.468180][ T215] Tainted: G O L 5.7.0-rc3-next-20200429 #1
> [53458.475924][ T215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [53458.484511][ T215] trinity-c17 D12312 150651 93301 0x10000004
> [53458.490862][ T215] Call Trace:
> [53458.494039][ T215] __schedule+0x47b/0xa50
> [53458.498452][ T215] schedule+0x59/0xd0
> [53458.502376][ T215] schedule_preempt_disabled+0x15/0x20
> [53458.507809][ T215] __mutex_lock+0x6f2/0xbf0
> [53458.512296][ T215] ? rtnl_lock+0x20/0x30
> [53458.516484][ T215] mutex_lock_nested+0x31/0x40
> [53458.521157][ T215] ? mutex_lock_nested+0x31/0x40
> [53458.526195][ T215] rtnl_lock+0x20/0x30
> [53458.530251][ T215] do_ip_setsockopt.isra.12+0xec/0x1b90
> [53458.535875][ T215] ? find_held_lock+0x35/0xa0
> [53458.540603][ T215] ? rb_insert_color+0x10f/0x390
> [53458.545436][ T215] ? lock_acquire+0xcd/0x450
> [53458.550126][ T215] ? find_held_lock+0x35/0xa0
> [53458.554717][ T215] ? __cgroup_bpf_prog_array_is_empty+0x121/0x230
> [53458.561127][ T215] ip_setsockopt+0x3e/0x90
> [53458.565511][ T215] udp_setsockopt+0x49/0x80
> [53458.570059][ T215] sock_common_setsockopt+0x6d/0x90
> [53458.575303][ T215] __sys_setsockopt+0x194/0x2e0
> [53458.580128][ T215] __x64_sys_setsockopt+0x70/0x90
> [53458.585066][ T215] do_syscall_64+0x91/0xb10
> [53458.589504][ T215] ? perf_call_bpf_enter+0x120/0x120
> [53458.594741][ T215] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [53458.600467][ T215] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [53458.606296][ T215] RIP: 0033:0x7f2cbe1b270d
> [53458.610611][ T215] Code: Bad RIP value.
> [53458.614570][ T215] RSP: 002b:00007ffe6a4b18d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
> [53458.623094][ T215] RAX: ffffffffffffffda RBX: 0000000000000036 RCX: 00007f2cbe1b270d
> [53458.631024][ T215] RDX: 000000000000002a RSI: 0000000000000000 RDI: 0000000000000060
> [53458.638954][ T215] RBP: 0000000000000036 R08: 0000000000000088 R09: 00000000000000f0
> [53458.647044][ T215] R10: 00000000025557f0 R11: 0000000000000246 R12: 0000000000000002
> [53458.654960][ T215] R13: 00007f2cbcaeb058 R14: 00007f2cbe0716c0 R15: 00007f2cbcaeb000
> [53458.662899][ T215] INFO: task trinity-c10:150896 blocked for more than 123 seconds.
> [53458.670757][ T215] Tainted: G O L 5.7.0-rc3-next-20200429 #1
> [53458.678380][ T215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [53458.687007][ T215] trinity-c10 D11512 150896 93301 0x10000004
> [53458.693334][ T215] Call Trace:
> [53458.696545][ T215] __schedule+0x47b/0xa50
> [53458.700944][ T215] schedule+0x59/0xd0
> [53458.704817][ T215] schedule_preempt_disabled+0x15/0x20
> [53458.710246][ T215] __mutex_lock+0x6f2/0xbf0
> [53458.714686][ T215] ? rtnl_lock+0x20/0x30
> [53458.718851][ T215] mutex_lock_nested+0x31/0x40
> [53458.723649][ T215] ? mutex_lock_nested+0x31/0x40
> [53458.728556][ T215] rtnl_lock+0x20/0x30
> [53458.732520][ T215] do_ip_setsockopt.isra.12+0xec/0x1b90
> [53458.737998][ T215] ? find_held_lock+0x35/0xa0
> [53458.742646][ T215] ? rb_insert_color+0x10f/0x390
> [53458.747672][ T215] ? lock_acquire+0xcd/0x450
> [53458.752236][ T215] ? find_held_lock+0x35/0xa0
> [53458.756847][ T215] ? __cgroup_bpf_prog_array_is_empty+0x121/0x230
> [53458.763169][ T215] ip_setsockopt+0x3e/0x90
> [53458.767509][ T215] udp_setsockopt+0x49/0x80
> [53458.772061][ T215] sock_common_setsockopt+0x6d/0x90
> [53458.777192][ T215] __sys_setsockopt+0x194/0x2e0
> [53458.781976][ T215] __x64_sys_setsockopt+0x70/0x90
> [53458.786994][ T215] do_syscall_64+0x91/0xb10
> [53458.791460][ T215] ? perf_call_bpf_enter+0x120/0x120
> [53458.797008][ T215] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [53458.802523][ T215] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [53458.808368][ T215] RIP: 0033:0x7f2cbe1b270d
> [53458.812678][ T215] Code: Bad RIP value.
> [53458.816669][ T215] RSP: 002b:00007ffe6a4b18d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
> [53458.825143][ T215] RAX: ffffffffffffffda RBX: 0000000000000036 RCX: 00007f2cbe1b270d
> [53458.833070][ T215] RDX: 000000000000002c RSI: 0000000000000000 RDI: 000000000000005c
> [53458.840997][ T215] RBP: 0000000000000036 R08: 0000000000000108 R09: 00000000b5b5b5b5
> [53458.849155][ T215] R10: 000000000255b5a0 R11: 0000000000000246 R12: 0000000000000002
> [53458.857072][ T215] R13: 00007f2cbcb1c058 R14: 00007f2cbe0716c0 R15: 00007f2cbcb1c000
> [53458.865076][ T215]
> [53458.865076][ T215] Showing all locks held in the system:
> [53458.872936][ T215] 5 locks held by kworker/u64:0/8:
> [53458.878066][ T215] #0: ffff95b4734e4538 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work+0x454/0xba0
> [53458.888191][ T215] #1: ffffa459431abe18 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work+0x454/0xba0
> [53458.898177][ T215] #2: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: cleanup_net+0x6d/0x600
> [53458.907469][ T215] #3: ffffffff955975a8 (rtnl_mutexl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
> [53459.304598][ T215] 1 lock held by trinity-c17/150651:
> [53459.309827][ T215] #0: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x20/0x30
> [53459.318343][ T215] 2 locks held by trinity-c11/150712:
> [53459.323744][ T215] #0: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: copy_net_ns+0x15c/0x26e
> [53459.333092][ T215] #1: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
> [53459.342364][ T215] 2 locks held by trinity-c7/150739:
> [53459.347768][ T215] #0: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: copy_net_ns+0x15c/0x26e
> [53459.357179][ T215] #1: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
> [53459.366479][ T215] 2 locks held by trinity-c15/150758:
> [53459.372006][ T215] #0: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: copy_net_ns+0x15c/0x26e
> [53459.381398][ T215] #1: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30

If you have lots of threads doing networking configuration changes,
then contention on RTNL (as evidenced above) is expected behavior.

Thanx, Paul

2020-04-30 13:30:40

by Qian Cai

[permalink] [raw]
Subject: Re: rcu_barrier() + membarrier() + scheduler deadlock?



> On Apr 30, 2020, at 9:15 AM, Paul E. McKenney <[email protected]> wrote:
>>
>> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>>
>> [53294.651754][T149877] futex_wake_op: trinity-c25 tries to shift op by -17; fix this program
>> [53323.947396][T150988] futex_wake_op: trinity-c6 tries to shift op by -5; fix this program
>> [53458.295837][ T215] INFO: task kworker/u64:0:8 blocked for more than 122 seconds.
>> [53458.304063][ T215] Tainted: G O L 5.7.0-rc3-next-20200429 #1
>> [53458.311568][ T215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [53458.320190][ T215] kworker/u64:0 D10584 8 2 0x90004000
>> [53458.326668][ T215] Workqueue: netns cleanup_net
>> [53458.331330][ T215] Call Trace:
>> [53458.334510][ T215] __schedule+0x47b/0xa50
>> [53458.338765][ T215] ? wait_for_completion+0x80/0x120
>> [53458.343920][ T215] schedule+0x59/0xd0
>> [53458.348013][ T215] schedule_timeout+0x10a/0x150
>> [53458.352762][ T215] ? wait_for_completion+0x80/0x120
>> [53458.357881][ T215] ? _raw_spin_unlock_irq+0x30/0x40
>> [53458.362997][ T215] ? trace_hardirqs_on+0x22/0x100
>> [53458.367948][ T215] ? wait_for_completion+0x80/0x120
>> [53458.373195][ T215] wait_for_completion+0xb4/0x120
>> [53458.378149][ T215] __flush_work+0x3ff/0x6e0
>> [53458.382586][ T215] ? init_pwq+0x210/0x210
>> [53458.386840][ T215] flush_work+0x20/0x30
>> [53458.390891][ T215] rollback_registered_many+0x3d6/0x950
>> [53458.396438][ T215] ? mark_held_locks+0x4e/0x80
>> [53458.401339][ T215] unregister_netdevice_many+0x5d/0x200
>> [53458.406816][ T215] default_device_exit_batch+0x213/0x240
>> [53458.412348][ T215] ? do_wait_intr_irq+0xe0/0xe0
>> [53458.417225][ T215] ? dev_change_net_namespace+0x6d0/0x6d0
>> [53458.423000][ T215] ops_exit_list+0xa2/0xc0
>> [53458.427367][ T215] cleanup_net+0x3d0/0x600
>> [53458.431778][ T215] process_one_work+0x560/0xba0
>> [53458.436629][ T215] worker_thread+0x80/0x5f0
>> [53458.441078][ T215] ? process_scheduled_works+0x90/0x90
>> [53458.446485][ T215] kthread+0x1de/0x200
>> [53458.450600][ T215] ? kthread_unpark+0xd0/0xd0
>> [53458.455227][ T215] ret_from_fork+0x27/0x50
>> [53458.460332][ T215] INFO: task trinity-c17:150651 blocked for more than 123 seconds.
>> [53458.468180][ T215] Tainted: G O L 5.7.0-rc3-next-20200429 #1
>> [53458.475924][ T215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [53458.484511][ T215] trinity-c17 D12312 150651 93301 0x10000004
>> [53458.490862][ T215] Call Trace:
>> [53458.494039][ T215] __schedule+0x47b/0xa50
>> [53458.498452][ T215] schedule+0x59/0xd0
>> [53458.502376][ T215] schedule_preempt_disabled+0x15/0x20
>> [53458.507809][ T215] __mutex_lock+0x6f2/0xbf0
>> [53458.512296][ T215] ? rtnl_lock+0x20/0x30
>> [53458.516484][ T215] mutex_lock_nested+0x31/0x40
>> [53458.521157][ T215] ? mutex_lock_nested+0x31/0x40
>> [53458.526195][ T215] rtnl_lock+0x20/0x30
>> [53458.530251][ T215] do_ip_setsockopt.isra.12+0xec/0x1b90
>> [53458.535875][ T215] ? find_held_lock+0x35/0xa0
>> [53458.540603][ T215] ? rb_insert_color+0x10f/0x390
>> [53458.545436][ T215] ? lock_acquire+0xcd/0x450
>> [53458.550126][ T215] ? find_held_lock+0x35/0xa0
>> [53458.554717][ T215] ? __cgroup_bpf_prog_array_is_empty+0x121/0x230
>> [53458.561127][ T215] ip_setsockopt+0x3e/0x90
>> [53458.565511][ T215] udp_setsockopt+0x49/0x80
>> [53458.570059][ T215] sock_common_setsockopt+0x6d/0x90
>> [53458.575303][ T215] __sys_setsockopt+0x194/0x2e0
>> [53458.580128][ T215] __x64_sys_setsockopt+0x70/0x90
>> [53458.585066][ T215] do_syscall_64+0x91/0xb10
>> [53458.589504][ T215] ? perf_call_bpf_enter+0x120/0x120
>> [53458.594741][ T215] ? trace_hardirqs_off_thunk+0x1a/0x1c
>> [53458.600467][ T215] entry_SYSCALL_64_after_hwframe+0x49/0xb3
>> [53458.606296][ T215] RIP: 0033:0x7f2cbe1b270d
>> [53458.610611][ T215] Code: Bad RIP value.
>> [53458.614570][ T215] RSP: 002b:00007ffe6a4b18d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
>> [53458.623094][ T215] RAX: ffffffffffffffda RBX: 0000000000000036 RCX: 00007f2cbe1b270d
>> [53458.631024][ T215] RDX: 000000000000002a RSI: 0000000000000000 RDI: 0000000000000060
>> [53458.638954][ T215] RBP: 0000000000000036 R08: 0000000000000088 R09: 00000000000000f0
>> [53458.647044][ T215] R10: 00000000025557f0 R11: 0000000000000246 R12: 0000000000000002
>> [53458.654960][ T215] R13: 00007f2cbcaeb058 R14: 00007f2cbe0716c0 R15: 00007f2cbcaeb000
>> [53458.662899][ T215] INFO: task trinity-c10:150896 blocked for more than 123 seconds.
>> [53458.670757][ T215] Tainted: G O L 5.7.0-rc3-next-20200429 #1
>> [53458.678380][ T215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [53458.687007][ T215] trinity-c10 D11512 150896 93301 0x10000004
>> [53458.693334][ T215] Call Trace:
>> [53458.696545][ T215] __schedule+0x47b/0xa50
>> [53458.700944][ T215] schedule+0x59/0xd0
>> [53458.704817][ T215] schedule_preempt_disabled+0x15/0x20
>> [53458.710246][ T215] __mutex_lock+0x6f2/0xbf0
>> [53458.714686][ T215] ? rtnl_lock+0x20/0x30
>> [53458.718851][ T215] mutex_lock_nested+0x31/0x40
>> [53458.723649][ T215] ? mutex_lock_nested+0x31/0x40
>> [53458.728556][ T215] rtnl_lock+0x20/0x30
>> [53458.732520][ T215] do_ip_setsockopt.isra.12+0xec/0x1b90
>> [53458.737998][ T215] ? find_held_lock+0x35/0xa0
>> [53458.742646][ T215] ? rb_insert_color+0x10f/0x390
>> [53458.747672][ T215] ? lock_acquire+0xcd/0x450
>> [53458.752236][ T215] ? find_held_lock+0x35/0xa0
>> [53458.756847][ T215] ? __cgroup_bpf_prog_array_is_empty+0x121/0x230
>> [53458.763169][ T215] ip_setsockopt+0x3e/0x90
>> [53458.767509][ T215] udp_setsockopt+0x49/0x80
>> [53458.772061][ T215] sock_common_setsockopt+0x6d/0x90
>> [53458.777192][ T215] __sys_setsockopt+0x194/0x2e0
>> [53458.781976][ T215] __x64_sys_setsockopt+0x70/0x90
>> [53458.786994][ T215] do_syscall_64+0x91/0xb10
>> [53458.791460][ T215] ? perf_call_bpf_enter+0x120/0x120
>> [53458.797008][ T215] ? trace_hardirqs_off_thunk+0x1a/0x1c
>> [53458.802523][ T215] entry_SYSCALL_64_after_hwframe+0x49/0xb3
>> [53458.808368][ T215] RIP: 0033:0x7f2cbe1b270d
>> [53458.812678][ T215] Code: Bad RIP value.
>> [53458.816669][ T215] RSP: 002b:00007ffe6a4b18d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
>> [53458.825143][ T215] RAX: ffffffffffffffda RBX: 0000000000000036 RCX: 00007f2cbe1b270d
>> [53458.833070][ T215] RDX: 000000000000002c RSI: 0000000000000000 RDI: 000000000000005c
>> [53458.840997][ T215] RBP: 0000000000000036 R08: 0000000000000108 R09: 00000000b5b5b5b5
>> [53458.849155][ T215] R10: 000000000255b5a0 R11: 0000000000000246 R12: 0000000000000002
>> [53458.857072][ T215] R13: 00007f2cbcb1c058 R14: 00007f2cbe0716c0 R15: 00007f2cbcb1c000
>> [53458.865076][ T215]
>> [53458.865076][ T215] Showing all locks held in the system:
>> [53458.872936][ T215] 5 locks held by kworker/u64:0/8:
>> [53458.878066][ T215] #0: ffff95b4734e4538 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work+0x454/0xba0
>> [53458.888191][ T215] #1: ffffa459431abe18 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work+0x454/0xba0
>> [53458.898177][ T215] #2: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: cleanup_net+0x6d/0x600
>> [53458.907469][ T215] #3: ffffffff955975a8 (rtnl_mutexl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
>> [53459.304598][ T215] 1 lock held by trinity-c17/150651:
>> [53459.309827][ T215] #0: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x20/0x30
>> [53459.318343][ T215] 2 locks held by trinity-c11/150712:
>> [53459.323744][ T215] #0: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: copy_net_ns+0x15c/0x26e
>> [53459.333092][ T215] #1: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
>> [53459.342364][ T215] 2 locks held by trinity-c7/150739:
>> [53459.347768][ T215] #0: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: copy_net_ns+0x15c/0x26e
>> [53459.357179][ T215] #1: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
>> [53459.366479][ T215] 2 locks held by trinity-c15/150758:
>> [53459.372006][ T215] #0: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: copy_net_ns+0x15c/0x26e
>> [53459.381398][ T215] #1: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
>
> If you have lots of threads doing networking configuration changes,
> then contention on RTNL (as evidenced above) is expected behavior.

Right, I’ll remove the CAP_NET_ADMIN from the fuzzer container to see if it helps. I thought it was no biggie because it was inside a net namespace. I’ll also reduce the number of fuzzer child process (one per active CPU right now) to see if it could get a bit clear picture. At least, it triggers quite a few memory leak over there, so it could be some error handing is not robust enough. I’ll dig a bit.

unreferenced object 0xffff97008e4c4040 (size 176):
comm "softirq", pid 0, jiffies 4295173845 (age 32012.550s)
hex dump (first 32 bytes):
00 d0 a5 74 04 97 ff ff 40 72 1a 96 ff ff ff ff ...t....@r......
c1 a3 c5 95 ff ff ff ff 00 00 00 00 00 00 00 00 ................
backtrace:
[<0000000030483fae>] kmem_cache_alloc+0x184/0x430
[<000000007ae17545>] dst_alloc+0x8e/0x128
[<000000001efe9a1f>] rt_dst_alloc+0x6f/0x1e0
rt_dst_alloc at net/ipv4/route.c:1628
[<00000000e67d4dac>] ip_route_input_rcu+0xdfe/0x1640
ip_route_input_slow at net/ipv4/route.c:2218
(inlined by) ip_route_input_rcu at net/ipv4/route.c:2348
[<000000009f30cbc0>] ip_route_input_noref+0xab/0x1a0
[<000000004f53bd04>] arp_process+0x83a/0xf50
arp_process at net/ipv4/arp.c:813 (discriminator 1)
[<0000000061fd547d>] arp_rcv+0x276/0x330
[<0000000007dbfa7a>] __netif_receive_skb_list_core+0x4d2/0x500
[<0000000062d5f6d2>] netif_receive_skb_list_internal+0x4cb/0x7d0
[<000000002baa2b74>] gro_normal_list+0x55/0xc0
[<0000000093d04885>] napi_complete_done+0xea/0x350
[<00000000467dd088>] tg3_poll_msix+0x174/0x310 [tg3]
[<00000000498af7d9>] net_rx_action+0x278/0x890
[<000000001e81d7e6>] __do_softirq+0xd9/0x589
[<00000000087ee354>] irq_exit+0xa2/0xc0
[<000000001c4db0cd>] do_IRQ+0x87/0x180
unreferenced object 0xffff96ffe4218700 (size 176):
comm "softirq", pid 0, jiffies 4295173865 (age 32012.360s)
hex dump (first 32 bytes):
00 d0 a5 74 04 97 ff ff 40 72 1a 96 ff ff ff ff ...t....@r......
c1 a3 c5 95 ff ff ff ff 00 00 00 00 00 00 00 00 ................
backtrace:
[<0000000030483fae>] kmem_cache_alloc+0x184/0x430
[<000000007ae17545>] dst_alloc+0x8e/0x128
[<000000001efe9a1f>] rt_dst_alloc+0x6f/0x1e0
[<00000000e67d4dac>] ip_route_input_rcu+0xdfe/0x1640
[<000000009f30cbc0>] ip_route_input_noref+0xab/0x1a0
[<000000004f53bd04>] arp_process+0x83a/0xf50
[<0000000061fd547d>] arp_rcv+0x276/0x330
[<0000000007dbfa7a>] __netif_receive_skb_list_core+0x4d2/0x500
[<0000000062d5f6d2>] netif_receive_skb_list_internal+0x4cb/0x7d0
[<000000002baa2b74>] gro_normal_list+0x55/0xc0
[<0000000093d04885>] napi_complete_done+0xea/0x350
[<00000000467dd088>] tg3_poll_msix+0x174/0x310 [tg3]
[<00000000498af7d9>] net_rx_action+0x278/0x890
[<000000001e81d7e6>] __do_softirq+0xd9/0x589
[<00000000087ee354>] irq_exit+0xa2/0xc0
[<000000001c4db0cd>] do_IRQ+0x87/0x180

2020-04-30 15:49:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_barrier() + membarrier() + scheduler deadlock?

On Thu, Apr 30, 2020 at 09:28:32AM -0400, Qian Cai wrote:
> > On Apr 30, 2020, at 9:15 AM, Paul E. McKenney <[email protected]> wrote:
> >> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
> >>
> >> [53294.651754][T149877] futex_wake_op: trinity-c25 tries to shift op by -17; fix this program
> >> [53323.947396][T150988] futex_wake_op: trinity-c6 tries to shift op by -5; fix this program
> >> [53458.295837][ T215] INFO: task kworker/u64:0:8 blocked for more than 122 seconds.
> >> [53458.304063][ T215] Tainted: G O L 5.7.0-rc3-next-20200429 #1
> >> [53458.311568][ T215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> [53458.320190][ T215] kworker/u64:0 D10584 8 2 0x90004000
> >> [53458.326668][ T215] Workqueue: netns cleanup_net
> >> [53458.331330][ T215] Call Trace:
> >> [53458.334510][ T215] __schedule+0x47b/0xa50
> >> [53458.338765][ T215] ? wait_for_completion+0x80/0x120
> >> [53458.343920][ T215] schedule+0x59/0xd0
> >> [53458.348013][ T215] schedule_timeout+0x10a/0x150
> >> [53458.352762][ T215] ? wait_for_completion+0x80/0x120
> >> [53458.357881][ T215] ? _raw_spin_unlock_irq+0x30/0x40
> >> [53458.362997][ T215] ? trace_hardirqs_on+0x22/0x100
> >> [53458.367948][ T215] ? wait_for_completion+0x80/0x120
> >> [53458.373195][ T215] wait_for_completion+0xb4/0x120
> >> [53458.378149][ T215] __flush_work+0x3ff/0x6e0
> >> [53458.382586][ T215] ? init_pwq+0x210/0x210
> >> [53458.386840][ T215] flush_work+0x20/0x30
> >> [53458.390891][ T215] rollback_registered_many+0x3d6/0x950
> >> [53458.396438][ T215] ? mark_held_locks+0x4e/0x80
> >> [53458.401339][ T215] unregister_netdevice_many+0x5d/0x200
> >> [53458.406816][ T215] default_device_exit_batch+0x213/0x240
> >> [53458.412348][ T215] ? do_wait_intr_irq+0xe0/0xe0
> >> [53458.417225][ T215] ? dev_change_net_namespace+0x6d0/0x6d0
> >> [53458.423000][ T215] ops_exit_list+0xa2/0xc0
> >> [53458.427367][ T215] cleanup_net+0x3d0/0x600
> >> [53458.431778][ T215] process_one_work+0x560/0xba0
> >> [53458.436629][ T215] worker_thread+0x80/0x5f0
> >> [53458.441078][ T215] ? process_scheduled_works+0x90/0x90
> >> [53458.446485][ T215] kthread+0x1de/0x200
> >> [53458.450600][ T215] ? kthread_unpark+0xd0/0xd0
> >> [53458.455227][ T215] ret_from_fork+0x27/0x50
> >> [53458.460332][ T215] INFO: task trinity-c17:150651 blocked for more than 123 seconds.
> >> [53458.468180][ T215] Tainted: G O L 5.7.0-rc3-next-20200429 #1
> >> [53458.475924][ T215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> [53458.484511][ T215] trinity-c17 D12312 150651 93301 0x10000004
> >> [53458.490862][ T215] Call Trace:
> >> [53458.494039][ T215] __schedule+0x47b/0xa50
> >> [53458.498452][ T215] schedule+0x59/0xd0
> >> [53458.502376][ T215] schedule_preempt_disabled+0x15/0x20
> >> [53458.507809][ T215] __mutex_lock+0x6f2/0xbf0
> >> [53458.512296][ T215] ? rtnl_lock+0x20/0x30
> >> [53458.516484][ T215] mutex_lock_nested+0x31/0x40
> >> [53458.521157][ T215] ? mutex_lock_nested+0x31/0x40
> >> [53458.526195][ T215] rtnl_lock+0x20/0x30
> >> [53458.530251][ T215] do_ip_setsockopt.isra.12+0xec/0x1b90
> >> [53458.535875][ T215] ? find_held_lock+0x35/0xa0
> >> [53458.540603][ T215] ? rb_insert_color+0x10f/0x390
> >> [53458.545436][ T215] ? lock_acquire+0xcd/0x450
> >> [53458.550126][ T215] ? find_held_lock+0x35/0xa0
> >> [53458.554717][ T215] ? __cgroup_bpf_prog_array_is_empty+0x121/0x230
> >> [53458.561127][ T215] ip_setsockopt+0x3e/0x90
> >> [53458.565511][ T215] udp_setsockopt+0x49/0x80
> >> [53458.570059][ T215] sock_common_setsockopt+0x6d/0x90
> >> [53458.575303][ T215] __sys_setsockopt+0x194/0x2e0
> >> [53458.580128][ T215] __x64_sys_setsockopt+0x70/0x90
> >> [53458.585066][ T215] do_syscall_64+0x91/0xb10
> >> [53458.589504][ T215] ? perf_call_bpf_enter+0x120/0x120
> >> [53458.594741][ T215] ? trace_hardirqs_off_thunk+0x1a/0x1c
> >> [53458.600467][ T215] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> >> [53458.606296][ T215] RIP: 0033:0x7f2cbe1b270d
> >> [53458.610611][ T215] Code: Bad RIP value.
> >> [53458.614570][ T215] RSP: 002b:00007ffe6a4b18d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
> >> [53458.623094][ T215] RAX: ffffffffffffffda RBX: 0000000000000036 RCX: 00007f2cbe1b270d
> >> [53458.631024][ T215] RDX: 000000000000002a RSI: 0000000000000000 RDI: 0000000000000060
> >> [53458.638954][ T215] RBP: 0000000000000036 R08: 0000000000000088 R09: 00000000000000f0
> >> [53458.647044][ T215] R10: 00000000025557f0 R11: 0000000000000246 R12: 0000000000000002
> >> [53458.654960][ T215] R13: 00007f2cbcaeb058 R14: 00007f2cbe0716c0 R15: 00007f2cbcaeb000
> >> [53458.662899][ T215] INFO: task trinity-c10:150896 blocked for more than 123 seconds.
> >> [53458.670757][ T215] Tainted: G O L 5.7.0-rc3-next-20200429 #1
> >> [53458.678380][ T215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> [53458.687007][ T215] trinity-c10 D11512 150896 93301 0x10000004
> >> [53458.693334][ T215] Call Trace:
> >> [53458.696545][ T215] __schedule+0x47b/0xa50
> >> [53458.700944][ T215] schedule+0x59/0xd0
> >> [53458.704817][ T215] schedule_preempt_disabled+0x15/0x20
> >> [53458.710246][ T215] __mutex_lock+0x6f2/0xbf0
> >> [53458.714686][ T215] ? rtnl_lock+0x20/0x30
> >> [53458.718851][ T215] mutex_lock_nested+0x31/0x40
> >> [53458.723649][ T215] ? mutex_lock_nested+0x31/0x40
> >> [53458.728556][ T215] rtnl_lock+0x20/0x30
> >> [53458.732520][ T215] do_ip_setsockopt.isra.12+0xec/0x1b90
> >> [53458.737998][ T215] ? find_held_lock+0x35/0xa0
> >> [53458.742646][ T215] ? rb_insert_color+0x10f/0x390
> >> [53458.747672][ T215] ? lock_acquire+0xcd/0x450
> >> [53458.752236][ T215] ? find_held_lock+0x35/0xa0
> >> [53458.756847][ T215] ? __cgroup_bpf_prog_array_is_empty+0x121/0x230
> >> [53458.763169][ T215] ip_setsockopt+0x3e/0x90
> >> [53458.767509][ T215] udp_setsockopt+0x49/0x80
> >> [53458.772061][ T215] sock_common_setsockopt+0x6d/0x90
> >> [53458.777192][ T215] __sys_setsockopt+0x194/0x2e0
> >> [53458.781976][ T215] __x64_sys_setsockopt+0x70/0x90
> >> [53458.786994][ T215] do_syscall_64+0x91/0xb10
> >> [53458.791460][ T215] ? perf_call_bpf_enter+0x120/0x120
> >> [53458.797008][ T215] ? trace_hardirqs_off_thunk+0x1a/0x1c
> >> [53458.802523][ T215] entry_SYSCALL_64_after_hwframe+0x49/0xb3
> >> [53458.808368][ T215] RIP: 0033:0x7f2cbe1b270d
> >> [53458.812678][ T215] Code: Bad RIP value.
> >> [53458.816669][ T215] RSP: 002b:00007ffe6a4b18d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
> >> [53458.825143][ T215] RAX: ffffffffffffffda RBX: 0000000000000036 RCX: 00007f2cbe1b270d
> >> [53458.833070][ T215] RDX: 000000000000002c RSI: 0000000000000000 RDI: 000000000000005c
> >> [53458.840997][ T215] RBP: 0000000000000036 R08: 0000000000000108 R09: 00000000b5b5b5b5
> >> [53458.849155][ T215] R10: 000000000255b5a0 R11: 0000000000000246 R12: 0000000000000002
> >> [53458.857072][ T215] R13: 00007f2cbcb1c058 R14: 00007f2cbe0716c0 R15: 00007f2cbcb1c000
> >> [53458.865076][ T215]
> >> [53458.865076][ T215] Showing all locks held in the system:
> >> [53458.872936][ T215] 5 locks held by kworker/u64:0/8:
> >> [53458.878066][ T215] #0: ffff95b4734e4538 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work+0x454/0xba0
> >> [53458.888191][ T215] #1: ffffa459431abe18 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work+0x454/0xba0
> >> [53458.898177][ T215] #2: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: cleanup_net+0x6d/0x600
> >> [53458.907469][ T215] #3: ffffffff955975a8 (rtnl_mutexl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
> >> [53459.304598][ T215] 1 lock held by trinity-c17/150651:
> >> [53459.309827][ T215] #0: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x20/0x30
> >> [53459.318343][ T215] 2 locks held by trinity-c11/150712:
> >> [53459.323744][ T215] #0: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: copy_net_ns+0x15c/0x26e
> >> [53459.333092][ T215] #1: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
> >> [53459.342364][ T215] 2 locks held by trinity-c7/150739:
> >> [53459.347768][ T215] #0: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: copy_net_ns+0x15c/0x26e
> >> [53459.357179][ T215] #1: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
> >> [53459.366479][ T215] 2 locks held by trinity-c15/150758:
> >> [53459.372006][ T215] #0: ffffffff955917b0 (pernet_ops_rwsem){++++}-{3:3}, at: copy_net_ns+0x15c/0x26e
> >> [53459.381398][ T215] #1: ffffffff955975a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock_killable+0x21/0x30
> >
> > If you have lots of threads doing networking configuration changes,
> > then contention on RTNL (as evidenced above) is expected behavior.
>
> Right, I’ll remove the CAP_NET_ADMIN from the fuzzer container to see if it helps. I thought it was no biggie because it was inside a net namespace. I’ll also reduce the number of fuzzer child process (one per active CPU right now) to see if it could get a bit clear picture. At least, it triggers quite a few memory leak over there, so it could be some error handing is not robust enough. I’ll dig a bit.

Sounds like a plan!

On the leak below, I would suggest reaching out to the networking guys,
except that it looks like you already have. ;-)

Thanx, Paul

> unreferenced object 0xffff97008e4c4040 (size 176):
> comm "softirq", pid 0, jiffies 4295173845 (age 32012.550s)
> hex dump (first 32 bytes):
> 00 d0 a5 74 04 97 ff ff 40 72 1a 96 ff ff ff ff ...t....@r......
> c1 a3 c5 95 ff ff ff ff 00 00 00 00 00 00 00 00 ................
> backtrace:
> [<0000000030483fae>] kmem_cache_alloc+0x184/0x430
> [<000000007ae17545>] dst_alloc+0x8e/0x128
> [<000000001efe9a1f>] rt_dst_alloc+0x6f/0x1e0
> rt_dst_alloc at net/ipv4/route.c:1628
> [<00000000e67d4dac>] ip_route_input_rcu+0xdfe/0x1640
> ip_route_input_slow at net/ipv4/route.c:2218
> (inlined by) ip_route_input_rcu at net/ipv4/route.c:2348
> [<000000009f30cbc0>] ip_route_input_noref+0xab/0x1a0
> [<000000004f53bd04>] arp_process+0x83a/0xf50
> arp_process at net/ipv4/arp.c:813 (discriminator 1)
> [<0000000061fd547d>] arp_rcv+0x276/0x330
> [<0000000007dbfa7a>] __netif_receive_skb_list_core+0x4d2/0x500
> [<0000000062d5f6d2>] netif_receive_skb_list_internal+0x4cb/0x7d0
> [<000000002baa2b74>] gro_normal_list+0x55/0xc0
> [<0000000093d04885>] napi_complete_done+0xea/0x350
> [<00000000467dd088>] tg3_poll_msix+0x174/0x310 [tg3]
> [<00000000498af7d9>] net_rx_action+0x278/0x890
> [<000000001e81d7e6>] __do_softirq+0xd9/0x589
> [<00000000087ee354>] irq_exit+0xa2/0xc0
> [<000000001c4db0cd>] do_IRQ+0x87/0x180
> unreferenced object 0xffff96ffe4218700 (size 176):
> comm "softirq", pid 0, jiffies 4295173865 (age 32012.360s)
> hex dump (first 32 bytes):
> 00 d0 a5 74 04 97 ff ff 40 72 1a 96 ff ff ff ff ...t....@r......
> c1 a3 c5 95 ff ff ff ff 00 00 00 00 00 00 00 00 ................
> backtrace:
> [<0000000030483fae>] kmem_cache_alloc+0x184/0x430
> [<000000007ae17545>] dst_alloc+0x8e/0x128
> [<000000001efe9a1f>] rt_dst_alloc+0x6f/0x1e0
> [<00000000e67d4dac>] ip_route_input_rcu+0xdfe/0x1640
> [<000000009f30cbc0>] ip_route_input_noref+0xab/0x1a0
> [<000000004f53bd04>] arp_process+0x83a/0xf50
> [<0000000061fd547d>] arp_rcv+0x276/0x330
> [<0000000007dbfa7a>] __netif_receive_skb_list_core+0x4d2/0x500
> [<0000000062d5f6d2>] netif_receive_skb_list_internal+0x4cb/0x7d0
> [<000000002baa2b74>] gro_normal_list+0x55/0xc0
> [<0000000093d04885>] napi_complete_done+0xea/0x350
> [<00000000467dd088>] tg3_poll_msix+0x174/0x310 [tg3]
> [<00000000498af7d9>] net_rx_action+0x278/0x890
> [<000000001e81d7e6>] __do_softirq+0xd9/0x589
> [<00000000087ee354>] irq_exit+0xa2/0xc0
> [<000000001c4db0cd>] do_IRQ+0x87/0x180