Hi Frederic Weisbecker and kernel developers,
Greeting!
There is task hung in "synchronize_rcu" in v6.1-rc5 kernel.
Bisected the issue on Raptor and server(No atom small core, big core only),
both platforms bisected results show that:
first bad commit is c597bfddc9e9e8a63817252b67c3ca0e544ace26:
"sched: Provide Kconfig support for default dynamic preempt mode"
[ 300.097166] INFO: task rcu_tasks_kthre:11 blocked for more than 147 seconds.
[ 300.097455] Not tainted 6.1.0-rc5-094226ad94f4 #1
[ 300.097641] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 300.097922] task:rcu_tasks_kthre state:D stack:0 pid:11 ppid:2 flags:0x00004000
[ 300.098230] Call Trace:
[ 300.098325] <TASK>
[ 300.098410] __schedule+0x2de/0x8f0
[ 300.098562] schedule+0x5b/0xe0
[ 300.098693] schedule_timeout+0x3f1/0x4b0
[ 300.098849] ? __sanitizer_cov_trace_pc+0x25/0x60
[ 300.099032] ? queue_delayed_work_on+0x82/0xc0
[ 300.099206] wait_for_completion+0x81/0x140
[ 300.099373] __synchronize_srcu.part.23+0x83/0xb0
[ 300.099558] ? __bpf_trace_rcu_stall_warning+0x20/0x20
[ 300.099757] synchronize_srcu+0xd6/0x100
[ 300.099913] rcu_tasks_postscan+0x19/0x20
[ 300.100070] rcu_tasks_wait_gp+0x108/0x290
[ 300.100230] ? _raw_spin_unlock+0x1d/0x40
[ 300.100389] rcu_tasks_one_gp+0x27f/0x370
[ 300.100546] ? rcu_tasks_postscan+0x20/0x20
[ 300.100709] rcu_tasks_kthread+0x37/0x50
[ 300.100863] kthread+0x14d/0x190
[ 300.100998] ? kthread_complete_and_exit+0x40/0x40
[ 300.101199] ret_from_fork+0x1f/0x30
[ 300.101347] </TASK>
[ 300.101457] INFO: task repro:568 blocked for more than 147 seconds.
[ 300.101688] Not tainted 6.1.0-rc5-094226ad94f4 #1
[ 300.101881] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 300.102162] task:repro state:D stack:0 pid:568 ppid:567 flags:0x00004004
[ 300.102471] Call Trace:
[ 300.102565] <TASK>
[ 300.102648] __schedule+0x2de/0x8f0
[ 300.102793] schedule+0x5b/0xe0
[ 300.102922] schedule_timeout+0x3f1/0x4b0
[ 300.103076] ? __schedule+0x2e6/0x8f0
[ 300.103224] ? __cond_resched+0x25/0x30
[ 300.103377] wait_for_completion+0x81/0x140
[ 300.103543] __wait_rcu_gp+0x121/0x150
[ 300.103691] synchronize_rcu_tasks_generic+0x52/0xa0
[ 300.103880] ? call_rcu_tasks_generic+0x230/0x230
[ 300.104059] ? __bpf_trace_rcu_stall_warning+0x20/0x20
[ 300.104253] synchronize_rcu_tasks+0x19/0x20
[ 300.104418] ftrace_shutdown+0x1cf/0x370
[ 300.104570] unregister_ftrace_function+0x33/0x210
[ 300.104750] ? __sanitizer_cov_trace_switch+0x57/0xa0
[ 300.104941] perf_ftrace_event_register+0x95/0xf0
[ 300.105140] ? perf_tp_event_init+0xa0/0xa0
[ 300.105324] perf_trace_destroy+0x38/0xa0
[ 300.105502] ? perf_tp_event_init+0xa0/0xa0
[ 300.105686] tp_perf_event_destroy+0x1e/0x30
[ 300.105875] _free_event+0xff/0x690
[ 300.106018] put_event+0x3c/0x50
[ 300.106152] perf_event_release_kernel+0x2f2/0x370
[ 300.106338] ? perf_event_release_kernel+0x9/0x370
[ 300.106536] ? perf_event_release_kernel+0x370/0x370
[ 300.106742] perf_release+0x22/0x30
[ 300.106883] __fput+0x10e/0x440
[ 300.107017] ____fput+0x1e/0x30
[ 300.107145] task_work_run+0xa6/0x100
[ 300.107291] exit_to_user_mode_prepare+0x173/0x180
[ 300.107480] syscall_exit_to_user_mode+0x1f/0x30
[ 300.107659] do_syscall_64+0x4a/0x90
[ 300.107803] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 300.108001] RIP: 0033:0x7f66e159f59d
[ 300.108139] RSP: 002b:00007fff47f04688 EFLAGS: 00000246 ORIG_RAX: 000000000000012a
[ 300.108419] RAX: ffffffffffffffea RBX: 0000000000000000 RCX: 00007f66e159f59d
[ 300.108679] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000200001c0
[ 300.108940] RBP: 00007fff47f04690 R08: 0000000000000000 R09: 00007fff47f046c0
[ 300.109226] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401180
[ 300.109516] R13: 00007fff47f047b0 R14: 0000000000000000 R15: 0000000000000000
[ 300.109809] </TASK>
Kconfig, reproduced code from syzkaller, bisect info from Raptor lake is
in attached.
More detailed info from RPL is in link:
https://github.com/xupengfe/syzkaller_logs/tree/main/221115_105658_synchronize_rcu
If it's helpful and fixed, please added Reported tag from me.
Thanks!
BR.
On Mon, Nov 21, 2022 at 01:37:06PM +0800, Pengfei Xu wrote:
> Hi Frederic Weisbecker and kernel developers,
>
> Greeting!
> There is task hung in "synchronize_rcu" in v6.1-rc5 kernel.
>
> Bisected the issue on Raptor and server(No atom small core, big core only),
> both platforms bisected results show that:
> first bad commit is c597bfddc9e9e8a63817252b67c3ca0e544ace26:
> "sched: Provide Kconfig support for default dynamic preempt mode"
>
> [ 300.097166] INFO: task rcu_tasks_kthre:11 blocked for more than 147 seconds.
> [ 300.097455] Not tainted 6.1.0-rc5-094226ad94f4 #1
> [ 300.097641] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 300.097922] task:rcu_tasks_kthre state:D stack:0 pid:11 ppid:2 flags:0x00004000
> [ 300.098230] Call Trace:
> [ 300.098325] <TASK>
> [ 300.098410] __schedule+0x2de/0x8f0
> [ 300.098562] schedule+0x5b/0xe0
> [ 300.098693] schedule_timeout+0x3f1/0x4b0
> [ 300.098849] ? __sanitizer_cov_trace_pc+0x25/0x60
> [ 300.099032] ? queue_delayed_work_on+0x82/0xc0
> [ 300.099206] wait_for_completion+0x81/0x140
> [ 300.099373] __synchronize_srcu.part.23+0x83/0xb0
> [ 300.099558] ? __bpf_trace_rcu_stall_warning+0x20/0x20
> [ 300.099757] synchronize_srcu+0xd6/0x100
> [ 300.099913] rcu_tasks_postscan+0x19/0x20
> [ 300.100070] rcu_tasks_wait_gp+0x108/0x290
> [ 300.100230] ? _raw_spin_unlock+0x1d/0x40
> [ 300.100389] rcu_tasks_one_gp+0x27f/0x370
> [ 300.100546] ? rcu_tasks_postscan+0x20/0x20
> [ 300.100709] rcu_tasks_kthread+0x37/0x50
> [ 300.100863] kthread+0x14d/0x190
> [ 300.100998] ? kthread_complete_and_exit+0x40/0x40
> [ 300.101199] ret_from_fork+0x1f/0x30
> [ 300.101347] </TASK>
Thanks for reporting this. Fortunately I managed to reproduce and debug.
It took me a few days to understand the complicated circular dependency
involved.
So here is a summary:
1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
that every subsequent child of TASK A will belong to. But TASK A doesn't
itself belong to that new PID namespace.
2) TASK A forks() and creates TASK B (it is a new threadgroup so it is a
thread group leader). TASK A stays attached to its PID namespace (let's say PID_NS1)
and TASK B is the first task belonging to the new PID namespace created by
unshare() (let's call it PID_NS2).
3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2
child reaper.
4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.
3) TASK B exits and since it is the child reaper for PID_NS2, it has to
kill all other tasks attached to PID_NS2, and wait for all of them to die
before reaping itself (zap_pid_ns_process()). Note it seems to make a
misleading assumption here, trusting that all tasks in PID_NS2 either
get reaped by a parent belonging to the same namespace or by TASK B.
And it is confident that since it deactivated SIGCHLD handler, all
the remaining tasks ultimately autoreap. And it waits for that to happen.
However TASK C escapes that rule because it will get reaped by its parent
TASK A belonging to PID_NS1.
4) TASK A calls synchronize_rcu_tasks() which leads to
synchronize_srcu(&tasks_rcu_exit_srcu).
5) TASK B is waiting for TASK C to get reaped (wrongly assuming it autoreaps)
But TASK B is under a tasks_rcu_exit_srcu SRCU critical section
(exit_notify() is between exit_tasks_rcu_start() and
exit_tasks_rcu_finish()), blocking TASK A
6) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C,
but it can't because TASK A waits for TASK B that waits for TASK C.
So there is a circular dependency:
_ TASK A waits for TASK B to get out of tasks_rcu_exit_srcu SRCU critical
section
_ TASK B waits for TASK C to get reaped
_ TASK C waits for TASK A to reap it.
I have no idea how to solve the situation without violating the pid_namespace
rules and unshare() semantics (although I wish unshare(CLONE_NEWPID) had a less
error prone behaviour with allowing creating more than one task belonging to the
same namespace).
So probably having an SRCU read side critical section within exit_notify() is
not a good idea, is there a solution to work around that for rcu tasks?
Thanks.
Hi Frederic Weisbecker,
On 2022-11-23 at 15:37:58 +0100, Frederic Weisbecker wrote:
> On Mon, Nov 21, 2022 at 01:37:06PM +0800, Pengfei Xu wrote:
> > Hi Frederic Weisbecker and kernel developers,
> >
> > Greeting!
> > There is task hung in "synchronize_rcu" in v6.1-rc5 kernel.
> >
> > Bisected the issue on Raptor and server(No atom small core, big core only),
> > both platforms bisected results show that:
> > first bad commit is c597bfddc9e9e8a63817252b67c3ca0e544ace26:
> > "sched: Provide Kconfig support for default dynamic preempt mode"
> >
> > [ 300.097166] INFO: task rcu_tasks_kthre:11 blocked for more than 147 seconds.
> > [ 300.097455] Not tainted 6.1.0-rc5-094226ad94f4 #1
> > [ 300.097641] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 300.097922] task:rcu_tasks_kthre state:D stack:0 pid:11 ppid:2 flags:0x00004000
> > [ 300.098230] Call Trace:
> > [ 300.098325] <TASK>
> > [ 300.098410] __schedule+0x2de/0x8f0
> > [ 300.098562] schedule+0x5b/0xe0
> > [ 300.098693] schedule_timeout+0x3f1/0x4b0
> > [ 300.098849] ? __sanitizer_cov_trace_pc+0x25/0x60
> > [ 300.099032] ? queue_delayed_work_on+0x82/0xc0
> > [ 300.099206] wait_for_completion+0x81/0x140
> > [ 300.099373] __synchronize_srcu.part.23+0x83/0xb0
> > [ 300.099558] ? __bpf_trace_rcu_stall_warning+0x20/0x20
> > [ 300.099757] synchronize_srcu+0xd6/0x100
> > [ 300.099913] rcu_tasks_postscan+0x19/0x20
> > [ 300.100070] rcu_tasks_wait_gp+0x108/0x290
> > [ 300.100230] ? _raw_spin_unlock+0x1d/0x40
> > [ 300.100389] rcu_tasks_one_gp+0x27f/0x370
> > [ 300.100546] ? rcu_tasks_postscan+0x20/0x20
> > [ 300.100709] rcu_tasks_kthread+0x37/0x50
> > [ 300.100863] kthread+0x14d/0x190
> > [ 300.100998] ? kthread_complete_and_exit+0x40/0x40
> > [ 300.101199] ret_from_fork+0x1f/0x30
> > [ 300.101347] </TASK>
>
> Thanks for reporting this. Fortunately I managed to reproduce and debug.
> It took me a few days to understand the complicated circular dependency
> involved.
>
> So here is a summary:
>
> 1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
> that every subsequent child of TASK A will belong to. But TASK A doesn't
> itself belong to that new PID namespace.
>
> 2) TASK A forks() and creates TASK B (it is a new threadgroup so it is a
> thread group leader). TASK A stays attached to its PID namespace (let's say PID_NS1)
> and TASK B is the first task belonging to the new PID namespace created by
> unshare() (let's call it PID_NS2).
>
> 3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2
> child reaper.
>
> 4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
> Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
> TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.
>
> 3) TASK B exits and since it is the child reaper for PID_NS2, it has to
> kill all other tasks attached to PID_NS2, and wait for all of them to die
> before reaping itself (zap_pid_ns_process()). Note it seems to make a
> misleading assumption here, trusting that all tasks in PID_NS2 either
> get reaped by a parent belonging to the same namespace or by TASK B.
> And it is confident that since it deactivated SIGCHLD handler, all
> the remaining tasks ultimately autoreap. And it waits for that to happen.
> However TASK C escapes that rule because it will get reaped by its parent
> TASK A belonging to PID_NS1.
>
> 4) TASK A calls synchronize_rcu_tasks() which leads to
> synchronize_srcu(&tasks_rcu_exit_srcu).
>
> 5) TASK B is waiting for TASK C to get reaped (wrongly assuming it autoreaps)
> But TASK B is under a tasks_rcu_exit_srcu SRCU critical section
> (exit_notify() is between exit_tasks_rcu_start() and
> exit_tasks_rcu_finish()), blocking TASK A
>
> 6) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C,
> but it can't because TASK A waits for TASK B that waits for TASK C.
>
> So there is a circular dependency:
>
> _ TASK A waits for TASK B to get out of tasks_rcu_exit_srcu SRCU critical
> section
> _ TASK B waits for TASK C to get reaped
> _ TASK C waits for TASK A to reap it.
>
> I have no idea how to solve the situation without violating the pid_namespace
> rules and unshare() semantics (although I wish unshare(CLONE_NEWPID) had a less
> error prone behaviour with allowing creating more than one task belonging to the
> same namespace).
>
> So probably having an SRCU read side critical section within exit_notify() is
> not a good idea, is there a solution to work around that for rcu tasks?
>
Thanks for the analysis!
Add one more information: I tried to revert this commit only on top of
v6.1-rc5 mainline by script, but it caused kernel make to fail, it could not
confirm the bisect information is 100% accurate if I could not pass the
revert step verification. I just provide all the information I could.
And this issue is too difficult to me.
If I find more clue, I will update the eamil.
Thanks!
BR.
> Thanks.
On Wed, Nov 23, 2022 at 11:06:48PM +0100, Frederic Weisbecker wrote:
> On Wed, Nov 23, 2022 at 11:45:50PM +0800, Pengfei Xu wrote:
> > On 2022-11-23 at 15:37:58 +0100, Frederic Weisbecker wrote:
> > > I have no idea how to solve the situation without violating the pid_namespace
> > > rules and unshare() semantics (although I wish unshare(CLONE_NEWPID) had a less
> > > error prone behaviour with allowing creating more than one task belonging to the
> > > same namespace).
> > >
> > > So probably having an SRCU read side critical section within exit_notify() is
> > > not a good idea, is there a solution to work around that for rcu tasks?
> > >
> > Thanks for the analysis!
> > Add one more information: I tried to revert this commit only on top of
> > v6.1-rc5 mainline by script, but it caused kernel make to fail, it could not
> > confirm the bisect information is 100% accurate if I could not pass the
> > revert step verification. I just provide all the information I could.
>
> No problem, I managed to reproduce with latest upstream.
> I don't think the bisected commit is the culprit though, it may perhaps just make
> the issue more likely to happen.
Frederic, Boqun, Neeraj, and I dug through this earlier today, and
record of our wanderings may be found here:
https://docs.google.com/document/d/1hJxgiZ5TMZ4YJkdJPLAkRvq7sYQ-A7svgA8no6i-v8k/edit?usp=sharing
It looks like we can break the deadlock within RCU Tasks, but it also
looks like the namespace-PID semantics are at best an accident waiting
to happen. ;-)
Thanx, Paul
> Thanks.
>
> >
> > And this issue is too difficult to me.
> > If I find more clue, I will update the eamil.
> >
> > Thanks!
> > BR.
> >
> > > Thanks.
On Wed, Nov 23, 2022 at 11:45:50PM +0800, Pengfei Xu wrote:
> On 2022-11-23 at 15:37:58 +0100, Frederic Weisbecker wrote:
> > I have no idea how to solve the situation without violating the pid_namespace
> > rules and unshare() semantics (although I wish unshare(CLONE_NEWPID) had a less
> > error prone behaviour with allowing creating more than one task belonging to the
> > same namespace).
> >
> > So probably having an SRCU read side critical section within exit_notify() is
> > not a good idea, is there a solution to work around that for rcu tasks?
> >
> Thanks for the analysis!
> Add one more information: I tried to revert this commit only on top of
> v6.1-rc5 mainline by script, but it caused kernel make to fail, it could not
> confirm the bisect information is 100% accurate if I could not pass the
> revert step verification. I just provide all the information I could.
No problem, I managed to reproduce with latest upstream.
I don't think the bisected commit is the culprit though, it may perhaps just make
the issue more likely to happen.
Thanks.
>
> And this issue is too difficult to me.
> If I find more clue, I will update the eamil.
>
> Thanks!
> BR.
>
> > Thanks.