From: Yanfei Xu <[email protected]>
rcu_node->lock isn't released in rcu_print_task_stall() if the rcu_node
don't contain tasks which blocking the GP. However this rcu_node->lock
will be used again in rcu_dump_cpu_stacks() soon while the ndetected is
non-zero. As a result the cpu will hung by this deadlock.
Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
Signed-off-by: Yanfei Xu <[email protected]>
---
v1->v2:
1.change the lock function to unlock function.
2.add fixes tag.
kernel/rcu/tree_stall.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index b72311d24a9f..b09a7140ef77 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -267,8 +267,10 @@ static int rcu_print_task_stall(struct rcu_node *rnp, unsigned long flags)
struct task_struct *ts[8];
lockdep_assert_irqs_disabled();
- if (!rcu_preempt_blocked_readers_cgp(rnp))
+ if (!rcu_preempt_blocked_readers_cgp(rnp)) {
+ raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return 0;
+ }
pr_err("\tTasks blocked on level-%d rcu_node (CPUs %d-%d):",
rnp->level, rnp->grplo, rnp->grphi);
t = list_entry(rnp->gp_tasks->prev,
--
2.27.0
Hi Paul,
Should I merge this patch and the before one into one? If need please
tell me and I will do it. :)
In addition, before these two patch the bug will lead a phenomenon which
is "BUG: scheduling while atomic:". Because the preempt_count is
disabled in tick irq while missing release the rcu_node->lock.
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 0-11):
(detected by 3, t=6504 jiffies, g=34033, q=10745911)
rcu: All QSes seen, last rcu_preempt kthread activity 28
(4295088530-4295088502), jiffies_till_next_fqs=1, root ->qsmask 0x1
BUG: scheduling while atomic: msgstress04/90186/0x00000002
INFO: lockdep is turned off.
Modules linked in: sch_fq_codel
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffff80001004d57c>]
copy_process+0x678/0x2790
softirqs last enabled at (0): [<ffff80001004d57c>]
copy_process+0x678/0x2790
softirqs last disabled at (0): [<0000000000000000>] 0x0
Preemption disabled at:
[<ffff800010402744>] find_and_remove_object+0x34/0xd0
CPU: 3 PID: 90186 Comm: msgstress04 Kdump: loaded Not tainted
5.12.2-yoctodev-standard #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
Call trace:
dump_backtrace+0x0/0x2cc
show_stack+0x24/0x30
dump_stack+0x110/0x188
__schedule_bug+0x100/0x114
__schedule+0xe5c/0xfd4
schedule+0x70/0x16c
do_notify_resume+0xe4/0x19d0
work_pending+0xc/0x2a8
Regards,
Yanfei
On 5/16/21 5:50 PM, [email protected] wrote:
> From: Yanfei Xu <[email protected]>
>
> rcu_node->lock isn't released in rcu_print_task_stall() if the rcu_node
> don't contain tasks which blocking the GP. However this rcu_node->lock
> will be used again in rcu_dump_cpu_stacks() soon while the ndetected is
> non-zero. As a result the cpu will hung by this deadlock.
>
> Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> Signed-off-by: Yanfei Xu <[email protected]>
> ---
> v1->v2:
> 1.change the lock function to unlock function.
> 2.add fixes tag.
>
> kernel/rcu/tree_stall.h | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index b72311d24a9f..b09a7140ef77 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -267,8 +267,10 @@ static int rcu_print_task_stall(struct rcu_node *rnp, unsigned long flags)
> struct task_struct *ts[8];
>
> lockdep_assert_irqs_disabled();
> - if (!rcu_preempt_blocked_readers_cgp(rnp))
> + if (!rcu_preempt_blocked_readers_cgp(rnp)) {
> + raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> return 0;
> + }
> pr_err("\tTasks blocked on level-%d rcu_node (CPUs %d-%d):",
> rnp->level, rnp->grplo, rnp->grphi);
> t = list_entry(rnp->gp_tasks->prev,
>
On Sun, May 16, 2021 at 05:50:10PM +0800, [email protected] wrote:
> From: Yanfei Xu <[email protected]>
>
> rcu_node->lock isn't released in rcu_print_task_stall() if the rcu_node
> don't contain tasks which blocking the GP. However this rcu_node->lock
> will be used again in rcu_dump_cpu_stacks() soon while the ndetected is
> non-zero. As a result the cpu will hung by this deadlock.
>
> Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> Signed-off-by: Yanfei Xu <[email protected]>
Also a good catch, thank you! Queued for further review and testing,
wordsmithed as shown below. The rcutorture scripts have been known to
work on ARM in the past, and might still do so. (I test on x86.)
As always, please check to make sure that I didn't mess something up.
Thanx, Paul
------------------------------------------------------------------------
commit e0a9b77f245ae4fe1537120fd5319bf9e091618e
Author: Yanfei Xu <[email protected]>
Date: Sun May 16 17:50:10 2021 +0800
rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
If rcu_print_task_stall() is invoked on an rcu_node structure that does
not contain any tasks blocking the current grace period, it takes an
early exit that fails to release that rcu_node structure's lock. This
results in a self-deadlock, which is detected by lockdep.
To reproduce this bug:
tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
This will also result in other complaints, including RCU's scheduler
hook complaining about blocking rather than preemption and an rcutorture
writer stall.
Only a partial RCU CPU stall warning message will be printed because of
the self-deadlock.
This commit therefore releases the lock on the rcu_print_task_stall()
function's early exit path.
Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
Signed-off-by: Yanfei Xu <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index a10ea1f1f81f..d574e3bbd929 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -267,8 +267,10 @@ static int rcu_print_task_stall(struct rcu_node *rnp, unsigned long flags)
struct task_struct *ts[8];
lockdep_assert_irqs_disabled();
- if (!rcu_preempt_blocked_readers_cgp(rnp))
+ if (!rcu_preempt_blocked_readers_cgp(rnp)) {
+ raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return 0;
+ }
pr_err("\tTasks blocked on level-%d rcu_node (CPUs %d-%d):",
rnp->level, rnp->grplo, rnp->grphi);
t = list_entry(rnp->gp_tasks->prev,
On 5/17/21 6:58 AM, Paul E. McKenney wrote:
> [Please note: This e-mail is from an EXTERNAL e-mail address]
>
> On Sun, May 16, 2021 at 05:50:10PM +0800, [email protected] wrote:
>> From: Yanfei Xu <[email protected]>
>>
>> rcu_node->lock isn't released in rcu_print_task_stall() if the rcu_node
>> don't contain tasks which blocking the GP. However this rcu_node->lock
>> will be used again in rcu_dump_cpu_stacks() soon while the ndetected is
>> non-zero. As a result the cpu will hung by this deadlock.
>>
>> Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
>> Signed-off-by: Yanfei Xu <[email protected]>
>
> Also a good catch, thank you! Queued for further review and testing,
> wordsmithed as shown below. The rcutorture scripts have been known to
> work on ARM in the past, and might still do so. (I test on x86.)
>
> As always, please check to make sure that I didn't mess something up.
>
Looks good to me, Thanks!
Regards,
Yanfei
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit e0a9b77f245ae4fe1537120fd5319bf9e091618e
> Author: Yanfei Xu <[email protected]>
> Date: Sun May 16 17:50:10 2021 +0800
>
> rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock
>
> If rcu_print_task_stall() is invoked on an rcu_node structure that does
> not contain any tasks blocking the current grace period, it takes an
> early exit that fails to release that rcu_node structure's lock. This
> results in a self-deadlock, which is detected by lockdep.
>
> To reproduce this bug:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
>
> This will also result in other complaints, including RCU's scheduler
> hook complaining about blocking rather than preemption and an rcutorture
> writer stall.
>
> Only a partial RCU CPU stall warning message will be printed because of
> the self-deadlock.
>
> This commit therefore releases the lock on the rcu_print_task_stall()
> function's early exit path.
>
> Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
> Signed-off-by: Yanfei Xu <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index a10ea1f1f81f..d574e3bbd929 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -267,8 +267,10 @@ static int rcu_print_task_stall(struct rcu_node *rnp, unsigned long flags)
> struct task_struct *ts[8];
>
> lockdep_assert_irqs_disabled();
> - if (!rcu_preempt_blocked_readers_cgp(rnp))
> + if (!rcu_preempt_blocked_readers_cgp(rnp)) {
> + raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> return 0;
> + }
> pr_err("\tTasks blocked on level-%d rcu_node (CPUs %d-%d):",
> rnp->level, rnp->grplo, rnp->grphi);
> t = list_entry(rnp->gp_tasks->prev,
>