2015-08-25 08:00:14

by Wanpeng Li

[permalink] [raw]
Subject: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()

[ 15.273708] ------------[ cut here ]------------
[ 15.274097] WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 do_set_cpus_allowed+0x7e/0x80()
[ 15.274857] Modules linked in:
[ 15.275101] CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c7 #2
[ 15.275674] 00000000 00000000 d21f1d24 c19228b2 00000000 d21f1d58 c1056a3b c1ba00e4
[ 15.276084] 00000000 0000000d c1ba17d8 00000484 c10838be 00000484 c10838be d21e5000
[ 15.276084] d2121900 d21e5158 d21f1d68 c1056b12 00000009 00000000 d21f1d7c c10838be
[ 15.276084] Call Trace:
[ 15.276084] [<c19228b2>] dump_stack+0x4b/0x75
[ 15.276084] [<c1056a3b>] warn_slowpath_common+0x8b/0xc0
[ 15.276084] [<c10838be>] ? do_set_cpus_allowed+0x7e/0x80
[ 15.276084] [<c10838be>] ? do_set_cpus_allowed+0x7e/0x80
[ 15.276084] [<c1056b12>] warn_slowpath_null+0x22/0x30
[ 15.276084] [<c10838be>] do_set_cpus_allowed+0x7e/0x80
[ 15.276084] [<c110154c>] cpuset_cpus_allowed_fallback+0x7c/0x170
[ 15.276084] [<c11014d0>] ? cpuset_cpus_allowed+0x180/0x180
[ 15.276084] [<c1083ae1>] select_fallback_rq+0x221/0x280
[ 15.276084] [<c1085073>] migration_call+0xe3/0x250
[ 15.276084] [<c1079e23>] notifier_call_chain+0x53/0x70
[ 15.276084] [<c1079e5e>] __raw_notifier_call_chain+0x1e/0x30
[ 15.276084] [<c1056cc8>] cpu_notify+0x28/0x50
[ 15.276084] [<c191e4d2>] take_cpu_down+0x22/0x40
[ 15.276084] [<c1102895>] multi_cpu_stop+0xd5/0x140
[ 15.276084] [<c11027c0>] ? __stop_cpus+0x80/0x80
[ 15.276084] [<c11025cc>] cpu_stopper_thread+0xbc/0x170
[ 15.276084] [<c1085ec9>] ? preempt_count_sub+0x9/0x50
[ 15.276084] [<c192b6a7>] ? _raw_spin_unlock_irq+0x37/0x50
[ 15.276084] [<c192b655>] ? _raw_spin_unlock_irqrestore+0x55/0x70
[ 15.276084] [<c10a9074>] ? trace_hardirqs_on_caller+0x144/0x1e0
[ 15.276084] [<c11024a5>] ? cpu_stop_should_run+0x35/0x40
[ 15.276084] [<c1085ec9>] ? preempt_count_sub+0x9/0x50
[ 15.276084] [<c192b641>] ? _raw_spin_unlock_irqrestore+0x41/0x70
[ 15.276084] [<c107c944>] smpboot_thread_fn+0x174/0x2f0
[ 15.276084] [<c107c7d0>] ? sort_range+0x30/0x30
[ 15.276084] [<c1078934>] kthread+0xc4/0xe0
[ 15.276084] [<c192c041>] ret_from_kernel_thread+0x21/0x30
[ 15.276084] [<c1078870>] ? kthread_create_on_node+0x180/0x180
[ 15.276084] ---[ end trace 15f4c86d404693b0 ]---

After commit (25834c73f93: sched: Fix a race between __kthread_bind()
and sched_setaffinity()) do_set_cpus_allowed() should be called w/
p->pi_lock held, however, it is not true in cpuset path currently.

This patch fix it by holding p->pi_lock in cpuset path.

Signed-off-by: Wanpeng Li <[email protected]>
---
kernel/cpuset.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index e414ae9..605ed66 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2376,8 +2376,12 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)

void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
{
+ unsigned long flags;
+
rcu_read_lock();
+ raw_spin_lock_irqsave(&tsk->pi_lock, flags);
do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
+ raw_spin_lock_irqsave(&tsk->pi_lock, flags);
rcu_read_unlock();

/*
--
1.7.1


2015-08-25 08:14:53

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()

Hi Wanpeng,

On Tue, Aug 25, 2015 at 03:59:54PM +0800, Wanpeng Li wrote:
> [ 15.273708] ------------[ cut here ]------------
> [ 15.274097] WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 do_set_cpus_allowed+0x7e/0x80()
> [ 15.274857] Modules linked in:
> [ 15.275101] CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c7 #2
> [ 15.275674] 00000000 00000000 d21f1d24 c19228b2 00000000 d21f1d58 c1056a3b c1ba00e4
> [ 15.276084] 00000000 0000000d c1ba17d8 00000484 c10838be 00000484 c10838be d21e5000
> [ 15.276084] d2121900 d21e5158 d21f1d68 c1056b12 00000009 00000000 d21f1d7c c10838be
> [ 15.276084] Call Trace:
> [ 15.276084] [<c19228b2>] dump_stack+0x4b/0x75
> [ 15.276084] [<c1056a3b>] warn_slowpath_common+0x8b/0xc0
> [ 15.276084] [<c10838be>] ? do_set_cpus_allowed+0x7e/0x80
> [ 15.276084] [<c10838be>] ? do_set_cpus_allowed+0x7e/0x80
> [ 15.276084] [<c1056b12>] warn_slowpath_null+0x22/0x30
> [ 15.276084] [<c10838be>] do_set_cpus_allowed+0x7e/0x80
> [ 15.276084] [<c110154c>] cpuset_cpus_allowed_fallback+0x7c/0x170
> [ 15.276084] [<c11014d0>] ? cpuset_cpus_allowed+0x180/0x180
> [ 15.276084] [<c1083ae1>] select_fallback_rq+0x221/0x280
> [ 15.276084] [<c1085073>] migration_call+0xe3/0x250
> [ 15.276084] [<c1079e23>] notifier_call_chain+0x53/0x70
> [ 15.276084] [<c1079e5e>] __raw_notifier_call_chain+0x1e/0x30
> [ 15.276084] [<c1056cc8>] cpu_notify+0x28/0x50
> [ 15.276084] [<c191e4d2>] take_cpu_down+0x22/0x40
> [ 15.276084] [<c1102895>] multi_cpu_stop+0xd5/0x140
> [ 15.276084] [<c11027c0>] ? __stop_cpus+0x80/0x80
> [ 15.276084] [<c11025cc>] cpu_stopper_thread+0xbc/0x170
> [ 15.276084] [<c1085ec9>] ? preempt_count_sub+0x9/0x50
> [ 15.276084] [<c192b6a7>] ? _raw_spin_unlock_irq+0x37/0x50
> [ 15.276084] [<c192b655>] ? _raw_spin_unlock_irqrestore+0x55/0x70
> [ 15.276084] [<c10a9074>] ? trace_hardirqs_on_caller+0x144/0x1e0
> [ 15.276084] [<c11024a5>] ? cpu_stop_should_run+0x35/0x40
> [ 15.276084] [<c1085ec9>] ? preempt_count_sub+0x9/0x50
> [ 15.276084] [<c192b641>] ? _raw_spin_unlock_irqrestore+0x41/0x70
> [ 15.276084] [<c107c944>] smpboot_thread_fn+0x174/0x2f0
> [ 15.276084] [<c107c7d0>] ? sort_range+0x30/0x30
> [ 15.276084] [<c1078934>] kthread+0xc4/0xe0
> [ 15.276084] [<c192c041>] ret_from_kernel_thread+0x21/0x30
> [ 15.276084] [<c1078870>] ? kthread_create_on_node+0x180/0x180
> [ 15.276084] ---[ end trace 15f4c86d404693b0 ]---
>
> After commit (25834c73f93: sched: Fix a race between __kthread_bind()
> and sched_setaffinity()) do_set_cpus_allowed() should be called w/
> p->pi_lock held, however, it is not true in cpuset path currently.
>
> This patch fix it by holding p->pi_lock in cpuset path.
>
> Signed-off-by: Wanpeng Li <[email protected]>
> ---
> kernel/cpuset.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index e414ae9..605ed66 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -2376,8 +2376,12 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>
> void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> {
> + unsigned long flags;
> +
> rcu_read_lock();
> + raw_spin_lock_irqsave(&tsk->pi_lock, flags);
> do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
> + raw_spin_lock_irqsave(&tsk->pi_lock, flags);

Just curious, Will introduce deadlock after acquire lock twice? ;)

Thanks,
Leo Yan

> rcu_read_unlock();
>
> /*
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-08-25 08:25:06

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()

On 8/25/15 4:13 PM, Leo Yan wrote:
> Hi Wanpeng,
>
> On Tue, Aug 25, 2015 at 03:59:54PM +0800, Wanpeng Li wrote:
>> [ 15.273708] ------------[ cut here ]------------
>> [ 15.274097] WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 do_set_cpus_allowed+0x7e/0x80()
>> [ 15.274857] Modules linked in:
>> [ 15.275101] CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c7 #2
>> [ 15.275674] 00000000 00000000 d21f1d24 c19228b2 00000000 d21f1d58 c1056a3b c1ba00e4
>> [ 15.276084] 00000000 0000000d c1ba17d8 00000484 c10838be 00000484 c10838be d21e5000
>> [ 15.276084] d2121900 d21e5158 d21f1d68 c1056b12 00000009 00000000 d21f1d7c c10838be
>> [ 15.276084] Call Trace:
>> [ 15.276084] [<c19228b2>] dump_stack+0x4b/0x75
>> [ 15.276084] [<c1056a3b>] warn_slowpath_common+0x8b/0xc0
>> [ 15.276084] [<c10838be>] ? do_set_cpus_allowed+0x7e/0x80
>> [ 15.276084] [<c10838be>] ? do_set_cpus_allowed+0x7e/0x80
>> [ 15.276084] [<c1056b12>] warn_slowpath_null+0x22/0x30
>> [ 15.276084] [<c10838be>] do_set_cpus_allowed+0x7e/0x80
>> [ 15.276084] [<c110154c>] cpuset_cpus_allowed_fallback+0x7c/0x170
>> [ 15.276084] [<c11014d0>] ? cpuset_cpus_allowed+0x180/0x180
>> [ 15.276084] [<c1083ae1>] select_fallback_rq+0x221/0x280
>> [ 15.276084] [<c1085073>] migration_call+0xe3/0x250
>> [ 15.276084] [<c1079e23>] notifier_call_chain+0x53/0x70
>> [ 15.276084] [<c1079e5e>] __raw_notifier_call_chain+0x1e/0x30
>> [ 15.276084] [<c1056cc8>] cpu_notify+0x28/0x50
>> [ 15.276084] [<c191e4d2>] take_cpu_down+0x22/0x40
>> [ 15.276084] [<c1102895>] multi_cpu_stop+0xd5/0x140
>> [ 15.276084] [<c11027c0>] ? __stop_cpus+0x80/0x80
>> [ 15.276084] [<c11025cc>] cpu_stopper_thread+0xbc/0x170
>> [ 15.276084] [<c1085ec9>] ? preempt_count_sub+0x9/0x50
>> [ 15.276084] [<c192b6a7>] ? _raw_spin_unlock_irq+0x37/0x50
>> [ 15.276084] [<c192b655>] ? _raw_spin_unlock_irqrestore+0x55/0x70
>> [ 15.276084] [<c10a9074>] ? trace_hardirqs_on_caller+0x144/0x1e0
>> [ 15.276084] [<c11024a5>] ? cpu_stop_should_run+0x35/0x40
>> [ 15.276084] [<c1085ec9>] ? preempt_count_sub+0x9/0x50
>> [ 15.276084] [<c192b641>] ? _raw_spin_unlock_irqrestore+0x41/0x70
>> [ 15.276084] [<c107c944>] smpboot_thread_fn+0x174/0x2f0
>> [ 15.276084] [<c107c7d0>] ? sort_range+0x30/0x30
>> [ 15.276084] [<c1078934>] kthread+0xc4/0xe0
>> [ 15.276084] [<c192c041>] ret_from_kernel_thread+0x21/0x30
>> [ 15.276084] [<c1078870>] ? kthread_create_on_node+0x180/0x180
>> [ 15.276084] ---[ end trace 15f4c86d404693b0 ]---
>>
>> After commit (25834c73f93: sched: Fix a race between __kthread_bind()
>> and sched_setaffinity()) do_set_cpus_allowed() should be called w/
>> p->pi_lock held, however, it is not true in cpuset path currently.
>>
>> This patch fix it by holding p->pi_lock in cpuset path.
>>
>> Signed-off-by: Wanpeng Li <[email protected]>
>> ---
>> kernel/cpuset.c | 4 ++++
>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
>> index e414ae9..605ed66 100644
>> --- a/kernel/cpuset.c
>> +++ b/kernel/cpuset.c
>> @@ -2376,8 +2376,12 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>>
>> void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
>> {
>> + unsigned long flags;
>> +
>> rcu_read_lock();
>> + raw_spin_lock_irqsave(&tsk->pi_lock, flags);
>> do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
>> + raw_spin_lock_irqsave(&tsk->pi_lock, flags);
> Just curious, Will introduce deadlock after acquire lock twice? ;)

Could you point out where acquires lock twice?

Regards,
Wanpeng Li

>
> Thanks,
> Leo Yan
>
>> rcu_read_unlock();
>>
>> /*
>> --
>> 1.7.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-08-25 08:31:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()


* Wanpeng Li <[email protected]> wrote:

> >>--- a/kernel/cpuset.c
> >>+++ b/kernel/cpuset.c
> >>@@ -2376,8 +2376,12 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
> >> void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> >> {
> >>+ unsigned long flags;
> >>+
> >> rcu_read_lock();
> >>+ raw_spin_lock_irqsave(&tsk->pi_lock, flags);
> >> do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
> >>+ raw_spin_lock_irqsave(&tsk->pi_lock, flags);
> >Just curious, Will introduce deadlock after acquire lock twice? ;)
>
> Could you point out where acquires lock twice?

In the code you quote?

Thanks,

Ingo

2015-08-25 08:38:23

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()

On 8/25/15 4:30 PM, Ingo Molnar wrote:
> * Wanpeng Li <[email protected]> wrote:
>
>>>> --- a/kernel/cpuset.c
>>>> +++ b/kernel/cpuset.c
>>>> @@ -2376,8 +2376,12 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>>>> void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
>>>> {
>>>> + unsigned long flags;
>>>> +
>>>> rcu_read_lock();
>>>> + raw_spin_lock_irqsave(&tsk->pi_lock, flags);
>>>> do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
>>>> + raw_spin_lock_irqsave(&tsk->pi_lock, flags);
>>> Just curious, Will introduce deadlock after acquire lock twice? ;)
>> Could you point out where acquires lock twice?
> In the code you quote?

Shame me, sorry for sending out the wrong version. I fix in it soon in v2.

Regards,
Wanpeng Li

>
> Thanks,
>
> Ingo

2015-08-25 10:05:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()

On Tue, Aug 25, 2015 at 03:59:54PM +0800, Wanpeng Li wrote:
> +++ b/kernel/cpuset.c
> @@ -2376,8 +2376,12 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>
> void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> {
> + unsigned long flags;
> +
> rcu_read_lock();
> + raw_spin_lock_irqsave(&tsk->pi_lock, flags);
> do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
> + raw_spin_lock_irqsave(&tsk->pi_lock, flags);
> rcu_read_unlock();

Aside from the double lock thing that was already pointed out, I think
this is wrong, because the select_task_rq() call can already have
pi_lock held.

Taking it again would result in a deadlock.

Consider for instance:

try_to_wake_up()
raw_spin_lock_irqsave(->pi_lock)
select_task_rq()
select_ballback_rq()
cpuset_cpus_allowed_fallback()
raw_spin_lock_irqsave(->pi_lock)


The problem is with the migration path and should be fixed there.

2015-08-25 10:10:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()

On Tue, Aug 25, 2015 at 12:05:27PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 25, 2015 at 03:59:54PM +0800, Wanpeng Li wrote:
> > +++ b/kernel/cpuset.c
> > @@ -2376,8 +2376,12 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
> >
> > void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> > {
> > + unsigned long flags;
> > +
> > rcu_read_lock();
> > + raw_spin_lock_irqsave(&tsk->pi_lock, flags);
> > do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
> > + raw_spin_lock_irqsave(&tsk->pi_lock, flags);
> > rcu_read_unlock();
>
> Aside from the double lock thing that was already pointed out, I think
> this is wrong, because the select_task_rq() call can already have
> pi_lock held.
>
> Taking it again would result in a deadlock.
>
> Consider for instance:
>
> try_to_wake_up()
> raw_spin_lock_irqsave(->pi_lock)
> select_task_rq()
> select_ballback_rq()
> cpuset_cpus_allowed_fallback()
> raw_spin_lock_irqsave(->pi_lock)
>
>
> The problem is with the migration path and should be fixed there.

Another problem, migration_call() will have rq->lock held, so you're
proposing to acquire pi_lock while holding rq->lock, this is an
inversion from the regular nesting order.

2015-08-25 10:18:43

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()

On 8/25/15 6:05 PM, Peter Zijlstra wrote:
> On Tue, Aug 25, 2015 at 03:59:54PM +0800, Wanpeng Li wrote:
>> +++ b/kernel/cpuset.c
>> @@ -2376,8 +2376,12 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
>>
>> void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
>> {
>> + unsigned long flags;
>> +
>> rcu_read_lock();
>> + raw_spin_lock_irqsave(&tsk->pi_lock, flags);
>> do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
>> + raw_spin_lock_irqsave(&tsk->pi_lock, flags);
>> rcu_read_unlock();
> Aside from the double lock thing that was already pointed out, I think
> this is wrong, because the select_task_rq() call can already have
> pi_lock held.
>
> Taking it again would result in a deadlock.
>
> Consider for instance:
>
> try_to_wake_up()
> raw_spin_lock_irqsave(->pi_lock)
> select_task_rq()
> select_ballback_rq()
> cpuset_cpus_allowed_fallback()
> raw_spin_lock_irqsave(->pi_lock)
>
>
> The problem is with the migration path and should be fixed there.

Indeed, it should be fixed in migration path. I will try to fight it out
and post a patch. :)

Regards,
Wanpeng Li

2015-08-25 10:32:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()

On Tue, Aug 25, 2015 at 12:10:32PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 25, 2015 at 12:05:27PM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 25, 2015 at 03:59:54PM +0800, Wanpeng Li wrote:
> > > +++ b/kernel/cpuset.c
> > > @@ -2376,8 +2376,12 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
> > >
> > > void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> > > {
> > > + unsigned long flags;
> > > +
> > > rcu_read_lock();
> > > + raw_spin_lock_irqsave(&tsk->pi_lock, flags);
> > > do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
> > > + raw_spin_lock_irqsave(&tsk->pi_lock, flags);
> > > rcu_read_unlock();
> >
> > Aside from the double lock thing that was already pointed out, I think
> > this is wrong, because the select_task_rq() call can already have
> > pi_lock held.
> >
> > Taking it again would result in a deadlock.
> >
> > Consider for instance:
> >
> > try_to_wake_up()
> > raw_spin_lock_irqsave(->pi_lock)
> > select_task_rq()
> > select_ballback_rq()
> > cpuset_cpus_allowed_fallback()
> > raw_spin_lock_irqsave(->pi_lock)
> >
> >
> > The problem is with the migration path and should be fixed there.
>
> Another problem, migration_call() will have rq->lock held, so you're
> proposing to acquire pi_lock while holding rq->lock, this is an
> inversion from the regular nesting order.
>

So Possibly, Maybe (I'm still to wrecked to say for sure), something
like this would work:

WARN_ON(debug_locks && (lockdep_is_held(&p->pi_lock) ||
(p->on_rq && lockdep_is_held(&rq->lock))));

Instead of those two separate lockdep asserts.

Please consider carefully.

2015-08-27 13:53:17

by T. Zhou

[permalink] [raw]
Subject: Re: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()

Hi,

On Tue, Aug 25, 2015 at 03:59:54PM +0800, Wanpeng Li wrote:
> [ 15.273708] ------------[ cut here ]------------
> [ 15.274097] WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 do_set_cpus_allowed+0x7e/0x80()
> [ 15.274857] Modules linked in:
> [ 15.275101] CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c7 #2
> [ 15.275674] 00000000 00000000 d21f1d24 c19228b2 00000000 d21f1d58 c1056a3b c1ba00e4
> [ 15.276084] 00000000 0000000d c1ba17d8 00000484 c10838be 00000484 c10838be d21e5000
> [ 15.276084] d2121900 d21e5158 d21f1d68 c1056b12 00000009 00000000 d21f1d7c c10838be
> [ 15.276084] Call Trace:
> [ 15.276084] [<c19228b2>] dump_stack+0x4b/0x75
> [ 15.276084] [<c1056a3b>] warn_slowpath_common+0x8b/0xc0
> [ 15.276084] [<c10838be>] ? do_set_cpus_allowed+0x7e/0x80
> [ 15.276084] [<c10838be>] ? do_set_cpus_allowed+0x7e/0x80
> [ 15.276084] [<c1056b12>] warn_slowpath_null+0x22/0x30
> [ 15.276084] [<c10838be>] do_set_cpus_allowed+0x7e/0x80
> [ 15.276084] [<c110154c>] cpuset_cpus_allowed_fallback+0x7c/0x170
> [ 15.276084] [<c11014d0>] ? cpuset_cpus_allowed+0x180/0x180
> [ 15.276084] [<c1083ae1>] select_fallback_rq+0x221/0x280
> [ 15.276084] [<c1085073>] migration_call+0xe3/0x250
> [ 15.276084] [<c1079e23>] notifier_call_chain+0x53/0x70
> [ 15.276084] [<c1079e5e>] __raw_notifier_call_chain+0x1e/0x30
> [ 15.276084] [<c1056cc8>] cpu_notify+0x28/0x50
> [ 15.276084] [<c191e4d2>] take_cpu_down+0x22/0x40
> [ 15.276084] [<c1102895>] multi_cpu_stop+0xd5/0x140
> [ 15.276084] [<c11027c0>] ? __stop_cpus+0x80/0x80
> [ 15.276084] [<c11025cc>] cpu_stopper_thread+0xbc/0x170
> [ 15.276084] [<c1085ec9>] ? preempt_count_sub+0x9/0x50
> [ 15.276084] [<c192b6a7>] ? _raw_spin_unlock_irq+0x37/0x50
> [ 15.276084] [<c192b655>] ? _raw_spin_unlock_irqrestore+0x55/0x70
> [ 15.276084] [<c10a9074>] ? trace_hardirqs_on_caller+0x144/0x1e0
> [ 15.276084] [<c11024a5>] ? cpu_stop_should_run+0x35/0x40
> [ 15.276084] [<c1085ec9>] ? preempt_count_sub+0x9/0x50
> [ 15.276084] [<c192b641>] ? _raw_spin_unlock_irqrestore+0x41/0x70
> [ 15.276084] [<c107c944>] smpboot_thread_fn+0x174/0x2f0
> [ 15.276084] [<c107c7d0>] ? sort_range+0x30/0x30
> [ 15.276084] [<c1078934>] kthread+0xc4/0xe0
> [ 15.276084] [<c192c041>] ret_from_kernel_thread+0x21/0x30
> [ 15.276084] [<c1078870>] ? kthread_create_on_node+0x180/0x180
> [ 15.276084] ---[ end trace 15f4c86d404693b0 ]---
>

no experiment from me(hate myself and me). just guess this path:

take_cpu_down()
cpu_notify(CPU_DYING)
migration_call()

in migration_call(), there is CPU_DYING case. add these:

raw_spin_lock_irqsave(&p->pi_lock, flags);
raw_spin_lock(&rq->lock, flags);
...
raw_spin_unlock(&rq->lock, flags);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);

no p->pi_lock and rq->lock inversed order(from Peter's review)
no lock on p->pi_lock two times(from Peter's review)

and in do_set_cpus_allowed(), add the following and delete some:

WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
p->on_rq && lockdep_is_held(&task_rq(p)->lock)));
(from Peter's suggestion)

like what used in set_task_cpu()

better or right solution there :)

thanks,
--
Tao

2015-08-27 22:18:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()

On Tue, Aug 25, 2015 at 07:47:44PM +0800, Wanpeng Li wrote:
> On 8/25/15 6:32 PM, Peter Zijlstra wrote:

> >So Possibly, Maybe (I'm still to wrecked to say for sure), something
> >like this would work:
> >
> > WARN_ON(debug_locks && (lockdep_is_held(&p->pi_lock) ||
> > (p->on_rq && lockdep_is_held(&rq->lock))));
> >
> >Instead of those two separate lockdep asserts.
> >
> >Please consider carefully.

So the normal rules for changing task_struct::cpus_allowed are holding
both pi_lock and rq->lock, such that holding either stabilizes the mask.

This is so that wakeup can happen without rq->lock and load-balance
without pi_lock.

>From this we already get the relaxation that we can omit acquiring
rq->lock if the task is not on the rq, because in that case
load-balancing will not apply to it.

** these are the rules currently tested in do_set_cpus_allowed() **

Now, since __set_cpus_allowed_ptr() uses task_rq_lock() which
unconditionally acquires both locks, we could get away with holding just
rq->lock when on_rq for modification because that'd still exclude
__set_cpus_allowed_ptr(), it would also work against
__kthread_bind_mask() because that assumes !on_rq.

That said, this is all somewhat fragile.

> Commit (5e16bbc2f: sched: Streamline the task migration locking a little)
> won't hold the pi_lock in migrate_tasks() path any more, actually pi_lock
> was still not held when call select_fallback_rq() and it was held in
> __migrate_task() before the commit. Then commit (25834c73f93: sched: Fix a
> race between __kthread_bind() and sched_setaffinity()) add a
> lockdep_assert_held() in do_set_cpus_allowed(), the bug is triggered. How
> about something like below:
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5186,6 +5186,15 @@ static void migrate_tasks(struct rq *dead_rq)
> BUG_ON(!next);
> next->sched_class->put_prev_task(rq, next);
>
> + raw_spin_unlock(&rq->lock);
> + raw_spin_lock(&next->pi_lock);
> + raw_spin_lock(&rq->lock);
> + if (!(task_rq(next) == rq && task_on_rq_queued(next))) {
> + raw_spin_unlock(&rq->lock);
> + raw_spin_unlock(&next->pi_lock);
> + continue;
> + }

Yeah, that's quite disgusting.. also you'll trip over the lockdep_pin if
you were to actually run this.

Now, I don't think dropping rq->lock is quite as disastrous as it
usually is because !cpu_active at this point, which means load-balance
will not interfere, but that too is somewhat fragile.


So we end up with a choice of two fragile.. let me ponder that a wee
bit more.

2015-08-28 01:49:25

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()

On 8/28/15 6:18 AM, Peter Zijlstra wrote:
> On Tue, Aug 25, 2015 at 07:47:44PM +0800, Wanpeng Li wrote:
>> On 8/25/15 6:32 PM, Peter Zijlstra wrote:
>>> So Possibly, Maybe (I'm still to wrecked to say for sure), something
>>> like this would work:
>>>
>>> WARN_ON(debug_locks && (lockdep_is_held(&p->pi_lock) ||
>>> (p->on_rq && lockdep_is_held(&rq->lock))));
>>>
>>> Instead of those two separate lockdep asserts.
>>>
>>> Please consider carefully.
> So the normal rules for changing task_struct::cpus_allowed are holding
> both pi_lock and rq->lock, such that holding either stabilizes the mask.
>
> This is so that wakeup can happen without rq->lock and load-balance
> without pi_lock.
>
> From this we already get the relaxation that we can omit acquiring
> rq->lock if the task is not on the rq, because in that case
> load-balancing will not apply to it.
>
> ** these are the rules currently tested in do_set_cpus_allowed() **
>
> Now, since __set_cpus_allowed_ptr() uses task_rq_lock() which
> unconditionally acquires both locks, we could get away with holding just
> rq->lock when on_rq for modification because that'd still exclude
> __set_cpus_allowed_ptr(), it would also work against
> __kthread_bind_mask() because that assumes !on_rq.
>
> That said, this is all somewhat fragile.
>
>> Commit (5e16bbc2f: sched: Streamline the task migration locking a little)
>> won't hold the pi_lock in migrate_tasks() path any more, actually pi_lock
>> was still not held when call select_fallback_rq() and it was held in
>> __migrate_task() before the commit. Then commit (25834c73f93: sched: Fix a
>> race between __kthread_bind() and sched_setaffinity()) add a
>> lockdep_assert_held() in do_set_cpus_allowed(), the bug is triggered. How
>> about something like below:
>>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5186,6 +5186,15 @@ static void migrate_tasks(struct rq *dead_rq)
>> BUG_ON(!next);
>> next->sched_class->put_prev_task(rq, next);
>>
>> + raw_spin_unlock(&rq->lock);
>> + raw_spin_lock(&next->pi_lock);
>> + raw_spin_lock(&rq->lock);
>> + if (!(task_rq(next) == rq && task_on_rq_queued(next))) {
>> + raw_spin_unlock(&rq->lock);
>> + raw_spin_unlock(&next->pi_lock);
>> + continue;
>> + }
> Yeah, that's quite disgusting.. also you'll trip over the lockdep_pin if
> you were to actually run this.

Indeed. I will handle lockdep_pin in these codes if you choice the
second fragile. :-)

Regards,
Wanpeng Li

>
> Now, I don't think dropping rq->lock is quite as disastrous as it
> usually is because !cpu_active at this point, which means load-balance
> will not interfere, but that too is somewhat fragile.
>
>
> So we end up with a choice of two fragile.. let me ponder that a wee
> bit more.