2021-01-28 00:18:40

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH] sched/core: Fix premature p->migration_pending completion

Fiddling some more with a TLA+ model of set_cpus_allowed_ptr() & friends
unearthed one more outstanding issue. This doesn't even involve
migrate_disable(), but rather affinity changes and execution of the stopper
racing with each other.

My own interpretation of the (lengthy) TLA+ splat (note the potential for
errors at each level) is:

Initial conditions:
victim.cpus_mask = {CPU0, CPU1}

CPU0 CPU1 CPU<don't care>

switch_to(victim)
set_cpus_allowed(victim, {CPU1})
kick CPU0 migration_cpu_stop({.dest_cpu = CPU1})
switch_to(stopper/0)
// e.g. CFS load balance
move_queued_task(CPU0, victim, CPU1);
switch_to(victim)
set_cpus_allowed(victim, {CPU0});
task_rq_unlock();
migration_cpu_stop(dest_cpu=CPU1)
task_rq(p) != rq && pending
kick CPU1 migration_cpu_stop({.dest_cpu = CPU1})

switch_to(stopper/1)
migration_cpu_stop(dest_cpu=CPU1)
task_rq(p) == rq && pending
__migrate_task(dest_cpu) // no-op
complete_all() <-- !!! affinity is {CPU0} !!!

I believe there are two issues there:
- retriggering of migration_cpu_stop() from within migration_cpu_stop()
itself doesn't change arg.dest_cpu
- we'll issue a complete_all() in the task_rq(p) == rq path of
migration_cpu_stop() even if the dest_cpu has been superseded by a
further affinity change.

Something similar could happen with NUMA's migrate_task_to(), and arguably
any other user of migration_cpu_stop() with a .dest_cpu >= 0.
Consider:

CPU0 CPUX

switch_to(victim)
migrate_task_to(victim, CPU1)
kick CPU0 migration_cpu_stop({.dest_cpu = CPU1})

set_cpus_allowed(victim, {CPU42})
task_rq_unlock();
switch_to(stopper/0)
migration_cpu_stop(dest_cpu=CPU1)
task_rq(p) == rq && pending
__migrate_task(dest_cpu)
complete_all() <-- !!! affinity is {CPU42} !!!

Prevent such premature completions by ensuring the dest_cpu in
migration_cpu_stop() is in the task's allowed cpumask.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/core.c | 32 ++++++++++++++++++++------------
1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 06b449942adf..b57326b0a742 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1923,20 +1923,28 @@ static int migration_cpu_stop(void *data)
complete = true;
}

- /* migrate_enable() -- we must not race against SCA */
- if (dest_cpu < 0) {
- /*
- * When this was migrate_enable() but we no longer
- * have a @pending, a concurrent SCA 'fixed' things
- * and we should be valid again. Nothing to do.
- */
- if (!pending) {
- WARN_ON_ONCE(!cpumask_test_cpu(task_cpu(p), &p->cpus_mask));
- goto out;
- }
+ /*
+ * When this was migrate_enable() but we no longer
+ * have a @pending, a concurrent SCA 'fixed' things
+ * and we should be valid again.
+ *
+ * This can also be a stopper invocation that was 'fixed' by an
+ * earlier one.
+ *
+ * Nothing to do.
+ */
+ if ((dest_cpu < 0 || dest_cpu == cpu_of(rq)) && !pending) {
+ WARN_ON_ONCE(!cpumask_test_cpu(task_cpu(p), &p->cpus_mask));
+ goto out;
+ }

+ /*
+ * Catch any affinity change between the stop_cpu() call and us
+ * getting here.
+ * For migrate_enable(), we just want to pick an allowed one.
+ */
+ if (dest_cpu < 0 || !cpumask_test_cpu(dest_cpu, &p->cpus_mask))
dest_cpu = cpumask_any_distribute(&p->cpus_mask);
- }

if (task_on_rq_queued(p))
rq = __migrate_task(rq, &rf, p, dest_cpu);
--
2.27.0


2021-01-28 20:00:35

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/core: Fix premature p->migration_pending completion

On 29/01/21 01:02, Tao Zhou wrote:
> Hi,
>
> On Wed, Jan 27, 2021 at 07:30:35PM +0000, Valentin Schneider wrote:
>
>> Fiddling some more with a TLA+ model of set_cpus_allowed_ptr() & friends
>> unearthed one more outstanding issue. This doesn't even involve
>> migrate_disable(), but rather affinity changes and execution of the stopper
>> racing with each other.
>>
>> My own interpretation of the (lengthy) TLA+ splat (note the potential for
>> errors at each level) is:
>>
>> Initial conditions:
>> victim.cpus_mask = {CPU0, CPU1}
>>
>> CPU0 CPU1 CPU<don't care>
>>
>> switch_to(victim)
>> set_cpus_allowed(victim, {CPU1})
>> kick CPU0 migration_cpu_stop({.dest_cpu = CPU1})
>> switch_to(stopper/0)
>> // e.g. CFS load balance
>> move_queued_task(CPU0, victim, CPU1);
> ^^^^^^^^^^^^^^^^
>
> Why is move_queued_task() not attach_task()/detach_task() for CFS load..
>

Heh I expected that one; it is indeed detach_task()/attach_task() for CFS
LB. I didn't want to make this any longer than it needed to, and I figured
that move_queued_task() being a composition of detach_task(), attach_task() and
rq_locks, this would get the point across.

This does raise an "interesting" point that ATM I think this issue cannot
actually involve move_queued_task(), since all current move_queued_task()
callsites are issued either from a stopper or from set_cpus_allowed_ptr().

CFS' detach_task() + attach_task() could do it, though.

>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 06b449942adf..b57326b0a742 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1923,20 +1923,28 @@ static int migration_cpu_stop(void *data)
>> complete = true;
>> }
>>
>> - /* migrate_enable() -- we must not race against SCA */
>> - if (dest_cpu < 0) {
>> - /*
>> - * When this was migrate_enable() but we no longer
>> - * have a @pending, a concurrent SCA 'fixed' things
>> - * and we should be valid again. Nothing to do.
>> - */
>> - if (!pending) {
>> - WARN_ON_ONCE(!cpumask_test_cpu(task_cpu(p), &p->cpus_mask));
>> - goto out;
>> - }
>> + /*
>> + * When this was migrate_enable() but we no longer
>> + * have a @pending, a concurrent SCA 'fixed' things
>> + * and we should be valid again.
>> + *
>> + * This can also be a stopper invocation that was 'fixed' by an
>> + * earlier one.
>> + *
>> + * Nothing to do.
>> + */
>> + if ((dest_cpu < 0 || dest_cpu == cpu_of(rq)) && !pending) {
>
> When the condition 'dest_cpu == cpu_of(rq)' is true, pending is not NULL.
> The condition may be like this:
>
> if ((dest_cpu < 0 && !pending) || dest_cpu == cpu_of(rq))
>
> We want to choose one cpu in the new(currently modified) cpu_mask and
> complete all.
>

Consider the execution of migration_cpu_stop() in above trace with
migrate_task_to(). We do have:
- dest_cpu == cpu_of(rq)
- p->migration_pending

but we do *not* want to bail out at this condition, because we need to fix
up dest_cpu.

2021-02-03 17:28:43

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/core: Fix premature p->migration_pending completion

On 01/27/21 19:30, Valentin Schneider wrote:
> Fiddling some more with a TLA+ model of set_cpus_allowed_ptr() & friends
> unearthed one more outstanding issue. This doesn't even involve
> migrate_disable(), but rather affinity changes and execution of the stopper
> racing with each other.
>
> My own interpretation of the (lengthy) TLA+ splat (note the potential for
> errors at each level) is:
>
> Initial conditions:
> victim.cpus_mask = {CPU0, CPU1}
>
> CPU0 CPU1 CPU<don't care>
>
> switch_to(victim)
> set_cpus_allowed(victim, {CPU1})
> kick CPU0 migration_cpu_stop({.dest_cpu = CPU1})
> switch_to(stopper/0)
> // e.g. CFS load balance
> move_queued_task(CPU0, victim, CPU1);
> switch_to(victim)
> set_cpus_allowed(victim, {CPU0});
> task_rq_unlock();
> migration_cpu_stop(dest_cpu=CPU1)

This migration stop is due to set_cpus_allowed(victim, {CPU1}), right?

> task_rq(p) != rq && pending
> kick CPU1 migration_cpu_stop({.dest_cpu = CPU1})
>
> switch_to(stopper/1)
> migration_cpu_stop(dest_cpu=CPU1)

And this migration stop is due to set_cpus_allowed(victim, {CPU0}), right?

If I didn't miss something, then dest_cpu should be CPU0 too, not CPU1 and the
task should be moved back to CPU0 as expected?

Thanks

--
Qais Yousef

> task_rq(p) == rq && pending
> __migrate_task(dest_cpu) // no-op
> complete_all() <-- !!! affinity is {CPU0} !!!
>
> I believe there are two issues there:
> - retriggering of migration_cpu_stop() from within migration_cpu_stop()
> itself doesn't change arg.dest_cpu
> - we'll issue a complete_all() in the task_rq(p) == rq path of
> migration_cpu_stop() even if the dest_cpu has been superseded by a
> further affinity change.
>
> Something similar could happen with NUMA's migrate_task_to(), and arguably
> any other user of migration_cpu_stop() with a .dest_cpu >= 0.
> Consider:
>
> CPU0 CPUX
>
> switch_to(victim)
> migrate_task_to(victim, CPU1)
> kick CPU0 migration_cpu_stop({.dest_cpu = CPU1})
>
> set_cpus_allowed(victim, {CPU42})
> task_rq_unlock();
> switch_to(stopper/0)
> migration_cpu_stop(dest_cpu=CPU1)
> task_rq(p) == rq && pending
> __migrate_task(dest_cpu)
> complete_all() <-- !!! affinity is {CPU42} !!!
>
> Prevent such premature completions by ensuring the dest_cpu in
> migration_cpu_stop() is in the task's allowed cpumask.
>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/core.c | 32 ++++++++++++++++++++------------
> 1 file changed, 20 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 06b449942adf..b57326b0a742 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1923,20 +1923,28 @@ static int migration_cpu_stop(void *data)
> complete = true;
> }
>
> - /* migrate_enable() -- we must not race against SCA */
> - if (dest_cpu < 0) {
> - /*
> - * When this was migrate_enable() but we no longer
> - * have a @pending, a concurrent SCA 'fixed' things
> - * and we should be valid again. Nothing to do.
> - */
> - if (!pending) {
> - WARN_ON_ONCE(!cpumask_test_cpu(task_cpu(p), &p->cpus_mask));
> - goto out;
> - }
> + /*
> + * When this was migrate_enable() but we no longer
> + * have a @pending, a concurrent SCA 'fixed' things
> + * and we should be valid again.
> + *
> + * This can also be a stopper invocation that was 'fixed' by an
> + * earlier one.
> + *
> + * Nothing to do.
> + */
> + if ((dest_cpu < 0 || dest_cpu == cpu_of(rq)) && !pending) {
> + WARN_ON_ONCE(!cpumask_test_cpu(task_cpu(p), &p->cpus_mask));
> + goto out;
> + }
>
> + /*
> + * Catch any affinity change between the stop_cpu() call and us
> + * getting here.
> + * For migrate_enable(), we just want to pick an allowed one.
> + */
> + if (dest_cpu < 0 || !cpumask_test_cpu(dest_cpu, &p->cpus_mask))
> dest_cpu = cpumask_any_distribute(&p->cpus_mask);
> - }
>
> if (task_on_rq_queued(p))
> rq = __migrate_task(rq, &rf, p, dest_cpu);
> --
> 2.27.0
>

2021-02-03 19:03:45

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/core: Fix premature p->migration_pending completion

On 03/02/21 17:23, Qais Yousef wrote:
> On 01/27/21 19:30, Valentin Schneider wrote:
>> Fiddling some more with a TLA+ model of set_cpus_allowed_ptr() & friends
>> unearthed one more outstanding issue. This doesn't even involve
>> migrate_disable(), but rather affinity changes and execution of the stopper
>> racing with each other.
>>
>> My own interpretation of the (lengthy) TLA+ splat (note the potential for
>> errors at each level) is:
>>
>> Initial conditions:
>> victim.cpus_mask = {CPU0, CPU1}
>>
>> CPU0 CPU1 CPU<don't care>
>>
>> switch_to(victim)
>> set_cpus_allowed(victim, {CPU1})
>> kick CPU0 migration_cpu_stop({.dest_cpu = CPU1})
>> switch_to(stopper/0)
>> // e.g. CFS load balance
>> move_queued_task(CPU0, victim, CPU1);
>> switch_to(victim)
>> set_cpus_allowed(victim, {CPU0});
>> task_rq_unlock();
>> migration_cpu_stop(dest_cpu=CPU1)
>
> This migration stop is due to set_cpus_allowed(victim, {CPU1}), right?
>

Right

>> task_rq(p) != rq && pending
>> kick CPU1 migration_cpu_stop({.dest_cpu = CPU1})
>>
>> switch_to(stopper/1)
>> migration_cpu_stop(dest_cpu=CPU1)
>
> And this migration stop is due to set_cpus_allowed(victim, {CPU0}), right?
>

Nein! This is a retriggering of the "current" stopper (triggered by
set_cpus_allowed(victim, {CPU1})), see the tail of that

else if (dest_cpu < 0 || pending)

branch in migration_cpu_stop(), is what I'm trying to hint at with that

task_rq(p) != rq && pending

> If I didn't miss something, then dest_cpu should be CPU0 too, not CPU1 and the
> task should be moved back to CPU0 as expected?
>
> Thanks
>
> --
> Qais Yousef

2021-02-04 15:36:14

by Qais Yousef

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/core: Fix premature p->migration_pending completion

On 02/03/21 18:59, Valentin Schneider wrote:
> On 03/02/21 17:23, Qais Yousef wrote:
> > On 01/27/21 19:30, Valentin Schneider wrote:
> >> Fiddling some more with a TLA+ model of set_cpus_allowed_ptr() & friends
> >> unearthed one more outstanding issue. This doesn't even involve
> >> migrate_disable(), but rather affinity changes and execution of the stopper
> >> racing with each other.
> >>
> >> My own interpretation of the (lengthy) TLA+ splat (note the potential for
> >> errors at each level) is:
> >>
> >> Initial conditions:
> >> victim.cpus_mask = {CPU0, CPU1}
> >>
> >> CPU0 CPU1 CPU<don't care>
> >>
> >> switch_to(victim)
> >> set_cpus_allowed(victim, {CPU1})
> >> kick CPU0 migration_cpu_stop({.dest_cpu = CPU1})
> >> switch_to(stopper/0)
> >> // e.g. CFS load balance
> >> move_queued_task(CPU0, victim, CPU1);
> >> switch_to(victim)
> >> set_cpus_allowed(victim, {CPU0});
> >> task_rq_unlock();
> >> migration_cpu_stop(dest_cpu=CPU1)
> >
> > This migration stop is due to set_cpus_allowed(victim, {CPU1}), right?
> >
>
> Right
>
> >> task_rq(p) != rq && pending
> >> kick CPU1 migration_cpu_stop({.dest_cpu = CPU1})
> >>
> >> switch_to(stopper/1)
> >> migration_cpu_stop(dest_cpu=CPU1)
> >
> > And this migration stop is due to set_cpus_allowed(victim, {CPU0}), right?
> >
>
> Nein! This is a retriggering of the "current" stopper (triggered by
> set_cpus_allowed(victim, {CPU1})), see the tail of that
>
> else if (dest_cpu < 0 || pending)
>
> branch in migration_cpu_stop(), is what I'm trying to hint at with that
>
> task_rq(p) != rq && pending

Okay I see. But AFAIU, the work will be queued in order. So we should first
handle the set_cpus_allowed_ptr(victim, {CPU0}) before the retrigger, no?

So I see migration_cpu_stop() running 3 times

1. because of set_cpus_allowed(victim, {CPU1}) on CPU0
2. because of set_cpus_allowed(victim, {CPU0}) on CPU1
3. because of retrigger of '1' on CPU0

Thanks

--
Qais Yousef

2021-02-05 11:09:21

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/core: Fix premature p->migration_pending completion

On 04/02/21 15:30, Qais Yousef wrote:
> On 02/03/21 18:59, Valentin Schneider wrote:
>> On 03/02/21 17:23, Qais Yousef wrote:
>> > On 01/27/21 19:30, Valentin Schneider wrote:
>> >> Initial conditions:
>> >> victim.cpus_mask = {CPU0, CPU1}
>> >>
>> >> CPU0 CPU1 CPU<don't care>
>> >>
>> >> switch_to(victim)
>> >> set_cpus_allowed(victim, {CPU1})
>> >> kick CPU0 migration_cpu_stop({.dest_cpu = CPU1})
>> >> switch_to(stopper/0)
>> >> // e.g. CFS load balance
>> >> move_queued_task(CPU0, victim, CPU1);
>> >> switch_to(victim)
>> >> set_cpus_allowed(victim, {CPU0});
>> >> task_rq_unlock();
>> >> migration_cpu_stop(dest_cpu=CPU1)
>> >
>> > This migration stop is due to set_cpus_allowed(victim, {CPU1}), right?
>> >
>>
>> Right
>>
>> >> task_rq(p) != rq && pending
>> >> kick CPU1 migration_cpu_stop({.dest_cpu = CPU1})
>> >>
>> >> switch_to(stopper/1)
>> >> migration_cpu_stop(dest_cpu=CPU1)
>> >
>> > And this migration stop is due to set_cpus_allowed(victim, {CPU0}), right?
>> >
>>
>> Nein! This is a retriggering of the "current" stopper (triggered by
>> set_cpus_allowed(victim, {CPU1})), see the tail of that
>>
>> else if (dest_cpu < 0 || pending)
>>
>> branch in migration_cpu_stop(), is what I'm trying to hint at with that
>>
>> task_rq(p) != rq && pending
>
> Okay I see. But AFAIU, the work will be queued in order. So we should first
> handle the set_cpus_allowed_ptr(victim, {CPU0}) before the retrigger, no?
>
> So I see migration_cpu_stop() running 3 times
>
> 1. because of set_cpus_allowed(victim, {CPU1}) on CPU0
> 2. because of set_cpus_allowed(victim, {CPU0}) on CPU1
> 3. because of retrigger of '1' on CPU0
>

On that 'CPU<don't care>' lane, I intentionally included task_rq_unlock()
but not 'kick CPU1 migration_cpu_stop({.dest_cpu = CPU0})'. IOW, there is
nothing in that trace that queues a stopper work for 2. - it *will* happen
at some point, but harm will already have been done.

The migrate_task_to() example is potentially worse, because it doesn't rely
on which stopper work gets enqueued first - only that an extra affinity
change happens before the first stopper work grabs the pi_lock and completes.