From: Peter Zijlstra <[email protected]>
Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).
There could be races in core scheduler where a CPU is trying to pick
a task for its sibling in core scheduler, when that CPU has just been
offlined. We should not schedule any tasks on the CPU in this case.
Return an idle task in pick_next_task for this situation.
NOTE: there is still potential for siblings rivalry.
NOTE: this is far too complicated; but thus far I've failed to
simplify it further.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Julien Desfossez <[email protected]>
Signed-off-by: Vineeth Remanan Pillai <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/core.c | 263 ++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 6 +-
2 files changed, 267 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b21bcab20da6..f51e5c4798c8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4113,7 +4113,7 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt)
* Pick up the highest-prio task:
*/
static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
const struct sched_class *class;
struct task_struct *p;
@@ -4169,6 +4169,262 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
BUG();
}
+#ifdef CONFIG_SCHED_CORE
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+ return is_idle_task(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+ if (is_idle_task(a) || is_idle_task(b))
+ return true;
+
+ return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ * rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+{
+ struct task_struct *class_pick, *cookie_pick;
+ unsigned long cookie = rq->core->core_cookie;
+
+ class_pick = class->pick_task(rq);
+ if (!class_pick)
+ return NULL;
+
+ if (!cookie) {
+ /*
+ * If class_pick is tagged, return it only if it has
+ * higher priority than max.
+ */
+ if (max && class_pick->core_cookie &&
+ prio_less(class_pick, max))
+ return idle_sched_class.pick_task(rq);
+
+ return class_pick;
+ }
+
+ /*
+ * If class_pick is idle or matches cookie, return early.
+ */
+ if (cookie_equals(class_pick, cookie))
+ return class_pick;
+
+ cookie_pick = sched_core_find(rq, cookie);
+
+ /*
+ * If class > max && class > cookie, it is the highest priority task on
+ * the core (so far) and it must be selected, otherwise we must go with
+ * the cookie pick in order to satisfy the constraint.
+ */
+ if (prio_less(cookie_pick, class_pick) &&
+ (!max || prio_less(max, class_pick)))
+ return class_pick;
+
+ return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+ struct task_struct *next, *max = NULL;
+ const struct sched_class *class;
+ const struct cpumask *smt_mask;
+ int i, j, cpu;
+ bool need_sync;
+
+ if (!sched_core_enabled(rq))
+ return __pick_next_task(rq, prev, rf);
+
+ /*
+ * If there were no {en,de}queues since we picked (IOW, the task
+ * pointers are all still valid), and we haven't scheduled the last
+ * pick yet, do so now.
+ */
+ if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+ rq->core->core_pick_seq != rq->core_sched_seq) {
+ WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+ next = rq->core_pick;
+ if (next != prev) {
+ put_prev_task(rq, prev);
+ set_next_task(rq, next);
+ }
+ return next;
+ }
+
+ prev->sched_class->put_prev_task(rq, prev);
+ if (!rq->nr_running)
+ newidle_balance(rq, rf);
+
+ cpu = cpu_of(rq);
+ smt_mask = cpu_smt_mask(cpu);
+
+ /*
+ * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+ *
+ * @task_seq guards the task state ({en,de}queues)
+ * @pick_seq is the @task_seq we did a selection on
+ * @sched_seq is the @pick_seq we scheduled
+ *
+ * However, preemptions can cause multiple picks on the same task set.
+ * 'Fix' this by also increasing @task_seq for every pick.
+ */
+ rq->core->core_task_seq++;
+ need_sync = !!rq->core->core_cookie;
+
+ /* reset state */
+ rq->core->core_cookie = 0UL;
+ for_each_cpu(i, smt_mask) {
+ struct rq *rq_i = cpu_rq(i);
+
+ rq_i->core_pick = NULL;
+
+ if (rq_i->core_forceidle) {
+ need_sync = true;
+ rq_i->core_forceidle = false;
+ }
+
+ if (i != cpu)
+ update_rq_clock(rq_i);
+ }
+
+ /*
+ * Try and select tasks for each sibling in decending sched_class
+ * order.
+ */
+ for_each_class(class) {
+again:
+ for_each_cpu_wrap(i, smt_mask, cpu) {
+ struct rq *rq_i = cpu_rq(i);
+ struct task_struct *p;
+
+ if (rq_i->core_pick)
+ continue;
+
+ /*
+ * If this sibling doesn't yet have a suitable task to
+ * run; ask for the most elegible task, given the
+ * highest priority task already selected for this
+ * core.
+ */
+ p = pick_task(rq_i, class, max);
+ if (!p) {
+ /*
+ * If there weren't no cookies; we don't need
+ * to bother with the other siblings.
+ */
+ if (i == cpu && !need_sync)
+ goto next_class;
+
+ continue;
+ }
+
+ /*
+ * Optimize the 'normal' case where there aren't any
+ * cookies and we don't need to sync up.
+ */
+ if (i == cpu && !need_sync && !p->core_cookie) {
+ next = p;
+ goto done;
+ }
+
+ rq_i->core_pick = p;
+
+ /*
+ * If this new candidate is of higher priority than the
+ * previous; and they're incompatible; we need to wipe
+ * the slate and start over. pick_task makes sure that
+ * p's priority is more than max if it doesn't match
+ * max's cookie.
+ *
+ * NOTE: this is a linear max-filter and is thus bounded
+ * in execution time.
+ */
+ if (!max || !cookie_match(max, p)) {
+ struct task_struct *old_max = max;
+
+ rq->core->core_cookie = p->core_cookie;
+ max = p;
+
+ if (old_max) {
+ for_each_cpu(j, smt_mask) {
+ if (j == i)
+ continue;
+
+ cpu_rq(j)->core_pick = NULL;
+ }
+ goto again;
+ } else {
+ /*
+ * Once we select a task for a cpu, we
+ * should not be doing an unconstrained
+ * pick because it might starve a task
+ * on a forced idle cpu.
+ */
+ need_sync = true;
+ }
+
+ }
+ }
+next_class:;
+ }
+
+ rq->core->core_pick_seq = rq->core->core_task_seq;
+ next = rq->core_pick;
+ rq->core_sched_seq = rq->core->core_pick_seq;
+
+ /*
+ * Reschedule siblings
+ *
+ * NOTE: L1TF -- at this point we're no longer running the old task and
+ * sending an IPI (below) ensures the sibling will no longer be running
+ * their task. This ensures there is no inter-sibling overlap between
+ * non-matching user state.
+ */
+ for_each_cpu(i, smt_mask) {
+ struct rq *rq_i = cpu_rq(i);
+
+ WARN_ON_ONCE(!rq_i->core_pick);
+
+ if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
+ rq_i->core_forceidle = true;
+
+ if (i == cpu)
+ continue;
+
+ if (rq_i->curr != rq_i->core_pick)
+ resched_curr(rq_i);
+
+ /* Did we break L1TF mitigation requirements? */
+ WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+ }
+
+done:
+ set_next_task(rq, next);
+ return next;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+ return __pick_next_task(rq, prev, rf);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
/*
* __schedule() is the main scheduler function.
*
@@ -6931,7 +7187,12 @@ void __init sched_init(void)
#ifdef CONFIG_SCHED_CORE
rq->core = NULL;
+ rq->core_pick = NULL;
rq->core_enabled = 0;
+ rq->core_tree = RB_ROOT;
+ rq->core_forceidle = false;
+
+ rq->core_cookie = 0UL;
#endif
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c85c5a4bc21f..4a738093d731 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1032,11 +1032,16 @@ struct rq {
#ifdef CONFIG_SCHED_CORE
/* per rq */
struct rq *core;
+ struct task_struct *core_pick;
unsigned int core_enabled;
+ unsigned int core_sched_seq;
struct rb_root core_tree;
+ unsigned char core_forceidle;
/* shared state */
unsigned int core_task_seq;
+ unsigned int core_pick_seq;
+ unsigned long core_cookie;
#endif
};
@@ -1905,7 +1910,6 @@ static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
static inline void set_next_task(struct rq *rq, struct task_struct *next)
{
- WARN_ON_ONCE(rq->curr != next);
next->sched_class->set_next_task(rq, next, false);
}
--
2.17.1
On Tue, Jun 30, 2020 at 09:32:27PM +0000, Vineeth Remanan Pillai wrote:
> From: Peter Zijlstra <[email protected]>
>
> Instead of only selecting a local task, select a task for all SMT
> siblings for every reschedule on the core (irrespective which logical
> CPU does the reschedule).
>
> There could be races in core scheduler where a CPU is trying to pick
> a task for its sibling in core scheduler, when that CPU has just been
> offlined. We should not schedule any tasks on the CPU in this case.
> Return an idle task in pick_next_task for this situation.
>
> NOTE: there is still potential for siblings rivalry.
> NOTE: this is far too complicated; but thus far I've failed to
> simplify it further.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Julien Desfossez <[email protected]>
> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
> Signed-off-by: Aaron Lu <[email protected]>
> Signed-off-by: Tim Chen <[email protected]>
Hi Peter, Tim, all, the below patch fixes the hotplug issue described in the
below patch's Link tag. Patch description below describes the issues fixed
and it applies on top of this patch.
------8<----------
From: "Joel Fernandes (Google)" <[email protected]>
Subject: [PATCH] sched: Fix CPU hotplug causing crashes in task selection logic
The selection logic does not run correctly if the current CPU is not in the
cpu_smt_mask (which it is not because the CPU is offlined when the stopper
finishes running and needs to switch to idle). There are also other issues
fixed by the patch I think such as: if some other sibling set core_pick to
something, however the selection logic on current cpu resets it before
selecting. In this case, we need to run the task selection logic again to
make sure it picks something if there is something to run. It might end up
picking the wrong task. Yet another issue was, if the stopper thread is an
unconstrained pick, then rq->core_pick is set. The next time task selection
logic runs when stopper needs to switch to idle, the current CPU is not in
the smt_mask. This causes the previous ->core_pick to be picked again which
happens to be the unconstrained task! so the stopper keeps getting selected
forever.
That and there are a few more safe guards and checks around checking/setting
rq->core_pick. To test it, I ran rcutorture and made it tag all torture
threads. Then ran it in hotplug mode (hotplugging every 200ms) and it hit the
issue. Now it runs for an hour or so without issue. (Torture testing debug
changes: https://bit.ly/38htfqK ).
Various fixes were tried causing varying degrees of crashes. Finally I found
that it is easiest to just add current CPU to the smt_mask's copy always.
This is so that task selection logic always runs on the current CPU which
called schedule().
Link: lore.kernel.org/r/[email protected]
Reported-by: Tim Chen <[email protected]>
Reported-by: Vineeth Pillai <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/sched/core.c | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ede86fb37b4e8..a5604aa292e66 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4307,7 +4307,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
struct task_struct *next, *max = NULL;
const struct sched_class *class;
- const struct cpumask *smt_mask;
+ struct cpumask select_mask;
int i, j, cpu, occ = 0;
bool need_sync;
@@ -4334,7 +4334,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
finish_prev_task(rq, prev, rf);
cpu = cpu_of(rq);
- smt_mask = cpu_smt_mask(cpu);
+ /* Make a copy of cpu_smt_mask as we should not set that. */
+ cpumask_copy(&select_mask, cpu_smt_mask(cpu));
+
+ /*
+ * Always make sure current CPU is added to smt_mask so that below
+ * selection logic runs on it.
+ */
+ cpumask_set_cpu(cpu, &select_mask);
/*
* core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
@@ -4351,7 +4358,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
/* reset state */
rq->core->core_cookie = 0UL;
- for_each_cpu(i, smt_mask) {
+ for_each_cpu(i, &select_mask) {
struct rq *rq_i = cpu_rq(i);
rq_i->core_pick = NULL;
@@ -4371,7 +4378,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*/
for_each_class(class) {
again:
- for_each_cpu_wrap(i, smt_mask, cpu) {
+ for_each_cpu_wrap(i, &select_mask, cpu) {
struct rq *rq_i = cpu_rq(i);
struct task_struct *p;
@@ -4402,6 +4409,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*/
if (i == cpu && !need_sync && !p->core_cookie) {
next = p;
+ rq_i->core_pick = next;
+ rq_i->core_sched_seq = rq_i->core->core_pick_seq;
goto done;
}
@@ -4427,7 +4436,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
max = p;
if (old_max) {
- for_each_cpu(j, smt_mask) {
+ for_each_cpu(j, &select_mask) {
if (j == i)
continue;
@@ -4452,6 +4461,10 @@ next_class:;
rq->core->core_pick_seq = rq->core->core_task_seq;
next = rq->core_pick;
+
+ /* Something should have been selected for current CPU */
+ WARN_ON_ONCE(!next);
+
rq->core_sched_seq = rq->core->core_pick_seq;
/*
@@ -4462,7 +4475,7 @@ next_class:;
* their task. This ensures there is no inter-sibling overlap between
* non-matching user state.
*/
- for_each_cpu(i, smt_mask) {
+ for_each_cpu(i, &select_mask) {
struct rq *rq_i = cpu_rq(i);
WARN_ON_ONCE(!rq_i->core_pick);
@@ -4483,6 +4496,16 @@ next_class:;
}
done:
+ /*
+ * If we reset a sibling's core_pick, make sure that we picked a task
+ * for it, this is because we might have reset it though it was set to
+ * something by another selector. In this case we cannot leave it as
+ * NULL and should have found something for it.
+ */
+ for_each_cpu(i, &select_mask) {
+ WARN_ON_ONCE(!cpu_rq(i)->core_pick);
+ }
+
set_next_task(rq, next);
return next;
}
--
2.27.0.212.ge8ba1cc988-goog
On 7/1/20 4:28 PM, Joel Fernandes wrote:
> On Tue, Jun 30, 2020 at 09:32:27PM +0000, Vineeth Remanan Pillai wrote:
>> From: Peter Zijlstra <[email protected]>
>>
>> Instead of only selecting a local task, select a task for all SMT
>> siblings for every reschedule on the core (irrespective which logical
>> CPU does the reschedule).
>>
>> There could be races in core scheduler where a CPU is trying to pick
>> a task for its sibling in core scheduler, when that CPU has just been
>> offlined. We should not schedule any tasks on the CPU in this case.
>> Return an idle task in pick_next_task for this situation.
>>
>> NOTE: there is still potential for siblings rivalry.
>> NOTE: this is far too complicated; but thus far I've failed to
>> simplify it further.
>>
>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>> Signed-off-by: Julien Desfossez <[email protected]>
>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>> Signed-off-by: Aaron Lu <[email protected]>
>> Signed-off-by: Tim Chen <[email protected]>
>
> Hi Peter, Tim, all, the below patch fixes the hotplug issue described in the
> below patch's Link tag. Patch description below describes the issues fixed
> and it applies on top of this patch.
>
> ------8<----------
>
> From: "Joel Fernandes (Google)" <[email protected]>
> Subject: [PATCH] sched: Fix CPU hotplug causing crashes in task selection logic
>
> The selection logic does not run correctly if the current CPU is not in the
> cpu_smt_mask (which it is not because the CPU is offlined when the stopper
> finishes running and needs to switch to idle). There are also other issues
> fixed by the patch I think such as: if some other sibling set core_pick to
> something, however the selection logic on current cpu resets it before
> selecting. In this case, we need to run the task selection logic again to
> make sure it picks something if there is something to run. It might end up
> picking the wrong task. Yet another issue was, if the stopper thread is an
> unconstrained pick, then rq->core_pick is set. The next time task selection
> logic runs when stopper needs to switch to idle, the current CPU is not in
> the smt_mask. This causes the previous ->core_pick to be picked again which
> happens to be the unconstrained task! so the stopper keeps getting selected
> forever.
>
> That and there are a few more safe guards and checks around checking/setting
> rq->core_pick. To test it, I ran rcutorture and made it tag all torture
> threads. Then ran it in hotplug mode (hotplugging every 200ms) and it hit the
> issue. Now it runs for an hour or so without issue. (Torture testing debug
> changes: https://bit.ly/38htfqK ).
>
> Various fixes were tried causing varying degrees of crashes. Finally I found
> that it is easiest to just add current CPU to the smt_mask's copy always.
> This is so that task selection logic always runs on the current CPU which
> called schedule().
It looks good to me.
Thanks.
Tim
On Wed, Jul 01, 2020 at 05:54:11PM -0700, Tim Chen wrote:
>
>
> On 7/1/20 4:28 PM, Joel Fernandes wrote:
> > On Tue, Jun 30, 2020 at 09:32:27PM +0000, Vineeth Remanan Pillai wrote:
> >> From: Peter Zijlstra <[email protected]>
> >>
> >> Instead of only selecting a local task, select a task for all SMT
> >> siblings for every reschedule on the core (irrespective which logical
> >> CPU does the reschedule).
> >>
> >> There could be races in core scheduler where a CPU is trying to pick
> >> a task for its sibling in core scheduler, when that CPU has just been
> >> offlined. We should not schedule any tasks on the CPU in this case.
> >> Return an idle task in pick_next_task for this situation.
> >>
> >> NOTE: there is still potential for siblings rivalry.
> >> NOTE: this is far too complicated; but thus far I've failed to
> >> simplify it further.
> >>
> >> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> >> Signed-off-by: Julien Desfossez <[email protected]>
> >> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
> >> Signed-off-by: Aaron Lu <[email protected]>
> >> Signed-off-by: Tim Chen <[email protected]>
> >
> > Hi Peter, Tim, all, the below patch fixes the hotplug issue described in the
> > below patch's Link tag. Patch description below describes the issues fixed
> > and it applies on top of this patch.
> >
> > ------8<----------
> >
> > From: "Joel Fernandes (Google)" <[email protected]>
> > Subject: [PATCH] sched: Fix CPU hotplug causing crashes in task selection logic
> >
> > The selection logic does not run correctly if the current CPU is not in the
> > cpu_smt_mask (which it is not because the CPU is offlined when the stopper
> > finishes running and needs to switch to idle). There are also other issues
> > fixed by the patch I think such as: if some other sibling set core_pick to
> > something, however the selection logic on current cpu resets it before
> > selecting. In this case, we need to run the task selection logic again to
> > make sure it picks something if there is something to run. It might end up
> > picking the wrong task. Yet another issue was, if the stopper thread is an
"It might end up picking the wrong task" needs to be: "We might end up
picking a different task but that's Ok".
> > unconstrained pick, then rq->core_pick is set. The next time task selection
> > logic runs when stopper needs to switch to idle, the current CPU is not in
> > the smt_mask. This causes the previous ->core_pick to be picked again which
> > happens to be the unconstrained task! so the stopper keeps getting selected
> > forever.
> >
> > That and there are a few more safe guards and checks around checking/setting
> > rq->core_pick. To test it, I ran rcutorture and made it tag all torture
> > threads. Then ran it in hotplug mode (hotplugging every 200ms) and it hit the
> > issue. Now it runs for an hour or so without issue. (Torture testing debug
> > changes: https://bit.ly/38htfqK ).
> >
> > Various fixes were tried causing varying degrees of crashes. Finally I found
> > that it is easiest to just add current CPU to the smt_mask's copy always.
> > This is so that task selection logic always runs on the current CPU which
> > called schedule().
>
>
> It looks good to me.
Thank you for your review! Could I add your Reviewed-by tag to the patch?
- Joel
> Thanks.
>
> Tim
On Thu, Jul 02, 2020 at 08:57:57AM -0400, Joel Fernandes wrote:
[...]
> > > unconstrained pick, then rq->core_pick is set. The next time task selection
> > > logic runs when stopper needs to switch to idle, the current CPU is not in
> > > the smt_mask. This causes the previous ->core_pick to be picked again which
> > > happens to be the unconstrained task! so the stopper keeps getting selected
> > > forever.
> > >
> > > That and there are a few more safe guards and checks around checking/setting
> > > rq->core_pick. To test it, I ran rcutorture and made it tag all torture
> > > threads. Then ran it in hotplug mode (hotplugging every 200ms) and it hit the
> > > issue. Now it runs for an hour or so without issue. (Torture testing debug
> > > changes: https://bit.ly/38htfqK ).
> > >
> > > Various fixes were tried causing varying degrees of crashes. Finally I found
> > > that it is easiest to just add current CPU to the smt_mask's copy always.
> > > This is so that task selection logic always runs on the current CPU which
> > > called schedule().
> >
> >
> > It looks good to me.
>
> Thank you for your review! Could I add your Reviewed-by tag to the patch?
Julien and Vineeth, here is by coresched tree updated with this patch for
when you are sending the next series:
git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git (branch coresched)
There are some trivial fixups to the debug patch, due to this commit. So
pulling from the above branch may save you some time.
thanks,
- Joel
On Wed, Jul 1, 2020 at 7:28 PM Joel Fernandes <[email protected]> wrote:
>
> From: "Joel Fernandes (Google)" <[email protected]>
> Subject: [PATCH] sched: Fix CPU hotplug causing crashes in task selection logic
>
> The selection logic does not run correctly if the current CPU is not in the
> cpu_smt_mask (which it is not because the CPU is offlined when the stopper
> finishes running and needs to switch to idle). There are also other issues
> fixed by the patch I think such as: if some other sibling set core_pick to
> something, however the selection logic on current cpu resets it before
> selecting. In this case, we need to run the task selection logic again to
> make sure it picks something if there is something to run. It might end up
> picking the wrong task.
>
I am not sure if this can happen. If the other sibling sets core_pick, it
will be under the core wide lock and it should set the core_sched_seq also
before releasing the lock. So when this cpu tries, it would see the core_pick
before resetting it. Is this the same case you were mentioning? Sorry if I
misunderstood the case you mentioned..
> Yet another issue was, if the stopper thread is an
> unconstrained pick, then rq->core_pick is set. The next time task selection
> logic runs when stopper needs to switch to idle, the current CPU is not in
> the smt_mask. This causes the previous ->core_pick to be picked again which
> happens to be the unconstrained task! so the stopper keeps getting selected
> forever.
>
I did not clearly understand this. During an unconstrained pick, current
cpu's core_pick is not set and tasks are not picked for siblings as well.
If it is observed being set in the v6 code, I think it should be a bug.
> That and there are a few more safe guards and checks around checking/setting
> rq->core_pick. To test it, I ran rcutorture and made it tag all torture
> threads. Then ran it in hotplug mode (hotplugging every 200ms) and it hit the
> issue. Now it runs for an hour or so without issue. (Torture testing debug
> changes: https://bit.ly/38htfqK ).
>
> Various fixes were tried causing varying degrees of crashes. Finally I found
> that it is easiest to just add current CPU to the smt_mask's copy always.
> This is so that task selection logic always runs on the current CPU which
> called schedule().
>
> [...]
> cpu = cpu_of(rq);
> - smt_mask = cpu_smt_mask(cpu);
> + /* Make a copy of cpu_smt_mask as we should not set that. */
> + cpumask_copy(&select_mask, cpu_smt_mask(cpu));
> +
> + /*
> + * Always make sure current CPU is added to smt_mask so that below
> + * selection logic runs on it.
> + */
> + cpumask_set_cpu(cpu, &select_mask);
>
I like this idea. Probably we can optimize it a bit. We get here with cpu
not in smt_mask only during an offline and online(including the boot time
online) phase. So we could probably wrap it in an "if (unlikely())". Also,
during this time, it would be idle thread or some hotplug online thread that
would be runnable and no other tasks should be runnable on this cpu. So, I
think it makes sense to do an unconstrained pick rather than a costly sync
of all siblings. Probably something like:
cpumask_copy(&select_mask, cpu_smt_mask(cpu));
if (unlikely(cpumask_test_cpu(cpu, &select_mask))) {
cpumask_set_cpu(cpu, &select_mask);
need_sync = false;
}
By setting need_sync to false, we will do an unconstrained pick and will
not sync with other siblings. I guess we need to reset need_sync after
or in the following for_each_cpu loop, because the loop may set it.
> /*
> * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
> @@ -4351,7 +4358,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> if (i == cpu && !need_sync && !p->core_cookie) {
> next = p;
> + rq_i->core_pick = next;
> + rq_i->core_sched_seq = rq_i->core->core_pick_seq;
>
I think we would not need these here. core_pick needs to be set only
for siblings if we are picking a task for them. For unconstrained pick,
we pick only for ourselves. Also, core_sched_seq need not be synced here.
We might already be synced with the existing core->core_pick_seq. Even
if it is not synced, I don't think it will cause an issue in subsequent
schedule events.
> done:
> + /*
> + * If we reset a sibling's core_pick, make sure that we picked a task
> + * for it, this is because we might have reset it though it was set to
> + * something by another selector. In this case we cannot leave it as
> + * NULL and should have found something for it.
> + */
> + for_each_cpu(i, &select_mask) {
> + WARN_ON_ONCE(!cpu_rq(i)->core_pick);
> + }
> +
I think this check will not be true always. For unconstrained pick, we
do not pick tasks for siblings and hence do not set core_pick for them.
So this WARN_ON will fire for unconstrained pick. Easily reproducible
by creating an empty cgroup and tagging it. Then only unconstrained
picks will happen and this WARN_ON fires. I guess this check after the
done label does not hold and could be removed.
Thanks,
Vineeth
On 7/2/20 5:57 AM, Joel Fernandes wrote:
> On Wed, Jul 01, 2020 at 05:54:11PM -0700, Tim Chen wrote:
>>
>>
>> On 7/1/20 4:28 PM, Joel Fernandes wrote:
>>> On Tue, Jun 30, 2020 at 09:32:27PM +0000, Vineeth Remanan Pillai wrote:
>>>> From: Peter Zijlstra <[email protected]>
>>>>
>>>> Instead of only selecting a local task, select a task for all SMT
>>>> siblings for every reschedule on the core (irrespective which logical
>>>> CPU does the reschedule).
>>>>
>>>> There could be races in core scheduler where a CPU is trying to pick
>>>> a task for its sibling in core scheduler, when that CPU has just been
>>>> offlined. We should not schedule any tasks on the CPU in this case.
>>>> Return an idle task in pick_next_task for this situation.
>>>>
>>>> NOTE: there is still potential for siblings rivalry.
>>>> NOTE: this is far too complicated; but thus far I've failed to
>>>> simplify it further.
>>>>
>>>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>>>> Signed-off-by: Julien Desfossez <[email protected]>
>>>> Signed-off-by: Vineeth Remanan Pillai <[email protected]>
>>>> Signed-off-by: Aaron Lu <[email protected]>
>>>> Signed-off-by: Tim Chen <[email protected]>
>>>
>>> Hi Peter, Tim, all, the below patch fixes the hotplug issue described in the
>>> below patch's Link tag. Patch description below describes the issues fixed
>>> and it applies on top of this patch.
>>>
>>> ------8<----------
>>>
>>> From: "Joel Fernandes (Google)" <[email protected]>
>>> Subject: [PATCH] sched: Fix CPU hotplug causing crashes in task selection logic
>>>
>>> The selection logic does not run correctly if the current CPU is not in the
>>> cpu_smt_mask (which it is not because the CPU is offlined when the stopper
>>> finishes running and needs to switch to idle). There are also other issues
>>> fixed by the patch I think such as: if some other sibling set core_pick to
>>> something, however the selection logic on current cpu resets it before
>>> selecting. In this case, we need to run the task selection logic again to
>>> make sure it picks something if there is something to run. It might end up
>>> picking the wrong task. Yet another issue was, if the stopper thread is an
>
> "It might end up picking the wrong task" needs to be: "We might end up
> picking a different task but that's Ok".
>
>>> unconstrained pick, then rq->core_pick is set. The next time task selection
>>> logic runs when stopper needs to switch to idle, the current CPU is not in
>>> the smt_mask. This causes the previous ->core_pick to be picked again which
>>> happens to be the unconstrained task! so the stopper keeps getting selected
>>> forever.
>>>
>>> That and there are a few more safe guards and checks around checking/setting
>>> rq->core_pick. To test it, I ran rcutorture and made it tag all torture
>>> threads. Then ran it in hotplug mode (hotplugging every 200ms) and it hit the
>>> issue. Now it runs for an hour or so without issue. (Torture testing debug
>>> changes: https://bit.ly/38htfqK ).
>>>
>>> Various fixes were tried causing varying degrees of crashes. Finally I found
>>> that it is easiest to just add current CPU to the smt_mask's copy always.
>>> This is so that task selection logic always runs on the current CPU which
>>> called schedule().
>>
>>
>> It looks good to me.
>
> Thank you for your review! Could I add your Reviewed-by tag to the patch?
>
Sure.
Tim
On Fri, Jul 03, 2020 at 04:21:46PM -0400, Vineeth Remanan Pillai wrote:
> On Wed, Jul 1, 2020 at 7:28 PM Joel Fernandes <[email protected]> wrote:
> >
> > From: "Joel Fernandes (Google)" <[email protected]>
> > Subject: [PATCH] sched: Fix CPU hotplug causing crashes in task selection logic
> >
> > The selection logic does not run correctly if the current CPU is not in the
> > cpu_smt_mask (which it is not because the CPU is offlined when the stopper
> > finishes running and needs to switch to idle). There are also other issues
> > fixed by the patch I think such as: if some other sibling set core_pick to
> > something, however the selection logic on current cpu resets it before
> > selecting. In this case, we need to run the task selection logic again to
> > make sure it picks something if there is something to run. It might end up
> > picking the wrong task.
> >
> I am not sure if this can happen. If the other sibling sets core_pick, it
> will be under the core wide lock and it should set the core_sched_seq also
> before releasing the lock. So when this cpu tries, it would see the core_pick
> before resetting it. Is this the same case you were mentioning? Sorry if I
> misunderstood the case you mentioned..
If you have a case where you have 3 siblings all trying to enter the schedule
loop. Call them A, B and C.
A picks something for B in core_pick. Now C comes and resets B's core_pick
before running the mega-loop, hoping to select something for it shortly.
However, C then does an unconstrained pick and forgets to set B's pick to
something.
I don't know if this can really happen - but this is why I added the warning
in the end of the patch. I think we should make the code more robust and
handle these kind of cases.
> > Yet another issue was, if the stopper thread is an
> > unconstrained pick, then rq->core_pick is set. The next time task selection
> > logic runs when stopper needs to switch to idle, the current CPU is not in
> > the smt_mask. This causes the previous ->core_pick to be picked again which
> > happens to be the unconstrained task! so the stopper keeps getting selected
> > forever.
> >
> I did not clearly understand this. During an unconstrained pick, current
> cpu's core_pick is not set and tasks are not picked for siblings as well.
> If it is observed being set in the v6 code, I think it should be a bug.
Again, it is about making the code more robust. Why should not set
rq->core_pick when we pick something? As we discussed in the private
discussion - we should make the code robust and consistent. Correctness is
not enough, the code has to be robust and maintainable.
I think in our private discussion, you agreed with me that there is no harm
in setting core_pick in this case.
> > That and there are a few more safe guards and checks around checking/setting
> > rq->core_pick. To test it, I ran rcutorture and made it tag all torture
> > threads. Then ran it in hotplug mode (hotplugging every 200ms) and it hit the
> > issue. Now it runs for an hour or so without issue. (Torture testing debug
> > changes: https://bit.ly/38htfqK ).
> >
> > Various fixes were tried causing varying degrees of crashes. Finally I found
> > that it is easiest to just add current CPU to the smt_mask's copy always.
> > This is so that task selection logic always runs on the current CPU which
> > called schedule().
> >
> > [...]
> > cpu = cpu_of(rq);
> > - smt_mask = cpu_smt_mask(cpu);
> > + /* Make a copy of cpu_smt_mask as we should not set that. */
> > + cpumask_copy(&select_mask, cpu_smt_mask(cpu));
> > +
> > + /*
> > + * Always make sure current CPU is added to smt_mask so that below
> > + * selection logic runs on it.
> > + */
> > + cpumask_set_cpu(cpu, &select_mask);
> >
> I like this idea. Probably we can optimize it a bit. We get here with cpu
> not in smt_mask only during an offline and online(including the boot time
> online) phase. So we could probably wrap it in an "if (unlikely())". Also,
> during this time, it would be idle thread or some hotplug online thread that
> would be runnable and no other tasks should be runnable on this cpu. So, I
> think it makes sense to do an unconstrained pick rather than a costly sync
> of all siblings. Probably something like:
>
> cpumask_copy(&select_mask, cpu_smt_mask(cpu));
> if (unlikely(cpumask_test_cpu(cpu, &select_mask))) {
> cpumask_set_cpu(cpu, &select_mask);
> need_sync = false;
> }
Nah, more lines of code for no good no reason, plus another branch right? I'd
like to leave my one liner alone than adding 4 more lines :-)
> By setting need_sync to false, we will do an unconstrained pick and will
> not sync with other siblings. I guess we need to reset need_sync after
> or in the following for_each_cpu loop, because the loop may set it.
I don't know if we want to add more conditions really and make it more
confusing. If anything, I believe we should simplify the existing code more TBH.
> > /*
> > * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
> > @@ -4351,7 +4358,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>
> > if (i == cpu && !need_sync && !p->core_cookie) {
> > next = p;
> > + rq_i->core_pick = next;
> > + rq_i->core_sched_seq = rq_i->core->core_pick_seq;
> >
> I think we would not need these here. core_pick needs to be set only
> for siblings if we are picking a task for them. For unconstrained pick,
> we pick only for ourselves. Also, core_sched_seq need not be synced here.
> We might already be synced with the existing core->core_pick_seq. Even
> if it is not synced, I don't think it will cause an issue in subsequent
> schedule events.
As discussed both privately and above, there is no harm and it is good to
keep the code consistent. I'd rather have any task picking set core_pick and
core_sched_seq to prevent confusion.
And if anything is resetting an existing ->core_pick of a sibling in the
selection loop, it better set it to something sane.
> > done:
> > + /*
> > + * If we reset a sibling's core_pick, make sure that we picked a task
> > + * for it, this is because we might have reset it though it was set to
> > + * something by another selector. In this case we cannot leave it as
> > + * NULL and should have found something for it.
> > + */
> > + for_each_cpu(i, &select_mask) {
> > + WARN_ON_ONCE(!cpu_rq(i)->core_pick);
> > + }
> > +
> I think this check will not be true always. For unconstrained pick, we
> do not pick tasks for siblings and hence do not set core_pick for them.
> So this WARN_ON will fire for unconstrained pick. Easily reproducible
> by creating an empty cgroup and tagging it. Then only unconstrained
> picks will happen and this WARN_ON fires. I guess this check after the
> done label does not hold and could be removed.
As discussed above, > 2 SMT case, we don't really know if the warning will
fire or not. I would rather keep the warning just in case for the future.
Thanks!
- Joel
On Mon, Jul 6, 2020 at 10:09 AM Joel Fernandes <[email protected]> wrote:
>
> > I am not sure if this can happen. If the other sibling sets core_pick, it
> > will be under the core wide lock and it should set the core_sched_seq also
> > before releasing the lock. So when this cpu tries, it would see the core_pick
> > before resetting it. Is this the same case you were mentioning? Sorry if I
> > misunderstood the case you mentioned..
>
> If you have a case where you have 3 siblings all trying to enter the schedule
> loop. Call them A, B and C.
>
> A picks something for B in core_pick. Now C comes and resets B's core_pick
> before running the mega-loop, hoping to select something for it shortly.
> However, C then does an unconstrained pick and forgets to set B's pick to
> something.
>
> I don't know if this can really happen - but this is why I added the warning
> in the end of the patch. I think we should make the code more robust and
> handle these kind of cases.
>
I don't think this can happen. Each of the sibling takes the core wide
lock before calling into pick_next _task. So this should not happen.
> Again, it is about making the code more robust. Why should not set
> rq->core_pick when we pick something? As we discussed in the private
> discussion - we should make the code robust and consistent. Correctness is
> not enough, the code has to be robust and maintainable.
>
> I think in our private discussion, you agreed with me that there is no harm
> in setting core_pick in this case.
>
I agreed there was no harm, because we wanted to use that in the last
check after 'done' label. But now I see that adding that check after
done label cause the WARN_ON to fire even in valid case. Firing the
WARN_ON in valid case is not good. So, if that WARN_ON check can be
removed, adding this is not necessary IMHO.
> > cpumask_copy(&select_mask, cpu_smt_mask(cpu));
> > if (unlikely(cpumask_test_cpu(cpu, &select_mask))) {
> > cpumask_set_cpu(cpu, &select_mask);
> > need_sync = false;
> > }
>
> Nah, more lines of code for no good no reason, plus another branch right? I'd
> like to leave my one liner alone than adding 4 more lines :-)
>
Remember, this is the fast path. Every schedule() except for our sync
IPI reaches here. And we are sure that smt_cpumask will not have cpu
only on hotplug cases which is very rare. I feel adding more code to
make it clear that this setting is not needed always and also optimizing for
the fast path is what I was looking for.
> > By setting need_sync to false, we will do an unconstrained pick and will
> > not sync with other siblings. I guess we need to reset need_sync after
> > or in the following for_each_cpu loop, because the loop may set it.
>
> I don't know if we want to add more conditions really and make it more
> confusing. If anything, I believe we should simplify the existing code more TBH.
>
I don't think its more confusing. This hotplug is really a rare case
and we should wrap it in a unlikely conditional IMHO. Comments can
make the reasoning more clear. We are saving two things here: one
is the always setting of cpu mask and second is the unnecessary syncing
of the siblings during hotplug.
> > I think we would not need these here. core_pick needs to be set only
> > for siblings if we are picking a task for them. For unconstrained pick,
> > we pick only for ourselves. Also, core_sched_seq need not be synced here.
> > We might already be synced with the existing core->core_pick_seq. Even
> > if it is not synced, I don't think it will cause an issue in subsequent
> > schedule events.
>
> As discussed both privately and above, there is no harm and it is good to
> keep the code consistent. I'd rather have any task picking set core_pick and
> core_sched_seq to prevent confusion.
>
> And if anything is resetting an existing ->core_pick of a sibling in the
> selection loop, it better set it to something sane.
>
As I mentioned, I was okay with this as you are using this down in the
WARN_ON check. But the WARN_ON check triggers even on valid cases which
is bad. I don't think setting this here will make code more robust IMHO.
core_pick is already NULL and I would like it to be that way unless there
is a compelling reason to set it. The reason is, we could find any bad
cases entering the pick condition above if this is NULL(it will crash).
> > > done:
> > > + /*
> > > + * If we reset a sibling's core_pick, make sure that we picked a task
> > > + * for it, this is because we might have reset it though it was set to
> > > + * something by another selector. In this case we cannot leave it as
> > > + * NULL and should have found something for it.
> > > + */
> > > + for_each_cpu(i, &select_mask) {
> > > + WARN_ON_ONCE(!cpu_rq(i)->core_pick);
> > > + }
> > > +
> > I think this check will not be true always. For unconstrained pick, we
> > do not pick tasks for siblings and hence do not set core_pick for them.
> > So this WARN_ON will fire for unconstrained pick. Easily reproducible
> > by creating an empty cgroup and tagging it. Then only unconstrained
> > picks will happen and this WARN_ON fires. I guess this check after the
> > done label does not hold and could be removed.
>
> As discussed above, > 2 SMT case, we don't really know if the warning will
> fire or not. I would rather keep the warning just in case for the future.
>
I think I was not clear last time. This WARN_ON will fire on valid cases
if you have this check here. As I mentioned unconstrained pick, picks only
for that cpu and not to any other siblings. This is by design. So for
unconstrained pick, core_pick of all siblings will be NULL. We jump to done
label on unconstrained pick and this for loop goes through all the siblings
and finds that its core_pick is not set. Then thei WARN_ON will fire. I have
reproduced this. We do not want it to fire as it is the correct logic not to
set core_pick for unconstrained pick. Please let me know if this is not clear.
Thanks,
Vineeth
Hi Vineeth,
On Mon, Jul 06, 2020 at 10:38:27AM -0400, Vineeth Remanan Pillai wrote:
> On Mon, Jul 6, 2020 at 10:09 AM Joel Fernandes <[email protected]> wrote:
> >
> > > I am not sure if this can happen. If the other sibling sets core_pick, it
> > > will be under the core wide lock and it should set the core_sched_seq also
> > > before releasing the lock. So when this cpu tries, it would see the core_pick
> > > before resetting it. Is this the same case you were mentioning? Sorry if I
> > > misunderstood the case you mentioned..
> >
> > If you have a case where you have 3 siblings all trying to enter the schedule
> > loop. Call them A, B and C.
> >
> > A picks something for B in core_pick. Now C comes and resets B's core_pick
> > before running the mega-loop, hoping to select something for it shortly.
> > However, C then does an unconstrained pick and forgets to set B's pick to
> > something.
> >
> > I don't know if this can really happen - but this is why I added the warning
> > in the end of the patch. I think we should make the code more robust and
> > handle these kind of cases.
> >
> I don't think this can happen. Each of the sibling takes the core wide
> lock before calling into pick_next _task. So this should not happen.
So my patch is correct but the warnings I added were probably overkill.
About the warnings, Vineeth explained to me on IRC that the design was
intially done to set ->core_pick to NULL if nothing is being picked for a
sibling rq, and the fact that we don't increment that rq's core_sched_seq
means it would the rq it is being set for would not go read core_pick.
And that resetting ->core_pick should be ok, since a sibling will go select a
task for itself if its core_pick was NULL anyway.
The only requirement is that the selection code definitely select something
for the current CPU, or idle. NULL is not an option,
So I guess we can drop the additional warnings I added, I was likely too
paranoid.
> > Again, it is about making the code more robust. Why should not set
> > rq->core_pick when we pick something? As we discussed in the private
> > discussion - we should make the code robust and consistent. Correctness is
> > not enough, the code has to be robust and maintainable.
> >
> > I think in our private discussion, you agreed with me that there is no harm
> > in setting core_pick in this case.
> >
> I agreed there was no harm, because we wanted to use that in the last
> check after 'done' label. But now I see that adding that check after
> done label cause the WARN_ON to fire even in valid case. Firing the
> WARN_ON in valid case is not good. So, if that WARN_ON check can be
> removed, adding this is not necessary IMHO.
Makes sense.
> > > cpumask_copy(&select_mask, cpu_smt_mask(cpu));
> > > if (unlikely(cpumask_test_cpu(cpu, &select_mask))) {
> > > cpumask_set_cpu(cpu, &select_mask);
> > > need_sync = false;
> > > }
> >
> > Nah, more lines of code for no good no reason, plus another branch right? I'd
> > like to leave my one liner alone than adding 4 more lines :-)
> >
> Remember, this is the fast path. Every schedule() except for our sync
> IPI reaches here. And we are sure that smt_cpumask will not have cpu
> only on hotplug cases which is very rare. I feel adding more code to
> make it clear that this setting is not needed always and also optimizing for
> the fast path is what I was looking for.
It occurs to us that may we want to optimize this a bit more, because we have
to copy cpumask every schedule() with my patch which may be unnecessarily
expensive for large CPU systems. I think we can do better -- probably by
unconditionally running the selection code on the current CPU without first
preparing an intermediate mask..
> > As discussed above, > 2 SMT case, we don't really know if the warning will
> > fire or not. I would rather keep the warning just in case for the future.
> >
> I think I was not clear last time. This WARN_ON will fire on valid cases
> if you have this check here. As I mentioned unconstrained pick, picks only
> for that cpu and not to any other siblings. This is by design. So for
> unconstrained pick, core_pick of all siblings will be NULL. We jump to done
> label on unconstrained pick and this for loop goes through all the siblings
> and finds that its core_pick is not set. Then thei WARN_ON will fire. I have
> reproduced this. We do not want it to fire as it is the correct logic not to
> set core_pick for unconstrained pick. Please let me know if this is not clear.
Agreed, I think my patch can be used as a starting point and we optimize it
further.
Me/Vineeth will continue to work on this and come up with a final patch, thanks!
- Joel