2021-08-18 00:58:19

by Josh Don

[permalink] [raw]
Subject: [PATCH] sched/core: fix pick_next_task 'max' tracking

For core-sched, pick_next_task will update the 'max' task if there is a
cookie mismatch (since in this case the new task must be of higher
priority than the current max). However, we fail to update 'max' if
we've found a task with a matching cookie and higher priority than
'max'.

This can result in extra iterations on SMT-X machines, where X > 2.

As an example, on a machine with SMT=3, on core 0, SMT-0 might pick
the following, in order:

- SMT-0: p1, with cookie A, and priority 10 (max = p1)
- SMT-1: p2, with cookie A, and priority 30 (max not updated here)
- SMT-2: p3, with cookie B, and priority 20 (max = p2)
> invalidate the other picks and retry

Here, we should have instead updated 'max' when picking for SMT-1. Note
that this code would eventually have righted itself, since the retry
loop would re-pick p2, and update 'max' accordingly. However, this patch
avoids the extra round-trip.

Signed-off-by: Josh Don <[email protected]>
---
kernel/sched/core.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3431939699dc..110ea7582a33 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5623,6 +5623,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
occ = 1;
goto again;
}
+ } else if (prio_less(max, p, fi_before)) {
+ max = p;
}
}
}
--
2.33.0.rc1.237.g0d66db33f3-goog


2021-08-18 04:39:06

by Tao Zhou

[permalink] [raw]
Subject: Re: [PATCH] sched/core: fix pick_next_task 'max' tracking

Hi Josh,

On Tue, Aug 17, 2021 at 05:56:15PM -0700, Josh Don wrote:
> For core-sched, pick_next_task will update the 'max' task if there is a
> cookie mismatch (since in this case the new task must be of higher
> priority than the current max). However, we fail to update 'max' if
> we've found a task with a matching cookie and higher priority than
> 'max'.
>
> This can result in extra iterations on SMT-X machines, where X > 2.
>
> As an example, on a machine with SMT=3, on core 0, SMT-0 might pick
> the following, in order:
>
> - SMT-0: p1, with cookie A, and priority 10 (max = p1)
> - SMT-1: p2, with cookie A, and priority 30 (max not updated here)

Thanks for your illustration. Good catch.
The guilty is 'cookie_equals(class_pick, cookie))' condition in pick_task()

> - SMT-2: p3, with cookie B, and priority 20 (max = p2)
> > invalidate the other picks and retry
>
> Here, we should have instead updated 'max' when picking for SMT-1. Note
> that this code would eventually have righted itself, since the retry
> loop would re-pick p2, and update 'max' accordingly. However, this patch
> avoids the extra round-trip.

This is correct then may increase the chance to retry. That means it is
more possible to filter the max first(not sure).

> Signed-off-by: Josh Don <[email protected]>
> ---
> kernel/sched/core.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3431939699dc..110ea7582a33 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5623,6 +5623,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> occ = 1;
> goto again;
> }
> + } else if (prio_less(max, p, fi_before)) {
> + max = p;
> }
> }
> }
> --
> 2.33.0.rc1.237.g0d66db33f3-goog
>


Thanks,
Tao

2021-08-18 15:19:56

by Tao Zhou

[permalink] [raw]
Subject: Re: [PATCH] sched/core: fix pick_next_task 'max' tracking

Hi,

On Wed, Aug 18, 2021 at 12:35:13PM +0800, Tao Zhou wrote:
> Hi Josh,
>
> On Tue, Aug 17, 2021 at 05:56:15PM -0700, Josh Don wrote:
> > For core-sched, pick_next_task will update the 'max' task if there is a
> > cookie mismatch (since in this case the new task must be of higher
> > priority than the current max). However, we fail to update 'max' if
> > we've found a task with a matching cookie and higher priority than
> > 'max'.
> >
> > This can result in extra iterations on SMT-X machines, where X > 2.
> >
> > As an example, on a machine with SMT=3, on core 0, SMT-0 might pick
> > the following, in order:
> >
> > - SMT-0: p1, with cookie A, and priority 10 (max = p1)
> > - SMT-1: p2, with cookie A, and priority 30 (max not updated here)
>
> Thanks for your illustration. Good catch.
> The guilty is 'cookie_equals(class_pick, cookie))' condition in pick_task()
>
> > - SMT-2: p3, with cookie B, and priority 20 (max = p2)
> > > invalidate the other picks and retry
> >
> > Here, we should have instead updated 'max' when picking for SMT-1. Note
> > that this code would eventually have righted itself, since the retry
> > loop would re-pick p2, and update 'max' accordingly. However, this patch
> > avoids the extra round-trip.
>
> This is correct then may increase the chance to retry. That means it is
> more possible to filter the max first(not sure).
>
> > Signed-off-by: Josh Don <[email protected]>
> > ---
> > kernel/sched/core.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 3431939699dc..110ea7582a33 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5623,6 +5623,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > occ = 1;
> > goto again;
> > }
> > + } else if (prio_less(max, p, fi_before)) {
> > + max = p;
> > }
> > }
> > }
> > --
> > 2.33.0.rc1.237.g0d66db33f3-goog
> >

My ugly patch need to continue, not compiled.
Rebase on the previous patch.

From 79f25599486c79ecb2e78875498b24f1320060b8 Mon Sep 17 00:00:00 2001
From: Tao Zhou <[email protected]>
Date: Wed, 18 Aug 2021 00:07:38 +0800
Subject: [PATCH] optimize pick_next_task()

sched/core: Optimize pick_next_task() not sure

---
kernel/sched/core.c | 78 ++++++++++++++++++++++++++++++--------------
kernel/sched/sched.h | 1 +
2 files changed, 54 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20ffcc044134..18f236f75170 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5370,28 +5370,42 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
if (max && class_pick->core_cookie &&
prio_less(class_pick, max, in_fi))
return idle_sched_class.pick_task(rq);
-
- return class_pick;
}

- /*
- * If class_pick is idle or matches cookie, return early.
- */
- if (cookie_equals(class_pick, cookie))
- return class_pick;
+ return class_pick;
+}
+
+static task_struct *
+filter_max_prio(struct rq *rq, struct task_struct *class_pick, struct task_struct *max, bool in_fi)
+{
+ unsigned long cookie = rq->core->core_cookie;
+ struct task_struct *cookie_pick = NULL;
+ bool cookie_core_temp = false;

- cookie_pick = sched_core_find(rq, cookie);
+ rq->core_temp = class_pick;

- /*
- * If class > max && class > cookie, it is the highest priority task on
- * the core (so far) and it must be selected, otherwise we must go with
- * the cookie pick in order to satisfy the constraint.
- */
- if (prio_less(cookie_pick, class_pick, in_fi) &&
- (!max || prio_less(max, class_pick, in_fi)))
- return class_pick;
+ if (cookie) {
+ if (!cookie_equals(class_pick, cookie)) {
+ cookie_pick = sched_core_find(rq, cookie);
+ /*
+ * If class > max && class > cookie, it is the
+ * highest priority task on the core (so far)
+ * and it must be selected, otherwise we must
+ * go with the cookie pick in order to satisfy
+ * the constraint.
+ */
+ if (prio_less(cookie_pick, class_pick, in_fi) &&
+ (!max || prio_less(max, class_pick, in_fi)))
+ return class_pick;
+ cookie_core_temp = true;
+ } else if (prio_less(max, class_pick, fi_before))
+ return class_pick;
+ }

- return cookie_pick;
+ if (cookie_core_temp)
+ rq->core_temp = cookie_pick;
+
+ return NULL;
}

extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
@@ -5508,24 +5522,37 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
* order.
*/
for_each_class(class) {
-again:
+ struct task_struct *class_pick, *cookie_pick;
+ struct rq *rq_i;
+
+ for_each_cpu_wrap(i, smt_mask, cpu) {
+ rq_i = cpu_rq(i);
+ class_pick = pick_task(rq_i, class, max, fi_before);
+ /*
+ * This sibling doesn't yet have a suitable task to run.
+ */
+ if (!class_pick)
+ continue;
+
+ if (filter_max_prio(rq_i, class_pick, max, fi_before))
+ max = class_pick;
+ }
+
for_each_cpu_wrap(i, smt_mask, cpu) {
- struct rq *rq_i = cpu_rq(i);
struct task_struct *p;
+ rq_i = cpu_rq(i);

if (rq_i->core_pick)
continue;

/*
- * If this sibling doesn't yet have a suitable task to
- * run; ask for the most eligible task, given the
- * highest priority task already selected for this
- * core.
+ * This sibling doesn't yet have a suitable task to run.
*/
- p = pick_task(rq_i, class, max, fi_before);
- if (!p)
+ if (!rq_i->core_temp)
continue;

+ p = rq_i->core_temp;
+
if (!is_task_rq_idle(p))
occ++;

@@ -9024,6 +9051,7 @@ void __init sched_init(void)
#ifdef CONFIG_SCHED_CORE
rq->core = NULL;
rq->core_pick = NULL;
+ rq->core_temp = NULL;
rq->core_enabled = 0;
rq->core_tree = RB_ROOT;
rq->core_forceidle = false;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14a41a243f7b..2b21a3846b8e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1089,6 +1089,7 @@ struct rq {
/* per rq */
struct rq *core;
struct task_struct *core_pick;
+ struct task_struct *core_temp;
unsigned int core_enabled;
unsigned int core_sched_seq;
struct rb_root core_tree;
--
2.31.1



Thanks,
Tao

2021-08-23 11:20:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/core: fix pick_next_task 'max' tracking

On Tue, Aug 17, 2021 at 05:56:15PM -0700, Josh Don wrote:
> For core-sched, pick_next_task will update the 'max' task if there is a
> cookie mismatch (since in this case the new task must be of higher
> priority than the current max). However, we fail to update 'max' if
> we've found a task with a matching cookie and higher priority than
> 'max'.
>
> This can result in extra iterations on SMT-X machines, where X > 2.
>
> As an example, on a machine with SMT=3, on core 0, SMT-0 might pick
> the following, in order:
>
> - SMT-0: p1, with cookie A, and priority 10 (max = p1)
> - SMT-1: p2, with cookie A, and priority 30 (max not updated here)
> - SMT-2: p3, with cookie B, and priority 20 (max = p2)
> > invalidate the other picks and retry
>
> Here, we should have instead updated 'max' when picking for SMT-1. Note
> that this code would eventually have righted itself, since the retry
> loop would re-pick p2, and update 'max' accordingly. However, this patch
> avoids the extra round-trip.

Going with the observation Tao made; how about we rewrite the whole lot
to not be mind-bending complicated :-)

How's this? It seems to build and pass the core-sched selftest thingy
(so it must be perfect, right? :-)

---
kernel/sched/core.c | 147 ++++++++++++++-------------------------------------
kernel/sched/sched.h | 1 +
2 files changed, 41 insertions(+), 107 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ceae25ea8a0e..e896250c2fb8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5404,66 +5404,18 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
return a->core_cookie == b->core_cookie;
}

-// XXX fairness/fwd progress conditions
-/*
- * Returns
- * - NULL if there is no runnable task for this class.
- * - the highest priority task for this runqueue if it matches
- * rq->core->core_cookie or its priority is greater than max.
- * - Else returns idle_task.
- */
-static struct task_struct *
-pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
-{
- struct task_struct *class_pick, *cookie_pick;
- unsigned long cookie = rq->core->core_cookie;
-
- class_pick = class->pick_task(rq);
- if (!class_pick)
- return NULL;
-
- if (!cookie) {
- /*
- * If class_pick is tagged, return it only if it has
- * higher priority than max.
- */
- if (max && class_pick->core_cookie &&
- prio_less(class_pick, max, in_fi))
- return idle_sched_class.pick_task(rq);
-
- return class_pick;
- }
-
- /*
- * If class_pick is idle or matches cookie, return early.
- */
- if (cookie_equals(class_pick, cookie))
- return class_pick;
-
- cookie_pick = sched_core_find(rq, cookie);
-
- /*
- * If class > max && class > cookie, it is the highest priority task on
- * the core (so far) and it must be selected, otherwise we must go with
- * the cookie pick in order to satisfy the constraint.
- */
- if (prio_less(cookie_pick, class_pick, in_fi) &&
- (!max || prio_less(max, class_pick, in_fi)))
- return class_pick;
-
- return cookie_pick;
-}
-
extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);

static struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
- struct task_struct *next, *max = NULL;
+ struct task_struct *next, *p, *max = NULL;
const struct sched_class *class;
const struct cpumask *smt_mask;
bool fi_before = false;
- int i, j, cpu, occ = 0;
+ unsigned long cookie;
+ int i, cpu, occ = 0;
+ struct rq *rq_i;
bool need_sync;

if (!sched_core_enabled(rq))
@@ -5554,76 +5506,57 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
}
}

- for_each_cpu(i, smt_mask) {
- struct rq *rq_i = cpu_rq(i);
-
+ /*
+ * For each thread: do the regular task pick and find the max prio task
+ * amongst them.
+ *
+ * Tie-break prio towards the current CPU
+ */
+ for_each_cpu_wrap(i, smt_mask, cpu) {
+ rq_i = cpu_rq(i);
rq_i->core_pick = NULL;

if (i != cpu)
update_rq_clock(rq_i);
+
+ for_each_class(class) {
+ p = rq_i->core_temp = class->pick_task(rq_i);
+ if (p)
+ break;
+ }
+
+ if (!max || prio_less(max, p, fi_before))
+ max = p;
}

+ cookie = rq->core->core_cookie = max->core_cookie;
+
/*
- * Try and select tasks for each sibling in descending sched_class
- * order.
+ * For each thread: try and find a runnable task that matches @max or
+ * force idle.
*/
- for_each_class(class) {
-again:
- for_each_cpu_wrap(i, smt_mask, cpu) {
- struct rq *rq_i = cpu_rq(i);
- struct task_struct *p;
-
- if (rq_i->core_pick)
- continue;
+ for_each_cpu(i, smt_mask) {
+ rq_i = cpu_rq(i);
+ p = rq_i->core_temp;

- /*
- * If this sibling doesn't yet have a suitable task to
- * run; ask for the most eligible task, given the
- * highest priority task already selected for this
- * core.
- */
- p = pick_task(rq_i, class, max, fi_before);
+ if (!cookie_equals(p, cookie)) {
+ p = NULL;
+ if (cookie)
+ p = sched_core_find(rq_i, cookie);
if (!p)
- continue;
+ p = idle_sched_class.pick_task(rq_i);
+ }

- if (!is_task_rq_idle(p))
- occ++;
+ rq_i->core_pick = p;

- rq_i->core_pick = p;
- if (rq_i->idle == p && rq_i->nr_running) {
+ if (p == rq_i->idle) {
+ if (rq_i->nr_running) {
rq->core->core_forceidle = true;
if (!fi_before)
rq->core->core_forceidle_seq++;
}
-
- /*
- * If this new candidate is of higher priority than the
- * previous; and they're incompatible; we need to wipe
- * the slate and start over. pick_task makes sure that
- * p's priority is more than max if it doesn't match
- * max's cookie.
- *
- * NOTE: this is a linear max-filter and is thus bounded
- * in execution time.
- */
- if (!max || !cookie_match(max, p)) {
- struct task_struct *old_max = max;
-
- rq->core->core_cookie = p->core_cookie;
- max = p;
-
- if (old_max) {
- rq->core->core_forceidle = false;
- for_each_cpu(j, smt_mask) {
- if (j == i)
- continue;
-
- cpu_rq(j)->core_pick = NULL;
- }
- occ = 1;
- goto again;
- }
- }
+ } else {
+ occ++;
}
}

@@ -5643,7 +5576,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
* non-matching user state.
*/
for_each_cpu(i, smt_mask) {
- struct rq *rq_i = cpu_rq(i);
+ rq_i = cpu_rq(i);

/*
* An online sibling might have gone offline before a task
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d9f8d73a1d84..0760b460903a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1091,6 +1091,7 @@ struct rq {
/* per rq */
struct rq *core;
struct task_struct *core_pick;
+ struct task_struct *core_temp;
unsigned int core_enabled;
unsigned int core_sched_seq;
struct rb_root core_tree;

2021-08-23 15:39:38

by Tao Zhou

[permalink] [raw]
Subject: Re: [PATCH] sched/core: fix pick_next_task 'max' tracking

Hi Peter,

On Mon, Aug 23, 2021 at 01:16:56PM +0200, Peter Zijlstra wrote:

> On Tue, Aug 17, 2021 at 05:56:15PM -0700, Josh Don wrote:
> > For core-sched, pick_next_task will update the 'max' task if there is a
> > cookie mismatch (since in this case the new task must be of higher
> > priority than the current max). However, we fail to update 'max' if
> > we've found a task with a matching cookie and higher priority than
> > 'max'.
> >
> > This can result in extra iterations on SMT-X machines, where X > 2.
> >
> > As an example, on a machine with SMT=3, on core 0, SMT-0 might pick
> > the following, in order:
> >
> > - SMT-0: p1, with cookie A, and priority 10 (max = p1)
> > - SMT-1: p2, with cookie A, and priority 30 (max not updated here)
> > - SMT-2: p3, with cookie B, and priority 20 (max = p2)
> > > invalidate the other picks and retry
> >
> > Here, we should have instead updated 'max' when picking for SMT-1. Note
> > that this code would eventually have righted itself, since the retry
> > loop would re-pick p2, and update 'max' accordingly. However, this patch
> > avoids the extra round-trip.
>
> Going with the observation Tao made; how about we rewrite the whole lot
> to not be mind-bending complicated :-)
>
> How's this? It seems to build and pass the core-sched selftest thingy
> (so it must be perfect, right? :-)
>
> ---
> kernel/sched/core.c | 147 ++++++++++++++-------------------------------------
> kernel/sched/sched.h | 1 +
> 2 files changed, 41 insertions(+), 107 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ceae25ea8a0e..e896250c2fb8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5404,66 +5404,18 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
> return a->core_cookie == b->core_cookie;
> }
>
> -// XXX fairness/fwd progress conditions
> -/*
> - * Returns
> - * - NULL if there is no runnable task for this class.
> - * - the highest priority task for this runqueue if it matches
> - * rq->core->core_cookie or its priority is greater than max.
> - * - Else returns idle_task.
> - */
> -static struct task_struct *
> -pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
> -{
> - struct task_struct *class_pick, *cookie_pick;
> - unsigned long cookie = rq->core->core_cookie;
> -
> - class_pick = class->pick_task(rq);
> - if (!class_pick)
> - return NULL;
> -
> - if (!cookie) {
> - /*
> - * If class_pick is tagged, return it only if it has
> - * higher priority than max.
> - */
> - if (max && class_pick->core_cookie &&
> - prio_less(class_pick, max, in_fi))
> - return idle_sched_class.pick_task(rq);
> -
> - return class_pick;
> - }
> -
> - /*
> - * If class_pick is idle or matches cookie, return early.
> - */
> - if (cookie_equals(class_pick, cookie))
> - return class_pick;
> -
> - cookie_pick = sched_core_find(rq, cookie);
> -
> - /*
> - * If class > max && class > cookie, it is the highest priority task on
> - * the core (so far) and it must be selected, otherwise we must go with
> - * the cookie pick in order to satisfy the constraint.
> - */
> - if (prio_less(cookie_pick, class_pick, in_fi) &&
> - (!max || prio_less(max, class_pick, in_fi)))
> - return class_pick;
> -
> - return cookie_pick;
> -}
> -
> extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
>
> static struct task_struct *
> pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> {
> - struct task_struct *next, *max = NULL;
> + struct task_struct *next, *p, *max = NULL;
> const struct sched_class *class;
> const struct cpumask *smt_mask;
> bool fi_before = false;
> - int i, j, cpu, occ = 0;
> + unsigned long cookie;
> + int i, cpu, occ = 0;
> + struct rq *rq_i;
> bool need_sync;
>
> if (!sched_core_enabled(rq))
> @@ -5554,76 +5506,57 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> }
> }
>
> - for_each_cpu(i, smt_mask) {
> - struct rq *rq_i = cpu_rq(i);
> -
> + /*
> + * For each thread: do the regular task pick and find the max prio task
> + * amongst them.
> + *
> + * Tie-break prio towards the current CPU
> + */
> + for_each_cpu_wrap(i, smt_mask, cpu) {
> + rq_i = cpu_rq(i);
> rq_i->core_pick = NULL;
>
> if (i != cpu)
> update_rq_clock(rq_i);
> +
> + for_each_class(class) {
> + p = rq_i->core_temp = class->pick_task(rq_i);
> + if (p)
> + break;
> + }
> +
> + if (!max || prio_less(max, p, fi_before))

The above 'prio_less(max, p, fi_before)' condition covers the
case of Josh's fix if I'm not wrong.

The rewriting code looks clarity and simpler..

> + max = p;
> }
>
> + cookie = rq->core->core_cookie = max->core_cookie;
> +
> /*
> - * Try and select tasks for each sibling in descending sched_class
> - * order.
> + * For each thread: try and find a runnable task that matches @max or
> + * force idle.
> */
> - for_each_class(class) {
> -again:
> - for_each_cpu_wrap(i, smt_mask, cpu) {
> - struct rq *rq_i = cpu_rq(i);
> - struct task_struct *p;
> -
> - if (rq_i->core_pick)
> - continue;
> + for_each_cpu(i, smt_mask) {
> + rq_i = cpu_rq(i);
> + p = rq_i->core_temp;
>
> - /*
> - * If this sibling doesn't yet have a suitable task to
> - * run; ask for the most eligible task, given the
> - * highest priority task already selected for this
> - * core.
> - */
> - p = pick_task(rq_i, class, max, fi_before);
> + if (!cookie_equals(p, cookie)) {
> + p = NULL;
> + if (cookie)
> + p = sched_core_find(rq_i, cookie);
> if (!p)
> - continue;
> + p = idle_sched_class.pick_task(rq_i);
> + }
>
> - if (!is_task_rq_idle(p))
> - occ++;
> + rq_i->core_pick = p;
>
> - rq_i->core_pick = p;
> - if (rq_i->idle == p && rq_i->nr_running) {
> + if (p == rq_i->idle) {
> + if (rq_i->nr_running) {
> rq->core->core_forceidle = true;
> if (!fi_before)
> rq->core->core_forceidle_seq++;
> }
> -
> - /*
> - * If this new candidate is of higher priority than the
> - * previous; and they're incompatible; we need to wipe
> - * the slate and start over. pick_task makes sure that
> - * p's priority is more than max if it doesn't match
> - * max's cookie.
> - *
> - * NOTE: this is a linear max-filter and is thus bounded
> - * in execution time.
> - */
> - if (!max || !cookie_match(max, p)) {
> - struct task_struct *old_max = max;
> -
> - rq->core->core_cookie = p->core_cookie;
> - max = p;
> -
> - if (old_max) {
> - rq->core->core_forceidle = false;
> - for_each_cpu(j, smt_mask) {
> - if (j == i)
> - continue;
> -
> - cpu_rq(j)->core_pick = NULL;
> - }
> - occ = 1;
> - goto again;
> - }
> - }
> + } else {
> + occ++;
> }
> }
>
> @@ -5643,7 +5576,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> * non-matching user state.
> */
> for_each_cpu(i, smt_mask) {
> - struct rq *rq_i = cpu_rq(i);
> + rq_i = cpu_rq(i);
>
> /*
> * An online sibling might have gone offline before a task
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d9f8d73a1d84..0760b460903a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1091,6 +1091,7 @@ struct rq {
> /* per rq */
> struct rq *core;
> struct task_struct *core_pick;
> + struct task_struct *core_temp;
> unsigned int core_enabled;
> unsigned int core_sched_seq;
> struct rb_root core_tree;




Thanks,
Tao

2021-08-23 20:29:41

by Vineeth Pillai

[permalink] [raw]
Subject: Re: [PATCH] sched/core: fix pick_next_task 'max' tracking

Hi Peter,


> > Here, we should have instead updated 'max' when picking for SMT-1. Note
> > that this code would eventually have righted itself, since the retry
> > loop would re-pick p2, and update 'max' accordingly. However, this patch
> > avoids the extra round-trip.
>
> Going with the observation Tao made; how about we rewrite the whole lot
> to not be mind-bending complicated :-)
>
> How's this? It seems to build and pass the core-sched selftest thingy
> (so it must be perfect, right? :-)
>
Nice, the code is much simpler now :-). A minor suggestion down..

> - for_each_cpu(i, smt_mask) {
> - struct rq *rq_i = cpu_rq(i);
> -
> + /*
> + * For each thread: do the regular task pick and find the max prio task
> + * amongst them.
> + *
> + * Tie-break prio towards the current CPU
> + */
> + for_each_cpu_wrap(i, smt_mask, cpu) {
> + rq_i = cpu_rq(i);
> rq_i->core_pick = NULL;
>
> if (i != cpu)
> update_rq_clock(rq_i);
> +
> + for_each_class(class) {
> + p = rq_i->core_temp = class->pick_task(rq_i);
I think we can use core_pick to store the pick here and core_temp
might not be required. What do you feel?

> + if (p)
> + break;
> + }
> +
> + if (!max || prio_less(max, p, fi_before))
> + max = p;


Thanks,
Vineeth

2021-08-23 22:58:44

by Tao Zhou

[permalink] [raw]
Subject: Re: [PATCH] sched/core: fix pick_next_task 'max' tracking

Hi Vineeth,

On Mon, Aug 23, 2021 at 04:25:28PM -0400, Vineeth Pillai wrote:
> Hi Peter,
>
>
> > > Here, we should have instead updated 'max' when picking for SMT-1. Note
> > > that this code would eventually have righted itself, since the retry
> > > loop would re-pick p2, and update 'max' accordingly. However, this patch
> > > avoids the extra round-trip.
> >
> > Going with the observation Tao made; how about we rewrite the whole lot
> > to not be mind-bending complicated :-)
> >
> > How's this? It seems to build and pass the core-sched selftest thingy
> > (so it must be perfect, right? :-)
> >
> Nice, the code is much simpler now :-). A minor suggestion down..
>
> > - for_each_cpu(i, smt_mask) {
> > - struct rq *rq_i = cpu_rq(i);
> > -
> > + /*
> > + * For each thread: do the regular task pick and find the max prio task
> > + * amongst them.
> > + *
> > + * Tie-break prio towards the current CPU
> > + */
> > + for_each_cpu_wrap(i, smt_mask, cpu) {
> > + rq_i = cpu_rq(i);
> > rq_i->core_pick = NULL;
> >
> > if (i != cpu)
> > update_rq_clock(rq_i);
> > +
> > + for_each_class(class) {
> > + p = rq_i->core_temp = class->pick_task(rq_i);
> I think we can use core_pick to store the pick here and core_temp
> might not be required. What do you feel?

You're right.

The @core_temp load the class pick, the @core_pick is the final
pick(class or cookie). Using @core_pick to store class pick first
and then the final pick is right and save the bytes) but just a
little not clarity from my end :-)

> > + if (p)
> > + break;
> > + }
> > +
> > + if (!max || prio_less(max, p, fi_before))
> > + max = p;
>
>
> Thanks,
> Vineeth



Thanks,
Tao

2021-08-23 23:26:21

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH] sched/core: fix pick_next_task 'max' tracking

Hi Peter,

On Mon, Aug 23, 2021 at 4:17 AM Peter Zijlstra <[email protected]> wrote:
[snip]
> + for_each_cpu(i, smt_mask) {
> + rq_i = cpu_rq(i);
> + p = rq_i->core_temp;
>
> - /*
> - * If this sibling doesn't yet have a suitable task to
> - * run; ask for the most eligible task, given the
> - * highest priority task already selected for this
> - * core.
> - */
> - p = pick_task(rq_i, class, max, fi_before);
> + if (!cookie_equals(p, cookie)) {
> + p = NULL;
> + if (cookie)
> + p = sched_core_find(rq_i, cookie);

In the case that 'max' has a zero cookie, shouldn't we search for a
match on this cpu if the original class pick ('p') had a non-zero
cookie? We don't enqueue tasks with zero cookie in the core_tree, so I
forget if there was some other reasoning here.

> if (!p)
> - continue;
> + p = idle_sched_class.pick_task(rq_i);
> + }

2021-08-24 03:11:24

by Tao Zhou

[permalink] [raw]
Subject: Re: [PATCH] sched/core: fix pick_next_task 'max' tracking

Hi Josh,

On Mon, Aug 23, 2021 at 04:24:26PM -0700, Josh Don wrote:
> Hi Peter,
>
> On Mon, Aug 23, 2021 at 4:17 AM Peter Zijlstra <[email protected]> wrote:
> [snip]
> > + for_each_cpu(i, smt_mask) {
> > + rq_i = cpu_rq(i);
> > + p = rq_i->core_temp;
> >
> > - /*
> > - * If this sibling doesn't yet have a suitable task to
> > - * run; ask for the most eligible task, given the
> > - * highest priority task already selected for this
> > - * core.
> > - */
> > - p = pick_task(rq_i, class, max, fi_before);
> > + if (!cookie_equals(p, cookie)) {
> > + p = NULL;
> > + if (cookie)
> > + p = sched_core_find(rq_i, cookie);
>
> In the case that 'max' has a zero cookie, shouldn't we search for a

In the original pick_task(), when cookie is zero, we choose class pick
task or force idle task.

> match on this cpu if the original class pick ('p') had a non-zero
> cookie? We don't enqueue tasks with zero cookie in the core_tree, so I
> forget if there was some other reasoning here.

So, no need to search the core_tree when cookie is zero.

But I'm not sure that force idle pick condition(in pick_task())
is also covered by this clause in the first filter max loop in
the rewrite..

'
if (!max || prio_less(max, p, fi_before))
max = p;
'

I just thought there are three(one add by Josh) places that can return
max;

1, !cookie condition and class_pick > max
2, cookie equal condition and class_pick > max(add by Josh)
3, cookie not equal condition and class_pick > max.

The rewrite change a little with the class pick, it loops all class to
find the max first(finally will get one task and not be possible return
NULL). This is not the same with the original.

It is very possible that I'm getting wrong, then please shout to me.

> > if (!p)
> > - continue;
> > + p = idle_sched_class.pick_task(rq_i);
> > + }



Thanks,
Tao

2021-08-24 08:56:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/core: fix pick_next_task 'max' tracking

On Mon, Aug 23, 2021 at 04:24:26PM -0700, Josh Don wrote:
> Hi Peter,
>
> On Mon, Aug 23, 2021 at 4:17 AM Peter Zijlstra <[email protected]> wrote:
> [snip]
> > + for_each_cpu(i, smt_mask) {
> > + rq_i = cpu_rq(i);
> > + p = rq_i->core_temp;
> >
> > - /*
> > - * If this sibling doesn't yet have a suitable task to
> > - * run; ask for the most eligible task, given the
> > - * highest priority task already selected for this
> > - * core.
> > - */
> > - p = pick_task(rq_i, class, max, fi_before);
> > + if (!cookie_equals(p, cookie)) {
> > + p = NULL;
> > + if (cookie)
> > + p = sched_core_find(rq_i, cookie);
>
> In the case that 'max' has a zero cookie, shouldn't we search for a
> match on this cpu if the original class pick ('p') had a non-zero
> cookie? We don't enqueue tasks with zero cookie in the core_tree, so I
> forget if there was some other reasoning here.

IIRC we don't keep 0-cookies in the tree. Lemme check.

Yeah, see sched_core_enqueue(), they bail for 0-cookie.

This is indeed sub-optimal, but also having the 0-cookie tasks in the
tree has other issues IIRC.

2021-08-24 09:06:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/core: fix pick_next_task 'max' tracking

On Mon, Aug 23, 2021 at 04:25:28PM -0400, Vineeth Pillai wrote:
> Hi Peter,
>
>
> > > Here, we should have instead updated 'max' when picking for SMT-1. Note
> > > that this code would eventually have righted itself, since the retry
> > > loop would re-pick p2, and update 'max' accordingly. However, this patch
> > > avoids the extra round-trip.
> >
> > Going with the observation Tao made; how about we rewrite the whole lot
> > to not be mind-bending complicated :-)
> >
> > How's this? It seems to build and pass the core-sched selftest thingy
> > (so it must be perfect, right? :-)
> >
> Nice, the code is much simpler now :-). A minor suggestion down..
>
> > - for_each_cpu(i, smt_mask) {
> > - struct rq *rq_i = cpu_rq(i);
> > -
> > + /*
> > + * For each thread: do the regular task pick and find the max prio task
> > + * amongst them.
> > + *
> > + * Tie-break prio towards the current CPU
> > + */
> > + for_each_cpu_wrap(i, smt_mask, cpu) {
> > + rq_i = cpu_rq(i);
> > rq_i->core_pick = NULL;
> >
> > if (i != cpu)
> > update_rq_clock(rq_i);
> > +
> > + for_each_class(class) {
> > + p = rq_i->core_temp = class->pick_task(rq_i);
> I think we can use core_pick to store the pick here and core_temp
> might not be required. What do you feel?

Indeed we can; makes the code a little less obvious but saves a few
bytes.

Let me go do that and also attempt a Changelog to go with it ;-)