LinuxLists.cc - [PATCH] sched/core: Minor optimize pick_next

2023-03-08 10:05:16

Subject: [PATCH] sched/core: Minor optimize pick_next_task() when core-sched enable

If core-sched is enabled, sometimes we will traverse each CPU on the core
to find the highest priority task 'max' on the entire core, and then try
and find a runnable task that matches @max.
But in the following case, we choose the runnable task is not the best.

core max: task2 (cookie 0)

rq0 rq1
task0(cookie non-zero) task2(cookie 0)
task1(cookie 0)
task3(cookie 0)
...

pick-task: idle pick-task: task2

CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the
highest priority tasks on rq0 and rq1 respectively, task2 is @max
on the entire core.

In the case that 'max' has a zero cookie, instead of continuing to
search for a runnable task on rq0 that matches @max's cookie, we
choose idle for rq0 directly.
At this time, it is obviously better to choose task1 to run for rq0,
which will increase the CPU utilization.
Therefore, we queue tasks with zero cookies in core_tree, and record
the number of non-zero cookie tasks of each rq to detect the status
of the sched-core.

Signed-off-by: Hao Jia <[email protected]>
---
kernel/sched/core.c | 29 +++++++++++++++--------------
kernel/sched/core_sched.c | 9 ++++-----
kernel/sched/sched.h | 1 +
3 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index af017e038b48..765cd14c52e1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -236,8 +236,8 @@ void sched_core_enqueue(struct rq *rq, struct task_struct *p)
{
rq->core->core_task_seq++;

- if (!p->core_cookie)
- return;
+ if (p->core_cookie)
+ rq->cookied_count++;

rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less);
}
@@ -246,11 +246,16 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags)
{
rq->core->core_task_seq++;

+ if (p->core_cookie)
+ rq->cookied_count--;
+
if (sched_core_enqueued(p)) {
rb_erase(&p->core_node, &rq->core_tree);
RB_CLEAR_NODE(&p->core_node);
}

+ if (!sched_core_enabled(rq))
+ return;
/*
* Migrating the last task off the cpu, with the cpu in forced idle
* state. Reschedule to create an accounting edge for forced idle,
@@ -370,7 +375,7 @@ static void sched_core_assert_empty(void)
int cpu;

for_each_possible_cpu(cpu)
- WARN_ON_ONCE(!RB_EMPTY_ROOT(&cpu_rq(cpu)->core_tree));
+ WARN_ON_ONCE(cpu_rq(cpu)->cookied_count);
}

static void __sched_core_enable(void)
@@ -2061,14 +2066,12 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
uclamp_rq_inc(rq, p);
p->sched_class->enqueue_task(rq, p, flags);

- if (sched_core_enabled(rq))
- sched_core_enqueue(rq, p);
+ sched_core_enqueue(rq, p);
}

static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
{
- if (sched_core_enabled(rq))
- sched_core_dequeue(rq, p, flags);
+ sched_core_dequeue(rq, p, flags);

if (!(flags & DEQUEUE_NOCLOCK))
update_rq_clock(rq);
@@ -6126,13 +6129,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
rq_i = cpu_rq(i);
p = rq_i->core_pick;

- if (!cookie_equals(p, cookie)) {
- p = NULL;
- if (cookie)
- p = sched_core_find(rq_i, cookie);
- if (!p)
- p = idle_sched_class.pick_task(rq_i);
- }
+ if (!cookie_equals(p, cookie))
+ p = sched_core_find(rq_i, cookie);

rq_i->core_pick = p;

@@ -6333,6 +6331,7 @@ static void sched_core_cpu_starting(unsigned int cpu)
sched_core_lock(cpu, &flags);

WARN_ON_ONCE(rq->core != rq);
+ WARN_ON_ONCE(rq->cookied_count);

/* if we're the first, we'll be our own leader */
if (cpumask_weight(smt_mask) == 1)
@@ -6425,6 +6424,7 @@ static inline void sched_core_cpu_dying(unsigned int cpu)
{
struct rq *rq = cpu_rq(cpu);

+ WARN_ON_ONCE(rq->cookied_count);
if (rq->core != rq)
rq->core = rq;
}
@@ -9917,6 +9917,7 @@ void __init sched_init(void)
rq->core = rq;
rq->core_pick = NULL;
rq->core_enabled = 0;
+ rq->cookied_count = 0;
rq->core_tree = RB_ROOT;
rq->core_forceidle_count = 0;
rq->core_forceidle_occupation = 0;
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index a57fd8f27498..70f424abcc2b 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -56,6 +56,7 @@ static unsigned long sched_core_update_cookie(struct task_struct *p,
unsigned long old_cookie;
struct rq_flags rf;
struct rq *rq;
+ bool enqueued;

rq = task_rq_lock(p, &rf);

@@ -67,16 +68,14 @@ static unsigned long sched_core_update_cookie(struct task_struct *p,
*/
SCHED_WARN_ON((p->core_cookie || cookie) && !sched_core_enabled(rq));

- if (sched_core_enqueued(p))
+ enqueued = task_on_rq_queued(p);
+ if (enqueued)
sched_core_dequeue(rq, p, DEQUEUE_SAVE);

old_cookie = p->core_cookie;
p->core_cookie = cookie;

- /*
- * Consider the cases: !prev_cookie and !cookie.
- */
- if (cookie && task_on_rq_queued(p))
+ if (enqueued)
sched_core_enqueue(rq, p);

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e8df6d31c1e..f5a0ee7fccae 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1148,6 +1148,7 @@ struct rq {
unsigned int core_task_seq;
unsigned int core_pick_seq;
unsigned long core_cookie;
+ unsigned int cookied_count;
unsigned int core_forceidle_count;
unsigned int core_forceidle_seq;
unsigned int core_forceidle_occupation;
--
2.11.0

2023-03-20 08:55:40

by Hao Jia

[permalink] [raw]

Subject: Re: [PATCH] sched/core: Minor optimize pick_next_task() when core-sched enable

kindly ping...

On 2023/3/8 Hao Jia wrote:
> If core-sched is enabled, sometimes we will traverse each CPU on the core
> to find the highest priority task 'max' on the entire core, and then try
> and find a runnable task that matches @max.
> But in the following case, we choose the runnable task is not the best.
>
> core max: task2 (cookie 0)
>
> rq0 rq1
> task0(cookie non-zero) task2(cookie 0)
> task1(cookie 0)
> task3(cookie 0)
> ...
>
> pick-task: idle pick-task: task2
>
> CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the
> highest priority tasks on rq0 and rq1 respectively, task2 is @max
> on the entire core.
>
> In the case that 'max' has a zero cookie, instead of continuing to
> search for a runnable task on rq0 that matches @max's cookie, we
> choose idle for rq0 directly.
> At this time, it is obviously better to choose task1 to run for rq0,
> which will increase the CPU utilization.
> Therefore, we queue tasks with zero cookies in core_tree, and record
> the number of non-zero cookie tasks of each rq to detect the status
> of the sched-core.
>
> Signed-off-by: Hao Jia <[email protected]>
> ---
> kernel/sched/core.c | 29 +++++++++++++++--------------
> kernel/sched/core_sched.c | 9 ++++-----
> kernel/sched/sched.h | 1 +
> 3 files changed, 20 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index af017e038b48..765cd14c52e1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -236,8 +236,8 @@ void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> {
> rq->core->core_task_seq++;
>
> - if (!p->core_cookie)
> - return;
> + if (p->core_cookie)
> + rq->cookied_count++;
>
> rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less);
> }
> @@ -246,11 +246,16 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags)
> {
> rq->core->core_task_seq++;
>
> + if (p->core_cookie)
> + rq->cookied_count--;
> +
> if (sched_core_enqueued(p)) {
> rb_erase(&p->core_node, &rq->core_tree);
> RB_CLEAR_NODE(&p->core_node);
> }
>
> + if (!sched_core_enabled(rq))
> + return;
> /*
> * Migrating the last task off the cpu, with the cpu in forced idle
> * state. Reschedule to create an accounting edge for forced idle,
> @@ -370,7 +375,7 @@ static void sched_core_assert_empty(void)
> int cpu;
>
> for_each_possible_cpu(cpu)
> - WARN_ON_ONCE(!RB_EMPTY_ROOT(&cpu_rq(cpu)->core_tree));
> + WARN_ON_ONCE(cpu_rq(cpu)->cookied_count);
> }
>
> static void __sched_core_enable(void)
> @@ -2061,14 +2066,12 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> uclamp_rq_inc(rq, p);
> p->sched_class->enqueue_task(rq, p, flags);
>
> - if (sched_core_enabled(rq))
> - sched_core_enqueue(rq, p);
> + sched_core_enqueue(rq, p);
> }
>
> static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> {
> - if (sched_core_enabled(rq))
> - sched_core_dequeue(rq, p, flags);
> + sched_core_dequeue(rq, p, flags);
>
> if (!(flags & DEQUEUE_NOCLOCK))
> update_rq_clock(rq);
> @@ -6126,13 +6129,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> rq_i = cpu_rq(i);
> p = rq_i->core_pick;
>
> - if (!cookie_equals(p, cookie)) {
> - p = NULL;
> - if (cookie)
> - p = sched_core_find(rq_i, cookie);
> - if (!p)
> - p = idle_sched_class.pick_task(rq_i);
> - }
> + if (!cookie_equals(p, cookie))
> + p = sched_core_find(rq_i, cookie);
>
> rq_i->core_pick = p;
>
> @@ -6333,6 +6331,7 @@ static void sched_core_cpu_starting(unsigned int cpu)
> sched_core_lock(cpu, &flags);
>
> WARN_ON_ONCE(rq->core != rq);
> + WARN_ON_ONCE(rq->cookied_count);
>
> /* if we're the first, we'll be our own leader */
> if (cpumask_weight(smt_mask) == 1)
> @@ -6425,6 +6424,7 @@ static inline void sched_core_cpu_dying(unsigned int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
>
> + WARN_ON_ONCE(rq->cookied_count);
> if (rq->core != rq)
> rq->core = rq;
> }
> @@ -9917,6 +9917,7 @@ void __init sched_init(void)
> rq->core = rq;
> rq->core_pick = NULL;
> rq->core_enabled = 0;
> + rq->cookied_count = 0;
> rq->core_tree = RB_ROOT;
> rq->core_forceidle_count = 0;
> rq->core_forceidle_occupation = 0;
> diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
> index a57fd8f27498..70f424abcc2b 100644
> --- a/kernel/sched/core_sched.c
> +++ b/kernel/sched/core_sched.c
> @@ -56,6 +56,7 @@ static unsigned long sched_core_update_cookie(struct task_struct *p,
> unsigned long old_cookie;
> struct rq_flags rf;
> struct rq *rq;
> + bool enqueued;
>
> rq = task_rq_lock(p, &rf);
>
> @@ -67,16 +68,14 @@ static unsigned long sched_core_update_cookie(struct task_struct *p,
> */
> SCHED_WARN_ON((p->core_cookie || cookie) && !sched_core_enabled(rq));
>
> - if (sched_core_enqueued(p))
> + enqueued = task_on_rq_queued(p);
> + if (enqueued)
> sched_core_dequeue(rq, p, DEQUEUE_SAVE);
>
> old_cookie = p->core_cookie;
> p->core_cookie = cookie;
>
> - /*
> - * Consider the cases: !prev_cookie and !cookie.
> - */
> - if (cookie && task_on_rq_queued(p))
> + if (enqueued)
> sched_core_enqueue(rq, p);
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3e8df6d31c1e..f5a0ee7fccae 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1148,6 +1148,7 @@ struct rq {
> unsigned int core_task_seq;
> unsigned int core_pick_seq;
> unsigned long core_cookie;
> + unsigned int cookied_count;
> unsigned int core_forceidle_count;
> unsigned int core_forceidle_seq;
> unsigned int core_forceidle_occupation;

2023-03-20 11:39:53

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched/core: Minor optimize pick_next_task() when core-sched enable

On Wed, Mar 08, 2023 at 06:04:13PM +0800, Hao Jia wrote:

> core max: task2 (cookie 0)
>
> rq0 rq1
> task0(cookie non-zero) task2(cookie 0)
> task1(cookie 0)
> task3(cookie 0)
> ...
>
> pick-task: idle pick-task: task2
>
> CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the
> highest priority tasks on rq0 and rq1 respectively, task2 is @max
> on the entire core.

I'm assuming this all starts by rq0 doing a pick and getting task0.
Because any other selection would go down the whole !need_sync route.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index af017e038b48..765cd14c52e1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -236,8 +236,8 @@ void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> {
> rq->core->core_task_seq++;
>
> - if (!p->core_cookie)
> - return;
> + if (p->core_cookie)
> + rq->cookied_count++;
>
> rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less);
> }

> @@ -2061,14 +2066,12 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> uclamp_rq_inc(rq, p);
> p->sched_class->enqueue_task(rq, p, flags);
>
> - if (sched_core_enabled(rq))
> - sched_core_enqueue(rq, p);
> + sched_core_enqueue(rq, p);
> }
>
> static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> {
> - if (sched_core_enabled(rq))
> - sched_core_dequeue(rq, p, flags);
> + sched_core_dequeue(rq, p, flags);
>
> if (!(flags & DEQUEUE_NOCLOCK))
> update_rq_clock(rq);

Yeah, this is an absolute no-no, it makes the overhead of the second rb
tree unconditional.

2023-03-21 21:43:36

by Joel Fernandes

[permalink] [raw]

Subject: Re: [PATCH] sched/core: Minor optimize pick_next_task() when core-sched enable

On Mon, Mar 20, 2023 at 4:55 AM Hao Jia <[email protected]> wrote:
>
> kindly ping...
>
> On 2023/3/8 Hao Jia wrote:
> > If core-sched is enabled, sometimes we will traverse each CPU on the core
> > to find the highest priority task 'max' on the entire core, and then try
> > and find a runnable task that matches @max.
> > But in the following case, we choose the runnable task is not the best.
> >
> > core max: task2 (cookie 0)
> >
> > rq0 rq1
> > task0(cookie non-zero) task2(cookie 0)
> > task1(cookie 0)
> > task3(cookie 0)
> > ...
> >
> > pick-task: idle pick-task: task2
> >
> > CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the
> > highest priority tasks on rq0 and rq1 respectively, task2 is @max
> > on the entire core.
> >
> > In the case that 'max' has a zero cookie, instead of continuing to
> > search for a runnable task on rq0 that matches @max's cookie, we
> > choose idle for rq0 directly.
> > At this time, it is obviously better to choose task1 to run for rq0,
> > which will increase the CPU utilization.
> > Therefore, we queue tasks with zero cookies in core_tree, and record
> > the number of non-zero cookie tasks of each rq to detect the status
> > of the sched-core.

I do remember this as a known issue (more of a known but unimplemented
optimization) which happens when you have a high priority non-cookie
task which is in front of several low priority ones on the same
thread/rq. Adding +Vineeth Pillai to see if he remembers the issue.

I can try to take a look at it this week to make sense of your patch.
The code in upstream has changed quite a bit since we did this work on
the older kernels, so allow some time to page it all in.

Meanwhile, could you please provide some more details of your
usecase/workload and how this patch improves it? Also out of
curiosity, is bytedance using core scheduling for security or
something else?

Thanks,

- Joel

> >
> > Signed-off-by: Hao Jia <[email protected]>
> > ---
> > kernel/sched/core.c | 29 +++++++++++++++--------------
> > kernel/sched/core_sched.c | 9 ++++-----
> > kernel/sched/sched.h | 1 +
> > 3 files changed, 20 insertions(+), 19 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index af017e038b48..765cd14c52e1 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -236,8 +236,8 @@ void sched_core_enqueue(struct rq *rq, struct task_struct *p)
> > {
> > rq->core->core_task_seq++;
> >
> > - if (!p->core_cookie)
> > - return;
> > + if (p->core_cookie)
> > + rq->cookied_count++;
> >
> > rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less);
> > }
> > @@ -246,11 +246,16 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags)
> > {
> > rq->core->core_task_seq++;
> >
> > + if (p->core_cookie)
> > + rq->cookied_count--;
> > +
> > if (sched_core_enqueued(p)) {
> > rb_erase(&p->core_node, &rq->core_tree);
> > RB_CLEAR_NODE(&p->core_node);
> > }
> >
> > + if (!sched_core_enabled(rq))
> > + return;
> > /*
> > * Migrating the last task off the cpu, with the cpu in forced idle
> > * state. Reschedule to create an accounting edge for forced idle,
> > @@ -370,7 +375,7 @@ static void sched_core_assert_empty(void)
> > int cpu;
> >
> > for_each_possible_cpu(cpu)
> > - WARN_ON_ONCE(!RB_EMPTY_ROOT(&cpu_rq(cpu)->core_tree));
> > + WARN_ON_ONCE(cpu_rq(cpu)->cookied_count);
> > }
> >
> > static void __sched_core_enable(void)
> > @@ -2061,14 +2066,12 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> > uclamp_rq_inc(rq, p);
> > p->sched_class->enqueue_task(rq, p, flags);
> >
> > - if (sched_core_enabled(rq))
> > - sched_core_enqueue(rq, p);
> > + sched_core_enqueue(rq, p);
> > }
> >
> > static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> > {
> > - if (sched_core_enabled(rq))
> > - sched_core_dequeue(rq, p, flags);
> > + sched_core_dequeue(rq, p, flags);
> >
> > if (!(flags & DEQUEUE_NOCLOCK))
> > update_rq_clock(rq);
> > @@ -6126,13 +6129,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > rq_i = cpu_rq(i);
> > p = rq_i->core_pick;
> >
> > - if (!cookie_equals(p, cookie)) {
> > - p = NULL;
> > - if (cookie)
> > - p = sched_core_find(rq_i, cookie);
> > - if (!p)
> > - p = idle_sched_class.pick_task(rq_i);
> > - }
> > + if (!cookie_equals(p, cookie))
> > + p = sched_core_find(rq_i, cookie);
> >
> > rq_i->core_pick = p;
> >
> > @@ -6333,6 +6331,7 @@ static void sched_core_cpu_starting(unsigned int cpu)
> > sched_core_lock(cpu, &flags);
> >
> > WARN_ON_ONCE(rq->core != rq);
> > + WARN_ON_ONCE(rq->cookied_count);
> >
> > /* if we're the first, we'll be our own leader */
> > if (cpumask_weight(smt_mask) == 1)
> > @@ -6425,6 +6424,7 @@ static inline void sched_core_cpu_dying(unsigned int cpu)
> > {
> > struct rq *rq = cpu_rq(cpu);
> >
> > + WARN_ON_ONCE(rq->cookied_count);
> > if (rq->core != rq)
> > rq->core = rq;
> > }
> > @@ -9917,6 +9917,7 @@ void __init sched_init(void)
> > rq->core = rq;
> > rq->core_pick = NULL;
> > rq->core_enabled = 0;
> > + rq->cookied_count = 0;
> > rq->core_tree = RB_ROOT;
> > rq->core_forceidle_count = 0;
> > rq->core_forceidle_occupation = 0;
> > diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
> > index a57fd8f27498..70f424abcc2b 100644
> > --- a/kernel/sched/core_sched.c
> > +++ b/kernel/sched/core_sched.c
> > @@ -56,6 +56,7 @@ static unsigned long sched_core_update_cookie(struct task_struct *p,
> > unsigned long old_cookie;
> > struct rq_flags rf;
> > struct rq *rq;
> > + bool enqueued;
> >
> > rq = task_rq_lock(p, &rf);
> >
> > @@ -67,16 +68,14 @@ static unsigned long sched_core_update_cookie(struct task_struct *p,
> > */
> > SCHED_WARN_ON((p->core_cookie || cookie) && !sched_core_enabled(rq));
> >
> > - if (sched_core_enqueued(p))
> > + enqueued = task_on_rq_queued(p);
> > + if (enqueued)
> > sched_core_dequeue(rq, p, DEQUEUE_SAVE);
> >
> > old_cookie = p->core_cookie;
> > p->core_cookie = cookie;
> >
> > - /*
> > - * Consider the cases: !prev_cookie and !cookie.
> > - */
> > - if (cookie && task_on_rq_queued(p))
> > + if (enqueued)
> > sched_core_enqueue(rq, p);
> >
> > /*
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 3e8df6d31c1e..f5a0ee7fccae 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1148,6 +1148,7 @@ struct rq {
> > unsigned int core_task_seq;
> > unsigned int core_pick_seq;
> > unsigned long core_cookie;
> > + unsigned int cookied_count;
> > unsigned int core_forceidle_count;
> > unsigned int core_forceidle_seq;
> > unsigned int core_forceidle_occupation;

2023-03-22 03:17:13

by Hao Jia

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] sched/core: Minor optimize pick_next_task() when core-sched enable

On 2023/3/22 Joel Fernandes wrote:
> On Mon, Mar 20, 2023 at 4:55 AM Hao Jia <[email protected]> wrote:
>>
>> kindly ping...
>>
>> On 2023/3/8 Hao Jia wrote:
>>> If core-sched is enabled, sometimes we will traverse each CPU on the core
>>> to find the highest priority task 'max' on the entire core, and then try
>>> and find a runnable task that matches @max.
>>> But in the following case, we choose the runnable task is not the best.
>>>
>>> core max: task2 (cookie 0)
>>>
>>> rq0 rq1
>>> task0(cookie non-zero) task2(cookie 0)
>>> task1(cookie 0)
>>> task3(cookie 0)
>>> ...
>>>
>>> pick-task: idle pick-task: task2
>>>
>>> CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the
>>> highest priority tasks on rq0 and rq1 respectively, task2 is @max
>>> on the entire core.
>>>
>>> In the case that 'max' has a zero cookie, instead of continuing to
>>> search for a runnable task on rq0 that matches @max's cookie, we
>>> choose idle for rq0 directly.
>>> At this time, it is obviously better to choose task1 to run for rq0,
>>> which will increase the CPU utilization.
>>> Therefore, we queue tasks with zero cookies in core_tree, and record
>>> the number of non-zero cookie tasks of each rq to detect the status
>>> of the sched-core.
>
> I do remember this as a known issue (more of a known but unimplemented
> optimization) which happens when you have a high priority non-cookie
> task which is in front of several low priority ones on the same
> thread/rq. Adding +Vineeth Pillai to see if he remembers the issue.

Thank you for sharing the information, I will check the previous emails.

>
> I can try to take a look at it this week to make sense of your patch.
> The code in upstream has changed quite a bit since we did this work on
> the older kernels, so allow some time to page it all in.

My patch tries to make non-cookie tasks also queue the core_tree if
CONFIG_SCHED_CORE is defined.
This allows us to quaickly find the highest priority non-cookie task on
this rq through sched_core_find(). But as Peter said it makes the
overhead of the second rb tree unconditional.

Finding the highest priority non-cookie task is complex and
time-consuming, especially in CONFIG_FAIR_GROUP_SCHED.

I am now trying to let the fore_idle CPU look for a highest priority
non-cookie task through a mechanism similar to queue_balance_callback.
I'm not sure if this is possible, do you have a better suggestion?

>
> Meanwhile, could you please provide some more details of your
> usecase/workload and how this patch improves it? Also out of
> curiosity, is bytedance using core scheduling for security or
> something else?
>

At ByteDance, we try to use core-sched to deploy high-priority services
(online) and low-priority services (offline) on the same physical machine.

core-sched can ensure that offline(cookie non-zero) will not affect
onlie's L1 and L2 cache when onlie is running. In our scenario,
core-sched is very useful for online services.

Thanks,
Hao

> Thanks,
>
> - Joel
>
>
>>>
>>> Signed-off-by: Hao Jia <[email protected]>
>>> ---
>>> kernel/sched/core.c | 29 +++++++++++++++--------------
>>> kernel/sched/core_sched.c | 9 ++++-----
>>> kernel/sched/sched.h | 1 +
>>> 3 files changed, 20 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index af017e038b48..765cd14c52e1 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -236,8 +236,8 @@ void sched_core_enqueue(struct rq *rq, struct task_struct *p)
>>> {
>>> rq->core->core_task_seq++;
>>>
>>> - if (!p->core_cookie)
>>> - return;
>>> + if (p->core_cookie)
>>> + rq->cookied_count++;
>>>
>>> rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less);
>>> }
>>> @@ -246,11 +246,16 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags)
>>> {
>>> rq->core->core_task_seq++;
>>>
>>> + if (p->core_cookie)
>>> + rq->cookied_count--;
>>> +
>>> if (sched_core_enqueued(p)) {
>>> rb_erase(&p->core_node, &rq->core_tree);
>>> RB_CLEAR_NODE(&p->core_node);
>>> }
>>>
>>> + if (!sched_core_enabled(rq))
>>> + return;
>>> /*
>>> * Migrating the last task off the cpu, with the cpu in forced idle
>>> * state. Reschedule to create an accounting edge for forced idle,
>>> @@ -370,7 +375,7 @@ static void sched_core_assert_empty(void)
>>> int cpu;
>>>
>>> for_each_possible_cpu(cpu)
>>> - WARN_ON_ONCE(!RB_EMPTY_ROOT(&cpu_rq(cpu)->core_tree));
>>> + WARN_ON_ONCE(cpu_rq(cpu)->cookied_count);
>>> }
>>>
>>> static void __sched_core_enable(void)

2023-03-22 20:46:36

by Vineeth Pillai

[permalink] [raw]

Subject: Re: [PATCH] sched/core: Minor optimize pick_next_task() when core-sched enable

Merging two threads.

On Tue, Mar 21, 2023 at 5:40 PM Joel Fernandes <[email protected]> wrote:
> >
> > CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the
> > highest priority tasks on rq0 and rq1 respectively, task2 is @max
> > on the entire core.

> I'm assuming this all starts by rq0 doing a pick and getting task0.
> Because any other selection would go down the whole !need_sync route.
>
I think this could also happen when rq1 starts the pick due to task2 wakeup
while task0 was running in rq0. In this case, core->core_cookie would be set
and we take the need_sync path I guess.

> > In the case that 'max' has a zero cookie, instead of continuing to
> > search for a runnable task on rq0 that matches @max's cookie, we
> > choose idle for rq0 directly.
> > At this time, it is obviously better to choose task1 to run for rq0,
> > which will increase the CPU utilization.
> > Therefore, we queue tasks with zero cookies in core_tree, and record
> > the number of non-zero cookie tasks of each rq to detect the status
> > of the sched-core.
>
> I do remember this as a known issue (more of a known but unimplemented
> optimization) which happens when you have a high priority non-cookie
> task which is in front of several low priority ones on the same
> thread/rq. Adding +Vineeth Pillai to see if he remembers the issue.
>
Yes, I remember this as one of the 2 issues we noticed, but could not get to
fixing it. Here we have non-cookied tasks considered special as a side effect
of implementation(non-cookied tasks not in core rbtree) and hence we force idle
if max is non-cookied and the highest prio task on the sibling is cookied.

The other issue was - we don't update core rbtree when vruntime changes and
this can cause starvation of cookied task if there are more than one task with
the same cookie on an rq.

> > static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> > {
> > - if (sched_core_enabled(rq))
> > - sched_core_dequeue(rq, p, flags);
> > + sched_core_dequeue(rq, p, flags);
> >
> > if (!(flags & DEQUEUE_NOCLOCK))
> > update_rq_clock(rq);

> Yeah, this is an absolute no-no, it makes the overhead of the second rb
> tree unconditional.

I agree. Could we keep it conditional by enqueuing 0-cookied tasks only when
coresched is enabled, just like what we do for cookied tasks? This is still an
overhead where we have two trees storing all the runnable tasks but in
different order. We would also need to populate core rbtree from cfs rbtree
on coresched enable and empty the tree on coresched disable.

Thanks,
Vineeth

2023-03-23 07:07:44

by Hao Jia

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] sched/core: Minor optimize pick_next_task() when core-sched enable

On 2023/3/23 Vineeth Pillai wrote:
> Merging two threads.
>
> On Tue, Mar 21, 2023 at 5:40 PM Joel Fernandes <[email protected]> wrote:
>>>
>>> CPU0 and CPU1 are two CPUs on the same core, task0 and task2 are the
>>> highest priority tasks on rq0 and rq1 respectively, task2 is @max
>>> on the entire core.
>
>> I'm assuming this all starts by rq0 doing a pick and getting task0.
>> Because any other selection would go down the whole !need_sync route.
>>
> I think this could also happen when rq1 starts the pick due to task2 wakeup
> while task0 was running in rq0. In this case, core->core_cookie would be set
> and we take the need_sync path I guess.
>
>>> In the case that 'max' has a zero cookie, instead of continuing to
>>> search for a runnable task on rq0 that matches @max's cookie, we
>>> choose idle for rq0 directly.
>>> At this time, it is obviously better to choose task1 to run for rq0,
>>> which will increase the CPU utilization.
>>> Therefore, we queue tasks with zero cookies in core_tree, and record
>>> the number of non-zero cookie tasks of each rq to detect the status
>>> of the sched-core.
>>
>> I do remember this as a known issue (more of a known but unimplemented
>> optimization) which happens when you have a high priority non-cookie
>> task which is in front of several low priority ones on the same
>> thread/rq. Adding +Vineeth Pillai to see if he remembers the issue.
>>
> Yes, I remember this as one of the 2 issues we noticed, but could not get to
> fixing it. Here we have non-cookied tasks considered special as a side effect
> of implementation(non-cookied tasks not in core rbtree) and hence we force idle
> if max is non-cookied and the highest prio task on the sibling is cookied.
>
> The other issue was - we don't update core rbtree when vruntime changes and
> this can cause starvation of cookied task if there are more than one task with
> the same cookie on an rq.
>

If I understand correctly, when a cookied task is enqueued, the
difference delta1 between its vruntime and min_vruntime is very large.

Another task with the same cookie is very actively dequeuing and
enqueuing, and the difference delta2 between its vruntime and
min_vruntime is always smaller than delta1?
I'm not sure if this is the case?

>>> static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>>> {
>>> - if (sched_core_enabled(rq))
>>> - sched_core_dequeue(rq, p, flags);
>>> + sched_core_dequeue(rq, p, flags);
>>>
>>> if (!(flags & DEQUEUE_NOCLOCK))
>>> update_rq_clock(rq);
>
>> Yeah, this is an absolute no-no, it makes the overhead of the second rb
>> tree unconditional.
>
> I agree. Could we keep it conditional by enqueuing 0-cookied tasks only when
> coresched is enabled, just like what we do for cookied tasks? This is still an
> overhead where we have two trees storing all the runnable tasks but in
> different order. We would also need to populate core rbtree from cfs rbtree
> on coresched enable and empty the tree on coresched disable.
>

I'm not sure if the other way is reasonable, I'm trying to provide a
function for each scheduling class to find a highest priority non-cookie
task.

For example fair_sched_class, we can use rq->cfs_tasks to traverse the
search. But this search may take a long time, maybe we need to limit the
number of searches.

Thanks,
Hao

2023-03-23 17:32:06

by Vineeth Pillai

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] sched/core: Minor optimize pick_next_task() when core-sched enable

On Thu, Mar 23, 2023 at 3:03 AM Hao Jia <[email protected]> wrote:

> > The other issue was - we don't update core rbtree when vruntime changes and
> > this can cause starvation of cookied task if there are more than one task with
> > the same cookie on an rq.
> >
>
> If I understand correctly, when a cookied task is enqueued, the
> difference delta1 between its vruntime and min_vruntime is very large.
>
> Another task with the same cookie is very actively dequeuing and
> enqueuing, and the difference delta2 between its vruntime and
> min_vruntime is always smaller than delta1?
> I'm not sure if this is the case?

This case I was mentioning is about tasks that are continuously running
and hence always in the runqueue. sched_core_enqueue/dequeue is
not called and hence their position in the core rbtree is static while cfs
rbtree positions change as vruntime progresses.

BTW, this is a separate issue than the one you are targeting with this
fix. I just thought of mentioning it here as well..

> >> Yeah, this is an absolute no-no, it makes the overhead of the second rb
> >> tree unconditional.
> >
> > I agree. Could we keep it conditional by enqueuing 0-cookied tasks only when
> > coresched is enabled, just like what we do for cookied tasks? This is still an
> > overhead where we have two trees storing all the runnable tasks but in
> > different order. We would also need to populate core rbtree from cfs rbtree
> > on coresched enable and empty the tree on coresched disable.
> >
>
> I'm not sure if the other way is reasonable, I'm trying to provide a
> function for each scheduling class to find a highest priority non-cookie
> task.
>
> For example fair_sched_class, we can use rq->cfs_tasks to traverse the
> search. But this search may take a long time, maybe we need to limit the
> number of searches.

Yes, it can be time consuming based on the number of cgroups and tasks
that are runnable. You could probably take some performance numbers to
see how worse it is.

We could also have some optimization like marking a runqueue having
non-cookied tasks and then do the search only if it is marked. I haven't
thought much about it, but search could be optimized hopefully.

Thanks,
Vineeth

2023-03-24 06:56:03

by Hao Jia

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] sched/core: Minor optimize pick_next_task() when core-sched enable

On 2023/3/24 Vineeth Pillai wrote:
> On Thu, Mar 23, 2023 at 3:03 AM Hao Jia <[email protected]> wrote:
>
>>> The other issue was - we don't update core rbtree when vruntime changes and
>>> this can cause starvation of cookied task if there are more than one task with
>>> the same cookie on an rq.
>>>
>>
>> If I understand correctly, when a cookied task is enqueued, the
>> difference delta1 between its vruntime and min_vruntime is very large.
>>
>> Another task with the same cookie is very actively dequeuing and
>> enqueuing, and the difference delta2 between its vruntime and
>> min_vruntime is always smaller than delta1?
>> I'm not sure if this is the case?
>
> This case I was mentioning is about tasks that are continuously running
> and hence always in the runqueue. sched_core_enqueue/dequeue is
> not called and hence their position in the core rbtree is static while cfs
> rbtree positions change as vruntime progresses.
>

Thanks for the detailed explanation.

> BTW, this is a separate issue than the one you are targeting with this
> fix. I just thought of mentioning it here as well..
>
>>>> Yeah, this is an absolute no-no, it makes the overhead of the second rb
>>>> tree unconditional.
>>>
>>> I agree. Could we keep it conditional by enqueuing 0-cookied tasks only when
>>> coresched is enabled, just like what we do for cookied tasks? This is still an
>>> overhead where we have two trees storing all the runnable tasks but in
>>> different order. We would also need to populate core rbtree from cfs rbtree
>>> on coresched enable and empty the tree on coresched disable.
>>>
>>
>> I'm not sure if the other way is reasonable, I'm trying to provide a
>> function for each scheduling class to find a highest priority non-cookie
>> task.
>>
>> For example fair_sched_class, we can use rq->cfs_tasks to traverse the
>> search. But this search may take a long time, maybe we need to limit the
>> number of searches.
>
> Yes, it can be time consuming based on the number of cgroups and tasks
> that are runnable. You could probably take some performance numbers to
> see how worse it is.

I agree, this can be very bad if there are a lot of tasks on rq. But
using cfs rbtree to find the highest priority non-cookie task will
become very complicated when CONFIG_FAIR_GROUP_SCHED is enabled.

Thanks,
Hao

>
> We could also have some optimization like marking a runqueue having
> non-cookied tasks and then do the search only if it is marked. I haven't
> thought much about it, but search could be optimized hopefully.
>