LinuxLists.cc - [PATCH 0/3] sched: Generalize misfit load balance

2023-12-09 01:18:56

Subject: [PATCH 0/3] sched: Generalize misfit load balance

Misfit load balance was added to help handle HMP systems where we can make
a wrong decision at wake up thinking a task can run at a smaller core, but its
characteristics change and requires to migrate to a bigger core to meet its
performance demands.

With the addition of uclamp, we can encounter more cases where such wrong
placement decisions can be made and require load balancer to do a corrective
action.

Specifically if a big task capped by uclamp_max was placed on a big core at
wake up because EAS thought it is the most energy efficient core at the time,
the dynamics of the system might change where other uncapped tasks might wake
up on the cluster and there could be a better new more energy efficient
placement for the capped task(s).

We can generalize the misfit load balance to handle different type of misfits
(whatever they may be) by simply giving it a reason. The reason can decide the
type of action required then.

Current misfit implementation is considered MISFIT_PERF. Which means we need to
move a task to a better CPU to meet its performance requirement.

For UCLAMP_MAX I propose MISFIT_POWER, where we need to find a better placement
to control its impact on power.

Once we have an API to annotate latency sensitive tasks, it is anticipated
MISFIT_LATENCY load balance will be required to help handle oversubscribe
situations to help better distribute the latency sensitive tasks to help reduce
their wake up latency.

Patch 1 splits misfit status update from misfit detection by adding a new
function is_misfit_task().

Patch 2 implements the generalization logic by adding a misfit reason and
propagating that correctly and guarding the current misfit code with
MISFIT_PERF reason.

Patch 3 is an RFC on a potential implementation for MISFIT_POWER.

Patch 1 and 2 were tested stand alone and had no regression observed and should
not introduce a functional change and can be considered for merge if they make
sense after addressing any review comments.

Patch 3 was only tested to verify it does what I expected it to do. But no real
power/perf testing was done. Mainly because I was expecting to remove uclamp
max-aggregation [1] and the RFC I currently have (which I wrote many many
months ago) is tied to detecting a task being uncapped by max-aggregation.
I need to rethink the detection mechanism.

Beside that, the logic relies on using find_energy_efficient_cpu() to find the
best potential new placement for the task. To do that though, we need to force
every CPU to do the MISFIT_POWER load balance as we don't know which CPU should
do the pull. But there might be better thoughts on how to handle this. So
feedback and thoughts would be appreciated.

[1] https://lore.kernel.org/lkml/[email protected]/

Thanks!

--
Qais Yousef

Qais Yousef (3):
sched/fair: Add is_misfit_task() function
sched/fair: Generalize misfit lb by adding a misfit reason
sched/fair: Implement new type of misfit MISFIT_POWER

kernel/sched/fair.c | 115 +++++++++++++++++++++++++++++++++++++------
kernel/sched/sched.h | 9 ++++
2 files changed, 110 insertions(+), 14 deletions(-)

--
2.34.1

2023-12-09 01:19:06

by Qais Yousef

[permalink] [raw]

Subject: [PATCH 2/3] sched/fair: Generalize misfit lb by adding a misfit reason

MISFIT_PERF is what is currently implemented. It indicates that the task
requires moving to a more performant CPU.

Guard misfit handling in find_busiest_queue and update_sd_pick_busiest
with MISFIT_PERF. They explicitly assume this type of misfit

This generalizes misfit lb to allow for more types of misfits to be
added. Like MISFIT_POWER to help uclamp_max being more effective, and
MISFIT_LATENCY to help latency sensitive tasks to be spread in
oversubscribe conditions.

Signed-off-by: Qais Yousef (Google) <[email protected]>
---
kernel/sched/fair.c | 28 +++++++++++++++++++++++-----
kernel/sched/sched.h | 8 ++++++++
2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb9e891182cc..dd49b89a6e3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5063,7 +5063,8 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
}

-static inline int is_misfit_task(struct task_struct *p, struct rq *rq)
+static inline int is_misfit_task(struct task_struct *p, struct rq *rq,
+ misfit_reason_t *reason)
{
if (!p || p->nr_cpus_allowed == 1)
return 0;
@@ -5071,16 +5072,21 @@ static inline int is_misfit_task(struct task_struct *p, struct rq *rq)
if (task_fits_cpu(p, cpu_of(rq)))
return 0;

+ if (reason)
+ *reason = MISFIT_PERF;
return 1;
}

static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
{
+ misfit_reason_t reason;
+
if (!sched_asym_cpucap_active())
return;

- if (!is_misfit_task(p, rq)) {
+ if (!is_misfit_task(p, rq, &reason)) {
rq->misfit_task_load = 0;
+ rq->misfit_reason = -1;
return;
}

@@ -5089,6 +5095,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
* task_h_load() returns 0.
*/
rq->misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
+ rq->misfit_reason = reason;
}

#else /* CONFIG_SMP */
@@ -9111,7 +9118,7 @@ static int detach_tasks(struct lb_env *env)

case migrate_misfit:
/* This is not a misfit task */
- if (!is_misfit_task(p, cpu_rq(env->src_cpu)))
+ if (!is_misfit_task(p, cpu_rq(env->src_cpu), NULL))
goto next;

env->imbalance = 0;
@@ -9426,6 +9433,7 @@ struct sg_lb_stats {
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
unsigned int group_smt_balance; /* Task on busy SMT be moved */
unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+ misfit_reason_t group_misfit_reason;
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -9904,6 +9912,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
/* Check for a misfit task on the cpu */
if (sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
+ sgs->group_misfit_reason = rq->misfit_reason;
*sg_status |= SG_OVERLOAD;
}
} else if ((env->idle != CPU_NOT_IDLE) &&
@@ -9969,6 +9978,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
*/
if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
(sgs->group_type == group_misfit_task) &&
+ (sgs->group_misfit_reason == MISFIT_PERF) &&
(!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
sds->local_stat.group_type != group_has_spare))
return false;
@@ -10193,6 +10203,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,

for_each_cpu(i, sched_group_span(group)) {
struct rq *rq = cpu_rq(i);
+ misfit_reason_t reason;
unsigned int local;

sgs->group_load += cpu_load_without(rq, p);
@@ -10212,9 +10223,15 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,

/* Check if task fits in the CPU */
if (sd->flags & SD_ASYM_CPUCAPACITY &&
- sgs->group_misfit_task_load &&
- !is_misfit_task(p, rq))
+ sgs->group_misfit_task_load) {
+ if (!is_misfit_task(p, rq, &reason)) {
sgs->group_misfit_task_load = 0;
+ sgs->group_misfit_reason = -1;
+ } else {
+ sgs->group_misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
+ sgs->group_misfit_reason = reason;
+ }
+ }

}

@@ -11008,6 +11025,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* average load.
*/
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ rq->misfit_reason == MISFIT_PERF &&
!capacity_greater(capacity_of(env->dst_cpu), capacity) &&
nr_running == 1)
continue;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77d..399b6526afab 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -962,6 +962,10 @@ struct balance_callback {
void (*func)(struct rq *rq);
};

+typedef enum misfit_reason {
+ MISFIT_PERF, /* Requires moving to a more performant CPU */
+} misfit_reason_t;
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -1168,6 +1172,10 @@ struct rq {
call_single_data_t cfsb_csd;
struct list_head cfsb_csd_list;
#endif
+
+#ifdef CONFIG_SMP
+ misfit_reason_t misfit_reason;
+#endif
};

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.34.1

2023-12-21 15:28:23

by Pierre Gondois

[permalink] [raw]

Subject: Re: [PATCH 0/3] sched: Generalize misfit load balance

Hello Qais,

On 12/9/23 02:17, Qais Yousef wrote:
> Misfit load balance was added to help handle HMP systems where we can make
> a wrong decision at wake up thinking a task can run at a smaller core, but its
> characteristics change and requires to migrate to a bigger core to meet its
> performance demands.
>
> With the addition of uclamp, we can encounter more cases where such wrong
> placement decisions can be made and require load balancer to do a corrective
> action.
>
> Specifically if a big task capped by uclamp_max was placed on a big core at
> wake up because EAS thought it is the most energy efficient core at the time,
> the dynamics of the system might change where other uncapped tasks might wake
> up on the cluster and there could be a better new more energy efficient
> placement for the capped task(s).
>
> We can generalize the misfit load balance to handle different type of misfits
> (whatever they may be) by simply giving it a reason. The reason can decide the
> type of action required then.
>
> Current misfit implementation is considered MISFIT_PERF. Which means we need to
> move a task to a better CPU to meet its performance requirement.
>
> For UCLAMP_MAX I propose MISFIT_POWER, where we need to find a better placement
> to control its impact on power.
>
> Once we have an API to annotate latency sensitive tasks, it is anticipated
> MISFIT_LATENCY load balance will be required to help handle oversubscribe
> situations to help better distribute the latency sensitive tasks to help reduce
> their wake up latency.
>
> Patch 1 splits misfit status update from misfit detection by adding a new
> function is_misfit_task().
>
> Patch 2 implements the generalization logic by adding a misfit reason and
> propagating that correctly and guarding the current misfit code with
> MISFIT_PERF reason.
>
> Patch 3 is an RFC on a potential implementation for MISFIT_POWER.
>
> Patch 1 and 2 were tested stand alone and had no regression observed and should
> not introduce a functional change and can be considered for merge if they make
> sense after addressing any review comments.
>
> Patch 3 was only tested to verify it does what I expected it to do. But no real
> power/perf testing was done. Mainly because I was expecting to remove uclamp
> max-aggregation [1] and the RFC I currently have (which I wrote many many
> months ago) is tied to detecting a task being uncapped by max-aggregation.
> I need to rethink the detection mechanism.

I tried to trigger the MISFIT_POWER misfit reason without success so far.
Would it be possible to provide a workload/test to reliably trigger the
condition ?

Regards,
Pierre

>
> Beside that, the logic relies on using find_energy_efficient_cpu() to find the
> best potential new placement for the task. To do that though, we need to force
> every CPU to do the MISFIT_POWER load balance as we don't know which CPU should
> do the pull. But there might be better thoughts on how to handle this. So
> feedback and thoughts would be appreciated.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Thanks!
>
> --
> Qais Yousef
>
> Qais Yousef (3):
> sched/fair: Add is_misfit_task() function
> sched/fair: Generalize misfit lb by adding a misfit reason
> sched/fair: Implement new type of misfit MISFIT_POWER
>
> kernel/sched/fair.c | 115 +++++++++++++++++++++++++++++++++++++------
> kernel/sched/sched.h | 9 ++++
> 2 files changed, 110 insertions(+), 14 deletions(-)
>

2023-12-28 23:39:05

by Qais Yousef

[permalink] [raw]

Subject: Re: [PATCH 0/3] sched: Generalize misfit load balance

On 12/21/23 16:26, Pierre Gondois wrote:
> Hello Qais,
>
> On 12/9/23 02:17, Qais Yousef wrote:
> > Misfit load balance was added to help handle HMP systems where we can make
> > a wrong decision at wake up thinking a task can run at a smaller core, but its
> > characteristics change and requires to migrate to a bigger core to meet its
> > performance demands.
> >
> > With the addition of uclamp, we can encounter more cases where such wrong
> > placement decisions can be made and require load balancer to do a corrective
> > action.
> >
> > Specifically if a big task capped by uclamp_max was placed on a big core at
> > wake up because EAS thought it is the most energy efficient core at the time,
> > the dynamics of the system might change where other uncapped tasks might wake
> > up on the cluster and there could be a better new more energy efficient
> > placement for the capped task(s).
> >
> > We can generalize the misfit load balance to handle different type of misfits
> > (whatever they may be) by simply giving it a reason. The reason can decide the
> > type of action required then.
> >
> > Current misfit implementation is considered MISFIT_PERF. Which means we need to
> > move a task to a better CPU to meet its performance requirement.
> >
> > For UCLAMP_MAX I propose MISFIT_POWER, where we need to find a better placement
> > to control its impact on power.
> >
> > Once we have an API to annotate latency sensitive tasks, it is anticipated
> > MISFIT_LATENCY load balance will be required to help handle oversubscribe
> > situations to help better distribute the latency sensitive tasks to help reduce
> > their wake up latency.
> >
> > Patch 1 splits misfit status update from misfit detection by adding a new
> > function is_misfit_task().
> >
> > Patch 2 implements the generalization logic by adding a misfit reason and
> > propagating that correctly and guarding the current misfit code with
> > MISFIT_PERF reason.
> >
> > Patch 3 is an RFC on a potential implementation for MISFIT_POWER.
> >
> > Patch 1 and 2 were tested stand alone and had no regression observed and should
> > not introduce a functional change and can be considered for merge if they make
> > sense after addressing any review comments.
> >
> > Patch 3 was only tested to verify it does what I expected it to do. But no real
> > power/perf testing was done. Mainly because I was expecting to remove uclamp
> > max-aggregation [1] and the RFC I currently have (which I wrote many many
> > months ago) is tied to detecting a task being uncapped by max-aggregation.
> > I need to rethink the detection mechanism.
>
> I tried to trigger the MISFIT_POWER misfit reason without success so far.
> Would it be possible to provide a workload/test to reliably trigger the
> condition ?

I spawn a busy loop like

cat /dev/zero > dev/null

Then use

uclampset -M 0 -p $PID

to change uclamp_max to 0 and 1024 back and forth.

Try to load the system with some workload and you should see something like
attached picture. Red boxes are periods where uclamp_max is 0. The rest is for
uclamp_max = 1024. Note how it being constantly moved between CPUs when capped.

Cheers

--
Qais Yousef

Attachments:

(No filename) (3.19 kB)
misfit_power.png (375.12 kB)
Download all attachments