2023-07-17 22:03:53

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v3 0/3] Fix a couple of corner cases in feec() when using uclamp_max

Changes in v3:

* Fix sign comparison problem in patch 1 (Thanks Vincent!)
* Simplify comparison and remove function in patch 2 (Thanks Dietmar!)

Changes in v2:

* Use long instead of unsigned long to keep the comparison simple
in spite of being inconsistent with how capacity type.
* Fix missing termination parenthesis that caused build error.
* Rebase on latest tip/sched/core and Vincent v5 of Unlink misift patch.

v1 link: https://lore.kernel.org/lkml/[email protected]/
v2 link: https://lore.kernel.org/lkml/[email protected]/

In v2 Dietmar has raised concerns about limitation in current EM calculations
that can end up packing more tasks on a cluster. While this is not ideal
situation and we need to fix it, but it is another independent problem that is
not introduced by this fix. I don't see a reason why we should couple them
rather than work on each problem independently. The packing behavior in
practice is actually not bad as if something is capped really hard, there's
a desire to keep them on this less performant clusters.

Patch 1 addresses a bug because forcing a task on a small CPU to honour
uclamp_max hint means we can end up with spare_capacity = 0; but the logic is
constructed such that spare_capacity = 0 leads to ignoring this CPU as
a candidate to compute_energy().

Patch 2 addresses a bug due to an optimization in feec() that could lead to
ignoring tasks whose uclamp_max = 0 but task_util(0) != 0.

Patch 3 adds a new tracepoint in compute_energy() as it was helpful in
debugging analyzing these two problems. It's generally useful to trace down
task placement decision based on EAS.

This is based on tip/sched/core.

Qais Yousef (3):
sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
sched/uclamp: Ignore (util == 0) optimization in feec() when
p_util_max = 0
sched/tp: Add new tracepoint to track compute energy computation

include/trace/events/sched.h | 4 ++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 36 ++++++++++++------------------------
3 files changed, 17 insertions(+), 24 deletions(-)

--
2.25.1



2023-07-17 22:13:54

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v3 3/3] sched/tp: Add new tracepoint to track compute energy computation

It was useful to track feec() placement decision and debug the spare
capacity and optimization issues vs uclamp_max.

Signed-off-by: Qais Yousef (Google) <[email protected]>
---
include/trace/events/sched.h | 4 ++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 7 ++++++-
3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index fbb99a61f714..20cc884f72ff 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -735,6 +735,10 @@ DECLARE_TRACE(sched_update_nr_running_tp,
TP_PROTO(struct rq *rq, int change),
TP_ARGS(rq, change));

+DECLARE_TRACE(sched_compute_energy_tp,
+ TP_PROTO(struct task_struct *p, int dst_cpu, unsigned long energy),
+ TP_ARGS(p, dst_cpu, energy));
+
#endif /* _TRACE_SCHED_H */

/* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 83e36547af17..2deca2dca625 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -114,6 +114,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp);
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
+EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);

DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c701f490ca4c..23e026393210 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7493,11 +7493,16 @@ compute_energy(struct energy_env *eenv, struct perf_domain *pd,
{
unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
unsigned long busy_time = eenv->pd_busy_time;
+ unsigned long energy;

if (dst_cpu >= 0)
busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);

- return em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
+ energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
+
+ trace_sched_compute_energy_tp(p, dst_cpu, energy);
+
+ return energy;
}

/*
--
2.25.1


2023-07-17 22:49:23

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v3 2/3] sched/uclamp: Ignore (util == 0) optimization in feec() when p_util_max = 0

find_energy_efficient_cpu() bails out early if effective util of the
task is 0 as the delta at this point will be zero and there's nothing
for EAS to do. When uclamp is being used, this could lead to wrong
decisions when uclamp_max is set to 0. In this case the task is capped
to performance point 0, but it is actually running and consuming energy
and we can benefit from EAS energy calculations.

Rework the condition so that it bails out for when util is actually 0 or
uclamp_min is requesting a higher performance point.

We can do that without needing to use uclamp_task_util(); remove it.

Fixes: d81304bc6193 ("sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition")
Signed-off-by: Qais Yousef (Google) <[email protected]>
---
kernel/sched/fair.c | 18 +-----------------
1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d489eece5a0d..c701f490ca4c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4348,22 +4348,6 @@ static inline unsigned long task_util_est(struct task_struct *p)
return max(task_util(p), _task_util_est(p));
}

-#ifdef CONFIG_UCLAMP_TASK
-static inline unsigned long uclamp_task_util(struct task_struct *p,
- unsigned long uclamp_min,
- unsigned long uclamp_max)
-{
- return clamp(task_util_est(p), uclamp_min, uclamp_max);
-}
-#else
-static inline unsigned long uclamp_task_util(struct task_struct *p,
- unsigned long uclamp_min,
- unsigned long uclamp_max)
-{
- return task_util_est(p);
-}
-#endif
-
static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
struct task_struct *p)
{
@@ -7588,7 +7572,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
target = prev_cpu;

sync_entity_load_avg(&p->se);
- if (!uclamp_task_util(p, p_util_min, p_util_max))
+ if (!task_util_est(p) && p_util_min == 0)
goto unlock;

eenv_task_busy_time(&eenv, p, prev_cpu);
--
2.25.1


2023-07-17 22:53:55

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v3 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0

When uclamp_max is being used, the util of the task could be higher than
the spare capacity of the CPU, but due to uclamp_max value we force fit
it there.

The way the condition for checking for max_spare_cap in
find_energy_efficient_cpu() was constructed; it ignored any CPU that has
its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
hence ending up never performing compute_energy() for this cluster and
missing an opportunity for a better energy efficient placement to honour
uclamp_max setting.

max_spare_cap = 0;
cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high

...

util_fits_cpu(...); // will return true if uclamp_max forces it to fit

...

// this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
if (cpu_cap > max_spare_cap) {
max_spare_cap = cpu_cap;
max_spare_cap_cpu = cpu;
}

prev_spare_cap suffers from a similar problem.

Fix the logic by converting the variables into long and treating -1
value as 'not populated' instead of 0 which is a viable and correct
spare capacity value. We need to be careful signed comparison is used
when comparing with cpu_cap in one of the conditions.

Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
Signed-off-by: Qais Yousef (Google) <[email protected]>
---
kernel/sched/fair.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0cd1cdbae534..d489eece5a0d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7596,11 +7596,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
for (; pd; pd = pd->next) {
unsigned long util_min = p_util_min, util_max = p_util_max;
unsigned long cpu_cap, cpu_thermal_cap, util;
- unsigned long cur_delta, max_spare_cap = 0;
+ long prev_spare_cap = -1, max_spare_cap = -1;
unsigned long rq_util_min, rq_util_max;
- unsigned long prev_spare_cap = 0;
+ unsigned long cur_delta, base_energy;
int max_spare_cap_cpu = -1;
- unsigned long base_energy;
int fits, max_fits = -1;

cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
@@ -7663,7 +7662,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
prev_spare_cap = cpu_cap;
prev_fits = fits;
} else if ((fits > max_fits) ||
- ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
+ ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
/*
* Find the CPU with the maximum spare capacity
* among the remaining CPUs in the performance
@@ -7675,7 +7674,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
}
}

- if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
+ if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
continue;

eenv_pd_busy_time(&eenv, cpus, p);
@@ -7683,7 +7682,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
base_energy = compute_energy(&eenv, pd, cpus, p, -1);

/* Evaluate the energy impact of using prev_cpu. */
- if (prev_spare_cap > 0) {
+ if (prev_spare_cap > -1) {
prev_delta = compute_energy(&eenv, pd, cpus, p,
prev_cpu);
/* CPU utilization has changed */
--
2.25.1


2023-07-21 10:36:43

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] sched/uclamp: Ignore (util == 0) optimization in feec() when p_util_max = 0

On Mon, 17 Jul 2023 at 23:57, Qais Yousef <[email protected]> wrote:
>
> find_energy_efficient_cpu() bails out early if effective util of the
> task is 0 as the delta at this point will be zero and there's nothing
> for EAS to do. When uclamp is being used, this could lead to wrong
> decisions when uclamp_max is set to 0. In this case the task is capped
> to performance point 0, but it is actually running and consuming energy
> and we can benefit from EAS energy calculations.
>
> Rework the condition so that it bails out for when util is actually 0 or
> uclamp_min is requesting a higher performance point.
>
> We can do that without needing to use uclamp_task_util(); remove it.
>
> Fixes: d81304bc6193 ("sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition")
> Signed-off-by: Qais Yousef (Google) <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/fair.c | 18 +-----------------
> 1 file changed, 1 insertion(+), 17 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d489eece5a0d..c701f490ca4c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4348,22 +4348,6 @@ static inline unsigned long task_util_est(struct task_struct *p)
> return max(task_util(p), _task_util_est(p));
> }
>
> -#ifdef CONFIG_UCLAMP_TASK
> -static inline unsigned long uclamp_task_util(struct task_struct *p,
> - unsigned long uclamp_min,
> - unsigned long uclamp_max)
> -{
> - return clamp(task_util_est(p), uclamp_min, uclamp_max);
> -}
> -#else
> -static inline unsigned long uclamp_task_util(struct task_struct *p,
> - unsigned long uclamp_min,
> - unsigned long uclamp_max)
> -{
> - return task_util_est(p);
> -}
> -#endif
> -
> static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> struct task_struct *p)
> {
> @@ -7588,7 +7572,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> target = prev_cpu;
>
> sync_entity_load_avg(&p->se);
> - if (!uclamp_task_util(p, p_util_min, p_util_max))
> + if (!task_util_est(p) && p_util_min == 0)
> goto unlock;
>
> eenv_task_busy_time(&eenv, p, prev_cpu);
> --
> 2.25.1
>

2023-07-21 10:44:57

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0

On Mon, 17 Jul 2023 at 23:57, Qais Yousef <[email protected]> wrote:
>
> When uclamp_max is being used, the util of the task could be higher than
> the spare capacity of the CPU, but due to uclamp_max value we force fit
> it there.
>
> The way the condition for checking for max_spare_cap in
> find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> hence ending up never performing compute_energy() for this cluster and
> missing an opportunity for a better energy efficient placement to honour
> uclamp_max setting.
>
> max_spare_cap = 0;
> cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high
>
> ...
>
> util_fits_cpu(...); // will return true if uclamp_max forces it to fit
>
> ...
>
> // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
> if (cpu_cap > max_spare_cap) {
> max_spare_cap = cpu_cap;
> max_spare_cap_cpu = cpu;
> }
>
> prev_spare_cap suffers from a similar problem.
>
> Fix the logic by converting the variables into long and treating -1
> value as 'not populated' instead of 0 which is a viable and correct
> spare capacity value. We need to be careful signed comparison is used
> when comparing with cpu_cap in one of the conditions.
>
> Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
> Signed-off-by: Qais Yousef (Google) <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

> ---
> kernel/sched/fair.c | 11 +++++------
> 1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0cd1cdbae534..d489eece5a0d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7596,11 +7596,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> for (; pd; pd = pd->next) {
> unsigned long util_min = p_util_min, util_max = p_util_max;
> unsigned long cpu_cap, cpu_thermal_cap, util;
> - unsigned long cur_delta, max_spare_cap = 0;
> + long prev_spare_cap = -1, max_spare_cap = -1;
> unsigned long rq_util_min, rq_util_max;
> - unsigned long prev_spare_cap = 0;
> + unsigned long cur_delta, base_energy;
> int max_spare_cap_cpu = -1;
> - unsigned long base_energy;
> int fits, max_fits = -1;
>
> cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> @@ -7663,7 +7662,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> prev_spare_cap = cpu_cap;
> prev_fits = fits;
> } else if ((fits > max_fits) ||
> - ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
> + ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
> /*
> * Find the CPU with the maximum spare capacity
> * among the remaining CPUs in the performance
> @@ -7675,7 +7674,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> }
> }
>
> - if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
> + if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> continue;
>
> eenv_pd_busy_time(&eenv, cpus, p);
> @@ -7683,7 +7682,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> base_energy = compute_energy(&eenv, pd, cpus, p, -1);
>
> /* Evaluate the energy impact of using prev_cpu. */
> - if (prev_spare_cap > 0) {
> + if (prev_spare_cap > -1) {
> prev_delta = compute_energy(&eenv, pd, cpus, p,
> prev_cpu);
> /* CPU utilization has changed */
> --
> 2.25.1
>

2023-08-22 11:50:26

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] sched/tp: Add new tracepoint to track compute energy computation



On 8/21/23 23:36, Qais Yousef wrote:
> Hi Lukasz
>
> On 08/21/23 10:04, Lukasz Luba wrote:
>> Hi Qais,
>>
>> On 7/17/23 22:57, Qais Yousef wrote:
>>> It was useful to track feec() placement decision and debug the spare
>>> capacity and optimization issues vs uclamp_max.
>>>
>>> Signed-off-by: Qais Yousef (Google) <[email protected]>
>>> ---
>>> include/trace/events/sched.h | 4 ++++
>>> kernel/sched/core.c | 1 +
>>> kernel/sched/fair.c | 7 ++++++-
>>> 3 files changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>>> index fbb99a61f714..20cc884f72ff 100644
>>> --- a/include/trace/events/sched.h
>>> +++ b/include/trace/events/sched.h
>>> @@ -735,6 +735,10 @@ DECLARE_TRACE(sched_update_nr_running_tp,
>>> TP_PROTO(struct rq *rq, int change),
>>> TP_ARGS(rq, change));
>>> +DECLARE_TRACE(sched_compute_energy_tp,
>>> + TP_PROTO(struct task_struct *p, int dst_cpu, unsigned long energy),
>>> + TP_ARGS(p, dst_cpu, energy));
>>> +
>>> #endif /* _TRACE_SCHED_H */
>>> /* This part must be outside protection */
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 83e36547af17..2deca2dca625 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -114,6 +114,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
>>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp);
>>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
>>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
>>> +EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
>>> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index c701f490ca4c..23e026393210 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7493,11 +7493,16 @@ compute_energy(struct energy_env *eenv, struct perf_domain *pd,
>>> {
>>> unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
>>> unsigned long busy_time = eenv->pd_busy_time;
>>> + unsigned long energy;
>>> if (dst_cpu >= 0)
>>> busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
>>> - return em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
>>> + energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
>>> +
>>> + trace_sched_compute_energy_tp(p, dst_cpu, energy);
>>
>> You've probably missed to add the change that we discussed in v2:
>>
>> https://lore.kernel.org/lkml/20230221120832.x642tqohxv5nascr@airbuntu/
>>
>> The max_util and busy_time. This would help us in our tool.
>
> Ah, I did indeed. Sorry about that. Will send v4 then.

Thanks Qais!