2021-06-02 13:59:59

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH 0/2] Add allowed CPU capacity knowledge to EAS

Hi all,

This is a patch set which aims to add knowledge about reduced CPU capacity
into the Energy Model (EM) and Energy Aware Scheduler (EAS). Currently the
issue is that SchedUtil CPU frequency and EM frequency are not aligned,
when there is a CPU thermal capping. This causes an estimation error.
This patch set provides the information about allowed CPU capacity
into the EM (thanks to thermal pressure signal). This improves the
energy estimation. More info about this mechanism can be found in the
patches comments.

Regards,
Lukasz

Lukasz Luba (2):
sched/fair: Take thermal pressure into account while estimating energy
sched/cpufreq: Consider reduced CPU capacity in energy calculation

include/linux/energy_model.h | 16 +++++++++++++---
include/linux/sched/cpufreq.h | 2 +-
kernel/sched/cpufreq_schedutil.c | 1 +
kernel/sched/fair.c | 13 +++++++++++--
4 files changed, 26 insertions(+), 6 deletions(-)

--
2.17.1


2021-06-02 14:00:12

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH 2/2] sched/cpufreq: Consider reduced CPU capacity in energy calculation

Energy Aware Scheduling (EAS) needs to predict the decisions made by
SchedUtil. The map_util_freq() exists to do that.

There are corner cases where the max allowed frequency might be reduced
(due to thermal). SchedUtil as a CPUFreq governor, is aware of that
but EAS is not. This patch aims to address it.

SchedUtil stores the maximum allowed frequency in
'sugov_policy::next_freq' field. EAS has to predict that value, which is
the real used frequency. That value is made after a call to
cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits.
In the existing code EAS is not able to predict that real frequency.
This leads to energy estimation errors.

To avoid wrong energy estimation in EAS (due to frequency miss prediction)
make sure that the step which calculates Performance Domain frequency,
is also aware of the allowed CPU capacity.

Furthermore, modify map_util_freq() to not extend the frequency value.
Instead, use map_util_perf() to extend the util value in both places:
SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity.
In the end, we achieve the same desirable behavior for both subsystems
and alignment in regards to the real CPU frequency.

Signed-off-by: Lukasz Luba <[email protected]>
---
include/linux/energy_model.h | 16 +++++++++++++---
include/linux/sched/cpufreq.h | 2 +-
kernel/sched/cpufreq_schedutil.c | 1 +
kernel/sched/fair.c | 2 +-
4 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 757fc60658fa..3f221dbf5f95 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -91,6 +91,8 @@ void em_dev_unregister_perf_domain(struct device *dev);
* @pd : performance domain for which energy has to be estimated
* @max_util : highest utilization among CPUs of the domain
* @sum_util : sum of the utilization of all CPUs in the domain
+ * @allowed_cpu_cap : maximum allowed CPU capacity for the @pd, which
+ might reflect reduced frequency (due to thermal)
*
* This function must be used only for CPU devices. There is no validation,
* i.e. if the EM is a CPU type and has cpumask allocated. It is called from
@@ -100,7 +102,8 @@ void em_dev_unregister_perf_domain(struct device *dev);
* a capacity state satisfying the max utilization of the domain.
*/
static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
- unsigned long max_util, unsigned long sum_util)
+ unsigned long max_util, unsigned long sum_util,
+ unsigned long allowed_cpu_cap)
{
unsigned long freq, scale_cpu;
struct em_perf_state *ps;
@@ -112,11 +115,17 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
/*
* In order to predict the performance state, map the utilization of
* the most utilized CPU of the performance domain to a requested
- * frequency, like schedutil.
+ * frequency, like schedutil. Take also into account that the real
+ * frequency might be set lower (due to thermal capping). Thus, clamp
+ * max utilization to the allowed CPU capacity before calculating
+ * effective frequency.
*/
cpu = cpumask_first(to_cpumask(pd->cpus));
scale_cpu = arch_scale_cpu_capacity(cpu);
ps = &pd->table[pd->nr_perf_states - 1];
+
+ max_util = map_util_perf(max_util);
+ max_util = min(max_util, allowed_cpu_cap);
freq = map_util_freq(max_util, ps->frequency, scale_cpu);

/*
@@ -209,7 +218,8 @@ static inline struct em_perf_domain *em_pd_get(struct device *dev)
return NULL;
}
static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
- unsigned long max_util, unsigned long sum_util)
+ unsigned long max_util, unsigned long sum_util,
+ unsigned long allowed_cpu_cap)
{
return 0;
}
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index 6205578ab6ee..bdd31ab93bc5 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -26,7 +26,7 @@ bool cpufreq_this_cpu_can_update(struct cpufreq_policy *policy);
static inline unsigned long map_util_freq(unsigned long util,
unsigned long freq, unsigned long cap)
{
- return (freq + (freq >> 2)) * util / cap;
+ return freq * util / cap;
}

static inline unsigned long map_util_perf(unsigned long util)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 4f09afd2f321..57124614363d 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -151,6 +151,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;

+ util = map_util_perf(util);
freq = map_util_freq(util, freq, max);

if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ca0a6f1408da..46064e18c287 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6588,7 +6588,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
max_util = max(max_util, cpu_util);
}

- return em_cpu_energy(pd->em_pd, max_util, sum_util);
+ return em_cpu_energy(pd->em_pd, max_util, sum_util, cpu_cap);
}

/*
--
2.17.1

2021-06-02 14:00:26

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH 1/2] sched/fair: Take thermal pressure into account while estimating energy

Energy Aware Scheduling (EAS) needs to be able to predict the frequency
requests made by the SchedUtil governor to properly estimate energy used
in the future. It has to take into account CPUs utilization and forecast
Performance Domain (PD) frequency. There is a corner case when the max
allowed frequency might be reduced due to thermal. SchedUtil is aware of
that reduced frequency, so it should be taken into account also in EAS
estimations.

SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of
a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping
to 'policy::max'. SchedUtil is responsible to respect that upper limit
while setting the frequency through CPUFreq drivers. This effective
frequency is stored internally in 'sugov_policy::next_freq' and EAS has
to predict that value.

In the existing code the raw value of arch_scale_cpu_capacity() is used
for clamping the returned CPU utilization from effective_cpu_util().
This patch fixes issue with too big single CPU utilization, by introducing
clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU
capacity reduced by thermal pressure signal. We rely on this load avg
geometric series in similar way as other mechanisms in the scheduler.

Thanks to knowledge about allowed CPU capacity, we don't get too big value
for a single CPU utilization, which is then added to the util sum. The
util sum is used as a source of information for estimating whole PD energy.
To avoid wrong energy estimation in EAS (due to capped frequency), make
sure that the calculation of util sum is aware of allowed CPU capacity.

Signed-off-by: Lukasz Luba <[email protected]>
---
kernel/sched/fair.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 161b92aa1c79..ca0a6f1408da 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6525,8 +6525,9 @@ static long
compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
{
struct cpumask *pd_mask = perf_domain_span(pd);
- unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
+ unsigned long _cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
unsigned long max_util = 0, sum_util = 0;
+ unsigned long cpu_cap = _cpu_cap;
int cpu;

/*
@@ -6558,6 +6559,14 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
cpu_util_next(cpu, p, -1) + task_util_est(p);
}

+ /*
+ * Take the thermal pressure from non-idle CPUs. They have
+ * most up-to-date information. For idle CPUs thermal pressure
+ * signal is not updated so often.
+ */
+ if (!idle_cpu(cpu))
+ cpu_cap = _cpu_cap - thermal_load_avg(cpu_rq(cpu));
+
/*
* Busy time computation: utilization clamping is not
* required since the ratio (sum_util / cpu_capacity)
--
2.17.1

2021-06-02 15:05:14

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Take thermal pressure into account while estimating energy

Hi Lukasz,

On Wednesday 02 Jun 2021 at 14:56:08 (+0100), Lukasz Luba wrote:
> compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> {
> struct cpumask *pd_mask = perf_domain_span(pd);
> - unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
> + unsigned long _cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
> unsigned long max_util = 0, sum_util = 0;
> + unsigned long cpu_cap = _cpu_cap;
> int cpu;
>
> /*
> @@ -6558,6 +6559,14 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> cpu_util_next(cpu, p, -1) + task_util_est(p);
> }
>
> + /*
> + * Take the thermal pressure from non-idle CPUs. They have
> + * most up-to-date information. For idle CPUs thermal pressure
> + * signal is not updated so often.
> + */
> + if (!idle_cpu(cpu))
> + cpu_cap = _cpu_cap - thermal_load_avg(cpu_rq(cpu));

This messes up the irq time scaling no? Maybe move the capping in this
function instead of relying on effective_cpu_util() to do it for you?

> /*
> * Busy time computation: utilization clamping is not
> * required since the ratio (sum_util / cpu_capacity)
> --
> 2.17.1
>

2021-06-02 15:38:54

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Take thermal pressure into account while estimating energy

Hi Quentin,

On 6/2/21 4:00 PM, Quentin Perret wrote:
> Hi Lukasz,
>
> On Wednesday 02 Jun 2021 at 14:56:08 (+0100), Lukasz Luba wrote:
>> compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>> {
>> struct cpumask *pd_mask = perf_domain_span(pd);
>> - unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
>> + unsigned long _cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
>> unsigned long max_util = 0, sum_util = 0;
>> + unsigned long cpu_cap = _cpu_cap;
>> int cpu;
>>
>> /*
>> @@ -6558,6 +6559,14 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>> cpu_util_next(cpu, p, -1) + task_util_est(p);
>> }
>>
>> + /*
>> + * Take the thermal pressure from non-idle CPUs. They have
>> + * most up-to-date information. For idle CPUs thermal pressure
>> + * signal is not updated so often.
>> + */
>> + if (!idle_cpu(cpu))
>> + cpu_cap = _cpu_cap - thermal_load_avg(cpu_rq(cpu));
>
> This messes up the irq time scaling no? Maybe move the capping in this

You are talking about scale_irq_capacity() which shrinks the util by
some percentage of irq time. It might be different, by some fraction
(e.g. 8/9 vs 9/10) compared to SchedUtil view, which passes 'raw' arch
capacity. It then adds the irq part, but still to this slightly
different base util.

> function instead of relying on effective_cpu_util() to do it for you?

Agree, since it would be more 'aligned' with how SchedUtil calls
effective_cpu_util(). I will clamp the returned value.

Thanks for pointing this out.

Regards,
Lukasz