This is the 1st part of consolidating how the max compute capacity is
used in the scheduler and how we calculate the frequency for a level of
utilization.
Fix some unconsistancy when computing frequency for an utilization. There
can be a mismatch between energy model and schedutil.
Next step will be to make a difference between the original
max compute capacity of a CPU and what is currently available when
there is a capping applying forever (i.e. seconds or more).
Changes since v1:
- Fix typos
- Added changes in cpufreq to use arch_scale_freq_ref() when calling
arch_set_freq_scale (patch 3).
- arch_scale_freq_ref() is always defined and returns 0 (as proposed
by Ionela) when not defined by the arch. This simplifies the code with
the addition of patch 3.
- Simplify Energy Model which always uses arch_scale_freq_ref(). The
latter returns 0 when not defined by arch instead of last item of the
perf domain. This is not a problem because the function is only defined
for compilation purpose in this case and we don't care about the
returned value. (patch 5)
- Added changes in cppc cpufreq to set capacity_ref_freq (patch 6)
- Added reviewed tag for patch 1 which got a minor change but not for
others as I did some changes which could make previous reviewed tag
no more relevant.
Vincent Guittot (6):
sched: consolidate and cleanup access to CPU's max compute capacity
topology: add a new arch_scale_freq_reference
cpufreq: use the fixed and coherent frequency for scaling capacity
cpufreq/schedutil: use a fixed reference frequency
energy_model: use a fixed reference frequency
cpufreq/cppc: set the frequency used for capacity computation
Documentation/scheduler/sched-capacity.rst | 13 +++++-----
arch/arm/include/asm/topology.h | 1 +
arch/arm64/include/asm/topology.h | 1 +
arch/riscv/include/asm/topology.h | 1 +
drivers/base/arch_topology.c | 29 +++++++++++-----------
drivers/cpufreq/cppc_cpufreq.c | 18 ++++++++++++++
drivers/cpufreq/cpufreq.c | 4 +--
include/linux/arch_topology.h | 7 ++++++
include/linux/cpufreq.h | 9 +++++++
include/linux/energy_model.h | 14 ++++++++---
kernel/sched/core.c | 2 +-
kernel/sched/cpudeadline.c | 2 +-
kernel/sched/cpufreq_schedutil.c | 26 +++++++++++++++++--
kernel/sched/deadline.c | 4 +--
kernel/sched/fair.c | 18 ++++++--------
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 6 -----
kernel/sched/topology.c | 7 ++++--
18 files changed, 113 insertions(+), 51 deletions(-)
--
2.34.1
cpuinfo.max_freq can change at runtime because of boost as an example. This
implies that the value could be different from the frequency that has been
used to compute the capacity of a CPU.
The new arch_scale_freq_ref() returns a fixed and coherent frequency
that can be used to compute the capacity for a given frequency.
Signed-off-by: Vincent Guittot <[email protected]>
---
drivers/cpufreq/cpufreq.c | 4 ++--
include/linux/cpufreq.h | 9 +++++++++
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 60ed89000e82..8c4f9c2f9c44 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -454,7 +454,7 @@ void cpufreq_freq_transition_end(struct cpufreq_policy *policy,
arch_set_freq_scale(policy->related_cpus,
policy->cur,
- policy->cpuinfo.max_freq);
+ arch_scale_freq_ref(policy->cpu));
spin_lock(&policy->transition_lock);
policy->transition_ongoing = false;
@@ -2174,7 +2174,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
policy->cur = freq;
arch_set_freq_scale(policy->related_cpus, freq,
- policy->cpuinfo.max_freq);
+ arch_scale_freq_ref(policy->cpu));
cpufreq_stats_record_transition(policy, freq);
if (trace_cpu_frequency_enabled()) {
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 71d186d6933a..bbc483b4b6e5 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -1211,6 +1211,15 @@ void arch_set_freq_scale(const struct cpumask *cpus,
{
}
#endif
+
+#ifndef arch_scale_freq_ref
+static __always_inline
+unsigned int arch_scale_freq_ref(int cpu)
+{
+ return 0;
+}
+#endif
+
/* the following are really really optional */
extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;
extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs;
--
2.34.1
Remove struct rq cpu_capacity_orig field and use arch_scale_cpu_capacity()
instead.
Scheduler uses 3 methods to get access to the CPU's max compute capacity:
- arch_scale_cpu_capacity(cpu) which is the default way to get CPU's capacity.
- cpu_capacity_orig field which is periodically updated with
arch_scale_cpu_capacity().
- capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig.
There is no real need to save the value returned by arch_scale_cpu_capacity()
in struct rq. arch_scale_cpu_capacity() returns:
- either a per_cpu variable.
- or a const value for systems which have only one capacity.
Remove cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere.
No functional changes.
some tests of Arm64
small SMP device (hikey): no noticeable changes
HMP device (RB5): hackbench shows minor improvement (1-2%)
large smp (thx2): hackbench and tbench shows minor improvement (1%)
Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
---
Documentation/scheduler/sched-capacity.rst | 13 +++++++------
kernel/sched/core.c | 2 +-
kernel/sched/cpudeadline.c | 2 +-
kernel/sched/deadline.c | 4 ++--
kernel/sched/fair.c | 18 ++++++++----------
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 6 ------
kernel/sched/topology.c | 7 +++++--
8 files changed, 25 insertions(+), 29 deletions(-)
diff --git a/Documentation/scheduler/sched-capacity.rst b/Documentation/scheduler/sched-capacity.rst
index e2c1cf743158..de414b33dd2a 100644
--- a/Documentation/scheduler/sched-capacity.rst
+++ b/Documentation/scheduler/sched-capacity.rst
@@ -39,14 +39,15 @@ per Hz, leading to::
-------------------
Two different capacity values are used within the scheduler. A CPU's
-``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
-attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
-which some loss of available performance (e.g. time spent handling IRQs) is
-subtracted.
+``original capacity`` is its maximum attainable capacity, i.e. its maximum
+attainable performance level. This original capacity is returned by
+the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original
+capacity`` to which some loss of available performance (e.g. time spent
+handling IRQs) is subtracted.
Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
-while ``capacity_orig`` is class-agnostic. The rest of this document will use
-the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
+while ``original capacity`` is class-agnostic. The rest of this document will use
+the term ``capacity`` interchangeably with ``original capacity`` for the sake of
brevity.
1.3 Platform examples
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cf6d3fdd4eb5..a3f9cd52eec5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9929,7 +9929,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
rq->sd = NULL;
rq->rd = NULL;
- rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
+ rq->cpu_capacity = SCHED_CAPACITY_SCALE;
rq->balance_callback = &balance_push_callback;
rq->active_balance = 0;
rq->next_balance = jiffies;
diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 57c92d751bcd..95baa12a1029 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -131,7 +131,7 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
if (!dl_task_fits_capacity(p, cpu)) {
cpumask_clear_cpu(cpu, later_mask);
- cap = capacity_orig_of(cpu);
+ cap = arch_scale_cpu_capacity(cpu);
if (cap > max_cap ||
(cpu == task_cpu(p) && cap == max_cap)) {
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d98408a274e5..7039a8d5ae9b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -132,7 +132,7 @@ static inline unsigned long __dl_bw_capacity(const struct cpumask *mask)
int i;
for_each_cpu_and(i, mask, cpu_active_mask)
- cap += capacity_orig_of(i);
+ cap += arch_scale_cpu_capacity(i);
return cap;
}
@@ -144,7 +144,7 @@ static inline unsigned long __dl_bw_capacity(const struct cpumask *mask)
static inline unsigned long dl_bw_capacity(int i)
{
if (!sched_asym_cpucap_active() &&
- capacity_orig_of(i) == SCHED_CAPACITY_SCALE) {
+ arch_scale_cpu_capacity(i) == SCHED_CAPACITY_SCALE) {
return dl_bw_cpus(i) << SCHED_CAPACITY_SHIFT;
} else {
RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04fbcbda97d5..7e2027c810e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4713,7 +4713,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
* To avoid overestimation of actual task utilization, skip updates if
* we cannot grant there is idle time in this CPU.
*/
- if (task_util(p) > capacity_orig_of(cpu_of(rq_of(cfs_rq))))
+ if (task_util(p) > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
return;
/*
@@ -4761,14 +4761,14 @@ static inline int util_fits_cpu(unsigned long util,
return fits;
/*
- * We must use capacity_orig_of() for comparing against uclamp_min and
+ * We must use arch_scale_cpu_capacity() for comparing against uclamp_min and
* uclamp_max. We only care about capacity pressure (by using
* capacity_of()) for comparing against the real util.
*
* If a task is boosted to 1024 for example, we don't want a tiny
* pressure to skew the check whether it fits a CPU or not.
*
- * Similarly if a task is capped to capacity_orig_of(little_cpu), it
+ * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it
* should fit a little cpu even if there's some pressure.
*
* Only exception is for thermal pressure since it has a direct impact
@@ -4780,7 +4780,7 @@ static inline int util_fits_cpu(unsigned long util,
* For uclamp_max, we can tolerate a drop in performance level as the
* goal is to cap the task. So it's okay if it's getting less.
*/
- capacity_orig = capacity_orig_of(cpu);
+ capacity_orig = arch_scale_cpu_capacity(cpu);
capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
/*
@@ -7261,7 +7261,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
* Look for the CPU with best capacity.
*/
else if (fits < 0)
- cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
+ cpu_cap = arch_scale_cpu_capacity(cpu) - thermal_load_avg(cpu_rq(cpu));
/*
* First, select CPU which fits better (-1 being better than 0).
@@ -7503,7 +7503,7 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
util = max(util, util_est);
}
- return min(util, capacity_orig_of(cpu));
+ return min(util, arch_scale_cpu_capacity(cpu));
}
unsigned long cpu_util_cfs(int cpu)
@@ -9294,8 +9294,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
unsigned long capacity = scale_rt_capacity(cpu);
struct sched_group *sdg = sd->groups;
- cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
-
if (!capacity)
capacity = 1;
@@ -9371,7 +9369,7 @@ static inline int
check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
{
return ((rq->cpu_capacity * sd->imbalance_pct) <
- (rq->cpu_capacity_orig * 100));
+ (arch_scale_cpu_capacity(cpu_of(rq)) * 100));
}
/*
@@ -9382,7 +9380,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
{
return rq->misfit_task_load &&
- (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
+ (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
check_cpu_capacity(rq, sd));
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 88fc98601413..72f0a0767059 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -471,7 +471,7 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
min_cap = uclamp_eff_value(p, UCLAMP_MIN);
max_cap = uclamp_eff_value(p, UCLAMP_MAX);
- cpu_cap = capacity_orig_of(cpu);
+ cpu_cap = arch_scale_cpu_capacity(cpu);
return cpu_cap >= min(min_cap, max_cap);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 649eb9ec0657..74195eb39eaa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1033,7 +1033,6 @@ struct rq {
struct sched_domain __rcu *sd;
unsigned long cpu_capacity;
- unsigned long cpu_capacity_orig;
struct balance_callback *balance_callback;
@@ -2967,11 +2966,6 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif
#ifdef CONFIG_SMP
-static inline unsigned long capacity_orig_of(int cpu)
-{
- return cpu_rq(cpu)->cpu_capacity_orig;
-}
-
/**
* enum cpu_util_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a7b50bba7829..1cc595907363 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2488,12 +2488,15 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
+ unsigned long capacity;
+
rq = cpu_rq(i);
sd = *per_cpu_ptr(d.sd, i);
+ capacity = arch_scale_cpu_capacity(i);
/* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
- if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
- WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
+ if (capacity > READ_ONCE(d.rd->max_cpu_capacity))
+ WRITE_ONCE(d.rd->max_cpu_capacity, capacity);
cpu_attach_domain(sd, d.rd, i);
}
--
2.34.1
Create a new method to get a unique and fixed max frequency. Currently
cpuinfo.max_freq or the highest (or last) state of performance domain are
used as the max frequency when computing the frequency for a level of
utilization but:
- cpuinfo_max_freq can change at runtime. boost is one example of
such change.
- cpuinfo.max_freq and last item of the PD can be different leading to
different results between cpufreq and energy model.
We need to save the reference frequency that has been used when computing
the CPUs capacity and use this fixed and coherent value to convert between
frequency and CPU's capacity.
In fact, we already save the frequency that has been used when computing
the capacity of each CPU. We extend the precision to save khZ instead of
Mhz currently and we modify the type to be aligned with other variables
used when converting frequency to capacity and the other way.
Signed-off-by: Vincent Guittot <[email protected]>
---
arch/arm/include/asm/topology.h | 1 +
arch/arm64/include/asm/topology.h | 1 +
arch/riscv/include/asm/topology.h | 1 +
drivers/base/arch_topology.c | 29 ++++++++++++++---------------
include/linux/arch_topology.h | 7 +++++++
5 files changed, 24 insertions(+), 15 deletions(-)
diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index c7d2510e5a78..853c4f81ba4a 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -13,6 +13,7 @@
#define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant
+#define arch_scale_freq_ref topology_get_freq_ref
#endif
/* Replace task scheduler's default cpu-invariant accounting */
diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index 9fab663dd2de..a323b109b9c4 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -23,6 +23,7 @@ void update_freq_counters_refs(void);
#define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant
+#define arch_scale_freq_ref topology_get_freq_ref
#ifdef CONFIG_ACPI_CPPC_LIB
#define arch_init_invariance_cppc topology_init_cpu_capacity_cppc
diff --git a/arch/riscv/include/asm/topology.h b/arch/riscv/include/asm/topology.h
index e316ab3b77f3..61183688bdd5 100644
--- a/arch/riscv/include/asm/topology.h
+++ b/arch/riscv/include/asm/topology.h
@@ -9,6 +9,7 @@
#define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant
+#define arch_scale_freq_ref topology_get_freq_ref
/* Replace task scheduler's default cpu-invariant accounting */
#define arch_scale_cpu_capacity topology_get_cpu_scale
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index b741b5ba82bd..9a073c2d2086 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -19,6 +19,7 @@
#include <linux/init.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
+#include <linux/units.h>
#define CREATE_TRACE_POINTS
#include <trace/events/thermal_pressure.h>
@@ -26,7 +27,8 @@
static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
static struct cpumask scale_freq_counters_mask;
static bool scale_freq_invariant;
-static DEFINE_PER_CPU(u32, freq_factor) = 1;
+DEFINE_PER_CPU(unsigned long, capacity_ref_freq) = 1;
+EXPORT_PER_CPU_SYMBOL_GPL(capacity_ref_freq);
static bool supports_scale_freq_counters(const struct cpumask *cpus)
{
@@ -170,9 +172,9 @@ DEFINE_PER_CPU(unsigned long, thermal_pressure);
* operating on stale data when hot-plug is used for some CPUs. The
* @capped_freq reflects the currently allowed max CPUs frequency due to
* thermal capping. It might be also a boost frequency value, which is bigger
- * than the internal 'freq_factor' max frequency. In such case the pressure
- * value should simply be removed, since this is an indication that there is
- * no thermal throttling. The @capped_freq must be provided in kHz.
+ * than the internal 'capacity_ref_freq' max frequency. In such case the
+ * pressure value should simply be removed, since this is an indication that
+ * there is no thermal throttling. The @capped_freq must be provided in kHz.
*/
void topology_update_thermal_pressure(const struct cpumask *cpus,
unsigned long capped_freq)
@@ -183,10 +185,7 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
cpu = cpumask_first(cpus);
max_capacity = arch_scale_cpu_capacity(cpu);
- max_freq = per_cpu(freq_factor, cpu);
-
- /* Convert to MHz scale which is used in 'freq_factor' */
- capped_freq /= 1000;
+ max_freq = arch_scale_freq_ref(cpu);
/*
* Handle properly the boost frequencies, which should simply clean
@@ -279,13 +278,13 @@ void topology_normalize_cpu_scale(void)
capacity_scale = 1;
for_each_possible_cpu(cpu) {
- capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu);
+ capacity = raw_capacity[cpu] * per_cpu(capacity_ref_freq, cpu);
capacity_scale = max(capacity, capacity_scale);
}
pr_debug("cpu_capacity: capacity_scale=%llu\n", capacity_scale);
for_each_possible_cpu(cpu) {
- capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu);
+ capacity = raw_capacity[cpu] * per_cpu(capacity_ref_freq, cpu);
capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
capacity_scale);
topology_set_cpu_scale(cpu, capacity);
@@ -321,15 +320,15 @@ bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
cpu_node, raw_capacity[cpu]);
/*
- * Update freq_factor for calculating early boot cpu capacities.
+ * Update capacity_ref_freq for calculating early boot cpu capacities.
* For non-clk CPU DVFS mechanism, there's no way to get the
* frequency value now, assuming they are running at the same
- * frequency (by keeping the initial freq_factor value).
+ * frequency (by keeping the initial capacity_ref_freq value).
*/
cpu_clk = of_clk_get(cpu_node, 0);
if (!PTR_ERR_OR_ZERO(cpu_clk)) {
- per_cpu(freq_factor, cpu) =
- clk_get_rate(cpu_clk) / 1000;
+ per_cpu(capacity_ref_freq, cpu) =
+ clk_get_rate(cpu_clk) / HZ_PER_KHZ;
clk_put(cpu_clk);
}
} else {
@@ -411,7 +410,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);
for_each_cpu(cpu, policy->related_cpus)
- per_cpu(freq_factor, cpu) = policy->cpuinfo.max_freq / 1000;
+ per_cpu(capacity_ref_freq, cpu) = policy->cpuinfo.max_freq;
if (cpumask_empty(cpus_to_visit)) {
topology_normalize_cpu_scale();
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index a07b510e7dc5..38ca6c76af56 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -27,6 +27,13 @@ static inline unsigned long topology_get_cpu_scale(int cpu)
void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity);
+DECLARE_PER_CPU(unsigned long, capacity_ref_freq);
+
+static inline unsigned long topology_get_freq_ref(int cpu)
+{
+ return per_cpu(capacity_ref_freq, cpu);
+}
+
DECLARE_PER_CPU(unsigned long, arch_freq_scale);
static inline unsigned long topology_get_freq_scale(int cpu)
--
2.34.1
cppc cpufreq driver can register an artificial energy model. In such case,
it also have to register the frequency that is used to define the CPU
capacity
Signed-off-by: Vincent Guittot <[email protected]>
---
drivers/cpufreq/cppc_cpufreq.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index fe08ca419b3d..24c6ba349f01 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -636,6 +636,21 @@ static int populate_efficiency_class(void)
return 0;
}
+
+static void cppc_cpufreq_set_capacity_ref_freq(struct cpufreq_policy *policy)
+{
+ struct cppc_perf_caps *perf_caps;
+ struct cppc_cpudata *cpu_data;
+ unsigned int ref_freq;
+
+ cpu_data = policy->driver_data;
+ perf_caps = &cpu_data->perf_caps;
+
+ ref_freq = cppc_cpufreq_perf_to_khz(cpu_data, perf_caps->highest_perf);
+
+ per_cpu(capacity_ref_freq, policy->cpu) = ref_freq;
+}
+
static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
{
struct cppc_cpudata *cpu_data;
@@ -643,6 +658,9 @@ static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
EM_ADV_DATA_CB(cppc_get_cpu_power, cppc_get_cpu_cost);
cpu_data = policy->driver_data;
+
+ cppc_cpufreq_set_capacity_ref_freq(policy);
+
em_dev_register_perf_domain(get_cpu_device(policy->cpu),
get_perf_level_count(policy), &em_cb,
cpu_data->shared_cpu_map, 0);
--
2.34.1
The last item of a performance domain is not always the performance point
that has been used to compute CPU's capacity. This can lead to different
target frequency compared with other part of the system like schedutil and
would result in wrong energy estimation.
A new arch_scale_freq_ref() is available to return a fixed and coherent
frequency reference that can be used when computing the CPU's frequency
for an level of utilization. Use this function to get this reference
frequency.
Energy model is never used without defining arch_scale_freq_ref() but
can be compiled. Define a default arch_scale_freq_ref() returning 0
in such case.
Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/energy_model.h | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index b9caa01dfac4..1b0c8490d4bd 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -204,6 +204,14 @@ struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,
return ps;
}
+#ifndef arch_scale_freq_ref
+static __always_inline
+unsigned int arch_scale_freq_ref(int cpu)
+{
+ return 0;
+}
+#endif
+
/**
* em_cpu_energy() - Estimates the energy consumed by the CPUs of a
* performance domain
@@ -224,7 +232,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
unsigned long max_util, unsigned long sum_util,
unsigned long allowed_cpu_cap)
{
- unsigned long freq, scale_cpu;
+ unsigned long freq, ref_freq, scale_cpu;
struct em_perf_state *ps;
int cpu;
@@ -241,11 +249,11 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
*/
cpu = cpumask_first(to_cpumask(pd->cpus));
scale_cpu = arch_scale_cpu_capacity(cpu);
- ps = &pd->table[pd->nr_perf_states - 1];
+ ref_freq = arch_scale_freq_ref(cpu);
max_util = map_util_perf(max_util);
max_util = min(max_util, allowed_cpu_cap);
- freq = map_util_freq(max_util, ps->frequency, scale_cpu);
+ freq = map_util_freq(max_util, ref_freq, scale_cpu);
/*
* Find the lowest performance state of the Energy Model above the
--
2.34.1
cpuinfo.max_freq can change at runtime because of boost as an example. This
implies that the value could be different than the one that has been
used when computing the capacity of a CPU.
The new arch_scale_freq_ref() returns a fixed and coherent reference
frequency that can be used when computing a frequency based on utilization.
Use this arch_scale_freq_ref() when available and fallback to
policy otherwise.
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 4492608b7d7f..1fa7e74add8f 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -114,6 +114,28 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy)
}
}
+/**
+ * cpufreq_get_capacity_ref_freq - get the reference frequency of a given CPU that
+ * has been used to correlate frequency and compute capacity.
+ * @policy: the cpufreq policy of the CPU in question.
+ * @use_current: Fallback to current freq instead of policy->cpuinfo.max_freq.
+ *
+ * Return: the reference CPU frequency to compute a capacity.
+ */
+static __always_inline
+unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
+{
+ unsigned int freq = arch_scale_freq_ref(policy->cpu);
+
+ if (freq)
+ return freq;
+
+ if (arch_scale_freq_invariant())
+ return policy->cpuinfo.max_freq;
+
+ return policy->cur;
+}
+
/**
* get_next_freq - Compute a new frequency for a given cpufreq policy.
* @sg_policy: schedutil policy object to compute the new frequency for.
@@ -140,10 +162,10 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
unsigned long util, unsigned long max)
{
struct cpufreq_policy *policy = sg_policy->policy;
- unsigned int freq = arch_scale_freq_invariant() ?
- policy->cpuinfo.max_freq : policy->cur;
+ unsigned int freq;
util = map_util_perf(util);
+ freq = get_capacity_ref_freq(policy);
freq = map_util_freq(util, freq, max);
if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
--
2.34.1
* Vincent Guittot <[email protected]> wrote:
> Remove struct rq cpu_capacity_orig field and use arch_scale_cpu_capacity()
> instead.
>
> Scheduler uses 3 methods to get access to the CPU's max compute capacity:
> - arch_scale_cpu_capacity(cpu) which is the default way to get CPU's capacity.
> - cpu_capacity_orig field which is periodically updated with
> arch_scale_cpu_capacity().
> - capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig.
>
> There is no real need to save the value returned by arch_scale_cpu_capacity()
> in struct rq. arch_scale_cpu_capacity() returns:
> - either a per_cpu variable.
> - or a const value for systems which have only one capacity.
>
> Remove cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere.
>
> No functional changes.
>
> some tests of Arm64
> small SMP device (hikey): no noticeable changes
> HMP device (RB5): hackbench shows minor improvement (1-2%)
> large smp (thx2): hackbench and tbench shows minor improvement (1%)
>
> Signed-off-by: Vincent Guittot <[email protected]>
> Reviewed-by: Dietmar Eggemann <[email protected]>
>
> ---
> Documentation/scheduler/sched-capacity.rst | 13 +++++++------
> kernel/sched/core.c | 2 +-
> kernel/sched/cpudeadline.c | 2 +-
> kernel/sched/deadline.c | 4 ++--
> kernel/sched/fair.c | 18 ++++++++----------
> kernel/sched/rt.c | 2 +-
> kernel/sched/sched.h | 6 ------
> kernel/sched/topology.c | 7 +++++--
> 8 files changed, 25 insertions(+), 29 deletions(-)
I've applied patch #1 to tip:sched/core, thanks!
Ingo
On 09-10-23, 12:36, Vincent Guittot wrote:
> cpuinfo.max_freq can change at runtime because of boost as an example. This
> implies that the value could be different from the frequency that has been
> used to compute the capacity of a CPU.
>
> The new arch_scale_freq_ref() returns a fixed and coherent frequency
> that can be used to compute the capacity for a given frequency.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> drivers/cpufreq/cpufreq.c | 4 ++--
> include/linux/cpufreq.h | 9 +++++++++
> 2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 60ed89000e82..8c4f9c2f9c44 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -454,7 +454,7 @@ void cpufreq_freq_transition_end(struct cpufreq_policy *policy,
>
> arch_set_freq_scale(policy->related_cpus,
> policy->cur,
> - policy->cpuinfo.max_freq);
> + arch_scale_freq_ref(policy->cpu));
>
> spin_lock(&policy->transition_lock);
> policy->transition_ongoing = false;
> @@ -2174,7 +2174,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
>
> policy->cur = freq;
> arch_set_freq_scale(policy->related_cpus, freq,
> - policy->cpuinfo.max_freq);
> + arch_scale_freq_ref(policy->cpu));
> cpufreq_stats_record_transition(policy, freq);
>
> if (trace_cpu_frequency_enabled()) {
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 71d186d6933a..bbc483b4b6e5 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -1211,6 +1211,15 @@ void arch_set_freq_scale(const struct cpumask *cpus,
> {
> }
> #endif
> +
> +#ifndef arch_scale_freq_ref
> +static __always_inline
> +unsigned int arch_scale_freq_ref(int cpu)
> +{
> + return 0;
> +}
> +#endif
> +
> /* the following are really really optional */
> extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;
> extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs;
Acked-by: Viresh Kumar <[email protected]>
--
viresh
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 7bc263840bc3377186cb06b003ac287bb2f18ce2
Gitweb: https://git.kernel.org/tip/7bc263840bc3377186cb06b003ac287bb2f18ce2
Author: Vincent Guittot <[email protected]>
AuthorDate: Mon, 09 Oct 2023 12:36:16 +02:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Mon, 09 Oct 2023 12:59:48 +02:00
sched/topology: Consolidate and clean up access to a CPU's max compute capacity
Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity()
instead.
The scheduler uses 3 methods to get access to a CPU's max compute capacity:
- arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity.
- cpu_capacity_orig field which is periodically updated with
arch_scale_cpu_capacity().
- capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig.
There is no real need to save the value returned by arch_scale_cpu_capacity()
in struct rq. arch_scale_cpu_capacity() returns:
- either a per_cpu variable.
- or a const value for systems which have only one capacity.
Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere.
No functional changes.
Some performance tests on Arm64:
- small SMP device (hikey): no noticeable changes
- HMP device (RB5): hackbench shows minor improvement (1-2%)
- large smp (thx2): hackbench and tbench shows minor improvement (1%)
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
Documentation/scheduler/sched-capacity.rst | 13 +++++++------
kernel/sched/core.c | 2 +-
kernel/sched/cpudeadline.c | 2 +-
kernel/sched/deadline.c | 4 ++--
kernel/sched/fair.c | 18 ++++++++----------
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 6 ------
kernel/sched/topology.c | 7 +++++--
8 files changed, 25 insertions(+), 29 deletions(-)
diff --git a/Documentation/scheduler/sched-capacity.rst b/Documentation/scheduler/sched-capacity.rst
index e2c1cf7..de414b3 100644
--- a/Documentation/scheduler/sched-capacity.rst
+++ b/Documentation/scheduler/sched-capacity.rst
@@ -39,14 +39,15 @@ per Hz, leading to::
-------------------
Two different capacity values are used within the scheduler. A CPU's
-``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
-attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
-which some loss of available performance (e.g. time spent handling IRQs) is
-subtracted.
+``original capacity`` is its maximum attainable capacity, i.e. its maximum
+attainable performance level. This original capacity is returned by
+the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original
+capacity`` to which some loss of available performance (e.g. time spent
+handling IRQs) is subtracted.
Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
-while ``capacity_orig`` is class-agnostic. The rest of this document will use
-the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
+while ``original capacity`` is class-agnostic. The rest of this document will use
+the term ``capacity`` interchangeably with ``original capacity`` for the sake of
brevity.
1.3 Platform examples
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cf6d3fd..a3f9cd5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9929,7 +9929,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
rq->sd = NULL;
rq->rd = NULL;
- rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
+ rq->cpu_capacity = SCHED_CAPACITY_SCALE;
rq->balance_callback = &balance_push_callback;
rq->active_balance = 0;
rq->next_balance = jiffies;
diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 57c92d7..95baa12 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -131,7 +131,7 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
if (!dl_task_fits_capacity(p, cpu)) {
cpumask_clear_cpu(cpu, later_mask);
- cap = capacity_orig_of(cpu);
+ cap = arch_scale_cpu_capacity(cpu);
if (cap > max_cap ||
(cpu == task_cpu(p) && cap == max_cap)) {
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d98408a..7039a8d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -132,7 +132,7 @@ static inline unsigned long __dl_bw_capacity(const struct cpumask *mask)
int i;
for_each_cpu_and(i, mask, cpu_active_mask)
- cap += capacity_orig_of(i);
+ cap += arch_scale_cpu_capacity(i);
return cap;
}
@@ -144,7 +144,7 @@ static inline unsigned long __dl_bw_capacity(const struct cpumask *mask)
static inline unsigned long dl_bw_capacity(int i)
{
if (!sched_asym_cpucap_active() &&
- capacity_orig_of(i) == SCHED_CAPACITY_SCALE) {
+ arch_scale_cpu_capacity(i) == SCHED_CAPACITY_SCALE) {
return dl_bw_cpus(i) << SCHED_CAPACITY_SHIFT;
} else {
RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19bb4ac..e7c1baf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4669,7 +4669,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
* To avoid overestimation of actual task utilization, skip updates if
* we cannot grant there is idle time in this CPU.
*/
- if (task_util(p) > capacity_orig_of(cpu_of(rq_of(cfs_rq))))
+ if (task_util(p) > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
return;
/*
@@ -4717,14 +4717,14 @@ static inline int util_fits_cpu(unsigned long util,
return fits;
/*
- * We must use capacity_orig_of() for comparing against uclamp_min and
+ * We must use arch_scale_cpu_capacity() for comparing against uclamp_min and
* uclamp_max. We only care about capacity pressure (by using
* capacity_of()) for comparing against the real util.
*
* If a task is boosted to 1024 for example, we don't want a tiny
* pressure to skew the check whether it fits a CPU or not.
*
- * Similarly if a task is capped to capacity_orig_of(little_cpu), it
+ * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it
* should fit a little cpu even if there's some pressure.
*
* Only exception is for thermal pressure since it has a direct impact
@@ -4736,7 +4736,7 @@ static inline int util_fits_cpu(unsigned long util,
* For uclamp_max, we can tolerate a drop in performance level as the
* goal is to cap the task. So it's okay if it's getting less.
*/
- capacity_orig = capacity_orig_of(cpu);
+ capacity_orig = arch_scale_cpu_capacity(cpu);
capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
/*
@@ -7217,7 +7217,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
* Look for the CPU with best capacity.
*/
else if (fits < 0)
- cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
+ cpu_cap = arch_scale_cpu_capacity(cpu) - thermal_load_avg(cpu_rq(cpu));
/*
* First, select CPU which fits better (-1 being better than 0).
@@ -7459,7 +7459,7 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
util = max(util, util_est);
}
- return min(util, capacity_orig_of(cpu));
+ return min(util, arch_scale_cpu_capacity(cpu));
}
unsigned long cpu_util_cfs(int cpu)
@@ -9250,8 +9250,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
unsigned long capacity = scale_rt_capacity(cpu);
struct sched_group *sdg = sd->groups;
- cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
-
if (!capacity)
capacity = 1;
@@ -9327,7 +9325,7 @@ static inline int
check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
{
return ((rq->cpu_capacity * sd->imbalance_pct) <
- (rq->cpu_capacity_orig * 100));
+ (arch_scale_cpu_capacity(cpu_of(rq)) * 100));
}
/*
@@ -9338,7 +9336,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
{
return rq->misfit_task_load &&
- (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
+ (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
check_cpu_capacity(rq, sd));
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 76d82a0..e93b69e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -471,7 +471,7 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
min_cap = uclamp_eff_value(p, UCLAMP_MIN);
max_cap = uclamp_eff_value(p, UCLAMP_MAX);
- cpu_cap = capacity_orig_of(cpu);
+ cpu_cap = arch_scale_cpu_capacity(cpu);
return cpu_cap >= min(min_cap, max_cap);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 515eb4c..7e7fedc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1033,7 +1033,6 @@ struct rq {
struct sched_domain __rcu *sd;
unsigned long cpu_capacity;
- unsigned long cpu_capacity_orig;
struct balance_callback *balance_callback;
@@ -2967,11 +2966,6 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif
#ifdef CONFIG_SMP
-static inline unsigned long capacity_orig_of(int cpu)
-{
- return cpu_rq(cpu)->cpu_capacity_orig;
-}
-
/**
* enum cpu_util_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a7b50bb..1cc5959 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2488,12 +2488,15 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
+ unsigned long capacity;
+
rq = cpu_rq(i);
sd = *per_cpu_ptr(d.sd, i);
+ capacity = arch_scale_cpu_capacity(i);
/* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
- if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
- WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
+ if (capacity > READ_ONCE(d.rd->max_cpu_capacity))
+ WRITE_ONCE(d.rd->max_cpu_capacity, capacity);
cpu_attach_domain(sd, d.rd, i);
}
Hello Vincent,
On 10/9/23 12:36, Vincent Guittot wrote:
> Create a new method to get a unique and fixed max frequency. Currently
> cpuinfo.max_freq or the highest (or last) state of performance domain are
> used as the max frequency when computing the frequency for a level of
> utilization but:
> - cpuinfo_max_freq can change at runtime. boost is one example of
> such change.
> - cpuinfo.max_freq and last item of the PD can be different leading to
> different results between cpufreq and energy model.
>
> We need to save the reference frequency that has been used when computing
> the CPUs capacity and use this fixed and coherent value to convert between
> frequency and CPU's capacity.
>
> In fact, we already save the frequency that has been used when computing
> the capacity of each CPU. We extend the precision to save khZ instead of
> Mhz currently and we modify the type to be aligned with other variables
> used when converting frequency to capacity and the other way.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> arch/arm/include/asm/topology.h | 1 +
> arch/arm64/include/asm/topology.h | 1 +
> arch/riscv/include/asm/topology.h | 1 +
> drivers/base/arch_topology.c | 29 ++++++++++++++---------------
> include/linux/arch_topology.h | 7 +++++++
> 5 files changed, 24 insertions(+), 15 deletions(-)
>
> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> index c7d2510e5a78..853c4f81ba4a 100644
> --- a/arch/arm/include/asm/topology.h
> +++ b/arch/arm/include/asm/topology.h
> @@ -13,6 +13,7 @@
> #define arch_set_freq_scale topology_set_freq_scale
> #define arch_scale_freq_capacity topology_get_freq_scale
> #define arch_scale_freq_invariant topology_scale_freq_invariant
> +#define arch_scale_freq_ref topology_get_freq_ref
> #endif
>
> /* Replace task scheduler's default cpu-invariant accounting */
> diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
> index 9fab663dd2de..a323b109b9c4 100644
> --- a/arch/arm64/include/asm/topology.h
> +++ b/arch/arm64/include/asm/topology.h
> @@ -23,6 +23,7 @@ void update_freq_counters_refs(void);
> #define arch_set_freq_scale topology_set_freq_scale
> #define arch_scale_freq_capacity topology_get_freq_scale
> #define arch_scale_freq_invariant topology_scale_freq_invariant
> +#define arch_scale_freq_ref topology_get_freq_ref
>
> #ifdef CONFIG_ACPI_CPPC_LIB
> #define arch_init_invariance_cppc topology_init_cpu_capacity_cppc
> diff --git a/arch/riscv/include/asm/topology.h b/arch/riscv/include/asm/topology.h
> index e316ab3b77f3..61183688bdd5 100644
> --- a/arch/riscv/include/asm/topology.h
> +++ b/arch/riscv/include/asm/topology.h
> @@ -9,6 +9,7 @@
> #define arch_set_freq_scale topology_set_freq_scale
> #define arch_scale_freq_capacity topology_get_freq_scale
> #define arch_scale_freq_invariant topology_scale_freq_invariant
> +#define arch_scale_freq_ref topology_get_freq_ref
>
> /* Replace task scheduler's default cpu-invariant accounting */
> #define arch_scale_cpu_capacity topology_get_cpu_scale
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index b741b5ba82bd..9a073c2d2086 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -19,6 +19,7 @@
> #include <linux/init.h>
> #include <linux/rcupdate.h>
> #include <linux/sched.h>
> +#include <linux/units.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/thermal_pressure.h>
> @@ -26,7 +27,8 @@
> static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
> static struct cpumask scale_freq_counters_mask;
> static bool scale_freq_invariant;
> -static DEFINE_PER_CPU(u32, freq_factor) = 1;
> +DEFINE_PER_CPU(unsigned long, capacity_ref_freq) = 1;
> +EXPORT_PER_CPU_SYMBOL_GPL(capacity_ref_freq);
>
> static bool supports_scale_freq_counters(const struct cpumask *cpus)
> {
> @@ -170,9 +172,9 @@ DEFINE_PER_CPU(unsigned long, thermal_pressure);
> * operating on stale data when hot-plug is used for some CPUs. The
> * @capped_freq reflects the currently allowed max CPUs frequency due to
> * thermal capping. It might be also a boost frequency value, which is bigger
> - * than the internal 'freq_factor' max frequency. In such case the pressure
> - * value should simply be removed, since this is an indication that there is
> - * no thermal throttling. The @capped_freq must be provided in kHz.
> + * than the internal 'capacity_ref_freq' max frequency. In such case the
> + * pressure value should simply be removed, since this is an indication that
> + * there is no thermal throttling. The @capped_freq must be provided in kHz.
> */
> void topology_update_thermal_pressure(const struct cpumask *cpus,
> unsigned long capped_freq)
> @@ -183,10 +185,7 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
>
> cpu = cpumask_first(cpus);
> max_capacity = arch_scale_cpu_capacity(cpu);
> - max_freq = per_cpu(freq_factor, cpu);
> -
> - /* Convert to MHz scale which is used in 'freq_factor' */
> - capped_freq /= 1000;
> + max_freq = arch_scale_freq_ref(cpu);
>
> /*
> * Handle properly the boost frequencies, which should simply clean
> @@ -279,13 +278,13 @@ void topology_normalize_cpu_scale(void)
>
> capacity_scale = 1;
> for_each_possible_cpu(cpu) {
> - capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu);
> + capacity = raw_capacity[cpu] * per_cpu(capacity_ref_freq, cpu);
> capacity_scale = max(capacity, capacity_scale);
> }
>
> pr_debug("cpu_capacity: capacity_scale=%llu\n", capacity_scale);
> for_each_possible_cpu(cpu) {
> - capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu);
> + capacity = raw_capacity[cpu] * per_cpu(capacity_ref_freq, cpu);
> capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
> capacity_scale);
> topology_set_cpu_scale(cpu, capacity);
> @@ -321,15 +320,15 @@ bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
> cpu_node, raw_capacity[cpu]);
>
> /*
> - * Update freq_factor for calculating early boot cpu capacities.
> + * Update capacity_ref_freq for calculating early boot cpu capacities.
> * For non-clk CPU DVFS mechanism, there's no way to get the
> * frequency value now, assuming they are running at the same
> - * frequency (by keeping the initial freq_factor value).
> + * frequency (by keeping the initial capacity_ref_freq value).
> */
> cpu_clk = of_clk_get(cpu_node, 0);
> if (!PTR_ERR_OR_ZERO(cpu_clk)) {
> - per_cpu(freq_factor, cpu) =
> - clk_get_rate(cpu_clk) / 1000;
> + per_cpu(capacity_ref_freq, cpu) =
> + clk_get_rate(cpu_clk) / HZ_PER_KHZ;
> clk_put(cpu_clk);
> }
> } else {
> @@ -411,7 +410,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
> cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);
>
> for_each_cpu(cpu, policy->related_cpus)
> - per_cpu(freq_factor, cpu) = policy->cpuinfo.max_freq / 1000;
> + per_cpu(capacity_ref_freq, cpu) = policy->cpuinfo.max_freq;
It seems init_cpu_capacity_callback() is only called when a policy is created
(cf. CPUFREQ_CREATE_POLICY). This means that CPU capacities are only updated
in this specific case, eluding the cases where:
- boost is enabled. 'policy->cpuinfo.max_freq' is updated in the cpufreq driver,
but it seems it doesn't have any consequence regarding CPU capacities.
- a cpufreq driver is unplugged (e.g. rmmod cpufreq_driver.ko). In this case
the CPU capacities are only updated when plugging in the driver (e.g. insmod
cpufreq_driver.ko).
Regards,
Pierre
>
> if (cpumask_empty(cpus_to_visit)) {
> topology_normalize_cpu_scale();
> diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> index a07b510e7dc5..38ca6c76af56 100644
> --- a/include/linux/arch_topology.h
> +++ b/include/linux/arch_topology.h
> @@ -27,6 +27,13 @@ static inline unsigned long topology_get_cpu_scale(int cpu)
>
> void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity);
>
> +DECLARE_PER_CPU(unsigned long, capacity_ref_freq);
> +
> +static inline unsigned long topology_get_freq_ref(int cpu)
> +{
> + return per_cpu(capacity_ref_freq, cpu);
> +}
> +
> DECLARE_PER_CPU(unsigned long, arch_freq_scale);
>
> static inline unsigned long topology_get_freq_scale(int cpu)
Hello Vincent,
On 10/9/23 12:36, Vincent Guittot wrote:
> cppc cpufreq driver can register an artificial energy model. In such case,
> it also have to register the frequency that is used to define the CPU
> capacity
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> drivers/cpufreq/cppc_cpufreq.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index fe08ca419b3d..24c6ba349f01 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -636,6 +636,21 @@ static int populate_efficiency_class(void)
> return 0;
> }
>
> +
> +static void cppc_cpufreq_set_capacity_ref_freq(struct cpufreq_policy *policy)
> +{
> + struct cppc_perf_caps *perf_caps;
> + struct cppc_cpudata *cpu_data;
> + unsigned int ref_freq;
> +
> + cpu_data = policy->driver_data;
> + perf_caps = &cpu_data->perf_caps;
> +
> + ref_freq = cppc_cpufreq_perf_to_khz(cpu_data, perf_caps->highest_perf);
> +
> + per_cpu(capacity_ref_freq, policy->cpu) = ref_freq;
'capacity_ref_freq' seems to be updated only if CONFIG_ENERGY_MODEL is set. However in
[1], get_capacity_ref_freq() relies on 'capacity_ref_freq'. The cpufreq_schedutil governor
should have a valid 'capacity_ref_freq' value set if the CPPC cpufreq driver is used
without energy model I believe.
Also 'capacity_ref_freq' seems to be set only for 'policy->cpu'. I believe it should
be set for the whole perf domain in case this 'policy->cpu' goes offline.
Another thing, related my comment to [1] and to [2], for CPPC the max capacity matches
the boosting frequency. We have:
'non-boosted max capacity' < 'boosted max capacity'.
-
If boosting is not enabled, the CPU utilization can still go above the 'non-boosted max
capacity'. The overutilization of the system seems to be triggered by comparing the CPU
util to the 'boosted max capacity'. So systems might not be detected as overutilized.
For the EAS energy computation, em_cpu_energy() tries to predict the frequency that will
be used. It is currently unknown to the function that the frequency request will be
clamped by __resolve_freq():
get_next_freq()
\-cpufreq_driver_resolve_freq()
\-__resolve_freq()
This means that the energy computation might use boosting frequencies, which are not
available.
Regards,
Pierre
[1]: [PATCH v2 4/6] cpufreq/schedutil: use a fixed reference frequency
[2]: https://lore.kernel.org/lkml/[email protected]/
> +}
> +
> static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
> {
> struct cppc_cpudata *cpu_data;
> @@ -643,6 +658,9 @@ static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
> EM_ADV_DATA_CB(cppc_get_cpu_power, cppc_get_cpu_cost);
>
> cpu_data = policy->driver_data;
> +
> + cppc_cpufreq_set_capacity_ref_freq(policy);
> +
> em_dev_register_perf_domain(get_cpu_device(policy->cpu),
> get_perf_level_count(policy), &em_cb,
> cpu_data->shared_cpu_map, 0);
On Wed, 11 Oct 2023 at 12:27, Pierre Gondois <[email protected]> wrote:
>
> Hello Vincent,
>
> On 10/9/23 12:36, Vincent Guittot wrote:
> > Create a new method to get a unique and fixed max frequency. Currently
> > cpuinfo.max_freq or the highest (or last) state of performance domain are
> > used as the max frequency when computing the frequency for a level of
> > utilization but:
> > - cpuinfo_max_freq can change at runtime. boost is one example of
> > such change.
> > - cpuinfo.max_freq and last item of the PD can be different leading to
> > different results between cpufreq and energy model.
> >
> > We need to save the reference frequency that has been used when computing
> > the CPUs capacity and use this fixed and coherent value to convert between
> > frequency and CPU's capacity.
> >
> > In fact, we already save the frequency that has been used when computing
> > the capacity of each CPU. We extend the precision to save khZ instead of
> > Mhz currently and we modify the type to be aligned with other variables
> > used when converting frequency to capacity and the other way.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > arch/arm/include/asm/topology.h | 1 +
> > arch/arm64/include/asm/topology.h | 1 +
> > arch/riscv/include/asm/topology.h | 1 +
> > drivers/base/arch_topology.c | 29 ++++++++++++++---------------
> > include/linux/arch_topology.h | 7 +++++++
> > 5 files changed, 24 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> > index c7d2510e5a78..853c4f81ba4a 100644
> > --- a/arch/arm/include/asm/topology.h
> > +++ b/arch/arm/include/asm/topology.h
> > @@ -13,6 +13,7 @@
> > #define arch_set_freq_scale topology_set_freq_scale
> > #define arch_scale_freq_capacity topology_get_freq_scale
> > #define arch_scale_freq_invariant topology_scale_freq_invariant
> > +#define arch_scale_freq_ref topology_get_freq_ref
> > #endif
> >
> > /* Replace task scheduler's default cpu-invariant accounting */
> > diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
> > index 9fab663dd2de..a323b109b9c4 100644
> > --- a/arch/arm64/include/asm/topology.h
> > +++ b/arch/arm64/include/asm/topology.h
> > @@ -23,6 +23,7 @@ void update_freq_counters_refs(void);
> > #define arch_set_freq_scale topology_set_freq_scale
> > #define arch_scale_freq_capacity topology_get_freq_scale
> > #define arch_scale_freq_invariant topology_scale_freq_invariant
> > +#define arch_scale_freq_ref topology_get_freq_ref
> >
> > #ifdef CONFIG_ACPI_CPPC_LIB
> > #define arch_init_invariance_cppc topology_init_cpu_capacity_cppc
> > diff --git a/arch/riscv/include/asm/topology.h b/arch/riscv/include/asm/topology.h
> > index e316ab3b77f3..61183688bdd5 100644
> > --- a/arch/riscv/include/asm/topology.h
> > +++ b/arch/riscv/include/asm/topology.h
> > @@ -9,6 +9,7 @@
> > #define arch_set_freq_scale topology_set_freq_scale
> > #define arch_scale_freq_capacity topology_get_freq_scale
> > #define arch_scale_freq_invariant topology_scale_freq_invariant
> > +#define arch_scale_freq_ref topology_get_freq_ref
> >
> > /* Replace task scheduler's default cpu-invariant accounting */
> > #define arch_scale_cpu_capacity topology_get_cpu_scale
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index b741b5ba82bd..9a073c2d2086 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -19,6 +19,7 @@
> > #include <linux/init.h>
> > #include <linux/rcupdate.h>
> > #include <linux/sched.h>
> > +#include <linux/units.h>
> >
> > #define CREATE_TRACE_POINTS
> > #include <trace/events/thermal_pressure.h>
> > @@ -26,7 +27,8 @@
> > static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
> > static struct cpumask scale_freq_counters_mask;
> > static bool scale_freq_invariant;
> > -static DEFINE_PER_CPU(u32, freq_factor) = 1;
> > +DEFINE_PER_CPU(unsigned long, capacity_ref_freq) = 1;
> > +EXPORT_PER_CPU_SYMBOL_GPL(capacity_ref_freq);
> >
> > static bool supports_scale_freq_counters(const struct cpumask *cpus)
> > {
> > @@ -170,9 +172,9 @@ DEFINE_PER_CPU(unsigned long, thermal_pressure);
> > * operating on stale data when hot-plug is used for some CPUs. The
> > * @capped_freq reflects the currently allowed max CPUs frequency due to
> > * thermal capping. It might be also a boost frequency value, which is bigger
> > - * than the internal 'freq_factor' max frequency. In such case the pressure
> > - * value should simply be removed, since this is an indication that there is
> > - * no thermal throttling. The @capped_freq must be provided in kHz.
> > + * than the internal 'capacity_ref_freq' max frequency. In such case the
> > + * pressure value should simply be removed, since this is an indication that
> > + * there is no thermal throttling. The @capped_freq must be provided in kHz.
> > */
> > void topology_update_thermal_pressure(const struct cpumask *cpus,
> > unsigned long capped_freq)
> > @@ -183,10 +185,7 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
> >
> > cpu = cpumask_first(cpus);
> > max_capacity = arch_scale_cpu_capacity(cpu);
> > - max_freq = per_cpu(freq_factor, cpu);
> > -
> > - /* Convert to MHz scale which is used in 'freq_factor' */
> > - capped_freq /= 1000;
> > + max_freq = arch_scale_freq_ref(cpu);
> >
> > /*
> > * Handle properly the boost frequencies, which should simply clean
> > @@ -279,13 +278,13 @@ void topology_normalize_cpu_scale(void)
> >
> > capacity_scale = 1;
> > for_each_possible_cpu(cpu) {
> > - capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu);
> > + capacity = raw_capacity[cpu] * per_cpu(capacity_ref_freq, cpu);
> > capacity_scale = max(capacity, capacity_scale);
> > }
> >
> > pr_debug("cpu_capacity: capacity_scale=%llu\n", capacity_scale);
> > for_each_possible_cpu(cpu) {
> > - capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu);
> > + capacity = raw_capacity[cpu] * per_cpu(capacity_ref_freq, cpu);
> > capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
> > capacity_scale);
> > topology_set_cpu_scale(cpu, capacity);
> > @@ -321,15 +320,15 @@ bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
> > cpu_node, raw_capacity[cpu]);
> >
> > /*
> > - * Update freq_factor for calculating early boot cpu capacities.
> > + * Update capacity_ref_freq for calculating early boot cpu capacities.
> > * For non-clk CPU DVFS mechanism, there's no way to get the
> > * frequency value now, assuming they are running at the same
> > - * frequency (by keeping the initial freq_factor value).
> > + * frequency (by keeping the initial capacity_ref_freq value).
> > */
> > cpu_clk = of_clk_get(cpu_node, 0);
> > if (!PTR_ERR_OR_ZERO(cpu_clk)) {
> > - per_cpu(freq_factor, cpu) =
> > - clk_get_rate(cpu_clk) / 1000;
> > + per_cpu(capacity_ref_freq, cpu) =
> > + clk_get_rate(cpu_clk) / HZ_PER_KHZ;
> > clk_put(cpu_clk);
> > }
> > } else {
> > @@ -411,7 +410,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
> > cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);
> >
> > for_each_cpu(cpu, policy->related_cpus)
> > - per_cpu(freq_factor, cpu) = policy->cpuinfo.max_freq / 1000;
> > + per_cpu(capacity_ref_freq, cpu) = policy->cpuinfo.max_freq;
>
> It seems init_cpu_capacity_callback() is only called when a policy is created
> (cf. CPUFREQ_CREATE_POLICY). This means that CPU capacities are only updated
> in this specific case, eluding the cases where:
> - boost is enabled. 'policy->cpuinfo.max_freq' is updated in the cpufreq driver,
> but it seems it doesn't have any consequence regarding CPU capacities.
Yes, that's why we have to rely on something else than max_freq. And
one might not want to take boost into account in the compute capacity
because it's not sustainable. Nothing as change on this side
> - a cpufreq driver is unplugged (e.g. rmmod cpufreq_driver.ko). In this case
> the CPU capacities are only updated when plugging in the driver (e.g. insmod
> cpufreq_driver.ko).
same here
>
> Regards,
> Pierre
>
>
> >
> > if (cpumask_empty(cpus_to_visit)) {
> > topology_normalize_cpu_scale();
> > diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> > index a07b510e7dc5..38ca6c76af56 100644
> > --- a/include/linux/arch_topology.h
> > +++ b/include/linux/arch_topology.h
> > @@ -27,6 +27,13 @@ static inline unsigned long topology_get_cpu_scale(int cpu)
> >
> > void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity);
> >
> > +DECLARE_PER_CPU(unsigned long, capacity_ref_freq);
> > +
> > +static inline unsigned long topology_get_freq_ref(int cpu)
> > +{
> > + return per_cpu(capacity_ref_freq, cpu);
> > +}
> > +
> > DECLARE_PER_CPU(unsigned long, arch_freq_scale);
> >
> > static inline unsigned long topology_get_freq_scale(int cpu)
On Wed, 11 Oct 2023 at 12:27, Pierre Gondois <[email protected]> wrote:
>
> Hello Vincent,
>
> On 10/9/23 12:36, Vincent Guittot wrote:
> > cppc cpufreq driver can register an artificial energy model. In such case,
> > it also have to register the frequency that is used to define the CPU
> > capacity
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > drivers/cpufreq/cppc_cpufreq.c | 18 ++++++++++++++++++
> > 1 file changed, 18 insertions(+)
> >
> > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> > index fe08ca419b3d..24c6ba349f01 100644
> > --- a/drivers/cpufreq/cppc_cpufreq.c
> > +++ b/drivers/cpufreq/cppc_cpufreq.c
> > @@ -636,6 +636,21 @@ static int populate_efficiency_class(void)
> > return 0;
> > }
> >
> > +
> > +static void cppc_cpufreq_set_capacity_ref_freq(struct cpufreq_policy *policy)
> > +{
> > + struct cppc_perf_caps *perf_caps;
> > + struct cppc_cpudata *cpu_data;
> > + unsigned int ref_freq;
> > +
> > + cpu_data = policy->driver_data;
> > + perf_caps = &cpu_data->perf_caps;
> > +
> > + ref_freq = cppc_cpufreq_perf_to_khz(cpu_data, perf_caps->highest_perf);
> > +
> > + per_cpu(capacity_ref_freq, policy->cpu) = ref_freq;
>
> 'capacity_ref_freq' seems to be updated only if CONFIG_ENERGY_MODEL is set. However in
> [1], get_capacity_ref_freq() relies on 'capacity_ref_freq'. The cpufreq_schedutil governor
> should have a valid 'capacity_ref_freq' value set if the CPPC cpufreq driver is used
> without energy model I believe.
we can disable it by setting capacity_ref_freq to 0 so it will
fallback on cpuinfo like intel and amd which uses default
SCHED_CAPACITY_SCALE capacity
Could you provide me with more details about your platform ? I still
try to understand how the cpu compute capacity is set up on your
system. How do you set per_cpu cpu_scale variable ? we should set the
ref freq at the same time
>
> Also 'capacity_ref_freq' seems to be set only for 'policy->cpu'. I believe it should
> be set for the whole perf domain in case this 'policy->cpu' goes offline.
>
> Another thing, related my comment to [1] and to [2], for CPPC the max capacity matches
> the boosting frequency. We have:
> 'non-boosted max capacity' < 'boosted max capacity'.
> -
> If boosting is not enabled, the CPU utilization can still go above the 'non-boosted max
> capacity'. The overutilization of the system seems to be triggered by comparing the CPU
> util to the 'boosted max capacity'. So systems might not be detected as overutilized.
As Peter mentioned, we have to decide what is the original compute
capacity of your CPUs which is usually the sustainable max compute
capacity, especially when using EAS and EM
>
> For the EAS energy computation, em_cpu_energy() tries to predict the frequency that will
> be used. It is currently unknown to the function that the frequency request will be
> clamped by __resolve_freq():
> get_next_freq()
> \-cpufreq_driver_resolve_freq()
> \-__resolve_freq()
> This means that the energy computation might use boosting frequencies, which are not
> available.
>
> Regards,
> Pierre
>
> [1]: [PATCH v2 4/6] cpufreq/schedutil: use a fixed reference frequency
> [2]: https://lore.kernel.org/lkml/[email protected]/
>
> > +}
> > +
> > static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
> > {
> > struct cppc_cpudata *cpu_data;
> > @@ -643,6 +658,9 @@ static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
> > EM_ADV_DATA_CB(cppc_get_cpu_power, cppc_get_cpu_cost);
> >
> > cpu_data = policy->driver_data;
> > +
> > + cppc_cpufreq_set_capacity_ref_freq(policy);
> > +
> > em_dev_register_perf_domain(get_cpu_device(policy->cpu),
> > get_perf_level_count(policy), &em_cb,
> > cpu_data->shared_cpu_map, 0);
Hi both,
On Wednesday 11 Oct 2023 at 16:25:46 (+0200), Vincent Guittot wrote:
> On Wed, 11 Oct 2023 at 12:27, Pierre Gondois <[email protected]> wrote:
> >
> > Hello Vincent,
> >
> > On 10/9/23 12:36, Vincent Guittot wrote:
> > > cppc cpufreq driver can register an artificial energy model. In such case,
> > > it also have to register the frequency that is used to define the CPU
> > > capacity
> > >
> > > Signed-off-by: Vincent Guittot <[email protected]>
> > > ---
> > > drivers/cpufreq/cppc_cpufreq.c | 18 ++++++++++++++++++
> > > 1 file changed, 18 insertions(+)
> > >
> > > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> > > index fe08ca419b3d..24c6ba349f01 100644
> > > --- a/drivers/cpufreq/cppc_cpufreq.c
> > > +++ b/drivers/cpufreq/cppc_cpufreq.c
> > > @@ -636,6 +636,21 @@ static int populate_efficiency_class(void)
> > > return 0;
> > > }
> > >
> > > +
> > > +static void cppc_cpufreq_set_capacity_ref_freq(struct cpufreq_policy *policy)
> > > +{
> > > + struct cppc_perf_caps *perf_caps;
> > > + struct cppc_cpudata *cpu_data;
> > > + unsigned int ref_freq;
> > > +
> > > + cpu_data = policy->driver_data;
> > > + perf_caps = &cpu_data->perf_caps;
> > > +
> > > + ref_freq = cppc_cpufreq_perf_to_khz(cpu_data, perf_caps->highest_perf);
> > > +
> > > + per_cpu(capacity_ref_freq, policy->cpu) = ref_freq;
> >
> > 'capacity_ref_freq' seems to be updated only if CONFIG_ENERGY_MODEL is set. However in
> > [1], get_capacity_ref_freq() relies on 'capacity_ref_freq'. The cpufreq_schedutil governor
> > should have a valid 'capacity_ref_freq' value set if the CPPC cpufreq driver is used
> > without energy model I believe.
>
> we can disable it by setting capacity_ref_freq to 0 so it will
> fallback on cpuinfo like intel and amd which uses default
> SCHED_CAPACITY_SCALE capacity
>
> Could you provide me with more details about your platform ? I still
> try to understand how the cpu compute capacity is set up on your
> system. How do you set per_cpu cpu_scale variable ? we should set the
> ref freq at the same time
>
Yes, the best place to set it would be in:
drivers/base/arch_topology.c: topology_init_cpu_capacity_cppc()
But:
- That function reuses topology_normalize_cpu_scale() and when called
it needs to have capacity_ref_freq = 1. So either capacity_ref_freq
needs to be set for each CPU after topology_normalize_cpu_scale() is
called or we should not call topology_normalize_cpu_scale() here and
just unpack a CPPC specific version of it in
topology_init_cpu_capacity_cppc(). The latter is probably better as
we avoid iterating through all CPUs a couple of times.
- When set, capacity_ref_freq needs to be a "frequency" (at least
in reference to the reference frequencies provided by CPPC). So
cppc_cpufreq_khz_to_perf() and cppc_cpufreq_perf_to_khz() would need
to move to drivers/acpi/cppc_acpi.c. They don't have any dependency
on cpufreq (policies) so that should be alright.
topology_init_cpu_capacity_cppc() is a better place to set
capacity_ref_freq because one can do it for each CPU, and it not only
caters for the EAS case but also for frequency invariance, when
arch_set_freq_scale() is called, if no counters are supported.
When counters are supported, there are still two loose threads:
- amu_fie_setup(): Vincent, would you mind completely removing
cpufreq_get_hw_max_freq() and reusing arch_scale_freq_ref() here?
- It would be nice if cppc_scale_freq_workfn() would use
arch_scale_freq_ref() as well, for consistency. But it would need
to be converted back to performance before use, so that would mean
extra work on the tick, which is not ideal.
Basically it would be good if what gets used for capacity
(arch_scale_freq_ref()) gets used for frequency invariance as well,
in all locations.
Thanks,
Ionela.
> >
> > Also 'capacity_ref_freq' seems to be set only for 'policy->cpu'. I believe it should
> > be set for the whole perf domain in case this 'policy->cpu' goes offline.
> >
> > Another thing, related my comment to [1] and to [2], for CPPC the max capacity matches
> > the boosting frequency. We have:
> > 'non-boosted max capacity' < 'boosted max capacity'.
> > -
> > If boosting is not enabled, the CPU utilization can still go above the 'non-boosted max
> > capacity'. The overutilization of the system seems to be triggered by comparing the CPU
> > util to the 'boosted max capacity'. So systems might not be detected as overutilized.
>
> As Peter mentioned, we have to decide what is the original compute
> capacity of your CPUs which is usually the sustainable max compute
> capacity, especially when using EAS and EM
>
> >
> > For the EAS energy computation, em_cpu_energy() tries to predict the frequency that will
> > be used. It is currently unknown to the function that the frequency request will be
> > clamped by __resolve_freq():
> > get_next_freq()
> > \-cpufreq_driver_resolve_freq()
> > \-__resolve_freq()
> > This means that the energy computation might use boosting frequencies, which are not
> > available.
> >
> > Regards,
> > Pierre
> >
> > [1]: [PATCH v2 4/6] cpufreq/schedutil: use a fixed reference frequency
> > [2]: https://lore.kernel.org/lkml/[email protected]/
> >
> > > +}
> > > +
> > > static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
> > > {
> > > struct cppc_cpudata *cpu_data;
> > > @@ -643,6 +658,9 @@ static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
> > > EM_ADV_DATA_CB(cppc_get_cpu_power, cppc_get_cpu_cost);
> > >
> > > cpu_data = policy->driver_data;
> > > +
> > > + cppc_cpufreq_set_capacity_ref_freq(policy);
> > > +
> > > em_dev_register_perf_domain(get_cpu_device(policy->cpu),
> > > get_perf_level_count(policy), &em_cb,
> > > cpu_data->shared_cpu_map, 0);
Hi Ionela,
On Mon, 16 Oct 2023 at 14:13, Ionela Voinescu <[email protected]> wrote:
>
> Hi both,
>
> On Wednesday 11 Oct 2023 at 16:25:46 (+0200), Vincent Guittot wrote:
> > On Wed, 11 Oct 2023 at 12:27, Pierre Gondois <[email protected]> wrote:
> > >
> > > Hello Vincent,
> > >
> > > On 10/9/23 12:36, Vincent Guittot wrote:
> > > > cppc cpufreq driver can register an artificial energy model. In such case,
> > > > it also have to register the frequency that is used to define the CPU
> > > > capacity
> > > >
> > > > Signed-off-by: Vincent Guittot <[email protected]>
> > > > ---
> > > > drivers/cpufreq/cppc_cpufreq.c | 18 ++++++++++++++++++
> > > > 1 file changed, 18 insertions(+)
> > > >
> > > > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> > > > index fe08ca419b3d..24c6ba349f01 100644
> > > > --- a/drivers/cpufreq/cppc_cpufreq.c
> > > > +++ b/drivers/cpufreq/cppc_cpufreq.c
> > > > @@ -636,6 +636,21 @@ static int populate_efficiency_class(void)
> > > > return 0;
> > > > }
> > > >
> > > > +
> > > > +static void cppc_cpufreq_set_capacity_ref_freq(struct cpufreq_policy *policy)
> > > > +{
> > > > + struct cppc_perf_caps *perf_caps;
> > > > + struct cppc_cpudata *cpu_data;
> > > > + unsigned int ref_freq;
> > > > +
> > > > + cpu_data = policy->driver_data;
> > > > + perf_caps = &cpu_data->perf_caps;
> > > > +
> > > > + ref_freq = cppc_cpufreq_perf_to_khz(cpu_data, perf_caps->highest_perf);
> > > > +
> > > > + per_cpu(capacity_ref_freq, policy->cpu) = ref_freq;
> > >
> > > 'capacity_ref_freq' seems to be updated only if CONFIG_ENERGY_MODEL is set. However in
> > > [1], get_capacity_ref_freq() relies on 'capacity_ref_freq'. The cpufreq_schedutil governor
> > > should have a valid 'capacity_ref_freq' value set if the CPPC cpufreq driver is used
> > > without energy model I believe.
> >
> > we can disable it by setting capacity_ref_freq to 0 so it will
> > fallback on cpuinfo like intel and amd which uses default
> > SCHED_CAPACITY_SCALE capacity
> >
> > Could you provide me with more details about your platform ? I still
> > try to understand how the cpu compute capacity is set up on your
> > system. How do you set per_cpu cpu_scale variable ? we should set the
> > ref freq at the same time
> >
>
> Yes, the best place to set it would be in:
> drivers/base/arch_topology.c: topology_init_cpu_capacity_cppc()
Thanks. I didn't notice it
>
> But:
> - That function reuses topology_normalize_cpu_scale() and when called
> it needs to have capacity_ref_freq = 1. So either capacity_ref_freq
> needs to be set for each CPU after topology_normalize_cpu_scale() is
> called or we should not call topology_normalize_cpu_scale() here and
> just unpack a CPPC specific version of it in
> topology_init_cpu_capacity_cppc(). The latter is probably better as
> we avoid iterating through all CPUs a couple of times.
>
> - When set, capacity_ref_freq needs to be a "frequency" (at least
> in reference to the reference frequencies provided by CPPC). So
> cppc_cpufreq_khz_to_perf() and cppc_cpufreq_perf_to_khz() would need
> to move to drivers/acpi/cppc_acpi.c. They don't have any dependency
> on cpufreq (policies) so that should be alright.
>
> topology_init_cpu_capacity_cppc() is a better place to set
> capacity_ref_freq because one can do it for each CPU, and it not only
I agree, topology_init_cpu_capacity_cppc() is the best place to set
capacity_ref_freq()
> caters for the EAS case but also for frequency invariance, when
> arch_set_freq_scale() is called, if no counters are supported.
>
> When counters are supported, there are still two loose threads:
> - amu_fie_setup(): Vincent, would you mind completely removing
> cpufreq_get_hw_max_freq() and reusing arch_scale_freq_ref() here?
I wonder if we can have a ordering dependency problem as both
init_cpu_capacity_notifier() and init_amu_fie_notifier() are
registered for the same CPUFREQ_POLICY_NOTIFIER event and I'm not sure
it will happen in the right ordering
>
> - It would be nice if cppc_scale_freq_workfn() would use
> arch_scale_freq_ref() as well, for consistency. But it would need
> to be converted back to performance before use, so that would mean
> extra work on the tick, which is not ideal.
This once seems more complex as it implies other arch that are not
using arch_topology.c and would need more rework so I would prefer to
make it a separate patchset
Thanks
Vincent
>
> Basically it would be good if what gets used for capacity
> (arch_scale_freq_ref()) gets used for frequency invariance as well,
> in all locations.
>
> Thanks,
> Ionela.
>
> > >
> > > Also 'capacity_ref_freq' seems to be set only for 'policy->cpu'. I believe it should
> > > be set for the whole perf domain in case this 'policy->cpu' goes offline.
> > >
> > > Another thing, related my comment to [1] and to [2], for CPPC the max capacity matches
> > > the boosting frequency. We have:
> > > 'non-boosted max capacity' < 'boosted max capacity'.
> > > -
> > > If boosting is not enabled, the CPU utilization can still go above the 'non-boosted max
> > > capacity'. The overutilization of the system seems to be triggered by comparing the CPU
> > > util to the 'boosted max capacity'. So systems might not be detected as overutilized.
> >
> > As Peter mentioned, we have to decide what is the original compute
> > capacity of your CPUs which is usually the sustainable max compute
> > capacity, especially when using EAS and EM
> >
> > >
> > > For the EAS energy computation, em_cpu_energy() tries to predict the frequency that will
> > > be used. It is currently unknown to the function that the frequency request will be
> > > clamped by __resolve_freq():
> > > get_next_freq()
> > > \-cpufreq_driver_resolve_freq()
> > > \-__resolve_freq()
> > > This means that the energy computation might use boosting frequencies, which are not
> > > available.
> > >
> > > Regards,
> > > Pierre
> > >
> > > [1]: [PATCH v2 4/6] cpufreq/schedutil: use a fixed reference frequency
> > > [2]: https://lore.kernel.org/lkml/[email protected]/
> > >
> > > > +}
> > > > +
> > > > static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
> > > > {
> > > > struct cppc_cpudata *cpu_data;
> > > > @@ -643,6 +658,9 @@ static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
> > > > EM_ADV_DATA_CB(cppc_get_cpu_power, cppc_get_cpu_cost);
> > > >
> > > > cpu_data = policy->driver_data;
> > > > +
> > > > + cppc_cpufreq_set_capacity_ref_freq(policy);
> > > > +
> > > > em_dev_register_perf_domain(get_cpu_device(policy->cpu),
> > > > get_perf_level_count(policy), &em_cb,
> > > > cpu_data->shared_cpu_map, 0);
Hi,
On Monday 16 Oct 2023 at 17:32:03 (+0200), Vincent Guittot wrote:
> Hi Ionela,
>
> On Mon, 16 Oct 2023 at 14:13, Ionela Voinescu <[email protected]> wrote:
> >
> > Hi both,
> >
> > On Wednesday 11 Oct 2023 at 16:25:46 (+0200), Vincent Guittot wrote:
> > > On Wed, 11 Oct 2023 at 12:27, Pierre Gondois <[email protected]> wrote:
> > > >
> > > > Hello Vincent,
> > > >
> > > > On 10/9/23 12:36, Vincent Guittot wrote:
> > > > > cppc cpufreq driver can register an artificial energy model. In such case,
> > > > > it also have to register the frequency that is used to define the CPU
> > > > > capacity
> > > > >
> > > > > Signed-off-by: Vincent Guittot <[email protected]>
> > > > > ---
> > > > > drivers/cpufreq/cppc_cpufreq.c | 18 ++++++++++++++++++
> > > > > 1 file changed, 18 insertions(+)
> > > > >
> > > > > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> > > > > index fe08ca419b3d..24c6ba349f01 100644
> > > > > --- a/drivers/cpufreq/cppc_cpufreq.c
> > > > > +++ b/drivers/cpufreq/cppc_cpufreq.c
> > > > > @@ -636,6 +636,21 @@ static int populate_efficiency_class(void)
> > > > > return 0;
> > > > > }
> > > > >
> > > > > +
> > > > > +static void cppc_cpufreq_set_capacity_ref_freq(struct cpufreq_policy *policy)
> > > > > +{
> > > > > + struct cppc_perf_caps *perf_caps;
> > > > > + struct cppc_cpudata *cpu_data;
> > > > > + unsigned int ref_freq;
> > > > > +
> > > > > + cpu_data = policy->driver_data;
> > > > > + perf_caps = &cpu_data->perf_caps;
> > > > > +
> > > > > + ref_freq = cppc_cpufreq_perf_to_khz(cpu_data, perf_caps->highest_perf);
> > > > > +
> > > > > + per_cpu(capacity_ref_freq, policy->cpu) = ref_freq;
> > > >
> > > > 'capacity_ref_freq' seems to be updated only if CONFIG_ENERGY_MODEL is set. However in
> > > > [1], get_capacity_ref_freq() relies on 'capacity_ref_freq'. The cpufreq_schedutil governor
> > > > should have a valid 'capacity_ref_freq' value set if the CPPC cpufreq driver is used
> > > > without energy model I believe.
> > >
> > > we can disable it by setting capacity_ref_freq to 0 so it will
> > > fallback on cpuinfo like intel and amd which uses default
> > > SCHED_CAPACITY_SCALE capacity
> > >
> > > Could you provide me with more details about your platform ? I still
> > > try to understand how the cpu compute capacity is set up on your
> > > system. How do you set per_cpu cpu_scale variable ? we should set the
> > > ref freq at the same time
> > >
> >
> > Yes, the best place to set it would be in:
> > drivers/base/arch_topology.c: topology_init_cpu_capacity_cppc()
>
> Thanks. I didn't notice it
>
> >
> > But:
> > - That function reuses topology_normalize_cpu_scale() and when called
> > it needs to have capacity_ref_freq = 1. So either capacity_ref_freq
> > needs to be set for each CPU after topology_normalize_cpu_scale() is
> > called or we should not call topology_normalize_cpu_scale() here and
> > just unpack a CPPC specific version of it in
> > topology_init_cpu_capacity_cppc(). The latter is probably better as
> > we avoid iterating through all CPUs a couple of times.
> >
> > - When set, capacity_ref_freq needs to be a "frequency" (at least
> > in reference to the reference frequencies provided by CPPC). So
> > cppc_cpufreq_khz_to_perf() and cppc_cpufreq_perf_to_khz() would need
> > to move to drivers/acpi/cppc_acpi.c. They don't have any dependency
> > on cpufreq (policies) so that should be alright.
> >
> > topology_init_cpu_capacity_cppc() is a better place to set
> > capacity_ref_freq because one can do it for each CPU, and it not only
>
> I agree, topology_init_cpu_capacity_cppc() is the best place to set
> capacity_ref_freq()
>
> > caters for the EAS case but also for frequency invariance, when
> > arch_set_freq_scale() is called, if no counters are supported.
> >
> > When counters are supported, there are still two loose threads:
> > - amu_fie_setup(): Vincent, would you mind completely removing
> > cpufreq_get_hw_max_freq() and reusing arch_scale_freq_ref() here?
>
> I wonder if we can have a ordering dependency problem as both
> init_cpu_capacity_notifier() and init_amu_fie_notifier() are
> registered for the same CPUFREQ_POLICY_NOTIFIER event and I'm not sure
> it will happen in the right ordering
Yes, you are right, this would be a problem for DT systems. With the
implementation above, ACPI systems would obtain capacity_ref_freq on
processor probe so it should be then available at policy creation when
amu_fie_setup() would be called.
Initially I thought the only solution might be to move
freq_inv_set_max_ratio() in the arch topology driver to the same
callback that initialises capacity, but that quickly becomes ugly with
making it support both DT and ACPI systems. And then there's the
question on whether it belongs there.
But I think the better option is to wrap policy->cpuinfo.max_freq in
another getter function which can be used in both amu_fie_setup() and
init_cpu_capacity_callback(). This can be implemented in the
arch topology driver and exposed to the architecture specific topology
files.
I'm not sure if this might be worth leaving for another patchset as
well. Let us know if you'd like us to help on theses ones.
Thanks,
Ionela.
>
> >
> > - It would be nice if cppc_scale_freq_workfn() would use
> > arch_scale_freq_ref() as well, for consistency. But it would need
> > to be converted back to performance before use, so that would mean
> > extra work on the tick, which is not ideal.
>
> This once seems more complex as it implies other arch that are not
> using arch_topology.c and would need more rework so I would prefer to
> make it a separate patchset
>
> Thanks
> Vincent
>
> >
> > Basically it would be good if what gets used for capacity
> > (arch_scale_freq_ref()) gets used for frequency invariance as well,
> > in all locations.
> >
> > Thanks,
> > Ionela.
> >
> > > >
> > > > Also 'capacity_ref_freq' seems to be set only for 'policy->cpu'. I believe it should
> > > > be set for the whole perf domain in case this 'policy->cpu' goes offline.
> > > >
> > > > Another thing, related my comment to [1] and to [2], for CPPC the max capacity matches
> > > > the boosting frequency. We have:
> > > > 'non-boosted max capacity' < 'boosted max capacity'.
> > > > -
> > > > If boosting is not enabled, the CPU utilization can still go above the 'non-boosted max
> > > > capacity'. The overutilization of the system seems to be triggered by comparing the CPU
> > > > util to the 'boosted max capacity'. So systems might not be detected as overutilized.
> > >
> > > As Peter mentioned, we have to decide what is the original compute
> > > capacity of your CPUs which is usually the sustainable max compute
> > > capacity, especially when using EAS and EM
> > >
> > > >
> > > > For the EAS energy computation, em_cpu_energy() tries to predict the frequency that will
> > > > be used. It is currently unknown to the function that the frequency request will be
> > > > clamped by __resolve_freq():
> > > > get_next_freq()
> > > > \-cpufreq_driver_resolve_freq()
> > > > \-__resolve_freq()
> > > > This means that the energy computation might use boosting frequencies, which are not
> > > > available.
> > > >
> > > > Regards,
> > > > Pierre
> > > >
> > > > [1]: [PATCH v2 4/6] cpufreq/schedutil: use a fixed reference frequency
> > > > [2]: https://lore.kernel.org/lkml/[email protected]/
> > > >
> > > > > +}
> > > > > +
> > > > > static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
> > > > > {
> > > > > struct cppc_cpudata *cpu_data;
> > > > > @@ -643,6 +658,9 @@ static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
> > > > > EM_ADV_DATA_CB(cppc_get_cpu_power, cppc_get_cpu_cost);
> > > > >
> > > > > cpu_data = policy->driver_data;
> > > > > +
> > > > > + cppc_cpufreq_set_capacity_ref_freq(policy);
> > > > > +
> > > > > em_dev_register_perf_domain(get_cpu_device(policy->cpu),
> > > > > get_perf_level_count(policy), &em_cb,
> > > > > cpu_data->shared_cpu_map, 0);
On Tue, 17 Oct 2023 at 11:02, Ionela Voinescu <[email protected]> wrote:
>
> Hi,
>
> On Monday 16 Oct 2023 at 17:32:03 (+0200), Vincent Guittot wrote:
> > Hi Ionela,
> >
> > On Mon, 16 Oct 2023 at 14:13, Ionela Voinescu <[email protected]> wrote:
> > >
> > > Hi both,
> > >
> > > On Wednesday 11 Oct 2023 at 16:25:46 (+0200), Vincent Guittot wrote:
> > > > On Wed, 11 Oct 2023 at 12:27, Pierre Gondois <[email protected]> wrote:
> > > > >
> > > > > Hello Vincent,
> > > > >
> > > > > On 10/9/23 12:36, Vincent Guittot wrote:
> > > > > > cppc cpufreq driver can register an artificial energy model. In such case,
> > > > > > it also have to register the frequency that is used to define the CPU
> > > > > > capacity
> > > > > >
> > > > > > Signed-off-by: Vincent Guittot <[email protected]>
> > > > > > ---
> > > > > > drivers/cpufreq/cppc_cpufreq.c | 18 ++++++++++++++++++
> > > > > > 1 file changed, 18 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> > > > > > index fe08ca419b3d..24c6ba349f01 100644
> > > > > > --- a/drivers/cpufreq/cppc_cpufreq.c
> > > > > > +++ b/drivers/cpufreq/cppc_cpufreq.c
> > > > > > @@ -636,6 +636,21 @@ static int populate_efficiency_class(void)
> > > > > > return 0;
> > > > > > }
> > > > > >
> > > > > > +
> > > > > > +static void cppc_cpufreq_set_capacity_ref_freq(struct cpufreq_policy *policy)
> > > > > > +{
> > > > > > + struct cppc_perf_caps *perf_caps;
> > > > > > + struct cppc_cpudata *cpu_data;
> > > > > > + unsigned int ref_freq;
> > > > > > +
> > > > > > + cpu_data = policy->driver_data;
> > > > > > + perf_caps = &cpu_data->perf_caps;
> > > > > > +
> > > > > > + ref_freq = cppc_cpufreq_perf_to_khz(cpu_data, perf_caps->highest_perf);
> > > > > > +
> > > > > > + per_cpu(capacity_ref_freq, policy->cpu) = ref_freq;
> > > > >
> > > > > 'capacity_ref_freq' seems to be updated only if CONFIG_ENERGY_MODEL is set. However in
> > > > > [1], get_capacity_ref_freq() relies on 'capacity_ref_freq'. The cpufreq_schedutil governor
> > > > > should have a valid 'capacity_ref_freq' value set if the CPPC cpufreq driver is used
> > > > > without energy model I believe.
> > > >
> > > > we can disable it by setting capacity_ref_freq to 0 so it will
> > > > fallback on cpuinfo like intel and amd which uses default
> > > > SCHED_CAPACITY_SCALE capacity
> > > >
> > > > Could you provide me with more details about your platform ? I still
> > > > try to understand how the cpu compute capacity is set up on your
> > > > system. How do you set per_cpu cpu_scale variable ? we should set the
> > > > ref freq at the same time
> > > >
> > >
> > > Yes, the best place to set it would be in:
> > > drivers/base/arch_topology.c: topology_init_cpu_capacity_cppc()
> >
> > Thanks. I didn't notice it
> >
> > >
> > > But:
> > > - That function reuses topology_normalize_cpu_scale() and when called
> > > it needs to have capacity_ref_freq = 1. So either capacity_ref_freq
> > > needs to be set for each CPU after topology_normalize_cpu_scale() is
> > > called or we should not call topology_normalize_cpu_scale() here and
> > > just unpack a CPPC specific version of it in
> > > topology_init_cpu_capacity_cppc(). The latter is probably better as
> > > we avoid iterating through all CPUs a couple of times.
> > >
> > > - When set, capacity_ref_freq needs to be a "frequency" (at least
> > > in reference to the reference frequencies provided by CPPC). So
> > > cppc_cpufreq_khz_to_perf() and cppc_cpufreq_perf_to_khz() would need
> > > to move to drivers/acpi/cppc_acpi.c. They don't have any dependency
> > > on cpufreq (policies) so that should be alright.
> > >
> > > topology_init_cpu_capacity_cppc() is a better place to set
> > > capacity_ref_freq because one can do it for each CPU, and it not only
> >
> > I agree, topology_init_cpu_capacity_cppc() is the best place to set
> > capacity_ref_freq()
> >
> > > caters for the EAS case but also for frequency invariance, when
> > > arch_set_freq_scale() is called, if no counters are supported.
> > >
> > > When counters are supported, there are still two loose threads:
> > > - amu_fie_setup(): Vincent, would you mind completely removing
> > > cpufreq_get_hw_max_freq() and reusing arch_scale_freq_ref() here?
> >
> > I wonder if we can have a ordering dependency problem as both
> > init_cpu_capacity_notifier() and init_amu_fie_notifier() are
> > registered for the same CPUFREQ_POLICY_NOTIFIER event and I'm not sure
> > it will happen in the right ordering
>
> Yes, you are right, this would be a problem for DT systems. With the
> implementation above, ACPI systems would obtain capacity_ref_freq on
> processor probe so it should be then available at policy creation when
> amu_fie_setup() would be called.
yes. the problem is only for DT
>
> Initially I thought the only solution might be to move
> freq_inv_set_max_ratio() in the arch topology driver to the same
> callback that initialises capacity, but that quickly becomes ugly with
> making it support both DT and ACPI systems. And then there's the
> question on whether it belongs there.
The goal would be to update the ratio while initializing everything
else. But this means that we must initialize the ratio in such a way
that amu will return by default SCHED_CAPACITY_SCALE until
arch_topology.c initializes it.
I will make it a try to check how ugly it will be
>
> But I think the better option is to wrap policy->cpuinfo.max_freq in
> another getter function which can be used in both amu_fie_setup() and
> init_cpu_capacity_callback(). This can be implemented in the
> arch topology driver and exposed to the architecture specific topology
> files.
>
> I'm not sure if this might be worth leaving for another patchset as
> well. Let us know if you'd like us to help on theses ones.
>
> Thanks,
> Ionela.
>
> >
> > >
> > > - It would be nice if cppc_scale_freq_workfn() would use
> > > arch_scale_freq_ref() as well, for consistency. But it would need
> > > to be converted back to performance before use, so that would mean
> > > extra work on the tick, which is not ideal.
> >
> > This once seems more complex as it implies other arch that are not
> > using arch_topology.c and would need more rework so I would prefer to
> > make it a separate patchset
> >
> > Thanks
> > Vincent
> >
> > >
> > > Basically it would be good if what gets used for capacity
> > > (arch_scale_freq_ref()) gets used for frequency invariance as well,
> > > in all locations.
> > >
> > > Thanks,
> > > Ionela.
> > >
> > > > >
> > > > > Also 'capacity_ref_freq' seems to be set only for 'policy->cpu'. I believe it should
> > > > > be set for the whole perf domain in case this 'policy->cpu' goes offline.
> > > > >
> > > > > Another thing, related my comment to [1] and to [2], for CPPC the max capacity matches
> > > > > the boosting frequency. We have:
> > > > > 'non-boosted max capacity' < 'boosted max capacity'.
> > > > > -
> > > > > If boosting is not enabled, the CPU utilization can still go above the 'non-boosted max
> > > > > capacity'. The overutilization of the system seems to be triggered by comparing the CPU
> > > > > util to the 'boosted max capacity'. So systems might not be detected as overutilized.
> > > >
> > > > As Peter mentioned, we have to decide what is the original compute
> > > > capacity of your CPUs which is usually the sustainable max compute
> > > > capacity, especially when using EAS and EM
> > > >
> > > > >
> > > > > For the EAS energy computation, em_cpu_energy() tries to predict the frequency that will
> > > > > be used. It is currently unknown to the function that the frequency request will be
> > > > > clamped by __resolve_freq():
> > > > > get_next_freq()
> > > > > \-cpufreq_driver_resolve_freq()
> > > > > \-__resolve_freq()
> > > > > This means that the energy computation might use boosting frequencies, which are not
> > > > > available.
> > > > >
> > > > > Regards,
> > > > > Pierre
> > > > >
> > > > > [1]: [PATCH v2 4/6] cpufreq/schedutil: use a fixed reference frequency
> > > > > [2]: https://lore.kernel.org/lkml/[email protected]/
> > > > >
> > > > > > +}
> > > > > > +
> > > > > > static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
> > > > > > {
> > > > > > struct cppc_cpudata *cpu_data;
> > > > > > @@ -643,6 +658,9 @@ static void cppc_cpufreq_register_em(struct cpufreq_policy *policy)
> > > > > > EM_ADV_DATA_CB(cppc_get_cpu_power, cppc_get_cpu_cost);
> > > > > >
> > > > > > cpu_data = policy->driver_data;
> > > > > > +
> > > > > > + cppc_cpufreq_set_capacity_ref_freq(policy);
> > > > > > +
> > > > > > em_dev_register_perf_domain(get_cpu_device(policy->cpu),
> > > > > > get_perf_level_count(policy), &em_cb,
> > > > > > cpu_data->shared_cpu_map, 0);
Hi Vincent,
On 10/9/23 11:36, Vincent Guittot wrote:
> The last item of a performance domain is not always the performance point
> that has been used to compute CPU's capacity. This can lead to different
> target frequency compared with other part of the system like schedutil and
> would result in wrong energy estimation.
>
> A new arch_scale_freq_ref() is available to return a fixed and coherent
> frequency reference that can be used when computing the CPU's frequency
> for an level of utilization. Use this function to get this reference
> frequency.
>
> Energy model is never used without defining arch_scale_freq_ref() but
> can be compiled. Define a default arch_scale_freq_ref() returning 0
> in such case.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> include/linux/energy_model.h | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
LGTM, taking into account the patch 2/6 that we don't include any
boost freq (so no changes w.r.t. current EAS situation)
Reviewed-by: Lukasz Luba <[email protected]>
On 10/9/23 11:36, Vincent Guittot wrote:
> Create a new method to get a unique and fixed max frequency. Currently
> cpuinfo.max_freq or the highest (or last) state of performance domain are
> used as the max frequency when computing the frequency for a level of
> utilization but:
> - cpuinfo_max_freq can change at runtime. boost is one example of
> such change.
> - cpuinfo.max_freq and last item of the PD can be different leading to
> different results between cpufreq and energy model.
>
> We need to save the reference frequency that has been used when computing
> the CPUs capacity and use this fixed and coherent value to convert between
> frequency and CPU's capacity.
>
> In fact, we already save the frequency that has been used when computing
> the capacity of each CPU. We extend the precision to save khZ instead of
> Mhz currently and we modify the type to be aligned with other variables
> used when converting frequency to capacity and the other way.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> arch/arm/include/asm/topology.h | 1 +
> arch/arm64/include/asm/topology.h | 1 +
> arch/riscv/include/asm/topology.h | 1 +
> drivers/base/arch_topology.c | 29 ++++++++++++++---------------
> include/linux/arch_topology.h | 7 +++++++
> 5 files changed, 24 insertions(+), 15 deletions(-)
[snip]
> @@ -170,9 +172,9 @@ DEFINE_PER_CPU(unsigned long, thermal_pressure);
> * operating on stale data when hot-plug is used for some CPUs. The
> * @capped_freq reflects the currently allowed max CPUs frequency due to
> * thermal capping. It might be also a boost frequency value, which is bigger
> - * than the internal 'freq_factor' max frequency. In such case the pressure
> - * value should simply be removed, since this is an indication that there is
> - * no thermal throttling. The @capped_freq must be provided in kHz.
> + * than the internal 'capacity_ref_freq' max frequency. In such case the
> + * pressure value should simply be removed, since this is an indication that
> + * there is no thermal throttling. The @capped_freq must be provided in kHz.
> */
> void topology_update_thermal_pressure(const struct cpumask *cpus,
> unsigned long capped_freq)
> @@ -183,10 +185,7 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
>
> cpu = cpumask_first(cpus);
> max_capacity = arch_scale_cpu_capacity(cpu);
> - max_freq = per_cpu(freq_factor, cpu);
> -
> - /* Convert to MHz scale which is used in 'freq_factor' */
> - capped_freq /= 1000;
I do like this small speed-up and I'm happy that it's possible with this
new design. (IIRC some of your platforms can update the thermal
pressure quite many time per sec - that might be the reason why
you see some speed-ups mentioned in the cover letter)
> + max_freq = arch_scale_freq_ref(cpu);
>
LGTM
Reviewed-by: Lukasz Luba <[email protected]>
On 10/9/23 11:36, Vincent Guittot wrote:
> cpuinfo.max_freq can change at runtime because of boost as an example. This
> implies that the value could be different from the frequency that has been
> used to compute the capacity of a CPU.
>
> The new arch_scale_freq_ref() returns a fixed and coherent frequency
> that can be used to compute the capacity for a given frequency.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> drivers/cpufreq/cpufreq.c | 4 ++--
> include/linux/cpufreq.h | 9 +++++++++
> 2 files changed, 11 insertions(+), 2 deletions(-)
>
Reviewed-by: Lukasz Luba <[email protected]>
On 10/9/23 11:36, Vincent Guittot wrote:
> cpuinfo.max_freq can change at runtime because of boost as an example. This
> implies that the value could be different than the one that has been
> used when computing the capacity of a CPU.
>
> The new arch_scale_freq_ref() returns a fixed and coherent reference
> frequency that can be used when computing a frequency based on utilization.
>
> Use this arch_scale_freq_ref() when available and fallback to
> policy otherwise.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/cpufreq_schedutil.c | 26 ++++++++++++++++++++++++--
> 1 file changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 4492608b7d7f..1fa7e74add8f 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -114,6 +114,28 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy)
> }
> }
>
> +/**
> + * cpufreq_get_capacity_ref_freq - get the reference frequency of a given CPU that
> + * has been used to correlate frequency and compute capacity.
> + * @policy: the cpufreq policy of the CPU in question.
> + * @use_current: Fallback to current freq instead of policy->cpuinfo.max_freq.
> + *
> + * Return: the reference CPU frequency to compute a capacity.
> + */
> +static __always_inline
> +unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
> +{
> + unsigned int freq = arch_scale_freq_ref(policy->cpu);
> +
> + if (freq)
> + return freq;
Looks good, for the platforms having this inline function returning 0,
this will be optimized out.
> +
> + if (arch_scale_freq_invariant())
> + return policy->cpuinfo.max_freq;
> +
> + return policy->cur;
> +}
> +
> /**
> * get_next_freq - Compute a new frequency for a given cpufreq policy.
> * @sg_policy: schedutil policy object to compute the new frequency for.
> @@ -140,10 +162,10 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> unsigned long util, unsigned long max)
> {
> struct cpufreq_policy *policy = sg_policy->policy;
> - unsigned int freq = arch_scale_freq_invariant() ?
> - policy->cpuinfo.max_freq : policy->cur;
> + unsigned int freq;
>
> util = map_util_perf(util);
> + freq = get_capacity_ref_freq(policy);
> freq = map_util_freq(util, freq, max);
>
> if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
Reviewed-by: Lukasz Luba <[email protected]>
On 10/9/23 11:36, Vincent Guittot wrote:
> This is the 1st part of consolidating how the max compute capacity is
> used in the scheduler and how we calculate the frequency for a level of
> utilization.
>
> Fix some unconsistancy when computing frequency for an utilization. There
> can be a mismatch between energy model and schedutil.
>
> Next step will be to make a difference between the original
> max compute capacity of a CPU and what is currently available when
> there is a capping applying forever (i.e. seconds or more).
>
>
I have tested the patches apart from the last CPPC because the platform
didn't have that. The EAS is working OK, so feel free to add:
Tested-by: Lukasz Luba <[email protected]>
apart from the patch 6/6 - with the cppc changes
Regards,
Lukasz