2020-01-22 17:37:17

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

Make schedutil cpufreq governor energy-aware.

- patch 1 introduces a function to retrieve a frequency given a base
frequency and an energy cost margin.
- patch 2 links Energy Model perf_domain to sugov_policy.
- patch 3 updates get_next_freq() to make use of the Energy Model.
- patch 4 adds sugov_cpu_ramp_boost() function.
- patch 5 updates sugov_update_(single|shared)() to make use of
sugov_cpu_ramp_boost().
- patch 6 introduces a tracepoint in get_next_freq() for
testing/debugging. Since it's not a trace event, it's not exposed to
userspace in a directly usable way, allowing for painless future
updates/removal.

The benefits of using the EM in schedutil are twofold:

1) Selecting the highest possible frequency for a given cost. Some
platforms can have lower frequencies that are less efficient than
higher ones, in which case they should be skipped for most purposes.
They can still be useful to give more freedom to thermal throttling
mechanisms, but not under normal circumstances.
note: the EM framework will warn about such OPPs "hertz/watts ratio
non-monotonically decreasing"

2) Driving the frequency selection with power in mind, in addition to
maximizing the utilization of the non-idle CPUs in the system.

Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
enabled in schedutil by
"sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".

Point 2) is enabled in
"sched/cpufreq: Boost schedutil frequency ramp up". It allows using
higher frequencies when it is known that the true utilization of
currently running tasks is exceeding their previous stable point.
The benefits are:

* Boosting the frequency when the behavior of a runnable task changes,
leading to an increase in utilization. That shortens the frequency
ramp up duration, which in turns allows the utilization signal to
reach stable values quicker. Since the allowed frequency boost is
bounded in energy, it will behave consistently across platforms,
regardless of the OPP cost range.

* The boost is only transient, and should not impact a lot the energy
consumed of workloads with very stable utilization signals.

This has been ligthly tested with a rtapp task ramping from 10% to 75%
utilisation on a big core.

v1 -> v2:

* Split the new sugov_cpu_ramp_boost() from the existing
sugov_cpu_is_busy() as they seem to seek a different goal.

* Implement sugov_cpu_ramp_boost() based on CFS util_avg and
util_est_enqueued signals, rather than using idle calls count.
This makes the ramp boost much more accurate in finding boost
opportunities, and give a "continuous" output rather than a boolean.

* Add EM_COST_MARGIN_SCALE=1024 to represent the
margin values of em_pd_get_higher_freq().

v2 -> v3:

* Check util_avg >= sg_cpu->util_avg in sugov_cpu_ramp_boost_update()
to avoid boosting when the utilization is decreasing.

* Add a tracepoint for testing.

v3 -> v4:

* em_pd_get_higher_freq() now interprets the margin as absolute,
rather than relative to the cost of the base frequency.

* Modify misleading comment in em_pd_get_higher_freq() since min_freq
can actually be higher than the max available frequency in normal
operations.

Douglas RAILLARD (6):
PM: Introduce em_pd_get_higher_freq()
sched/cpufreq: Attach perf domain to sugov policy
sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()
sched/cpufreq: Introduce sugov_cpu_ramp_boost
sched/cpufreq: Boost schedutil frequency ramp up
sched/cpufreq: Add schedutil_em_tp tracepoint

include/linux/energy_model.h | 56 ++++++++++++++
include/trace/events/power.h | 9 +++
kernel/sched/cpufreq_schedutil.c | 124 +++++++++++++++++++++++++++++--
3 files changed, 182 insertions(+), 7 deletions(-)

--
2.24.1


2020-01-22 17:37:19

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v4 1/6] PM: Introduce em_pd_get_higher_freq()

em_pd_get_higher_freq() returns a frequency greater or equal to the
provided one while taking into account a given cost margin. It also
skips inefficient OPPs that have a higher cost than another one with a
higher frequency (ordering OPPs by cost or efficiency leads to the same
result within a given CPU).

The efficiency of an OPP is measured as efficiency=capacity/power. OPPs
with the same efficiency are assumed to be equivalent, since they will
consume as much energy for a given amount of work to do. That may take
more or less time depending on the frequency, but will consume the same
energy.

Signed-off-by: Douglas RAILLARD <[email protected]>
---
include/linux/energy_model.h | 56 ++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index d249b88a4d5a..8855e6892724 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -159,6 +159,56 @@ static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
return pd->nr_cap_states;
}

+#define EM_COST_MARGIN_SCALE 1024U
+
+/**
+ * em_pd_get_higher_freq() - Get the highest frequency that does not exceed the
+ * given cost margin compared to min_freq
+ * @pd : performance domain for which this must be done
+ * @min_freq : minimum frequency to return
+ * @cost_margin : allowed cost margin on the EM_COST_MARGIN_SCALE scale. The
+ * maximum value of the scale maps to the highest cost in that perf domain.
+ *
+ * Return: the chosen frequency, guaranteed to be at least as high as min_freq.
+ */
+static inline unsigned long em_pd_get_higher_freq(struct em_perf_domain *pd,
+ unsigned long min_freq, unsigned long cost_margin)
+{
+ unsigned long max_cost;
+ unsigned long max_allowed_cost = 0;
+ struct em_cap_state *cs;
+ int i;
+
+ if (!pd)
+ return min_freq;
+
+ max_cost = pd->table[pd->nr_cap_states - 1].cost;
+ cost_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE;
+
+ /* Compute the maximum allowed cost */
+ for (i = 0; i < pd->nr_cap_states; i++) {
+ cs = &pd->table[i];
+ if (cs->frequency >= min_freq) {
+ max_allowed_cost = cs->cost + cost_margin;
+ break;
+ }
+ }
+
+ /* Find the highest frequency that will not exceed the cost margin */
+ for (i = pd->nr_cap_states-1; i >= 0; i--) {
+ cs = &pd->table[i];
+ if (cs->cost <= max_allowed_cost)
+ return cs->frequency;
+ }
+
+ /*
+ * min_freq can be higher than the highest available frequency since
+ * map_util_freq() will multiply the minimum frequency by some amount.
+ * This can allow it to be higher than the maximum achievable frequency.
+ */
+ return min_freq;
+}
+
#else
struct em_data_callback {};
#define EM_DATA_CB(_active_power_cb) { }
@@ -181,6 +231,12 @@ static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
{
return 0;
}
+
+static inline unsigned long em_pd_get_higher_freq(struct em_perf_domain *pd,
+ unsigned long min_freq, unsigned long cost_margin)
+{
+ return min_freq;
+}
#endif

#endif
--
2.24.1

2020-01-22 17:37:24

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v4 2/6] sched/cpufreq: Attach perf domain to sugov policy

Attach an Energy Model perf_domain to each sugov_policy to prepare the
ground for energy-aware schedutil.

Signed-off-by: Douglas RAILLARD <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 41 ++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 9b8916fd00a2..09e284dc918a 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -42,6 +42,10 @@ struct sugov_policy {

bool limits_changed;
bool need_freq_update;
+
+#ifdef CONFIG_ENERGY_MODEL
+ struct em_perf_domain *pd;
+#endif
};

struct sugov_cpu {
@@ -66,6 +70,40 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);

/************************ Governor internals ***********************/

+#ifdef CONFIG_ENERGY_MODEL
+static void sugov_policy_attach_pd(struct sugov_policy *sg_policy)
+{
+ struct cpufreq_policy *policy = sg_policy->policy;
+ struct em_perf_domain *pd;
+
+ sg_policy->pd = NULL;
+ pd = em_cpu_get(policy->cpu);
+ if (!pd)
+ return;
+
+ if (cpumask_equal(policy->related_cpus, to_cpumask(pd->cpus))) {
+ sg_policy->pd = pd;
+ } else {
+ pr_warn("%s: Not all CPUs in schedutil policy %u share the same perf domain, no perf domain for that policy will be registered\n",
+ __func__, policy->cpu);
+ }
+}
+
+static struct em_perf_domain *
+sugov_policy_get_pd(struct sugov_policy *sg_policy)
+{
+ return sg_policy->pd;
+}
+#else /* CONFIG_ENERGY_MODEL */
+static void sugov_policy_attach_pd(struct sugov_policy *sg_policy) {}
+
+static struct em_perf_domain *
+sugov_policy_get_pd(struct sugov_policy *sg_policy)
+{
+ return NULL;
+}
+#endif /* CONFIG_ENERGY_MODEL */
+
static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
{
s64 delta_ns;
@@ -859,6 +897,9 @@ static int sugov_start(struct cpufreq_policy *policy)
sugov_update_shared :
sugov_update_single);
}
+
+ sugov_policy_attach_pd(sg_policy);
+
return 0;
}

--
2.24.1

2020-01-22 17:37:48

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v4 3/6] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()

Choose the highest OPP for a given energy cost, allowing to skip lower
frequencies that would not be cheaper in terms of consumed power. These
frequencies can still be interesting to keep in the energy model to give
more freedom to thermal throttling, but should not be selected under
normal circumstances.

This also prepares the ground for energy-aware frequency boosting.

Signed-off-by: Douglas RAILLARD <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 09e284dc918a..608963da4916 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -10,6 +10,7 @@

#include "sched.h"

+#include <linux/energy_model.h>
#include <linux/sched/cpufreq.h>
#include <trace/events/power.h>

@@ -210,9 +211,16 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;
+ struct em_perf_domain *pd = sugov_policy_get_pd(sg_policy);

freq = map_util_freq(util, freq, max);

+ /*
+ * Try to get a higher frequency if one is available, given the extra
+ * power we are ready to spend.
+ */
+ freq = em_pd_get_higher_freq(pd, freq, 0);
+
if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
return sg_policy->next_freq;

--
2.24.1

2020-01-22 17:38:04

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v4 6/6] sched/cpufreq: Add schedutil_em_tp tracepoint

Introduce a new tracepoint reporting the effect of using the Energy
Model inside get_next_freq() in schedutil.

Signed-off-by: Douglas RAILLARD <[email protected]>
---
include/trace/events/power.h | 9 +++++++++
kernel/sched/cpufreq_schedutil.c | 20 ++++++++++++++------
2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index 7457e238e1b7..a3df4e915f5b 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -525,6 +525,15 @@ DEFINE_EVENT(dev_pm_qos_request, dev_pm_qos_remove_request,

TP_ARGS(name, type, new_value)
);
+
+DECLARE_TRACE(schedutil_em_tp,
+ TP_PROTO(unsigned int cpu, unsigned long util,
+ unsigned int cost_margin, unsigned int policy_cost_margin,
+ unsigned int base_freq, unsigned int boosted_freq),
+ TP_ARGS(cpu, util, cost_margin, policy_cost_margin, base_freq,
+ boosted_freq)
+);
+
#endif /* _TRACE_POWER_H */

/* This part must be outside protection */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 9a7617ea7bf4..8909c752c06f 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -14,6 +14,8 @@
#include <linux/sched/cpufreq.h>
#include <trace/events/power.h>

+EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_em_tp);
+
#define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8)

struct sugov_tunables {
@@ -225,7 +227,7 @@ static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)

/**
* get_next_freq - Compute a new frequency for a given cpufreq policy.
- * @sg_policy: schedutil policy object to compute the new frequency for.
+ * @sg_cpu: schedutil CPU object to compute the new frequency for.
* @util: Current CPU utilization.
* @max: CPU capacity.
* @boost: Extra power that can be spent on top of the minimum amount of power
@@ -248,22 +250,28 @@ static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
* next_freq (as calculated above) is returned, subject to policy min/max and
* cpufreq driver limitations.
*/
-static unsigned int get_next_freq(struct sugov_policy *sg_policy,
+static unsigned int get_next_freq(struct sugov_cpu *sg_cpu,
unsigned long util, unsigned long max,
unsigned long boost)
{
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;
struct em_perf_domain *pd = sugov_policy_get_pd(sg_policy);
+ unsigned int base_freq;

- freq = map_util_freq(util, freq, max);
+ base_freq = map_util_freq(util, freq, max);

/*
* Try to get a higher frequency if one is available, given the extra
* power we are ready to spend.
*/
- freq = em_pd_get_higher_freq(pd, freq, boost);
+ freq = em_pd_get_higher_freq(pd, base_freq, boost);
+
+ trace_schedutil_em_tp(sg_cpu->cpu, util,
+ sugov_cpu_ramp_boost(sg_cpu), boost,
+ base_freq, freq);

if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
return sg_policy->next_freq;
@@ -562,7 +570,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
ramp_boost = sugov_cpu_ramp_boost_update(sg_cpu);
max = sg_cpu->max;
util = sugov_iowait_apply(sg_cpu, time, util, max);
- next_f = get_next_freq(sg_policy, util, max, ramp_boost);
+ next_f = get_next_freq(sg_cpu, util, max, ramp_boost);
/*
* Do not reduce the frequency if the CPU has not been idle
* recently, as the reduction is likely to be premature then.
@@ -616,7 +624,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
}
}

- return get_next_freq(sg_policy, util, max, ramp_boost);
+ return get_next_freq(sg_cpu, util, max, ramp_boost);
}

static void
--
2.24.1

2020-01-22 17:38:10

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v4 5/6] sched/cpufreq: Boost schedutil frequency ramp up

In some situations, it can be interesting to spend temporarily more
power if that can give a useful frequency boost.

Use the new sugov_cpu_ramp_boost() function to drive an energy-aware
boost, on top of the minimal required frequency.

As that boost number is not accurate (and cannot be without a crystal
ball), we only use it in a way that allows direct control over the power
it is going to cost. This allows keeping a platform-independent level of
control over the average power, while allowing for frequency bursts when
we know a (set of) tasks can make use of it.

In shared policies, the maximum of all CPU's boost is used. Since the
extra power expenditure is bounded, it cannot skyrocket even on
platforms with a large number of cores in the same frequency domain
and/or very high ratio between lowest and highest OPP cost.

Signed-off-by: Douglas RAILLARD <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 25a410a1ff6a..9a7617ea7bf4 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -228,6 +228,9 @@ static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
* @sg_policy: schedutil policy object to compute the new frequency for.
* @util: Current CPU utilization.
* @max: CPU capacity.
+ * @boost: Extra power that can be spent on top of the minimum amount of power
+ * required to meet capacity requirements, as a percentage between 0 and
+ * EM_COST_MARGIN_SCALE.
*
* If the utilization is frequency-invariant, choose the new frequency to be
* proportional to it, that is
@@ -246,7 +249,8 @@ static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
* cpufreq driver limitations.
*/
static unsigned int get_next_freq(struct sugov_policy *sg_policy,
- unsigned long util, unsigned long max)
+ unsigned long util, unsigned long max,
+ unsigned long boost)
{
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq = arch_scale_freq_invariant() ?
@@ -259,7 +263,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
* Try to get a higher frequency if one is available, given the extra
* power we are ready to spend.
*/
- freq = em_pd_get_higher_freq(pd, freq, 0);
+ freq = em_pd_get_higher_freq(pd, freq, boost);

if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
return sg_policy->next_freq;
@@ -541,6 +545,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
unsigned long util, max;
unsigned int next_f;
bool busy;
+ unsigned long ramp_boost = 0;

sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
@@ -554,10 +559,10 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
busy = !sg_policy->need_freq_update && sugov_cpu_is_busy(sg_cpu);

util = sugov_get_util(sg_cpu);
- sugov_cpu_ramp_boost_update(sg_cpu);
+ ramp_boost = sugov_cpu_ramp_boost_update(sg_cpu);
max = sg_cpu->max;
util = sugov_iowait_apply(sg_cpu, time, util, max);
- next_f = get_next_freq(sg_policy, util, max);
+ next_f = get_next_freq(sg_policy, util, max, ramp_boost);
/*
* Do not reduce the frequency if the CPU has not been idle
* recently, as the reduction is likely to be premature then.
@@ -589,14 +594,19 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
struct cpufreq_policy *policy = sg_policy->policy;
unsigned long util = 0, max = 1;
unsigned int j;
+ unsigned long ramp_boost = 0;

for_each_cpu(j, policy->cpus) {
struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
- unsigned long j_util, j_max;
+ unsigned long j_util, j_max, j_ramp_boost;

j_util = sugov_get_util(j_sg_cpu);
if (j_sg_cpu == sg_cpu)
- sugov_cpu_ramp_boost_update(sg_cpu);
+ j_ramp_boost = sugov_cpu_ramp_boost_update(sg_cpu);
+ else
+ j_ramp_boost = sugov_cpu_ramp_boost(j_sg_cpu);
+ ramp_boost = max(ramp_boost, j_ramp_boost);
+
j_max = j_sg_cpu->max;
j_util = sugov_iowait_apply(j_sg_cpu, time, j_util, j_max);

@@ -606,7 +616,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
}
}

- return get_next_freq(sg_policy, util, max);
+ return get_next_freq(sg_policy, util, max, ramp_boost);
}

static void
--
2.24.1

2020-01-22 17:38:52

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v4 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost

Use the utilization signals dynamic to detect when the utilization of a
set of tasks starts increasing because of a change in tasks' behavior.
This allows detecting when spending extra power for faster frequency
ramp up response would be beneficial to the reactivity of the system.

This ramp boost is computed as the difference between util_avg and
util_est_enqueued. This number somehow represents a lower bound of how
much extra utilization this tasks is actually using, compared to our
best current stable knowledge of it (which is util_est_enqueued).

When the set of runnable tasks changes, the boost is disabled as the
impact of blocked utilization on util_avg will make the delta with
util_est_enqueued not very informative.

Signed-off-by: Douglas RAILLARD <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 43 ++++++++++++++++++++++++++++++++
1 file changed, 43 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 608963da4916..25a410a1ff6a 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -61,6 +61,10 @@ struct sugov_cpu {
unsigned long bw_dl;
unsigned long max;

+ unsigned long ramp_boost;
+ unsigned long util_est_enqueued;
+ unsigned long util_avg;
+
/* The field below is for single-CPU policies only: */
#ifdef CONFIG_NO_HZ_COMMON
unsigned long saved_idle_calls;
@@ -183,6 +187,42 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy, u64 time,
}
}

+static unsigned long sugov_cpu_ramp_boost(struct sugov_cpu *sg_cpu)
+{
+ return READ_ONCE(sg_cpu->ramp_boost);
+}
+
+static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
+{
+ struct rq *rq = cpu_rq(sg_cpu->cpu);
+ unsigned long util_est_enqueued;
+ unsigned long util_avg;
+ unsigned long boost = 0;
+
+ util_est_enqueued = READ_ONCE(rq->cfs.avg.util_est.enqueued);
+ util_avg = READ_ONCE(rq->cfs.avg.util_avg);
+
+ /*
+ * Boost when util_avg becomes higher than the previous stable
+ * knowledge of the enqueued tasks' set util, which is CPU's
+ * util_est_enqueued.
+ *
+ * We try to spot changes in the workload itself, so we want to
+ * avoid the noise of tasks being enqueued/dequeued. To do that,
+ * we only trigger boosting when the "amount of work" enqueued
+ * is stable.
+ */
+ if (util_est_enqueued == sg_cpu->util_est_enqueued &&
+ util_avg >= sg_cpu->util_avg &&
+ util_avg > util_est_enqueued)
+ boost = util_avg - util_est_enqueued;
+
+ sg_cpu->util_est_enqueued = util_est_enqueued;
+ sg_cpu->util_avg = util_avg;
+ WRITE_ONCE(sg_cpu->ramp_boost, boost);
+ return boost;
+}
+
/**
* get_next_freq - Compute a new frequency for a given cpufreq policy.
* @sg_policy: schedutil policy object to compute the new frequency for.
@@ -514,6 +554,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
busy = !sg_policy->need_freq_update && sugov_cpu_is_busy(sg_cpu);

util = sugov_get_util(sg_cpu);
+ sugov_cpu_ramp_boost_update(sg_cpu);
max = sg_cpu->max;
util = sugov_iowait_apply(sg_cpu, time, util, max);
next_f = get_next_freq(sg_policy, util, max);
@@ -554,6 +595,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
unsigned long j_util, j_max;

j_util = sugov_get_util(j_sg_cpu);
+ if (j_sg_cpu == sg_cpu)
+ sugov_cpu_ramp_boost_update(sg_cpu);
j_max = j_sg_cpu->max;
j_util = sugov_iowait_apply(j_sg_cpu, time, j_util, j_max);

--
2.24.1

2020-01-22 18:15:35

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

Hi Peter,

Since the v3 was posted a while ago, here is a short recap of the hanging
comments:

* The boost margin was relative, but we came to the conclusion it would make
more sense to make it absolute (done in that v4).

* The main remaining blur point was why defining boost=(util - util_est) makes
sense. The justification for that is that we use PELT-shaped signal to drive
the frequency, so using a PELT-shaped signal for the boost makes sense for the
same reasons.

AFAIK there is no specific criteria to meet for frequency selection signal shape
for anything else than periodic tasks (if we don't add other constraints on
top), so (util - util_est)=(util - constant) seems as good as anything else.
Especially since util is deemed to be a good fit in practice for frequency
selection. Let me know if I missed anything on that front.


v3 thread: https://lore.kernel.org/lkml/[email protected]/

Cheers,
Douglas

On 1/22/20 5:35 PM, Douglas RAILLARD wrote:
> Make schedutil cpufreq governor energy-aware.
>
> - patch 1 introduces a function to retrieve a frequency given a base
> frequency and an energy cost margin.
> - patch 2 links Energy Model perf_domain to sugov_policy.
> - patch 3 updates get_next_freq() to make use of the Energy Model.
> - patch 4 adds sugov_cpu_ramp_boost() function.
> - patch 5 updates sugov_update_(single|shared)() to make use of
> sugov_cpu_ramp_boost().
> - patch 6 introduces a tracepoint in get_next_freq() for
> testing/debugging. Since it's not a trace event, it's not exposed to
> userspace in a directly usable way, allowing for painless future
> updates/removal.
>
> The benefits of using the EM in schedutil are twofold:
>
> 1) Selecting the highest possible frequency for a given cost. Some
> platforms can have lower frequencies that are less efficient than
> higher ones, in which case they should be skipped for most purposes.
> They can still be useful to give more freedom to thermal throttling
> mechanisms, but not under normal circumstances.
> note: the EM framework will warn about such OPPs "hertz/watts ratio
> non-monotonically decreasing"
>
> 2) Driving the frequency selection with power in mind, in addition to
> maximizing the utilization of the non-idle CPUs in the system.
>
> Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
> enabled in schedutil by
> "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".
>
> Point 2) is enabled in
> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using
> higher frequencies when it is known that the true utilization of
> currently running tasks is exceeding their previous stable point.
> The benefits are:
>
> * Boosting the frequency when the behavior of a runnable task changes,
> leading to an increase in utilization. That shortens the frequency
> ramp up duration, which in turns allows the utilization signal to
> reach stable values quicker. Since the allowed frequency boost is
> bounded in energy, it will behave consistently across platforms,
> regardless of the OPP cost range.
>
> * The boost is only transient, and should not impact a lot the energy
> consumed of workloads with very stable utilization signals.
>
> This has been ligthly tested with a rtapp task ramping from 10% to 75%
> utilisation on a big core.
>
> v1 -> v2:
>
> * Split the new sugov_cpu_ramp_boost() from the existing
> sugov_cpu_is_busy() as they seem to seek a different goal.
>
> * Implement sugov_cpu_ramp_boost() based on CFS util_avg and
> util_est_enqueued signals, rather than using idle calls count.
> This makes the ramp boost much more accurate in finding boost
> opportunities, and give a "continuous" output rather than a boolean.
>
> * Add EM_COST_MARGIN_SCALE=1024 to represent the
> margin values of em_pd_get_higher_freq().
>
> v2 -> v3:
>
> * Check util_avg >= sg_cpu->util_avg in sugov_cpu_ramp_boost_update()
> to avoid boosting when the utilization is decreasing.
>
> * Add a tracepoint for testing.
>
> v3 -> v4:
>
> * em_pd_get_higher_freq() now interprets the margin as absolute,
> rather than relative to the cost of the base frequency.
>
> * Modify misleading comment in em_pd_get_higher_freq() since min_freq
> can actually be higher than the max available frequency in normal
> operations.
>
> Douglas RAILLARD (6):
> PM: Introduce em_pd_get_higher_freq()
> sched/cpufreq: Attach perf domain to sugov policy
> sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()
> sched/cpufreq: Introduce sugov_cpu_ramp_boost
> sched/cpufreq: Boost schedutil frequency ramp up
> sched/cpufreq: Add schedutil_em_tp tracepoint
>
> include/linux/energy_model.h | 56 ++++++++++++++
> include/trace/events/power.h | 9 +++
> kernel/sched/cpufreq_schedutil.c | 124 +++++++++++++++++++++++++++++--
> 3 files changed, 182 insertions(+), 7 deletions(-)
>

2020-01-23 15:46:34

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD
<[email protected]> wrote:
>
> Make schedutil cpufreq governor energy-aware.

I have to say that your terminology is confusing to me, like what
exactly does "energy-aware" mean in the first place?

> - patch 1 introduces a function to retrieve a frequency given a base
> frequency and an energy cost margin.
> - patch 2 links Energy Model perf_domain to sugov_policy.
> - patch 3 updates get_next_freq() to make use of the Energy Model.
> - patch 4 adds sugov_cpu_ramp_boost() function.
> - patch 5 updates sugov_update_(single|shared)() to make use of
> sugov_cpu_ramp_boost().
> - patch 6 introduces a tracepoint in get_next_freq() for
> testing/debugging. Since it's not a trace event, it's not exposed to
> userspace in a directly usable way, allowing for painless future
> updates/removal.
>
> The benefits of using the EM in schedutil are twofold:

I guess you mean using the EM directly in schedutil (note that it is
used indirectly already, because of EAS), but that needs to be clearly
stated.

> 1) Selecting the highest possible frequency for a given cost. Some
> platforms can have lower frequencies that are less efficient than
> higher ones, in which case they should be skipped for most purposes.
> They can still be useful to give more freedom to thermal throttling
> mechanisms, but not under normal circumstances.
> note: the EM framework will warn about such OPPs "hertz/watts ratio
> non-monotonically decreasing"

While all of that is fair enough for platforms using the EM, do you
realize that the EM is not available on the majority of architectures
(including some fairly significant ones) and so adding overhead
related to it for all of them is quite less than welcome?

> 2) Driving the frequency selection with power in mind, in addition to
> maximizing the utilization of the non-idle CPUs in the system.

Care to explain this? I'm totally unsure what you mean here.

> Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
> enabled in schedutil by
> "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".
>
> Point 2) is enabled in
> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using
> higher frequencies when it is known that the true utilization of
> currently running tasks is exceeding their previous stable point.

Please explain "true utilization" and "stable point".

> The benefits are:
>
> * Boosting the frequency when the behavior of a runnable task changes,
> leading to an increase in utilization. That shortens the frequency
> ramp up duration, which in turns allows the utilization signal to
> reach stable values quicker. Since the allowed frequency boost is
> bounded in energy, it will behave consistently across platforms,
> regardless of the OPP cost range.

Sounds good.

Can you please describe the algorithm applied to achieve that?

> * The boost is only transient, and should not impact a lot the energy
> consumed of workloads with very stable utilization signals.

2020-01-23 15:56:32

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost

On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD
<[email protected]> wrote:
>
> Use the utilization signals dynamic to detect when the utilization of a
> set of tasks starts increasing because of a change in tasks' behavior.
> This allows detecting when spending extra power for faster frequency
> ramp up response would be beneficial to the reactivity of the system.
>
> This ramp boost is computed as the difference between util_avg and
> util_est_enqueued. This number somehow represents a lower bound of how
> much extra utilization this tasks is actually using, compared to our
> best current stable knowledge of it (which is util_est_enqueued).
>
> When the set of runnable tasks changes, the boost is disabled as the
> impact of blocked utilization on util_avg will make the delta with
> util_est_enqueued not very informative.
>
> Signed-off-by: Douglas RAILLARD <[email protected]>
> ---
> kernel/sched/cpufreq_schedutil.c | 43 ++++++++++++++++++++++++++++++++
> 1 file changed, 43 insertions(+)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 608963da4916..25a410a1ff6a 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -61,6 +61,10 @@ struct sugov_cpu {
> unsigned long bw_dl;
> unsigned long max;
>
> + unsigned long ramp_boost;
> + unsigned long util_est_enqueued;
> + unsigned long util_avg;
> +
> /* The field below is for single-CPU policies only: */
> #ifdef CONFIG_NO_HZ_COMMON
> unsigned long saved_idle_calls;
> @@ -183,6 +187,42 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy, u64 time,
> }
> }
>
> +static unsigned long sugov_cpu_ramp_boost(struct sugov_cpu *sg_cpu)
> +{
> + return READ_ONCE(sg_cpu->ramp_boost);
> +}

Where exactly is this function used?

> +
> +static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
> +{
> + struct rq *rq = cpu_rq(sg_cpu->cpu);
> + unsigned long util_est_enqueued;
> + unsigned long util_avg;
> + unsigned long boost = 0;
> +
> + util_est_enqueued = READ_ONCE(rq->cfs.avg.util_est.enqueued);
> + util_avg = READ_ONCE(rq->cfs.avg.util_avg);
> +
> + /*
> + * Boost when util_avg becomes higher than the previous stable
> + * knowledge of the enqueued tasks' set util, which is CPU's
> + * util_est_enqueued.
> + *
> + * We try to spot changes in the workload itself, so we want to
> + * avoid the noise of tasks being enqueued/dequeued. To do that,
> + * we only trigger boosting when the "amount of work" enqueued
> + * is stable.
> + */
> + if (util_est_enqueued == sg_cpu->util_est_enqueued &&
> + util_avg >= sg_cpu->util_avg &&
> + util_avg > util_est_enqueued)
> + boost = util_avg - util_est_enqueued;
> +
> + sg_cpu->util_est_enqueued = util_est_enqueued;
> + sg_cpu->util_avg = util_avg;
> + WRITE_ONCE(sg_cpu->ramp_boost, boost);
> + return boost;
> +}
> +
> /**
> * get_next_freq - Compute a new frequency for a given cpufreq policy.
> * @sg_policy: schedutil policy object to compute the new frequency for.
> @@ -514,6 +554,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> busy = !sg_policy->need_freq_update && sugov_cpu_is_busy(sg_cpu);
>
> util = sugov_get_util(sg_cpu);
> + sugov_cpu_ramp_boost_update(sg_cpu);
> max = sg_cpu->max;
> util = sugov_iowait_apply(sg_cpu, time, util, max);
> next_f = get_next_freq(sg_policy, util, max);
> @@ -554,6 +595,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> unsigned long j_util, j_max;
>
> j_util = sugov_get_util(j_sg_cpu);
> + if (j_sg_cpu == sg_cpu)
> + sugov_cpu_ramp_boost_update(sg_cpu);
> j_max = j_sg_cpu->max;
> j_util = sugov_iowait_apply(j_sg_cpu, time, j_util, j_max);
>
> --

2020-01-23 16:18:28

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v4 3/6] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()

On Wednesday 22 Jan 2020 at 17:35:35 (+0000), Douglas RAILLARD wrote:
> @@ -210,9 +211,16 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> struct cpufreq_policy *policy = sg_policy->policy;
> unsigned int freq = arch_scale_freq_invariant() ?
> policy->cpuinfo.max_freq : policy->cur;
> + struct em_perf_domain *pd = sugov_policy_get_pd(sg_policy);
>
> freq = map_util_freq(util, freq, max);
>
> + /*
> + * Try to get a higher frequency if one is available, given the extra
> + * power we are ready to spend.
> + */
> + freq = em_pd_get_higher_freq(pd, freq, 0);

I find it sad that the call just below to cpufreq_driver_resolve_freq()
and cpufreq_frequency_table_target() iterates the OPPs all over again.
It's especially a shame since most existing users of the EM stuff do
have a cpufreq frequency table.

Have you looked at hooking this inside cpufreq_driver_resolve_freq()
instead ? If we have a well-formed EM available, the call to
cpufreq_frequency_table_target() feels redundant, so we might want to
skip it.

Thoughts ?

Quentin

2020-01-23 17:20:42

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

Hi Rafael,

On 1/23/20 3:43 PM, Rafael J. Wysocki wrote:
> On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD
> <[email protected]> wrote:
>>
>> Make schedutil cpufreq governor energy-aware.
>
> I have to say that your terminology is confusing to me, like what
> exactly does "energy-aware" mean in the first place?

Should be better rephrased as "Make schedutil cpufreq governor use the
energy model" I guess. Schedutil is indeed already energy aware since it
tries to use the lowest frequency possible for the job to be done (kind of).

>
>> - patch 1 introduces a function to retrieve a frequency given a base
>> frequency and an energy cost margin.
>> - patch 2 links Energy Model perf_domain to sugov_policy.
>> - patch 3 updates get_next_freq() to make use of the Energy Model.
>> - patch 4 adds sugov_cpu_ramp_boost() function.
>> - patch 5 updates sugov_update_(single|shared)() to make use of
>> sugov_cpu_ramp_boost().
>> - patch 6 introduces a tracepoint in get_next_freq() for
>> testing/debugging. Since it's not a trace event, it's not exposed to
>> userspace in a directly usable way, allowing for painless future
>> updates/removal.
>>
>> The benefits of using the EM in schedutil are twofold:
>
> I guess you mean using the EM directly in schedutil (note that it is
> used indirectly already, because of EAS), but that needs to be clearly
> stated.

In the current state (of the code and my knowledge), the EM "leaks" into
schedutil only by the fact that tasks are moved around by EAS, so the
CPU util seen by schedutil is impacted compared to the same workload on
non-EAS setup.

Other than that, the only energy-related information schedutil uses is
the assumption that lower freq == better efficiency. Explicit use of the
EM allows to refine this assumption.

>
>> 1) Selecting the highest possible frequency for a given cost. Some
>> platforms can have lower frequencies that are less efficient than
>> higher ones, in which case they should be skipped for most purposes.
>> They can still be useful to give more freedom to thermal throttling
>> mechanisms, but not under normal circumstances.
>> note: the EM framework will warn about such OPPs "hertz/watts ratio
>> non-monotonically decreasing"
>
> While all of that is fair enough for platforms using the EM, do you
> realize that the EM is not available on the majority of architectures
> (including some fairly significant ones) and so adding overhead
> related to it for all of them is quite less than welcome?

When CONFIG_ENERGY_MODEL is not defined, em_pd_get_higher_freq() is
defined to a static inline no-op function, so that feature won't incur
overhead (patch 1+2+3).

Patch 4 and 5 do add some new logic that could be used on any platform.
Current code will use the boost as an energy margin, but it would be
straightforward to make a util-based version (like iowait boost) on
non-EM platforms.

>> 2) Driving the frequency selection with power in mind, in addition to
>> maximizing the utilization of the non-idle CPUs in the system.
>
> Care to explain this? I'm totally unsure what you mean here.

Currently, schedutil is basically tailoring the CPU capacity to the util
of the tasks on it. That's all good for periodic tasks, but there are
situations where we can do better than assuming the task is periodic
with a fixed duty cycle.

The case improved by that series is when a task increases its duty
cycle. In that specific case, it can be a good idea to increase the
frequency until the util stabilizes again. We don't have a crystal ball
so we can't adjust the freq right away. However, we do want to avoid the
task to crave for speed until schedutil realizes it needs it. Using the
EM here allows to boost within reasonable limits, without destroying the
average energy consumption.

>
>> Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
>> enabled in schedutil by
>> "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".
>>
>> Point 2) is enabled in
>> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using
>> higher frequencies when it is known that the true utilization of
>> currently running tasks is exceeding their previous stable point.
>
> Please explain "true utilization" and "stable point".

"true utilization" would be an instantaneous duty cycle. If a task
suddenly starts doing twice as much work, its "true utilization" will
double instantly. "stable point" would be util est enqueued here. If a
task is periodic, util est enqueued will be constant once it reaches a
steady state. As soon as the duty cycle of the task changes, util est
enqueued will change.

>
>> The benefits are:
>>
>> * Boosting the frequency when the behavior of a runnable task changes,
>> leading to an increase in utilization. That shortens the frequency
>> ramp up duration, which in turns allows the utilization signal to
>> reach stable values quicker. Since the allowed frequency boost is
>> bounded in energy, it will behave consistently across platforms,
>> regardless of the OPP cost range.
>
> Sounds good.
>
> Can you please describe the algorithm applied to achieve that?

The util est enqueued of a task is basically a snapshot of the util of
the task just before it's dequeued. This means that when the util has
stabilized, util est enqueued will be a constant signal. Specifically,
util est enqueued will be an upper bound of the swing of util avg.

When the task starts doing more work than at the previous activation,
its util avg will rise above the current util est enqueued. This means
we cannot assume anymore that util est enqueued represents an upper
bound of the duty cycle, so we can decide to boost until util avg
"stabilizes" again [note].

At the CPU level, we can track that in the rq aggregated signals:
- "stable rq's util est enqueued" is assumed to mean "same set of
enqueued tasks as the last time we looked at that rq".

- task util est enqueued and util avg can be replaced by the rq
signal. This will hide cases where a task's util increases while another
one decreases by the same amount.

The limitations of both assumptions can be fixed by more invasive
changes (a rq cookie to know the set of enqueued tasks and an
OR-aggregated per-task flag to ask for boosting), but these heuristics
allow using the existing signals with changes limited to schedutil.

Once we detected this situation, we can decide to boost. We don't want
black&white boosting, since a tiny increase in util should lead to a
tiny boost. Here, we use (util - util_est_enqueued). If the increase is
small, that boost will be small.


[note]:
util avg of a periodic task never actually stabilizes, it just enters an
interval and never leaves it. When the duty cycle changes, it will leave
that interval to enter another one. The centre of that interval is the
task's duty cycle.


>> * The boost is only transient, and should not impact a lot the energy
>> consumed of workloads with very stable utilization signals.

Thanks,
Douglas

2020-01-23 17:23:35

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost



On 1/23/20 3:55 PM, Rafael J. Wysocki wrote:
> On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD
> <[email protected]> wrote:
>>
>> Use the utilization signals dynamic to detect when the utilization of a
>> set of tasks starts increasing because of a change in tasks' behavior.
>> This allows detecting when spending extra power for faster frequency
>> ramp up response would be beneficial to the reactivity of the system.
>>
>> This ramp boost is computed as the difference between util_avg and
>> util_est_enqueued. This number somehow represents a lower bound of how
>> much extra utilization this tasks is actually using, compared to our
>> best current stable knowledge of it (which is util_est_enqueued).
>>
>> When the set of runnable tasks changes, the boost is disabled as the
>> impact of blocked utilization on util_avg will make the delta with
>> util_est_enqueued not very informative.
>>
>> Signed-off-by: Douglas RAILLARD <[email protected]>
>> ---
>> kernel/sched/cpufreq_schedutil.c | 43 ++++++++++++++++++++++++++++++++
>> 1 file changed, 43 insertions(+)
>>
>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> index 608963da4916..25a410a1ff6a 100644
>> --- a/kernel/sched/cpufreq_schedutil.c
>> +++ b/kernel/sched/cpufreq_schedutil.c
>> @@ -61,6 +61,10 @@ struct sugov_cpu {
>> unsigned long bw_dl;
>> unsigned long max;
>>
>> + unsigned long ramp_boost;
>> + unsigned long util_est_enqueued;
>> + unsigned long util_avg;
>> +
>> /* The field below is for single-CPU policies only: */
>> #ifdef CONFIG_NO_HZ_COMMON
>> unsigned long saved_idle_calls;
>> @@ -183,6 +187,42 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy, u64 time,
>> }
>> }
>>
>> +static unsigned long sugov_cpu_ramp_boost(struct sugov_cpu *sg_cpu)
>> +{
>> + return READ_ONCE(sg_cpu->ramp_boost);
>> +}
>
> Where exactly is this function used?

In the next commit where the boost value is actually used to do
something. The function is introduced here to keep the
WRITE_ONCE/READ_ONCE pair together.

>
>> +
>> +static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
>> +{
>> + struct rq *rq = cpu_rq(sg_cpu->cpu);
>> + unsigned long util_est_enqueued;
>> + unsigned long util_avg;
>> + unsigned long boost = 0;
>> +
>> + util_est_enqueued = READ_ONCE(rq->cfs.avg.util_est.enqueued);
>> + util_avg = READ_ONCE(rq->cfs.avg.util_avg);
>> +
>> + /*
>> + * Boost when util_avg becomes higher than the previous stable
>> + * knowledge of the enqueued tasks' set util, which is CPU's
>> + * util_est_enqueued.
>> + *
>> + * We try to spot changes in the workload itself, so we want to
>> + * avoid the noise of tasks being enqueued/dequeued. To do that,
>> + * we only trigger boosting when the "amount of work" enqueued
>> + * is stable.
>> + */
>> + if (util_est_enqueued == sg_cpu->util_est_enqueued &&
>> + util_avg >= sg_cpu->util_avg &&
>> + util_avg > util_est_enqueued)
>> + boost = util_avg - util_est_enqueued;
>> +
>> + sg_cpu->util_est_enqueued = util_est_enqueued;
>> + sg_cpu->util_avg = util_avg;
>> + WRITE_ONCE(sg_cpu->ramp_boost, boost);
>> + return boost;
>> +}
>> +
>> /**
>> * get_next_freq - Compute a new frequency for a given cpufreq policy.
>> * @sg_policy: schedutil policy object to compute the new frequency for.
>> @@ -514,6 +554,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
>> busy = !sg_policy->need_freq_update && sugov_cpu_is_busy(sg_cpu);
>>
>> util = sugov_get_util(sg_cpu);
>> + sugov_cpu_ramp_boost_update(sg_cpu);
>> max = sg_cpu->max;
>> util = sugov_iowait_apply(sg_cpu, time, util, max);
>> next_f = get_next_freq(sg_policy, util, max);
>> @@ -554,6 +595,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
>> unsigned long j_util, j_max;
>>
>> j_util = sugov_get_util(j_sg_cpu);
>> + if (j_sg_cpu == sg_cpu)
>> + sugov_cpu_ramp_boost_update(sg_cpu);
>> j_max = j_sg_cpu->max;
>> j_util = sugov_iowait_apply(j_sg_cpu, time, j_util, j_max);
>>
>> --

2020-01-23 17:55:42

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v4 3/6] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()



On 1/23/20 4:16 PM, Quentin Perret wrote:
> On Wednesday 22 Jan 2020 at 17:35:35 (+0000), Douglas RAILLARD wrote:
>> @@ -210,9 +211,16 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>> struct cpufreq_policy *policy = sg_policy->policy;
>> unsigned int freq = arch_scale_freq_invariant() ?
>> policy->cpuinfo.max_freq : policy->cur;
>> + struct em_perf_domain *pd = sugov_policy_get_pd(sg_policy);
>>
>> freq = map_util_freq(util, freq, max);
>>
>> + /*
>> + * Try to get a higher frequency if one is available, given the extra
>> + * power we are ready to spend.
>> + */
>> + freq = em_pd_get_higher_freq(pd, freq, 0);
>
> I find it sad that the call just below to cpufreq_driver_resolve_freq()
> and cpufreq_frequency_table_target() iterates the OPPs all over again.
> It's especially a shame since most existing users of the EM stuff do
> have a cpufreq frequency table.
>
> Have you looked at hooking this inside cpufreq_driver_resolve_freq()
> instead ? If we have a well-formed EM available, the call to
> cpufreq_frequency_table_target() feels redundant, so we might want to
> skip it.

We can't really move the call to em_pd_get_higher_freq() into
cpufreq_driver_resolve_freq() since that's a schedutil-specific feature,
and we would loose the !sg_policy->need_freq_update optimization.

Maybe we can add a flag to cpufreq_driver_resolve_freq() that promises
that the frequency is already a valid one. We have to be careful though,
since a number of things can make that untrue:
- em_pd_get_higher_freq() will return the passed freq verbatim if it's
higher than the max freq, so em_pd_get_higher_freq() will have to set
the flag itself in case that logic changes.
- policy limits can change the value
- future things could tinker with the freq and forget to reset the flag.

If you think it's worth it I can make these changes.

> Thoughts ?
>
> Quentin
>

Cheers,
Douglas

2020-01-23 21:26:03

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost

On Thu, Jan 23, 2020 at 6:21 PM Douglas Raillard
<[email protected]> wrote:
>
>
>
> On 1/23/20 3:55 PM, Rafael J. Wysocki wrote:
> > On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD
> > <[email protected]> wrote:
> >>
> >> Use the utilization signals dynamic to detect when the utilization of a
> >> set of tasks starts increasing because of a change in tasks' behavior.
> >> This allows detecting when spending extra power for faster frequency
> >> ramp up response would be beneficial to the reactivity of the system.
> >>
> >> This ramp boost is computed as the difference between util_avg and
> >> util_est_enqueued. This number somehow represents a lower bound of how
> >> much extra utilization this tasks is actually using, compared to our
> >> best current stable knowledge of it (which is util_est_enqueued).
> >>
> >> When the set of runnable tasks changes, the boost is disabled as the
> >> impact of blocked utilization on util_avg will make the delta with
> >> util_est_enqueued not very informative.
> >>
> >> Signed-off-by: Douglas RAILLARD <[email protected]>
> >> ---
> >> kernel/sched/cpufreq_schedutil.c | 43 ++++++++++++++++++++++++++++++++
> >> 1 file changed, 43 insertions(+)
> >>
> >> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> >> index 608963da4916..25a410a1ff6a 100644
> >> --- a/kernel/sched/cpufreq_schedutil.c
> >> +++ b/kernel/sched/cpufreq_schedutil.c
> >> @@ -61,6 +61,10 @@ struct sugov_cpu {
> >> unsigned long bw_dl;
> >> unsigned long max;
> >>
> >> + unsigned long ramp_boost;
> >> + unsigned long util_est_enqueued;
> >> + unsigned long util_avg;
> >> +
> >> /* The field below is for single-CPU policies only: */
> >> #ifdef CONFIG_NO_HZ_COMMON
> >> unsigned long saved_idle_calls;
> >> @@ -183,6 +187,42 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy, u64 time,
> >> }
> >> }
> >>
> >> +static unsigned long sugov_cpu_ramp_boost(struct sugov_cpu *sg_cpu)
> >> +{
> >> + return READ_ONCE(sg_cpu->ramp_boost);
> >> +}
> >
> > Where exactly is this function used?
>
> In the next commit where the boost value is actually used to do
> something. The function is introduced here to keep the
> WRITE_ONCE/READ_ONCE pair together.

But ramp_boost itself is not really used in this patch too AFAICS.

2020-01-24 15:19:07

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v4 3/6] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()

On Friday 24 Jan 2020 at 14:37:04 (+0000), Quentin Perret wrote:
> On Thursday 23 Jan 2020 at 17:52:53 (+0000), Douglas Raillard wrote:
> > We can't really move the call to em_pd_get_higher_freq() into
> > cpufreq_driver_resolve_freq() since that's a schedutil-specific feature,
> > and we would loose the !sg_policy->need_freq_update optimization.
>
> Depends how you do it. You could add a new method to cpufreq_policy that

s/cpufreq_policy/cpufreq_governor

> is defined only for sugov or something along those lines. And you'd call
> that instead of cpufreq_frequency_table_target() when that makes sense.
>
> > Maybe we can add a flag to cpufreq_driver_resolve_freq() that promises
> > that the frequency is already a valid one. We have to be careful though,
> > since a number of things can make that untrue:
> > - em_pd_get_higher_freq() will return the passed freq verbatim if it's
> > higher than the max freq, so em_pd_get_higher_freq() will have to set
> > the flag itself in case that logic changes.
> > - policy limits can change the value
> > - future things could tinker with the freq and forget to reset the flag.
> >
> > If you think it's worth it I can make these changes.
>
> The thing is, not only with the current patch we end up iterating the
> frequencies twice for nothing, but also I think it'd be interesting to
> use the EM for consistency with EAS. It'd be nice to use the same data
> structure for the predictions we do in compute_energy() and for the
> actual request.
>
> Thoughts ?
>
> Quentin

2020-01-24 20:51:09

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v4 3/6] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()

On Thursday 23 Jan 2020 at 17:52:53 (+0000), Douglas Raillard wrote:
> We can't really move the call to em_pd_get_higher_freq() into
> cpufreq_driver_resolve_freq() since that's a schedutil-specific feature,
> and we would loose the !sg_policy->need_freq_update optimization.

Depends how you do it. You could add a new method to cpufreq_policy that
is defined only for sugov or something along those lines. And you'd call
that instead of cpufreq_frequency_table_target() when that makes sense.

> Maybe we can add a flag to cpufreq_driver_resolve_freq() that promises
> that the frequency is already a valid one. We have to be careful though,
> since a number of things can make that untrue:
> - em_pd_get_higher_freq() will return the passed freq verbatim if it's
> higher than the max freq, so em_pd_get_higher_freq() will have to set
> the flag itself in case that logic changes.
> - policy limits can change the value
> - future things could tinker with the freq and forget to reset the flag.
>
> If you think it's worth it I can make these changes.

The thing is, not only with the current patch we end up iterating the
frequencies twice for nothing, but also I think it'd be interesting to
use the EM for consistency with EAS. It'd be nice to use the same data
structure for the predictions we do in compute_energy() and for the
actual request.

Thoughts ?

Quentin

2020-01-27 17:18:03

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

On Wed, 22 Jan 2020 at 18:36, Douglas RAILLARD <[email protected]> wrote:
>
> Make schedutil cpufreq governor energy-aware.
>
> - patch 1 introduces a function to retrieve a frequency given a base
> frequency and an energy cost margin.
> - patch 2 links Energy Model perf_domain to sugov_policy.
> - patch 3 updates get_next_freq() to make use of the Energy Model.
> - patch 4 adds sugov_cpu_ramp_boost() function.
> - patch 5 updates sugov_update_(single|shared)() to make use of
> sugov_cpu_ramp_boost().
> - patch 6 introduces a tracepoint in get_next_freq() for
> testing/debugging. Since it's not a trace event, it's not exposed to
> userspace in a directly usable way, allowing for painless future
> updates/removal.
>
> The benefits of using the EM in schedutil are twofold:
>
> 1) Selecting the highest possible frequency for a given cost. Some
> platforms can have lower frequencies that are less efficient than
> higher ones, in which case they should be skipped for most purposes.

This make sense. Why using a lower frequency when a higher one is more
power efficient

> They can still be useful to give more freedom to thermal throttling
> mechanisms, but not under normal circumstances.
> note: the EM framework will warn about such OPPs "hertz/watts ratio
> non-monotonically decreasing"
>
> 2) Driving the frequency selection with power in mind, in addition to
> maximizing the utilization of the non-idle CPUs in the system.
>
> Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
> enabled in schedutil by
> "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".
>
> Point 2) is enabled in
> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using
> higher frequencies when it is known that the true utilization of
> currently running tasks is exceeding their previous stable point.
> The benefits are:
>
> * Boosting the frequency when the behavior of a runnable task changes,
> leading to an increase in utilization. That shortens the frequency
> ramp up duration, which in turns allows the utilization signal to
> reach stable values quicker. Since the allowed frequency boost is
> bounded in energy, it will behave consistently across platforms,
> regardless of the OPP cost range.

Could you explain this a bit more ?

>
> * The boost is only transient, and should not impact a lot the energy
> consumed of workloads with very stable utilization signals.
>
> This has been lightly tested with a rtapp task ramping from 10% to 75%
> utilisation on a big core.

Which kind of UC are you targeting ?

Do you have some benchmark showing the benefit and how you can bound
the increase of energy ?

The benefit of point2 is less obvious for me. We already have uclamp
which helps to overwrite the "utilization" that is seen by schedutil
to boost or cap the frequency when some tasks are running. I'm curious
to see what would be the benefit of this on top.

>
> v1 -> v2:
>
> * Split the new sugov_cpu_ramp_boost() from the existing
> sugov_cpu_is_busy() as they seem to seek a different goal.
>
> * Implement sugov_cpu_ramp_boost() based on CFS util_avg and
> util_est_enqueued signals, rather than using idle calls count.
> This makes the ramp boost much more accurate in finding boost
> opportunities, and give a "continuous" output rather than a boolean.
>
> * Add EM_COST_MARGIN_SCALE=1024 to represent the
> margin values of em_pd_get_higher_freq().
>
> v2 -> v3:
>
> * Check util_avg >= sg_cpu->util_avg in sugov_cpu_ramp_boost_update()
> to avoid boosting when the utilization is decreasing.
>
> * Add a tracepoint for testing.
>
> v3 -> v4:
>
> * em_pd_get_higher_freq() now interprets the margin as absolute,
> rather than relative to the cost of the base frequency.
>
> * Modify misleading comment in em_pd_get_higher_freq() since min_freq
> can actually be higher than the max available frequency in normal
> operations.
>
> Douglas RAILLARD (6):
> PM: Introduce em_pd_get_higher_freq()
> sched/cpufreq: Attach perf domain to sugov policy
> sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()
> sched/cpufreq: Introduce sugov_cpu_ramp_boost
> sched/cpufreq: Boost schedutil frequency ramp up
> sched/cpufreq: Add schedutil_em_tp tracepoint
>
> include/linux/energy_model.h | 56 ++++++++++++++
> include/trace/events/power.h | 9 +++
> kernel/sched/cpufreq_schedutil.c | 124 +++++++++++++++++++++++++++++--
> 3 files changed, 182 insertions(+), 7 deletions(-)
>
> --
> 2.24.1
>

2020-01-28 15:39:23

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost



On 1/23/20 9:02 PM, Rafael J. Wysocki wrote:
> On Thu, Jan 23, 2020 at 6:21 PM Douglas Raillard
> <[email protected]> wrote:
>>
>>
>>
>> On 1/23/20 3:55 PM, Rafael J. Wysocki wrote:
>>> On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD
>>> <[email protected]> wrote:
>>>>
>>>> Use the utilization signals dynamic to detect when the utilization of a
>>>> set of tasks starts increasing because of a change in tasks' behavior.
>>>> This allows detecting when spending extra power for faster frequency
>>>> ramp up response would be beneficial to the reactivity of the system.
>>>>
>>>> This ramp boost is computed as the difference between util_avg and
>>>> util_est_enqueued. This number somehow represents a lower bound of how
>>>> much extra utilization this tasks is actually using, compared to our
>>>> best current stable knowledge of it (which is util_est_enqueued).
>>>>
>>>> When the set of runnable tasks changes, the boost is disabled as the
>>>> impact of blocked utilization on util_avg will make the delta with
>>>> util_est_enqueued not very informative.
>>>>
>>>> Signed-off-by: Douglas RAILLARD <[email protected]>
>>>> ---
>>>> kernel/sched/cpufreq_schedutil.c | 43 ++++++++++++++++++++++++++++++++
>>>> 1 file changed, 43 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>>>> index 608963da4916..25a410a1ff6a 100644
>>>> --- a/kernel/sched/cpufreq_schedutil.c
>>>> +++ b/kernel/sched/cpufreq_schedutil.c
>>>> @@ -61,6 +61,10 @@ struct sugov_cpu {
>>>> unsigned long bw_dl;
>>>> unsigned long max;
>>>>
>>>> + unsigned long ramp_boost;
>>>> + unsigned long util_est_enqueued;
>>>> + unsigned long util_avg;
>>>> +
>>>> /* The field below is for single-CPU policies only: */
>>>> #ifdef CONFIG_NO_HZ_COMMON
>>>> unsigned long saved_idle_calls;
>>>> @@ -183,6 +187,42 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy, u64 time,
>>>> }
>>>> }
>>>>
>>>> +static unsigned long sugov_cpu_ramp_boost(struct sugov_cpu *sg_cpu)
>>>> +{
>>>> + return READ_ONCE(sg_cpu->ramp_boost);
>>>> +}
>>>
>>> Where exactly is this function used?
>>
>> In the next commit where the boost value is actually used to do
>> something. The function is introduced here to keep the
>> WRITE_ONCE/READ_ONCE pair together.
>
> But ramp_boost itself is not really used in this patch too AFAICS.

I'll squash that patch with the next one where it's actually used then:
sched/cpufreq: Boost schedutil frequency ramp up

Thanks,
Douglas

2020-02-10 11:38:50

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

Hi Vincent,

On 1/27/20 5:16 PM, Vincent Guittot wrote:
> On Wed, 22 Jan 2020 at 18:36, Douglas RAILLARD <[email protected]> wrote:
>>
>> Make schedutil cpufreq governor energy-aware.
>>
>> - patch 1 introduces a function to retrieve a frequency given a base
>> frequency and an energy cost margin.
>> - patch 2 links Energy Model perf_domain to sugov_policy.
>> - patch 3 updates get_next_freq() to make use of the Energy Model.
>> - patch 4 adds sugov_cpu_ramp_boost() function.
>> - patch 5 updates sugov_update_(single|shared)() to make use of
>> sugov_cpu_ramp_boost().
>> - patch 6 introduces a tracepoint in get_next_freq() for
>> testing/debugging. Since it's not a trace event, it's not exposed to
>> userspace in a directly usable way, allowing for painless future
>> updates/removal.
>>
>> The benefits of using the EM in schedutil are twofold:
>>
>> 1) Selecting the highest possible frequency for a given cost. Some
>> platforms can have lower frequencies that are less efficient than
>> higher ones, in which case they should be skipped for most purposes.
>
> This make sense. Why using a lower frequency when a higher one is more
> power efficient

Apparently in some cases it can be useful for thermal capping. AFAIU the
alternate solution is to race to idle with a more efficient OPP (idle
injection work of Linaro).

>> They can still be useful to give more freedom to thermal throttling
>> mechanisms, but not under normal circumstances.
>> note: the EM framework will warn about such OPPs "hertz/watts ratio
>> non-monotonically decreasing"
>>
>> 2) Driving the frequency selection with power in mind, in addition to
>> maximizing the utilization of the non-idle CPUs in the system.
>>
>> Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
>> enabled in schedutil by
>> "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".
>>
>> Point 2) is enabled in
>> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using
>> higher frequencies when it is known that the true utilization of
>> currently running tasks is exceeding their previous stable point.
>> The benefits are:
>>
>> * Boosting the frequency when the behavior of a runnable task changes,
>> leading to an increase in utilization. That shortens the frequency
>> ramp up duration, which in turns allows the utilization signal to
>> reach stable values quicker. Since the allowed frequency boost is
>> bounded in energy, it will behave consistently across platforms,
>> regardless of the OPP cost range.
>
> Could you explain this a bit more ?

The goal is to detect when the task starts asking more CPU time than it
did during the previous period. At this stage, we don't know how much
more, so we increase the frequency faster to allow signals to settle
more quickly.

The PELT signal does increases independently from the chosen frequency,
but that's only up until idle time shows up. At this point, the util
will drop again, and the frequency with it.

>>
>> * The boost is only transient, and should not impact a lot the energy
>> consumed of workloads with very stable utilization signals.
>>
>> This has been lightly tested with a rtapp task ramping from 10% to 75%
>> utilisation on a big core.
>
> Which kind of UC are you targeting ?

One case are tasks with "random" behavior like threads in thread pools
that can end up doing very different things. There may be other cases as
well, but I'll need to do more extensive testing with actual applications.

>
> Do you have some benchmark showing the benefit and how you can bound
> the increase of energy ?

In the test setup described above, it increases the energy consumption
by ~2.5%.

I also did some preliminary experiments to reduce the margin taken in
map_util_freq(), which becomes less necessary if the frequency is
boosted in the cases where util increase is getting out of hands. That
can recover some amount of lost power.

The real cost in practice heavily depends on:
* the workloads (if its util jumps around, it will boost more frequently)
* the discrete frequencies available (if boosting does not bring us to
the next freq, no boost is actually applied).

>
> The benefit of point2 is less obvious for me. We already have uclamp
> which helps to overwrite the "utilization" that is seen by schedutil
> to boost or cap the frequency when some tasks are running. I'm curious
> to see what would be the benefit of this on top.

uclamp is only useful when a target utilization is known beforehand by
the task itself or some kind of manager. In all the cases relying on
plain PELT, we can decrease the freq change reaction time.

Note that schedutil is already built around the duty cycle detection
with a bias for higher frequency when the task period increases (using
util est enqueued). What this series bring is a way to detect when util
est enqueued turns from a set-point into a lower bound.

>>
>> v1 -> v2:
>>
>> * Split the new sugov_cpu_ramp_boost() from the existing
>> sugov_cpu_is_busy() as they seem to seek a different goal.
>>
>> * Implement sugov_cpu_ramp_boost() based on CFS util_avg and
>> util_est_enqueued signals, rather than using idle calls count.
>> This makes the ramp boost much more accurate in finding boost
>> opportunities, and give a "continuous" output rather than a boolean.
>>
>> * Add EM_COST_MARGIN_SCALE=1024 to represent the
>> margin values of em_pd_get_higher_freq().
>>
>> v2 -> v3:
>>
>> * Check util_avg >= sg_cpu->util_avg in sugov_cpu_ramp_boost_update()
>> to avoid boosting when the utilization is decreasing.
>>
>> * Add a tracepoint for testing.
>>
>> v3 -> v4:
>>
>> * em_pd_get_higher_freq() now interprets the margin as absolute,
>> rather than relative to the cost of the base frequency.
>>
>> * Modify misleading comment in em_pd_get_higher_freq() since min_freq
>> can actually be higher than the max available frequency in normal
>> operations.
>>
>> Douglas RAILLARD (6):
>> PM: Introduce em_pd_get_higher_freq()
>> sched/cpufreq: Attach perf domain to sugov policy
>> sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()
>> sched/cpufreq: Introduce sugov_cpu_ramp_boost
>> sched/cpufreq: Boost schedutil frequency ramp up
>> sched/cpufreq: Add schedutil_em_tp tracepoint
>>
>> include/linux/energy_model.h | 56 ++++++++++++++
>> include/trace/events/power.h | 9 +++
>> kernel/sched/cpufreq_schedutil.c | 124 +++++++++++++++++++++++++++++--
>> 3 files changed, 182 insertions(+), 7 deletions(-)
>>
>> --
>> 2.24.1
>>

2020-02-10 13:08:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost

On Wed, Jan 22, 2020 at 05:35:36PM +0000, Douglas RAILLARD wrote:

> +static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
> +{
> + struct rq *rq = cpu_rq(sg_cpu->cpu);
> + unsigned long util_est_enqueued;
> + unsigned long util_avg;
> + unsigned long boost = 0;
> +

Should we NO-OP this function when !sched_feat(UTIL_EST) ?

> + util_est_enqueued = READ_ONCE(rq->cfs.avg.util_est.enqueued);

Otherwise you're reading garbage here, no?

> + util_avg = READ_ONCE(rq->cfs.avg.util_avg);
> +
> + /*
> + * Boost when util_avg becomes higher than the previous stable
> + * knowledge of the enqueued tasks' set util, which is CPU's
> + * util_est_enqueued.
> + *
> + * We try to spot changes in the workload itself, so we want to
> + * avoid the noise of tasks being enqueued/dequeued. To do that,
> + * we only trigger boosting when the "amount of work" enqueued
> + * is stable.
> + */
> + if (util_est_enqueued == sg_cpu->util_est_enqueued &&
> + util_avg >= sg_cpu->util_avg &&
> + util_avg > util_est_enqueued)
> + boost = util_avg - util_est_enqueued;
> +
> + sg_cpu->util_est_enqueued = util_est_enqueued;
> + sg_cpu->util_avg = util_avg;
> + WRITE_ONCE(sg_cpu->ramp_boost, boost);
> + return boost;
> +}

2020-02-10 13:22:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

On Wed, Jan 22, 2020 at 06:14:24PM +0000, Douglas Raillard wrote:
> Hi Peter,
>
> Since the v3 was posted a while ago, here is a short recap of the hanging
> comments:
>
> * The boost margin was relative, but we came to the conclusion it would make
> more sense to make it absolute (done in that v4).

As per (patch #1):

+ max_cost = pd->table[pd->nr_cap_states - 1].cost;
+ cost_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE;

So we'll allow the boost to double energy consumption (or rather, since
you cannot go above the max OPP, we're allowed that).

> * The main remaining blur point was why defining boost=(util - util_est) makes
> sense. The justification for that is that we use PELT-shaped signal to drive
> the frequency, so using a PELT-shaped signal for the boost makes sense for the
> same reasons.

As per (patch #4):

+ unsigned long boost = 0;

+ if (util_est_enqueued == sg_cpu->util_est_enqueued &&
+ util_avg >= sg_cpu->util_avg &&
+ util_avg > util_est_enqueued)
+ boost = util_avg - util_est_enqueued;

The result of that is not, strictly speaking, a PELT shaped signal.
Although when it is !0 the curves are similar, albeit offset.

> AFAIK there is no specific criteria to meet for frequency selection signal shape
> for anything else than periodic tasks (if we don't add other constraints on
> top), so (util - util_est)=(util - constant) seems as good as anything else.
> Especially since util is deemed to be a good fit in practice for frequency
> selection. Let me know if I missed anything on that front.


Given:

sugov_get_util() <- cpu_util_cfs() <- UTIL_EST ? util_est.enqueued : util_avg.

our next_f becomes:

next_f = 1.25 * util_est * max_freq / max;

so our min_freq in em_pd_get_higher_freq() will already be compensated
for the offset.

So even when:

boost = util_avg - util_est

is small, despite util_avg being huge (~1024), due to large util_est,
we'll still get an effective boost to max_cost ASSUMING cs[].cost and
cost_margin have the same curve.

They have not.

assuming cs[].cost ~ f^3, and given our cost_margin ~ f, that leaves a
factor f^2 on the table.

So the higher the min_freq, the less effective the boost.

Maybe it all works out in practise, but I'm missing a big picture
description of it all somewhere.

2020-02-10 13:31:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

On Thu, Jan 23, 2020 at 05:16:52PM +0000, Douglas Raillard wrote:
> Hi Rafael,
>
> On 1/23/20 3:43 PM, Rafael J. Wysocki wrote:
> > On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD
> > <[email protected]> wrote:
> >>
> >> Make schedutil cpufreq governor energy-aware.
> >
> > I have to say that your terminology is confusing to me, like what
> > exactly does "energy-aware" mean in the first place?
>
> Should be better rephrased as "Make schedutil cpufreq governor use the
> energy model" I guess. Schedutil is indeed already energy aware since it
> tries to use the lowest frequency possible for the job to be done (kind of).

So ARM64 will soon get x86-like power management if I read these here
patches right:

https://lkml.kernel.org/r/[email protected]

And I'm thinking a part of Rafael's concerns will also apply to those
platforms.

> Other than that, the only energy-related information schedutil uses is
> the assumption that lower freq == better efficiency. Explicit use of the
> EM allows to refine this assumption.

I'm thinking that such platforms guarantee this on their own, if not,
there just isn't anything we can do about it, so that assumption is
fair.

(I've always found it weird to have less efficient OPPs listed anyway)

> >> 1) Selecting the highest possible frequency for a given cost. Some
> >> platforms can have lower frequencies that are less efficient than
> >> higher ones, in which case they should be skipped for most purposes.
> >> They can still be useful to give more freedom to thermal throttling
> >> mechanisms, but not under normal circumstances.
> >> note: the EM framework will warn about such OPPs "hertz/watts ratio
> >> non-monotonically decreasing"
> >
> > While all of that is fair enough for platforms using the EM, do you
> > realize that the EM is not available on the majority of architectures
> > (including some fairly significant ones) and so adding overhead
> > related to it for all of them is quite less than welcome?
>
> When CONFIG_ENERGY_MODEL is not defined, em_pd_get_higher_freq() is
> defined to a static inline no-op function, so that feature won't incur
> overhead (patch 1+2+3).
>
> Patch 4 and 5 do add some new logic that could be used on any platform.
> Current code will use the boost as an energy margin, but it would be
> straightforward to make a util-based version (like iowait boost) on
> non-EM platforms.

Right, so the condition 'util_avg > util_est' makes sense to trigger
some sort of boost off of.

What kind would make sense for these platforms? One possibility would be
to instead of frobbing the energy margin, as you do here, to frob the C
in get_next_freq().

(I have vague memories of this being proposed earlier; it also avoids
that double OPP iteration thing complained about elsewhere in this
thread if I'm not mistaken).


That is; I'm thinking it is important (esp. now that we got frequency
invariance sorted for x86), to have this patch also work for !EM
architectures (as those ARM64-AMU things would be).


2020-02-13 10:50:40

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v4 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost



On 2/10/20 1:08 PM, Peter Zijlstra wrote:
> On Wed, Jan 22, 2020 at 05:35:36PM +0000, Douglas RAILLARD wrote:
>
>> +static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
>> +{
>> + struct rq *rq = cpu_rq(sg_cpu->cpu);
>> + unsigned long util_est_enqueued;
>> + unsigned long util_avg;
>> + unsigned long boost = 0;
>> +
>
> Should we NO-OP this function when !sched_feat(UTIL_EST) ?
>
>> + util_est_enqueued = READ_ONCE(rq->cfs.avg.util_est.enqueued);
>
> Otherwise you're reading garbage here, no?

Most likely indeed. The boosting should be disabled in that case.

>
>> + util_avg = READ_ONCE(rq->cfs.avg.util_avg);
>> +
>> + /*
>> + * Boost when util_avg becomes higher than the previous stable
>> + * knowledge of the enqueued tasks' set util, which is CPU's
>> + * util_est_enqueued.
>> + *
>> + * We try to spot changes in the workload itself, so we want to
>> + * avoid the noise of tasks being enqueued/dequeued. To do that,
>> + * we only trigger boosting when the "amount of work" enqueued
>> + * is stable.
>> + */
>> + if (util_est_enqueued == sg_cpu->util_est_enqueued &&
>> + util_avg >= sg_cpu->util_avg &&
>> + util_avg > util_est_enqueued)
>> + boost = util_avg - util_est_enqueued;
>> +
>> + sg_cpu->util_est_enqueued = util_est_enqueued;
>> + sg_cpu->util_avg = util_avg;
>> + WRITE_ONCE(sg_cpu->ramp_boost, boost);
>> + return boost;
>> +}

2020-02-13 11:56:20

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

On 2/10/20 1:30 PM, Peter Zijlstra wrote:
> On Thu, Jan 23, 2020 at 05:16:52PM +0000, Douglas Raillard wrote:
>> Hi Rafael,
>>
>> On 1/23/20 3:43 PM, Rafael J. Wysocki wrote:
>>> On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD
>>> <[email protected]> wrote:
>>>>
>>>> Make schedutil cpufreq governor energy-aware.
>>>
>>> I have to say that your terminology is confusing to me, like what
>>> exactly does "energy-aware" mean in the first place?
>>
>> Should be better rephrased as "Make schedutil cpufreq governor use the
>> energy model" I guess. Schedutil is indeed already energy aware since it
>> tries to use the lowest frequency possible for the job to be done (kind of).
>
> So ARM64 will soon get x86-like power management if I read these here
> patches right:
>
> https://lkml.kernel.org/r/[email protected]
>
> And I'm thinking a part of Rafael's concerns will also apply to those
> platforms.

AFAIU there is an important difference: ARM64 firmware should not end up
increasing frequency on its own, it should only cap the frequency. That
means that the situation stays the same for that boost:

Let's say you let schedutil selecting a freq that is +2% more power
hungry. That will probably not be enough to make it jump to the next
OPP, so you end up not boosting. Now if there is a firmware that decides
for some reasons to cap frequency, it will be a similar situation.

>
>> Other than that, the only energy-related information schedutil uses is
>> the assumption that lower freq == better efficiency. Explicit use of the
>> EM allows to refine this assumption.
>
> I'm thinking that such platforms guarantee this on their own, if not,
> there just isn't anything we can do about it, so that assumption is
> fair.
>
> (I've always found it weird to have less efficient OPPs listed anyway)

Ultimately, (mostly) the piece of code involved in thermal capping needs
to know about these inefficient OPPs (be it the firmware or some kernel
subsystem). The rest of the world doesn't need to care.

>>>> 1) Selecting the highest possible frequency for a given cost. Some
>>>> platforms can have lower frequencies that are less efficient than
>>>> higher ones, in which case they should be skipped for most purposes.
>>>> They can still be useful to give more freedom to thermal throttling
>>>> mechanisms, but not under normal circumstances.
>>>> note: the EM framework will warn about such OPPs "hertz/watts ratio
>>>> non-monotonically decreasing"
>>>
>>> While all of that is fair enough for platforms using the EM, do you
>>> realize that the EM is not available on the majority of architectures
>>> (including some fairly significant ones) and so adding overhead
>>> related to it for all of them is quite less than welcome?
>>
>> When CONFIG_ENERGY_MODEL is not defined, em_pd_get_higher_freq() is
>> defined to a static inline no-op function, so that feature won't incur
>> overhead (patch 1+2+3).
>>
>> Patch 4 and 5 do add some new logic that could be used on any platform.
>> Current code will use the boost as an energy margin, but it would be
>> straightforward to make a util-based version (like iowait boost) on
>> non-EM platforms.
>
> Right, so the condition 'util_avg > util_est' makes sense to trigger
> some sort of boost off of.
>
> What kind would make sense for these platforms? One possibility would be
> to instead of frobbing the energy margin, as you do here, to frob the C
> in get_next_freq().

If I'm correct, changing the C value would be somewhat similar to the
relative boosting I had in a previous version. Maybe adding a fixed
offset would give more predictable results as was discussed with Vincent
Guittot. In any case, it would change the perceived util (like iowait
boost).

> (I have vague memories of this being proposed earlier; it also avoids
> that double OPP iteration thing complained about elsewhere in this
> thread if I'm not mistaken).

It should be possible to get rid of the double iteration mentioned by
Quentin. Choosing to boost the util or the energy boils down to:

1) If you care more about predictable battery life (or energy bill) than
predictability of the boost feature, EM should be used.

2) If you don't have an EM or you care more about having a predictable
boost for a given workload, use util (or disable that boost).

The rational is that with 1), you will get a different speed boost for a
given workload depending on the other things executing at the same time,
as the speed up is not linear with the task-related metric (util -
util_est). If you are already at high freq because of another workload,
the speed up will be small because the next 100Mhz will cost much more
than the same +100Mhz delta starting from a low OPP.

> That is; I'm thinking it is important (esp. now that we got frequency
> invariance sorted for x86), to have this patch also work for !EM
> architectures (as those ARM64-AMU things would be).

For sure, that feature is supposed to help in cases that would be
impossible to pinpoint with hardware, since it has to know what tasks
execute.

2020-02-13 13:20:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

On Thu, Feb 13, 2020 at 11:55:32AM +0000, Douglas Raillard wrote:
> On 2/10/20 1:30 PM, Peter Zijlstra wrote:

> > So ARM64 will soon get x86-like power management if I read these here
> > patches right:
> >
> > https://lkml.kernel.org/r/[email protected]
> >
> > And I'm thinking a part of Rafael's concerns will also apply to those
> > platforms.
>
> AFAIU there is an important difference: ARM64 firmware should not end up
> increasing frequency on its own, it should only cap the frequency. That
> means that the situation stays the same for that boost:
>
> Let's say you let schedutil selecting a freq that is +2% more power
> hungry. That will probably not be enough to make it jump to the next
> OPP, so you end up not boosting. Now if there is a firmware that decides
> for some reasons to cap frequency, it will be a similar situation.

The moment you give out OPP selection to a 3rd party (be it firmware or
a micro-controller) things are uncertain at best anyway.

Still, in general, if you give it higher input, it tends to at least
consider going faster -- which might be all you can ask for...

So I'm not exactly seeing what your argument is here.

> > Right, so the condition 'util_avg > util_est' makes sense to trigger
> > some sort of boost off of.
> >
> > What kind would make sense for these platforms? One possibility would be
> > to instead of frobbing the energy margin, as you do here, to frob the C
> > in get_next_freq().
>
> If I'm correct, changing the C value would be somewhat similar to the
> relative boosting I had in a previous version. Maybe adding a fixed
> offset would give more predictable results as was discussed with Vincent
> Guittot. In any case, it would change the perceived util (like iowait
> boost).

It depends a bit on what you change C into. If we do something trivial
like:
1.25 ; !(util_avg > util_est)
C := {
2 ; (util_avg > util_est)

ie. a binary selection of constants, then yes, I suppose that is the
case.

But nothing stops us from making it more complicated; or having it
depend on the presence of EM data.

> > (I have vague memories of this being proposed earlier; it also avoids
> > that double OPP iteration thing complained about elsewhere in this
> > thread if I'm not mistaken).
>
> It should be possible to get rid of the double iteration mentioned by
> Quentin. Choosing to boost the util or the energy boils down to:
>
> 1) If you care more about predictable battery life (or energy bill) than
> predictability of the boost feature, EM should be used.
>
> 2) If you don't have an EM or you care more about having a predictable
> boost for a given workload, use util (or disable that boost).
>
> The rational is that with 1), you will get a different speed boost for a
> given workload depending on the other things executing at the same time,
> as the speed up is not linear with the task-related metric (util -
> util_est). If you are already at high freq because of another workload,
> the speed up will be small because the next 100Mhz will cost much more
> than the same +100Mhz delta starting from a low OPP.

It's just that I'm not seeing how 1 actually works or provides that more
predictable battery life I suppose. We have this other sub-thread to
argue about that :-)

> > That is; I'm thinking it is important (esp. now that we got frequency
> > invariance sorted for x86), to have this patch also work for !EM
> > architectures (as those ARM64-AMU things would be).
>
> For sure, that feature is supposed to help in cases that would be
> impossible to pinpoint with hardware, since it has to know what tasks
> execute.

OK, so I'm thinking we're agreeing that it would be good to have this
support !EM systems too.

2020-02-13 17:51:10

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

On 2/10/20 1:21 PM, Peter Zijlstra wrote:
> On Wed, Jan 22, 2020 at 06:14:24PM +0000, Douglas Raillard wrote:
>> Hi Peter,
>>
>> Since the v3 was posted a while ago, here is a short recap of the hanging
>> comments:
>>
>> * The boost margin was relative, but we came to the conclusion it would make
>> more sense to make it absolute (done in that v4).
>
> As per (patch #1):
>
> + max_cost = pd->table[pd->nr_cap_states - 1].cost;
> + cost_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE;
>
> So we'll allow the boost to double energy consumption (or rather, since
> you cannot go above the max OPP, we're allowed that).

Indeed. This might need some tweaking based on testing, maybe +50% is
enough, or maybe +200% is even better.


>> * The main remaining blur point was why defining boost=(util - util_est) makes
>> sense. The justification for that is that we use PELT-shaped signal to drive
>> the frequency, so using a PELT-shaped signal for the boost makes sense for the
>> same reasons.
>
> As per (patch #4):
>
> + unsigned long boost = 0;
>
> + if (util_est_enqueued == sg_cpu->util_est_enqueued &&
> + util_avg >= sg_cpu->util_avg &&
> + util_avg > util_est_enqueued)
> + boost = util_avg - util_est_enqueued;
>
> The result of that is not, strictly speaking, a PELT shaped signal.
> Although when it is !0 the curves are similar, albeit offset.

Yes, it has the same rate of increase as PELT.

>
>> AFAIK there is no specific criteria to meet for frequency selection signal shape
>> for anything else than periodic tasks (if we don't add other constraints on
>> top), so (util - util_est)=(util - constant) seems as good as anything else.
>> Especially since util is deemed to be a good fit in practice for frequency
>> selection. Let me know if I missed anything on that front.
>
>
> Given:
>
> sugov_get_util() <- cpu_util_cfs() <- UTIL_EST ? util_est.enqueued : util_avg.

cpu_util_cfs uses max_t (maybe irrelevant for this discussion):
UTIL_EST ? max(util_est.enqueued, util_avg) : util_avg

> our next_f becomes:
>
> next_f = 1.25 * util_est * max_freq / max;

> so our min_freq in em_pd_get_higher_freq() will already be compensated
> for the offset.

Yes, the boost is added on top of the existing behavior.

> So even when:
>
> boost = util_avg - util_est
>
> is small, despite util_avg being huge (~1024), due to large util_est,
> we'll still get an effective boost to max_cost ASSUMING cs[].cost and
> cost_margin have the same curve.

I'm not sure to follow, cs[].cost can be plotted against cs[].freq, but
cost_margin is a time-based signal (the boost value), so it would be
plotted against time.

>
> They have not.
>
> assuming cs[].cost ~ f^3, and given our cost_margin ~ f, that leaves a
> factor f^2 on the table.

I'm guessing that you arrived to `cost_margin ~ f` this way:

cost_margin = util - util_est_enqueued
cost_margin = util - constant

# with constant small enough
cost_margin ~ util

# with util ~ 1/f
cost_margin ~ 1/f

In the case you describe, `constant` is actually almost equal to `util`
so `cost_margin ~! util`, and that series assumes frequency invariant
util_avg so `util !~ 1/f` (I'll probably have to fix that).

> So the higher the min_freq, the less effective the boost.

Yes, since the boost is allowing a fixed amount of extra power. Higher
OPPs are less efficient than lower ones, so if min_freq is high, we
won't speed up as much as if min_freq was low.

> Maybe it all works out in practise, but I'm missing a big picture

Here is a big picture :)

https://gist.github.com/douglas-raillard-arm/f76586428836ec70c6db372993e0b731#file-ramp_boost-svg

The board is a Juno R0, with a periodic task pinned on a big CPU
(capa=1024):
* phase 1: 5% duty cycle (=51 PELT units)
* phase 2: 75% duty cycle (=768 PELT units)

Legend:
* blue square wave: when the task executes (like in kernelshark)
* base_cost = cost of frequency as selected by schedutil in normal
operations
* allowed_cost = base_cost + cost_margin
* util = util_avg

note: the small gaps right after the duty cycle transition between
t=4.15 and 4.25 are due to sugov task executing, so there is no dequeue
and no util_est update.

> description of it all somewhere.

Now a textual version of it:

em_pd_get_higher_freq() does the following:

# Turn the abstract cost margin on the EM_COST_MARGIN_SCALE into a
# concrete value. cost_margin=EM_COST_MARGIN_SCALE will give a concrete
# value of "max_cost", which is the highest OPP on that CPU.
concrete_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE;

# Then it finds the lowest OPP satisfying min_freq:
min_opp = OPP_AT_FREQ(min_freq)

# It takes the cost associated, and finds the highest OPP that has a
# cost lower than that:
max_cost = COST_OF(min_opp) + concrete_margin

final_freq = MAX(
FREQ_OF(opp)
for opp in available_opps
if COST_OF(opp) <= max_cost
)

So this means that:
util - util_est_enqueued ~= 0
=> cost_margin ~= 0
=> concrete_cost_margin ~= 0
=> max_cost = COST_OF(min_opp) + 0
=> final_freq = FREQ_OF(min_opp)

The effective boost is ~0, so you will get the current behaviour of
schedutil.

If the task starts needing more cycles than during its previous period,
`util - util_est_enqueued` will grow like util since util_est_enqueued
is constant. The longer we wait, the higher the boost, until the task
goes to sleep again.

At next wakeup, util_est_enqueued has caught up and either:
1) util becomes stable, so no more boosting
2) util keeps increasing, so go for another round of boosting


Thanks,
Douglas

2020-02-14 12:22:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote:

> > So even when:
> >
> > boost = util_avg - util_est
> >
> > is small, despite util_avg being huge (~1024), due to large util_est,
> > we'll still get an effective boost to max_cost ASSUMING cs[].cost and
> > cost_margin have the same curve.
>
> I'm not sure to follow, cs[].cost can be plotted against cs[].freq, but
> cost_margin is a time-based signal (the boost value), so it would be
> plotted against time.

Suppose we have the normalized energy vs frequency curve: x^3

( P ~ V^2 * f, due to lack of better: V ~ f -> P ~ f^3 )

1 +--------------------------------------------------------------------+
| + + + + *|
| x**3 ******* |
| ** |
0.8 |-+ ** +-|
| ** |
| * |
| ** |
0.6 |-+ ** +-|
| ** |
| ** |
| *** |
0.4 |-+ *** +-|
| ** |
| *** |
| *** |
0.2 |-+ **** +-|
| **** |
| ****** |
| + ********** + + |
0 +--------------------------------------------------------------------+
0 0.2 0.4 0.6 0.8 1


where x is our normalized frequency and y is the normalized energy.

Further, remember that schedutil does (per construction; for lack of
better):

f ~ u

So at u=0.6, we're at f=0.6 and P=0.2

+ boost = util_avg - util_est_enqueued;

So for util_est = 0.6, we're limited to: boost = 0.4.

+ max_cost = pd->table[pd->nr_cap_states - 1].cost;
+ cost_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE;

Which then gives:

cost_margin = boost = 0.4

And we find that:

P' = P + cost_margin = 0.2 + 0.4 = 0.6 < 1

So even though set out to allow a 100% boost in energy usage, we were in
fact incapable of achieving this, because our cost_margin is linear in u
while the energy (or cost) curve is cubic in u.

That was my argument; but I think that now that I've expanded on it, I
see a flaw, because when we do have boost = 0.4, this means util_avg =
1, and we would've selected f = 1, and boosting would've been pointless.

So let me try again:

f = util_avg, P = f^3, boost = util_avg - util_est

P' = util_avg ^ 3 + util_avg - util_est

And I'm then failing to make further sense of that; it of course means
that P'(u) is larger than P(2u) for some u, but I don't think we set
that as a goal either.

Let me ponder this a little more while I go read the rest of your email.

2020-02-14 12:53:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote:
> On 2/10/20 1:21 PM, Peter Zijlstra wrote:

> > assuming cs[].cost ~ f^3, and given our cost_margin ~ f, that leaves a
> > factor f^2 on the table.
>
> I'm guessing that you arrived to `cost_margin ~ f` this way:
>
> cost_margin = util - util_est_enqueued
> cost_margin = util - constant
>
> # with constant small enough
> cost_margin ~ util
>
> # with util ~ 1/f
> cost_margin ~ 1/f
>
> In the case you describe, `constant` is actually almost equal to `util`
> so `cost_margin ~! util`, and that series assumes frequency invariant
> util_avg so `util !~ 1/f` (I'll probably have to fix that).

Nah, perhaps already clear from the other email; but it goes like:

boost = util_avg - util_est
cost_margin = boost * C = C * util_avg - C * util_est

And since u ~ f (per schedutil construction), cost_margin is a function
linear in either u or f.

> > So the higher the min_freq, the less effective the boost.
>
> Yes, since the boost is allowing a fixed amount of extra power. Higher
> OPPs are less efficient than lower ones, so if min_freq is high, we
> won't speed up as much as if min_freq was low.
>
> > Maybe it all works out in practise, but I'm missing a big picture
>
> Here is a big picture :)
>
> https://gist.github.com/douglas-raillard-arm/f76586428836ec70c6db372993e0b731#file-ramp_boost-svg
>
> The board is a Juno R0, with a periodic task pinned on a big CPU
> (capa=1024):
> * phase 1: 5% duty cycle (=51 PELT units)
> * phase 2: 75% duty cycle (=768 PELT units)
>
> Legend:
> * blue square wave: when the task executes (like in kernelshark)
> * base_cost = cost of frequency as selected by schedutil in normal
> operations
> * allowed_cost = base_cost + cost_margin
> * util = util_avg
>
> note: the small gaps right after the duty cycle transition between
> t=4.15 and 4.25 are due to sugov task executing, so there is no dequeue
> and no util_est update.

I'm confused by the giant drop in frequency (blue line) around 4.18

schedutil shouldn't select f < max(util_avg, util_est), which is
violated right about there.

I'm also confused by the base_cost line; how can that be flat until
somewhere around 4.16. Sadly there is no line for pure schedutil freq to
compare against.

Other than that, I can see the green line is consistent with
util_avg>util_est, and how it help grow the frequency (blue).

2020-02-14 13:37:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote:

> > description of it all somewhere.
>
> Now a textual version of it:
>
> em_pd_get_higher_freq() does the following:
>
> # Turn the abstract cost margin on the EM_COST_MARGIN_SCALE into a
> # concrete value. cost_margin=EM_COST_MARGIN_SCALE will give a concrete
> # value of "max_cost", which is the highest OPP on that CPU.
> concrete_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE;
>
> # Then it finds the lowest OPP satisfying min_freq:
> min_opp = OPP_AT_FREQ(min_freq)
>
> # It takes the cost associated, and finds the highest OPP that has a
> # cost lower than that:
> max_cost = COST_OF(min_opp) + concrete_margin
>
> final_freq = MAX(
> FREQ_OF(opp)
> for opp in available_opps
> if COST_OF(opp) <= max_cost
> )

Right; I got that.

> So this means that:
> util - util_est_enqueued ~= 0

Only if you assume the task will get scheduled out reasonably frequent.

> => cost_margin ~= 0
> => concrete_cost_margin ~= 0
> => max_cost = COST_OF(min_opp) + 0
> => final_freq = FREQ_OF(min_opp)
>
> The effective boost is ~0, so you will get the current behaviour of
> schedutil.

But the argument holds; because if things don't get scheduled out, we'll
peg u = 1 and hit f = 1 and all is well anyway.

Which is a useful property; it shows that in the steady state, this
patch-set is a NOP, but the above argument only relies on 'util_avg >
util_est' being used a trigger.

> If the task starts needing more cycles than during its previous period,
> `util - util_est_enqueued` will grow like util since util_est_enqueued
> is constant. The longer we wait, the higher the boost, until the task
> goes to sleep again.
>
> At next wakeup, util_est_enqueued has caught up and either:
> 1) util becomes stable, so no more boosting
> 2) util keeps increasing, so go for another round of boosting

Agreed; however elsewhere you wrote:

> 1) If you care more about predictable battery life (or energy bill) than
> predictability of the boost feature, EM should be used.
>
> 2) If you don't have an EM or you care more about having a predictable
> boost for a given workload, use util (or disable that boost).

This is the part I'm still not sure about; how do the specifics of the
cost_margin setup lead to 1), or how would some frobbing with frequency
selection destroy that property.

2020-02-27 15:52:26

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

Hi Peter,

On 2/13/20 1:20 PM, Peter Zijlstra wrote:
> On Thu, Feb 13, 2020 at 11:55:32AM +0000, Douglas Raillard wrote:
>> On 2/10/20 1:30 PM, Peter Zijlstra wrote:
>
>>> So ARM64 will soon get x86-like power management if I read these here
>>> patches right:
>>>
>>> https://lkml.kernel.org/r/[email protected]
>>>
>>> And I'm thinking a part of Rafael's concerns will also apply to those
>>> platforms.
>>
>> AFAIU there is an important difference: ARM64 firmware should not end up
>> increasing frequency on its own, it should only cap the frequency. That
>> means that the situation stays the same for that boost:
>>
>> Let's say you let schedutil selecting a freq that is +2% more power
>> hungry. That will probably not be enough to make it jump to the next
>> OPP, so you end up not boosting. Now if there is a firmware that decides
>> for some reasons to cap frequency, it will be a similar situation.
>
> The moment you give out OPP selection to a 3rd party (be it firmware or
> a micro-controller) things are uncertain at best anyway.
>
> Still, in general, if you give it higher input, it tends to at least
> consider going faster -- which might be all you can ask for...
>
> So I'm not exactly seeing what your argument is here.

My point is that a +2% boost will give *up to* +2% increase in energy
use. With or without a fancy firmware, having cost_margin > 0 does not
mean you will always actually get a speedup.

>>> Right, so the condition 'util_avg > util_est' makes sense to trigger
>>> some sort of boost off of.
>>>
>>> What kind would make sense for these platforms? One possibility would be
>>> to instead of frobbing the energy margin, as you do here, to frob the C
>>> in get_next_freq().
>>
>> If I'm correct, changing the C value would be somewhat similar to the
>> relative boosting I had in a previous version. Maybe adding a fixed
>> offset would give more predictable results as was discussed with Vincent
>> Guittot. In any case, it would change the perceived util (like iowait
>> boost).
>
> It depends a bit on what you change C into. If we do something trivial
> like:
> 1.25 ; !(util_avg > util_est)
> C := {
> 2 ; (util_avg > util_est)
>
> ie. a binary selection of constants, then yes, I suppose that is the
> case.
>
> But nothing stops us from making it more complicated; or having it
> depend on the presence of EM data.

The series currently fiddles with energy cost directly, but it's
possible to have the exact same effect by fiddling with util if we have
the function `(base_util, cost_margin) -> boosted_util`. It just that it
forces to:
1. map util to freq
2. find a higher freq for the given cost_margin
3. Map freq to util
4. Re-inject the new util, which will eventually get remapped to a freq

While it's easier to just do it directly:
1. map util to freq
2. find higher_freq for the given cost_margin
3. Use the increased freq

>
>>> (I have vague memories of this being proposed earlier; it also avoids
>>> that double OPP iteration thing complained about elsewhere in this
>>> thread if I'm not mistaken).
>>
>> It should be possible to get rid of the double iteration mentioned by
>> Quentin. Choosing to boost the util or the energy boils down to:
>>
>> 1) If you care more about predictable battery life (or energy bill) than
>> predictability of the boost feature, EM should be used.
>>
>> 2) If you don't have an EM or you care more about having a predictable
>> boost for a given workload, use util (or disable that boost).
>>
>> The rational is that with 1), you will get a different speed boost for a
>> given workload depending on the other things executing at the same time,
>> as the speed up is not linear with the task-related metric (util -
>> util_est). If you are already at high freq because of another workload,
>> the speed up will be small because the next 100Mhz will cost much more
>> than the same +100Mhz delta starting from a low OPP.
>
> It's just that I'm not seeing how 1 actually works or provides that more
> predictable battery life I suppose. We have this other sub-thread to
> argue about that :-)

Ok, I've posted the answer there, so this thread can focus on
boost=(util - util_est_enqueued) logic, and the other one on how to make
actual use of the boost value.

>>> That is; I'm thinking it is important (esp. now that we got frequency
>>> invariance sorted for x86), to have this patch also work for !EM
>>> architectures (as those ARM64-AMU things would be).
>>
>> For sure, that feature is supposed to help in cases that would be
>> impossible to pinpoint with hardware, since it has to know what tasks
>> execute.
>
> OK, so I'm thinking we're agreeing that it would be good to have this
> support !EM systems too.
>

Thanks,
Douglas

2020-02-27 15:53:22

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v4 3/6] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()

Let's this thread be about util boosting vs energy boosting.

Short recap of the conversation:

Peter:
>>> (I have vague memories of this (the util boosting) being proposed
earlier; it also avoids
>>> that double OPP iteration thing complained about elsewhere in this
>>> thread if I'm not mistaken).
>>

Douglas:
>> It should be possible to get rid of the double iteration mentioned by
>> Quentin. Choosing to boost the util or the energy boils down to:
>>
>> 1) If you care more about predictable battery life (or energy bill) than
>> predictability of the boost feature, EM should be used.
>>
>> 2) If you don't have an EM or you care more about having a predictable
>> boost for a given workload, use util (or disable that boost).
>>
>> The rational is that with 1), you will get a different speed boost for a
>> given workload depending on the other things executing at the same time,
>> as the speed up is not linear with the task-related metric (util -
>> util_est). If you are already at high freq because of another workload,
>> the speed up will be small because the next 100Mhz will cost much more
>> than the same +100Mhz delta starting from a low OPP.

Peter:
>
> It's just that I'm not seeing how 1 actually works or provides that more
> predictable battery life I suppose. We have this other sub-thread to
> argue about that :-)

Here is a more detailed version of util boost vs energy boost.
The goal of what follows is to show that util-based boost gives predictable
performance speedup, while energy-based boost gives predictable increase in
energy consumption.

In both cases, the boost value is computed as (util - ue.enqueued),
and only its interpretation differs:
* as an increase in util for util-based boost
* as an OPP cost margin for the energy-based boost



util(wload) = util_avg(wload)
util_est_enqueued(wload)
| wload is enqueued = f(util_avg(wload))
| otherwise = 0

(wloadA + wloadB) denotes a "virtual" task that is running whenever either
wloadA or wloadB is running.

# Assuming wloadA and wloadB are attached to the same rq:
util(wloadA + wloadB) = util(wloadA) + util(wloadB)

# Assuming wloadA and wloadB do not preempt each other:
util_est_enqueued(wloadA + wloadB) =
util_est_enqueued(wloadA) + util_est_enqueued(wloadB)

# boostU(wload) is the increase of util due to the util-based boost.
# boostE(wload) is the increase of *util* due to the energy-based boost.

boostU(wload)
| wload enqueued and util(wload) > util_est_enqueued(wload) =
util(wload) - util_est_enqueued(wload)
| otherwise = 0

boostU(wloadA + wloadB) = util(wloadA + wloadB) -
util_est_enqueued(wloadA + wloadB)
boostU(wloadA + wloadB) = util(wloadA) + util(wloadB) -
util_est_enqueued(wloadA) - util_est_enqueued(wloadB)
boostU(wloadA + wloadB) = boostU(wloadA) + boostU(wloadB)

# Now if we now intepret the same boost value as a freq energy cost margin:
boostE(wload)
| wload enqueued and util(wload) > util_est_enqueued(wload) =
apply_boostE(util(wload), util_est_enqueued(wload), map_util_freq(util(wload)))
| otherwise = 0

# with:
apply_boostE(util, util_est_enqueued, base_freq) =
boost = util - util_est_enqueued
boosted_freq = freq_at(boost + cost_of(base_freq))
freq_delta = boosted_freq - base_freq
speedup = 1 + freq_delta / max_freq
return util * speedup

Since freq_at(cost) function is not a linear function of cost
and util(wloadA + wloadB) = util(wloadA) + util(wloadB),
apply_boostE() is not a linear function of wload, which means:
boostE(wloadA + wloadB) != boostE(wloadA) + boostE(wloadB)

This means the speedup (util increase) given by boostE cannot be evaluated for
one task without considering all other attached tasks on the same rq. Therefore
the speedup introduced by boostE is not easily predictable.

On the other hand, the increase in energy cost is linear in the workload for
energy-based boost. but not linear for the util-based boost.

That made me realize that to achieve that, EM_COST_MARGIN_SCALE needs to
map to the highest cost in the system, not the highest cost of the
current CPU, I will fix that.

Also, util_est_enqueued is not linear in the general case when wloadA and
wloadB preempt each-other. There could be ways of making that work but it's
probably a better idea to move the logic at the task level and aggregate the
flag at runqueue level. This will also allow the boost to work when the set of
enqueued tasks varies, which is likely to happen in a real system.

2020-03-11 12:25:46

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware

Hi Peter,

The preemption stack unwinded back to schedutil :)

On 2/14/20 12:52 PM, Peter Zijlstra wrote:
> On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote:
>> On 2/10/20 1:21 PM, Peter Zijlstra wrote:
>
>>> assuming cs[].cost ~ f^3, and given our cost_margin ~ f, that leaves a
>>> factor f^2 on the table.
>>
>> I'm guessing that you arrived to `cost_margin ~ f` this way:
>>
>> cost_margin = util - util_est_enqueued
>> cost_margin = util - constant
>>
>> # with constant small enough
>> cost_margin ~ util
>>
>> # with util ~ 1/f
>> cost_margin ~ 1/f
>>
>> In the case you describe, `constant` is actually almost equal to `util`
>> so `cost_margin ~! util`, and that series assumes frequency invariant
>> util_avg so `util !~ 1/f` (I'll probably have to fix that).
>
> Nah, perhaps already clear from the other email; but it goes like:
>
> boost = util_avg - util_est
> cost_margin = boost * C = C * util_avg - C * util_est
>
> And since u ~ f (per schedutil construction), cost_margin is a function
> linear in either u or f.

cost_margin(u) is not linear in `u` or `f`, as that does not hold:

cost_margin(A*u) = (A*u - CONSTANT)
!= A*u - A*CONSTANT
A * (u - CONSTANT)
A * cost_margin(u)

This would only approximately work if CONSTANT was much smaller than u,
which it isn't.

The tricky part is that CONSTANT is actually CONSTANT(u), but the
relation between u and CONSTANT is time-dependent.

That said, the problem can be simplified if we split it in two phases:

* when boost != 0 for a given task, in which case CONSTANT(u) is a true
constant by construction, i.e. cost_margin(u) !~ u.

* when boost = 0 for a given task: all that matters is that only one
task is boosted at a time, so that we can easily relate task composition
and boost composition as done in:
https://lore.kernel.org/lkml/[email protected]/
note: as mentioned in that email, this reasoning relies on
util_est_enqueued(wload) at rq level being linear in wload, which does
not hold with preemption. That would be fixed by working on task signals
rather than rq aggregated ones.

>
>>> So the higher the min_freq, the less effective the boost.

Ultimately I can't remember exactly what the cost_margin(u) linearity
was intended to be used for, but the relative impact of boosting does
become smaller if u is already high.

This is by design, as the dynamic of that boosting is purposely mostly
tied to local PELT variations, by removing a fixed offset from it. I
don't think big tasks specifically deserve more boosting, as it would
imply the incertitude on their new period is higher. We would need a
more real world trace evaluating how the duty cycle changes relate to
the initial duty cycle. That should not be particularly difficult to do,
but the trace parsing infrastructure in LISA currently does not handle
very nicely large traces, so that will probably have to wait until I get
around fixing that by using libtraceevent from trace-cmd project.

This means boosting currently depends on two things:
1) the PELT parameters (half life and window size)
2) the former task period

1) is fixed by definition in the kernel.
2) will make boosting more aggressive for short periods (i.e. the boost
will rise faster). The initial rate of increase of boosting once it
triggers should be related to the former task period in (approximately)
that way:

| period (s) | boost rate of increase (PELT unit/ms) |
|-------------:|----------------------------------------:|
| 0 | 21.3839 |
| 0.016 | 15.3101 |
| 0.032 | 10.9615 |
| 0.048 | 7.84807 |
| 0.064 | 5.61895 |
| 0.08 | 4.02297 |
| 0.096 | 2.8803 |
| 0.112 | 2.0622 |
| 0.128 | 1.47646 |


This assumes that while the task is running:
util_avg(t) =
util_avg(t_enqueue) + 1024 * (1-e^(-(t - t_enqueue)/tau))

with tau:
# tau as defined by:
https://en.wikipedia.org/wiki/Low-pass_filter#Simple_infinite_impulse_response_filter
# Alpha as defined in https://en.wikipedia.org/wiki/Moving_average
decay = (1 / 2)^(1 / half_life)
alpha = 1 - decay
window = 1024 * 1024 * 1e-9
tau = window * ((1 - alpha) / alpha)
tau = 0.047886417385345964
>>
>> Yes, since the boost is allowing a fixed amount of extra power. Higher
>> OPPs are less efficient than lower ones, so if min_freq is high, we
>> won't speed up as much as if min_freq was low.
>>
>>> Maybe it all works out in practise, but I'm missing a big picture
>>
>> Here is a big picture :)
>>
>> https://gist.github.com/douglas-raillard-arm/f76586428836ec70c6db372993e0b731#file-ramp_boost-svg
>>
>> The board is a Juno R0, with a periodic task pinned on a big CPU
>> (capa=1024):
>> * phase 1: 5% duty cycle (=51 PELT units)
>> * phase 2: 75% duty cycle (=768 PELT units)
>>
>> Legend:
>> * blue square wave: when the task executes (like in kernelshark)
>> * base_cost = cost of frequency as selected by schedutil in normal
>> operations
>> * allowed_cost = base_cost + cost_margin
>> * util = util_avg
>>
>> note: the small gaps right after the duty cycle transition between
>> t=4.15 and 4.25 are due to sugov task executing, so there is no dequeue
>> and no util_est update.
>
> I'm confused by the giant drop in frequency (blue line) around 4.18
>
> schedutil shouldn't select f < max(util_avg, util_est), which is
> violated right about there.

AFAICT that's due the boost being disabled, which means the allowed cost
becomes smaller. The boost is disabled because the task's
util_est_enqueued has been updated since last schedutil invocation, so
the rq one changes, which in turn disables boosting.

Here are the normalized costs on the CPU the workload was run on:

| freq | cost | capa |
|--------:|---------:|---------:|
| 450000 | 67.086 | 418.909 |
| 625000 | 72.1509 | 581.818 |
| 800000 | 80.8962 | 744.727 |
| 950000 | 90.1688 | 884.364 |
| 1100000 | 100 | 1024 |

In our case, it seems that schedutil is invoked when the task is
preempted (presumably by schedutil kthread) and we end up with:
util_avg ~= 370
util_est_enqueued ~= 427.
We therefore end up at freq=625000 (capa=581).

> I'm also confused by the base_cost line; how can that be flat until
> somewhere around 4.16.

The minimum frequency on the board being used provides a capacity of
418, so it stays flat until max(util_avg, util_est) goes above that.

> Sadly there is no line for pure schedutil freq to
> compare against.

I could add that one fairly easily. Would you be interested in the
"real" frequency selected by normal schedutil or the "ideal" frequency
schedutil would like to apply ?

> Other than that, I can see the green line is consistent with
> util_avg>util_est, and how it help grow the frequency (blue).
>

2020-03-11 12:42:10

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware



On 2/14/20 1:37 PM, Peter Zijlstra wrote:
> On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote:
>
>>> description of it all somewhere.
>>
>> Now a textual version of it:
>>
>> em_pd_get_higher_freq() does the following:
>>
>> # Turn the abstract cost margin on the EM_COST_MARGIN_SCALE into a
>> # concrete value. cost_margin=EM_COST_MARGIN_SCALE will give a concrete
>> # value of "max_cost", which is the highest OPP on that CPU.
>> concrete_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE;
>>
>> # Then it finds the lowest OPP satisfying min_freq:
>> min_opp = OPP_AT_FREQ(min_freq)
>>
>> # It takes the cost associated, and finds the highest OPP that has a
>> # cost lower than that:
>> max_cost = COST_OF(min_opp) + concrete_margin
>>
>> final_freq = MAX(
>> FREQ_OF(opp)
>> for opp in available_opps
>> if COST_OF(opp) <= max_cost
>> )
>
> Right; I got that.
>
>> So this means that:
>> util - util_est_enqueued ~= 0
>
> Only if you assume the task will get scheduled out reasonably frequent.
>
>> => cost_margin ~= 0
>> => concrete_cost_margin ~= 0
>> => max_cost = COST_OF(min_opp) + 0
>> => final_freq = FREQ_OF(min_opp)
>>
>> The effective boost is ~0, so you will get the current behaviour of
>> schedutil.
>
> But the argument holds; because if things don't get scheduled out, we'll
> peg u = 1 and hit f = 1 and all is well anyway.
>
> Which is a useful property; it shows that in the steady state, this
> patch-set is a NOP, but the above argument only relies on 'util_avg >
> util_est' being used a trigger.

Yes, `util_avg > util_est` can only happen when the task's duty cycle is
changing, which does not happen at steady state.

Either it's periodic and the boost is legitimate, or it's not periodic
and we assume it's a periodic task well represented by its last
activation and sleep (for the purpose of boosting).

Tasks with a high variability in their activation durations (i.e. not
periodic at all) will likely get more boosting on average, which is
probably good since we can't predict much about them, so in doubt we
tilt the behaviour of schedutil toward racing to completion.

>> If the task starts needing more cycles than during its previous period,
>> `util - util_est_enqueued` will grow like util since util_est_enqueued
>> is constant. The longer we wait, the higher the boost, until the task
>> goes to sleep again.
>>
>> At next wakeup, util_est_enqueued has caught up and either:
>> 1) util becomes stable, so no more boosting
>> 2) util keeps increasing, so go for another round of boosting
>
> Agreed; however elsewhere you wrote:
>
>> 1) If you care more about predictable battery life (or energy bill) than
>> predictability of the boost feature, EM should be used.
>>
>> 2) If you don't have an EM or you care more about having a predictable
>> boost for a given workload, use util (or disable that boost).
>
> This is the part I'm still not sure about; how do the specifics of the
> cost_margin setup lead to 1), or how would some frobbing with frequency
> selection destroy that property.

This should be answered by this other thread:
https://lore.kernel.org/lkml/[email protected]/#t