2019-10-11 13:46:07

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

Make schedutil cpufreq governor energy-aware.

- patch 1 introduces a function to retrieve a frequency given a base
frequency and an energy cost margin.
- patch 2 links Energy Model perf_domain to sugov_policy.
- patch 3 updates get_next_freq() to make use of the Energy Model.
- patch 4 adds sugov_cpu_ramp_boost() function.
- patch 5 updates sugov_update_(single|shared)() to make use of
sugov_cpu_ramp_boost().
- patch 6 introduces a tracepoint in get_next_freq() for
testing/debugging. Since it's not a trace event, it's not exposed to
userspace in a directly usable way, allowing for painless future
updates/removal.

The benefits of using the EM in schedutil are twofold:

1) Selecting the highest possible frequency for a given cost. Some
platforms can have lower frequencies that are less efficient than
higher ones, in which case they should be skipped for most purposes.
They can still be useful to give more freedom to thermal throttling
mechanisms, but not under normal circumstances.
note: the EM framework will warn about such OPPs "hertz/watts ratio
non-monotonically decreasing"

2) Driving the frequency selection with power in mind, in addition to
maximizing the utilization of the non-idle CPUs in the system.

Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
enabled in schedutil by
"sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".

Point 2) is enabled in
"sched/cpufreq: Boost schedutil frequency ramp up". It allows using
higher frequencies when it is known that the true utilization of
currently running tasks is exceeding their previous stable point.
The benefits are:

* Boosting the frequency when the behavior of a runnable task changes,
leading to an increase in utilization. That shortens the frequency
ramp up duration, which in turns allows the utilization signal to
reach stable values quicker. Since the allowed frequency boost is
bounded in energy, it will behave consistently across platforms,
regardless of the OPP cost range.

* The boost is only transient, and should not impact a lot the energy
consumed of workloads with very stable utilization signals.

This has been ligthly tested with a rtapp task ramping from 10% to 75%
utilisation on a big core. Results are improved by fast ramp-up
EWMA [1], since it greatly reduces the oscillation in frequency at first
idle when ramping up.

[1] [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases
Message-ID: <[email protected]>
https://lore.kernel.org/lkml/[email protected]/


v1 -> v2:

* Split the new sugov_cpu_ramp_boost() from the existing
sugov_cpu_is_busy() as they seem to seek a different goal.

* Implement sugov_cpu_ramp_boost() based on CFS util_avg and
util_est_enqueued signals, rather than using idle calls count.
This makes the ramp boost much more accurate in finding boost
opportunities, and give a "continuous" output rather than a boolean.

* Add EM_COST_MARGIN_SCALE=1024 to represent the
margin values of em_pd_get_higher_freq().

v2 -> v3:

* Check util_avg >= sg_cpu->util_avg in sugov_cpu_ramp_boost_update()
to avoid boosting when the utilization is decreasing.

* Add a tracepoint for testing.

Douglas RAILLARD (6):
PM: Introduce em_pd_get_higher_freq()
sched/cpufreq: Attach perf domain to sugov policy
sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()
sched/cpufreq: Introduce sugov_cpu_ramp_boost
sched/cpufreq: Boost schedutil frequency ramp up
sched/cpufreq: Add schedutil_em_tp tracepoint

include/linux/energy_model.h | 53 ++++++++++++++
include/trace/events/power.h | 9 +++
kernel/sched/cpufreq_schedutil.c | 122 +++++++++++++++++++++++++++++--
3 files changed, 178 insertions(+), 6 deletions(-)

--
2.23.0



2019-10-11 13:46:18

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v3 1/6] PM: Introduce em_pd_get_higher_freq()

em_pd_get_higher_freq() returns a frequency greater or equal to the
provided one while taking into account a given cost margin. It also
skips inefficient OPPs that have a higher cost than another one with a
higher frequency.

The efficiency of an OPP is measured as efficiency=capacity/power.
OPPs with the same efficiency are assumed to be equivalent, since they
will consume as much energy for a given amount of work to do. That may
take more or less time depending on the frequency, but will consume the
same energy.

Signed-off-by: Douglas RAILLARD <[email protected]>
---
include/linux/energy_model.h | 53 ++++++++++++++++++++++++++++++++++++
1 file changed, 53 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index d249b88a4d5a..dd6a35f099ea 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -159,6 +159,53 @@ static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
return pd->nr_cap_states;
}

+#define EM_COST_MARGIN_SCALE 1024U
+
+/**
+ * em_pd_get_higher_freq() - Get the highest frequency that does not exceed the
+ * given cost margin compared to min_freq
+ * @pd : performance domain for which this must be done
+ * @min_freq : minimum frequency to return
+ * @cost_margin : allowed margin compared to min_freq, on the
+ * EM_COST_MARGIN_SCALE scale.
+ *
+ * Return: the chosen frequency, guaranteed to be at least as high as min_freq.
+ */
+static inline unsigned long em_pd_get_higher_freq(struct em_perf_domain *pd,
+ unsigned long min_freq, unsigned long cost_margin)
+{
+ unsigned long max_cost = 0;
+ struct em_cap_state *cs;
+ int i;
+
+ if (!pd)
+ return min_freq;
+
+ /* Compute the maximum allowed cost */
+ for (i = 0; i < pd->nr_cap_states; i++) {
+ cs = &pd->table[i];
+ if (cs->frequency >= min_freq) {
+ max_cost = cs->cost +
+ (cs->cost * cost_margin) / EM_COST_MARGIN_SCALE;
+ break;
+ }
+ }
+
+ /* Find the highest frequency that will not exceed the cost margin */
+ for (i = pd->nr_cap_states-1; i >= 0; i--) {
+ cs = &pd->table[i];
+ if (cs->cost <= max_cost)
+ return cs->frequency;
+ }
+
+ /*
+ * We should normally never reach here, unless min_freq was higher than
+ * the highest available frequency, which is not expected to happen.
+ */
+ return min_freq;
+}
+
+
#else
struct em_data_callback {};
#define EM_DATA_CB(_active_power_cb) { }
@@ -181,6 +228,12 @@ static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
{
return 0;
}
+
+static inline unsigned long em_pd_get_higher_freq(struct em_perf_domain *pd,
+ unsigned long min_freq, unsigned long cost_margin)
+{
+ return min_freq;
+}
#endif

#endif
--
2.23.0

2019-10-11 13:46:27

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v3 2/6] sched/cpufreq: Attach perf domain to sugov policy

Attach an Energy Model perf_domain to each sugov_policy to prepare the
ground for energy-aware schedutil.

Signed-off-by: Douglas RAILLARD <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 39 ++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index ba9e8309eec7..9abda58827c0 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -42,6 +42,10 @@ struct sugov_policy {

bool limits_changed;
bool need_freq_update;
+
+#ifdef CONFIG_ENERGY_MODEL
+ struct em_perf_domain *pd;
+#endif
};

struct sugov_cpu {
@@ -66,6 +70,38 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);

/************************ Governor internals ***********************/

+#ifdef CONFIG_ENERGY_MODEL
+static void sugov_policy_attach_pd(struct sugov_policy *sg_policy)
+{
+ struct em_perf_domain *pd;
+ struct cpufreq_policy *policy = sg_policy->policy;
+
+ sg_policy->pd = NULL;
+ pd = em_cpu_get(policy->cpu);
+ if (!pd)
+ return;
+
+ if (cpumask_equal(policy->related_cpus, to_cpumask(pd->cpus)))
+ sg_policy->pd = pd;
+ else
+ pr_warn("%s: Not all CPUs in schedutil policy %u share the same perf domain, no perf domain for that policy will be registered\n",
+ __func__, policy->cpu);
+}
+
+static struct em_perf_domain *sugov_policy_get_pd(
+ struct sugov_policy *sg_policy)
+{
+ return sg_policy->pd;
+}
+#else /* CONFIG_ENERGY_MODEL */
+static void sugov_policy_attach_pd(struct sugov_policy *sg_policy) {}
+static struct em_perf_domain *sugov_policy_get_pd(
+ struct sugov_policy *sg_policy)
+{
+ return NULL;
+}
+#endif /* CONFIG_ENERGY_MODEL */
+
static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
{
s64 delta_ns;
@@ -859,6 +895,9 @@ static int sugov_start(struct cpufreq_policy *policy)
sugov_update_shared :
sugov_update_single);
}
+
+ sugov_policy_attach_pd(sg_policy);
+
return 0;
}

--
2.23.0

2019-10-11 13:46:35

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v3 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost

Use the utilization signals dynamic to detect when the utilization of a
set of tasks starts increasing because of a change in tasks' behavior.
This allows detecting when spending extra power for faster frequency
ramp up response would be beneficial to the reactivity of the system.

This ramp boost is computed as the difference
util_avg-util_est_enqueued. This number somehow represents a lower bound
of how much extra utilization this tasks is actually using, compared to
our best current stable knowledge of it (which is util_est_enqueued).

When the set of runnable tasks changes, the boost is disabled as the
impact of blocked utilization on util_avg will make the delta with
util_est_enqueued not very informative.

Signed-off-by: Douglas RAILLARD <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 44 ++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index aab8c0498dd1..c118f85d1f3d 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -61,6 +61,10 @@ struct sugov_cpu {
unsigned long bw_dl;
unsigned long max;

+ unsigned long ramp_boost;
+ unsigned long util_est_enqueued;
+ unsigned long util_avg;
+
/* The field below is for single-CPU policies only: */
#ifdef CONFIG_NO_HZ_COMMON
unsigned long saved_idle_calls;
@@ -181,6 +185,42 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy, u64 time,
}
}

+static unsigned long sugov_cpu_ramp_boost(struct sugov_cpu *sg_cpu)
+{
+ return READ_ONCE(sg_cpu->ramp_boost);
+}
+
+static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
+{
+ struct rq *rq = cpu_rq(sg_cpu->cpu);
+ unsigned long util_est_enqueued;
+ unsigned long util_avg;
+ unsigned long boost = 0;
+
+ util_est_enqueued = READ_ONCE(rq->cfs.avg.util_est.enqueued);
+ util_avg = READ_ONCE(rq->cfs.avg.util_avg);
+
+ /*
+ * Boost when util_avg becomes higher than the previous stable
+ * knowledge of the enqueued tasks' set util, which is CPU's
+ * util_est_enqueued.
+ *
+ * We try to spot changes in the workload itself, so we want to
+ * avoid the noise of tasks being enqueued/dequeued. To do that,
+ * we only trigger boosting when the "amount of work' enqueued
+ * is stable.
+ */
+ if (util_est_enqueued == sg_cpu->util_est_enqueued &&
+ util_avg >= sg_cpu->util_avg &&
+ util_avg > util_est_enqueued)
+ boost = util_avg - util_est_enqueued;
+
+ sg_cpu->util_est_enqueued = util_est_enqueued;
+ sg_cpu->util_avg = util_avg;
+ WRITE_ONCE(sg_cpu->ramp_boost, boost);
+ return boost;
+}
+
/**
* get_next_freq - Compute a new frequency for a given cpufreq policy.
* @sg_policy: schedutil policy object to compute the new frequency for.
@@ -512,6 +552,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
busy = !sg_policy->need_freq_update && sugov_cpu_is_busy(sg_cpu);

util = sugov_get_util(sg_cpu);
+ sugov_cpu_ramp_boost_update(sg_cpu);
max = sg_cpu->max;
util = sugov_iowait_apply(sg_cpu, time, util, max);
next_f = get_next_freq(sg_policy, util, max);
@@ -552,6 +593,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
unsigned long j_util, j_max;

j_util = sugov_get_util(j_sg_cpu);
+ if (j_sg_cpu == sg_cpu)
+ sugov_cpu_ramp_boost_update(sg_cpu);
j_max = j_sg_cpu->max;
j_util = sugov_iowait_apply(j_sg_cpu, time, j_util, j_max);

@@ -561,6 +604,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
}
}

+
return get_next_freq(sg_policy, util, max);
}

--
2.23.0

2019-10-11 13:46:46

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v3 5/6] sched/cpufreq: Boost schedutil frequency ramp up

In some situations, it can be interesting to spend temporarily more
power if that can give a useful frequency boost.

Use the new sugov_cpu_ramp_boost() function to drive an energy-aware
boost, on top of the minimal required frequency.

As that boost number is not accurate (and cannot be without a crystal
ball), we only use it in a way that allows direct control over the power
it is going to cost. This allows keeping a platform-independent level of
control over the average power, while allowing for frequency bursts when
we know a (set of) tasks can make use of it.

In shared policies, the maximum of all CPU's boost is used. Since the
extra power expenditure is bounded, it cannot skyrocket even on
platforms with a large number of cores in the same frequency domain
and/or very high ratio between lowest and highest OPP cost.

Signed-off-by: Douglas RAILLARD <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index c118f85d1f3d..7c1a749fb6ef 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -226,6 +226,9 @@ static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
* @sg_policy: schedutil policy object to compute the new frequency for.
* @util: Current CPU utilization.
* @max: CPU capacity.
+ * @boost: Extra power that can be spent on top of the minimum amount of power
+ * required to meet capacity requirements, as a percentage between 0 and
+ * EM_COST_MARGIN_SCALE.
*
* If the utilization is frequency-invariant, choose the new frequency to be
* proportional to it, that is
@@ -244,7 +247,8 @@ static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
* cpufreq driver limitations.
*/
static unsigned int get_next_freq(struct sugov_policy *sg_policy,
- unsigned long util, unsigned long max)
+ unsigned long util, unsigned long max,
+ unsigned long boost)
{
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq = arch_scale_freq_invariant() ?
@@ -257,7 +261,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
* Try to get a higher frequency if one is available, given the extra
* power we are ready to spend.
*/
- freq = em_pd_get_higher_freq(pd, freq, 0);
+ freq = em_pd_get_higher_freq(pd, freq, boost);

if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
return sg_policy->next_freq;
@@ -539,6 +543,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
unsigned long util, max;
unsigned int next_f;
bool busy;
+ unsigned long ramp_boost = 0;

sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
@@ -552,10 +557,10 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
busy = !sg_policy->need_freq_update && sugov_cpu_is_busy(sg_cpu);

util = sugov_get_util(sg_cpu);
- sugov_cpu_ramp_boost_update(sg_cpu);
+ ramp_boost = sugov_cpu_ramp_boost_update(sg_cpu);
max = sg_cpu->max;
util = sugov_iowait_apply(sg_cpu, time, util, max);
- next_f = get_next_freq(sg_policy, util, max);
+ next_f = get_next_freq(sg_policy, util, max, ramp_boost);
/*
* Do not reduce the frequency if the CPU has not been idle
* recently, as the reduction is likely to be premature then.
@@ -587,6 +592,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
struct cpufreq_policy *policy = sg_policy->policy;
unsigned long util = 0, max = 1;
unsigned int j;
+ unsigned long ramp_boost = 0;
+ unsigned long j_ramp_boost = 0;

for_each_cpu(j, policy->cpus) {
struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
@@ -594,7 +601,11 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)

j_util = sugov_get_util(j_sg_cpu);
if (j_sg_cpu == sg_cpu)
- sugov_cpu_ramp_boost_update(sg_cpu);
+ j_ramp_boost = sugov_cpu_ramp_boost_update(sg_cpu);
+ else
+ j_ramp_boost = sugov_cpu_ramp_boost(j_sg_cpu);
+ ramp_boost = max(ramp_boost, j_ramp_boost);
+
j_max = j_sg_cpu->max;
j_util = sugov_iowait_apply(j_sg_cpu, time, j_util, j_max);

@@ -605,7 +616,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
}


- return get_next_freq(sg_policy, util, max);
+ return get_next_freq(sg_policy, util, max, ramp_boost);
}

static void
--
2.23.0

2019-10-11 13:46:59

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v3 6/6] sched/cpufreq: Add schedutil_em_tp tracepoint

Introduce a new tracepoint reporting the effect of using the Energy
Model inside get_next_freq() in schedutil.

Signed-off-by: Douglas RAILLARD <[email protected]>
---
include/trace/events/power.h | 9 +++++++++
kernel/sched/cpufreq_schedutil.c | 20 ++++++++++++++------
2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/power.h b/include/trace/events/power.h
index f7aece721aed..87a14f5208a7 100644
--- a/include/trace/events/power.h
+++ b/include/trace/events/power.h
@@ -529,6 +529,15 @@ DEFINE_EVENT(dev_pm_qos_request, dev_pm_qos_remove_request,

TP_ARGS(name, type, new_value)
);
+
+DECLARE_TRACE(schedutil_em_tp,
+ TP_PROTO(unsigned int cpu, unsigned long util,
+ unsigned int cost_margin, unsigned int policy_cost_margin,
+ unsigned int base_freq, unsigned int boosted_freq),
+ TP_ARGS(cpu, util, cost_margin, policy_cost_margin, base_freq,
+ boosted_freq)
+);
+
#endif /* _TRACE_POWER_H */

/* This part must be outside protection */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 7c1a749fb6ef..076bbb69ff42 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -14,6 +14,8 @@
#include <linux/sched/cpufreq.h>
#include <trace/events/power.h>

+EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_em_tp);
+
#define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8)

struct sugov_tunables {
@@ -223,7 +225,7 @@ static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)

/**
* get_next_freq - Compute a new frequency for a given cpufreq policy.
- * @sg_policy: schedutil policy object to compute the new frequency for.
+ * @sg_cpu: schedutil CPU object to compute the new frequency for.
* @util: Current CPU utilization.
* @max: CPU capacity.
* @boost: Extra power that can be spent on top of the minimum amount of power
@@ -246,22 +248,28 @@ static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
* next_freq (as calculated above) is returned, subject to policy min/max and
* cpufreq driver limitations.
*/
-static unsigned int get_next_freq(struct sugov_policy *sg_policy,
+static unsigned int get_next_freq(struct sugov_cpu *sg_cpu,
unsigned long util, unsigned long max,
unsigned long boost)
{
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;
struct em_perf_domain *pd = sugov_policy_get_pd(sg_policy);
+ unsigned int base_freq;

- freq = map_util_freq(util, freq, max);
+ base_freq = map_util_freq(util, freq, max);

/*
* Try to get a higher frequency if one is available, given the extra
* power we are ready to spend.
*/
- freq = em_pd_get_higher_freq(pd, freq, boost);
+ freq = em_pd_get_higher_freq(pd, base_freq, boost);
+
+ trace_schedutil_em_tp(sg_cpu->cpu, util,
+ sugov_cpu_ramp_boost(sg_cpu), boost,
+ base_freq, freq);

if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
return sg_policy->next_freq;
@@ -560,7 +568,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
ramp_boost = sugov_cpu_ramp_boost_update(sg_cpu);
max = sg_cpu->max;
util = sugov_iowait_apply(sg_cpu, time, util, max);
- next_f = get_next_freq(sg_policy, util, max, ramp_boost);
+ next_f = get_next_freq(sg_cpu, util, max, ramp_boost);
/*
* Do not reduce the frequency if the CPU has not been idle
* recently, as the reduction is likely to be premature then.
@@ -616,7 +624,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
}


- return get_next_freq(sg_policy, util, max, ramp_boost);
+ return get_next_freq(sg_cpu, util, max, ramp_boost);
}

static void
--
2.23.0

2019-10-11 13:47:20

by Douglas RAILLARD

[permalink] [raw]
Subject: [RFC PATCH v3 3/6] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()

Choose the highest OPP for a given energy cost, allowing to skip lower
frequencies that would not be cheaper in terms of consumed power. These
frequencies can still be interesting to keep in the energy model to give
more freedom to thermal throttling, but should not be selected under
normal circumstances.

This also prepares the ground for energy-aware frequency boosting.

Signed-off-by: Douglas RAILLARD <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 9abda58827c0..aab8c0498dd1 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -10,6 +10,7 @@

#include "sched.h"

+#include <linux/energy_model.h>
#include <linux/sched/cpufreq.h>
#include <trace/events/power.h>

@@ -208,9 +209,16 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;
+ struct em_perf_domain *pd = sugov_policy_get_pd(sg_policy);

freq = map_util_freq(util, freq, max);

+ /*
+ * Try to get a higher frequency if one is available, given the extra
+ * power we are ready to spend.
+ */
+ freq = em_pd_get_higher_freq(pd, freq, 0);
+
if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
return sg_policy->next_freq;

--
2.23.0

2019-10-14 14:34:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost

On Fri, Oct 11, 2019 at 02:44:58PM +0100, Douglas RAILLARD wrote:
> Use the utilization signals dynamic to detect when the utilization of a
> set of tasks starts increasing because of a change in tasks' behavior.
> This allows detecting when spending extra power for faster frequency
> ramp up response would be beneficial to the reactivity of the system.
>
> This ramp boost is computed as the difference
> util_avg-util_est_enqueued. This number somehow represents a lower bound

That reads funny, maybe 'as the difference between util_avg and
util_est_enqueued' ?

> of how much extra utilization this tasks is actually using, compared to
> our best current stable knowledge of it (which is util_est_enqueued).
>
> When the set of runnable tasks changes, the boost is disabled as the
> impact of blocked utilization on util_avg will make the delta with
> util_est_enqueued not very informative.

> @@ -561,6 +604,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> }
> }
>
> +
> return get_next_freq(sg_policy, util, max);
> }

Surely we can do without this extra whitespace? :-)

2019-10-14 14:56:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

On Fri, Oct 11, 2019 at 02:44:54PM +0100, Douglas RAILLARD wrote:
> This has been ligthly tested with a rtapp task ramping from 10% to 75%
> utilisation on a big core. Results are improved by fast ramp-up
> EWMA [1], since it greatly reduces the oscillation in frequency at first
> idle when ramping up.
>
> [1] [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases
> Message-ID: <[email protected]>
> https://lore.kernel.org/lkml/[email protected]/


I don't really see anything fundamentally weird here. Any actual numbers
or other means of quantifying the improvement these patches bring?

2019-10-14 16:15:39

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v3 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost

Hi Peter,

On 10/14/19 3:33 PM, Peter Zijlstra wrote:
> On Fri, Oct 11, 2019 at 02:44:58PM +0100, Douglas RAILLARD wrote:
>> Use the utilization signals dynamic to detect when the utilization of a
>> set of tasks starts increasing because of a change in tasks' behavior.
>> This allows detecting when spending extra power for faster frequency
>> ramp up response would be beneficial to the reactivity of the system.
>>
>> This ramp boost is computed as the difference
>> util_avg-util_est_enqueued. This number somehow represents a lower bound
>
> That reads funny, maybe 'as the difference between util_avg and
> util_est_enqueued' ?

Indeed, it was not clear that it was a formula. Talking about formulas, I remember laying down
the relations between the various flavors of util signals in the v2 thread. This could be
turned rather easily into a doc page for PELT, along with a signal-processing-friendly
accurate description of how the PELT signals are built. Would such a patch be of any
interest the kernel tree ?

>> of how much extra utilization this tasks is actually using, compared to
>> our best current stable knowledge of it (which is util_est_enqueued).
>>
>> When the set of runnable tasks changes, the boost is disabled as the
>> impact of blocked utilization on util_avg will make the delta with
>> util_est_enqueued not very informative.
>
>> @@ -561,6 +604,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
>> }
>> }
>>
>> +
>> return get_next_freq(sg_policy, util, max);
>> }
>
> Surely we can do without this extra whitespace? :-)
>
woops ...

Cheers,
Douglas

2019-10-14 16:28:34

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

Hi Peter,

On 10/14/19 3:53 PM, Peter Zijlstra wrote:
> On Fri, Oct 11, 2019 at 02:44:54PM +0100, Douglas RAILLARD wrote:
>> This has been ligthly tested with a rtapp task ramping from 10% to 75%
>> utilisation on a big core. Results are improved by fast ramp-up
>> EWMA [1], since it greatly reduces the oscillation in frequency at first
>> idle when ramping up.
>>
>> [1] [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases
>> Message-ID: <[email protected]>
>> https://lore.kernel.org/lkml/[email protected]/
>
>
> I don't really see anything fundamentally weird here. Any actual numbers
> or other means of quantifying the improvement these patches bring?
>

I posted some numbers based on a similar experiment on the v2 of that series that
are still applicable:

TL;DR the rt-app negative slack is divided by 1.75 by this series, with an
increase of 3% in total energy consumption. There is a burst every 0.6s, and
the average power consumption increase is proportional to the average number
of bursts.


The workload is an rt-app task ramping up from 5% to 75% util in one big step,
pinned on a big core. The whole cycle is 0.6s long (0.3s at 5% followed by 0.3s at 75%).
This cycle is repeated 20 times and the average of boosting is taken.

The test device is a Google Pixel 3 (Qcom Snapdragon 845) phone.
It has a lot more OPPs than a hikey 960, so gradations in boosting
are better reflected on frequency selection.

avg slack (higher=better):
Average time between task sleep and its next periodic activation.
See rt-app doc: https://github.com/scheduler-tools/rt-app/blob/9a50d76f726d7c325c82ac8c7ed9ed70e1c97937/doc/tutorial.txt#L631

avg negative slack (lower in absolute value=better):
Same as avg slack, but only taking into account negative values.
Negative slack means a task activation did not have enough time to complete before the next
periodic activation fired, which is what we want to avoid.

boost energy overhead (lower=better):
Extra power consumption induced by ramp boost, assuming continuous OPP space (infinite number of OPP)
and single-CPU policies. In practice, fixed number of OPP decrease this value, and more CPU per policy increases it,
since boost(policy) = max(boost(cpu) foreach cpu of policy)).

Without ramp boost:
+--------------------+--------------------+
|avg slack (us) |avg negative slack |
| |(us) |
+--------------------+--------------------+
|6598.72 |-10217.13 |
|6595.49 |-10200.13 |
|6613.72 |-10401.06 |
|6600.29 |-9860.872 |
|6605.53 |-10057.64 |
|6612.05 |-10267.50 |
|6599.01 |-9939.60 |
|6593.79 |-9445.633 |
|6613.56 |-10276.75 |
|6595.44 |-9751.770 |
+--------------------+--------------------+
|average |
+--------------------+--------------------+
|6602.76 |-10041.81 |
+--------------------+--------------------+


With ramp boost enabled:
+--------------------+--------------------+--------------------+
|boost energy |avg slack (us) |avg negative slack |
|overhead (%) | |(us) |
+--------------------+--------------------+--------------------+
|3.05 |7148.93 |-5664.26 |
|3.04 |7144.69 |-5667.77 |
|3.05 |7149.05 |-5698.31 |
|2.97 |7126.71 |-6040.23 |
|3.02 |7140.28 |-5826.78 |
|3.03 |7135.11 |-5749.62 |
|3.05 |7140.24 |-5750.0 |
|3.05 |7144.84 |-5667.04 |
|3.07 |7157.30 |-5656.65 |
|3.06 |7154.65 |-5653.76 |
+--------------------+--------------------+--------------------+
|average |
+--------------------+--------------------+--------------------+
|3.039000 |7144.18 |-5737.44 |
+--------------------+--------------------+--------------------+


The negative slack is due to missed activations while the utilization signals
increase during the big utilization step. Ramp boost is designed to boost frequency during
that phase, which materializes in 1.75 less negative slack, for an extra power
consumption under 3%.


Regards,
Douglas

2019-10-18 10:19:04

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/6] PM: Introduce em_pd_get_higher_freq()

On 11/10/2019 15:44, Douglas RAILLARD wrote:

[...]

> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index d249b88a4d5a..dd6a35f099ea 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -159,6 +159,53 @@ static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
> return pd->nr_cap_states;
> }
>
> +#define EM_COST_MARGIN_SCALE 1024U
> +
> +/**
> + * em_pd_get_higher_freq() - Get the highest frequency that does not exceed the
> + * given cost margin compared to min_freq
> + * @pd : performance domain for which this must be done
> + * @min_freq : minimum frequency to return
> + * @cost_margin : allowed margin compared to min_freq, on the
> + * EM_COST_MARGIN_SCALE scale.
> + *
> + * Return: the chosen frequency, guaranteed to be at least as high as min_freq.
> + */
> +static inline unsigned long em_pd_get_higher_freq(struct em_perf_domain *pd,
> + unsigned long min_freq, unsigned long cost_margin)
> +{
> + unsigned long max_cost = 0;
> + struct em_cap_state *cs;
> + int i;
> +
> + if (!pd)
> + return min_freq;
> +
> + /* Compute the maximum allowed cost */
> + for (i = 0; i < pd->nr_cap_states; i++) {
> + cs = &pd->table[i];
> + if (cs->frequency >= min_freq) {
> + max_cost = cs->cost +
> + (cs->cost * cost_margin) / EM_COST_MARGIN_SCALE;
> + break;
> + }
> + }
> +
> + /* Find the highest frequency that will not exceed the cost margin */
> + for (i = pd->nr_cap_states-1; i >= 0; i--) {
> + cs = &pd->table[i];
> + if (cs->cost <= max_cost)
> + return cs->frequency;
> + }
> +
> + /*
> + * We should normally never reach here, unless min_freq was higher than
> + * the highest available frequency, which is not expected to happen.
> + */
> + return min_freq;
> +}
> +
> +

Why two blank lines?

> #else

Doesn't apply cleanly on v5.4-rc3. There seems to be a line missing?

27871f7a8a341 (Quentin Perret 2018-12-03 09:56:16 +0000 163) struct
em_perf_domain {};

[...]

2019-10-18 10:19:16

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/6] sched/cpufreq: Attach perf domain to sugov policy

On 11/10/2019 15:44, Douglas RAILLARD wrote:

[...]

> @@ -66,6 +70,38 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
>
> /************************ Governor internals ***********************/
>
> +#ifdef CONFIG_ENERGY_MODEL
> +static void sugov_policy_attach_pd(struct sugov_policy *sg_policy)
> +{
> + struct em_perf_domain *pd;
> + struct cpufreq_policy *policy = sg_policy->policy;

Shouldn't always order local variable declarations from longest to
shortest line?

> +
> + sg_policy->pd = NULL;
> + pd = em_cpu_get(policy->cpu);
> + if (!pd)
> + return;
> +
> + if (cpumask_equal(policy->related_cpus, to_cpumask(pd->cpus)))
> + sg_policy->pd = pd;
> + else
> + pr_warn("%s: Not all CPUs in schedutil policy %u share the same perf domain, no perf domain for that policy will be registered\n",
> + __func__, policy->cpu);

Maybe {} because of 2 lines?

> +}
> +
> +static struct em_perf_domain *sugov_policy_get_pd(
> + struct sugov_policy *sg_policy)


Maybe this way? This format is already used in this file.

static struct em_perf_domain *
sugov_policy_get_pd(struct sugov_policy *sg_policy)


> +{
> + return sg_policy->pd;
> +}
> +#else /* CONFIG_ENERGY_MODEL */
> +static void sugov_policy_attach_pd(struct sugov_policy *sg_policy) {}
> +static struct em_perf_domain *sugov_policy_get_pd(
> + struct sugov_policy *sg_policy)
> +{
> + return NULL;
> +}
> +#endif /* CONFIG_ENERGY_MODEL */
> +
> static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> {
> s64 delta_ns;
> @@ -859,6 +895,9 @@ static int sugov_start(struct cpufreq_policy *policy)
> sugov_update_shared :
> sugov_update_single);
> }
> +
> + sugov_policy_attach_pd(sg_policy);
> +
> return 0;
> }

A sugov_policy_detach_pd() called from sugov_stop() (doing for instance
the g_policy->pd = NULL) is not needed?

2019-10-18 10:19:24

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH v3 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost

On 11/10/2019 15:44, Douglas RAILLARD wrote:

[...]

> @@ -181,6 +185,42 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy, u64 time,
> }
> }
>
> +static unsigned long sugov_cpu_ramp_boost(struct sugov_cpu *sg_cpu)
> +{
> + return READ_ONCE(sg_cpu->ramp_boost);
> +}
> +
> +static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
> +{
> + struct rq *rq = cpu_rq(sg_cpu->cpu);
> + unsigned long util_est_enqueued;
> + unsigned long util_avg;
> + unsigned long boost = 0;
> +
> + util_est_enqueued = READ_ONCE(rq->cfs.avg.util_est.enqueued);
> + util_avg = READ_ONCE(rq->cfs.avg.util_avg);
> +
> + /*
> + * Boost when util_avg becomes higher than the previous stable
> + * knowledge of the enqueued tasks' set util, which is CPU's
> + * util_est_enqueued.
> + *
> + * We try to spot changes in the workload itself, so we want to
> + * avoid the noise of tasks being enqueued/dequeued. To do that,
> + * we only trigger boosting when the "amount of work' enqueued

s/"amount of work'/"amount of work" or 'amount of work'

[...]

> @@ -552,6 +593,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> unsigned long j_util, j_max;
>
> j_util = sugov_get_util(j_sg_cpu);
> + if (j_sg_cpu == sg_cpu)
> + sugov_cpu_ramp_boost_update(sg_cpu);

Can you not call this already in sugov_update_shared(), like in the
sugov_update_single() case?

diff --git a/kernel/sched/cpufreq_schedutil.c
b/kernel/sched/cpufreq_schedutil.c
index e35c20b42780..4c53f63a537d 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -595,8 +595,6 @@ static unsigned int sugov_next_freq_shared(struct
sugov_cpu *sg_cpu, u64 time)
unsigned long j_util, j_max;

j_util = sugov_get_util(j_sg_cpu);
- if (j_sg_cpu == sg_cpu)
- sugov_cpu_ramp_boost_update(sg_cpu);
j_max = j_sg_cpu->max;
j_util = sugov_iowait_apply(j_sg_cpu, time, j_util, j_max);

@@ -625,6 +623,7 @@ sugov_update_shared(struct update_util_data *hook,
u64 time, unsigned int flags)
ignore_dl_rate_limit(sg_cpu, sg_policy);

if (sugov_should_update_freq(sg_policy, time)) {
+ sugov_cpu_ramp_boost_update(sg_cpu);
next_f = sugov_next_freq_shared(sg_cpu, time);

[...]

2019-10-18 10:22:32

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched/cpufreq: Boost schedutil frequency ramp up

On 11/10/2019 15:44, Douglas RAILLARD wrote:

[...]

> @@ -539,6 +543,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> unsigned long util, max;
> unsigned int next_f;
> bool busy;
> + unsigned long ramp_boost = 0;

Shouldn't always order local variable declarations from longest to
shortest line?

> sugov_iowait_boost(sg_cpu, time, flags);
> sg_cpu->last_update = time;
> @@ -552,10 +557,10 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> busy = !sg_policy->need_freq_update && sugov_cpu_is_busy(sg_cpu);
>
> util = sugov_get_util(sg_cpu);
> - sugov_cpu_ramp_boost_update(sg_cpu);
> + ramp_boost = sugov_cpu_ramp_boost_update(sg_cpu);
> max = sg_cpu->max;
> util = sugov_iowait_apply(sg_cpu, time, util, max);
> - next_f = get_next_freq(sg_policy, util, max);
> + next_f = get_next_freq(sg_policy, util, max, ramp_boost);
> /*
> * Do not reduce the frequency if the CPU has not been idle
> * recently, as the reduction is likely to be premature then.
> @@ -587,6 +592,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> struct cpufreq_policy *policy = sg_policy->policy;
> unsigned long util = 0, max = 1;
> unsigned int j;
> + unsigned long ramp_boost = 0;
> + unsigned long j_ramp_boost = 0;

Shouldn't always order local variable declarations from longest to
shortest line?

> for_each_cpu(j, policy->cpus) {
> struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
> @@ -594,7 +601,11 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
>
> j_util = sugov_get_util(j_sg_cpu);
> if (j_sg_cpu == sg_cpu)
> - sugov_cpu_ramp_boost_update(sg_cpu);
> + j_ramp_boost = sugov_cpu_ramp_boost_update(sg_cpu);
> + else
> + j_ramp_boost = sugov_cpu_ramp_boost(j_sg_cpu);
> + ramp_boost = max(ramp_boost, j_ramp_boost);

Ah, that's why you call sugov_cpu_ramp_boost_update() from
sugov_next_freq_shared() in 4/6. So sugov_cpu_ramp_boost_update() is
more a sugov_cpu_ramp_boost(..., int update)? Minor detail though.

[...]

2019-10-18 14:00:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

On Mon, Oct 14, 2019 at 04:50:24PM +0100, Douglas Raillard wrote:

> I posted some numbers based on a similar experiment on the v2 of that series that
> are still applicable:
>
> TL;DR the rt-app negative slack is divided by 1.75 by this series, with an
> increase of 3% in total energy consumption. There is a burst every 0.6s, and
> the average power consumption increase is proportional to the average number
> of bursts.
>
>
> The workload is an rt-app task ramping up from 5% to 75% util in one big step,
> pinned on a big core. The whole cycle is 0.6s long (0.3s at 5% followed by 0.3s at 75%).
> This cycle is repeated 20 times and the average of boosting is taken.
>
> The test device is a Google Pixel 3 (Qcom Snapdragon 845) phone.
> It has a lot more OPPs than a hikey 960, so gradations in boosting
> are better reflected on frequency selection.
>
> avg slack (higher=better):
> Average time between task sleep and its next periodic activation.
> See rt-app doc: https://github.com/scheduler-tools/rt-app/blob/9a50d76f726d7c325c82ac8c7ed9ed70e1c97937/doc/tutorial.txt#L631
>
> avg negative slack (lower in absolute value=better):
> Same as avg slack, but only taking into account negative values.
> Negative slack means a task activation did not have enough time to complete before the next
> periodic activation fired, which is what we want to avoid.
>
> boost energy overhead (lower=better):
> Extra power consumption induced by ramp boost, assuming continuous OPP space (infinite number of OPP)
> and single-CPU policies. In practice, fixed number of OPP decrease this value, and more CPU per policy increases it,
> since boost(policy) = max(boost(cpu) foreach cpu of policy)).
>
> Without ramp boost:
> +--------------------+--------------------+
> |avg slack (us) |avg negative slack |
> | |(us) |
> +--------------------+--------------------+
> |6598.72 |-10217.13 |
> |6595.49 |-10200.13 |
> |6613.72 |-10401.06 |
> |6600.29 |-9860.872 |
> |6605.53 |-10057.64 |
> |6612.05 |-10267.50 |
> |6599.01 |-9939.60 |
> |6593.79 |-9445.633 |
> |6613.56 |-10276.75 |
> |6595.44 |-9751.770 |
> +--------------------+--------------------+
> |average |
> +--------------------+--------------------+
> |6602.76 |-10041.81 |
> +--------------------+--------------------+
>
>
> With ramp boost enabled:
> +--------------------+--------------------+--------------------+
> |boost energy |avg slack (us) |avg negative slack |
> |overhead (%) | |(us) |
> +--------------------+--------------------+--------------------+
> |3.05 |7148.93 |-5664.26 |
> |3.04 |7144.69 |-5667.77 |
> |3.05 |7149.05 |-5698.31 |
> |2.97 |7126.71 |-6040.23 |
> |3.02 |7140.28 |-5826.78 |
> |3.03 |7135.11 |-5749.62 |
> |3.05 |7140.24 |-5750.0 |
> |3.05 |7144.84 |-5667.04 |
> |3.07 |7157.30 |-5656.65 |
> |3.06 |7154.65 |-5653.76 |
> +--------------------+--------------------+--------------------+
> |average |
> +--------------------+--------------------+--------------------+
> |3.039000 |7144.18 |-5737.44 |
> +--------------------+--------------------+--------------------+
>
>
> The negative slack is due to missed activations while the utilization signals
> increase during the big utilization step. Ramp boost is designed to boost frequency during
> that phase, which materializes in 1.75 less negative slack, for an extra power
> consumption under 3%.

OK, so I think I see what it is doing, and why.

Normally we use (map_util_freq):

freq = C * max_freq * util / max ; C=1.25

But here, when util is increasing, we effectively increase our C to
allow picking a higher OPP. Because of that higher OPP we finish our
work sooner (avg slack increases) and miss our activation less often
(avg neg slack decreases).

Now, the thing is, we use map_util_freq() in more places, and should we
not reflect this increase in C for all of them? That is, why is this
patch changing get_next_freq() and not map_util_freq().

I don't think that question is answered in the Changelogs.

Exactly because it does change the energy consumption (it must) should
that not also be reflected in the EAS logic?

I'm still thinking about the exact means you're using to raise C; that
is, the 'util - util_est' as cost_margin. It hurts my brain still.

2019-10-18 14:06:13

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/6] PM: Introduce em_pd_get_higher_freq()

On 11/10/2019 15:44, Douglas RAILLARD wrote:

[...]

> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index d249b88a4d5a..dd6a35f099ea 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -159,6 +159,53 @@ static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
> return pd->nr_cap_states;
> }
>
> +#define EM_COST_MARGIN_SCALE 1024U
> +
> +/**
> + * em_pd_get_higher_freq() - Get the highest frequency that does not exceed the
> + * given cost margin compared to min_freq
> + * @pd : performance domain for which this must be done
> + * @min_freq : minimum frequency to return
> + * @cost_margin : allowed margin compared to min_freq, on the
> + * EM_COST_MARGIN_SCALE scale.
> + *
> + * Return: the chosen frequency, guaranteed to be at least as high as min_freq.
> + */
> +static inline unsigned long em_pd_get_higher_freq(struct em_perf_domain *pd,
> + unsigned long min_freq, unsigned long cost_margin)
> +{
> + unsigned long max_cost = 0;
> + struct em_cap_state *cs;
> + int i;
> +
> + if (!pd)
> + return min_freq;
> +
> + /* Compute the maximum allowed cost */
> + for (i = 0; i < pd->nr_cap_states; i++) {
> + cs = &pd->table[i];
> + if (cs->frequency >= min_freq) {
> + max_cost = cs->cost +
> + (cs->cost * cost_margin) / EM_COST_MARGIN_SCALE;

Maybe you could mention in the header that this is the place where the
algorithm could be tuned. (even though it doesn't offer any tuning
interface, which is a good thing).

> + break;
> + }
> + }
> +
> + /* Find the highest frequency that will not exceed the cost margin */
> + for (i = pd->nr_cap_states-1; i >= 0; i--) {
> + cs = &pd->table[i];
> + if (cs->cost <= max_cost)
> + return cs->frequency;
> + }
> +
> + /*
> + * We should normally never reach here, unless min_freq was higher than
> + * the highest available frequency, which is not expected to happen.
> + */

Maybe add a WARN_ONCE(1, "foobar"); here to indicate this unlikely event
(CPUfreq and EM framwork out of sync)?

[...]

2019-10-18 15:32:40

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/6] sched/cpufreq: Attach perf domain to sugov policy

Hi Dietmar,

On 10/17/19 9:57 AM, Dietmar Eggemann wrote:
> On 11/10/2019 15:44, Douglas RAILLARD wrote:
>
> [...]
>
>> @@ -66,6 +70,38 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
>>
>> /************************ Governor internals ***********************/
>>
>> +#ifdef CONFIG_ENERGY_MODEL
>> +static void sugov_policy_attach_pd(struct sugov_policy *sg_policy)
>> +{
>> + struct em_perf_domain *pd;
>> + struct cpufreq_policy *policy = sg_policy->policy;
>
> Shouldn't always order local variable declarations from longest to
> shortest line?

Can't find any reference to that rule in the coding style, although I'm happy to change order
if that's deemed useful.

>
>> +
>> + sg_policy->pd = NULL;
>> + pd = em_cpu_get(policy->cpu);
>> + if (!pd)
>> + return;
>> +
>> + if (cpumask_equal(policy->related_cpus, to_cpumask(pd->cpus)))
>> + sg_policy->pd = pd;
>> + else
>> + pr_warn("%s: Not all CPUs in schedutil policy %u share the same perf domain, no perf domain for that policy will be registered\n",
>> + __func__, policy->cpu);
>
> Maybe {} because of 2 lines?

+1

>> +}
>> +
>> +static struct em_perf_domain *sugov_policy_get_pd(
>> + struct sugov_policy *sg_policy)
>
>
> Maybe this way? This format is already used in this file.
>
> static struct em_perf_domain *
> sugov_policy_get_pd(struct sugov_policy *sg_policy)
>

I also prefer this kind of non-indented form that always stays indented across renames :)

>> +{
>> + return sg_policy->pd;
>> +}
>> +#else /* CONFIG_ENERGY_MODEL */
>> +static void sugov_policy_attach_pd(struct sugov_policy *sg_policy) {}
>> +static struct em_perf_domain *sugov_policy_get_pd(
>> + struct sugov_policy *sg_policy)
>> +{
>> + return NULL;
>> +}
>> +#endif /* CONFIG_ENERGY_MODEL */
>> +
>> static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>> {
>> s64 delta_ns;
>> @@ -859,6 +895,9 @@ static int sugov_start(struct cpufreq_policy *policy)
>> sugov_update_shared :
>> sugov_update_single);
>> }
>> +
>> + sugov_policy_attach_pd(sg_policy);
>> +
>> return 0;
>> }
>
> A sugov_policy_detach_pd() called from sugov_stop() (doing for instance
> the g_policy->pd = NULL) is not needed?

From what I could see, sugov_stop() will always be followed by sugov_start() before
it's used again, so that does not seem more risky than not de-initializing sg_cpu's
for example.

2019-10-18 15:38:39

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v3 1/6] PM: Introduce em_pd_get_higher_freq()

Hi Dietmar,

On 10/17/19 10:58 AM, Dietmar Eggemann wrote:
> On 11/10/2019 15:44, Douglas RAILLARD wrote:
>
> [...]
>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index d249b88a4d5a..dd6a35f099ea 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -159,6 +159,53 @@ static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
>> return pd->nr_cap_states;
>> }
>>
>> +#define EM_COST_MARGIN_SCALE 1024U
>> +
>> +/**
>> + * em_pd_get_higher_freq() - Get the highest frequency that does not exceed the
>> + * given cost margin compared to min_freq
>> + * @pd : performance domain for which this must be done
>> + * @min_freq : minimum frequency to return
>> + * @cost_margin : allowed margin compared to min_freq, on the
>> + * EM_COST_MARGIN_SCALE scale.
>> + *
>> + * Return: the chosen frequency, guaranteed to be at least as high as min_freq.
>> + */
>> +static inline unsigned long em_pd_get_higher_freq(struct em_perf_domain *pd,
>> + unsigned long min_freq, unsigned long cost_margin)
>> +{
>> + unsigned long max_cost = 0;
>> + struct em_cap_state *cs;
>> + int i;
>> +
>> + if (!pd)
>> + return min_freq;
>> +
>> + /* Compute the maximum allowed cost */
>> + for (i = 0; i < pd->nr_cap_states; i++) {
>> + cs = &pd->table[i];
>> + if (cs->frequency >= min_freq) {
>> + max_cost = cs->cost +
>> + (cs->cost * cost_margin) / EM_COST_MARGIN_SCALE;
>
> Maybe you could mention in the header that this is the place where the
> algorithm could be tuned. (even though it doesn't offer any tuning
> interface, which is a good thing).

I'm not sure what you mean, the patch "title" contains "em_pd_get_higher_freq()", should it
also mention where exactly inside the function the margin logic is implemented ?

>> + break;
>> + }
>> + }
>> +
>> + /* Find the highest frequency that will not exceed the cost margin */
>> + for (i = pd->nr_cap_states-1; i >= 0; i--) {
>> + cs = &pd->table[i];
>> + if (cs->cost <= max_cost)
>> + return cs->frequency;
>> + }
>> +
>> + /*
>> + * We should normally never reach here, unless min_freq was higher than
>> + * the highest available frequency, which is not expected to happen.
>> + */
>
> Maybe add a WARN_ONCE(1, "foobar"); here to indicate this unlikely event
> (CPUfreq and EM framwork out of sync)?

Will do.

> [...]
>

2019-10-18 15:39:00

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

On Thursday 17 Oct 2019 at 11:50:15 (+0200), Peter Zijlstra wrote:
> Now, the thing is, we use map_util_freq() in more places, and should we
> not reflect this increase in C for all of them? That is, why is this
> patch changing get_next_freq() and not map_util_freq().
>
> I don't think that question is answered in the Changelogs.
>
> Exactly because it does change the energy consumption (it must) should
> that not also be reflected in the EAS logic?

Right that shouldn't hurt and keep things consistent. That probably
won't have a huge impact in practice (the boost should be != 0 only when
the util signals haven't converged IIUC, which is a case where the EAS
calculation is already 'wrong' anyway), but that still feels like the
right thing to do.

> I'm still thinking about the exact means you're using to raise C; that
> is, the 'util - util_est' as cost_margin. It hurts my brain still.

+1 ...

Thanks,
Quentin

2019-10-18 15:40:39

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v3 4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost



On 10/17/19 9:57 AM, Dietmar Eggemann wrote:
> On 11/10/2019 15:44, Douglas RAILLARD wrote:
>
> [...]
>
>> @@ -181,6 +185,42 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy, u64 time,
>> }
>> }
>>
>> +static unsigned long sugov_cpu_ramp_boost(struct sugov_cpu *sg_cpu)
>> +{
>> + return READ_ONCE(sg_cpu->ramp_boost);
>> +}
>> +
>> +static unsigned long sugov_cpu_ramp_boost_update(struct sugov_cpu *sg_cpu)
>> +{
>> + struct rq *rq = cpu_rq(sg_cpu->cpu);
>> + unsigned long util_est_enqueued;
>> + unsigned long util_avg;
>> + unsigned long boost = 0;
>> +
>> + util_est_enqueued = READ_ONCE(rq->cfs.avg.util_est.enqueued);
>> + util_avg = READ_ONCE(rq->cfs.avg.util_avg);
>> +
>> + /*
>> + * Boost when util_avg becomes higher than the previous stable
>> + * knowledge of the enqueued tasks' set util, which is CPU's
>> + * util_est_enqueued.
>> + *
>> + * We try to spot changes in the workload itself, so we want to
>> + * avoid the noise of tasks being enqueued/dequeued. To do that,
>> + * we only trigger boosting when the "amount of work' enqueued
>
> s/"amount of work'/"amount of work" or 'amount of work'
>
> [...]
>
>> @@ -552,6 +593,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
>> unsigned long j_util, j_max;
>>
>> j_util = sugov_get_util(j_sg_cpu);
>> + if (j_sg_cpu == sg_cpu)
>> + sugov_cpu_ramp_boost_update(sg_cpu);
>
> Can you not call this already in sugov_update_shared(), like in the
> sugov_update_single() case?

The next commit in the series needs to aggregate the ramp_boost of all CPUs in the policy,
so this call will end up here anyway, unless we want to set the value at previous level and
query it back again in the loop. I don't mind either way, but since no option seem
faster than the other, I went for clustering the ramp boost code rather than spreading it at
all levels.


> diff --git a/kernel/sched/cpufreq_schedutil.c
> b/kernel/sched/cpufreq_schedutil.c
> index e35c20b42780..4c53f63a537d 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -595,8 +595,6 @@ static unsigned int sugov_next_freq_shared(struct
> sugov_cpu *sg_cpu, u64 time)
> unsigned long j_util, j_max;
>
> j_util = sugov_get_util(j_sg_cpu);
> - if (j_sg_cpu == sg_cpu)
> - sugov_cpu_ramp_boost_update(sg_cpu);
> j_max = j_sg_cpu->max;
> j_util = sugov_iowait_apply(j_sg_cpu, time, j_util, j_max);
>
> @@ -625,6 +623,7 @@ sugov_update_shared(struct update_util_data *hook,
> u64 time, unsigned int flags)
> ignore_dl_rate_limit(sg_cpu, sg_policy);
>
> if (sugov_should_update_freq(sg_policy, time)) {
> + sugov_cpu_ramp_boost_update(sg_cpu);
> next_f = sugov_next_freq_shared(sg_cpu, time);
>
> [...]
>

2019-10-18 16:16:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote:
> On Thursday 17 Oct 2019 at 11:50:15 (+0200), Peter Zijlstra wrote:
> > Now, the thing is, we use map_util_freq() in more places, and should we
> > not reflect this increase in C for all of them? That is, why is this
> > patch changing get_next_freq() and not map_util_freq().
> >
> > I don't think that question is answered in the Changelogs.
> >
> > Exactly because it does change the energy consumption (it must) should
> > that not also be reflected in the EAS logic?
>
> Right that shouldn't hurt and keep things consistent. That probably
> won't have a huge impact in practice (the boost should be != 0 only when
> the util signals haven't converged IIUC, which is a case where the EAS
> calculation is already 'wrong' anyway), but that still feels like the
> right thing to do.

It only boosts when 'rq->cfs.avg.util' increases while
'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est
obv).

This condition can be true for select_task_rq_fair(), because that is
ran before we do enqueue_task_fair() (for obvious raisins).

> > I'm still thinking about the exact means you're using to raise C; that
> > is, the 'util - util_est' as cost_margin. It hurts my brain still.
>
> +1 ...

cost_i = capacity_i / power_i ; for the i-th OPP

We then do: 'x = util - util_avg' and use that on the first
OPP >= min_freq:

cost(x) = cost_j + cost_j * x ; freq_j >= min_freq
= cost_j * (1 + x)

And then we find the highest OPP that has:

cost_k <= cost(x)

Now, 'P = C*V^2*f', which under 'V ~ f' turns into 'P ~ f^3'.

(this assumption is important because we don't have V_i, but know that
when f increases, V also increases and the linear relation is the
simplest)

This then gives us:

capacity_i = f_i / f_max
power_i ~ f_i ^ 3
cost_i = capacity_i / power_i
~ (f_i / f_max) / f_i^3
~ 1 / (f_max * f_i^2)

(your changelog already called if efficiency, but then went on calling
it cost; as per the above, you see that higher frequencies have lower
efficiency, as expected)

cost(x) then turns into something like:

cost(x) = cost_j * (1 + x)
~ (1 + x) / (f_max * f_j^2)

We then get the following equation (assuming inf. OPPs):

cost_k = cost(x)

1 / (f_max * f_k^2) = (1 + x) / (f_max * f_j^2)

From which we can solve f_k:

f_k = f_j / sqrt(1 + x) ; x = util - util_est

Which, given positive 'x' results in f_k < f_j, IOW. we're selecting a
lower frequency.

Given that 'cost' really is efficiency, and we've made the equations
about selecting a higher efficiency, that makes sense in so far that it
would always end up at the knee in the graph (our most efficient OPP).

Is this really what we're wanting to do?

2019-10-18 19:43:11

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

Hi Peter,

On 10/17/19 10:50 AM, Peter Zijlstra wrote:
> On Mon, Oct 14, 2019 at 04:50:24PM +0100, Douglas Raillard wrote:
>
>> I posted some numbers based on a similar experiment on the v2 of that series that
>> are still applicable:
>>
>> TL;DR the rt-app negative slack is divided by 1.75 by this series, with an
>> increase of 3% in total energy consumption. There is a burst every 0.6s, and
>> the average power consumption increase is proportional to the average number
>> of bursts.
>>
>>
>> The workload is an rt-app task ramping up from 5% to 75% util in one big step,
>> pinned on a big core. The whole cycle is 0.6s long (0.3s at 5% followed by 0.3s at 75%).
>> This cycle is repeated 20 times and the average of boosting is taken.
>>
>> The test device is a Google Pixel 3 (Qcom Snapdragon 845) phone.
>> It has a lot more OPPs than a hikey 960, so gradations in boosting
>> are better reflected on frequency selection.
>>
>> avg slack (higher=better):
>> Average time between task sleep and its next periodic activation.
>> See rt-app doc: https://github.com/scheduler-tools/rt-app/blob/9a50d76f726d7c325c82ac8c7ed9ed70e1c97937/doc/tutorial.txt#L631
>>
>> avg negative slack (lower in absolute value=better):
>> Same as avg slack, but only taking into account negative values.
>> Negative slack means a task activation did not have enough time to complete before the next
>> periodic activation fired, which is what we want to avoid.
>>
>> boost energy overhead (lower=better):
>> Extra power consumption induced by ramp boost, assuming continuous OPP space (infinite number of OPP)
>> and single-CPU policies. In practice, fixed number of OPP decrease this value, and more CPU per policy increases it,
>> since boost(policy) = max(boost(cpu) foreach cpu of policy)).
>>
>> Without ramp boost:
>> +--------------------+--------------------+
>> |avg slack (us) |avg negative slack |
>> | |(us) |
>> +--------------------+--------------------+
>> |6598.72 |-10217.13 |
>> |6595.49 |-10200.13 |
>> |6613.72 |-10401.06 |
>> |6600.29 |-9860.872 |
>> |6605.53 |-10057.64 |
>> |6612.05 |-10267.50 |
>> |6599.01 |-9939.60 |
>> |6593.79 |-9445.633 |
>> |6613.56 |-10276.75 |
>> |6595.44 |-9751.770 |
>> +--------------------+--------------------+
>> |average |
>> +--------------------+--------------------+
>> |6602.76 |-10041.81 |
>> +--------------------+--------------------+
>>
>>
>> With ramp boost enabled:
>> +--------------------+--------------------+--------------------+
>> |boost energy |avg slack (us) |avg negative slack |
>> |overhead (%) | |(us) |
>> +--------------------+--------------------+--------------------+
>> |3.05 |7148.93 |-5664.26 |
>> |3.04 |7144.69 |-5667.77 |
>> |3.05 |7149.05 |-5698.31 |
>> |2.97 |7126.71 |-6040.23 |
>> |3.02 |7140.28 |-5826.78 |
>> |3.03 |7135.11 |-5749.62 |
>> |3.05 |7140.24 |-5750.0 |
>> |3.05 |7144.84 |-5667.04 |
>> |3.07 |7157.30 |-5656.65 |
>> |3.06 |7154.65 |-5653.76 |
>> +--------------------+--------------------+--------------------+
>> |average |
>> +--------------------+--------------------+--------------------+
>> |3.039000 |7144.18 |-5737.44 |
>> +--------------------+--------------------+--------------------+
>>
>>
>> The negative slack is due to missed activations while the utilization signals
>> increase during the big utilization step. Ramp boost is designed to boost frequency during
>> that phase, which materializes in 1.75 less negative slack, for an extra power
>> consumption under 3%.
>
> OK, so I think I see what it is doing, and why.
>
> Normally we use (map_util_freq):
>
> freq = C * max_freq * util / max ; C=1.25
>
> But here, when util is increasing, we effectively increase our C to
> allow picking a higher OPP. Because of that higher OPP we finish our
> work sooner (avg slack increases)

Indeed. The increase in avg slack is much smaller than the absolute decrease in negative slack,
which means we don't boost when that's not necessary
(i.e. on activations where the slack is already positive).

> and miss our activation less often
> (avg neg slack decreases).

We actually miss the activation as often, but by a lesser amount of time,
which is desirable. Preventing from missing activations altogether could
probably be achieved by very aggressive boosting, at which point it might
be better/easier to just assume util=1024 and let it decrease to its
correct value. None of that sounds very appealing in the general case though.

> Now, the thing is, we use map_util_freq() in more places, and should we
> not reflect this increase in C for all of them? That is, why is this
> patch changing get_next_freq() and not map_util_freq().
>
> I don't think that question is answered in the Changelogs.
>
> Exactly because it does change the energy consumption (it must) should
> that not also be reflected in the EAS logic?

map_util_freq() is only used in schedutil and in EAS by compute_energy(), so
I retarget the question at compute_energy(). It is supposed to compute
the energy consumed by a given CPU if a given task was migrated on it.
This implies some assumptions on util signals:
1) util(_est.enqueued) of the task is representative of the task's duty cycle
(the semantic of util is somehow blurry for aperiodic tasks AFAIK).
2) util of the task is CPU-invariant
3) task util + target CPU util = new target CPU util for the foreseeable future,
i.e. the amount of future we can predict with reasonable accuracy. Otherwise
we would end up moving things around all the time.

Since ramp boost is designed to be transient, integrating it (indirectly) in "target CPU util" will
add some noise to the placement decision, potentially rendering it obsolete as soon
as the boosting stops. Basing a costly decision on that does not sound like a good idea IMHO.

> I'm still thinking about the exact means you're using to raise C; that
> is, the 'util - util_est' as cost_margin. It hurts my brain still.

util_est is currently the best approximation of the actual portion of the CPU the task needs:
1) for periodic tasks, it's not too far from the duty cycle, and is always higher

2) for aperiodic tasks, it (indirectly) takes into account the total time it took
to complete the previous activation, so the signal is not 100% composed of logical signals
only relevant for periodic tasks (although it's a big part of it).

3) Point 1) and 2) together allows util_est to adapt to periodic tasks that changes
their duty cycle over time, without needing a very long history (the last task period
is sufficient).

For periodic tasks, the distance between instantaneous util_avg and the actual task
duty cycle indicates somehow what is our best guess of the (potential) change in the task
duty cycle.

util_est is the threshold (assuming util_avg increasing) for util_avg after which we know
for sure that even if the task stopped right now, its duty cycle would be higher than
during the previous period.
This means for a given task and with (util >= util_est):

1) util - util_est == 0 means the task duty cycle will be equal to the one during
during the previous activation, if the tasks stopped executing right now.

2) util - util_est > 0 means the task duty cycle will be higher to the one during
during the previous activation, if the tasks stopped executing right now.

Using the difference (util - util_est) will therefore give these properties to the boost signal:
* no boost will be applied as long as the task has a constant or decreasing duty cycle.

* when we can detect that the duty cycle increases, we temporarily increase the frequency.
We start with a slight increase, and the longer we wait for the current period to finish,
the more we boost, since the more likely it is that the task has a much larger duty cycle
than anticipated. More specifically, the evaluation of "how much more" is done the exact
same way as it is done for PELT, since the dynamic of the boost is "inherited" from PELT.

Now if the task is aperiodic, the boost will allow reaching the highest frequency faster,
which may or may not be desired. Ultimately, it's not more or less wrong than just picking
the freq based on util_est alone, since util_est is already somewhat meaningless for aperiodic
tasks. It just allows reaching the max freq at some point without waiting for too long, which is
all what we can do without more info on the task.

When applying these boosting rules on the runqueue util signals, we are able to detect if at least one
task needs boosting according to these rules. That only holds as long as the history we look at is
the result of a stable set of tasks, i.e. no tasks added or removed from the rq.


Regards,
Douglas

2019-10-18 20:28:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

On Thu, Oct 17, 2019 at 03:23:04PM +0100, Douglas Raillard wrote:
> On 10/17/19 10:50 AM, Peter Zijlstra wrote:
> > Now, the thing is, we use map_util_freq() in more places, and should we
> > not reflect this increase in C for all of them? That is, why is this
> > patch changing get_next_freq() and not map_util_freq().
> >
> > I don't think that question is answered in the Changelogs.
> >
> > Exactly because it does change the energy consumption (it must) should
> > that not also be reflected in the EAS logic?
>
> map_util_freq() is only used in schedutil and in EAS by compute_energy(), so
> I retarget the question at compute_energy(). It is supposed to compute
> the energy consumed by a given CPU if a given task was migrated on it.
> This implies some assumptions on util signals:
> 1) util(_est.enqueued) of the task is representative of the task's
> duty cycle (the semantic of util is somehow blurry for aperiodic tasks
> AFAIK).
> 2) util of the task is CPU-invariant

( we know this to be false, but do indeed make this assumption because
simplicity, taking IPC differences into account would just complicate
things more )

> 3) task util + target CPU util = new target CPU util for the
> foreseeable future, i.e. the amount of future we can predict with
> reasonable accuracy. Otherwise we would end up moving things around
> all the time.
>
> Since ramp boost is designed to be transient, integrating it
> (indirectly) in "target CPU util" will add some noise to the placement
> decision, potentially rendering it obsolete as soon as the boosting
> stops. Basing a costly decision on that does not sound like a good
> idea IMHO.

Well, we _hope_ recent past is a reasonable predictor for the near
future. We of course know it'll be complete crap every so often, but
hope that on average it is true more than false :-)

Anyway, the above seems like a sensible enough argument, and seems
worthy of being part of the Changelog. Also maybe a comment in schedutil
as to why there is a map_util_freq() modifier there.

2019-10-18 22:15:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

On Thu, Oct 17, 2019 at 03:23:04PM +0100, Douglas Raillard wrote:
> On 10/17/19 10:50 AM, Peter Zijlstra wrote:

> > I'm still thinking about the exact means you're using to raise C; that
> > is, the 'util - util_est' as cost_margin. It hurts my brain still.
>
> util_est is currently the best approximation of the actual portion of the CPU the task needs:
> 1) for periodic tasks, it's not too far from the duty cycle, and is always higher
>
> 2) for aperiodic tasks, it (indirectly) takes into account the total time it took
> to complete the previous activation, so the signal is not 100% composed of logical signals
> only relevant for periodic tasks (although it's a big part of it).
>
> 3) Point 1) and 2) together allows util_est to adapt to periodic tasks that changes
> their duty cycle over time, without needing a very long history (the last task period
> is sufficient).
>
> For periodic tasks, the distance between instantaneous util_avg and the actual task
> duty cycle indicates somehow what is our best guess of the (potential) change in the task
> duty cycle.
>
> util_est is the threshold (assuming util_avg increasing) for util_avg after which we know
> for sure that even if the task stopped right now, its duty cycle would be higher than
> during the previous period.
> This means for a given task and with (util >= util_est):
>
> 1) util - util_est == 0 means the task duty cycle will be equal to the one during
> during the previous activation, if the tasks stopped executing right now.
>
> 2) util - util_est > 0 means the task duty cycle will be higher to the one during
> during the previous activation, if the tasks stopped executing right now.

So far I can follow, 2) is indeed a fairly sane indication that
utilization is growing.

> Using the difference (util - util_est) will therefore give these properties to the boost signal:
> * no boost will be applied as long as the task has a constant or decreasing duty cycle.
>
> * when we can detect that the duty cycle increases, we temporarily increase the frequency.
> We start with a slight increase, and the longer we wait for the current period to finish,
> the more we boost, since the more likely it is that the task has a much larger duty cycle
> than anticipated. More specifically, the evaluation of "how much more" is done the exact
> same way as it is done for PELT, since the dynamic of the boost is "inherited" from PELT.

Right, because as long it keeps running, util_est will not be changed,
so the difference will continue to increase.

What I don't see is how that that difference makes sense as input to:

cost(x) : (1 + x) * cost_j

I suppose that limits the additional OPP to twice the previously
selected cost / efficiency (see the confusion from that other email).
But given that efficency drops (or costs rise) for higher OPPs that
still doesn't really make sense..

> Now if the task is aperiodic, the boost will allow reaching the highest frequency faster,
> which may or may not be desired. Ultimately, it's not more or less wrong than just picking
> the freq based on util_est alone, since util_est is already somewhat meaningless for aperiodic
> tasks. It just allows reaching the max freq at some point without waiting for too long, which is
> all what we can do without more info on the task.
>
> When applying these boosting rules on the runqueue util signals, we are able to detect if at least one
> task needs boosting according to these rules. That only holds as long as the history we look at is
> the result of a stable set of tasks, i.e. no tasks added or removed from the rq.

So while I agree that 2) is a reasonable signal to work from, everything
that comes after is still much confusing me.

2019-10-19 00:14:13

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

On 17/10/2019 16:11, Peter Zijlstra wrote:
> On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote:

[...]

> It only boosts when 'rq->cfs.avg.util' increases while
> 'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est
> obv).
>
> This condition can be true for select_task_rq_fair(), because that is
> ran before we do enqueue_task_fair() (for obvious raisins).
>
>>> I'm still thinking about the exact means you're using to raise C; that
>>> is, the 'util - util_est' as cost_margin. It hurts my brain still.
>>
>> +1 ...
>
> cost_i = capacity_i / power_i ; for the i-th OPP

I get confused by this definition. efficiency=capacity/power but the
cs->cost value used in em_pd_get_higher_freq() is defined as

cs_cost = cs->power * cpu_max_freq / cs->freq [energy_model.h]

> We then do: 'x = util - util_avg' and use that on the first
> OPP >= min_freq:
>
> cost(x) = cost_j + cost_j * x ; freq_j >= min_freq
> = cost_j * (1 + x)
>
> And then we find the highest OPP that has:
>
> cost_k <= cost(x)

[...]

2019-10-19 07:56:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

On Fri, Oct 18, 2019 at 09:44:44AM +0200, Dietmar Eggemann wrote:
> On 17/10/2019 16:11, Peter Zijlstra wrote:
> > On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote:
>
> [...]
>
> > It only boosts when 'rq->cfs.avg.util' increases while
> > 'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est
> > obv).
> >
> > This condition can be true for select_task_rq_fair(), because that is
> > ran before we do enqueue_task_fair() (for obvious raisins).
> >
> >>> I'm still thinking about the exact means you're using to raise C; that
> >>> is, the 'util - util_est' as cost_margin. It hurts my brain still.
> >>
> >> +1 ...
> >
> > cost_i = capacity_i / power_i ; for the i-th OPP
>
> I get confused by this definition. efficiency=capacity/power but the
> cs->cost value used in em_pd_get_higher_freq() is defined as
>
> cs_cost = cs->power * cpu_max_freq / cs->freq [energy_model.h]

Well, chalk that up to confusion inspired by the Changelog of patch 1.

Let me redo it with that formula then.

2019-10-19 07:59:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

On Fri, Oct 18, 2019 at 09:44:44AM +0200, Dietmar Eggemann wrote:
> On 17/10/2019 16:11, Peter Zijlstra wrote:
> > On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote:
>
> [...]
>
> > It only boosts when 'rq->cfs.avg.util' increases while
> > 'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est
> > obv).
> >
> > This condition can be true for select_task_rq_fair(), because that is
> > ran before we do enqueue_task_fair() (for obvious raisins).
> >
> >>> I'm still thinking about the exact means you're using to raise C; that
> >>> is, the 'util - util_est' as cost_margin. It hurts my brain still.
> >>
> >> +1 ...
> >
> > cost_i = capacity_i / power_i ; for the i-th OPP
>
> I get confused by this definition. efficiency=capacity/power but the
> cs->cost value used in em_pd_get_higher_freq() is defined as
>
> cs_cost = cs->power * cpu_max_freq / cs->freq [energy_model.h]

cost_i = power_i * f_max / f_i

cost(x) = cost_j * (1 + x) ; f_j >= min_freq

cost_k <= cost(x)

P = C*V^2*f, V ~ f -> P ~ f^3

cost_i ~ f_i^3 * f_max / f_i
= f_i^2 * f_max

cost(x) = (1 + x) * f_j^2 * f_max

cost_k = cost(x)

f_k^2 * f_max = (1 + x) * f_j^2 * f_max

f_k = sqrt(1 + x) * f_j

Which does indeed make more sense... However, I still struggle with
using our 'x = util - util_est' as input for an OPP specific increase.

2019-10-19 08:18:30

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware



On 10/17/19 8:07 PM, Peter Zijlstra wrote:
> On Thu, Oct 17, 2019 at 03:23:04PM +0100, Douglas Raillard wrote:
>> On 10/17/19 10:50 AM, Peter Zijlstra wrote:
>
>>> I'm still thinking about the exact means you're using to raise C; that
>>> is, the 'util - util_est' as cost_margin. It hurts my brain still.
>>
>> util_est is currently the best approximation of the actual portion of the CPU the task needs:
>> 1) for periodic tasks, it's not too far from the duty cycle, and is always higher
>>
>> 2) for aperiodic tasks, it (indirectly) takes into account the total time it took
>> to complete the previous activation, so the signal is not 100% composed of logical signals
>> only relevant for periodic tasks (although it's a big part of it).
>>
>> 3) Point 1) and 2) together allows util_est to adapt to periodic tasks that changes
>> their duty cycle over time, without needing a very long history (the last task period
>> is sufficient).
>>
>> For periodic tasks, the distance between instantaneous util_avg and the actual task
>> duty cycle indicates somehow what is our best guess of the (potential) change in the task
>> duty cycle.
>>
>> util_est is the threshold (assuming util_avg increasing) for util_avg after which we know
>> for sure that even if the task stopped right now, its duty cycle would be higher than
>> during the previous period.
>> This means for a given task and with (util >= util_est):
>>
>> 1) util - util_est == 0 means the task duty cycle will be equal to the one during
>> during the previous activation, if the tasks stopped executing right now.
>>
>> 2) util - util_est > 0 means the task duty cycle will be higher to the one during
>> during the previous activation, if the tasks stopped executing right now.
>
> So far I can follow, 2) is indeed a fairly sane indication that
> utilization is growing.
>
>> Using the difference (util - util_est) will therefore give these properties to the boost signal:
>> * no boost will be applied as long as the task has a constant or decreasing duty cycle.
>>
>> * when we can detect that the duty cycle increases, we temporarily increase the frequency.
>> We start with a slight increase, and the longer we wait for the current period to finish,
>> the more we boost, since the more likely it is that the task has a much larger duty cycle
>> than anticipated. More specifically, the evaluation of "how much more" is done the exact
>> same way as it is done for PELT, since the dynamic of the boost is "inherited" from PELT.
>
> Right, because as long it keeps running, util_est will not be changed,
> so the difference will continue to increase.
>
> What I don't see is how that that difference makes sense as input to:
>
> cost(x) : (1 + x) * cost_j

The actual input is:
x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est)

Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor
of 1 is not directly reflected in the code but is important for units
consistency.

Non-zero x means that we are going to allow spending more energy than
what schedutil currently thinks of being the minimal energy.
(it's actually not that minimal, since we have this 1.25 factor, plus
the fact that we use util_est.enqueued, which will always over-estimate
the actual util of the task).

>
> I suppose that limits the additional OPP to twice the previously
> selected cost / efficiency (see the confusion from that other email).
> But given that efficency drops (or costs rise) for higher OPPs that
> still doesn't really make sense..
Yes, this current limit to +100% freq boosting is somehow arbitrary and
could probably benefit from being tunable in some way (Kconfig option
maybe). When (margin > 0), we end up selecting an OPP that has a higher
cost than the one strictly required, which is expected. The goal is to
speed things up at the expense of more power consumed to achieve the
same work, hence at a lower efficiency (== higher cost).

That's the main reason why this boosting apply a margin on the cost of
the selected OPP rather than just inflating the util. This allows
controlling directly how much more power (battery life) we are going to
spend to achieve some work that we know could be achieved with less
power. This "how much more" is the margin. A policy for such boosting
must obviously be quite picky in when it decides to boost (not boosting
all the time). Decreasing the amount of negative slack is one situation
where spending a bit more power to ensure the work is done in time can
be more important than just efficiency, within reasonable limits. (the
eternal efficiency vs throughput vs latency trade-off).

>
>> Now if the task is aperiodic, the boost will allow reaching the highest frequency faster,
>> which may or may not be desired. Ultimately, it's not more or less wrong than just picking
>> the freq based on util_est alone, since util_est is already somewhat meaningless for aperiodic
>> tasks. It just allows reaching the max freq at some point without waiting for too long, which is
>> all what we can do without more info on the task.
>>
>> When applying these boosting rules on the runqueue util signals, we are able to detect if at least one
>> task needs boosting according to these rules. That only holds as long as the history we look at is
>> the result of a stable set of tasks, i.e. no tasks added or removed from the rq.
>
> So while I agree that 2) is a reasonable signal to work from, everything
> that comes after is still much confusing me.


"Now if the task is aperiodic ...": What I mean by that is that AFAIK,
there isn't really any fundamentally right or wrong way of choosing
frequencies if the tasks we are dealing with are not periodic, unless
you add more constraints to the problem. We currently base the decision
on an overestimation of some kind of average running time per second of
the task. The average being the EWMA implemented by PELT, not to be
confused with util_est.ewma that adds an extra EWMA on top.

What "window" of time used for that average (or EWMA half life in our
case) will change the value of the average for aperiodic tasks. The
choice of that half life is driven by how much of the task history we
want to take into account. That is driven by how often we expect tasks
to change their "execution profile" on average so to speak (a thread
pool picking disparate work items at random would change its profile
very often for example).

Once this window or half life is chosen, we can ensure that the CPU will
use a frequency high enough to avoid work piling up more than what can
be computed using a simple proportional controller. The goal of the
schedutil controller is to make sure that:
current CPU capa == current util * 1.25

All that to say that in the aperiodic case, some pieces of the setup are
directly provided by the policy (PELT half life), which are empirically
determined to perform well, without any way of computing an provably
optimal value (unless we know for sure exactly when a task is going to
change its workload and CPU share it will require). For periodic tasks,
we can compute the exact frequency that will lead to using 100% of the
CPU just by looking at the duty cycle of the tasks, and that's more or
less what schedutil does.


"When applying these boosting rules on the runqueue util signals ...":
Assuming the set of enqueued tasks stays the same between 2 observations
from schedutil, if we see the rq util_avg increase above its
util_est.enqueued, that means that at least one task had its util_avg go
above util_est.enqueued. We might miss some boosting opportunities if
some (util - util_est) compensates:
TASK_1(util - util_est) = - TASK_2(util - util_est)
but working on the aggregated value is much easier in schedutil, to
avoid crawling the list of entities.

2019-10-19 08:21:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote:

> > What I don't see is how that that difference makes sense as input to:
> >
> > cost(x) : (1 + x) * cost_j
>
> The actual input is:
> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est)
>
> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1
> is not directly reflected in the code but is important for units
> consistency.

But completely irrelevant for the actual math and conceptual
understanding. Just because computers suck at real numbers, and floats
are expensive, doesn't mean we have to burden ourselves with fixed point
when writing equations.

Also, as a physicist I'm prone to normalizing everything to 1, because
that's lazy.

> > I suppose that limits the additional OPP to twice the previously
> > selected cost / efficiency (see the confusion from that other email).
> > But given that efficency drops (or costs rise) for higher OPPs that
> > still doesn't really make sense..

> Yes, this current limit to +100% freq boosting is somehow arbitrary and
> could probably benefit from being tunable in some way (Kconfig option
> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost
> than the one strictly required, which is expected. The goal is to speed
> things up at the expense of more power consumed to achieve the same work,
> hence at a lower efficiency (== higher cost).

No, no Kconfig knobs.

> That's the main reason why this boosting apply a margin on the cost of the
> selected OPP rather than just inflating the util. This allows controlling
> directly how much more power (battery life) we are going to spend to achieve
> some work that we know could be achieved with less power.

But you're not; the margin is relative to the OPP, it is not absolute.

Or rather, the only actual limit is in relation to the max OPP. So you
have very little actual control over how much more energy you're
spending.

> > So while I agree that 2) is a reasonable signal to work from, everything
> > that comes after is still much confusing me.

> "When applying these boosting rules on the runqueue util signals ...":
> Assuming the set of enqueued tasks stays the same between 2 observations
> from schedutil, if we see the rq util_avg increase above its
> util_est.enqueued, that means that at least one task had its util_avg go
> above util_est.enqueued. We might miss some boosting opportunities if some
> (util - util_est) compensates:
> TASK_1(util - util_est) = - TASK_2(util - util_est)
> but working on the aggregated value is much easier in schedutil, to avoid
> crawling the list of entities.

That still does not explain why 'util - util_est', when >0, makes for a
sensible input into an OPP relative function.

I agree that 'util - util_est', when >0, indicates utilization is
increasing (for the aperiodic blah blah blah). But after that I'm still
confused.

2019-10-19 08:32:11

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware



On 10/18/19 1:07 PM, Peter Zijlstra wrote:
> On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote:
>
>>> What I don't see is how that that difference makes sense as input to:
>>>
>>> cost(x) : (1 + x) * cost_j
>>
>> The actual input is:
>> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est)
>>
>> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1
>> is not directly reflected in the code but is important for units
>> consistency.
>
> But completely irrelevant for the actual math and conceptual
> understanding.

> how that that difference makes sense as input to
I was unsure if you referred to the units being inconsistent or the
actual way of computing values being strange, so I provided some
justification for both.

> Just because computers suck at real numbers, and floats
> are expensive, doesn't mean we have to burden ourselves with fixed point
> when writing equations.
>
> Also, as a physicist I'm prone to normalizing everything to 1, because
> that's lazy.
>
>>> I suppose that limits the additional OPP to twice the previously
>>> selected cost / efficiency (see the confusion from that other email).
>>> But given that efficency drops (or costs rise) for higher OPPs that
>>> still doesn't really make sense..
>
>> Yes, this current limit to +100% freq boosting is somehow arbitrary and
>> could probably benefit from being tunable in some way (Kconfig option
>> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost
>> than the one strictly required, which is expected. The goal is to speed
>> things up at the expense of more power consumed to achieve the same work,
>> hence at a lower efficiency (== higher cost).
>
> No, no Kconfig knobs.
>
>> That's the main reason why this boosting apply a margin on the cost of the
>> selected OPP rather than just inflating the util. This allows controlling
>> directly how much more power (battery life) we are going to spend to achieve
>> some work that we know could be achieved with less power.
>
> But you're not; the margin is relative to the OPP, it is not absolute.

Considering a CPU with 1024 max capacity (since we are not talking about
migrations here, we can ignore CPU invariance):

work = normalized number of iterations of a given busy loop
# Thanks to freq invariance
work = util (between 0 and 1)
util = f/f_max

# f(work) is the min freq that is admissible for "work", which we will
# abbreviate as "f"
f(work) = work * f_max

# from struct em_cap_state doc in energy_model.h
cost(f) = power(f) * f_max / f
cost(f) = power(f) / util
cost(f) = power(f) / work
power(f) = cost(f) * work

boosted_cost(f) = cost(f) + x
boosted_power(f) = boosted_cost(f) * work
boosted_power(f) = (cost(f) + x) * work

# Let's normalize cost() so we can forget about f and deal only with work.
cost'(work) = cost(f)/cost(f_max)
x' = x/cost(f_max)
boosted_power'(work) = (cost'(work) + x') * work
boosted_power'(work) = cost'(work) * work + x' * work
boosted_power'(work) = power'(work) + x' * work
boosted_power'(work) = power'(work) + A(work)

# Over a duration T, spend an extra B unit of energy
B(work) = A(work) * T
lost_battery_percent(work) = 100 * B(work)/total_battery_energy
lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy
lost_battery_percent(work) =
(100 * T / cost(f_max) / total_battery_energy) * x * work

This means that the effect of boosting on battery life is proportional
to "x" unless I made a mistake somewhere.

>
> Or rather, the only actual limit is in relation to the max OPP. So you
> have very little actual control over how much more energy you're
> spending.
>
>>> So while I agree that 2) is a reasonable signal to work from, everything
>>> that comes after is still much confusing me.
>
>> "When applying these boosting rules on the runqueue util signals ...":
>> Assuming the set of enqueued tasks stays the same between 2 observations
>> from schedutil, if we see the rq util_avg increase above its
>> util_est.enqueued, that means that at least one task had its util_avg go
>> above util_est.enqueued. We might miss some boosting opportunities if some
>> (util - util_est) compensates:
>> TASK_1(util - util_est) = - TASK_2(util - util_est)
>> but working on the aggregated value is much easier in schedutil, to avoid
>> crawling the list of entities.
>
> That still does not explain why 'util - util_est', when >0, makes for a
> sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is
> increasing (for the aperiodic blah blah blah). But after that I'm still
> confused.

For the same reason PELT makes a sensible input for OPP selection.
Currently, OPP selection is based on max(util_avg, util_est.enqueued)
(from cpu_util_cfs in sched.h), so as soon as we have
(util - util_est > 0), the OPP will be selected according to util_avg.
In a way, using util_avg there is already some kind of boosting.

Since the boosting is essentially (util - constant), it grows the same
way as util. If we think of (util - util_est) as being some estimation
of how wrong we were in the estimation of the task "true" utilization of
the CPU, then it makes sense to feed that to the boost. The wronger we
were, the more we want to boost, because the more time passes, the more
the scheduler realizes it actually does not know what the task needs. In
doubt, provide a higher freq than usual until we get to know this task
better. When that happens (at the next period), boosting is disabled and
we revert to the usual behavior (aka margin=0).

Hope we are converging to some wording that makes sense.

2019-10-19 08:32:24

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

On Fri, 18 Oct 2019 at 16:44, Douglas Raillard <[email protected]> wrote:
>
>
>
> On 10/18/19 1:07 PM, Peter Zijlstra wrote:
> > On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote:
> >
> >>> What I don't see is how that that difference makes sense as input to:
> >>>
> >>> cost(x) : (1 + x) * cost_j
> >>
> >> The actual input is:
> >> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est)
> >>
> >> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1
> >> is not directly reflected in the code but is important for units
> >> consistency.
> >
> > But completely irrelevant for the actual math and conceptual
> > understanding.
>
> > how that that difference makes sense as input to
> I was unsure if you referred to the units being inconsistent or the
> actual way of computing values being strange, so I provided some
> justification for both.
>
> > Just because computers suck at real numbers, and floats
> > are expensive, doesn't mean we have to burden ourselves with fixed point
> > when writing equations.
> >
> > Also, as a physicist I'm prone to normalizing everything to 1, because
> > that's lazy.
> >
> >>> I suppose that limits the additional OPP to twice the previously
> >>> selected cost / efficiency (see the confusion from that other email).
> >>> But given that efficency drops (or costs rise) for higher OPPs that
> >>> still doesn't really make sense..
> >
> >> Yes, this current limit to +100% freq boosting is somehow arbitrary and
> >> could probably benefit from being tunable in some way (Kconfig option
> >> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost
> >> than the one strictly required, which is expected. The goal is to speed
> >> things up at the expense of more power consumed to achieve the same work,
> >> hence at a lower efficiency (== higher cost).
> >
> > No, no Kconfig knobs.
> >
> >> That's the main reason why this boosting apply a margin on the cost of the
> >> selected OPP rather than just inflating the util. This allows controlling
> >> directly how much more power (battery life) we are going to spend to achieve
> >> some work that we know could be achieved with less power.
> >
> > But you're not; the margin is relative to the OPP, it is not absolute.
>
> Considering a CPU with 1024 max capacity (since we are not talking about
> migrations here, we can ignore CPU invariance):
>
> work = normalized number of iterations of a given busy loop
> # Thanks to freq invariance
> work = util (between 0 and 1)
> util = f/f_max
>
> # f(work) is the min freq that is admissible for "work", which we will
> # abbreviate as "f"
> f(work) = work * f_max
>
> # from struct em_cap_state doc in energy_model.h
> cost(f) = power(f) * f_max / f
> cost(f) = power(f) / util
> cost(f) = power(f) / work
> power(f) = cost(f) * work
>
> boosted_cost(f) = cost(f) + x
> boosted_power(f) = boosted_cost(f) * work
> boosted_power(f) = (cost(f) + x) * work
>
> # Let's normalize cost() so we can forget about f and deal only with work.
> cost'(work) = cost(f)/cost(f_max)
> x' = x/cost(f_max)
> boosted_power'(work) = (cost'(work) + x') * work
> boosted_power'(work) = cost'(work) * work + x' * work
> boosted_power'(work) = power'(work) + x' * work
> boosted_power'(work) = power'(work) + A(work)
>
> # Over a duration T, spend an extra B unit of energy
> B(work) = A(work) * T
> lost_battery_percent(work) = 100 * B(work)/total_battery_energy
> lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy
> lost_battery_percent(work) =
> (100 * T / cost(f_max) / total_battery_energy) * x * work
>
> This means that the effect of boosting on battery life is proportional
> to "x" unless I made a mistake somewhere.

Because the boost is relative to cost(f) and cost is not linear to the
frequency, I don't think that it's is a linear relation.

>
> >
> > Or rather, the only actual limit is in relation to the max OPP. So you
> > have very little actual control over how much more energy you're
> > spending.
> >
> >>> So while I agree that 2) is a reasonable signal to work from, everything
> >>> that comes after is still much confusing me.
> >
> >> "When applying these boosting rules on the runqueue util signals ...":
> >> Assuming the set of enqueued tasks stays the same between 2 observations
> >> from schedutil, if we see the rq util_avg increase above its
> >> util_est.enqueued, that means that at least one task had its util_avg go
> >> above util_est.enqueued. We might miss some boosting opportunities if some
> >> (util - util_est) compensates:
> >> TASK_1(util - util_est) = - TASK_2(util - util_est)
> >> but working on the aggregated value is much easier in schedutil, to avoid
> >> crawling the list of entities.
> >
> > That still does not explain why 'util - util_est', when >0, makes for a
> > sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is
> > increasing (for the aperiodic blah blah blah). But after that I'm still
> > confused.
>
> For the same reason PELT makes a sensible input for OPP selection.
> Currently, OPP selection is based on max(util_avg, util_est.enqueued)
> (from cpu_util_cfs in sched.h), so as soon as we have
> (util - util_est > 0), the OPP will be selected according to util_avg.
> In a way, using util_avg there is already some kind of boosting.
>
> Since the boosting is essentially (util - constant), it grows the same
> way as util. If we think of (util - util_est) as being some estimation
> of how wrong we were in the estimation of the task "true" utilization of
> the CPU, then it makes sense to feed that to the boost. The wronger we
> were, the more we want to boost, because the more time passes, the more
> the scheduler realizes it actually does not know what the task needs. In
> doubt, provide a higher freq than usual until we get to know this task
> better. When that happens (at the next period), boosting is disabled and
> we revert to the usual behavior (aka margin=0).
>
> Hope we are converging to some wording that makes sense.

2019-10-19 08:34:08

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware

On Fri, 18 Oct 2019 at 16:44, Douglas Raillard <[email protected]> wrote:
>
>
>
> On 10/18/19 1:07 PM, Peter Zijlstra wrote:
> > On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote:
> >
> >>> What I don't see is how that that difference makes sense as input to:
> >>>
> >>> cost(x) : (1 + x) * cost_j
> >>
> >> The actual input is:
> >> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est)
> >>
> >> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1
> >> is not directly reflected in the code but is important for units
> >> consistency.
> >
> > But completely irrelevant for the actual math and conceptual
> > understanding.
>
> > how that that difference makes sense as input to
> I was unsure if you referred to the units being inconsistent or the
> actual way of computing values being strange, so I provided some
> justification for both.
>
> > Just because computers suck at real numbers, and floats
> > are expensive, doesn't mean we have to burden ourselves with fixed point
> > when writing equations.
> >
> > Also, as a physicist I'm prone to normalizing everything to 1, because
> > that's lazy.
> >
> >>> I suppose that limits the additional OPP to twice the previously
> >>> selected cost / efficiency (see the confusion from that other email).
> >>> But given that efficency drops (or costs rise) for higher OPPs that
> >>> still doesn't really make sense..
> >
> >> Yes, this current limit to +100% freq boosting is somehow arbitrary and
> >> could probably benefit from being tunable in some way (Kconfig option
> >> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost
> >> than the one strictly required, which is expected. The goal is to speed
> >> things up at the expense of more power consumed to achieve the same work,
> >> hence at a lower efficiency (== higher cost).
> >
> > No, no Kconfig knobs.
> >
> >> That's the main reason why this boosting apply a margin on the cost of the
> >> selected OPP rather than just inflating the util. This allows controlling
> >> directly how much more power (battery life) we are going to spend to achieve
> >> some work that we know could be achieved with less power.
> >
> > But you're not; the margin is relative to the OPP, it is not absolute.
>
> Considering a CPU with 1024 max capacity (since we are not talking about
> migrations here, we can ignore CPU invariance):
>
> work = normalized number of iterations of a given busy loop
> # Thanks to freq invariance
> work = util (between 0 and 1)
> util = f/f_max
>
> # f(work) is the min freq that is admissible for "work", which we will
> # abbreviate as "f"
> f(work) = work * f_max
>
> # from struct em_cap_state doc in energy_model.h
> cost(f) = power(f) * f_max / f
> cost(f) = power(f) / util
> cost(f) = power(f) / work
> power(f) = cost(f) * work
>
> boosted_cost(f) = cost(f) + x

In em_pd_get_higher_freq, the boost is a % of cost(f) so it should be
boosted_cost(f)=cost(f)1+ cost(f)*x

> boosted_power(f) = boosted_cost(f) * work
> boosted_power(f) = (cost(f) + x) * work
>
> # Let's normalize cost() so we can forget about f and deal only with work.
> cost'(work) = cost(f)/cost(f_max)
> x' = x/cost(f_max)
> boosted_power'(work) = (cost'(work) + x') * work
> boosted_power'(work) = cost'(work) * work + x' * work
> boosted_power'(work) = power'(work) + x' * work
> boosted_power'(work) = power'(work) + A(work)
>
> # Over a duration T, spend an extra B unit of energy
> B(work) = A(work) * T
> lost_battery_percent(work) = 100 * B(work)/total_battery_energy
> lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy
> lost_battery_percent(work) =
> (100 * T / cost(f_max) / total_battery_energy) * x * work
>
> This means that the effect of boosting on battery life is proportional
> to "x" unless I made a mistake somewhere.
>
> >
> > Or rather, the only actual limit is in relation to the max OPP. So you
> > have very little actual control over how much more energy you're
> > spending.
> >
> >>> So while I agree that 2) is a reasonable signal to work from, everything
> >>> that comes after is still much confusing me.
> >
> >> "When applying these boosting rules on the runqueue util signals ...":
> >> Assuming the set of enqueued tasks stays the same between 2 observations
> >> from schedutil, if we see the rq util_avg increase above its
> >> util_est.enqueued, that means that at least one task had its util_avg go
> >> above util_est.enqueued. We might miss some boosting opportunities if some
> >> (util - util_est) compensates:
> >> TASK_1(util - util_est) = - TASK_2(util - util_est)
> >> but working on the aggregated value is much easier in schedutil, to avoid
> >> crawling the list of entities.
> >
> > That still does not explain why 'util - util_est', when >0, makes for a
> > sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is
> > increasing (for the aperiodic blah blah blah). But after that I'm still
> > confused.
>
> For the same reason PELT makes a sensible input for OPP selection.
> Currently, OPP selection is based on max(util_avg, util_est.enqueued)
> (from cpu_util_cfs in sched.h), so as soon as we have
> (util - util_est > 0), the OPP will be selected according to util_avg.
> In a way, using util_avg there is already some kind of boosting.
>
> Since the boosting is essentially (util - constant), it grows the same
> way as util. If we think of (util - util_est) as being some estimation
> of how wrong we were in the estimation of the task "true" utilization of
> the CPU, then it makes sense to feed that to the boost. The wronger we
> were, the more we want to boost, because the more time passes, the more
> the scheduler realizes it actually does not know what the task needs. In
> doubt, provide a higher freq than usual until we get to know this task
> better. When that happens (at the next period), boosting is disabled and
> we revert to the usual behavior (aka margin=0).
>
> Hope we are converging to some wording that makes sense.

2019-10-19 08:39:01

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware



On 10/18/19 4:15 PM, Vincent Guittot wrote:
> On Fri, 18 Oct 2019 at 16:44, Douglas Raillard <[email protected]> wrote:
>>
>>
>>
>> On 10/18/19 1:07 PM, Peter Zijlstra wrote:
>>> On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote:
>>>
>>>>> What I don't see is how that that difference makes sense as input to:
>>>>>
>>>>> cost(x) : (1 + x) * cost_j
>>>>
>>>> The actual input is:
>>>> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est)
>>>>
>>>> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1
>>>> is not directly reflected in the code but is important for units
>>>> consistency.
>>>
>>> But completely irrelevant for the actual math and conceptual
>>> understanding.
>>
>> > how that that difference makes sense as input to
>> I was unsure if you referred to the units being inconsistent or the
>> actual way of computing values being strange, so I provided some
>> justification for both.
>>
>>> Just because computers suck at real numbers, and floats
>>> are expensive, doesn't mean we have to burden ourselves with fixed point
>>> when writing equations.
>>>
>>> Also, as a physicist I'm prone to normalizing everything to 1, because
>>> that's lazy.
>>>
>>>>> I suppose that limits the additional OPP to twice the previously
>>>>> selected cost / efficiency (see the confusion from that other email).
>>>>> But given that efficency drops (or costs rise) for higher OPPs that
>>>>> still doesn't really make sense..
>>>
>>>> Yes, this current limit to +100% freq boosting is somehow arbitrary and
>>>> could probably benefit from being tunable in some way (Kconfig option
>>>> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost
>>>> than the one strictly required, which is expected. The goal is to speed
>>>> things up at the expense of more power consumed to achieve the same work,
>>>> hence at a lower efficiency (== higher cost).
>>>
>>> No, no Kconfig knobs.
>>>
>>>> That's the main reason why this boosting apply a margin on the cost of the
>>>> selected OPP rather than just inflating the util. This allows controlling
>>>> directly how much more power (battery life) we are going to spend to achieve
>>>> some work that we know could be achieved with less power.
>>>
>>> But you're not; the margin is relative to the OPP, it is not absolute.
>>
>> Considering a CPU with 1024 max capacity (since we are not talking about
>> migrations here, we can ignore CPU invariance):
>>
>> work = normalized number of iterations of a given busy loop
>> # Thanks to freq invariance
>> work = util (between 0 and 1)
>> util = f/f_max
>>
>> # f(work) is the min freq that is admissible for "work", which we will
>> # abbreviate as "f"
>> f(work) = work * f_max
>>
>> # from struct em_cap_state doc in energy_model.h
>> cost(f) = power(f) * f_max / f
>> cost(f) = power(f) / util
>> cost(f) = power(f) / work
>> power(f) = cost(f) * work
>>
>> boosted_cost(f) = cost(f) + x
>
> In em_pd_get_higher_freq, the boost is a % of cost(f) so it should be
> boosted_cost(f)=cost(f)1+ cost(f)*x

Good point, this means that we need to change "x" in these equations:
x = cost(f) * margin

Which leads to:
lost_battery_percent(work) =
(100 * T / cost(f_max) / total_battery_energy) * cost'(work) * margin * work

lost_battery_percent(work) is still proportional to something that can
easily be traced and averaged (cost'(work,t) * margin(work,t)). At the
end of the day, since the impact depends on whether the workload will
make the condition to trigger, tracing is necessary to see how it
performs.

Other than that, I agree that the thing becomes simpler if
em_pd_get_higher_freq() takes an absolute margin (as a per-1024 of max
cost) rather than something proportional to cost(f). I'll make the
change for v4.

>> boosted_power(f) = boosted_cost(f) * work
>> boosted_power(f) = (cost(f) + x) * work
>>
>> # Let's normalize cost() so we can forget about f and deal only with work.
>> cost'(work) = cost(f)/cost(f_max)
>> x' = x/cost(f_max)
>> boosted_power'(work) = (cost'(work) + x') * work
>> boosted_power'(work) = cost'(work) * work + x' * work
>> boosted_power'(work) = power'(work) + x' * work
>> boosted_power'(work) = power'(work) + A(work)
>>
>> # Over a duration T, spend an extra B unit of energy
>> B(work) = A(work) * T
>> lost_battery_percent(work) = 100 * B(work)/total_battery_energy
>> lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy
>> lost_battery_percent(work) =
>> (100 * T / cost(f_max) / total_battery_energy) * x * work
>>
>> This means that the effect of boosting on battery life is proportional
>> to "x" unless I made a mistake somewhere.
>>
>>>
>>> Or rather, the only actual limit is in relation to the max OPP. So you
>>> have very little actual control over how much more energy you're
>>> spending.
>>>
>>>>> So while I agree that 2) is a reasonable signal to work from, everything
>>>>> that comes after is still much confusing me.
>>>
>>>> "When applying these boosting rules on the runqueue util signals ...":
>>>> Assuming the set of enqueued tasks stays the same between 2 observations
>>>> from schedutil, if we see the rq util_avg increase above its
>>>> util_est.enqueued, that means that at least one task had its util_avg go
>>>> above util_est.enqueued. We might miss some boosting opportunities if some
>>>> (util - util_est) compensates:
>>>> TASK_1(util - util_est) = - TASK_2(util - util_est)
>>>> but working on the aggregated value is much easier in schedutil, to avoid
>>>> crawling the list of entities.
>>>
>>> That still does not explain why 'util - util_est', when >0, makes for a
>>> sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is
>>> increasing (for the aperiodic blah blah blah). But after that I'm still
>>> confused.
>>
>> For the same reason PELT makes a sensible input for OPP selection.
>> Currently, OPP selection is based on max(util_avg, util_est.enqueued)
>> (from cpu_util_cfs in sched.h), so as soon as we have
>> (util - util_est > 0), the OPP will be selected according to util_avg.
>> In a way, using util_avg there is already some kind of boosting.
>>
>> Since the boosting is essentially (util - constant), it grows the same
>> way as util. If we think of (util - util_est) as being some estimation
>> of how wrong we were in the estimation of the task "true" utilization of
>> the CPU, then it makes sense to feed that to the boost. The wronger we
>> were, the more we want to boost, because the more time passes, the more
>> the scheduler realizes it actually does not know what the task needs. In
>> doubt, provide a higher freq than usual until we get to know this task
>> better. When that happens (at the next period), boosting is disabled and
>> we revert to the usual behavior (aka margin=0).
>>
>> Hope we are converging to some wording that makes sense.

2019-10-19 08:53:56

by Douglas RAILLARD

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched/cpufreq: Make schedutil energy aware



On 10/18/19 8:59 AM, Peter Zijlstra wrote:
> On Fri, Oct 18, 2019 at 09:44:44AM +0200, Dietmar Eggemann wrote:
>> On 17/10/2019 16:11, Peter Zijlstra wrote:
>>> On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote:
>>
>> [...]
>>
>>> It only boosts when 'rq->cfs.avg.util' increases while
>>> 'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est
>>> obv).
>>>
>>> This condition can be true for select_task_rq_fair(), because that is
>>> ran before we do enqueue_task_fair() (for obvious raisins).
>>>
>>>>> I'm still thinking about the exact means you're using to raise C; that
>>>>> is, the 'util - util_est' as cost_margin. It hurts my brain still.
>>>>
>>>> +1 ...
>>>
>>> cost_i = capacity_i / power_i ; for the i-th OPP
>>
>> I get confused by this definition. efficiency=capacity/power but the
>> cs->cost value used in em_pd_get_higher_freq() is defined as
>>
>> cs_cost = cs->power * cpu_max_freq / cs->freq [energy_model.h]
>
> Well, chalk that up to confusion inspired by the Changelog of patch 1.

I've updated the commit message to include that ordering OPPs by
increasing efficiency=capa/power on one CPU leads to the same ordering
as ordering by decreasing cost=power*f_max/f.

efficiency=(cpu_max_capa/1024) * (f/f_max) / power
efficiency=(cpu_max_capa/1024) / cost


> Let me redo it with that formula then.
>