2021-12-09 16:12:34

by Vincent Donnefort

[permalink] [raw]
Subject: [PATCH 0/4] feec() energy margin removal

find_energy_efficient() (feec()) will migrate a task to save energy only
if it saves at least 6% of the total energy consumed by the system. This
conservative approach is a problem on a system where a lot of small tasks
create a huge load on the overall: very few of them will be allowed to migrate
to a smaller CPU, wasting a lot of energy. Instead of trying to determine yet
another margin, let's try to remove it.

The first elements of this patch-set are various fixes and improvement that
stabilizes task_util and ensures energy comparison fairness across all CPUs of
the topology. Only once those fixed, we can completely remove the margin and
let feec() aggressively place task and save energy.

This has been validated by two different ways:

First using LISA's eas_behaviour test suite. This is composed of a set of
scenario and verify if the task placement is optimum. No failure have been
observed and it also improved some tests such as Ramp-Down (as the placement
is now more energy oriented) and *ThreeSmall (as no bouncing between clusters
happen anymore).

* Hikey960: 100% PASSED
* DB-845C: 100% PASSED
* RB5: 100% PASSED

Second, using an Android benchmark: PCMark2 on a Pixel4, with a lot of
backports to have a scheduler as close as we can from mainline.

+------------+-----------------+-----------------+
| Test | Perf | Energy [1] |
+------------+-----------------+-----------------+
| Web2 | -0.3% pval 0.03 | -1.8% pval 0.00 |
| Video2 | -0.3% pval 0.13 | -5.6% pval 0.00 |
| Photo2 [2] | -3.8% pval 0.00 | -1% pval 0.00 |
| Writing2 | 0% pval 0.13 | -1% pval 0.00 |
| Data2 | 0% pval 0.8 | -0.43 pval 0.00 |
+------------+-----------------+-----------------+

The margin removal let the kernel make the best use of the Energy Model,
tasks are more likely to be placed where they fit and this saves a
substantial amount of energy, while having a limited impact on performances.

[1] This is an energy estimation based on the CPU activity and the Energy Model
for this device. "All models are wrong but some are useful"; yes, this is an
imperfect estimation that doesn't take into account some idle states and shared
power rails. Nonetheless this is based on the information the kernel has during
runtime and it proves the scheduler can take better decisions based solely on
those data.

[2] This is the only performance impact observed. The debugging of this test
showed no issue with task placement. The better score was solely due to some
critical threads held on better performing CPUs. If a thread needs a higher
capacity CPU, the placement must result from a user input (with e.g. uclamp
min) instead of being artificially held on less efficient CPUs by feec().
Notice also, the experiment didn't use the Android only latency_sensitive
feature which would hide this problem on a real-life device.


Vincent Donnefort (4):
sched/fair: Provide u64 read for 32-bits arch helper
sched/fair: Decay task PELT values during migration
sched/fair: Remove task_util from effective utilization in feec()
sched/fair: Remove the energy margin in feec()

kernel/sched/core.c | 7 ++
kernel/sched/fair.c | 263 ++++++++++++++++++++++---------------------
kernel/sched/sched.h | 46 +++++++-
3 files changed, 189 insertions(+), 127 deletions(-)

--
2.25.1



2021-12-09 16:12:36

by Vincent Donnefort

[permalink] [raw]
Subject: [PATCH 1/4] sched/fair: Provide u64 read for 32-bits arch helper

Introducing macro helpers u64_u32_{store,load}() to factorize lockless
accesses to u64 variables for 32-bits architectures.

Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To
accommodate the later where the copy lies outside of the structure
(cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy),
use the _copy() version of those helpers.

Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and
therefore, have a small penalty in set_task_rq_fair() and init_cfs_rq().

Signed-off-by: Vincent Donnefort <[email protected]>

---
This is a follow-up for:

https://lkml.kernel.org/r/[email protected]
---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da85b2938bff..3e6182dabc99 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -568,11 +568,8 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
}

/* ensure we never gain time by being placed backwards. */
- cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
-#ifndef CONFIG_64BIT
- smp_wmb();
- cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
-#endif
+ u64_u32_store(cfs_rq->min_vruntime,
+ max_vruntime(cfs_rq->min_vruntime, vruntime));
}

static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
@@ -3247,6 +3244,11 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
}

#ifdef CONFIG_SMP
+static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
+{
+ return u64_u32_load_copy(cfs_rq->avg.last_update_time,
+ cfs_rq->last_update_time_copy);
+}
#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* Because list_add_leaf_cfs_rq always places a child cfs_rq on the list
@@ -3357,27 +3359,9 @@ void set_task_rq_fair(struct sched_entity *se,
if (!(se->avg.last_update_time && prev))
return;

-#ifndef CONFIG_64BIT
- {
- u64 p_last_update_time_copy;
- u64 n_last_update_time_copy;
-
- do {
- p_last_update_time_copy = prev->load_last_update_time_copy;
- n_last_update_time_copy = next->load_last_update_time_copy;
-
- smp_rmb();
+ p_last_update_time = cfs_rq_last_update_time(prev);
+ n_last_update_time = cfs_rq_last_update_time(next);

- p_last_update_time = prev->avg.last_update_time;
- n_last_update_time = next->avg.last_update_time;
-
- } while (p_last_update_time != p_last_update_time_copy ||
- n_last_update_time != n_last_update_time_copy);
- }
-#else
- p_last_update_time = prev->avg.last_update_time;
- n_last_update_time = next->avg.last_update_time;
-#endif
__update_load_avg_blocked_se(p_last_update_time, se);
se->avg.last_update_time = n_last_update_time;
}
@@ -3701,8 +3685,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
decayed |= __update_load_avg_cfs_rq(now, cfs_rq);

#ifndef CONFIG_64BIT
- smp_wmb();
- cfs_rq->load_last_update_time_copy = sa->last_update_time;
+ u64_u32_store_copy(sa->last_update_time,
+ cfs_rq->last_update_time_copy,
+ sa->last_update_time);
#endif

return decayed;
@@ -3835,27 +3820,6 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
}
}

-#ifndef CONFIG_64BIT
-static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
-{
- u64 last_update_time_copy;
- u64 last_update_time;
-
- do {
- last_update_time_copy = cfs_rq->load_last_update_time_copy;
- smp_rmb();
- last_update_time = cfs_rq->avg.last_update_time;
- } while (last_update_time != last_update_time_copy);
-
- return last_update_time;
-}
-#else
-static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
-{
- return cfs_rq->avg.last_update_time;
-}
-#endif
-
/*
* Synchronize entity load avg of dequeued entity without locking
* the previous rq.
@@ -6951,21 +6915,8 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
if (READ_ONCE(p->__state) == TASK_WAKING) {
struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- u64 min_vruntime;

-#ifndef CONFIG_64BIT
- u64 min_vruntime_copy;
-
- do {
- min_vruntime_copy = cfs_rq->min_vruntime_copy;
- smp_rmb();
- min_vruntime = cfs_rq->min_vruntime;
- } while (min_vruntime != min_vruntime_copy);
-#else
- min_vruntime = cfs_rq->min_vruntime;
-#endif
-
- se->vruntime -= min_vruntime;
+ se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
}

if (p->on_rq == TASK_ON_RQ_MIGRATING) {
@@ -11388,10 +11339,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
void init_cfs_rq(struct cfs_rq *cfs_rq)
{
cfs_rq->tasks_timeline = RB_ROOT_CACHED;
- cfs_rq->min_vruntime = (u64)(-(1LL << 20));
-#ifndef CONFIG_64BIT
- cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
-#endif
+ u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20)));
#ifdef CONFIG_SMP
raw_spin_lock_init(&cfs_rq->removed.lock);
#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a00fc7057d97..1bd2058d7bb5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -528,6 +528,45 @@ struct cfs_bandwidth { };

#endif /* CONFIG_CGROUP_SCHED */

+/*
+ * u64_u32_load/u64_u32_store
+ *
+ * Use a copy of a u64 value to protect against data race. This is only
+ * applicable for 32-bits architectures.
+ */
+#ifdef CONFIG_64BIT
+# define u64_u32_load_copy(var, copy) var
+# define u64_u32_store_copy(var, copy, val) (var = val)
+#else
+# define u64_u32_load_copy(var, copy) \
+({ \
+ u64 __val, __val_copy; \
+ do { \
+ __val_copy = copy; \
+ /* \
+ * paired with u64_u32_store, ordering access \
+ * to var and copy. \
+ */ \
+ smp_rmb(); \
+ __val = var; \
+ } while (__val != __val_copy); \
+ __val; \
+})
+# define u64_u32_store_copy(var, copy, val) \
+do { \
+ typeof(val) __val = (val); \
+ var = __val; \
+ /* \
+ * paired with u64_u32_load, ordering access to var and \
+ * copy. \
+ */ \
+ smp_wmb(); \
+ copy = __val; \
+} while (0)
+#endif
+# define u64_u32_load(var) u64_u32_load_copy(var, var##_copy)
+# define u64_u32_store(var, val) u64_u32_store_copy(var, var##_copy, val)
+
/* CFS-related fields in a runqueue */
struct cfs_rq {
struct load_weight load;
@@ -568,7 +607,7 @@ struct cfs_rq {
*/
struct sched_avg avg;
#ifndef CONFIG_64BIT
- u64 load_last_update_time_copy;
+ u64 last_update_time_copy;
#endif
struct {
raw_spinlock_t lock ____cacheline_aligned;
--
2.25.1


2021-12-09 16:12:37

by Vincent Donnefort

[permalink] [raw]
Subject: [PATCH 2/4] sched/fair: Decay task PELT values during migration

Before being migrated to a new CPU, a task sees its PELT values
synchronized with rq last_update_time. Once done, that same task will also
have its sched_avg last_update_time reset. This means the time between
the migration and the last clock update (B) will not be accounted for in
util_avg and a discontinuity will appear. This issue is amplified by the
PELT clock scaling. If the clock hasn't been updated while the CPU is
idle, clock_pelt will not be aligned with clock_task and that time (A)
will be also lost.

---------|----- A -----|-----------|------- B -----|>
clock_pelt clock_task clock now

This is especially problematic for asymmetric CPU capacity systems which
need stable util_avg signals for task placement and energy estimation.

Ideally, this problem would be solved by updating the runqueue clocks
before the migration. But that would require taking the runqueue lock
which is quite expensive [1]. Instead estimate the missing time and update
the task util_avg with that value:

A + B = clock_task - clock_pelt + sched_clock_cpu() - clock

Neither clock_task, clock_pelt nor clock can be accessed without the
runqueue lock. The new runqueue clock_pelt_lag is therefore created and
encode those three values.

clock_pelt_lag = clock - clock_task + clock_pelt

And we can then write the missing time as follow:

A + B = sched_clock_cpu() - clock_pelt_lag

The B. part of the missing time is however an estimation that doesn't take
into account IRQ and Paravirt time.

Now we have an estimation for A + B, we can create an estimator for the
PELT value at the time of the migration. We need for this purpose to
inject last_update_time which is a combination of both clock_pelt and
lost_idle_time. The latter is a time value which is completely lost form a
PELT point of view and must be ignored. And finally, we can write:

rq_clock_pelt_estimator() = last_update_time + A + B
= last_update_time +
sched_clock_cpu() - clock_pelt_lag

[1] https://lore.kernel.org/all/[email protected]/

Signed-off-by: Vincent Donnefort <[email protected]>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f2611b9cf503..3649716d1de7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -607,6 +607,12 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
}
}

+static void update_rq_clock_pelt_lag(struct rq *rq)
+{
+ u64_u32_store(rq->clock_pelt_lag,
+ rq->clock - rq->clock_task + rq->clock_pelt);
+}
+
/*
* RQ-clock updating methods:
*/
@@ -663,6 +669,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
update_irq_load_avg(rq, irq_delta + steal);
#endif
update_rq_clock_pelt(rq, delta);
+ update_rq_clock_pelt_lag(rq);
}

void update_rq_clock(struct rq *rq)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e6182dabc99..9d640d519866 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6899,6 +6899,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)

static void detach_entity_cfs_rq(struct sched_entity *se);

+static u64 rq_clock_pelt_estimator(struct rq *rq)
+{
+ u64 pelt_lag = sched_clock_cpu(cpu_of(rq)) -
+ u64_u32_load(rq->clock_pelt_lag);
+
+ return cfs_rq_last_update_time(&rq->cfs) + pelt_lag;
+}
+
/*
* Called immediately before a task is migrated to a new CPU; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
@@ -6906,6 +6914,9 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
*/
static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
{
+ struct sched_entity *se = &p->se;
+ struct rq *rq = task_rq(p);
+
/*
* As blocked tasks retain absolute vruntime the migration needs to
* deal with this by subtracting the old and adding the new
@@ -6913,7 +6924,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
* the task on the new runqueue.
*/
if (READ_ONCE(p->__state) == TASK_WAKING) {
- struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);

se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
@@ -6924,26 +6934,29 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
* In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
* rq->lock and can modify state directly.
*/
- lockdep_assert_rq_held(task_rq(p));
- detach_entity_cfs_rq(&p->se);
+ lockdep_assert_rq_held(rq);
+ detach_entity_cfs_rq(se);

} else {
+ remove_entity_load_avg(se);
+
/*
- * We are supposed to update the task to "current" time, then
- * its up to date and ready to go to new CPU/cfs_rq. But we
- * have difficulty in getting what current time is, so simply
- * throw away the out-of-date time. This will result in the
- * wakee task is less decayed, but giving the wakee more load
- * sounds not bad.
+ * Here, the task's PELT values have been updated according to
+ * the current rq's clock. But if that clock hasn't been
+ * updated in a while, a substantial idle time will be missed,
+ * leading to an inflation after wake-up on the new rq.
+ *
+ * Estimate the PELT clock lag, and update sched_avg to ensure
+ * PELT continuity after migration.
*/
- remove_entity_load_avg(&p->se);
+ __update_load_avg_blocked_se(rq_clock_pelt_estimator(rq), se);
}

/* Tell new CPU we are migrated */
- p->se.avg.last_update_time = 0;
+ se->avg.last_update_time = 0;

/* We have migrated, no longer consider this task hot */
- p->se.exec_start = 0;
+ se->exec_start = 0;

update_scan_period(p, new_cpu);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1bd2058d7bb5..6cd128de8666 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1027,8 +1027,13 @@ struct rq {
/* Ensure that all clocks are in the same cache line */
u64 clock_task ____cacheline_aligned;
u64 clock_pelt;
+ u64 clock_pelt_lag;
unsigned long lost_idle_time;

+#ifndef CONFIG_64BIT
+ u64 clock_pelt_lag_copy;
+#endif
+
atomic_t nr_iowait;

#ifdef CONFIG_SCHED_DEBUG
--
2.25.1


2021-12-09 16:12:41

by Vincent Donnefort

[permalink] [raw]
Subject: [PATCH 3/4] sched/fair: Remove task_util from effective utilization in feec()

The energy estimation in find_energy_efficient_cpu() (feec()) relies on
the computation of the effective utilization for each CPU of a perf domain
(PD). This effective utilization is then used as an estimation of the busy
time for this pd. The function effective_cpu_util() which gives this value,
scales the utilization relative to IRQ pressure on the CPU to take into
account that the IRQ time is hidden from the task clock. The IRQ scaling is
as follow:

effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util

Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of
the CPU and irq the IRQ avg time.

If now we take as an example a task placement which doesn't raise the OPP
on the candidate CPU, we can write the energy delta as:

delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) -
effective_cpu_util(cpu_util))
= OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util

We end-up with an energy delta depending on the IRQ avg time, which is a
problem: first the time spent on IRQs by a CPU has no effect on the
additional energy that would be consumed by a task. Second, we don't want
to favour a CPU with a higher IRQ avg time value.

Nonetheless, we need to take the IRQ avg time into account. If a task
placement raises the PD's frequency, it will increase the energy cost for
the entire time where the CPU is busy. A solution is to only use
effective_cpu_util() with the CPU contribution part. The task contribution
is added separately and scaled according to prev_cpu's IRQ time.

No change for the FREQUENCY_UTIL component of the energy estimation. We
still want to get the actual frequency that would be selected after the
task placement.

Signed-off-by: Vincent Donnefort <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9d640d519866..c40252a96671 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6599,23 +6599,83 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
return min(util, capacity_orig_of(cpu));
}

+/*
+ * Compute the task busy time for compute_energy(). This time cannot be
+ * injected directly into effective_cpu_util() because of the IRQ scaling.
+ * The latter only makes sense with the most recent CPUs where the task has
+ * run.
+ */
+static inline unsigned long
+task_busy_time(struct task_struct *p, int prev_cpu)
+{
+ unsigned long cpu_cap = arch_scale_cpu_capacity(prev_cpu);
+ unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));
+
+ return scale_irq_capacity(task_util_est(p), irq, cpu_cap);
+}
+
+/*
+ * Compute the busy time for compute_energy(). Based on the utilization, it
+ * however doesn't take into account clamping since the ratio (utilization /
+ * cpu_capacity) is already enough to scale the EM reported power consumption
+ * at the (eventually clamped) cpu_capacity.
+ *
+ * The contribution of the task @p for which we want to estimate the energy
+ * cost is removed (by cpu_util_next()) and must be compted separalyte (see
+ * task_busy_time). This ensures:
+ *
+ * - A stable PD utilization, no matter which CPU of that PD we want to place
+ * the task on.
+ *
+ * - A fair comparison between CPUs as the task contribution (task_util())
+ * will always be the same no matter which CPU utilization we rely on
+ * (util_avg or util_est).
+ *
+ * Returns the busy time for @pd. @pd_task_busy_time gives the sum of pd's and
+ * task's busy time.
+ */
+static inline unsigned long
+pd_task_busy_time(struct task_struct *p, struct perf_domain *pd,
+ unsigned long cpu_cap, unsigned long tsk_busy_time,
+ unsigned long *pd_tsk_busy_time)
+{
+ unsigned long max_cap, pd_cap = 0, pd_busy_time = 0;
+ struct cpumask *pd_mask = perf_domain_span(pd);
+ int cpu;
+
+ max_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
+
+ /* see compute_energy() */
+ for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
+ unsigned long util = cpu_util_next(cpu, p, -1);
+
+ pd_cap += cpu_cap;
+ pd_busy_time += effective_cpu_util(cpu, util, max_cap,
+ ENERGY_UTIL, NULL);
+ }
+
+ pd_busy_time = min(pd_cap, pd_busy_time);
+ *pd_tsk_busy_time = min(pd_cap, pd_busy_time + tsk_busy_time);
+
+ return pd_busy_time;
+}
+
/*
* compute_energy(): Estimates the energy that @pd would consume if @p was
- * migrated to @dst_cpu. compute_energy() predicts what will be the utilization
- * landscape of @pd's CPUs after the task migration, and uses the Energy Model
- * to compute what would be the energy if we decided to actually migrate that
- * task.
+ * migrated to @dst_cpu. compute_energy(), in conjunction with
+ * pd_task_busy_time() predicts what will be the utilization landscape of @pd's
+ * CPUs after the task migration, and uses the Energy Model to compute what
+ * would be the energy if we decided to actually migrate that task.
*/
static long
-compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
+compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd,
+ unsigned long cpu_cap, unsigned long busy_time)
{
struct cpumask *pd_mask = perf_domain_span(pd);
- unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
- unsigned long max_util = 0, sum_util = 0;
- unsigned long _cpu_cap = cpu_cap;
+ unsigned long max_cap, max_util = 0;
int cpu;

- _cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
+ max_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));

/*
* The capacity state of CPUs of the current rd can be driven by CPUs
@@ -6628,34 +6688,11 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
*/
for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
- unsigned long cpu_util, util_running = util_freq;
+ unsigned long cpu_util;
struct task_struct *tsk = NULL;

- /*
- * When @p is placed on @cpu:
- *
- * util_running = max(cpu_util, cpu_util_est) +
- * max(task_util, _task_util_est)
- *
- * while cpu_util_next is: max(cpu_util + task_util,
- * cpu_util_est + _task_util_est)
- */
- if (cpu == dst_cpu) {
+ if (cpu == dst_cpu)
tsk = p;
- util_running =
- cpu_util_next(cpu, p, -1) + task_util_est(p);
- }
-
- /*
- * Busy time computation: utilization clamping is not
- * required since the ratio (sum_util / cpu_capacity)
- * is already enough to scale the EM reported power
- * consumption at the (eventually clamped) cpu_capacity.
- */
- cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
- ENERGY_UTIL, NULL);
-
- sum_util += min(cpu_util, _cpu_cap);

/*
* Performance domain frequency: utilization clamping
@@ -6664,12 +6701,12 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
+ cpu_util = effective_cpu_util(cpu, util_freq, max_cap,
FREQUENCY_UTIL, tsk);
- max_util = max(max_util, min(cpu_util, _cpu_cap));
+ max_util = max(max_util, min(cpu_util, cpu_cap));
}

- return em_cpu_energy(pd->em_pd, max_util, sum_util, _cpu_cap);
+ return em_cpu_energy(pd->em_pd, max_util, busy_time, cpu_cap);
}

/*
@@ -6714,9 +6751,11 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{
unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
+ unsigned long tsk_busy_time, pd_busy_time, pd_tsk_busy_time;
struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
int cpu, best_energy_cpu = prev_cpu, target = -1;
- unsigned long cpu_cap, util, base_energy = 0;
+ unsigned long cpu_cap, cpu_thermal_cap, util;
+ unsigned long base_energy = 0;
struct sched_domain *sd;
struct perf_domain *pd;

@@ -6741,6 +6780,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (!task_util_est(p))
goto unlock;

+ tsk_busy_time = task_busy_time(p, prev_cpu);
+
for (; pd; pd = pd->next) {
unsigned long cur_delta, spare_cap, max_spare_cap = 0;
bool compute_prev_delta = false;
@@ -6783,13 +6824,27 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (max_spare_cap_cpu < 0 && !compute_prev_delta)
continue;

+ /* Account thermal pressure for the energy estimation */
+ cpu_thermal_cap = arch_scale_cpu_capacity(
+ cpumask_first(perf_domain_span(pd)));
+ cpu_thermal_cap -= arch_scale_thermal_pressure(
+ cpumask_first(perf_domain_span(pd)));
+
+ /* Get the effective utilization for energy estimation */
+ pd_busy_time = pd_task_busy_time(p, pd, cpu_thermal_cap,
+ tsk_busy_time,
+ &pd_tsk_busy_time);
+
/* Compute the 'base' energy of the pd, without @p */
- base_energy_pd = compute_energy(p, -1, pd);
+ base_energy_pd = compute_energy(p, -1, pd, cpu_thermal_cap,
+ pd_busy_time);
base_energy += base_energy_pd;

/* Evaluate the energy impact of using prev_cpu. */
if (compute_prev_delta) {
- prev_delta = compute_energy(p, prev_cpu, pd);
+ prev_delta = compute_energy(p, prev_cpu, pd,
+ cpu_thermal_cap,
+ pd_tsk_busy_time);
if (prev_delta < base_energy_pd)
goto unlock;
prev_delta -= base_energy_pd;
@@ -6798,7 +6853,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

/* Evaluate the energy impact of using max_spare_cap_cpu. */
if (max_spare_cap_cpu >= 0) {
- cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
+ cur_delta = compute_energy(p, max_spare_cap_cpu, pd,
+ cpu_thermal_cap,
+ pd_tsk_busy_time);
if (cur_delta < base_energy_pd)
goto unlock;
cur_delta -= base_energy_pd;
--
2.25.1


2021-12-09 16:12:44

by Vincent Donnefort

[permalink] [raw]
Subject: [PATCH 4/4] sched/fair: Remove the energy margin in feec()

find_energy_efficient_cpu() integrates a margin to protect tasks from
bouncing back and forth from a CPU to another. This margin is set as being
6% of the total current energy estimated on the system. This however does
not work for two reasons:

1. The energy estimation is not a good absolute value:

The function, compute_energy() used in feec() is a good estimation for
task placement as it allows to compare the energy with and without a task.
The computed delta will give a good overview of the cost for a certain
task placement. It, however, doesn't work as an absolute estimation for
the total energy of the system. First it adds the contribution to idle
CPUs into the energy, second it mixes util_avg with util_est values.
util_avg represents integrates the near history for a CPU usage,
it doesn't tell at all what the current utilization is. A system that has
been quite busy in the near past will hold a very high energy and then a
high margin preventing any task migration to a lower capacity CPU, wasting
energy. It even creates a negative feedback loop: by holding the tasks on
a less efficient CPU, the margin contributes in keeping the energy high.

2. The margin handicaps small tasks:

On a system where the workload is composed mostly of small tasks (which is
often the case on Android), the overall energy will be high enough to
create a margin none of those tasks can cross. e.g. On a Pixel4, a small
utilization of 5% on all the CPUs creates a global estimated energy of 140
joules, as per the Energy Model declaration of that same device. This
means, after applying the 6% margin that any migration must save more than
8 joules to happen. No task with a utilization lower than 40 would then be
able to migrate away from the biggest CPU of the system.

The 6% of the overall system energy was brought by the following patch:

(eb92692b2544 sched/fair: Speed-up energy-aware wake-ups)

It was previously 6% of the prev_cpu energy. Also, the following one
made this margin value conditional on the clusters where the task fits:

(8d4c97c105ca sched/fair: Only compute base_energy_pd if necessary)

We could simply revert that margin change to what it was, but the original
version didn't have strong grounds neither and as demonstrated in (1.) the
estimated energy isn't a good absolute value. Instead, removing it
completely. It is indeed, made possible by recent changes that improved
energy estimation comparison fairness (sched/fair: Remove task_util from
effective utilization in feec()) and task utilization stabilization
(sched/fair: Decay task util_avg during migration)

Without a margin, we could have feared bouncing between CPUs. But running
LISA's eas_behaviour test coverage on three different platforms (Hikey960,
RB-5 and DB-845) showed no issue and even fixed previously known failures.

Removing the energy margin enables more energy-optimized placements for a
more energy efficient system.

Signed-off-by: Vincent Donnefort <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c40252a96671..a36d37f0b227 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6755,7 +6755,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
int cpu, best_energy_cpu = prev_cpu, target = -1;
unsigned long cpu_cap, cpu_thermal_cap, util;
- unsigned long base_energy = 0;
struct sched_domain *sd;
struct perf_domain *pd;

@@ -6838,7 +6837,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
/* Compute the 'base' energy of the pd, without @p */
base_energy_pd = compute_energy(p, -1, pd, cpu_thermal_cap,
pd_busy_time);
- base_energy += base_energy_pd;

/* Evaluate the energy impact of using prev_cpu. */
if (compute_prev_delta) {
@@ -6867,12 +6865,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
}
rcu_read_unlock();

- /*
- * Pick the best CPU if prev_cpu cannot be used, or if it saves at
- * least 6% of the energy used by prev_cpu.
- */
- if ((prev_delta == ULONG_MAX) ||
- (prev_delta - best_delta) > ((prev_delta + base_energy) >> 4))
+ if (best_delta < prev_delta)
target = best_energy_cpu;

return target;
--
2.25.1


2021-12-20 11:26:37

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Decay task PELT values during migration

On 09.12.21 17:11, Vincent Donnefort wrote:

[...]

> @@ -6899,6 +6899,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>
> static void detach_entity_cfs_rq(struct sched_entity *se);
>
> +static u64 rq_clock_pelt_estimator(struct rq *rq)
> +{
> + u64 pelt_lag = sched_clock_cpu(cpu_of(rq)) -
> + u64_u32_load(rq->clock_pelt_lag);
> +
> + return cfs_rq_last_update_time(&rq->cfs) + pelt_lag;

Why do you use `avg.last_update_time` (lut) of the root cfs_rq here?

p's lut was just synced to cfs_rq_of(se)'s lut in

migrate_task_rq_fair() (1) -> remove_entity_load_avg() ->
sync_entity_load_avg(se) (2)

[...]

> @@ -6924,26 +6934,29 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
> * rq->lock and can modify state directly.
> */
> - lockdep_assert_rq_held(task_rq(p));
> - detach_entity_cfs_rq(&p->se);
> + lockdep_assert_rq_held(rq);
> + detach_entity_cfs_rq(se);
>
> } else {
> + remove_entity_load_avg(se);
> +
> /*
> - * We are supposed to update the task to "current" time, then
> - * its up to date and ready to go to new CPU/cfs_rq. But we
> - * have difficulty in getting what current time is, so simply
> - * throw away the out-of-date time. This will result in the
> - * wakee task is less decayed, but giving the wakee more load
> - * sounds not bad.
> + * Here, the task's PELT values have been updated according to
> + * the current rq's clock. But if that clock hasn't been
> + * updated in a while, a substantial idle time will be missed,
> + * leading to an inflation after wake-up on the new rq.
> + *
> + * Estimate the PELT clock lag, and update sched_avg to ensure
> + * PELT continuity after migration.
> */
> - remove_entity_load_avg(&p->se);
> + __update_load_avg_blocked_se(rq_clock_pelt_estimator(rq), se);

We do __update_load_avg_blocked_se() now twice for p, 1. in (2) and then
in (1) again.

[...]


2021-12-20 16:10:02

by Vincent Donnefort

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Decay task PELT values during migration

On Mon, Dec 20, 2021 at 12:26:23PM +0100, Dietmar Eggemann wrote:
> On 09.12.21 17:11, Vincent Donnefort wrote:
>
> [...]
>
> > @@ -6899,6 +6899,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >
> > static void detach_entity_cfs_rq(struct sched_entity *se);
> >
> > +static u64 rq_clock_pelt_estimator(struct rq *rq)
> > +{
> > + u64 pelt_lag = sched_clock_cpu(cpu_of(rq)) -
> > + u64_u32_load(rq->clock_pelt_lag);
> > +
> > + return cfs_rq_last_update_time(&rq->cfs) + pelt_lag;
>
> Why do you use `avg.last_update_time` (lut) of the root cfs_rq here?
>
> p's lut was just synced to cfs_rq_of(se)'s lut in
>
> migrate_task_rq_fair() (1) -> remove_entity_load_avg() ->
> sync_entity_load_avg(se) (2)

Huum, indeed, the estimation is an offset on top of the se's last_update_time,
which I suppose could be different from the rq's cfs_rq.

I'll add a sched_entity argument for this function, to use either cfs_rq_of(se)
or se last_update_time


> [...]
>
> > @@ -6924,26 +6934,29 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> > * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
> > * rq->lock and can modify state directly.
> > */
> > - lockdep_assert_rq_held(task_rq(p));
> > - detach_entity_cfs_rq(&p->se);
> > + lockdep_assert_rq_held(rq);
> > + detach_entity_cfs_rq(se);
> >
> > } else {
> > + remove_entity_load_avg(se);
> > +
> > /*
> > - * We are supposed to update the task to "current" time, then
> > - * its up to date and ready to go to new CPU/cfs_rq. But we
> > - * have difficulty in getting what current time is, so simply
> > - * throw away the out-of-date time. This will result in the
> > - * wakee task is less decayed, but giving the wakee more load
> > - * sounds not bad.
> > + * Here, the task's PELT values have been updated according to
> > + * the current rq's clock. But if that clock hasn't been
> > + * updated in a while, a substantial idle time will be missed,
> > + * leading to an inflation after wake-up on the new rq.
> > + *
> > + * Estimate the PELT clock lag, and update sched_avg to ensure
> > + * PELT continuity after migration.
> > */
> > - remove_entity_load_avg(&p->se);
> > + __update_load_avg_blocked_se(rq_clock_pelt_estimator(rq), se);
>
> We do __update_load_avg_blocked_se() now twice for p, 1. in (2) and then
> in (1) again.

the first __update_load_avg_blocked_se() ensures the se is aligned with the
cfs_rq's clock and then, update the "removed" struct accordingly. We couldn't
use the estimator there, it would break that structure.

>
> [...]
>

2021-12-21 12:46:06

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 2/4] sched/fair: Decay task PELT values during migration

On 20.12.21 17:09, Vincent Donnefort wrote:
> On Mon, Dec 20, 2021 at 12:26:23PM +0100, Dietmar Eggemann wrote:
>> On 09.12.21 17:11, Vincent Donnefort wrote:

[...]

>> Why do you use `avg.last_update_time` (lut) of the root cfs_rq here?
>>
>> p's lut was just synced to cfs_rq_of(se)'s lut in
>>
>> migrate_task_rq_fair() (1) -> remove_entity_load_avg() ->
>> sync_entity_load_avg(se) (2)
>
> Huum, indeed, the estimation is an offset on top of the se's last_update_time,
> which I suppose could be different from the rq's cfs_rq.
>
> I'll add a sched_entity argument for this function, to use either cfs_rq_of(se)
> or se last_update_time

OK, or an `u64 now or lut`.

[...]

>>> } else {
>>> + remove_entity_load_avg(se);
>>> +
>>> /*
>>> - * We are supposed to update the task to "current" time, then
>>> - * its up to date and ready to go to new CPU/cfs_rq. But we
>>> - * have difficulty in getting what current time is, so simply
>>> - * throw away the out-of-date time. This will result in the
>>> - * wakee task is less decayed, but giving the wakee more load
>>> - * sounds not bad.
>>> + * Here, the task's PELT values have been updated according to
>>> + * the current rq's clock. But if that clock hasn't been
>>> + * updated in a while, a substantial idle time will be missed,
>>> + * leading to an inflation after wake-up on the new rq.
>>> + *
>>> + * Estimate the PELT clock lag, and update sched_avg to ensure
>>> + * PELT continuity after migration.
>>> */
>>> - remove_entity_load_avg(&p->se);
>>> + __update_load_avg_blocked_se(rq_clock_pelt_estimator(rq), se);
>>
>> We do __update_load_avg_blocked_se() now twice for p, 1. in (2) and then
>> in (1) again.
>
> the first __update_load_avg_blocked_se() ensures the se is aligned with the
> cfs_rq's clock and then, update the "removed" struct accordingly. We couldn't
> use the estimator there, it would break that structure.

You're right. I missed this bit.

Related to this: Looks like on CAS/EAS we don't rely on
remove_entity_load_avg()->sync_entity_load_avg(se) since it is already
called during select_task_rq().

2022-01-04 17:27:40

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/fair: Remove task_util from effective utilization in feec()

On 09/12/2021 17:11, Vincent Donnefort wrote:
> The energy estimation in find_energy_efficient_cpu() (feec()) relies on
> the computation of the effective utilization for each CPU of a perf domain
> (PD). This effective utilization is then used as an estimation of the busy
> time for this pd. The function effective_cpu_util() which gives this value,
> scales the utilization relative to IRQ pressure on the CPU to take into
> account that the IRQ time is hidden from the task clock. The IRQ scaling is
> as follow:
>
> effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util

effective_cpu_util stands for cpu_util (effective_cpu_util(...,
ENERGY_UTIL, ...), right? (1)

> Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of
> the CPU and irq the IRQ avg time.
>
> If now we take as an example a task placement which doesn't raise the OPP
> on the candidate CPU, we can write the energy delta as:
>
> delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) -
> effective_cpu_util(cpu_util))
> = OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util
>
> We end-up with an energy delta depending on the IRQ avg time, which is a
> problem: first the time spent on IRQs by a CPU has no effect on the
> additional energy that would be consumed by a task. Second, we don't want
> to favour a CPU with a higher IRQ avg time value.
>
> Nonetheless, we need to take the IRQ avg time into account. If a task
> placement raises the PD's frequency, it will increase the energy cost for
> the entire time where the CPU is busy. A solution is to only use
> effective_cpu_util() with the CPU contribution part. The task contribution
> is added separately and scaled according to prev_cpu's IRQ time.

This whole idea looks like a follow-up of commit 0372e1cf70c2
("sched/fair: Fix task utilization accountability in compute_energy()").

I forgot why we still use cpu_util_next(cpu, p, dst_cpu) for
FREQUENCY_UTIL though?

> No change for the FREQUENCY_UTIL component of the energy estimation. We

OK, it indirectly says so. (1)

[...]

> @@ -6599,23 +6599,83 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> return min(util, capacity_orig_of(cpu));
> }
>
> +/*
> + * Compute the task busy time for compute_energy(). This time cannot be
> + * injected directly into effective_cpu_util() because of the IRQ scaling.
> + * The latter only makes sense with the most recent CPUs where the task has
> + * run.
> + */
> +static inline unsigned long
> +task_busy_time(struct task_struct *p, int prev_cpu)
> +{
> + unsigned long cpu_cap = arch_scale_cpu_capacity(prev_cpu);

s/cpu_cap/max_cap ... to stay consistent

> + unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));

What about irq >= max_cap ?

effective_cpu_util() has the following condition:

if (unlikely(irq >= max_cap))
return max_cap;

[...]

> + * The contribution of the task @p for which we want to estimate the energy
> + * cost is removed (by cpu_util_next()) and must be compted separalyte (see

s/compted separalyte/calculated separately ?

[...]

> +static inline unsigned long
> +pd_task_busy_time(struct task_struct *p, struct perf_domain *pd,
> + unsigned long cpu_cap, unsigned long tsk_busy_time,
> + unsigned long *pd_tsk_busy_time)
> +{
> + unsigned long max_cap, pd_cap = 0, pd_busy_time = 0;
> + struct cpumask *pd_mask = perf_domain_span(pd);
> + int cpu;
> +
> + max_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));

In case we use 'sched, drivers: Remove max param from
effective_cpu_util()/sched_cpu_util()'

https://gitlab.arm.com/linux-arm/linux-de/-/commit/150e753e861285e82e9d7c593f1f26075c34e124

we would not have to have max_cap for effective_cpu_util(). This would
make the code easier to understand since we already have to pass pd_cap
and cpu_cap (cpu_thermal_cap) now.

> +
> + /* see compute_energy() */
> + for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {

Somehow unrelated ... We now use cpu_online_mask 3 times per PD to
iterate over the CPUs. cpu_online_mask can change during the run-queue
selection.

We could reuse `select_idle_mask` to create one cpumask at the beginning
of feec() and pass it down the fucntions:

See `sched/fair: Rename select_idle_mask to select_rq_mask` and
`sched/fair: Use the same cpumask per-PD throughout
find_energy_efficient_cpu()`

https://gitlab.arm.com/linux-arm/linux-de/-/commit/ec5bc27a298dd1352dbaff5809743128fa351075

https://gitlab.arm.com/linux-arm/linux-de/-/commit/f73b19968a65b07b0ad5bd1dff721ed1a675a24b

> + unsigned long util = cpu_util_next(cpu, p, -1);
> +
> + pd_cap += cpu_cap;
> + pd_busy_time += effective_cpu_util(cpu, util, max_cap,
> + ENERGY_UTIL, NULL);
> + }
> +
> + pd_busy_time = min(pd_cap, pd_busy_time);
> + *pd_tsk_busy_time = min(pd_cap, pd_busy_time + tsk_busy_time);

We do `sum_util += min(cpu_util, _cpu_cap (cpu_thermal_cap))` in
compute_energy() so far but now you sum up PD capacity (pd_cap) first
and then compare the sum_util (i.e. pd_busy_time) with pd_cap. Why?

cpu_util = effective_cpu_util(..., FREQUENCY_UTIL, ...) still uses
min(cpu_util, _cpu_cap).

> +
> + return pd_busy_time;

This function seems to be a little bit overloaded. It's called
pd_task_busy_time but returns `pd` busy time and `pd + task` busy time.

In case we could calculate pd_cap pd_task_busy_time() then this function
can only return pd_busy_time and pd_tsk_busy_time could also calculate
outside the function.

See `XXX: Calculate pd_cap & pd_tsk_busy_time outside pd_task_busy_time()`

https://gitlab.arm.com/linux-arm/linux-de/-/commit/4512d681fe32eb5e2862c4c5d5a03b1d84129e26

[...]

> @@ -6628,34 +6688,11 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> */
> for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> - unsigned long cpu_util, util_running = util_freq;
> + unsigned long cpu_util;
> struct task_struct *tsk = NULL;
>
> - /*
> - * When @p is placed on @cpu:
> - *
> - * util_running = max(cpu_util, cpu_util_est) +
> - * max(task_util, _task_util_est)
> - *
> - * while cpu_util_next is: max(cpu_util + task_util,
> - * cpu_util_est + _task_util_est)
> - */
> - if (cpu == dst_cpu) {
> + if (cpu == dst_cpu)
> tsk = p;
> - util_running =
> - cpu_util_next(cpu, p, -1) + task_util_est(p);
> - }
> -
> - /*
> - * Busy time computation: utilization clamping is not
> - * required since the ratio (sum_util / cpu_capacity)
> - * is already enough to scale the EM reported power
> - * consumption at the (eventually clamped) cpu_capacity.
> - */
> - cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
> - ENERGY_UTIL, NULL);
> -
> - sum_util += min(cpu_util, _cpu_cap);
>
> /*
> * Performance domain frequency: utilization clamping
> @@ -6664,12 +6701,12 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> * NOTE: in case RT tasks are running, by default the
> * FREQUENCY_UTIL's utilization can be max OPP.
> */
> - cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
> + cpu_util = effective_cpu_util(cpu, util_freq, max_cap,
> FREQUENCY_UTIL, tsk);
> - max_util = max(max_util, min(cpu_util, _cpu_cap));
> + max_util = max(max_util, min(cpu_util, cpu_cap));
> }

It's hard to understand since it is unbalanced that `busy time` is
calculated in pd_busy_time() whereas `max_util` is still calculated in
compute_energy().

IMHO it would be much easier to understand if there would be a
pd_max_util() as well so that we could do:

busy_time = get_pd_busy_time()
max_util = get_pd_max_util(..., -1, ...)
base_energy = compute_energy(..., max_util, busy_time, ...)

busy_time = min(busy_time + tsk_busy_time, pd_cap);
if (compute_prev_delta)
max_util = get_pd_max_util(..., prev_cpu, ...)
prev_delta = compute_energy(..., max_util, busy_time, ...)
...

See `XXX: Split get_pd_max_util() from compute_energy()`

https://gitlab.arm.com/linux-arm/linux-de/-/commit/6d79929ee7c8675a127158187786dd1d6b6dd355

[...]

> @@ -6783,13 +6824,27 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> if (max_spare_cap_cpu < 0 && !compute_prev_delta)
> continue;
>
> + /* Account thermal pressure for the energy estimation */
> + cpu_thermal_cap = arch_scale_cpu_capacity(
> + cpumask_first(perf_domain_span(pd)));
> + cpu_thermal_cap -= arch_scale_thermal_pressure(
> + cpumask_first(perf_domain_span(pd)));

Yes, we should calculate cpu_thermal_cap only once per PD. Can you make
this a little bit more easy to read by getting `cpu =
cpumask_first(perf_domain_span(pd));` first?

[...]

2022-01-12 10:21:20

by Vincent Donnefort

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched/fair: Remove task_util from effective utilization in feec()

[...]

> >
> > We end-up with an energy delta depending on the IRQ avg time, which is a
> > problem: first the time spent on IRQs by a CPU has no effect on the
> > additional energy that would be consumed by a task. Second, we don't want
> > to favour a CPU with a higher IRQ avg time value.
> >
> > Nonetheless, we need to take the IRQ avg time into account. If a task
> > placement raises the PD's frequency, it will increase the energy cost for
> > the entire time where the CPU is busy. A solution is to only use
> > effective_cpu_util() with the CPU contribution part. The task contribution
> > is added separately and scaled according to prev_cpu's IRQ time.
>
> This whole idea looks like a follow-up of commit 0372e1cf70c2
> ("sched/fair: Fix task utilization accountability in compute_energy()").
>
> I forgot why we still use cpu_util_next(cpu, p, dst_cpu) for
> FREQUENCY_UTIL though?

Yes, it's a generalised version.

cpu_util_next() gives the utilization we expect from the task placement that
then will be turned into a "schedutil" effective utilization.

>
> > No change for the FREQUENCY_UTIL component of the energy estimation. We
>
> OK, it indirectly says so. (1)
>
> [...]
>
> > @@ -6599,23 +6599,83 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > return min(util, capacity_orig_of(cpu));
> > }
> >
> > +/*
> > + * Compute the task busy time for compute_energy(). This time cannot be
> > + * injected directly into effective_cpu_util() because of the IRQ scaling.
> > + * The latter only makes sense with the most recent CPUs where the task has
> > + * run.
> > + */
> > +static inline unsigned long
> > +task_busy_time(struct task_struct *p, int prev_cpu)
> > +{
> > + unsigned long cpu_cap = arch_scale_cpu_capacity(prev_cpu);
>
> s/cpu_cap/max_cap ... to stay consistent

Ack

>
> > + unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));
>
> What about irq >= max_cap ?
>
> effective_cpu_util() has the following condition:
>
> if (unlikely(irq >= max_cap))
> return max_cap;

Ack

>
> [...]
>
> > + * The contribution of the task @p for which we want to estimate the energy
> > + * cost is removed (by cpu_util_next()) and must be compted separalyte (see
>
> s/compted separalyte/calculated separately ?

Woh ...

>
> [...]
>
> > +static inline unsigned long
> > +pd_task_busy_time(struct task_struct *p, struct perf_domain *pd,
> > + unsigned long cpu_cap, unsigned long tsk_busy_time,
> > + unsigned long *pd_tsk_busy_time)
> > +{
> > + unsigned long max_cap, pd_cap = 0, pd_busy_time = 0;
> > + struct cpumask *pd_mask = perf_domain_span(pd);
> > + int cpu;
> > +
> > + max_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
>
> In case we use 'sched, drivers: Remove max param from
> effective_cpu_util()/sched_cpu_util()'
>
> https://gitlab.arm.com/linux-arm/linux-de/-/commit/150e753e861285e82e9d7c593f1f26075c34e124
>
> we would not have to have max_cap for effective_cpu_util(). This would
> make the code easier to understand since we already have to pass pd_cap
> and cpu_cap (cpu_thermal_cap) now.

I've taken your patches for the V2.

>
> > +
> > + /* see compute_energy() */
> > + for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
>
> Somehow unrelated ... We now use cpu_online_mask 3 times per PD to
> iterate over the CPUs. cpu_online_mask can change during the run-queue
> selection.
>
> We could reuse `select_idle_mask` to create one cpumask at the beginning
> of feec() and pass it down the fucntions:
>
> See `sched/fair: Rename select_idle_mask to select_rq_mask` and
> `sched/fair: Use the same cpumask per-PD throughout
> find_energy_efficient_cpu()`
>
> https://gitlab.arm.com/linux-arm/linux-de/-/commit/ec5bc27a298dd1352dbaff5809743128fa351075
>
> https://gitlab.arm.com/linux-arm/linux-de/-/commit/f73b19968a65b07b0ad5bd1dff721ed1a675a24b

I'll take this one as well.

>
> > + unsigned long util = cpu_util_next(cpu, p, -1);
> > +
> > + pd_cap += cpu_cap;
> > + pd_busy_time += effective_cpu_util(cpu, util, max_cap,
> > + ENERGY_UTIL, NULL);
> > + }
> > +
> > + pd_busy_time = min(pd_cap, pd_busy_time);
> > + *pd_tsk_busy_time = min(pd_cap, pd_busy_time + tsk_busy_time);
>
> We do `sum_util += min(cpu_util, _cpu_cap (cpu_thermal_cap))` in
> compute_energy() so far but now you sum up PD capacity (pd_cap) first
> and then compare the sum_util (i.e. pd_busy_time) with pd_cap. Why?
>
> cpu_util = effective_cpu_util(..., FREQUENCY_UTIL, ...) still uses
> min(cpu_util, _cpu_cap).
>
> > +
> > + return pd_busy_time;
>
> This function seems to be a little bit overloaded. It's called
> pd_task_busy_time but returns `pd` busy time and `pd + task` busy time.
>
> In case we could calculate pd_cap pd_task_busy_time() then this function
> can only return pd_busy_time and pd_tsk_busy_time could also calculate
> outside the function.

Ok, but it'll make feec() even bigger.

What about a struct describing the utilization, busy time landscape that could
be edited all along feec() and finally given to compute_energy()?

>
> See `XXX: Calculate pd_cap & pd_tsk_busy_time outside pd_task_busy_time()`
>
> https://gitlab.arm.com/linux-arm/linux-de/-/commit/4512d681fe32eb5e2862c4c5d5a03b1d84129e26
>
> [...]
>
> > @@ -6628,34 +6688,11 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> > */
> > for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> > unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> > - unsigned long cpu_util, util_running = util_freq;
> > + unsigned long cpu_util;
> > struct task_struct *tsk = NULL;
> >
> > - /*
> > - * When @p is placed on @cpu:
> > - *
> > - * util_running = max(cpu_util, cpu_util_est) +
> > - * max(task_util, _task_util_est)
> > - *
> > - * while cpu_util_next is: max(cpu_util + task_util,
> > - * cpu_util_est + _task_util_est)
> > - */
> > - if (cpu == dst_cpu) {
> > + if (cpu == dst_cpu)
> > tsk = p;
> > - util_running =
> > - cpu_util_next(cpu, p, -1) + task_util_est(p);
> > - }
> > -
> > - /*
> > - * Busy time computation: utilization clamping is not
> > - * required since the ratio (sum_util / cpu_capacity)
> > - * is already enough to scale the EM reported power
> > - * consumption at the (eventually clamped) cpu_capacity.
> > - */
> > - cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
> > - ENERGY_UTIL, NULL);
> > -
> > - sum_util += min(cpu_util, _cpu_cap);
> >
> > /*
> > * Performance domain frequency: utilization clamping
> > @@ -6664,12 +6701,12 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> > * NOTE: in case RT tasks are running, by default the
> > * FREQUENCY_UTIL's utilization can be max OPP.
> > */
> > - cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
> > + cpu_util = effective_cpu_util(cpu, util_freq, max_cap,
> > FREQUENCY_UTIL, tsk);
> > - max_util = max(max_util, min(cpu_util, _cpu_cap));
> > + max_util = max(max_util, min(cpu_util, cpu_cap));
> > }
>
> It's hard to understand since it is unbalanced that `busy time` is
> calculated in pd_busy_time() whereas `max_util` is still calculated in
> compute_energy().
>
> IMHO it would be much easier to understand if there would be a
> pd_max_util() as well so that we could do:
>
> busy_time = get_pd_busy_time()
> max_util = get_pd_max_util(..., -1, ...)
> base_energy = compute_energy(..., max_util, busy_time, ...)
>
> busy_time = min(busy_time + tsk_busy_time, pd_cap);
> if (compute_prev_delta)
> max_util = get_pd_max_util(..., prev_cpu, ...)
> prev_delta = compute_energy(..., max_util, busy_time, ...)
> ...

I'll do that for the V2... but feec() will grow even more :)

>
> See `XXX: Split get_pd_max_util() from compute_energy()`
>
> https://gitlab.arm.com/linux-arm/linux-de/-/commit/6d79929ee7c8675a127158187786dd1d6b6dd355
>
> [...]
>
> > @@ -6783,13 +6824,27 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > if (max_spare_cap_cpu < 0 && !compute_prev_delta)
> > continue;
> >
> > + /* Account thermal pressure for the energy estimation */
> > + cpu_thermal_cap = arch_scale_cpu_capacity(
> > + cpumask_first(perf_domain_span(pd)));
> > + cpu_thermal_cap -= arch_scale_thermal_pressure(
> > + cpumask_first(perf_domain_span(pd)));
>
> Yes, we should calculate cpu_thermal_cap only once per PD. Can you make
> this a little bit more easy to read by getting `cpu =
> cpumask_first(perf_domain_span(pd));` first?

Ack.