Hi,
Here's a new version of the patch-set to get rid of the energy margin in
feec(). Many thanks to all for the insightful comments I got.
find_energy_efficient() (feec()) will migrate a task to save energy only if
it saves at least 6% of the total energy consumed by the system. This
conservative approach is a problem on a system where a lot of small tasks
create a huge load on the overall: very few of them will be allowed to
migrate to a smaller CPU, wasting a lot of energy. Instead of trying to
determine yet another margin, let's try to remove it.
The first elements of this patch-set are various fixes and improvement that
stabilizes task_util and ensures energy comparison fairness across all CPUs
of the topology. Only once those fixed, we can completely remove the margin
and let feec() aggressively place task and save energy.
This has been validated by two different ways:
First using LISA's eas_behaviour test suite. This is composed of a set of
scenario and verify if the task placement is optimum. No failure have been
observed and it also improved some tests such as Ramp-Down (as the
placement is now more energy oriented) and *ThreeSmall (as no bouncing
between clusters happen anymore).
* Hikey960: 100% PASSED
* DB-845C: 100% PASSED
* RB5: 100% PASSED
Second, using an Android benchmark: PCMark2 on a Pixel4, with a lot of
backports to have a scheduler as close as we can from mainline.
+------------+-----------------+-----------------+
| Test | Perf | Energy [1] |
+------------+-----------------+-----------------+
| Web2 | -0.3% pval 0.03 | -1.8% pval 0.00 |
| Video2 | -0.3% pval 0.13 | -5.6% pval 0.00 |
| Photo2 [2] | -3.8% pval 0.00 | -1% pval 0.00 |
| Writing2 | 0% pval 0.13 | -1% pval 0.00 |
| Data2 | 0% pval 0.8 | -0.43 pval 0.00 |
+------------+-----------------+-----------------+
The margin removal let the kernel make the best use of the Energy Model,
tasks are more likely to be placed where they fit and this saves a
substantial amount of energy, while having a limited impact on
performances.
[1] This is an energy estimation based on the CPU activity and the Energy
Model for this device. "All models are wrong but some are useful"; yes,
this is an imperfect estimation that doesn't take into account some idle
states and shared power rails. Nonetheless this is based on the information
the kernel has during runtime and it proves the scheduler can take better
decisions based solely on those data.
[2] This is the only performance impact observed. The debugging of this
test showed no issue with task placement. The better score was solely due
to some critical threads held on better performing CPUs. If a thread needs
a higher capacity CPU, the placement must result from a user input (with
e.g. uclamp min) instead of being artificially held on less efficient CPUs
by feec(). Notice also, the experiment didn't use the Android only
latency_sensitive feature which would hide this problem on a real-life
device.
v8 -> v9:
- PELT migration decay: Fix barriers to prevent overestimation. (Vincent
G.)
- PELT migration decay: Fix CONFIG_GROUP_SCHED=n build.
- Various readbility improvements. (Dietmar)
- Collect Reviewed-by tags.
v7 -> v8:
- PELT migration decay: Refine estimation computation. (vincent G.)
- PELT migration decay: Do not apply estimation if load_avg is decayed
(Tao)
- PELT migration decay: throttled_pelt_idle update ordering for the
update_blocked_load case. (vincent G.)
v6 -> v7:
- PELT migration decay: Add missing clock_pelt_idle updates.
- PELT migration decay: Fix PELT scaling delta for CONFIG_CFS_BANDWIDTH.
v4 -> v5:
- PELT migration decay: timestamp only at idle time (Vincent G.)
- PELT migration decay: split timestamp values (enter_idle /
clock_pelt_idle) (Vincent G.)
v3 -> v4:
- Minor cosmetic changes (Dietmar)
v2 -> v3:
- feec(): introduce energy_env struct (Dietmar)
- PELT migration decay: Only apply when src CPU is idle (Vincent G.)
- PELT migration decay: Do not apply when cfs_rq is throttled
- PELT migration decay: Snapshot the lag at cfs_rq's level
v1 -> v2:
- Fix PELT migration last_update_time (previously root cfs_rq's).
- Add Dietmar's patches to refactor feec()'s CPU loop.
- feec(): renaming busy time functions get_{pd,tsk}_busy_time()
- feec(): pd_cap computation in the first for_each_cpu loop.
- feec(): create get_pd_max_util() function (previously within
compute_energy())
- feec(): rename base_energy_pd to base_energy.
Dietmar Eggemann (3):
sched, drivers: Remove max param from
effective_cpu_util()/sched_cpu_util()
sched/fair: Rename select_idle_mask to select_rq_mask
sched/fair: Use the same cpumask per-PD throughout
find_energy_efficient_cpu()
Vincent Donnefort (4):
sched/fair: Provide u64 read for 32-bits arch helper
sched/fair: Decay task PELT values during wakeup migration
sched/fair: Remove task_util from effective utilization in feec()
sched/fair: Remove the energy margin in feec()
drivers/powercap/dtpm_cpu.c | 33 +--
drivers/thermal/cpufreq_cooling.c | 6 +-
include/linux/sched.h | 2 +-
kernel/sched/core.c | 15 +-
kernel/sched/cpufreq_schedutil.c | 5 +-
kernel/sched/fair.c | 465 +++++++++++++++++++-----------
kernel/sched/pelt.h | 40 ++-
kernel/sched/sched.h | 53 +++-
8 files changed, 395 insertions(+), 224 deletions(-)
--
2.36.1.124.g0e6072fb45-goog
From: Vincent Donnefort <[email protected]>
Before being migrated to a new CPU, a task sees its PELT values
synchronized with rq last_update_time. Once done, that same task will also
have its sched_avg last_update_time reset. This means the time between
the migration and the last clock update will not be accounted for in
util_avg and a discontinuity will appear. This issue is amplified by the
PELT clock scaling. It takes currently one tick after the CPU being idle
to let clock_pelt catching up clock_task.
This is especially problematic for asymmetric CPU capacity systems which
need stable util_avg signals for task placement and energy estimation.
Ideally, this problem would be solved by updating the runqueue clocks
before the migration. But that would require taking the runqueue lock
which is quite expensive [1]. Instead estimate the missing time and update
the task util_avg with that value.
To that end, we need sched_clock_cpu() but it is a costly function. Limit
the usage to the case where the source CPU is idle as we know this is when
the clock is having the biggest risk of being outdated. In this such case,
let's call it cfs_idle_lag the delta time between the rq_clock_pelt value
at rq idle and cfs_rq idle. And rq_idle_lag the delta between "now" and
the rq_clock_pelt at rq idle.
The estimated PELT clock is then:
last_update_time + (the cfs_rq's last_update_time)
cfs_idle_lag + (delta between cfs_rq's update and rq's update)
rq_idle_lag (delta between rq's update and now)
last_update_time = cfs_rq_clock_pelt()
= rq_clock_pelt() - cfs->throttled_clock_pelt_time
cfs_idle_lag = rq_clock_pelt()@rq_idle -
rq_clock_pelt()@cfs_rq_idle
rq_idle_lag = sched_clock_cpu() - rq_clock()@rq_idle
The rq_clock_pelt() from last_update_time being the same as
rq_clock_pelt()@cfs_rq_idle, we can write:
estimation = rq_clock_pelt()@rq_idle - cfs->throttled_clock_pelt_time +
sched_clock_cpu() - rq_clock()@rq_idle
The clocks being not accessible without the rq lock taken, some timestamps
are created:
rq_clock_pelt()@rq_idle is rq->clock_pelt_idle
rq_clock()@rq_idle is rq->enter_idle
cfs->throttled_clock_pelt_time is cfs_rq->throttled_pelt_idle
The rq_idle_lag part of the missing time is however an estimation that
doesn't take into account IRQ and Paravirt time.
[1] https://lore.kernel.org/all/[email protected]/
Signed-off-by: Vincent Donnefort <[email protected]>
Signed-off-by: Vincent Donnefort <[email protected]>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 05614d9b919c..df5e6e565b4d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3310,6 +3310,29 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
}
#ifdef CONFIG_SMP
+static inline bool load_avg_is_decayed(struct sched_avg *sa)
+{
+ if (sa->load_sum)
+ return false;
+
+ if (sa->util_sum)
+ return false;
+
+ if (sa->runnable_sum)
+ return false;
+
+ /*
+ * _avg must be null when _sum are null because _avg = _sum / divider
+ * Make sure that rounding and/or propagation of PELT values never
+ * break this.
+ */
+ SCHED_WARN_ON(sa->load_avg ||
+ sa->util_avg ||
+ sa->runnable_avg);
+
+ return true;
+}
+
static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
{
return u64_u32_load_copy(cfs_rq->avg.last_update_time,
@@ -3347,27 +3370,12 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
if (cfs_rq->load.weight)
return false;
- if (cfs_rq->avg.load_sum)
- return false;
-
- if (cfs_rq->avg.util_sum)
- return false;
-
- if (cfs_rq->avg.runnable_sum)
+ if (!load_avg_is_decayed(&cfs_rq->avg))
return false;
if (child_cfs_rq_on_list(cfs_rq))
return false;
- /*
- * _avg must be null when _sum are null because _avg = _sum / divider
- * Make sure that rounding and/or propagation of PELT values never
- * break this.
- */
- SCHED_WARN_ON(cfs_rq->avg.load_avg ||
- cfs_rq->avg.util_avg ||
- cfs_rq->avg.runnable_avg);
-
return true;
}
@@ -3706,6 +3714,88 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
#endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_NO_HZ_COMMON
+static inline void migrate_se_pelt_lag(struct sched_entity *se)
+{
+ u64 throttled = 0, now, lut;
+ struct cfs_rq *cfs_rq;
+ struct rq *rq;
+ bool is_idle;
+
+ if (load_avg_is_decayed(&se->avg))
+ return;
+
+ cfs_rq = cfs_rq_of(se);
+ rq = rq_of(cfs_rq);
+
+ rcu_read_lock();
+ is_idle = is_idle_task(rcu_dereference(rq->curr));
+ rcu_read_unlock();
+
+ /*
+ * The lag estimation comes with a cost we don't want to pay all the
+ * time. Hence, limiting to the case where the source CPU is idle and
+ * we know we are at the greatest risk to have an outdated clock.
+ */
+ if (!is_idle)
+ return;
+
+ /*
+ * Estimated "now" is: last_update_time + cfs_idle_lag + rq_idle_lag, where:
+ *
+ * last_update_time (the cfs_rq's last_update_time)
+ * = cfs_rq_clock_pelt()
+ * = rq_clock_pelt() - cfs->throttled_clock_pelt_time
+ *
+ * cfs_idle_lag (delta between cfs_rq's update and rq's update)
+ * = rq_clock_pelt()@rq_idle - rq_clock_pelt()@cfs_rq_idle
+ *
+ * rq_idle_lag (delta between rq's update and now)
+ * = sched_clock_cpu() - rq_clock()@rq_idle
+ *
+ * The rq_clock_pelt() from last_update_time being the same as
+ * rq_clock_pelt()@cfs_rq_idle, we can write:
+ *
+ * now = rq_clock_pelt()@rq_idle - cfs->throttled_clock_pelt_time +
+ * sched_clock_cpu() - rq_clock()@rq_idle
+ * Where:
+ * rq_clock_pelt()@rq_idle is rq->clock_pelt_idle
+ * rq_clock()@rq_idle is rq->enter_idle
+ * cfs->throttled_clock_pelt_time is cfs_rq->throttled_pelt_idle
+ */
+
+#ifdef CONFIG_CFS_BANDWIDTH
+ throttled = u64_u32_load(cfs_rq->throttled_pelt_idle);
+ /* The clock has been stopped for throttling */
+ if (throttled == U64_MAX)
+ return;
+#endif
+ now = u64_u32_load(rq->clock_pelt_idle);
+ /*
+ * Paired with _update_idle_rq_clock_pelt. It ensures at the worst case
+ * is observed the old clock_pelt_idle value and the new enter_idle,
+ * which lead to an understimation. The opposite would lead to an
+ * overestimation.
+ */
+ smp_rmb();
+ lut = cfs_rq_last_update_time(cfs_rq);
+
+ now -= throttled;
+ if (now < lut)
+ /*
+ * cfs_rq->avg.last_update_time is more recent than our
+ * estimation, let's use it.
+ */
+ now = lut;
+ else
+ now += sched_clock_cpu(cpu_of(rq)) - u64_u32_load(rq->enter_idle);
+
+ __update_load_avg_blocked_se(now, se);
+}
+#else
+static void migrate_se_pelt_lag(struct sched_entity *se) {}
+#endif
+
/**
* update_cfs_rq_load_avg - update the cfs_rq's load/util averages
* @now: current time, as per cfs_rq_clock_pelt()
@@ -4437,6 +4527,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*/
if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
update_min_vruntime(cfs_rq);
+
+ if (cfs_rq->nr_running == 0)
+ update_idle_cfs_rq_clock_pelt(cfs_rq);
}
/*
@@ -6911,6 +7004,8 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
*/
static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
{
+ struct sched_entity *se = &p->se;
+
/*
* As blocked tasks retain absolute vruntime the migration needs to
* deal with this by subtracting the old and adding the new
@@ -6918,7 +7013,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
* the task on the new runqueue.
*/
if (READ_ONCE(p->__state) == TASK_WAKING) {
- struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);
se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
@@ -6930,25 +7024,29 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
* rq->lock and can modify state directly.
*/
lockdep_assert_rq_held(task_rq(p));
- detach_entity_cfs_rq(&p->se);
+ detach_entity_cfs_rq(se);
} else {
+ remove_entity_load_avg(se);
+
/*
- * We are supposed to update the task to "current" time, then
- * its up to date and ready to go to new CPU/cfs_rq. But we
- * have difficulty in getting what current time is, so simply
- * throw away the out-of-date time. This will result in the
- * wakee task is less decayed, but giving the wakee more load
- * sounds not bad.
+ * Here, the task's PELT values have been updated according to
+ * the current rq's clock. But if that clock hasn't been
+ * updated in a while, a substantial idle time will be missed,
+ * leading to an inflation after wake-up on the new rq.
+ *
+ * Estimate the missing time from the cfs_rq last_update_time
+ * and update sched_avg to improve the PELT continuity after
+ * migration.
*/
- remove_entity_load_avg(&p->se);
+ migrate_se_pelt_lag(se);
}
/* Tell new CPU we are migrated */
- p->se.avg.last_update_time = 0;
+ se->avg.last_update_time = 0;
/* We have migrated, no longer consider this task hot */
- p->se.exec_start = 0;
+ se->exec_start = 0;
update_scan_period(p, new_cpu);
}
@@ -8114,6 +8212,10 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
update_tg_load_avg(cfs_rq);
+ /* sync clock_pelt_idle with last update */
+ if (cfs_rq->nr_running == 0)
+ update_idle_cfs_rq_clock_pelt(cfs_rq);
+
if (cfs_rq == &rq->cfs)
decayed = true;
}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 4ff2ed4f8fa1..647e5fcc041b 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -61,6 +61,25 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
WRITE_ONCE(avg->util_est.enqueued, enqueued);
}
+static inline u64 rq_clock_pelt(struct rq *rq)
+{
+ lockdep_assert_rq_held(rq);
+ assert_clock_updated(rq);
+
+ return rq->clock_pelt - rq->lost_idle_time;
+}
+
+/* The rq is idle, we can sync to clock_task */
+static inline void _update_idle_rq_clock_pelt(struct rq *rq)
+{
+ rq->clock_pelt = rq_clock_task(rq);
+
+ u64_u32_store(rq->enter_idle, rq_clock(rq));
+ /* Paired with smp_rmb in migrate_se_pelt_lag */
+ smp_wmb();
+ u64_u32_store(rq->clock_pelt_idle, rq_clock_pelt(rq));
+}
+
/*
* The clock_pelt scales the time to reflect the effective amount of
* computation done during the running delta time but then sync back to
@@ -76,8 +95,7 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
{
if (unlikely(is_idle_task(rq->curr))) {
- /* The rq is idle, we can sync to clock_task */
- rq->clock_pelt = rq_clock_task(rq);
+ _update_idle_rq_clock_pelt(rq);
return;
}
@@ -130,17 +148,23 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
*/
if (util_sum >= divider)
rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
+
+ _update_idle_rq_clock_pelt(rq);
}
-static inline u64 rq_clock_pelt(struct rq *rq)
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{
- lockdep_assert_rq_held(rq);
- assert_clock_updated(rq);
+ u64 throttled;
- return rq->clock_pelt - rq->lost_idle_time;
+ if (unlikely(cfs_rq->throttle_count))
+ throttled = U64_MAX;
+ else
+ throttled = cfs_rq->throttled_clock_pelt_time;
+
+ u64_u32_store(cfs_rq->throttled_pelt_idle, throttled);
}
-#ifdef CONFIG_CFS_BANDWIDTH
/* rq->task_clock normalized against any time this cfs_rq has spent throttled */
static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{
@@ -150,6 +174,7 @@ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time;
}
#else
+static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { }
static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{
return rq_clock_pelt(rq_of(cfs_rq));
@@ -204,6 +229,7 @@ update_rq_clock_pelt(struct rq *rq, s64 delta) { }
static inline void
update_idle_rq_clock_pelt(struct rq *rq) { }
+static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { }
#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bf4a0ec98678..97bc26e5c8af 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -648,6 +648,10 @@ struct cfs_rq {
int runtime_enabled;
s64 runtime_remaining;
+ u64 throttled_pelt_idle;
+#ifndef CONFIG_64BIT
+ u64 throttled_pelt_idle_copy;
+#endif
u64 throttled_clock;
u64 throttled_clock_pelt;
u64 throttled_clock_pelt_time;
@@ -1020,6 +1024,12 @@ struct rq {
u64 clock_task ____cacheline_aligned;
u64 clock_pelt;
unsigned long lost_idle_time;
+ u64 clock_pelt_idle;
+ u64 enter_idle;
+#ifndef CONFIG_64BIT
+ u64 clock_pelt_idle_copy;
+ u64 enter_idle_copy;
+#endif
atomic_t nr_iowait;
--
2.36.1.124.g0e6072fb45-goog
From: Dietmar Eggemann <[email protected]>
Decouple the name of the per-cpu cpumask select_idle_mask from its usage
in select_idle_[cpu/capacity]() of the CFS run-queue selection
(select_task_rq_fair()).
This is to support the reuse of this cpumask in the Energy Aware
Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue
selection.
Signed-off-by: Dietmar Eggemann <[email protected]>
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c531976ee960..68f5eb8a1de7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9502,7 +9502,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
#endif
DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
-DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
+DECLARE_PER_CPU(cpumask_var_t, select_rq_mask);
void __init sched_init(void)
{
@@ -9551,7 +9551,7 @@ void __init sched_init(void)
for_each_possible_cpu(i) {
per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
- per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
+ per_cpu(select_rq_mask, i) = (cpumask_var_t)kzalloc_node(
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
}
#endif /* CONFIG_CPUMASK_OFFSTACK */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 73a9dc522b73..2d7bba2f1da2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5897,7 +5897,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Working cpumask for: load_balance, load_balance_newidle. */
DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
-DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+DEFINE_PER_CPU(cpumask_var_t, select_rq_mask);
#ifdef CONFIG_NO_HZ_COMMON
@@ -6387,7 +6387,7 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
*/
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
{
- struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
struct rq *this_rq = this_rq();
int this = smp_processor_id();
@@ -6473,7 +6473,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
int cpu, best_cpu = -1;
struct cpumask *cpus;
- cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
task_util = uclamp_task_util(p);
--
2.36.1.124.g0e6072fb45-goog
From: Dietmar Eggemann <[email protected]>
effective_cpu_util() already has a `int cpu' parameter which allows to
retrieve the CPU capacity scale factor (or maximum CPU capacity) inside
this function via an arch_scale_cpu_capacity(cpu).
A lot of code calling effective_cpu_util() (or the shim
sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call
arch_scale_cpu_capacity() already.
But not having to pass it into effective_cpu_util() will make the EAS
wake-up code easier, especially when the maximum CPU capacity reduced
by the thermal pressure is passed through the EAS wake-up functions.
Due to the asymmetric CPU capacity support of arm/arm64 architectures,
arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via
per_cpu(cpu_scale, cpu) on such a system.
On all other architectures it is a a compile-time constant
(SCHED_CAPACITY_SCALE).
Signed-off-by: Dietmar Eggemann <[email protected]>
diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c
index f5eced0842b3..6a88eb7e9f75 100644
--- a/drivers/powercap/dtpm_cpu.c
+++ b/drivers/powercap/dtpm_cpu.c
@@ -71,34 +71,19 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power)
{
- unsigned long max = 0, sum_util = 0;
+ unsigned long max, sum_util = 0;
int cpu;
- for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
-
- /*
- * The capacity is the same for all CPUs belonging to
- * the same perf domain, so a single call to
- * arch_scale_cpu_capacity() is enough. However, we
- * need the CPU parameter to be initialized by the
- * loop, so the call ends up in this block.
- *
- * We can initialize 'max' with a cpumask_first() call
- * before the loop but the bits computation is not
- * worth given the arch_scale_cpu_capacity() just
- * returns a value where the resulting assembly code
- * will be optimized by the compiler.
- */
- max = arch_scale_cpu_capacity(cpu);
- sum_util += sched_cpu_util(cpu, max);
- }
-
/*
- * In the improbable case where all the CPUs of the perf
- * domain are offline, 'max' will be zero and will lead to an
- * illegal operation with a zero division.
+ * The capacity is the same for all CPUs belonging to
+ * the same perf domain.
*/
- return max ? (power * ((sum_util << 10) / max)) >> 10 : 0;
+ max = arch_scale_cpu_capacity(cpumask_first(pd_mask));
+
+ for_each_cpu_and(cpu, pd_mask, cpu_online_mask)
+ sum_util += sched_cpu_util(cpu);
+
+ return (power * ((sum_util << 10) / max)) >> 10;
}
static u64 get_pd_power_uw(struct dtpm *dtpm)
diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index b8151d95a806..b263b0fde03c 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -137,11 +137,9 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
int cpu_idx)
{
- unsigned long max = arch_scale_cpu_capacity(cpu);
- unsigned long util;
+ unsigned long util = sched_cpu_util(cpu);
- util = sched_cpu_util(cpu, max);
- return (util * 100) / max;
+ return (util * 100) / arch_scale_cpu_capacity(cpu);
}
#else /* !CONFIG_SMP */
static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c46f3a63b758..88b8817b827d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2257,7 +2257,7 @@ static inline bool owner_on_cpu(struct task_struct *owner)
}
/* Returns effective CPU energy utilization, as seen by the scheduler */
-unsigned long sched_cpu_util(int cpu, unsigned long max);
+unsigned long sched_cpu_util(int cpu);
#endif /* CONFIG_SMP */
#ifdef CONFIG_RSEQ
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 53596842f0d8..c531976ee960 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7107,12 +7107,14 @@ struct task_struct *idle_task(int cpu)
* required to meet deadlines.
*/
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- unsigned long max, enum cpu_util_type type,
+ enum cpu_util_type type,
struct task_struct *p)
{
- unsigned long dl_util, util, irq;
+ unsigned long dl_util, util, irq, max;
struct rq *rq = cpu_rq(cpu);
+ max = arch_scale_cpu_capacity(cpu);
+
if (!uclamp_is_used() &&
type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
return max;
@@ -7192,10 +7194,9 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
return min(max, util);
}
-unsigned long sched_cpu_util(int cpu, unsigned long max)
+unsigned long sched_cpu_util(int cpu)
{
- return effective_cpu_util(cpu, cpu_util_cfs(cpu), max,
- ENERGY_UTIL, NULL);
+ return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
}
#endif /* CONFIG_SMP */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 3dbf351d12d5..1207c78f85c1 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -157,11 +157,10 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
static void sugov_get_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
- unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
- sg_cpu->max = max;
+ sg_cpu->max = arch_scale_cpu_capacity(sg_cpu->cpu);
sg_cpu->bw_dl = cpu_bw_dl(rq);
- sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), max,
+ sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu),
FREQUENCY_UTIL, NULL);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df5e6e565b4d..73a9dc522b73 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6703,12 +6703,11 @@ static long
compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
{
struct cpumask *pd_mask = perf_domain_span(pd);
- unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
- unsigned long max_util = 0, sum_util = 0;
- unsigned long _cpu_cap = cpu_cap;
+ unsigned long max_util = 0, sum_util = 0, cpu_cap;
int cpu;
- _cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
+ cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
+ cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
/*
* The capacity state of CPUs of the current rd can be driven by CPUs
@@ -6745,10 +6744,10 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* is already enough to scale the EM reported power
* consumption at the (eventually clamped) cpu_capacity.
*/
- cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
- ENERGY_UTIL, NULL);
+ cpu_util = effective_cpu_util(cpu, util_running, ENERGY_UTIL,
+ NULL);
- sum_util += min(cpu_util, _cpu_cap);
+ sum_util += min(cpu_util, cpu_cap);
/*
* Performance domain frequency: utilization clamping
@@ -6757,12 +6756,12 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
- FREQUENCY_UTIL, tsk);
- max_util = max(max_util, min(cpu_util, _cpu_cap));
+ cpu_util = effective_cpu_util(cpu, util_freq, FREQUENCY_UTIL,
+ tsk);
+ max_util = max(max_util, min(cpu_util, cpu_cap));
}
- return em_cpu_energy(pd->em_pd, max_util, sum_util, _cpu_cap);
+ return em_cpu_energy(pd->em_pd, max_util, sum_util, cpu_cap);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 97bc26e5c8af..07b7c50bd987 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2895,7 +2895,7 @@ enum cpu_util_type {
};
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- unsigned long max, enum cpu_util_type type,
+ enum cpu_util_type type,
struct task_struct *p);
static inline unsigned long cpu_bw_dl(struct rq *rq)
--
2.36.1.124.g0e6072fb45-goog
From: Dietmar Eggemann <[email protected]>
The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays
invariant after Energy Model creation, i.e. it is not updated after
CPU hotplug operations.
That's why the PD mask is used in conjunction with the cpu_online_mask
(or Sched Domain cpumask). Thereby the cpu_online_mask is fetched
multiple times (in compute_energy()) during a run-queue selection
for a task.
cpu_online_mask may change during this time which can lead to wrong
energy calculations.
To be able to avoid this, use the select_rq_mask per-cpu cpumask to
create a cpumask out of PD cpumask and cpu_online_mask and pass it
through the function calls of the EAS run-queue selection path.
The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection
(find_energy_efficient_cpu()) is now ANDed not only with the SD mask
but also with the cpu_online_mask. This is fine since this cpumask
has to be in syc with the one used for energy computation
(compute_energy()).
An exclusive cpuset setup with at least one asymmetric CPU capacity
island (hence the additional AND with the SD cpumask) is the obvious
exception here.
Signed-off-by: Dietmar Eggemann <[email protected]>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d7bba2f1da2..57074f27c0d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6700,14 +6700,14 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
* task.
*/
static long
-compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
+compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
+ struct perf_domain *pd)
{
- struct cpumask *pd_mask = perf_domain_span(pd);
unsigned long max_util = 0, sum_util = 0, cpu_cap;
int cpu;
- cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
- cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
+ cpu_cap = arch_scale_cpu_capacity(cpumask_first(cpus));
+ cpu_cap -= arch_scale_thermal_pressure(cpumask_first(cpus));
/*
* The capacity state of CPUs of the current rd can be driven by CPUs
@@ -6718,7 +6718,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* If an entire pd is outside of the current rd, it will not appear in
* its pd list and will not be accounted by compute_energy().
*/
- for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
+ for_each_cpu(cpu, cpus) {
unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
unsigned long cpu_util, util_running = util_freq;
struct task_struct *tsk = NULL;
@@ -6805,6 +6805,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
*/
static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
int cpu, best_energy_cpu = prev_cpu, target = -1;
@@ -6839,7 +6840,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
unsigned long base_energy_pd;
int max_spare_cap_cpu = -1;
- for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
+ cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
+
+ for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue;
@@ -6876,12 +6879,12 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
continue;
/* Compute the 'base' energy of the pd, without @p */
- base_energy_pd = compute_energy(p, -1, pd);
+ base_energy_pd = compute_energy(p, -1, cpus, pd);
base_energy += base_energy_pd;
/* Evaluate the energy impact of using prev_cpu. */
if (compute_prev_delta) {
- prev_delta = compute_energy(p, prev_cpu, pd);
+ prev_delta = compute_energy(p, prev_cpu, cpus, pd);
if (prev_delta < base_energy_pd)
goto unlock;
prev_delta -= base_energy_pd;
@@ -6890,7 +6893,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
/* Evaluate the energy impact of using max_spare_cap_cpu. */
if (max_spare_cap_cpu >= 0) {
- cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
+ cur_delta = compute_energy(p, max_spare_cap_cpu, cpus,
+ pd);
if (cur_delta < base_energy_pd)
goto unlock;
cur_delta -= base_energy_pd;
--
2.36.1.124.g0e6072fb45-goog
From: Vincent Donnefort <[email protected]>
The energy estimation in find_energy_efficient_cpu() (feec()) relies on
the computation of the effective utilization for each CPU of a perf domain
(PD). This effective utilization is then used as an estimation of the busy
time for this pd. The function effective_cpu_util() which gives this value,
scales the utilization relative to IRQ pressure on the CPU to take into
account that the IRQ time is hidden from the task clock. The IRQ scaling is
as follow:
effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util
Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of
the CPU and irq the IRQ avg time.
If now we take as an example a task placement which doesn't raise the OPP
on the candidate CPU, we can write the energy delta as:
delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) -
effective_cpu_util(cpu_util))
= OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util
We end-up with an energy delta depending on the IRQ avg time, which is a
problem: first the time spent on IRQs by a CPU has no effect on the
additional energy that would be consumed by a task. Second, we don't want
to favour a CPU with a higher IRQ avg time value.
Nonetheless, we need to take the IRQ avg time into account. If a task
placement raises the PD's frequency, it will increase the energy cost for
the entire time where the CPU is busy. A solution is to only use
effective_cpu_util() with the CPU contribution part. The task contribution
is added separately and scaled according to prev_cpu's IRQ time.
No change for the FREQUENCY_UTIL component of the energy estimation. We
still want to get the actual frequency that would be selected after the
task placement.
Signed-off-by: Vincent Donnefort <[email protected]>
Signed-off-by: Vincent Donnefort <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57074f27c0d2..5586b6848858 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6693,61 +6693,96 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
}
/*
- * compute_energy(): Estimates the energy that @pd would consume if @p was
- * migrated to @dst_cpu. compute_energy() predicts what will be the utilization
- * landscape of @pd's CPUs after the task migration, and uses the Energy Model
- * to compute what would be the energy if we decided to actually migrate that
- * task.
+ * energy_env - Utilization landscape for energy estimation.
+ * @task_busy_time: Utilization contribution by the task for which we test the
+ * placement. Given by eenv_task_busy_time().
+ * @pd_busy_time: Utilization of the whole perf domain without the task
+ * contribution. Given by eenv_pd_busy_time().
+ * @cpu_cap: Maximum CPU capacity for the perf domain.
+ * @pd_cap: Entire perf domain capacity. (pd->nr_cpus * cpu_cap).
+ */
+struct energy_env {
+ unsigned long task_busy_time;
+ unsigned long pd_busy_time;
+ unsigned long cpu_cap;
+ unsigned long pd_cap;
+};
+
+/*
+ * Compute the task busy time for compute_energy(). This time cannot be
+ * injected directly into effective_cpu_util() because of the IRQ scaling.
+ * The latter only makes sense with the most recent CPUs where the task has
+ * run.
+ */
+static inline void eenv_task_busy_time(struct energy_env *eenv,
+ struct task_struct *p, int prev_cpu)
+{
+ unsigned long busy_time, max_cap = arch_scale_cpu_capacity(prev_cpu);
+ unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));
+
+ if (unlikely(irq >= max_cap))
+ busy_time = max_cap;
+ else
+ busy_time = scale_irq_capacity(task_util_est(p), irq, max_cap);
+
+ eenv->task_busy_time = busy_time;
+}
+
+/*
+ * Compute the perf_domain (PD) busy time for compute_energy(). Based on the
+ * utilization for each @pd_cpus, it however doesn't take into account
+ * clamping since the ratio (utilization / cpu_capacity) is already enough to
+ * scale the EM reported power consumption at the (eventually clamped)
+ * cpu_capacity.
+ *
+ * The contribution of the task @p for which we want to estimate the
+ * energy cost is removed (by cpu_util_next()) and must be calculated
+ * separately (see eenv_task_busy_time). This ensures:
+ *
+ * - A stable PD utilization, no matter which CPU of that PD we want to place
+ * the task on.
+ *
+ * - A fair comparison between CPUs as the task contribution (task_util())
+ * will always be the same no matter which CPU utilization we rely on
+ * (util_avg or util_est).
+ *
+ * Set @eenv busy time for the PD that spans @pd_cpus. This busy time can't
+ * exceed @eenv->pd_cap.
*/
-static long
-compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
- struct perf_domain *pd)
+static inline void eenv_pd_busy_time(struct energy_env *eenv,
+ struct cpumask *pd_cpus,
+ struct task_struct *p)
{
- unsigned long max_util = 0, sum_util = 0, cpu_cap;
+ unsigned long busy_time = 0;
int cpu;
- cpu_cap = arch_scale_cpu_capacity(cpumask_first(cpus));
- cpu_cap -= arch_scale_thermal_pressure(cpumask_first(cpus));
+ for_each_cpu(cpu, pd_cpus) {
+ unsigned long util = cpu_util_next(cpu, p, -1);
- /*
- * The capacity state of CPUs of the current rd can be driven by CPUs
- * of another rd if they belong to the same pd. So, account for the
- * utilization of these CPUs too by masking pd with cpu_online_mask
- * instead of the rd span.
- *
- * If an entire pd is outside of the current rd, it will not appear in
- * its pd list and will not be accounted by compute_energy().
- */
- for_each_cpu(cpu, cpus) {
- unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
- unsigned long cpu_util, util_running = util_freq;
- struct task_struct *tsk = NULL;
+ busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
+ }
- /*
- * When @p is placed on @cpu:
- *
- * util_running = max(cpu_util, cpu_util_est) +
- * max(task_util, _task_util_est)
- *
- * while cpu_util_next is: max(cpu_util + task_util,
- * cpu_util_est + _task_util_est)
- */
- if (cpu == dst_cpu) {
- tsk = p;
- util_running =
- cpu_util_next(cpu, p, -1) + task_util_est(p);
- }
+ eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
+}
- /*
- * Busy time computation: utilization clamping is not
- * required since the ratio (sum_util / cpu_capacity)
- * is already enough to scale the EM reported power
- * consumption at the (eventually clamped) cpu_capacity.
- */
- cpu_util = effective_cpu_util(cpu, util_running, ENERGY_UTIL,
- NULL);
+/*
+ * Compute the maximum utilization for compute_energy() when the task @p
+ * is placed on the cpu @dst_cpu.
+ *
+ * Returns the maximum utilization among @eenv->cpus. This utilization can't
+ * exceed @eenv->cpu_cap.
+ */
+static inline unsigned long
+eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
+ struct task_struct *p, int dst_cpu)
+{
+ unsigned long max_util = 0;
+ int cpu;
- sum_util += min(cpu_util, cpu_cap);
+ for_each_cpu(cpu, pd_cpus) {
+ struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
+ unsigned long util = cpu_util_next(cpu, p, dst_cpu);
+ unsigned long cpu_util;
/*
* Performance domain frequency: utilization clamping
@@ -6756,12 +6791,29 @@ compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- cpu_util = effective_cpu_util(cpu, util_freq, FREQUENCY_UTIL,
- tsk);
- max_util = max(max_util, min(cpu_util, cpu_cap));
+ cpu_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
+ max_util = max(max_util, cpu_util);
}
- return em_cpu_energy(pd->em_pd, max_util, sum_util, cpu_cap);
+ return min(max_util, eenv->cpu_cap);
+}
+
+/*
+ * compute_energy(): Use the Energy Model to estimate the energy that @pd would
+ * consume for a given utilization landscape @eenv. If @dst_cpu < 0 the task
+ * contribution is removed from the energy estimation.
+ */
+static inline unsigned long
+compute_energy(struct energy_env *eenv, struct perf_domain *pd,
+ struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
+{
+ unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
+ unsigned long busy_time = eenv->pd_busy_time;
+
+ if (dst_cpu >= 0)
+ busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
+
+ return em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
}
/*
@@ -6807,11 +6859,12 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
- struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
int cpu, best_energy_cpu = prev_cpu, target = -1;
- unsigned long cpu_cap, util, base_energy = 0;
+ struct root_domain *rd = this_rq()->rd;
+ unsigned long base_energy = 0;
struct sched_domain *sd;
struct perf_domain *pd;
+ struct energy_env eenv;
rcu_read_lock();
pd = rcu_dereference(rd->pd);
@@ -6834,22 +6887,36 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (!task_util_est(p))
goto unlock;
+ eenv_task_busy_time(&eenv, p, prev_cpu);
+
for (; pd; pd = pd->next) {
- unsigned long cur_delta, spare_cap, max_spare_cap = 0;
+ unsigned long cpu_cap, cpu_thermal_cap, util;
+ unsigned long cur_delta, max_spare_cap = 0;
bool compute_prev_delta = false;
unsigned long base_energy_pd;
int max_spare_cap_cpu = -1;
cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
- for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
+ /* Account thermal pressure for the energy estimation */
+ cpu = cpumask_first(cpus);
+ cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
+ cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
+
+ eenv.cpu_cap = cpu_thermal_cap;
+ eenv.pd_cap = 0;
+
+ for_each_cpu(cpu, cpus) {
+ eenv.pd_cap += cpu_thermal_cap;
+
+ if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
+ continue;
+
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue;
util = cpu_util_next(cpu, p, cpu);
cpu_cap = capacity_of(cpu);
- spare_cap = cpu_cap;
- lsub_positive(&spare_cap, util);
/*
* Skip CPUs that cannot satisfy the capacity request.
@@ -6862,15 +6929,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (!fits_capacity(util, cpu_cap))
continue;
+ lsub_positive(&cpu_cap, util);
+
if (cpu == prev_cpu) {
/* Always use prev_cpu as a candidate. */
compute_prev_delta = true;
- } else if (spare_cap > max_spare_cap) {
+ } else if (cpu_cap > max_spare_cap) {
/*
* Find the CPU with the maximum spare capacity
* in the performance domain.
*/
- max_spare_cap = spare_cap;
+ max_spare_cap = cpu_cap;
max_spare_cap_cpu = cpu;
}
}
@@ -6878,13 +6947,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (max_spare_cap_cpu < 0 && !compute_prev_delta)
continue;
+ eenv_pd_busy_time(&eenv, cpus, p);
/* Compute the 'base' energy of the pd, without @p */
- base_energy_pd = compute_energy(p, -1, cpus, pd);
+ base_energy_pd = compute_energy(&eenv, pd, cpus, p, -1);
base_energy += base_energy_pd;
/* Evaluate the energy impact of using prev_cpu. */
if (compute_prev_delta) {
- prev_delta = compute_energy(p, prev_cpu, cpus, pd);
+ prev_delta = compute_energy(&eenv, pd, cpus, p,
+ prev_cpu);
if (prev_delta < base_energy_pd)
goto unlock;
prev_delta -= base_energy_pd;
@@ -6893,8 +6964,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
/* Evaluate the energy impact of using max_spare_cap_cpu. */
if (max_spare_cap_cpu >= 0) {
- cur_delta = compute_energy(p, max_spare_cap_cpu, cpus,
- pd);
+ cur_delta = compute_energy(&eenv, pd, cpus, p,
+ max_spare_cap_cpu);
if (cur_delta < base_energy_pd)
goto unlock;
cur_delta -= base_energy_pd;
--
2.36.1.124.g0e6072fb45-goog
From: Vincent Donnefort <[email protected]>
find_energy_efficient_cpu() integrates a margin to protect tasks from
bouncing back and forth from a CPU to another. This margin is set as being
6% of the total current energy estimated on the system. This however does
not work for two reasons:
1. The energy estimation is not a good absolute value:
compute_energy() used in feec() is a good estimation for task placement as
it allows to compare the energy with and without a task. The computed
delta will give a good overview of the cost for a certain task placement.
It, however, doesn't work as an absolute estimation for the total energy
of the system. First it adds the contribution to idle CPUs into the
energy, second it mixes util_avg with util_est values. util_avg contains
the near history for a CPU usage, it doesn't tell at all what the current
utilization is. A system that has been quite busy in the near past will
hold a very high energy and then a high margin preventing any task
migration to a lower capacity CPU, wasting energy. It even creates a
negative feedback loop: by holding the tasks on a less efficient CPU, the
margin contributes in keeping the energy high.
2. The margin handicaps small tasks:
On a system where the workload is composed mostly of small tasks (which is
often the case on Android), the overall energy will be high enough to
create a margin none of those tasks can cross. On a Pixel4, a small
utilization of 5% on all the CPUs creates a global estimated energy of 140
joules, as per the Energy Model declaration of that same device. This
means, after applying the 6% margin that any migration must save more than
8 joules to happen. No task with a utilization lower than 40 would then be
able to migrate away from the biggest CPU of the system.
The 6% of the overall system energy was brought by the following patch:
(eb92692b2544 sched/fair: Speed-up energy-aware wake-ups)
It was previously 6% of the prev_cpu energy. Also, the following one
made this margin value conditional on the clusters where the task fits:
(8d4c97c105ca sched/fair: Only compute base_energy_pd if necessary)
We could simply revert that margin change to what it was, but the original
version didn't have strong grounds neither and as demonstrated in (1.) the
estimated energy isn't a good absolute value. Instead, removing it
completely. It is indeed, made possible by recent changes that improved
energy estimation comparison fairness (sched/fair: Remove task_util from
effective utilization in feec()) (PM: EM: Increase energy calculation
precision) and task utilization stabilization (sched/fair: Decay task
util_avg during migration)
Without a margin, we could have feared bouncing between CPUs. But running
LISA's eas_behaviour test coverage on three different platforms (Hikey960,
RB-5 and DB-845) showed no issue.
Removing the energy margin enables more energy-optimized placements for a
more energy efficient system.
Signed-off-by: Vincent Donnefort <[email protected]>
Signed-off-by: Vincent Donnefort <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5586b6848858..92907b384265 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6859,9 +6859,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
- int cpu, best_energy_cpu = prev_cpu, target = -1;
struct root_domain *rd = this_rq()->rd;
- unsigned long base_energy = 0;
+ int cpu, best_energy_cpu, target = -1;
struct sched_domain *sd;
struct perf_domain *pd;
struct energy_env eenv;
@@ -6893,8 +6892,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
unsigned long cpu_cap, cpu_thermal_cap, util;
unsigned long cur_delta, max_spare_cap = 0;
bool compute_prev_delta = false;
- unsigned long base_energy_pd;
int max_spare_cap_cpu = -1;
+ unsigned long base_energy;
cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
@@ -6949,16 +6948,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
eenv_pd_busy_time(&eenv, cpus, p);
/* Compute the 'base' energy of the pd, without @p */
- base_energy_pd = compute_energy(&eenv, pd, cpus, p, -1);
- base_energy += base_energy_pd;
+ base_energy = compute_energy(&eenv, pd, cpus, p, -1);
/* Evaluate the energy impact of using prev_cpu. */
if (compute_prev_delta) {
prev_delta = compute_energy(&eenv, pd, cpus, p,
prev_cpu);
- if (prev_delta < base_energy_pd)
+ if (prev_delta < base_energy)
goto unlock;
- prev_delta -= base_energy_pd;
+ prev_delta -= base_energy;
best_delta = min(best_delta, prev_delta);
}
@@ -6966,9 +6964,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (max_spare_cap_cpu >= 0) {
cur_delta = compute_energy(&eenv, pd, cpus, p,
max_spare_cap_cpu);
- if (cur_delta < base_energy_pd)
+ if (cur_delta < base_energy)
goto unlock;
- cur_delta -= base_energy_pd;
+ cur_delta -= base_energy;
if (cur_delta < best_delta) {
best_delta = cur_delta;
best_energy_cpu = max_spare_cap_cpu;
@@ -6977,12 +6975,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
}
rcu_read_unlock();
- /*
- * Pick the best CPU if prev_cpu cannot be used, or if it saves at
- * least 6% of the energy used by prev_cpu.
- */
- if ((prev_delta == ULONG_MAX) ||
- (prev_delta - best_delta) > ((prev_delta + base_energy) >> 4))
+ if (best_delta < prev_delta)
target = best_energy_cpu;
return target;
--
2.36.1.124.g0e6072fb45-goog
- Vincent Donnefort <[email protected]>
On 23/05/2022 17:51, Vincent Donnefort wrote:
> From: Vincent Donnefort <[email protected]>
[...]
> @@ -6834,22 +6887,36 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> if (!task_util_est(p))
> goto unlock;
>
> + eenv_task_busy_time(&eenv, p, prev_cpu);
> +
> for (; pd; pd = pd->next) {
> - unsigned long cur_delta, spare_cap, max_spare_cap = 0;
> + unsigned long cpu_cap, cpu_thermal_cap, util;
> + unsigned long cur_delta, max_spare_cap = 0;
> bool compute_prev_delta = false;
> unsigned long base_energy_pd;
> int max_spare_cap_cpu = -1;
>
> cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
Internal EAS testing of this patch-set version has revealed that this
doesn't work against LTP CPU hotplug stress test. `struct cpumask *cpus`
can't be used when it is empty. This can happen in case all PD CPUs are
hotplugged out since we `and` the invariant PD cpumask with
cpu_online_mask. We need a:
+ if (cpumask_empty(cpus))
+ continue;
+
here.
> - for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> + /* Account thermal pressure for the energy estimation */
> + cpu = cpumask_first(cpus);
> + cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
> + cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
[...]
- Vincent Donnefort <[email protected]>
On 23/05/2022 17:51, Vincent Donnefort wrote:
> From: Vincent Donnefort <[email protected]>
[...]
> [1] https://lore.kernel.org/all/[email protected]/
minor:
I get `WARNING: Possible unwrapped commit description (prefer a maximum
75 chars per line)`. If you use
https://lkml.kernel.org/r/[email protected]
this warning disappears.
[...]
> +static inline void migrate_se_pelt_lag(struct sched_entity *se)
> +{
> + u64 throttled = 0, now, lut;
> + struct cfs_rq *cfs_rq;
> + struct rq *rq;
> + bool is_idle;
> +
> + if (load_avg_is_decayed(&se->avg))
> + return;
> +
> + cfs_rq = cfs_rq_of(se);
> + rq = rq_of(cfs_rq);
> +
> + rcu_read_lock();
> + is_idle = is_idle_task(rcu_dereference(rq->curr));
> + rcu_read_unlock();
> +
> + /*
> + * The lag estimation comes with a cost we don't want to pay all the
> + * time. Hence, limiting to the case where the source CPU is idle and
> + * we know we are at the greatest risk to have an outdated clock.
> + */
> + if (!is_idle)
> + return;
> +
> + /*
> + * Estimated "now" is: last_update_time + cfs_idle_lag + rq_idle_lag, where:
> + *
> + * last_update_time (the cfs_rq's last_update_time)
> + * = cfs_rq_clock_pelt()
> + * = rq_clock_pelt() - cfs->throttled_clock_pelt_time
So this line is always:
= rq_clock_pelt()@cfs_rq_idle -
cfs->throttled_clock_pelt_time@cfs_rq_idle
since we only execute this code when idle. Which then IMHO explains (1)
better.
> + *
> + * cfs_idle_lag (delta between cfs_rq's update and rq's update)
> + * = rq_clock_pelt()@rq_idle - rq_clock_pelt()@cfs_rq_idle
> + *
> + * rq_idle_lag (delta between rq's update and now)
> + * = sched_clock_cpu() - rq_clock()@rq_idle
> + *
> + * The rq_clock_pelt() from last_update_time being the same as
> + * rq_clock_pelt()@cfs_rq_idle, we can write:
--> (1) ^^^
> + *
> + * now = rq_clock_pelt()@rq_idle - cfs->throttled_clock_pelt_time +
> + * sched_clock_cpu() - rq_clock()@rq_idle
> + * Where:
> + * rq_clock_pelt()@rq_idle is rq->clock_pelt_idle
> + * rq_clock()@rq_idle is rq->enter_idle
> + * cfs->throttled_clock_pelt_time is cfs_rq->throttled_pelt_idle
To understand this better:
cfs->throttled_clock_pelt_time@cfs_rq_idle is
cfs_rq->throttled_pelt_idle
[...]
> + /*
> + * Paired with _update_idle_rq_clock_pelt. It ensures at the worst case
minor:
s/_update_idle_rq_clock_pelt/_update_idle_rq_clock_pelt()
> + * is observed the old clock_pelt_idle value and the new enter_idle,
> + * which lead to an understimation. The opposite would lead to an
s/understimation/underestimation
[...]
> @@ -8114,6 +8212,10 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
> if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
> update_tg_load_avg(cfs_rq);
>
> + /* sync clock_pelt_idle with last update */
update_idle_cfs_rq_clock_pelt() syncs cfs_rq->throttled_pelt_idle with
cfs_rq->throttled_clock_pelt_time. Not sure what `clock_pelt_idle` and
`last update` here mean?
[...]
> +/* The rq is idle, we can sync to clock_task */
> +static inline void _update_idle_rq_clock_pelt(struct rq *rq)
> +{
> + rq->clock_pelt = rq_clock_task(rq);
> +
> + u64_u32_store(rq->enter_idle, rq_clock(rq));
> + /* Paired with smp_rmb in migrate_se_pelt_lag */
minor:
s/migrate_se_pelt_lag/migrate_se_pelt_lag()
[...]
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index bf4a0ec98678..97bc26e5c8af 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -648,6 +648,10 @@ struct cfs_rq {
> int runtime_enabled;
> s64 runtime_remaining;
>
> + u64 throttled_pelt_idle;
> +#ifndef CONFIG_64BIT
> + u64 throttled_pelt_idle_copy;
> +#endif
> u64 throttled_clock;
> u64 throttled_clock_pelt;
> u64 throttled_clock_pelt_time;
> @@ -1020,6 +1024,12 @@ struct rq {
> u64 clock_task ____cacheline_aligned;
> u64 clock_pelt;
> unsigned long lost_idle_time;
> + u64 clock_pelt_idle;
> + u64 enter_idle;
> +#ifndef CONFIG_64BIT
> + u64 clock_pelt_idle_copy;
> + u64 enter_idle_copy;
> +#endif
>
> atomic_t nr_iowait;
`throttled_pelt_idle`, `clock_pelt_idle` and `enter_idle` are clock
snapshots when cfs_rq resp. rq go idle. But the naming does not really
show this relation. And this makes reading those equations rather difficult.
What about something like `throttled_clock_pelt_time_enter_idle`,
`clock_pelt_enter_idle`, `clock_enter_idle`? Especially the first one is
too long but something which shows that those are clock snapshots when
enter idle would IMHO augment readability in migrate_se_pelt_lag().
Besides these small issues:
Reviewed-by: Dietmar Eggemann <[email protected]>
On Mon, 23 May 2022 at 17:52, Vincent Donnefort <[email protected]> wrote:
>
> From: Vincent Donnefort <[email protected]>
>
> Before being migrated to a new CPU, a task sees its PELT values
> synchronized with rq last_update_time. Once done, that same task will also
it's the cfs_rq last_update_time not the rq
> have its sched_avg last_update_time reset. This means the time between
> the migration and the last clock update will not be accounted for in
> util_avg and a discontinuity will appear. This issue is amplified by the
> PELT clock scaling. It takes currently one tick after the CPU being idle
> to let clock_pelt catching up clock_task.
>
> This is especially problematic for asymmetric CPU capacity systems which
> need stable util_avg signals for task placement and energy estimation.
>
> Ideally, this problem would be solved by updating the runqueue clocks
> before the migration. But that would require taking the runqueue lock
> which is quite expensive [1]. Instead estimate the missing time and update
> the task util_avg with that value.
>
> To that end, we need sched_clock_cpu() but it is a costly function. Limit
> the usage to the case where the source CPU is idle as we know this is when
> the clock is having the biggest risk of being outdated. In this such case,
> let's call it cfs_idle_lag the delta time between the rq_clock_pelt value
> at rq idle and cfs_rq idle. And rq_idle_lag the delta between "now" and
> the rq_clock_pelt at rq idle.
>
> The estimated PELT clock is then:
>
> last_update_time + (the cfs_rq's last_update_time)
> cfs_idle_lag + (delta between cfs_rq's update and rq's update)
> rq_idle_lag (delta between rq's update and now)
moving '+' at the beginning of the line make it easier to understand
last_update_time (the cfs_rq's last_update_time)
+ cfs_idle_lag (delta between cfs_rq's update and rq's update)
+ rq_idle_lag (delta between rq's update and now)
>
> last_update_time = cfs_rq_clock_pelt()
> = rq_clock_pelt() - cfs->throttled_clock_pelt_time
>
> cfs_idle_lag = rq_clock_pelt()@rq_idle -
> rq_clock_pelt()@cfs_rq_idle
>
> rq_idle_lag = sched_clock_cpu() - rq_clock()@rq_idle
>
> The rq_clock_pelt() from last_update_time being the same as
> rq_clock_pelt()@cfs_rq_idle, we can write:
>
> estimation = rq_clock_pelt()@rq_idle - cfs->throttled_clock_pelt_time +
> sched_clock_cpu() - rq_clock()@rq_idle
>
> The clocks being not accessible without the rq lock taken, some timestamps
> are created:
>
> rq_clock_pelt()@rq_idle is rq->clock_pelt_idle
> rq_clock()@rq_idle is rq->enter_idle
> cfs->throttled_clock_pelt_time is cfs_rq->throttled_pelt_idle
>
> The rq_idle_lag part of the missing time is however an estimation that
> doesn't take into account IRQ and Paravirt time.
>
> [1] https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Vincent Donnefort <[email protected]>
> Signed-off-by: Vincent Donnefort <[email protected]>
minor typo but otherwise
Reviewed-by: Vincent Guittot <[email protected]>
Thanks
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 05614d9b919c..df5e6e565b4d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3310,6 +3310,29 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> }
>
> #ifdef CONFIG_SMP
> +static inline bool load_avg_is_decayed(struct sched_avg *sa)
> +{
> + if (sa->load_sum)
> + return false;
> +
> + if (sa->util_sum)
> + return false;
> +
> + if (sa->runnable_sum)
> + return false;
> +
> + /*
> + * _avg must be null when _sum are null because _avg = _sum / divider
> + * Make sure that rounding and/or propagation of PELT values never
> + * break this.
> + */
> + SCHED_WARN_ON(sa->load_avg ||
> + sa->util_avg ||
> + sa->runnable_avg);
> +
> + return true;
> +}
> +
> static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
> {
> return u64_u32_load_copy(cfs_rq->avg.last_update_time,
> @@ -3347,27 +3370,12 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> if (cfs_rq->load.weight)
> return false;
>
> - if (cfs_rq->avg.load_sum)
> - return false;
> -
> - if (cfs_rq->avg.util_sum)
> - return false;
> -
> - if (cfs_rq->avg.runnable_sum)
> + if (!load_avg_is_decayed(&cfs_rq->avg))
> return false;
>
> if (child_cfs_rq_on_list(cfs_rq))
> return false;
>
> - /*
> - * _avg must be null when _sum are null because _avg = _sum / divider
> - * Make sure that rounding and/or propagation of PELT values never
> - * break this.
> - */
> - SCHED_WARN_ON(cfs_rq->avg.load_avg ||
> - cfs_rq->avg.util_avg ||
> - cfs_rq->avg.runnable_avg);
> -
> return true;
> }
>
> @@ -3706,6 +3714,88 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
>
> #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> +#ifdef CONFIG_NO_HZ_COMMON
> +static inline void migrate_se_pelt_lag(struct sched_entity *se)
> +{
> + u64 throttled = 0, now, lut;
> + struct cfs_rq *cfs_rq;
> + struct rq *rq;
> + bool is_idle;
> +
> + if (load_avg_is_decayed(&se->avg))
> + return;
> +
> + cfs_rq = cfs_rq_of(se);
> + rq = rq_of(cfs_rq);
> +
> + rcu_read_lock();
> + is_idle = is_idle_task(rcu_dereference(rq->curr));
> + rcu_read_unlock();
> +
> + /*
> + * The lag estimation comes with a cost we don't want to pay all the
> + * time. Hence, limiting to the case where the source CPU is idle and
> + * we know we are at the greatest risk to have an outdated clock.
> + */
> + if (!is_idle)
> + return;
> +
> + /*
> + * Estimated "now" is: last_update_time + cfs_idle_lag + rq_idle_lag, where:
> + *
> + * last_update_time (the cfs_rq's last_update_time)
> + * = cfs_rq_clock_pelt()
> + * = rq_clock_pelt() - cfs->throttled_clock_pelt_time
> + *
> + * cfs_idle_lag (delta between cfs_rq's update and rq's update)
> + * = rq_clock_pelt()@rq_idle - rq_clock_pelt()@cfs_rq_idle
> + *
> + * rq_idle_lag (delta between rq's update and now)
> + * = sched_clock_cpu() - rq_clock()@rq_idle
> + *
> + * The rq_clock_pelt() from last_update_time being the same as
> + * rq_clock_pelt()@cfs_rq_idle, we can write:
> + *
> + * now = rq_clock_pelt()@rq_idle - cfs->throttled_clock_pelt_time +
> + * sched_clock_cpu() - rq_clock()@rq_idle
> + * Where:
> + * rq_clock_pelt()@rq_idle is rq->clock_pelt_idle
> + * rq_clock()@rq_idle is rq->enter_idle
> + * cfs->throttled_clock_pelt_time is cfs_rq->throttled_pelt_idle
> + */
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> + throttled = u64_u32_load(cfs_rq->throttled_pelt_idle);
> + /* The clock has been stopped for throttling */
> + if (throttled == U64_MAX)
> + return;
> +#endif
> + now = u64_u32_load(rq->clock_pelt_idle);
> + /*
> + * Paired with _update_idle_rq_clock_pelt. It ensures at the worst case
> + * is observed the old clock_pelt_idle value and the new enter_idle,
> + * which lead to an understimation. The opposite would lead to an
s/understimation/underestimation/
> + * overestimation.
> + */
> + smp_rmb();
> + lut = cfs_rq_last_update_time(cfs_rq);
> +
> + now -= throttled;
> + if (now < lut)
> + /*
> + * cfs_rq->avg.last_update_time is more recent than our
> + * estimation, let's use it.
> + */
> + now = lut;
> + else
> + now += sched_clock_cpu(cpu_of(rq)) - u64_u32_load(rq->enter_idle);
> +
> + __update_load_avg_blocked_se(now, se);
> +}
> +#else
> +static void migrate_se_pelt_lag(struct sched_entity *se) {}
> +#endif
> +
> /**
> * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
> * @now: current time, as per cfs_rq_clock_pelt()
> @@ -4437,6 +4527,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> */
> if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
> update_min_vruntime(cfs_rq);
> +
> + if (cfs_rq->nr_running == 0)
> + update_idle_cfs_rq_clock_pelt(cfs_rq);
> }
>
> /*
> @@ -6911,6 +7004,8 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
> */
> static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> {
> + struct sched_entity *se = &p->se;
> +
> /*
> * As blocked tasks retain absolute vruntime the migration needs to
> * deal with this by subtracting the old and adding the new
> @@ -6918,7 +7013,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> * the task on the new runqueue.
> */
> if (READ_ONCE(p->__state) == TASK_WAKING) {
> - struct sched_entity *se = &p->se;
> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
> se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
> @@ -6930,25 +7024,29 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> * rq->lock and can modify state directly.
> */
> lockdep_assert_rq_held(task_rq(p));
> - detach_entity_cfs_rq(&p->se);
> + detach_entity_cfs_rq(se);
>
> } else {
> + remove_entity_load_avg(se);
> +
> /*
> - * We are supposed to update the task to "current" time, then
> - * its up to date and ready to go to new CPU/cfs_rq. But we
> - * have difficulty in getting what current time is, so simply
> - * throw away the out-of-date time. This will result in the
> - * wakee task is less decayed, but giving the wakee more load
> - * sounds not bad.
> + * Here, the task's PELT values have been updated according to
> + * the current rq's clock. But if that clock hasn't been
> + * updated in a while, a substantial idle time will be missed,
> + * leading to an inflation after wake-up on the new rq.
> + *
> + * Estimate the missing time from the cfs_rq last_update_time
> + * and update sched_avg to improve the PELT continuity after
> + * migration.
> */
> - remove_entity_load_avg(&p->se);
> + migrate_se_pelt_lag(se);
> }
>
> /* Tell new CPU we are migrated */
> - p->se.avg.last_update_time = 0;
> + se->avg.last_update_time = 0;
>
> /* We have migrated, no longer consider this task hot */
> - p->se.exec_start = 0;
> + se->exec_start = 0;
>
> update_scan_period(p, new_cpu);
> }
> @@ -8114,6 +8212,10 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
> if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
> update_tg_load_avg(cfs_rq);
>
> + /* sync clock_pelt_idle with last update */
> + if (cfs_rq->nr_running == 0)
> + update_idle_cfs_rq_clock_pelt(cfs_rq);
> +
> if (cfs_rq == &rq->cfs)
> decayed = true;
> }
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index 4ff2ed4f8fa1..647e5fcc041b 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -61,6 +61,25 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
> WRITE_ONCE(avg->util_est.enqueued, enqueued);
> }
>
> +static inline u64 rq_clock_pelt(struct rq *rq)
> +{
> + lockdep_assert_rq_held(rq);
> + assert_clock_updated(rq);
> +
> + return rq->clock_pelt - rq->lost_idle_time;
> +}
> +
> +/* The rq is idle, we can sync to clock_task */
> +static inline void _update_idle_rq_clock_pelt(struct rq *rq)
> +{
> + rq->clock_pelt = rq_clock_task(rq);
> +
> + u64_u32_store(rq->enter_idle, rq_clock(rq));
> + /* Paired with smp_rmb in migrate_se_pelt_lag */
> + smp_wmb();
> + u64_u32_store(rq->clock_pelt_idle, rq_clock_pelt(rq));
> +}
> +
> /*
> * The clock_pelt scales the time to reflect the effective amount of
> * computation done during the running delta time but then sync back to
> @@ -76,8 +95,7 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
> static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
> {
> if (unlikely(is_idle_task(rq->curr))) {
> - /* The rq is idle, we can sync to clock_task */
> - rq->clock_pelt = rq_clock_task(rq);
> + _update_idle_rq_clock_pelt(rq);
> return;
> }
>
> @@ -130,17 +148,23 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
> */
> if (util_sum >= divider)
> rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> +
> + _update_idle_rq_clock_pelt(rq);
> }
>
> -static inline u64 rq_clock_pelt(struct rq *rq)
> +#ifdef CONFIG_CFS_BANDWIDTH
> +static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
> {
> - lockdep_assert_rq_held(rq);
> - assert_clock_updated(rq);
> + u64 throttled;
>
> - return rq->clock_pelt - rq->lost_idle_time;
> + if (unlikely(cfs_rq->throttle_count))
> + throttled = U64_MAX;
> + else
> + throttled = cfs_rq->throttled_clock_pelt_time;
> +
> + u64_u32_store(cfs_rq->throttled_pelt_idle, throttled);
> }
>
> -#ifdef CONFIG_CFS_BANDWIDTH
> /* rq->task_clock normalized against any time this cfs_rq has spent throttled */
> static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
> {
> @@ -150,6 +174,7 @@ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
> return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time;
> }
> #else
> +static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { }
> static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
> {
> return rq_clock_pelt(rq_of(cfs_rq));
> @@ -204,6 +229,7 @@ update_rq_clock_pelt(struct rq *rq, s64 delta) { }
> static inline void
> update_idle_rq_clock_pelt(struct rq *rq) { }
>
> +static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { }
> #endif
>
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index bf4a0ec98678..97bc26e5c8af 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -648,6 +648,10 @@ struct cfs_rq {
> int runtime_enabled;
> s64 runtime_remaining;
>
> + u64 throttled_pelt_idle;
> +#ifndef CONFIG_64BIT
> + u64 throttled_pelt_idle_copy;
> +#endif
> u64 throttled_clock;
> u64 throttled_clock_pelt;
> u64 throttled_clock_pelt_time;
> @@ -1020,6 +1024,12 @@ struct rq {
> u64 clock_task ____cacheline_aligned;
> u64 clock_pelt;
> unsigned long lost_idle_time;
> + u64 clock_pelt_idle;
> + u64 enter_idle;
> +#ifndef CONFIG_64BIT
> + u64 clock_pelt_idle_copy;
> + u64 enter_idle_copy;
> +#endif
>
> atomic_t nr_iowait;
>
> --
> 2.36.1.124.g0e6072fb45-goog
>
On Mon, 23 May 2022 at 17:52, Vincent Donnefort <[email protected]> wrote:
>
> From: Dietmar Eggemann <[email protected]>
>
> effective_cpu_util() already has a `int cpu' parameter which allows to
> retrieve the CPU capacity scale factor (or maximum CPU capacity) inside
> this function via an arch_scale_cpu_capacity(cpu).
>
> A lot of code calling effective_cpu_util() (or the shim
> sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call
> arch_scale_cpu_capacity() already.
> But not having to pass it into effective_cpu_util() will make the EAS
> wake-up code easier, especially when the maximum CPU capacity reduced
> by the thermal pressure is passed through the EAS wake-up functions.
>
> Due to the asymmetric CPU capacity support of arm/arm64 architectures,
> arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via
> per_cpu(cpu_scale, cpu) on such a system.
> On all other architectures it is a a compile-time constant
> (SCHED_CAPACITY_SCALE).
>
> Signed-off-by: Dietmar Eggemann <[email protected]>
Acked-by: Vincent Guittot <[email protected]>
>
> diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c
> index f5eced0842b3..6a88eb7e9f75 100644
> --- a/drivers/powercap/dtpm_cpu.c
> +++ b/drivers/powercap/dtpm_cpu.c
> @@ -71,34 +71,19 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
>
> static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power)
> {
> - unsigned long max = 0, sum_util = 0;
> + unsigned long max, sum_util = 0;
> int cpu;
>
> - for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> -
> - /*
> - * The capacity is the same for all CPUs belonging to
> - * the same perf domain, so a single call to
> - * arch_scale_cpu_capacity() is enough. However, we
> - * need the CPU parameter to be initialized by the
> - * loop, so the call ends up in this block.
> - *
> - * We can initialize 'max' with a cpumask_first() call
> - * before the loop but the bits computation is not
> - * worth given the arch_scale_cpu_capacity() just
> - * returns a value where the resulting assembly code
> - * will be optimized by the compiler.
> - */
> - max = arch_scale_cpu_capacity(cpu);
> - sum_util += sched_cpu_util(cpu, max);
> - }
> -
> /*
> - * In the improbable case where all the CPUs of the perf
> - * domain are offline, 'max' will be zero and will lead to an
> - * illegal operation with a zero division.
> + * The capacity is the same for all CPUs belonging to
> + * the same perf domain.
> */
> - return max ? (power * ((sum_util << 10) / max)) >> 10 : 0;
> + max = arch_scale_cpu_capacity(cpumask_first(pd_mask));
> +
> + for_each_cpu_and(cpu, pd_mask, cpu_online_mask)
> + sum_util += sched_cpu_util(cpu);
> +
> + return (power * ((sum_util << 10) / max)) >> 10;
> }
>
> static u64 get_pd_power_uw(struct dtpm *dtpm)
> diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
> index b8151d95a806..b263b0fde03c 100644
> --- a/drivers/thermal/cpufreq_cooling.c
> +++ b/drivers/thermal/cpufreq_cooling.c
> @@ -137,11 +137,9 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
> static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> int cpu_idx)
> {
> - unsigned long max = arch_scale_cpu_capacity(cpu);
> - unsigned long util;
> + unsigned long util = sched_cpu_util(cpu);
>
> - util = sched_cpu_util(cpu, max);
> - return (util * 100) / max;
> + return (util * 100) / arch_scale_cpu_capacity(cpu);
> }
> #else /* !CONFIG_SMP */
> static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c46f3a63b758..88b8817b827d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2257,7 +2257,7 @@ static inline bool owner_on_cpu(struct task_struct *owner)
> }
>
> /* Returns effective CPU energy utilization, as seen by the scheduler */
> -unsigned long sched_cpu_util(int cpu, unsigned long max);
> +unsigned long sched_cpu_util(int cpu);
> #endif /* CONFIG_SMP */
>
> #ifdef CONFIG_RSEQ
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 53596842f0d8..c531976ee960 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7107,12 +7107,14 @@ struct task_struct *idle_task(int cpu)
> * required to meet deadlines.
> */
> unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> - unsigned long max, enum cpu_util_type type,
> + enum cpu_util_type type,
> struct task_struct *p)
> {
> - unsigned long dl_util, util, irq;
> + unsigned long dl_util, util, irq, max;
> struct rq *rq = cpu_rq(cpu);
>
> + max = arch_scale_cpu_capacity(cpu);
> +
> if (!uclamp_is_used() &&
> type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
> return max;
> @@ -7192,10 +7194,9 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> return min(max, util);
> }
>
> -unsigned long sched_cpu_util(int cpu, unsigned long max)
> +unsigned long sched_cpu_util(int cpu)
> {
> - return effective_cpu_util(cpu, cpu_util_cfs(cpu), max,
> - ENERGY_UTIL, NULL);
> + return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
> }
> #endif /* CONFIG_SMP */
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 3dbf351d12d5..1207c78f85c1 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -157,11 +157,10 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> static void sugov_get_util(struct sugov_cpu *sg_cpu)
> {
> struct rq *rq = cpu_rq(sg_cpu->cpu);
> - unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
>
> - sg_cpu->max = max;
> + sg_cpu->max = arch_scale_cpu_capacity(sg_cpu->cpu);
> sg_cpu->bw_dl = cpu_bw_dl(rq);
> - sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), max,
> + sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu),
> FREQUENCY_UTIL, NULL);
> }
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df5e6e565b4d..73a9dc522b73 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6703,12 +6703,11 @@ static long
> compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> {
> struct cpumask *pd_mask = perf_domain_span(pd);
> - unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
> - unsigned long max_util = 0, sum_util = 0;
> - unsigned long _cpu_cap = cpu_cap;
> + unsigned long max_util = 0, sum_util = 0, cpu_cap;
> int cpu;
>
> - _cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
> + cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
> + cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
>
> /*
> * The capacity state of CPUs of the current rd can be driven by CPUs
> @@ -6745,10 +6744,10 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> * is already enough to scale the EM reported power
> * consumption at the (eventually clamped) cpu_capacity.
> */
> - cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
> - ENERGY_UTIL, NULL);
> + cpu_util = effective_cpu_util(cpu, util_running, ENERGY_UTIL,
> + NULL);
>
> - sum_util += min(cpu_util, _cpu_cap);
> + sum_util += min(cpu_util, cpu_cap);
>
> /*
> * Performance domain frequency: utilization clamping
> @@ -6757,12 +6756,12 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> * NOTE: in case RT tasks are running, by default the
> * FREQUENCY_UTIL's utilization can be max OPP.
> */
> - cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
> - FREQUENCY_UTIL, tsk);
> - max_util = max(max_util, min(cpu_util, _cpu_cap));
> + cpu_util = effective_cpu_util(cpu, util_freq, FREQUENCY_UTIL,
> + tsk);
> + max_util = max(max_util, min(cpu_util, cpu_cap));
> }
>
> - return em_cpu_energy(pd->em_pd, max_util, sum_util, _cpu_cap);
> + return em_cpu_energy(pd->em_pd, max_util, sum_util, cpu_cap);
> }
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 97bc26e5c8af..07b7c50bd987 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2895,7 +2895,7 @@ enum cpu_util_type {
> };
>
> unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> - unsigned long max, enum cpu_util_type type,
> + enum cpu_util_type type,
> struct task_struct *p);
>
> static inline unsigned long cpu_bw_dl(struct rq *rq)
> --
> 2.36.1.124.g0e6072fb45-goog
>
On Mon, 23 May 2022 at 17:52, Vincent Donnefort <[email protected]> wrote:
>
> From: Dietmar Eggemann <[email protected]>
>
> The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays
> invariant after Energy Model creation, i.e. it is not updated after
> CPU hotplug operations.
>
> That's why the PD mask is used in conjunction with the cpu_online_mask
> (or Sched Domain cpumask). Thereby the cpu_online_mask is fetched
> multiple times (in compute_energy()) during a run-queue selection
> for a task.
>
> cpu_online_mask may change during this time which can lead to wrong
> energy calculations.
>
> To be able to avoid this, use the select_rq_mask per-cpu cpumask to
> create a cpumask out of PD cpumask and cpu_online_mask and pass it
> through the function calls of the EAS run-queue selection path.
>
> The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection
> (find_energy_efficient_cpu()) is now ANDed not only with the SD mask
> but also with the cpu_online_mask. This is fine since this cpumask
> has to be in syc with the one used for energy computation
> (compute_energy()).
> An exclusive cpuset setup with at least one asymmetric CPU capacity
> island (hence the additional AND with the SD cpumask) is the obvious
> exception here.
>
> Signed-off-by: Dietmar Eggemann <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2d7bba2f1da2..57074f27c0d2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6700,14 +6700,14 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
> * task.
> */
> static long
> -compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> +compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
> + struct perf_domain *pd)
> {
> - struct cpumask *pd_mask = perf_domain_span(pd);
> unsigned long max_util = 0, sum_util = 0, cpu_cap;
> int cpu;
>
> - cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
> - cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
> + cpu_cap = arch_scale_cpu_capacity(cpumask_first(cpus));
> + cpu_cap -= arch_scale_thermal_pressure(cpumask_first(cpus));
>
> /*
> * The capacity state of CPUs of the current rd can be driven by CPUs
> @@ -6718,7 +6718,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> * If an entire pd is outside of the current rd, it will not appear in
> * its pd list and will not be accounted by compute_energy().
> */
> - for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> + for_each_cpu(cpu, cpus) {
> unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> unsigned long cpu_util, util_running = util_freq;
> struct task_struct *tsk = NULL;
> @@ -6805,6 +6805,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> */
> static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> {
> + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
> struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
> int cpu, best_energy_cpu = prev_cpu, target = -1;
> @@ -6839,7 +6840,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> unsigned long base_energy_pd;
> int max_spare_cap_cpu = -1;
>
> - for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
> + cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> +
> + for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> continue;
>
> @@ -6876,12 +6879,12 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> continue;
>
> /* Compute the 'base' energy of the pd, without @p */
> - base_energy_pd = compute_energy(p, -1, pd);
> + base_energy_pd = compute_energy(p, -1, cpus, pd);
> base_energy += base_energy_pd;
>
> /* Evaluate the energy impact of using prev_cpu. */
> if (compute_prev_delta) {
> - prev_delta = compute_energy(p, prev_cpu, pd);
> + prev_delta = compute_energy(p, prev_cpu, cpus, pd);
> if (prev_delta < base_energy_pd)
> goto unlock;
> prev_delta -= base_energy_pd;
> @@ -6890,7 +6893,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>
> /* Evaluate the energy impact of using max_spare_cap_cpu. */
> if (max_spare_cap_cpu >= 0) {
> - cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
> + cur_delta = compute_energy(p, max_spare_cap_cpu, cpus,
> + pd);
> if (cur_delta < base_energy_pd)
> goto unlock;
> cur_delta -= base_energy_pd;
> --
> 2.36.1.124.g0e6072fb45-goog
>
On Mon, 23 May 2022 at 17:52, Vincent Donnefort <[email protected]> wrote:
>
> From: Vincent Donnefort <[email protected]>
>
> The energy estimation in find_energy_efficient_cpu() (feec()) relies on
> the computation of the effective utilization for each CPU of a perf domain
> (PD). This effective utilization is then used as an estimation of the busy
> time for this pd. The function effective_cpu_util() which gives this value,
> scales the utilization relative to IRQ pressure on the CPU to take into
> account that the IRQ time is hidden from the task clock. The IRQ scaling is
> as follow:
>
> effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util
>
> Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of
> the CPU and irq the IRQ avg time.
>
> If now we take as an example a task placement which doesn't raise the OPP
> on the candidate CPU, we can write the energy delta as:
>
> delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) -
> effective_cpu_util(cpu_util))
> = OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util
>
> We end-up with an energy delta depending on the IRQ avg time, which is a
> problem: first the time spent on IRQs by a CPU has no effect on the
> additional energy that would be consumed by a task. Second, we don't want
> to favour a CPU with a higher IRQ avg time value.
>
> Nonetheless, we need to take the IRQ avg time into account. If a task
> placement raises the PD's frequency, it will increase the energy cost for
> the entire time where the CPU is busy. A solution is to only use
> effective_cpu_util() with the CPU contribution part. The task contribution
> is added separately and scaled according to prev_cpu's IRQ time.
>
> No change for the FREQUENCY_UTIL component of the energy estimation. We
> still want to get the actual frequency that would be selected after the
> task placement.
>
> Signed-off-by: Vincent Donnefort <[email protected]>
> Signed-off-by: Vincent Donnefort <[email protected]>
> Reviewed-by: Dietmar Eggemann <[email protected]>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 57074f27c0d2..5586b6848858 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6693,61 +6693,96 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
> }
>
> /*
> - * compute_energy(): Estimates the energy that @pd would consume if @p was
> - * migrated to @dst_cpu. compute_energy() predicts what will be the utilization
> - * landscape of @pd's CPUs after the task migration, and uses the Energy Model
> - * to compute what would be the energy if we decided to actually migrate that
> - * task.
> + * energy_env - Utilization landscape for energy estimation.
> + * @task_busy_time: Utilization contribution by the task for which we test the
> + * placement. Given by eenv_task_busy_time().
> + * @pd_busy_time: Utilization of the whole perf domain without the task
> + * contribution. Given by eenv_pd_busy_time().
> + * @cpu_cap: Maximum CPU capacity for the perf domain.
> + * @pd_cap: Entire perf domain capacity. (pd->nr_cpus * cpu_cap).
> + */
> +struct energy_env {
> + unsigned long task_busy_time;
> + unsigned long pd_busy_time;
> + unsigned long cpu_cap;
> + unsigned long pd_cap;
> +};
> +
> +/*
> + * Compute the task busy time for compute_energy(). This time cannot be
> + * injected directly into effective_cpu_util() because of the IRQ scaling.
> + * The latter only makes sense with the most recent CPUs where the task has
> + * run.
> + */
> +static inline void eenv_task_busy_time(struct energy_env *eenv,
> + struct task_struct *p, int prev_cpu)
> +{
> + unsigned long busy_time, max_cap = arch_scale_cpu_capacity(prev_cpu);
> + unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));
> +
> + if (unlikely(irq >= max_cap))
> + busy_time = max_cap;
> + else
> + busy_time = scale_irq_capacity(task_util_est(p), irq, max_cap);
> +
> + eenv->task_busy_time = busy_time;
> +}
> +
> +/*
> + * Compute the perf_domain (PD) busy time for compute_energy(). Based on the
> + * utilization for each @pd_cpus, it however doesn't take into account
> + * clamping since the ratio (utilization / cpu_capacity) is already enough to
> + * scale the EM reported power consumption at the (eventually clamped)
> + * cpu_capacity.
> + *
> + * The contribution of the task @p for which we want to estimate the
> + * energy cost is removed (by cpu_util_next()) and must be calculated
> + * separately (see eenv_task_busy_time). This ensures:
> + *
> + * - A stable PD utilization, no matter which CPU of that PD we want to place
> + * the task on.
> + *
> + * - A fair comparison between CPUs as the task contribution (task_util())
> + * will always be the same no matter which CPU utilization we rely on
> + * (util_avg or util_est).
> + *
> + * Set @eenv busy time for the PD that spans @pd_cpus. This busy time can't
> + * exceed @eenv->pd_cap.
> */
> -static long
> -compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
> - struct perf_domain *pd)
> +static inline void eenv_pd_busy_time(struct energy_env *eenv,
> + struct cpumask *pd_cpus,
> + struct task_struct *p)
> {
> - unsigned long max_util = 0, sum_util = 0, cpu_cap;
> + unsigned long busy_time = 0;
> int cpu;
>
> - cpu_cap = arch_scale_cpu_capacity(cpumask_first(cpus));
> - cpu_cap -= arch_scale_thermal_pressure(cpumask_first(cpus));
> + for_each_cpu(cpu, pd_cpus) {
> + unsigned long util = cpu_util_next(cpu, p, -1);
>
> - /*
> - * The capacity state of CPUs of the current rd can be driven by CPUs
> - * of another rd if they belong to the same pd. So, account for the
> - * utilization of these CPUs too by masking pd with cpu_online_mask
> - * instead of the rd span.
> - *
> - * If an entire pd is outside of the current rd, it will not appear in
> - * its pd list and will not be accounted by compute_energy().
> - */
> - for_each_cpu(cpu, cpus) {
> - unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> - unsigned long cpu_util, util_running = util_freq;
> - struct task_struct *tsk = NULL;
> + busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
> + }
>
> - /*
> - * When @p is placed on @cpu:
> - *
> - * util_running = max(cpu_util, cpu_util_est) +
> - * max(task_util, _task_util_est)
> - *
> - * while cpu_util_next is: max(cpu_util + task_util,
> - * cpu_util_est + _task_util_est)
> - */
> - if (cpu == dst_cpu) {
> - tsk = p;
> - util_running =
> - cpu_util_next(cpu, p, -1) + task_util_est(p);
> - }
> + eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
> +}
>
> - /*
> - * Busy time computation: utilization clamping is not
> - * required since the ratio (sum_util / cpu_capacity)
> - * is already enough to scale the EM reported power
> - * consumption at the (eventually clamped) cpu_capacity.
> - */
> - cpu_util = effective_cpu_util(cpu, util_running, ENERGY_UTIL,
> - NULL);
> +/*
> + * Compute the maximum utilization for compute_energy() when the task @p
> + * is placed on the cpu @dst_cpu.
> + *
> + * Returns the maximum utilization among @eenv->cpus. This utilization can't
> + * exceed @eenv->cpu_cap.
> + */
> +static inline unsigned long
> +eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
> + struct task_struct *p, int dst_cpu)
> +{
> + unsigned long max_util = 0;
> + int cpu;
>
> - sum_util += min(cpu_util, cpu_cap);
> + for_each_cpu(cpu, pd_cpus) {
> + struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
> + unsigned long util = cpu_util_next(cpu, p, dst_cpu);
> + unsigned long cpu_util;
>
> /*
> * Performance domain frequency: utilization clamping
> @@ -6756,12 +6791,29 @@ compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
> * NOTE: in case RT tasks are running, by default the
> * FREQUENCY_UTIL's utilization can be max OPP.
> */
> - cpu_util = effective_cpu_util(cpu, util_freq, FREQUENCY_UTIL,
> - tsk);
> - max_util = max(max_util, min(cpu_util, cpu_cap));
> + cpu_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
> + max_util = max(max_util, cpu_util);
> }
>
> - return em_cpu_energy(pd->em_pd, max_util, sum_util, cpu_cap);
> + return min(max_util, eenv->cpu_cap);
> +}
> +
> +/*
> + * compute_energy(): Use the Energy Model to estimate the energy that @pd would
> + * consume for a given utilization landscape @eenv. If @dst_cpu < 0 the task
I find this comment a bit confusing because compute_energy() adds the
task contribution if dst_cpu >= 0 but doesn't remove it. The fact that
eenv->pd_busy_time has been previously computed without the
contribution of the task, is outside the scope of this this function
whereas the comment suggest that the remove will happen in
compute_energy()
> + * contribution is removed from the energy estimation.
> + */
> +static inline unsigned long
> +compute_energy(struct energy_env *eenv, struct perf_domain *pd,
> + struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
> +{
> + unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> + unsigned long busy_time = eenv->pd_busy_time;
> +
> + if (dst_cpu >= 0)
> + busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
> +
> + return em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
> }
>
> /*
> @@ -6807,11 +6859,12 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> {
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
> - struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
> int cpu, best_energy_cpu = prev_cpu, target = -1;
> - unsigned long cpu_cap, util, base_energy = 0;
> + struct root_domain *rd = this_rq()->rd;
> + unsigned long base_energy = 0;
> struct sched_domain *sd;
> struct perf_domain *pd;
> + struct energy_env eenv;
>
> rcu_read_lock();
> pd = rcu_dereference(rd->pd);
> @@ -6834,22 +6887,36 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> if (!task_util_est(p))
> goto unlock;
>
> + eenv_task_busy_time(&eenv, p, prev_cpu);
> +
> for (; pd; pd = pd->next) {
> - unsigned long cur_delta, spare_cap, max_spare_cap = 0;
> + unsigned long cpu_cap, cpu_thermal_cap, util;
> + unsigned long cur_delta, max_spare_cap = 0;
> bool compute_prev_delta = false;
> unsigned long base_energy_pd;
> int max_spare_cap_cpu = -1;
>
> cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
>
> - for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> + /* Account thermal pressure for the energy estimation */
> + cpu = cpumask_first(cpus);
> + cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
> + cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
> +
> + eenv.cpu_cap = cpu_thermal_cap;
> + eenv.pd_cap = 0;
> +
> + for_each_cpu(cpu, cpus) {
> + eenv.pd_cap += cpu_thermal_cap;
> +
> + if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
> + continue;
> +
> if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> continue;
>
> util = cpu_util_next(cpu, p, cpu);
> cpu_cap = capacity_of(cpu);
> - spare_cap = cpu_cap;
> - lsub_positive(&spare_cap, util);
>
> /*
> * Skip CPUs that cannot satisfy the capacity request.
> @@ -6862,15 +6929,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> if (!fits_capacity(util, cpu_cap))
> continue;
>
> + lsub_positive(&cpu_cap, util);
> +
> if (cpu == prev_cpu) {
> /* Always use prev_cpu as a candidate. */
> compute_prev_delta = true;
> - } else if (spare_cap > max_spare_cap) {
> + } else if (cpu_cap > max_spare_cap) {
> /*
> * Find the CPU with the maximum spare capacity
> * in the performance domain.
> */
> - max_spare_cap = spare_cap;
> + max_spare_cap = cpu_cap;
> max_spare_cap_cpu = cpu;
> }
> }
> @@ -6878,13 +6947,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> if (max_spare_cap_cpu < 0 && !compute_prev_delta)
> continue;
>
> + eenv_pd_busy_time(&eenv, cpus, p);
> /* Compute the 'base' energy of the pd, without @p */
> - base_energy_pd = compute_energy(p, -1, cpus, pd);
> + base_energy_pd = compute_energy(&eenv, pd, cpus, p, -1);
> base_energy += base_energy_pd;
>
> /* Evaluate the energy impact of using prev_cpu. */
> if (compute_prev_delta) {
> - prev_delta = compute_energy(p, prev_cpu, cpus, pd);
> + prev_delta = compute_energy(&eenv, pd, cpus, p,
> + prev_cpu);
> if (prev_delta < base_energy_pd)
side question:
-base_energy_pd is the energy for the perf domain without task p
-prev_delta is the energy for the same perf domain if task p is put on dst_cpu
How can prev_delta be lower than base_energy ?
if dst_cpu doesn't belong to the perf domain, prev_delta should be
equal to base_energy_pd
if dst_cpu belongs to the perf domain, the compute_energy should be
higher because the busy_time will be higher
> goto unlock;
> prev_delta -= base_energy_pd;
> @@ -6893,8 +6964,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>
> /* Evaluate the energy impact of using max_spare_cap_cpu. */
> if (max_spare_cap_cpu >= 0) {
> - cur_delta = compute_energy(p, max_spare_cap_cpu, cpus,
> - pd);
> + cur_delta = compute_energy(&eenv, pd, cpus, p,
> + max_spare_cap_cpu);
> if (cur_delta < base_energy_pd)
same question as above
> goto unlock;
> cur_delta -= base_energy_pd;
> --
> 2.36.1.124.g0e6072fb45-goog
>
On Mon, 23 May 2022 at 17:52, Vincent Donnefort <[email protected]> wrote:
>
> From: Dietmar Eggemann <[email protected]>
>
> Decouple the name of the per-cpu cpumask select_idle_mask from its usage
> in select_idle_[cpu/capacity]() of the CFS run-queue selection
> (select_task_rq_fair()).
>
> This is to support the reuse of this cpumask in the Energy Aware
> Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue
> selection.
>
> Signed-off-by: Dietmar Eggemann <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c531976ee960..68f5eb8a1de7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9502,7 +9502,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
> #endif
>
> DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
> -DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
> +DECLARE_PER_CPU(cpumask_var_t, select_rq_mask);
>
> void __init sched_init(void)
> {
> @@ -9551,7 +9551,7 @@ void __init sched_init(void)
> for_each_possible_cpu(i) {
> per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
> cpumask_size(), GFP_KERNEL, cpu_to_node(i));
> - per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
> + per_cpu(select_rq_mask, i) = (cpumask_var_t)kzalloc_node(
> cpumask_size(), GFP_KERNEL, cpu_to_node(i));
> }
> #endif /* CONFIG_CPUMASK_OFFSTACK */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 73a9dc522b73..2d7bba2f1da2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5897,7 +5897,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
> /* Working cpumask for: load_balance, load_balance_newidle. */
> DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
> -DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
> +DEFINE_PER_CPU(cpumask_var_t, select_rq_mask);
>
> #ifdef CONFIG_NO_HZ_COMMON
>
> @@ -6387,7 +6387,7 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> */
> static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
> {
> - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> int i, cpu, idle_cpu = -1, nr = INT_MAX;
> struct rq *this_rq = this_rq();
> int this = smp_processor_id();
> @@ -6473,7 +6473,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> int cpu, best_cpu = -1;
> struct cpumask *cpus;
>
> - cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> + cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
> task_util = uclamp_task_util(p);
> --
> 2.36.1.124.g0e6072fb45-goog
>
[...]
> > @@ -8114,6 +8212,10 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
> > if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
> > update_tg_load_avg(cfs_rq);
> >
> > + /* sync clock_pelt_idle with last update */
>
> update_idle_cfs_rq_clock_pelt() syncs cfs_rq->throttled_pelt_idle with
> cfs_rq->throttled_clock_pelt_time. Not sure what `clock_pelt_idle` and
> `last update` here mean?
Indeed, this comment is not helpful at all. What matters here is that the cfs_rq
is idle and we need to update the throttled_pelt_idle accordingly.
>
> [...]
>
> > +/* The rq is idle, we can sync to clock_task */
> > +static inline void _update_idle_rq_clock_pelt(struct rq *rq)
> > +{
> > + rq->clock_pelt = rq_clock_task(rq);
> > +
> > + u64_u32_store(rq->enter_idle, rq_clock(rq));
> > + /* Paired with smp_rmb in migrate_se_pelt_lag */
>
> minor:
>
> s/migrate_se_pelt_lag/migrate_se_pelt_lag()
>
> [...]
>
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index bf4a0ec98678..97bc26e5c8af 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -648,6 +648,10 @@ struct cfs_rq {
> > int runtime_enabled;
> > s64 runtime_remaining;
> >
> > + u64 throttled_pelt_idle;
> > +#ifndef CONFIG_64BIT
> > + u64 throttled_pelt_idle_copy;
> > +#endif
> > u64 throttled_clock;
> > u64 throttled_clock_pelt;
> > u64 throttled_clock_pelt_time;
> > @@ -1020,6 +1024,12 @@ struct rq {
> > u64 clock_task ____cacheline_aligned;
> > u64 clock_pelt;
> > unsigned long lost_idle_time;
> > + u64 clock_pelt_idle;
> > + u64 enter_idle;
> > +#ifndef CONFIG_64BIT
> > + u64 clock_pelt_idle_copy;
> > + u64 enter_idle_copy;
> > +#endif
> >
> > atomic_t nr_iowait;
>
> `throttled_pelt_idle`, `clock_pelt_idle` and `enter_idle` are clock
> snapshots when cfs_rq resp. rq go idle. But the naming does not really
> show this relation. And this makes reading those equations rather difficult.
>
> What about something like `throttled_clock_pelt_time_enter_idle`,
> `clock_pelt_enter_idle`, `clock_enter_idle`? Especially the first one is
> too long but something which shows that those are clock snapshots when
> enter idle would IMHO augment readability in migrate_se_pelt_lag().
What if I drop the "enter"?
clock_idle;
clock_pelt_idle;
throttled_clock_pelt_time_idle;
>
> Besides these small issues:
>
> Reviewed-by: Dietmar Eggemann <[email protected]>
Thanks!
On Tue, May 31, 2022 at 10:17:01AM +0200, Dietmar Eggemann wrote:
> - Vincent Donnefort <[email protected]>
>
> On 23/05/2022 17:51, Vincent Donnefort wrote:
> > From: Vincent Donnefort <[email protected]>
>
> [...]
>
> > @@ -6834,22 +6887,36 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > if (!task_util_est(p))
> > goto unlock;
> >
> > + eenv_task_busy_time(&eenv, p, prev_cpu);
> > +
> > for (; pd; pd = pd->next) {
> > - unsigned long cur_delta, spare_cap, max_spare_cap = 0;
> > + unsigned long cpu_cap, cpu_thermal_cap, util;
> > + unsigned long cur_delta, max_spare_cap = 0;
> > bool compute_prev_delta = false;
> > unsigned long base_energy_pd;
> > int max_spare_cap_cpu = -1;
> >
> > cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
>
> Internal EAS testing of this patch-set version has revealed that this
> doesn't work against LTP CPU hotplug stress test. `struct cpumask *cpus`
> can't be used when it is empty. This can happen in case all PD CPUs are
> hotplugged out since we `and` the invariant PD cpumask with
> cpu_online_mask. We need a:
>
> + if (cpumask_empty(cpus))
> + continue;
> +
>
> here.
Good catch, thanks for trying the test suite.
>
> > - for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
> > + /* Account thermal pressure for the energy estimation */
> > + cpu = cpumask_first(cpus);
> > + cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
> > + cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
>
> [...]
[...]
> > +
> > +/*
> > + * compute_energy(): Use the Energy Model to estimate the energy that @pd would
> > + * consume for a given utilization landscape @eenv. If @dst_cpu < 0 the task
>
> I find this comment a bit confusing because compute_energy() adds the
> task contribution if dst_cpu >= 0 but doesn't remove it. The fact that
> eenv->pd_busy_time has been previously computed without the
> contribution of the task, is outside the scope of this this function
> whereas the comment suggest that the remove will happen in
> compute_energy()
Arg, leftover from a previous version where this function was adding or removing
the contribution. I'll update!
>
> > + * contribution is removed from the energy estimation.
> > + */
> > +static inline unsigned long
> > +compute_energy(struct energy_env *eenv, struct perf_domain *pd,
> > + struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
> > +{
> > + unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> > + unsigned long busy_time = eenv->pd_busy_time;
> > +
> > + if (dst_cpu >= 0)
> > + busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
> > +
> > + return em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
> > }
> >
[...]
> > @@ -6878,13 +6947,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > if (max_spare_cap_cpu < 0 && !compute_prev_delta)
> > continue;
> >
> > + eenv_pd_busy_time(&eenv, cpus, p);
> > /* Compute the 'base' energy of the pd, without @p */
> > - base_energy_pd = compute_energy(p, -1, cpus, pd);
> > + base_energy_pd = compute_energy(&eenv, pd, cpus, p, -1);
> > base_energy += base_energy_pd;
> >
> > /* Evaluate the energy impact of using prev_cpu. */
> > if (compute_prev_delta) {
> > - prev_delta = compute_energy(p, prev_cpu, cpus, pd);
> > + prev_delta = compute_energy(&eenv, pd, cpus, p,
> > + prev_cpu);
> > if (prev_delta < base_energy_pd)
>
> side question:
> -base_energy_pd is the energy for the perf domain without task p
> -prev_delta is the energy for the same perf domain if task p is put on dst_cpu
>
> How can prev_delta be lower than base_energy ?
It can happen if one of the CPU utilization is updated in the middle of feec().
>
> if dst_cpu doesn't belong to the perf domain, prev_delta should be
> equal to base_energy_pd
> if dst_cpu belongs to the perf domain, the compute_energy should be
> higher because the busy_time will be higher
>
> > goto unlock;
> > prev_delta -= base_energy_pd;
> > @@ -6893,8 +6964,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >
> > /* Evaluate the energy impact of using max_spare_cap_cpu. */
> > if (max_spare_cap_cpu >= 0) {
> > - cur_delta = compute_energy(p, max_spare_cap_cpu, cpus,
> > - pd);
> > + cur_delta = compute_energy(&eenv, pd, cpus, p,
> > + max_spare_cap_cpu);
> > if (cur_delta < base_energy_pd)
>
> same question as above
>
> > goto unlock;
> > cur_delta -= base_energy_pd;
> > --
> > 2.36.1.124.g0e6072fb45-goog
> >
On Mon, 6 Jun 2022 at 11:41, Vincent Donnefort <[email protected]> wrote:
>
> [...]
>
> > > +
> > > +/*
> > > + * compute_energy(): Use the Energy Model to estimate the energy that @pd would
> > > + * consume for a given utilization landscape @eenv. If @dst_cpu < 0 the task
> >
> > I find this comment a bit confusing because compute_energy() adds the
> > task contribution if dst_cpu >= 0 but doesn't remove it. The fact that
> > eenv->pd_busy_time has been previously computed without the
> > contribution of the task, is outside the scope of this this function
> > whereas the comment suggest that the remove will happen in
> > compute_energy()
>
> Arg, leftover from a previous version where this function was adding or removing
> the contribution. I'll update!
>
> >
> > > + * contribution is removed from the energy estimation.
> > > + */
> > > +static inline unsigned long
> > > +compute_energy(struct energy_env *eenv, struct perf_domain *pd,
> > > + struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
> > > +{
> > > + unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> > > + unsigned long busy_time = eenv->pd_busy_time;
> > > +
> > > + if (dst_cpu >= 0)
> > > + busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
> > > +
> > > + return em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
> > > }
> > >
>
> [...]
>
> > > @@ -6878,13 +6947,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > if (max_spare_cap_cpu < 0 && !compute_prev_delta)
> > > continue;
> > >
> > > + eenv_pd_busy_time(&eenv, cpus, p);
> > > /* Compute the 'base' energy of the pd, without @p */
> > > - base_energy_pd = compute_energy(p, -1, cpus, pd);
> > > + base_energy_pd = compute_energy(&eenv, pd, cpus, p, -1);
> > > base_energy += base_energy_pd;
> > >
> > > /* Evaluate the energy impact of using prev_cpu. */
> > > if (compute_prev_delta) {
> > > - prev_delta = compute_energy(p, prev_cpu, cpus, pd);
> > > + prev_delta = compute_energy(&eenv, pd, cpus, p,
> > > + prev_cpu);
> > > if (prev_delta < base_energy_pd)
> >
> > side question:
> > -base_energy_pd is the energy for the perf domain without task p
> > -prev_delta is the energy for the same perf domain if task p is put on dst_cpu
> >
> > How can prev_delta be lower than base_energy ?
>
> It can happen if one of the CPU utilization is updated in the middle of feec().
Ok. A comment would be helpful
>
> >
> > if dst_cpu doesn't belong to the perf domain, prev_delta should be
> > equal to base_energy_pd
> > if dst_cpu belongs to the perf domain, the compute_energy should be
> > higher because the busy_time will be higher
> >
> > > goto unlock;
> > > prev_delta -= base_energy_pd;
> > > @@ -6893,8 +6964,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > >
> > > /* Evaluate the energy impact of using max_spare_cap_cpu. */
> > > if (max_spare_cap_cpu >= 0) {
> > > - cur_delta = compute_energy(p, max_spare_cap_cpu, cpus,
> > > - pd);
> > > + cur_delta = compute_energy(&eenv, pd, cpus, p,
> > > + max_spare_cap_cpu);
> > > if (cur_delta < base_energy_pd)
> >
> > same question as above
> >
> > > goto unlock;
> > > cur_delta -= base_energy_pd;
> > > --
> > > 2.36.1.124.g0e6072fb45-goog
> > >
On Mon, 6 Jun 2022 at 11:31, Vincent Donnefort <[email protected]> wrote:
>
> [...]
> > > @@ -8114,6 +8212,10 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
> > > if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
> > > update_tg_load_avg(cfs_rq);
> > >
> > > + /* sync clock_pelt_idle with last update */
> >
> > update_idle_cfs_rq_clock_pelt() syncs cfs_rq->throttled_pelt_idle with
> > cfs_rq->throttled_clock_pelt_time. Not sure what `clock_pelt_idle` and
> > `last update` here mean?
>
>
> Indeed, this comment is not helpful at all. What matters here is that the cfs_rq
> is idle and we need to update the throttled_pelt_idle accordingly.
>
> >
> > [...]
> >
> > > +/* The rq is idle, we can sync to clock_task */
> > > +static inline void _update_idle_rq_clock_pelt(struct rq *rq)
> > > +{
> > > + rq->clock_pelt = rq_clock_task(rq);
> > > +
> > > + u64_u32_store(rq->enter_idle, rq_clock(rq));
> > > + /* Paired with smp_rmb in migrate_se_pelt_lag */
> >
> > minor:
> >
> > s/migrate_se_pelt_lag/migrate_se_pelt_lag()
> >
> > [...]
> >
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index bf4a0ec98678..97bc26e5c8af 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -648,6 +648,10 @@ struct cfs_rq {
> > > int runtime_enabled;
> > > s64 runtime_remaining;
> > >
> > > + u64 throttled_pelt_idle;
> > > +#ifndef CONFIG_64BIT
> > > + u64 throttled_pelt_idle_copy;
> > > +#endif
> > > u64 throttled_clock;
> > > u64 throttled_clock_pelt;
> > > u64 throttled_clock_pelt_time;
> > > @@ -1020,6 +1024,12 @@ struct rq {
> > > u64 clock_task ____cacheline_aligned;
> > > u64 clock_pelt;
> > > unsigned long lost_idle_time;
> > > + u64 clock_pelt_idle;
> > > + u64 enter_idle;
> > > +#ifndef CONFIG_64BIT
> > > + u64 clock_pelt_idle_copy;
> > > + u64 enter_idle_copy;
> > > +#endif
> > >
> > > atomic_t nr_iowait;
> >
> > `throttled_pelt_idle`, `clock_pelt_idle` and `enter_idle` are clock
> > snapshots when cfs_rq resp. rq go idle. But the naming does not really
> > show this relation. And this makes reading those equations rather difficult.
> >
> > What about something like `throttled_clock_pelt_time_enter_idle`,
> > `clock_pelt_enter_idle`, `clock_enter_idle`? Especially the first one is
> > too long but something which shows that those are clock snapshots when
> > enter idle would IMHO augment readability in migrate_se_pelt_lag().
>
> What if I drop the "enter"?
>
> clock_idle;
> clock_pelt_idle;
> throttled_clock_pelt_time_idle;
and you can even remove the _time for throttled_clock_pelt_idle
>
> >
> > Besides these small issues:
> >
> > Reviewed-by: Dietmar Eggemann <[email protected]>
>
> Thanks!
On Tue, 7 Jun 2022 at 12:03, Vincent Donnefort <[email protected]> wrote:
>
> [...]
>
> > > >
> > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > > > index bf4a0ec98678..97bc26e5c8af 100644
> > > > > --- a/kernel/sched/sched.h
> > > > > +++ b/kernel/sched/sched.h
> > > > > @@ -648,6 +648,10 @@ struct cfs_rq {
> > > > > int runtime_enabled;
> > > > > s64 runtime_remaining;
> > > > >
> > > > > + u64 throttled_pelt_idle;
> > > > > +#ifndef CONFIG_64BIT
> > > > > + u64 throttled_pelt_idle_copy;
> > > > > +#endif
> > > > > u64 throttled_clock;
> > > > > u64 throttled_clock_pelt;
> > > > > u64 throttled_clock_pelt_time;
> > > > > @@ -1020,6 +1024,12 @@ struct rq {
> > > > > u64 clock_task ____cacheline_aligned;
> > > > > u64 clock_pelt;
> > > > > unsigned long lost_idle_time;
> > > > > + u64 clock_pelt_idle;
> > > > > + u64 enter_idle;
> > > > > +#ifndef CONFIG_64BIT
> > > > > + u64 clock_pelt_idle_copy;
> > > > > + u64 enter_idle_copy;
> > > > > +#endif
> > > > >
> > > > > atomic_t nr_iowait;
> > > >
> > > > `throttled_pelt_idle`, `clock_pelt_idle` and `enter_idle` are clock
> > > > snapshots when cfs_rq resp. rq go idle. But the naming does not really
> > > > show this relation. And this makes reading those equations rather difficult.
> > > >
> > > > What about something like `throttled_clock_pelt_time_enter_idle`,
> > > > `clock_pelt_enter_idle`, `clock_enter_idle`? Especially the first one is
> > > > too long but something which shows that those are clock snapshots when
> > > > enter idle would IMHO augment readability in migrate_se_pelt_lag().
> > >
> > > What if I drop the "enter"?
> > >
> > > clock_idle;
> > > clock_pelt_idle;
> > > throttled_clock_pelt_time_idle;
> >
> > and you can even remove the _time for throttled_clock_pelt_idle
> >
>
> Hum, "throttled_clock_pelt" already exists, while what we really snapshot is
> "throttled_clock_pelt_time".
What is snapshot is throttled_clock_pelt when entering idle so my
proposal of "throttled_clock_pelt_idle"
throttled_clock_pelt_idle doesn't exist and reflect that it's
throttled_clock_pelt when cfs_rq enter idle just like clock_pelt_idle
reflect clock_pelt when entering idle
[...]
> > >
> > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > > index bf4a0ec98678..97bc26e5c8af 100644
> > > > --- a/kernel/sched/sched.h
> > > > +++ b/kernel/sched/sched.h
> > > > @@ -648,6 +648,10 @@ struct cfs_rq {
> > > > int runtime_enabled;
> > > > s64 runtime_remaining;
> > > >
> > > > + u64 throttled_pelt_idle;
> > > > +#ifndef CONFIG_64BIT
> > > > + u64 throttled_pelt_idle_copy;
> > > > +#endif
> > > > u64 throttled_clock;
> > > > u64 throttled_clock_pelt;
> > > > u64 throttled_clock_pelt_time;
> > > > @@ -1020,6 +1024,12 @@ struct rq {
> > > > u64 clock_task ____cacheline_aligned;
> > > > u64 clock_pelt;
> > > > unsigned long lost_idle_time;
> > > > + u64 clock_pelt_idle;
> > > > + u64 enter_idle;
> > > > +#ifndef CONFIG_64BIT
> > > > + u64 clock_pelt_idle_copy;
> > > > + u64 enter_idle_copy;
> > > > +#endif
> > > >
> > > > atomic_t nr_iowait;
> > >
> > > `throttled_pelt_idle`, `clock_pelt_idle` and `enter_idle` are clock
> > > snapshots when cfs_rq resp. rq go idle. But the naming does not really
> > > show this relation. And this makes reading those equations rather difficult.
> > >
> > > What about something like `throttled_clock_pelt_time_enter_idle`,
> > > `clock_pelt_enter_idle`, `clock_enter_idle`? Especially the first one is
> > > too long but something which shows that those are clock snapshots when
> > > enter idle would IMHO augment readability in migrate_se_pelt_lag().
> >
> > What if I drop the "enter"?
> >
> > clock_idle;
> > clock_pelt_idle;
> > throttled_clock_pelt_time_idle;
>
> and you can even remove the _time for throttled_clock_pelt_idle
>
Hum, "throttled_clock_pelt" already exists, while what we really snapshot is
"throttled_clock_pelt_time".