LinuxLists.cc - [PATCH v3 0/7] feec() energy margin removal

2022-03-09 00:40:14

Subject: [PATCH v3 0/7] feec() energy margin removal

find_energy_efficient() (feec()) will migrate a task to save energy only
if it saves at least 6% of the total energy consumed by the system. This
conservative approach is a problem on a system where a lot of small tasks
create a huge load on the overall: very few of them will be allowed to migrate
to a smaller CPU, wasting a lot of energy. Instead of trying to determine yet
another margin, let's try to remove it.

The first elements of this patch-set are various fixes and improvement that
stabilizes task_util and ensures energy comparison fairness across all CPUs of
the topology. Only once those fixed, we can completely remove the margin and
let feec() aggressively place task and save energy.

This has been validated by two different ways:

First using LISA's eas_behaviour test suite. This is composed of a set of
scenario and verify if the task placement is optimum. No failure have been
observed and it also improved some tests such as Ramp-Down (as the placement
is now more energy oriented) and *ThreeSmall (as no bouncing between clusters
happen anymore).

* Hikey960: 100% PASSED
* DB-845C: 100% PASSED
* RB5: 100% PASSED

Second, using an Android benchmark: PCMark2 on a Pixel4, with a lot of
backports to have a scheduler as close as we can from mainline.

+------------+-----------------+-----------------+
| Test | Perf | Energy [1] |
+------------+-----------------+-----------------+
| Web2 | -0.3% pval 0.03 | -1.8% pval 0.00 |
| Video2 | -0.3% pval 0.13 | -5.6% pval 0.00 |
| Photo2 [2] | -3.8% pval 0.00 | -1% pval 0.00 |
| Writing2 | 0% pval 0.13 | -1% pval 0.00 |
| Data2 | 0% pval 0.8 | -0.43 pval 0.00 |
+------------+-----------------+-----------------+

The margin removal let the kernel make the best use of the Energy Model,
tasks are more likely to be placed where they fit and this saves a
substantial amount of energy, while having a limited impact on performances.

[1] This is an energy estimation based on the CPU activity and the Energy Model
for this device. "All models are wrong but some are useful"; yes, this is an
imperfect estimation that doesn't take into account some idle states and shared
power rails. Nonetheless this is based on the information the kernel has during
runtime and it proves the scheduler can take better decisions based solely on
those data.

[2] This is the only performance impact observed. The debugging of this test
showed no issue with task placement. The better score was solely due to some
critical threads held on better performing CPUs. If a thread needs a higher
capacity CPU, the placement must result from a user input (with e.g. uclamp
min) instead of being artificially held on less efficient CPUs by feec().
Notice also, the experiment didn't use the Android only latency_sensitive
feature which would hide this problem on a real-life device.

v2 -> v3:
- feec(): introduce energy_env struct (Dietmar)
- PELT migration decay: Only apply when src CPU is idle (Vincent G.)
- PELT migration decay: Do not apply when cfs_rq is throttled
- PELT migration decay: Snapshot the lag at cfs_rq's level

v1 -> v2:
- Fix PELT migration last_update_time (previously root cfs_rq's).
- Add Dietmar's patches to refactor feec()'s CPU loop.
- feec(): renaming busy time functions get_{pd,tsk}_busy_time()
- feec(): pd_cap computation in the first for_each_cpu loop.
- feec(): create get_pd_max_util() function (previously within compute_energy())
- feec(): rename base_energy_pd to base_energy.

Dietmar Eggemann (3):
sched, drivers: Remove max param from
effective_cpu_util()/sched_cpu_util()
sched/fair: Rename select_idle_mask to select_rq_mask
sched/fair: Use the same cpumask per-PD throughout
find_energy_efficient_cpu()

Vincent Donnefort (4):
sched/fair: Provide u64 read for 32-bits arch helper
sched/fair: Decay task PELT values during migration
sched/fair: Remove task_util from effective utilization in feec()
sched/fair: Remove the energy margin in feec()

drivers/powercap/dtpm_cpu.c | 33 +--
drivers/thermal/cpufreq_cooling.c | 6 +-
include/linux/sched.h | 2 +-
kernel/sched/core.c | 15 +-
kernel/sched/cpufreq_schedutil.c | 5 +-
kernel/sched/fair.c | 376 ++++++++++++++++++------------
kernel/sched/sched.h | 49 +++-
7 files changed, 295 insertions(+), 191 deletions(-)

--
2.25.1

2022-03-09 00:41:20

by Vincent Donnefort

[permalink] [raw]

Subject: [PATCH v3 5/7] sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()

From: Dietmar Eggemann <[email protected]>

The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays
invariant after Energy Model creation, i.e. it is not updated after
CPU hotplug operations.

That's why the PD mask is used in conjunction with the cpu_online_mask
(or Sched Domain cpumask). Thereby the cpu_online_mask is fetched
multiple times (in compute_energy()) during a run-queue selection
for a task.

cpu_online_mask may change during this time which can lead to wrong
energy calculations.

To be able to avoid this, use the select_rq_mask per-cpu cpumask to
create a cpumask out of PD cpumask and cpu_online_mask and pass it
through the function calls of the EAS run-queue selection path.

The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection
(find_energy_efficient_cpu()) is now ANDed not only with the SD mask
but also with the cpu_online_mask. This is fine since this cpumask
has to be in syc with the one used for energy computation
(compute_energy()).
An exclusive cpuset setup with at least one asymmetric CPU capacity
island (hence the additional AND with the SD cpumask) is the obvious
exception here.

Signed-off-by: Dietmar Eggemann <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0ebfaa2fc1f4..07de5c63c75f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6572,14 +6572,14 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
* task.
*/
static long
-compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
+compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
+ struct perf_domain *pd)
{
- struct cpumask *pd_mask = perf_domain_span(pd);
unsigned long max_util = 0, sum_util = 0, cpu_cap;
int cpu;

- cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
- cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
+ cpu_cap = arch_scale_cpu_capacity(cpumask_first(cpus));
+ cpu_cap -= arch_scale_thermal_pressure(cpumask_first(cpus));

/*
* The capacity state of CPUs of the current rd can be driven by CPUs
@@ -6590,7 +6590,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* If an entire pd is outside of the current rd, it will not appear in
* its pd list and will not be accounted by compute_energy().
*/
- for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
+ for_each_cpu(cpu, cpus) {
unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
unsigned long cpu_util, util_running = util_freq;
struct task_struct *tsk = NULL;
@@ -6677,6 +6677,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
*/
static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
int cpu, best_energy_cpu = prev_cpu, target = -1;
@@ -6711,7 +6712,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
unsigned long base_energy_pd;
int max_spare_cap_cpu = -1;

- for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
+ cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
+
+ for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue;

@@ -6748,12 +6751,12 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
continue;

/* Compute the 'base' energy of the pd, without @p */
- base_energy_pd = compute_energy(p, -1, pd);
+ base_energy_pd = compute_energy(p, -1, cpus, pd);
base_energy += base_energy_pd;

/* Evaluate the energy impact of using prev_cpu. */
if (compute_prev_delta) {
- prev_delta = compute_energy(p, prev_cpu, pd);
+ prev_delta = compute_energy(p, prev_cpu, cpus, pd);
if (prev_delta < base_energy_pd)
goto unlock;
prev_delta -= base_energy_pd;
@@ -6762,7 +6765,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

/* Evaluate the energy impact of using max_spare_cap_cpu. */
if (max_spare_cap_cpu >= 0) {
- cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
+ cur_delta = compute_energy(p, max_spare_cap_cpu, cpus,
+ pd);
if (cur_delta < base_energy_pd)
goto unlock;
cur_delta -= base_energy_pd;
--
2.25.1

2022-03-09 01:10:11

by Vincent Donnefort

[permalink] [raw]

Subject: [PATCH v3 6/7] sched/fair: Remove task_util from effective utilization in feec()

The energy estimation in find_energy_efficient_cpu() (feec()) relies on
the computation of the effective utilization for each CPU of a perf domain
(PD). This effective utilization is then used as an estimation of the busy
time for this pd. The function effective_cpu_util() which gives this value,
scales the utilization relative to IRQ pressure on the CPU to take into
account that the IRQ time is hidden from the task clock. The IRQ scaling is
as follow:

effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util

Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of
the CPU and irq the IRQ avg time.

If now we take as an example a task placement which doesn't raise the OPP
on the candidate CPU, we can write the energy delta as:

delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) -
effective_cpu_util(cpu_util))
= OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util

We end-up with an energy delta depending on the IRQ avg time, which is a
problem: first the time spent on IRQs by a CPU has no effect on the
additional energy that would be consumed by a task. Second, we don't want
to favour a CPU with a higher IRQ avg time value.

Nonetheless, we need to take the IRQ avg time into account. If a task
placement raises the PD's frequency, it will increase the energy cost for
the entire time where the CPU is busy. A solution is to only use
effective_cpu_util() with the CPU contribution part. The task contribution
is added separately and scaled according to prev_cpu's IRQ time.

No change for the FREQUENCY_UTIL component of the energy estimation. We
still want to get the actual frequency that would be selected after the
task placement.

Signed-off-by: Vincent Donnefort <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07de5c63c75f..b48ba181c8ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6565,61 +6565,97 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
}

/*
- * compute_energy(): Estimates the energy that @pd would consume if @p was
- * migrated to @dst_cpu. compute_energy() predicts what will be the utilization
- * landscape of @pd's CPUs after the task migration, and uses the Energy Model
- * to compute what would be the energy if we decided to actually migrate that
- * task.
+ * energy_env - Utilization landscape for energy estimation.
+ * @task_busy_time: Utilization contribution by the task for which we test the
+ * placement. Given by eenv_task_busy_time().
+ * @pd_busy_time: Utilization of the whole perf domain without the task
+ * contribution. Given by eenv_pd_busy_time().
+ * @cpu_cap: Maximum CPU capacity for the perf domain.
+ * @pd_cap: Entire perf domain capacity. (pd->nr_cpus * cpu_cap).
+ */
+struct energy_env {
+ unsigned long task_busy_time;
+ unsigned long pd_busy_time;
+ unsigned long cpu_cap;
+ unsigned long pd_cap;
+};
+
+/*
+ * Compute the task busy time for compute_energy(). This time cannot be
+ * injected directly into effective_cpu_util() because of the IRQ scaling.
+ * The latter only makes sense with the most recent CPUs where the task has
+ * run.
+ */
+static inline void eenv_task_busy_time(struct energy_env *eenv,
+ struct task_struct *p, int prev_cpu)
+{
+ unsigned long max_cap = arch_scale_cpu_capacity(prev_cpu);
+ unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));
+
+ if (unlikely(irq >= max_cap)) {
+ eenv->task_busy_time = max_cap;
+ return;
+ }
+
+ eenv->task_busy_time =
+ scale_irq_capacity(task_util_est(p), irq, max_cap);
+}
+
+/*
+ * Compute the perf_domain (PD) busy time for compute_energy(). Based on the
+ * utilization for each @pd_cpus, it however doesn't take into account
+ * clamping since the ratio (utilization / cpu_capacity) is already enough to
+ * scale the EM reported power consumption at the (eventually clamped)
+ * cpu_capacity.
+ *
+ * The contribution of the task @p for which we want to estimate the
+ * energy cost is removed (by cpu_util_next()) and must be calculated
+ * separately (see eenv_task_busy_time). This ensures:
+ *
+ * - A stable PD utilization, no matter which CPU of that PD we want to place
+ * the task on.
+ *
+ * - A fair comparison between CPUs as the task contribution (task_util())
+ * will always be the same no matter which CPU utilization we rely on
+ * (util_avg or util_est).
+ *
+ * Set @eenv busy time for the PD that spans @pd_cpus. This busy time can't
+ * exceed @eenv->pd_cap.
*/
-static long
-compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
- struct perf_domain *pd)
+static inline void eenv_pd_busy_time(struct energy_env *eenv,
+ struct cpumask *pd_cpus,
+ struct task_struct *p)
{
- unsigned long max_util = 0, sum_util = 0, cpu_cap;
+ unsigned long busy_time = 0;
int cpu;

- cpu_cap = arch_scale_cpu_capacity(cpumask_first(cpus));
- cpu_cap -= arch_scale_thermal_pressure(cpumask_first(cpus));
+ for_each_cpu(cpu, pd_cpus) {
+ unsigned long util = cpu_util_next(cpu, p, -1);

- /*
- * The capacity state of CPUs of the current rd can be driven by CPUs
- * of another rd if they belong to the same pd. So, account for the
- * utilization of these CPUs too by masking pd with cpu_online_mask
- * instead of the rd span.
- *
- * If an entire pd is outside of the current rd, it will not appear in
- * its pd list and will not be accounted by compute_energy().
- */
- for_each_cpu(cpu, cpus) {
- unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
- unsigned long cpu_util, util_running = util_freq;
- struct task_struct *tsk = NULL;
+ busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
+ }

- /*
- * When @p is placed on @cpu:
- *
- * util_running = max(cpu_util, cpu_util_est) +
- * max(task_util, _task_util_est)
- *
- * while cpu_util_next is: max(cpu_util + task_util,
- * cpu_util_est + _task_util_est)
- */
- if (cpu == dst_cpu) {
- tsk = p;
- util_running =
- cpu_util_next(cpu, p, -1) + task_util_est(p);
- }
+ eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
+}

- /*
- * Busy time computation: utilization clamping is not
- * required since the ratio (sum_util / cpu_capacity)
- * is already enough to scale the EM reported power
- * consumption at the (eventually clamped) cpu_capacity.
- */
- cpu_util = effective_cpu_util(cpu, util_running, ENERGY_UTIL,
- NULL);
+/*
+ * Compute the maximum utilization for compute_energy() when the task @p
+ * is placed on the cpu @dst_cpu.
+ *
+ * Returns the maximum utilization among @eenv->cpus. This utilization can't
+ * exceed @eenv->cpu_cap.
+ */
+static inline unsigned long
+eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
+ struct task_struct *p, int dst_cpu)
+{
+ unsigned long max_util = 0;
+ int cpu;

- sum_util += min(cpu_util, cpu_cap);
+ for_each_cpu(cpu, pd_cpus) {
+ struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
+ unsigned long util = cpu_util_next(cpu, p, dst_cpu);
+ unsigned long cpu_util;

/*
* Performance domain frequency: utilization clamping
@@ -6628,12 +6664,30 @@ compute_energy(struct task_struct *p, int dst_cpu, struct cpumask *cpus,
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- cpu_util = effective_cpu_util(cpu, util_freq, FREQUENCY_UTIL,
- tsk);
- max_util = max(max_util, min(cpu_util, cpu_cap));
+ cpu_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
+ max_util = max(max_util, cpu_util);
}

- return em_cpu_energy(pd->em_pd, max_util, sum_util, cpu_cap);
+ return min(max_util, eenv->cpu_cap);
+}
+
+/*
+ * compute_energy(): Use the Energy Model to estimate the energy that @pd would
+ * consume for a given utilization landscape @eenv. If @dst_cpu < 0 the task
+ * contribution is removed from the energy estimation.
+ */
+static inline unsigned long
+compute_energy(struct energy_env *eenv, struct perf_domain *pd,
+ struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
+{
+ unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
+ unsigned long busy_time = eenv->pd_busy_time;
+
+ if (dst_cpu >= 0)
+ busy_time = min(eenv->pd_cap,
+ eenv->pd_busy_time + eenv->task_busy_time);
+
+ return em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
}

/*
@@ -6681,9 +6735,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
int cpu, best_energy_cpu = prev_cpu, target = -1;
- unsigned long cpu_cap, util, base_energy = 0;
+ unsigned long cpu_cap, cpu_thermal_cap, util;
+ unsigned long base_energy = 0;
struct sched_domain *sd;
struct perf_domain *pd;
+ struct energy_env eenv;

rcu_read_lock();
pd = rcu_dereference(rd->pd);
@@ -6706,6 +6762,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (!task_util_est(p))
goto unlock;

+ eenv_task_busy_time(&eenv, p, prev_cpu);
+
for (; pd; pd = pd->next) {
unsigned long cur_delta, spare_cap, max_spare_cap = 0;
bool compute_prev_delta = false;
@@ -6714,7 +6772,20 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);

- for_each_cpu_and(cpu, cpus, sched_domain_span(sd)) {
+ /* Account thermal pressure for the energy estimation */
+ cpu = cpumask_first(cpus);
+ cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
+ cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
+
+ eenv.cpu_cap = cpu_thermal_cap;
+ eenv.pd_cap = 0;
+
+ for_each_cpu(cpu, cpus) {
+ eenv.pd_cap += cpu_thermal_cap;
+
+ if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
+ continue;
+
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue;

@@ -6751,12 +6822,14 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
continue;

/* Compute the 'base' energy of the pd, without @p */
- base_energy_pd = compute_energy(p, -1, cpus, pd);
+ eenv_pd_busy_time(&eenv, cpus, p);
+ base_energy_pd = compute_energy(&eenv, pd, cpus, p, -1);
base_energy += base_energy_pd;

/* Evaluate the energy impact of using prev_cpu. */
if (compute_prev_delta) {
- prev_delta = compute_energy(p, prev_cpu, cpus, pd);
+ prev_delta = compute_energy(&eenv, pd, cpus, p,
+ prev_cpu);
if (prev_delta < base_energy_pd)
goto unlock;
prev_delta -= base_energy_pd;
@@ -6765,8 +6838,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

/* Evaluate the energy impact of using max_spare_cap_cpu. */
if (max_spare_cap_cpu >= 0) {
- cur_delta = compute_energy(p, max_spare_cap_cpu, cpus,
- pd);
+ cur_delta = compute_energy(&eenv, pd, cpus, p,
+ max_spare_cap_cpu);
if (cur_delta < base_energy_pd)
goto unlock;
cur_delta -= base_energy_pd;
--
2.25.1

2022-03-09 01:50:21

by Vincent Donnefort

[permalink] [raw]

Subject: [PATCH v3 4/7] sched/fair: Rename select_idle_mask to select_rq_mask

From: Dietmar Eggemann <[email protected]>

Decouple the name of the per-cpu cpumask select_idle_mask from its usage
in select_idle_[cpu/capacity]() of the CFS run-queue selection
(select_task_rq_fair()).

This is to support the reuse of this cpumask in the Energy Aware
Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue
selection.

Signed-off-by: Dietmar Eggemann <[email protected]>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a38d27abdf8d..d0363766b4b0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9293,7 +9293,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
#endif

DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
-DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
+DECLARE_PER_CPU(cpumask_var_t, select_rq_mask);

void __init sched_init(void)
{
@@ -9342,7 +9342,7 @@ void __init sched_init(void)
for_each_possible_cpu(i) {
per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
- per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
+ per_cpu(select_rq_mask, i) = (cpumask_var_t)kzalloc_node(
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
}
#endif /* CONFIG_CPUMASK_OFFSTACK */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bbc44c3bc47c..0ebfaa2fc1f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5726,7 +5726,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

/* Working cpumask for: load_balance, load_balance_newidle. */
DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
-DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+DEFINE_PER_CPU(cpumask_var_t, select_rq_mask);

#ifdef CONFIG_NO_HZ_COMMON

@@ -6216,7 +6216,7 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
*/
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
{
- struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
struct rq *this_rq = this_rq();
int this = smp_processor_id();
@@ -6302,7 +6302,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
int cpu, best_cpu = -1;
struct cpumask *cpus;

- cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

task_util = uclamp_task_util(p);
--
2.25.1

2022-03-09 01:58:32

by Vincent Donnefort

[permalink] [raw]

Subject: [PATCH v3 2/7] sched/fair: Decay task PELT values during migration

Before being migrated to a new CPU, a task sees its PELT values
synchronized with rq last_update_time. Once done, that same task will also
have its sched_avg last_update_time reset. This means the time between
the migration and the last clock update (B) will not be accounted for in
util_avg and a discontinuity will appear. This issue is amplified by the
PELT clock scaling. If the clock hasn't been updated while the CPU is
idle, clock_pelt will not be aligned with clock_task and that time (A)
will be also lost.

---------|----- A -----|-----------|------- B -----|>
clock_pelt clock_task clock now

This is especially problematic for asymmetric CPU capacity systems which
need stable util_avg signals for task placement and energy estimation.

Ideally, this problem would be solved by updating the runqueue clocks
before the migration. But that would require taking the runqueue lock
which is quite expensive [1]. Instead estimate the missing time and update
the task util_avg with that value:

A + B = clock_task - clock_pelt + sched_clock_cpu() - clock

Neither clock_task, clock_pelt nor clock can be accessed without the
runqueue lock. The new cfs_rq last_update_lag is therefore created and
encode those three values when the last_update_time value for that very
same cfs_rq is updated.

last_update_lag = clock - clock_task + clock_pelt

And we can then write the missing time as follow:

A + B = sched_clock_cpu() - last_update_lag

The B. part of the missing time is however an estimation that doesn't take
into account IRQ and Paravirt time.

Now we have an estimation for A + B, we can create an estimator for the
PELT value at the time of the migration. We need for this purpose to
inject last_update_time which is a combination of both clock_pelt and
lost_idle_time. The latter is a time value which is completely lost form a
PELT point of view and must be ignored. And finally, we can write:

rq_clock_pelt_estimator() = last_update_time + A + B
= last_update_time +
sched_clock_cpu() - last_update_lag

This estimation has a cost, mostly due to sched_clock_cpu(). Limit the
usage to the case where the source CPU is idle as we know this is when the
clock is having the biggest risk of being outdated.

[1] https://lore.kernel.org/all/[email protected]/

Signed-off-by: Vincent Donnefort <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99ea9540ece4..1f83616a44d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3625,6 +3625,22 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum

#endif /* CONFIG_FAIR_GROUP_SCHED */

+#ifdef CONFIG_NO_HZ_COMMON
+static inline void update_cfs_rq_lag(struct cfs_rq *cfs_rq)
+{
+ struct rq *rq = rq_of(cfs_rq);
+
+ u64_u32_store(cfs_rq->last_update_lag,
+#ifdef CONFIG_CFS_BANDWIDTH
+ /* Timer stopped by throttling */
+ unlikely(cfs_rq->throttle_count) ? U64_MAX :
+#endif
+ rq->clock - rq->clock_task + rq->clock_pelt);
+}
+#else
+static void update_cfs_rq_lag(struct cfs_rq *cfs_rq) {}
+#endif
+
/**
* update_cfs_rq_load_avg - update the cfs_rq's load/util averages
* @now: current time, as per cfs_rq_clock_pelt()
@@ -3688,6 +3704,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
cfs_rq->last_update_time_copy,
sa->last_update_time);
#endif
+ update_cfs_rq_lag(cfs_rq);

return decayed;
}
@@ -6852,6 +6869,44 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)

static void detach_entity_cfs_rq(struct sched_entity *se);

+#ifdef CONFIG_NO_HZ_COMMON
+static inline void migrate_se_pelt_lag(struct sched_entity *se)
+{
+ u64 now, last_update_lag;
+ struct cfs_rq *cfs_rq;
+ struct rq *rq;
+ bool is_idle;
+
+ cfs_rq = cfs_rq_of(se);
+ rq = rq_of(cfs_rq);
+
+ rcu_read_lock();
+ is_idle = is_idle_task(rcu_dereference(rq->curr));
+ rcu_read_unlock();
+
+ /*
+ * The lag estimation comes with a cost we don't want to pay all the
+ * time. Hence, limiting to the case where the source CPU is idle and
+ * we know we are at the greatest risk to have an outdated clock.
+ */
+ if (!is_idle)
+ return;
+
+ last_update_lag = u64_u32_load(cfs_rq->last_update_lag);
+
+ /* The clock has been stopped for throttling */
+ if (last_update_lag == U64_MAX)
+ return;
+
+ now = se->avg.last_update_time - last_update_lag +
+ sched_clock_cpu(cpu_of(rq));
+
+ __update_load_avg_blocked_se(now, se);
+}
+#else
+static void migrate_se_pelt_lag(struct sched_entity *se) {}
+#endif
+
/*
* Called immediately before a task is migrated to a new CPU; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
@@ -6859,6 +6914,9 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
*/
static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
{
+ struct sched_entity *se = &p->se;
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
/*
* As blocked tasks retain absolute vruntime the migration needs to
* deal with this by subtracting the old and adding the new
@@ -6866,8 +6924,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
* the task on the new runqueue.
*/
if (READ_ONCE(p->__state) == TASK_WAKING) {
- struct sched_entity *se = &p->se;
- struct cfs_rq *cfs_rq = cfs_rq_of(se);

se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
}
@@ -6878,25 +6934,28 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
* rq->lock and can modify state directly.
*/
lockdep_assert_rq_held(task_rq(p));
- detach_entity_cfs_rq(&p->se);
+ detach_entity_cfs_rq(se);

} else {
+ remove_entity_load_avg(se);
+
/*
- * We are supposed to update the task to "current" time, then
- * its up to date and ready to go to new CPU/cfs_rq. But we
- * have difficulty in getting what current time is, so simply
- * throw away the out-of-date time. This will result in the
- * wakee task is less decayed, but giving the wakee more load
- * sounds not bad.
+ * Here, the task's PELT values have been updated according to
+ * the current rq's clock. But if that clock hasn't been
+ * updated in a while, a substantial idle time will be missed,
+ * leading to an inflation after wake-up on the new rq.
+ *
+ * Estimate the missing time from the rq clock and update
+ * sched_avg to improve the PELT continuity after migration.
*/
- remove_entity_load_avg(&p->se);
+ migrate_se_pelt_lag(se);
}

/* Tell new CPU we are migrated */
- p->se.avg.last_update_time = 0;
+ se->avg.last_update_time = 0;

/* We have migrated, no longer consider this task hot */
- p->se.exec_start = 0;
+ se->exec_start = 0;

update_scan_period(p, new_cpu);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f1a445efdc63..982691ffe9a1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -608,6 +608,12 @@ struct cfs_rq {
struct sched_avg avg;
#ifndef CONFIG_64BIT
u64 last_update_time_copy;
+#endif
+#ifdef CONFIG_NO_HZ_COMMON
+ u64 last_update_lag;
+#ifndef CONFIG_64BIT
+ u64 last_update_lag_copy;
+#endif
#endif
struct {
raw_spinlock_t lock ____cacheline_aligned;
--
2.25.1

2022-03-21 20:58:55

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH v3 2/7] sched/fair: Decay task PELT values during migration

On 08/03/2022 19:19, Vincent Donnefort wrote:

Patch header:

s/during migration/during wakeup migration ?

To make sure it's not related to lb migration (TASK_ON_RQ_MIGRATING).

[...]

> Neither clock_task, clock_pelt nor clock can be accessed without the
> runqueue lock. The new cfs_rq last_update_lag is therefore created and
> encode those three values when the last_update_time value for that very

s/encode/encodes ... It's not really encoding?

[...]

> Now we have an estimation for A + B, we can create an estimator for the
> PELT value at the time of the migration. We need for this purpose to
> inject last_update_time which is a combination of both clock_pelt and
> lost_idle_time. The latter is a time value which is completely lost form a

s/form/from

> PELT point of view and must be ignored. And finally, we can write:
>
> rq_clock_pelt_estimator() = last_update_time + A + B
> = last_update_time +
> sched_clock_cpu() - last_update_lag

rq_clock_pelt_estimator() did exist in v2 but does not in v3 anymore.
Might be misleading when people search for it.

[...]

> @@ -3688,6 +3704,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> cfs_rq->last_update_time_copy,
> sa->last_update_time);
> #endif
> + update_cfs_rq_lag(cfs_rq);
>
> return decayed;
> }
> @@ -6852,6 +6869,44 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>
> static void detach_entity_cfs_rq(struct sched_entity *se);
>
> +#ifdef CONFIG_NO_HZ_COMMON

Couldn't you put update_cfs_rq_lag() and migrate_se_pelt_lag() under one
CONFIG_NO_HZ_COMMON?

> +static inline void migrate_se_pelt_lag(struct sched_entity *se)

[...]

> + /*
> + * The lag estimation comes with a cost we don't want to pay all the
> + * time. Hence, limiting to the case where the source CPU is idle and
> + * we know we are at the greatest risk to have an outdated clock.
> + */
> + if (!is_idle)
> + return;
> +
> + last_update_lag = u64_u32_load(cfs_rq->last_update_lag);

So each taskgroup has its own last_update_lag. I guess it works since we
sync se in migrate_task_rq_fair() -> remove_entity_load_avg() ->
sync_entity_load_avg() with its cfs_rq.

[...]

> @@ -6859,6 +6914,9 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
> */
> static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> {
> + struct sched_entity *se = &p->se;
> + struct cfs_rq *cfs_rq = cfs_rq_of(se);

This line can stay in the if condition below since cfs_rq is only used
there. The brackets are also still there (A).

[...]

> @@ -6866,8 +6924,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> * the task on the new runqueue.
> */
> if (READ_ONCE(p->__state) == TASK_WAKING) {
> - struct sched_entity *se = &p->se; <--- (A)
> - struct cfs_rq *cfs_rq = cfs_rq_of(se);

[...]

2022-03-22 12:57:37

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH v3 6/7] sched/fair: Remove task_util from effective utilization in feec()

On 08/03/2022 19:19, Vincent Donnefort wrote:

[...]

> Signed-off-by: Vincent Donnefort <[email protected]>

LGTM, just a couple of small remarks.

[...]

> @@ -6681,9 +6735,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
> struct root_domain *rd = cpu_rq(smp_processor_id())->rd;

s/cpu_rq(smp_processor_id())/this_rq()

Maybe you can clean this up with this patch?

> int cpu, best_energy_cpu = prev_cpu, target = -1;
> - unsigned long cpu_cap, util, base_energy = 0;
> + unsigned long cpu_cap, cpu_thermal_cap, util;

cpu_cap, cpu_thermal_cap and util can be defined inside the
`for (; pd; pd = pd->next)` scope below.

[...]

> @@ -6706,6 +6762,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> if (!task_util_est(p))
> goto unlock;
>
> + eenv_task_busy_time(&eenv, p, prev_cpu);
> +
> for (; pd; pd = pd->next) {
> unsigned long cur_delta, spare_cap, max_spare_cap = 0;

cpu_cap and spare_cap could be combined into one to save on local
variables. E.g. cpu_cap

@@ -6908,8 +6908,6 @@ static int find_energy_efficient_cpu(struct
task_struct *p, int prev_cpu)
util = cpu_util_next(cpu, p, cpu);
cpu_cap = capacity_of(cpu);
- spare_cap = cpu_cap;
- lsub_positive(&spare_cap, util);
/*
* Skip CPUs that cannot satisfy the capacity
request.
@@ -6922,15 +6920,14 @@ static int find_energy_efficient_cpu(struct
task_struct *p, int prev_cpu)
if (!fits_capacity(util, cpu_cap))
continue;
+ lsub_positive(&cpu_cap, util);
+
if (cpu == prev_cpu) {
/* Always use prev_cpu as a candidate. */
compute_prev_delta = true;
- } else if (spare_cap > max_spare_cap) {
- /*
- * Find the CPU with the maximum spare
capacity
- * in the performance domain.
- */
- max_spare_cap = spare_cap;
+ } else if (cpu_cap > max_spare_cap) {
+ /* Find CPU with max spare capacity in
PD. */
+ max_spare_cap = cpu_cap;
max_spare_cap_cpu = cpu;
}

[...]