LinuxLists.cc - [PATCH v6 0/7] feec() energy margin removal

2022-04-27 10:18:08

Subject: [PATCH v6 0/7] feec() energy margin removal

find_energy_efficient() (feec()) will migrate a task to save energy only
if it saves at least 6% of the total energy consumed by the system. This
conservative approach is a problem on a system where a lot of small tasks
create a huge load on the overall: very few of them will be allowed to migrate
to a smaller CPU, wasting a lot of energy. Instead of trying to determine yet
another margin, let's try to remove it.

The first elements of this patch-set are various fixes and improvement that
stabilizes task_util and ensures energy comparison fairness across all CPUs of
the topology. Only once those fixed, we can completely remove the margin and
let feec() aggressively place task and save energy.

This has been validated by two different ways:

First using LISA's eas_behaviour test suite. This is composed of a set of
scenario and verify if the task placement is optimum. No failure have been
observed and it also improved some tests such as Ramp-Down (as the placement
is now more energy oriented) and *ThreeSmall (as no bouncing between clusters
happen anymore).

* Hikey960: 100% PASSED
* DB-845C: 100% PASSED
* RB5: 100% PASSED

Second, using an Android benchmark: PCMark2 on a Pixel4, with a lot of
backports to have a scheduler as close as we can from mainline.

+------------+-----------------+-----------------+
| Test | Perf | Energy [1] |
+------------+-----------------+-----------------+
| Web2 | -0.3% pval 0.03 | -1.8% pval 0.00 |
| Video2 | -0.3% pval 0.13 | -5.6% pval 0.00 |
| Photo2 [2] | -3.8% pval 0.00 | -1% pval 0.00 |
| Writing2 | 0% pval 0.13 | -1% pval 0.00 |
| Data2 | 0% pval 0.8 | -0.43 pval 0.00 |
+------------+-----------------+-----------------+

The margin removal let the kernel make the best use of the Energy Model,
tasks are more likely to be placed where they fit and this saves a
substantial amount of energy, while having a limited impact on performances.

[1] This is an energy estimation based on the CPU activity and the Energy Model
for this device. "All models are wrong but some are useful"; yes, this is an
imperfect estimation that doesn't take into account some idle states and shared
power rails. Nonetheless this is based on the information the kernel has during
runtime and it proves the scheduler can take better decisions based solely on
those data.

[2] This is the only performance impact observed. The debugging of this test
showed no issue with task placement. The better score was solely due to some
critical threads held on better performing CPUs. If a thread needs a higher
capacity CPU, the placement must result from a user input (with e.g. uclamp
min) instead of being artificially held on less efficient CPUs by feec().
Notice also, the experiment didn't use the Android only latency_sensitive
feature which would hide this problem on a real-life device.

v5 -> v6:
- Fix !CONFIG_SMP build.

v4 -> v5:
- PELT migration decay: timestamp only at idle time (Vincent G.)
- PELT migration decay: split timestamp values (enter_idle / clock_pelt_idle)
(Vincent G.)

v3 -> v4:
- Minor cosmetic changes (Dietmar)

v2 -> v3:
- feec(): introduce energy_env struct (Dietmar)
- PELT migration decay: Only apply when src CPU is idle (Vincent G.)
- PELT migration decay: Do not apply when cfs_rq is throttled
- PELT migration decay: Snapshot the lag at cfs_rq's level

v1 -> v2:
- Fix PELT migration last_update_time (previously root cfs_rq's).
- Add Dietmar's patches to refactor feec()'s CPU loop.
- feec(): renaming busy time functions get_{pd,tsk}_busy_time()
- feec(): pd_cap computation in the first for_each_cpu loop.
- feec(): create get_pd_max_util() function (previously within compute_energy())
- feec(): rename base_energy_pd to base_energy.

Dietmar Eggemann (3):
sched, drivers: Remove max param from
effective_cpu_util()/sched_cpu_util()
sched/fair: Rename select_idle_mask to select_rq_mask
sched/fair: Use the same cpumask per-PD throughout
find_energy_efficient_cpu()

Vincent Donnefort (4):
sched/fair: Provide u64 read for 32-bits arch helper
sched/fair: Decay task PELT values during wakeup migration
sched/fair: Remove task_util from effective utilization in feec()
sched/fair: Remove the energy margin in feec()

drivers/powercap/dtpm_cpu.c | 33 +--
drivers/thermal/cpufreq_cooling.c | 6 +-
include/linux/sched.h | 2 +-
kernel/sched/core.c | 15 +-
kernel/sched/cpufreq_schedutil.c | 5 +-
kernel/sched/fair.c | 379 ++++++++++++++++++------------
kernel/sched/pelt.h | 28 ++-
kernel/sched/sched.h | 53 ++++-
8 files changed, 318 insertions(+), 203 deletions(-)

--
2.25.1

2022-04-27 11:27:02

by Vincent Donnefort

[permalink] [raw]

Subject: [PATCH v6 2/7] sched/fair: Decay task PELT values during wakeup migration

Before being migrated to a new CPU, a task sees its PELT values
synchronized with rq last_update_time. Once done, that same task will also
have its sched_avg last_update_time reset. This means the time between
the migration and the last clock update (B) will not be accounted for in
util_avg and a discontinuity will appear. This issue is amplified by the
PELT clock scaling. If the clock hasn't been updated while the CPU is
idle, clock_pelt will not be aligned with clock_task and that time (A)
will be also lost.

---------|----- A -----|-----------|------- B -----|>
clock_pelt clock_task clock now

This is especially problematic for asymmetric CPU capacity systems which
need stable util_avg signals for task placement and energy estimation.

Ideally, this problem would be solved by updating the runqueue clocks
before the migration. But that would require taking the runqueue lock
which is quite expensive [1]. Instead estimate the missing time and update
the task util_avg with that value:

A + B = clock_task - clock_pelt + sched_clock_cpu() - clock

sched_clock_cpu() is a costly functinon. Limit the usage to the case where
the source CPU is idle as we know this is when the clock is having the
biggest risk of being outdated.

Neither clock_task, clock_pelt nor clock can be accessed without the
runqueue lock. We then need to store those values in a timestamp variable
which can be accessed during the migration. rq's enter_idle will give the
wall-clock time when the rq went idle. We have then:

B = sched_clock_cpu() - rq->enter_idle.

Then, to catch-up the PELT clock scaling (A), two cases:

* !CFS_BANDWIDTH: We can simply use clock_task(). This value is stored
in rq's clock_pelt_idle, before the rq enters idle. The estimated time
is then:

rq->clock_pelt_idle + sched_clock_cpu() - rq->enter_idle.

* CFS_BANDWIDTH: We can't catch-up with clock_task because of the
throttled_clock_task_time offset. cfs_rq's clock_pelt_idle is then
giving the PELT clock when the cfs_rq becomes idle. This gives:

A = rq->clock_pelt_idle - cfs_rq->clock_pelt_idle

And gives the following estimated time:

cfs_rq->last_update_time +
rq->clock_pelt_idle - cfs_rq->clock_pelt_idle + (A)
sched_clock_cpu() - rq->enter_idle (B)

The (B) part of the missing time is however an estimation that doesn't
take into account IRQ and Paravirt time.

[1] https://lore.kernel.org/all/[email protected]/

Signed-off-by: Vincent Donnefort <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index abd1feeec0c2..1256e2c0e2e2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3694,6 +3694,48 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum

#endif /* CONFIG_FAIR_GROUP_SCHED */

+#ifdef CONFIG_NO_HZ_COMMON
+static inline void migrate_se_pelt_lag(struct sched_entity *se)
+{
+ struct cfs_rq *cfs_rq;
+ struct rq *rq;
+ bool is_idle;
+ u64 now;
+
+ cfs_rq = cfs_rq_of(se);
+ rq = rq_of(cfs_rq);
+
+ rcu_read_lock();
+ is_idle = is_idle_task(rcu_dereference(rq->curr));
+ rcu_read_unlock();
+
+ /*
+ * The lag estimation comes with a cost we don't want to pay all the
+ * time. Hence, limiting to the case where the source CPU is idle and
+ * we know we are at the greatest risk to have an outdated clock.
+ */
+ if (!is_idle)
+ return;
+
+#ifdef CONFIG_CFS_BANDWIDTH
+ now = u64_u32_load(cfs_rq->clock_pelt_idle);
+ /* The clock has been stopped for throttling */
+ if (now == U64_MAX)
+ return;
+
+ now += cfs_rq_last_update_time(cfs_rq);
+ now -= u64_u32_load(rq->clock_pelt_idle);
+#else
+ now = u64_u32_load(rq->clock_pelt_idle);
+#endif
+ now += sched_clock_cpu(cpu_of(rq)) - u64_u32_load(rq->enter_idle);
+
+ __update_load_avg_blocked_se(now, se);
+}
+#else
+static void migrate_se_pelt_lag(struct sched_entity *se) {}
+#endif
+
/**
* update_cfs_rq_load_avg - update the cfs_rq's load/util averages
* @now: current time, as per cfs_rq_clock_pelt()
@@ -4429,6 +4471,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*/
if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
update_min_vruntime(cfs_rq);
+
+ if (cfs_rq->nr_running == 0)
+ update_idle_cfs_rq_clock_pelt(cfs_rq);
}

/*
@@ -6946,6 +6991,8 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
*/
static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
{
+ struct sched_entity *se = &p->se;
+
/*
* As blocked tasks retain absolute vruntime the migration needs to
* deal with this by subtracting the old and adding the new
@@ -6953,7 +7000,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
* the task on the new runqueue.
*/
if (READ_ONCE(p->__state) == TASK_WAKING) {
- struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);

se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
@@ -6965,25 +7011,29 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
* rq->lock and can modify state directly.
*/
lockdep_assert_rq_held(task_rq(p));
- detach_entity_cfs_rq(&p->se);
+ detach_entity_cfs_rq(se);

} else {
+ remove_entity_load_avg(se);
+
/*
- * We are supposed to update the task to "current" time, then
- * its up to date and ready to go to new CPU/cfs_rq. But we
- * have difficulty in getting what current time is, so simply
- * throw away the out-of-date time. This will result in the
- * wakee task is less decayed, but giving the wakee more load
- * sounds not bad.
+ * Here, the task's PELT values have been updated according to
+ * the current rq's clock. But if that clock hasn't been
+ * updated in a while, a substantial idle time will be missed,
+ * leading to an inflation after wake-up on the new rq.
+ *
+ * Estimate the missing time from the cfs_rq last_update_time
+ * and update sched_avg to improve the PELT continuity after
+ * migration.
*/
- remove_entity_load_avg(&p->se);
+ migrate_se_pelt_lag(se);
}

/* Tell new CPU we are migrated */
- p->se.avg.last_update_time = 0;
+ se->avg.last_update_time = 0;

/* We have migrated, no longer consider this task hot */
- p->se.exec_start = 0;
+ se->exec_start = 0;

update_scan_period(p, new_cpu);
}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 4ff2ed4f8fa1..0380f750adbe 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -103,6 +103,14 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
rq->clock_pelt += delta;
}

+static inline u64 rq_clock_pelt(struct rq *rq)
+{
+ lockdep_assert_rq_held(rq);
+ assert_clock_updated(rq);
+
+ return rq->clock_pelt - rq->lost_idle_time;
+}
+
/*
* When rq becomes idle, we have to check if it has lost idle time
* because it was fully busy. A rq is fully used when the /Sum util_sum
@@ -130,17 +138,24 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
*/
if (util_sum >= divider)
rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
-}

-static inline u64 rq_clock_pelt(struct rq *rq)
-{
- lockdep_assert_rq_held(rq);
- assert_clock_updated(rq);
+ /* The rq is idle, we can sync with clock_task */
+ rq->clock_pelt = rq_clock_task(rq);

- return rq->clock_pelt - rq->lost_idle_time;
+ u64_u32_store(rq->enter_idle, rq_clock(rq));
+ u64_u32_store(rq->clock_pelt_idle, rq_clock_pelt(rq));
}

#ifdef CONFIG_CFS_BANDWIDTH
+static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
+{
+ if (unlikely(cfs_rq->throttle_count))
+ u64_u32_store(cfs_rq->clock_pelt_idle, U64_MAX);
+ else
+ u64_u32_store(cfs_rq->clock_pelt_idle,
+ rq_clock_pelt(rq_of(cfs_rq)));
+}
+
/* rq->task_clock normalized against any time this cfs_rq has spent throttled */
static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{
@@ -150,6 +165,7 @@ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time;
}
#else
+static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { }
static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{
return rq_clock_pelt(rq_of(cfs_rq));
@@ -204,6 +220,7 @@ update_rq_clock_pelt(struct rq *rq, s64 delta) { }
static inline void
update_idle_rq_clock_pelt(struct rq *rq) { }

+static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { }
#endif

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e2cf6e48b165..07014e8cbae2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -641,6 +641,10 @@ struct cfs_rq {
int runtime_enabled;
s64 runtime_remaining;

+ u64 clock_pelt_idle;
+#ifndef CONFIG_64BIT
+ u64 clock_pelt_idle_copy;
+#endif
u64 throttled_clock;
u64 throttled_clock_pelt;
u64 throttled_clock_pelt_time;
@@ -1013,6 +1017,12 @@ struct rq {
u64 clock_task ____cacheline_aligned;
u64 clock_pelt;
unsigned long lost_idle_time;
+ u64 clock_pelt_idle;
+ u64 enter_idle;
+#ifndef CONFIG_64BIT
+ u64 clock_pelt_idle_copy;
+ u64 enter_idle_copy;
+#endif

atomic_t nr_iowait;

--
2.25.1

2022-04-27 11:28:43

by Tao Zhou

[permalink] [raw]

Subject: Re: [PATCH v6 2/7] sched/fair: Decay task PELT values during wakeup migration

On Tue, Apr 26, 2022 at 10:35:01AM +0100, Vincent Donnefort wrote:

> Before being migrated to a new CPU, a task sees its PELT values
> synchronized with rq last_update_time. Once done, that same task will also
> have its sched_avg last_update_time reset. This means the time between
> the migration and the last clock update (B) will not be accounted for in
> util_avg and a discontinuity will appear. This issue is amplified by the
> PELT clock scaling. If the clock hasn't been updated while the CPU is
> idle, clock_pelt will not be aligned with clock_task and that time (A)
> will be also lost.
>
> ---------|----- A -----|-----------|------- B -----|>
> clock_pelt clock_task clock now
>
> This is especially problematic for asymmetric CPU capacity systems which
> need stable util_avg signals for task placement and energy estimation.
>
> Ideally, this problem would be solved by updating the runqueue clocks
> before the migration. But that would require taking the runqueue lock
> which is quite expensive [1]. Instead estimate the missing time and update
> the task util_avg with that value:
>
> A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
>
> sched_clock_cpu() is a costly functinon. Limit the usage to the case where
> the source CPU is idle as we know this is when the clock is having the
> biggest risk of being outdated.
>
> Neither clock_task, clock_pelt nor clock can be accessed without the
> runqueue lock. We then need to store those values in a timestamp variable
> which can be accessed during the migration. rq's enter_idle will give the
> wall-clock time when the rq went idle. We have then:
>
> B = sched_clock_cpu() - rq->enter_idle.
>
> Then, to catch-up the PELT clock scaling (A), two cases:
>
> * !CFS_BANDWIDTH: We can simply use clock_task(). This value is stored
> in rq's clock_pelt_idle, before the rq enters idle. The estimated time
> is then:
>
> rq->clock_pelt_idle + sched_clock_cpu() - rq->enter_idle.
>
> * CFS_BANDWIDTH: We can't catch-up with clock_task because of the
> throttled_clock_task_time offset. cfs_rq's clock_pelt_idle is then
> giving the PELT clock when the cfs_rq becomes idle. This gives:
>
> A = rq->clock_pelt_idle - cfs_rq->clock_pelt_idle

The code calulating A below is not consistent with this. The order is reversed.

> And gives the following estimated time:
>
> cfs_rq->last_update_time +
> rq->clock_pelt_idle - cfs_rq->clock_pelt_idle + (A)
> sched_clock_cpu() - rq->enter_idle (B)
>
> The (B) part of the missing time is however an estimation that doesn't
> take into account IRQ and Paravirt time.
>
> [1] https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Vincent Donnefort <[email protected]>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index abd1feeec0c2..1256e2c0e2e2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3694,6 +3694,48 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
>
> #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> +#ifdef CONFIG_NO_HZ_COMMON
> +static inline void migrate_se_pelt_lag(struct sched_entity *se)
> +{
> + struct cfs_rq *cfs_rq;
> + struct rq *rq;
> + bool is_idle;
> + u64 now;
> +
> + cfs_rq = cfs_rq_of(se);
> + rq = rq_of(cfs_rq);
> +
> + rcu_read_lock();
> + is_idle = is_idle_task(rcu_dereference(rq->curr));
> + rcu_read_unlock();
> +
> + /*
> + * The lag estimation comes with a cost we don't want to pay all the
> + * time. Hence, limiting to the case where the source CPU is idle and
> + * we know we are at the greatest risk to have an outdated clock.
> + */
> + if (!is_idle)
> + return;
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> + now = u64_u32_load(cfs_rq->clock_pelt_idle);
> + /* The clock has been stopped for throttling */
> + if (now == U64_MAX)
> + return;
> +
> + now += cfs_rq_last_update_time(cfs_rq);
> + now -= u64_u32_load(rq->clock_pelt_idle);
> +#else
> + now = u64_u32_load(rq->clock_pelt_idle);
> +#endif
> + now += sched_clock_cpu(cpu_of(rq)) - u64_u32_load(rq->enter_idle);
> +
> + __update_load_avg_blocked_se(now, se);
> +}
> +#else
> +static void migrate_se_pelt_lag(struct sched_entity *se) {}
> +#endif
> +
> /**
> * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
> * @now: current time, as per cfs_rq_clock_pelt()
> @@ -4429,6 +4471,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> */
> if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
> update_min_vruntime(cfs_rq);
> +
> + if (cfs_rq->nr_running == 0)
> + update_idle_cfs_rq_clock_pelt(cfs_rq);
> }
>
> /*
> @@ -6946,6 +6991,8 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
> */
> static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> {
> + struct sched_entity *se = &p->se;
> +
> /*
> * As blocked tasks retain absolute vruntime the migration needs to
> * deal with this by subtracting the old and adding the new
> @@ -6953,7 +7000,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> * the task on the new runqueue.
> */
> if (READ_ONCE(p->__state) == TASK_WAKING) {
> - struct sched_entity *se = &p->se;
> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
> se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
> @@ -6965,25 +7011,29 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> * rq->lock and can modify state directly.
> */
> lockdep_assert_rq_held(task_rq(p));
> - detach_entity_cfs_rq(&p->se);
> + detach_entity_cfs_rq(se);
>
> } else {
> + remove_entity_load_avg(se);
> +
> /*
> - * We are supposed to update the task to "current" time, then
> - * its up to date and ready to go to new CPU/cfs_rq. But we
> - * have difficulty in getting what current time is, so simply
> - * throw away the out-of-date time. This will result in the
> - * wakee task is less decayed, but giving the wakee more load
> - * sounds not bad.
> + * Here, the task's PELT values have been updated according to
> + * the current rq's clock. But if that clock hasn't been
> + * updated in a while, a substantial idle time will be missed,
> + * leading to an inflation after wake-up on the new rq.
> + *
> + * Estimate the missing time from the cfs_rq last_update_time
> + * and update sched_avg to improve the PELT continuity after
> + * migration.
> */
> - remove_entity_load_avg(&p->se);
> + migrate_se_pelt_lag(se);
> }
>
> /* Tell new CPU we are migrated */
> - p->se.avg.last_update_time = 0;
> + se->avg.last_update_time = 0;
>
> /* We have migrated, no longer consider this task hot */
> - p->se.exec_start = 0;
> + se->exec_start = 0;
>
> update_scan_period(p, new_cpu);
> }
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index 4ff2ed4f8fa1..0380f750adbe 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -103,6 +103,14 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
> rq->clock_pelt += delta;
> }
>
> +static inline u64 rq_clock_pelt(struct rq *rq)
> +{
> + lockdep_assert_rq_held(rq);
> + assert_clock_updated(rq);
> +
> + return rq->clock_pelt - rq->lost_idle_time;
> +}
> +
> /*
> * When rq becomes idle, we have to check if it has lost idle time
> * because it was fully busy. A rq is fully used when the /Sum util_sum
> @@ -130,17 +138,24 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
> */
> if (util_sum >= divider)
> rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> -}
>
> -static inline u64 rq_clock_pelt(struct rq *rq)
> -{
> - lockdep_assert_rq_held(rq);
> - assert_clock_updated(rq);
> + /* The rq is idle, we can sync with clock_task */
> + rq->clock_pelt = rq_clock_task(rq);
>
> - return rq->clock_pelt - rq->lost_idle_time;
> + u64_u32_store(rq->enter_idle, rq_clock(rq));
> + u64_u32_store(rq->clock_pelt_idle, rq_clock_pelt(rq));
> }
>
> #ifdef CONFIG_CFS_BANDWIDTH
> +static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
> +{
> + if (unlikely(cfs_rq->throttle_count))
> + u64_u32_store(cfs_rq->clock_pelt_idle, U64_MAX);
> + else
> + u64_u32_store(cfs_rq->clock_pelt_idle,
> + rq_clock_pelt(rq_of(cfs_rq)));
> +}
> +
> /* rq->task_clock normalized against any time this cfs_rq has spent throttled */
> static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
> {
> @@ -150,6 +165,7 @@ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
> return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time;
> }
> #else
> +static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { }
> static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
> {
> return rq_clock_pelt(rq_of(cfs_rq));
> @@ -204,6 +220,7 @@ update_rq_clock_pelt(struct rq *rq, s64 delta) { }
> static inline void
> update_idle_rq_clock_pelt(struct rq *rq) { }
>
> +static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { }
> #endif
>
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e2cf6e48b165..07014e8cbae2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -641,6 +641,10 @@ struct cfs_rq {
> int runtime_enabled;
> s64 runtime_remaining;
>
> + u64 clock_pelt_idle;
> +#ifndef CONFIG_64BIT
> + u64 clock_pelt_idle_copy;
> +#endif
> u64 throttled_clock;
> u64 throttled_clock_pelt;
> u64 throttled_clock_pelt_time;
> @@ -1013,6 +1017,12 @@ struct rq {
> u64 clock_task ____cacheline_aligned;
> u64 clock_pelt;
> unsigned long lost_idle_time;
> + u64 clock_pelt_idle;
> + u64 enter_idle;
> +#ifndef CONFIG_64BIT
> + u64 clock_pelt_idle_copy;
> + u64 enter_idle_copy;
> +#endif
>
> atomic_t nr_iowait;
>
> --
> 2.25.1
>

2022-04-27 11:40:28

by Vincent Donnefort

[permalink] [raw]

Subject: [PATCH v6 7/7] sched/fair: Remove the energy margin in feec()

find_energy_efficient_cpu() integrates a margin to protect tasks from
bouncing back and forth from a CPU to another. This margin is set as being
6% of the total current energy estimated on the system. This however does
not work for two reasons:

1. The energy estimation is not a good absolute value:

compute_energy() used in feec() is a good estimation for task placement as
it allows to compare the energy with and without a task. The computed
delta will give a good overview of the cost for a certain task placement.
It, however, doesn't work as an absolute estimation for the total energy
of the system. First it adds the contribution to idle CPUs into the
energy, second it mixes util_avg with util_est values. util_avg contains
the near history for a CPU usage, it doesn't tell at all what the current
utilization is. A system that has been quite busy in the near past will
hold a very high energy and then a high margin preventing any task
migration to a lower capacity CPU, wasting energy. It even creates a
negative feedback loop: by holding the tasks on a less efficient CPU, the
margin contributes in keeping the energy high.

2. The margin handicaps small tasks:

On a system where the workload is composed mostly of small tasks (which is
often the case on Android), the overall energy will be high enough to
create a margin none of those tasks can cross. On a Pixel4, a small
utilization of 5% on all the CPUs creates a global estimated energy of 140
joules, as per the Energy Model declaration of that same device. This
means, after applying the 6% margin that any migration must save more than
8 joules to happen. No task with a utilization lower than 40 would then be
able to migrate away from the biggest CPU of the system.

The 6% of the overall system energy was brought by the following patch:

(eb92692b2544 sched/fair: Speed-up energy-aware wake-ups)

It was previously 6% of the prev_cpu energy. Also, the following one
made this margin value conditional on the clusters where the task fits:

(8d4c97c105ca sched/fair: Only compute base_energy_pd if necessary)

We could simply revert that margin change to what it was, but the original
version didn't have strong grounds neither and as demonstrated in (1.) the
estimated energy isn't a good absolute value. Instead, removing it
completely. It is indeed, made possible by recent changes that improved
energy estimation comparison fairness (sched/fair: Remove task_util from
effective utilization in feec()) (PM: EM: Increase energy calculation
precision) and task utilization stabilization (sched/fair: Decay task
util_avg during migration)

Without a margin, we could have feared bouncing between CPUs. But running
LISA's eas_behaviour test coverage on three different platforms (Hikey960,
RB-5 and DB-845) showed no issue.

Removing the energy margin enables more energy-optimized placements for a
more energy efficient system.

Signed-off-by: Vincent Donnefort <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a9941299547b..49ac5958aa69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6848,9 +6848,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
- int cpu, best_energy_cpu = prev_cpu, target = -1;
struct root_domain *rd = this_rq()->rd;
- unsigned long base_energy = 0;
+ int cpu, best_energy_cpu, target = -1;
struct sched_domain *sd;
struct perf_domain *pd;
struct energy_env eenv;
@@ -6882,8 +6881,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
unsigned long cpu_cap, cpu_thermal_cap, util;
unsigned long cur_delta, max_spare_cap = 0;
bool compute_prev_delta = false;
- unsigned long base_energy_pd;
int max_spare_cap_cpu = -1;
+ unsigned long base_energy;

cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);

@@ -6938,16 +6937,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

/* Compute the 'base' energy of the pd, without @p */
eenv_pd_busy_time(&eenv, cpus, p);
- base_energy_pd = compute_energy(&eenv, pd, cpus, p, -1);
- base_energy += base_energy_pd;
+ base_energy = compute_energy(&eenv, pd, cpus, p, -1);

/* Evaluate the energy impact of using prev_cpu. */
if (compute_prev_delta) {
prev_delta = compute_energy(&eenv, pd, cpus, p,
prev_cpu);
- if (prev_delta < base_energy_pd)
+ if (prev_delta < base_energy)
goto unlock;
- prev_delta -= base_energy_pd;
+ prev_delta -= base_energy;
best_delta = min(best_delta, prev_delta);
}

@@ -6955,9 +6953,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (max_spare_cap_cpu >= 0) {
cur_delta = compute_energy(&eenv, pd, cpus, p,
max_spare_cap_cpu);
- if (cur_delta < base_energy_pd)
+ if (cur_delta < base_energy)
goto unlock;
- cur_delta -= base_energy_pd;
+ cur_delta -= base_energy;
if (cur_delta < best_delta) {
best_delta = cur_delta;
best_energy_cpu = max_spare_cap_cpu;
@@ -6966,12 +6964,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
}
rcu_read_unlock();

- /*
- * Pick the best CPU if prev_cpu cannot be used, or if it saves at
- * least 6% of the energy used by prev_cpu.
- */
- if ((prev_delta == ULONG_MAX) ||
- (prev_delta - best_delta) > ((prev_delta + base_energy) >> 4))
+ if (best_delta < prev_delta)
target = best_energy_cpu;

return target;
--
2.25.1

2022-04-27 11:41:41

by Vincent Donnefort

[permalink] [raw]

Subject: Re: [PATCH v6 2/7] sched/fair: Decay task PELT values during wakeup migration

On 27/04/2022 10:25, Tao Zhou wrote:
> On Tue, Apr 26, 2022 at 10:35:01AM +0100, Vincent Donnefort wrote:
>
>> Before being migrated to a new CPU, a task sees its PELT values
>> synchronized with rq last_update_time. Once done, that same task will also
>> have its sched_avg last_update_time reset. This means the time between
>> the migration and the last clock update (B) will not be accounted for in
>> util_avg and a discontinuity will appear. This issue is amplified by the
>> PELT clock scaling. If the clock hasn't been updated while the CPU is
>> idle, clock_pelt will not be aligned with clock_task and that time (A)
>> will be also lost.
>>
>> ---------|----- A -----|-----------|------- B -----|>
>> clock_pelt clock_task clock now
>>
>> This is especially problematic for asymmetric CPU capacity systems which
>> need stable util_avg signals for task placement and energy estimation.
>>
>> Ideally, this problem would be solved by updating the runqueue clocks
>> before the migration. But that would require taking the runqueue lock
>> which is quite expensive [1]. Instead estimate the missing time and update
>> the task util_avg with that value:
>>
>> A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
>>
>> sched_clock_cpu() is a costly functinon. Limit the usage to the case where
>> the source CPU is idle as we know this is when the clock is having the
>> biggest risk of being outdated.
>>
>> Neither clock_task, clock_pelt nor clock can be accessed without the
>> runqueue lock. We then need to store those values in a timestamp variable
>> which can be accessed during the migration. rq's enter_idle will give the
>> wall-clock time when the rq went idle. We have then:
>>
>> B = sched_clock_cpu() - rq->enter_idle.
>>
>> Then, to catch-up the PELT clock scaling (A), two cases:
>>
>> * !CFS_BANDWIDTH: We can simply use clock_task(). This value is stored
>> in rq's clock_pelt_idle, before the rq enters idle. The estimated time
>> is then:
>>
>> rq->clock_pelt_idle + sched_clock_cpu() - rq->enter_idle.
>>
>> * CFS_BANDWIDTH: We can't catch-up with clock_task because of the
>> throttled_clock_task_time offset. cfs_rq's clock_pelt_idle is then
>> giving the PELT clock when the cfs_rq becomes idle. This gives:
>>
>> A = rq->clock_pelt_idle - cfs_rq->clock_pelt_idle
>
> The code calulating A below is not consistent with this. The order is reversed.
>

Good catch, but this comment is actually correct, the code is not.
rq->clock_pelt_idle is updated _after_ cfs_rq->clock_pelt_idle. (see
previous email to Vincent)

Thanks,

[...]