2019-09-12 09:50:27

by Quentin Perret

[permalink] [raw]
Subject: [PATCH] sched/fair: Speed-up energy-aware wake-ups

From: Quentin Perret <[email protected]>

EAS computes the energy impact of migrating a waking task when deciding
on which CPU it should run. However, the current approach is known to
have a high algorithmic complexity, which can result in prohibitively
high wake-up latencies on systems with complex energy models, such as
systems with per-CPU DVFS. On such systems, the algorithm complexity is
in O(n^2) (ignoring the cost of searching for performance states in the
EM) with 'n' the number of CPUs.

To address this, re-factor the EAS wake-up path to compute the energy
'delta' (with and without the task) on a per-performance domain basis,
rather than system-wide, which brings the complexity down to O(n).

No functional changes intended.

Signed-off-by: Quentin Perret <[email protected]>
---

Test results
~~~~~~~~~~~~

* Setup: Tested on a Google Pixel 3, with a Snapdragon 845 (4+4 CPUs,
A55/A75). Base kernel is 5.3-rc5 + Pixel3 specific patches. Android
userspace, no graphics.

* Test case: Run a periodic rt-app task, with 16ms period, ramping down
from 70% to 10%, in 5% steps of 500 ms each (json avail. at [1]).
Frequencies of all CPUs are pinned to max (using scaling_min_freq
CPUFreq sysfs entries) to reduce variability. The time to run
select_task_rq_fair() is measured using the function profiler
(/sys/kernel/debug/tracing/trace_stat/function*). See the test script
for more details [2].

Test 1:
I hacked the DT to 'fake' per-CPU DVFS. That is, we end up with one
CPUFreq policy per CPU (8 policies in total). Since all frequencies are
pinned to max for the test, this should have no impact on the actual
frequency selection, but it does in the EAS calculation.

+---------------------------+----------------------------------+
| Without patch | With patch |
+-----+-----+----------+----------+-----+-----------------+----------+
| CPU | Hit | Avg (us) | s^2 (us) | Hit | Avg (us) | s^2 (us) |
|-----+-----+----------+----------+-----+-----------------+----------+
| 0 | 274 | 38.303 | 1750.239 | 401 | 14.126 (-63.1%) | 146.625 |
| 1 | 197 | 49.529 | 1695.852 | 314 | 16.135 (-67.4%) | 167.525 |
| 2 | 142 | 34.296 | 1758.665 | 302 | 14.133 (-58.8%) | 130.071 |
| 3 | 172 | 31.734 | 1490.975 | 641 | 14.637 (-53.9%) | 139.189 |
| 4 | 316 | 7.834 | 178.217 | 425 | 5.413 (-30.9%) | 20.803 |
| 5 | 447 | 8.424 | 144.638 | 556 | 5.929 (-29.6%) | 27.301 |
| 6 | 581 | 14.886 | 346.793 | 456 | 5.711 (-61.6%) | 23.124 |
| 7 | 456 | 10.005 | 211.187 | 997 | 4.708 (-52.9%) | 21.144 |
+-----+-----+----------+----------+-----+-----------------+----------+
* Hit, Avg and s^2 are as reported by the function profiler

Test 2:
I also ran the same test with a normal DT, with 2 CPUFreq policies, to
see if this causes regressions in the most common case.

+---------------------------+----------------------------------+
| Without patch | With patch |
+-----+-----+----------+----------+-----+-----------------+----------+
| CPU | Hit | Avg (us) | s^2 (us) | Hit | Avg (us) | s^2 (us) |
|-----+-----+----------+----------+-----+-----------------+----------+
| 0 | 345 | 22.184 | 215.321 | 580 | 18.635 (-16.0%) | 146.892 |
| 1 | 358 | 18.597 | 200.596 | 438 | 12.934 (-30.5%) | 104.604 |
| 2 | 359 | 25.566 | 200.217 | 397 | 10.826 (-57.7%) | 74.021 |
| 3 | 362 | 16.881 | 200.291 | 718 | 11.455 (-32.1%) | 102.280 |
| 4 | 457 | 3.822 | 9.895 | 757 | 4.616 (+20.8%) | 13.369 |
| 5 | 344 | 4.301 | 7.121 | 594 | 5.320 (+23.7%) | 18.798 |
| 6 | 472 | 4.326 | 7.849 | 464 | 5.648 (+30.6%) | 22.022 |
| 7 | 331 | 4.630 | 13.937 | 408 | 5.299 (+14.4%) | 18.273 |
+-----+-----+----------+----------+-----+-----------------+----------+
* Hit, Avg and s^2 are as reported by the function profiler

In addition to these two tests, I also ran 50 iterations of the Lisa
EAS functional test suite [3] with this patch applied on Arm Juno r0,
Arm Juno r2, Arm TC2 and Hikey960, and could not see any regressions
(all EAS functional tests are passing).

[1] https://paste.debian.net/1100055/
[2] https://paste.debian.net/1100057/
[3] https://github.com/ARM-software/lisa/blob/master/lisa/tests/scheduler/eas_behaviour.py
---
kernel/sched/fair.c | 110 ++++++++++++++++++++------------------------
1 file changed, 50 insertions(+), 60 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc9cfeaac8bd..8a521087e11d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6277,69 +6277,55 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
}

/*
- * compute_energy(): Estimates the energy that would be consumed if @p was
+ * compute_energy(): Estimates the energy that @pd would consume if @p was
* migrated to @dst_cpu. compute_energy() predicts what will be the utilization
- * landscape of the * CPUs after the task migration, and uses the Energy Model
+ * landscape of @pd's CPUs after the task migration, and uses the Energy Model
* to compute what would be the energy if we decided to actually migrate that
* task.
*/
static long
compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
{
- unsigned int max_util, util_cfs, cpu_util, cpu_cap;
- unsigned long sum_util, energy = 0;
- struct task_struct *tsk;
+ struct cpumask *pd_mask = perf_domain_span(pd);
+ unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
+ unsigned long max_util = 0, sum_util = 0;
int cpu;

- for (; pd; pd = pd->next) {
- struct cpumask *pd_mask = perf_domain_span(pd);
+ /*
+ * The capacity state of CPUs of the current rd can be driven by CPUs
+ * of another rd if they belong to the same pd. So, account for the
+ * utilization of these CPUs too by masking pd with cpu_online_mask
+ * instead of the rd span.
+ *
+ * If an entire pd is outside of the current rd, it will not appear in
+ * its pd list and will not be accounted by compute_energy().
+ */
+ for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
+ unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
+ struct task_struct *tsk = cpu == dst_cpu ? p : NULL;

/*
- * The energy model mandates all the CPUs of a performance
- * domain have the same capacity.
+ * Busy time computation: utilization clamping is not
+ * required since the ratio (sum_util / cpu_capacity)
+ * is already enough to scale the EM reported power
+ * consumption at the (eventually clamped) cpu_capacity.
*/
- cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
- max_util = sum_util = 0;
+ sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+ ENERGY_UTIL, NULL);

/*
- * The capacity state of CPUs of the current rd can be driven by
- * CPUs of another rd if they belong to the same performance
- * domain. So, account for the utilization of these CPUs too
- * by masking pd with cpu_online_mask instead of the rd span.
- *
- * If an entire performance domain is outside of the current rd,
- * it will not appear in its pd list and will not be accounted
- * by compute_energy().
+ * Performance domain frequency: utilization clamping
+ * must be considered since it affects the selection
+ * of the performance domain frequency.
+ * NOTE: in case RT tasks are running, by default the
+ * FREQUENCY_UTIL's utilization can be max OPP.
*/
- for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
- util_cfs = cpu_util_next(cpu, p, dst_cpu);
-
- /*
- * Busy time computation: utilization clamping is not
- * required since the ratio (sum_util / cpu_capacity)
- * is already enough to scale the EM reported power
- * consumption at the (eventually clamped) cpu_capacity.
- */
- sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
- ENERGY_UTIL, NULL);
-
- /*
- * Performance domain frequency: utilization clamping
- * must be considered since it affects the selection
- * of the performance domain frequency.
- * NOTE: in case RT tasks are running, by default the
- * FREQUENCY_UTIL's utilization can be max OPP.
- */
- tsk = cpu == dst_cpu ? p : NULL;
- cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
- FREQUENCY_UTIL, tsk);
- max_util = max(max_util, cpu_util);
- }
-
- energy += em_pd_energy(pd->em_pd, max_util, sum_util);
+ cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+ FREQUENCY_UTIL, tsk);
+ max_util = max(max_util, cpu_util);
}

- return energy;
+ return em_pd_energy(pd->em_pd, max_util, sum_util);
}

/*
@@ -6381,21 +6367,19 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* other use-cases too. So, until someone finds a better way to solve this,
* let's keep things simple by re-using the existing slow path.
*/
-
static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{
- unsigned long prev_energy = ULONG_MAX, best_energy = ULONG_MAX;
+ unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
+ unsigned long cpu_cap, util, base_energy = 0;
int cpu, best_energy_cpu = prev_cpu;
- struct perf_domain *head, *pd;
- unsigned long cpu_cap, util;
struct sched_domain *sd;
+ struct perf_domain *pd;

rcu_read_lock();
pd = rcu_dereference(rd->pd);
if (!pd || READ_ONCE(rd->overutilized))
goto fail;
- head = pd;

/*
* Energy-aware wake-up happens on the lowest sched_domain starting
@@ -6412,9 +6396,14 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
goto unlock;

for (; pd; pd = pd->next) {
- unsigned long cur_energy, spare_cap, max_spare_cap = 0;
+ unsigned long cur_delta, spare_cap, max_spare_cap = 0;
+ unsigned long base_energy_pd;
int max_spare_cap_cpu = -1;

+ /* Compute the 'base' energy of the pd, without @p */
+ base_energy_pd = compute_energy(p, -1, pd);
+ base_energy += base_energy_pd;
+
for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue;
@@ -6427,9 +6416,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

/* Always use prev_cpu as a candidate. */
if (cpu == prev_cpu) {
- prev_energy = compute_energy(p, prev_cpu, head);
- best_energy = min(best_energy, prev_energy);
- continue;
+ prev_delta = compute_energy(p, prev_cpu, pd);
+ prev_delta -= base_energy_pd;
+ best_delta = min(best_delta, prev_delta);
}

/*
@@ -6445,9 +6434,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

/* Evaluate the energy impact of using this CPU. */
if (max_spare_cap_cpu >= 0) {
- cur_energy = compute_energy(p, max_spare_cap_cpu, head);
- if (cur_energy < best_energy) {
- best_energy = cur_energy;
+ cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
+ cur_delta -= base_energy_pd;
+ if (cur_delta < best_delta) {
+ best_delta = cur_delta;
best_energy_cpu = max_spare_cap_cpu;
}
}
@@ -6459,10 +6449,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
* Pick the best CPU if prev_cpu cannot be used, or if it saves at
* least 6% of the energy used by prev_cpu.
*/
- if (prev_energy == ULONG_MAX)
+ if (prev_delta == ULONG_MAX)
return best_energy_cpu;

- if ((prev_energy - best_energy) > (prev_energy >> 4))
+ if ((prev_delta - best_delta) > ((prev_delta + base_energy) >> 4))
return best_energy_cpu;

return prev_cpu;
--
2.22.1


2019-09-13 22:46:45

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [tip: sched/core] sched/fair: Speed-up energy-aware wake-ups

The following commit has been merged into the sched/core branch of tip:

Commit-ID: eb92692b2544d3f415887dbbc98499843dfe568b
Gitweb: https://git.kernel.org/tip/eb92692b2544d3f415887dbbc98499843dfe568b
Author: Quentin Perret <[email protected]>
AuthorDate: Thu, 12 Sep 2019 11:44:04 +02:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Fri, 13 Sep 2019 07:45:17 +02:00

sched/fair: Speed-up energy-aware wake-ups

EAS computes the energy impact of migrating a waking task when deciding
on which CPU it should run. However, the current approach is known to
have a high algorithmic complexity, which can result in prohibitively
high wake-up latencies on systems with complex energy models, such as
systems with per-CPU DVFS. On such systems, the algorithm complexity is
in O(n^2) (ignoring the cost of searching for performance states in the
EM) with 'n' the number of CPUs.

To address this, re-factor the EAS wake-up path to compute the energy
'delta' (with and without the task) on a per-performance domain basis,
rather than system-wide, which brings the complexity down to O(n).

No functional changes intended.

Test results
~~~~~~~~~~~~

* Setup: Tested on a Google Pixel 3, with a Snapdragon 845 (4+4 CPUs,
A55/A75). Base kernel is 5.3-rc5 + Pixel3 specific patches. Android
userspace, no graphics.

* Test case: Run a periodic rt-app task, with 16ms period, ramping down
from 70% to 10%, in 5% steps of 500 ms each (json avail. at [1]).
Frequencies of all CPUs are pinned to max (using scaling_min_freq
CPUFreq sysfs entries) to reduce variability. The time to run
select_task_rq_fair() is measured using the function profiler
(/sys/kernel/debug/tracing/trace_stat/function*). See the test script
for more details [2].

Test 1:

I hacked the DT to 'fake' per-CPU DVFS. That is, we end up with one
CPUFreq policy per CPU (8 policies in total). Since all frequencies are
pinned to max for the test, this should have no impact on the actual
frequency selection, but it does in the EAS calculation.

+---------------------------+----------------------------------+
| Without patch | With patch |
+-----+-----+----------+----------+-----+-----------------+----------+
| CPU | Hit | Avg (us) | s^2 (us) | Hit | Avg (us) | s^2 (us) |
|-----+-----+----------+----------+-----+-----------------+----------+
| 0 | 274 | 38.303 | 1750.239 | 401 | 14.126 (-63.1%) | 146.625 |
| 1 | 197 | 49.529 | 1695.852 | 314 | 16.135 (-67.4%) | 167.525 |
| 2 | 142 | 34.296 | 1758.665 | 302 | 14.133 (-58.8%) | 130.071 |
| 3 | 172 | 31.734 | 1490.975 | 641 | 14.637 (-53.9%) | 139.189 |
| 4 | 316 | 7.834 | 178.217 | 425 | 5.413 (-30.9%) | 20.803 |
| 5 | 447 | 8.424 | 144.638 | 556 | 5.929 (-29.6%) | 27.301 |
| 6 | 581 | 14.886 | 346.793 | 456 | 5.711 (-61.6%) | 23.124 |
| 7 | 456 | 10.005 | 211.187 | 997 | 4.708 (-52.9%) | 21.144 |
+-----+-----+----------+----------+-----+-----------------+----------+
* Hit, Avg and s^2 are as reported by the function profiler

Test 2:
I also ran the same test with a normal DT, with 2 CPUFreq policies, to
see if this causes regressions in the most common case.

+---------------------------+----------------------------------+
| Without patch | With patch |
+-----+-----+----------+----------+-----+-----------------+----------+
| CPU | Hit | Avg (us) | s^2 (us) | Hit | Avg (us) | s^2 (us) |
|-----+-----+----------+----------+-----+-----------------+----------+
| 0 | 345 | 22.184 | 215.321 | 580 | 18.635 (-16.0%) | 146.892 |
| 1 | 358 | 18.597 | 200.596 | 438 | 12.934 (-30.5%) | 104.604 |
| 2 | 359 | 25.566 | 200.217 | 397 | 10.826 (-57.7%) | 74.021 |
| 3 | 362 | 16.881 | 200.291 | 718 | 11.455 (-32.1%) | 102.280 |
| 4 | 457 | 3.822 | 9.895 | 757 | 4.616 (+20.8%) | 13.369 |
| 5 | 344 | 4.301 | 7.121 | 594 | 5.320 (+23.7%) | 18.798 |
| 6 | 472 | 4.326 | 7.849 | 464 | 5.648 (+30.6%) | 22.022 |
| 7 | 331 | 4.630 | 13.937 | 408 | 5.299 (+14.4%) | 18.273 |
+-----+-----+----------+----------+-----+-----------------+----------+
* Hit, Avg and s^2 are as reported by the function profiler

In addition to these two tests, I also ran 50 iterations of the Lisa
EAS functional test suite [3] with this patch applied on Arm Juno r0,
Arm Juno r2, Arm TC2 and Hikey960, and could not see any regressions
(all EAS functional tests are passing).

[1] https://paste.debian.net/1100055/
[2] https://paste.debian.net/1100057/
[3] https://github.com/ARM-software/lisa/blob/master/lisa/tests/scheduler/eas_behaviour.py

Signed-off-by: Quentin Perret <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 110 +++++++++++++++++++------------------------
1 file changed, 50 insertions(+), 60 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1054d2c..8b66511 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6251,69 +6251,55 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
}

/*
- * compute_energy(): Estimates the energy that would be consumed if @p was
+ * compute_energy(): Estimates the energy that @pd would consume if @p was
* migrated to @dst_cpu. compute_energy() predicts what will be the utilization
- * landscape of the * CPUs after the task migration, and uses the Energy Model
+ * landscape of @pd's CPUs after the task migration, and uses the Energy Model
* to compute what would be the energy if we decided to actually migrate that
* task.
*/
static long
compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
{
- unsigned int max_util, util_cfs, cpu_util, cpu_cap;
- unsigned long sum_util, energy = 0;
- struct task_struct *tsk;
+ struct cpumask *pd_mask = perf_domain_span(pd);
+ unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
+ unsigned long max_util = 0, sum_util = 0;
int cpu;

- for (; pd; pd = pd->next) {
- struct cpumask *pd_mask = perf_domain_span(pd);
+ /*
+ * The capacity state of CPUs of the current rd can be driven by CPUs
+ * of another rd if they belong to the same pd. So, account for the
+ * utilization of these CPUs too by masking pd with cpu_online_mask
+ * instead of the rd span.
+ *
+ * If an entire pd is outside of the current rd, it will not appear in
+ * its pd list and will not be accounted by compute_energy().
+ */
+ for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
+ unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
+ struct task_struct *tsk = cpu == dst_cpu ? p : NULL;

/*
- * The energy model mandates all the CPUs of a performance
- * domain have the same capacity.
+ * Busy time computation: utilization clamping is not
+ * required since the ratio (sum_util / cpu_capacity)
+ * is already enough to scale the EM reported power
+ * consumption at the (eventually clamped) cpu_capacity.
*/
- cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
- max_util = sum_util = 0;
+ sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+ ENERGY_UTIL, NULL);

/*
- * The capacity state of CPUs of the current rd can be driven by
- * CPUs of another rd if they belong to the same performance
- * domain. So, account for the utilization of these CPUs too
- * by masking pd with cpu_online_mask instead of the rd span.
- *
- * If an entire performance domain is outside of the current rd,
- * it will not appear in its pd list and will not be accounted
- * by compute_energy().
+ * Performance domain frequency: utilization clamping
+ * must be considered since it affects the selection
+ * of the performance domain frequency.
+ * NOTE: in case RT tasks are running, by default the
+ * FREQUENCY_UTIL's utilization can be max OPP.
*/
- for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
- util_cfs = cpu_util_next(cpu, p, dst_cpu);
-
- /*
- * Busy time computation: utilization clamping is not
- * required since the ratio (sum_util / cpu_capacity)
- * is already enough to scale the EM reported power
- * consumption at the (eventually clamped) cpu_capacity.
- */
- sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
- ENERGY_UTIL, NULL);
-
- /*
- * Performance domain frequency: utilization clamping
- * must be considered since it affects the selection
- * of the performance domain frequency.
- * NOTE: in case RT tasks are running, by default the
- * FREQUENCY_UTIL's utilization can be max OPP.
- */
- tsk = cpu == dst_cpu ? p : NULL;
- cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
- FREQUENCY_UTIL, tsk);
- max_util = max(max_util, cpu_util);
- }
-
- energy += em_pd_energy(pd->em_pd, max_util, sum_util);
+ cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+ FREQUENCY_UTIL, tsk);
+ max_util = max(max_util, cpu_util);
}

- return energy;
+ return em_pd_energy(pd->em_pd, max_util, sum_util);
}

/*
@@ -6355,21 +6341,19 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* other use-cases too. So, until someone finds a better way to solve this,
* let's keep things simple by re-using the existing slow path.
*/
-
static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
{
- unsigned long prev_energy = ULONG_MAX, best_energy = ULONG_MAX;
+ unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
+ unsigned long cpu_cap, util, base_energy = 0;
int cpu, best_energy_cpu = prev_cpu;
- struct perf_domain *head, *pd;
- unsigned long cpu_cap, util;
struct sched_domain *sd;
+ struct perf_domain *pd;

rcu_read_lock();
pd = rcu_dereference(rd->pd);
if (!pd || READ_ONCE(rd->overutilized))
goto fail;
- head = pd;

/*
* Energy-aware wake-up happens on the lowest sched_domain starting
@@ -6386,9 +6370,14 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
goto unlock;

for (; pd; pd = pd->next) {
- unsigned long cur_energy, spare_cap, max_spare_cap = 0;
+ unsigned long cur_delta, spare_cap, max_spare_cap = 0;
+ unsigned long base_energy_pd;
int max_spare_cap_cpu = -1;

+ /* Compute the 'base' energy of the pd, without @p */
+ base_energy_pd = compute_energy(p, -1, pd);
+ base_energy += base_energy_pd;
+
for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue;
@@ -6401,9 +6390,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

/* Always use prev_cpu as a candidate. */
if (cpu == prev_cpu) {
- prev_energy = compute_energy(p, prev_cpu, head);
- best_energy = min(best_energy, prev_energy);
- continue;
+ prev_delta = compute_energy(p, prev_cpu, pd);
+ prev_delta -= base_energy_pd;
+ best_delta = min(best_delta, prev_delta);
}

/*
@@ -6419,9 +6408,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

/* Evaluate the energy impact of using this CPU. */
if (max_spare_cap_cpu >= 0) {
- cur_energy = compute_energy(p, max_spare_cap_cpu, head);
- if (cur_energy < best_energy) {
- best_energy = cur_energy;
+ cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
+ cur_delta -= base_energy_pd;
+ if (cur_delta < best_delta) {
+ best_delta = cur_delta;
best_energy_cpu = max_spare_cap_cpu;
}
}
@@ -6433,10 +6423,10 @@ unlock:
* Pick the best CPU if prev_cpu cannot be used, or if it saves at
* least 6% of the energy used by prev_cpu.
*/
- if (prev_energy == ULONG_MAX)
+ if (prev_delta == ULONG_MAX)
return best_energy_cpu;

- if ((prev_energy - best_energy) > (prev_energy >> 4))
+ if ((prev_delta - best_delta) > ((prev_delta + base_energy) >> 4))
return best_energy_cpu;

return prev_cpu;

2019-09-20 19:10:38

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Speed-up energy-aware wake-ups

Hi Pavan,

On Friday 20 Sep 2019 at 08:32:15 (+0530), Pavan Kondeti wrote:
> Earlier, we are not checking the spare capacity for the prev_cpu. Now that the
> continue statement is removed, prev_cpu could also be the max_spare_cap_cpu.
> Actually that makes sense. Because there is no reason why we want to select
> another CPU which has less spare capacity than previous CPU.
>
> Is this behavior intentional?

The intent was indeed to not compute the energy for another CPU in
prev_cpu's perf domain if prev_cpu is the one with max spare cap -- it
is useless to do so since this other CPU cannot 'beat' prev_cpu and
will never be chosen in the end.

But I did miss that we'd end up computing the energy for prev_cpu
twice ... Harmless but useless. So yeah, let's optimize that case too :)

> When prev_cpu == max_spare_cap_cpu, we are evaluating the energy again for the
> same CPU below. That could have been skipped by returning prev_cpu when
> prev_cpu == max_spare_cap_cpu.

Right, something like the patch below ? My test results are still
looking good with it applied.

Thanks for the careful review,
Quentin
---
From 7b8258287f180a2c383ebe397e8129f5f898ffbe Mon Sep 17 00:00:00 2001
From: Quentin Perret <[email protected]>
Date: Fri, 20 Sep 2019 09:07:20 +0100
Subject: [PATCH] sched/fair: Avoid redundant EAS calculation

The EAS wake-up path computes the system energy for several CPU
candidates: the CPU with maximum spare capacity in each performance
domain, and the prev_cpu. However, if prev_cpu also happens to be the
CPU with maximum spare capacity in its performance domain, the energy
calculation is still done twice, unnecessarily.

Add a condition to filter out this corner case before doing the energy
calculation.

Reported-by: Pavan Kondeti <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4bbf68c3161..7399382bc291 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6412,7 +6412,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
}

/* Evaluate the energy impact of using this CPU. */
- if (max_spare_cap_cpu >= 0) {
+ if (max_spare_cap_cpu >= 0 && max_spare_cap_cpu != prev_cpu) {
cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
cur_delta -= base_energy_pd;
if (cur_delta < best_delta) {
--
2.22.1

2019-09-20 19:15:26

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Speed-up energy-aware wake-ups

On Friday 20 Sep 2019 at 16:03:38 (+0530), Pavan Kondeti wrote:
> +1. Looks good to me.

Cool, thanks.

Peter/Ingo, is there anything else I should do ? Should I resend the
patch 'properly' (that is, not inline) ?

Thanks,
Quentin

2019-09-20 22:52:58

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Speed-up energy-aware wake-ups

Hi Quentin,

On Thu, Sep 12, 2019 at 11:44:04AM +0200, Quentin Perret wrote:
> From: Quentin Perret <[email protected]>
>
> EAS computes the energy impact of migrating a waking task when deciding
> on which CPU it should run. However, the current approach is known to
> have a high algorithmic complexity, which can result in prohibitively
> high wake-up latencies on systems with complex energy models, such as
> systems with per-CPU DVFS. On such systems, the algorithm complexity is
> in O(n^2) (ignoring the cost of searching for performance states in the
> EM) with 'n' the number of CPUs.
>
> To address this, re-factor the EAS wake-up path to compute the energy
> 'delta' (with and without the task) on a per-performance domain basis,
> rather than system-wide, which brings the complexity down to O(n).
>
> No functional changes intended.
>
> Signed-off-by: Quentin Perret <[email protected]>
>

[snip]

> /*
> @@ -6381,21 +6367,19 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> * other use-cases too. So, until someone finds a better way to solve this,
> * let's keep things simple by re-using the existing slow path.
> */
> -
> static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> {
> - unsigned long prev_energy = ULONG_MAX, best_energy = ULONG_MAX;
> + unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
> struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
> + unsigned long cpu_cap, util, base_energy = 0;
> int cpu, best_energy_cpu = prev_cpu;
> - struct perf_domain *head, *pd;
> - unsigned long cpu_cap, util;
> struct sched_domain *sd;
> + struct perf_domain *pd;
>
> rcu_read_lock();
> pd = rcu_dereference(rd->pd);
> if (!pd || READ_ONCE(rd->overutilized))
> goto fail;
> - head = pd;
>
> /*
> * Energy-aware wake-up happens on the lowest sched_domain starting
> @@ -6412,9 +6396,14 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> goto unlock;
>
> for (; pd; pd = pd->next) {
> - unsigned long cur_energy, spare_cap, max_spare_cap = 0;
> + unsigned long cur_delta, spare_cap, max_spare_cap = 0;
> + unsigned long base_energy_pd;
> int max_spare_cap_cpu = -1;
>
> + /* Compute the 'base' energy of the pd, without @p */
> + base_energy_pd = compute_energy(p, -1, pd);
> + base_energy += base_energy_pd;
> +
> for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
> if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> continue;
> @@ -6427,9 +6416,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>
> /* Always use prev_cpu as a candidate. */
> if (cpu == prev_cpu) {
> - prev_energy = compute_energy(p, prev_cpu, head);
> - best_energy = min(best_energy, prev_energy);
> - continue;
> + prev_delta = compute_energy(p, prev_cpu, pd);
> + prev_delta -= base_energy_pd;
> + best_delta = min(best_delta, prev_delta);
> }

Earlier, we are not checking the spare capacity for the prev_cpu. Now that the
continue statement is removed, prev_cpu could also be the max_spare_cap_cpu.
Actually that makes sense. Because there is no reason why we want to select
another CPU which has less spare capacity than previous CPU.

Is this behavior intentional?

When prev_cpu == max_spare_cap_cpu, we are evaluating the energy again for the
same CPU below. That could have been skipped by returning prev_cpu when
prev_cpu == max_spare_cap_cpu.

>
> /*
> @@ -6445,9 +6434,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>
> /* Evaluate the energy impact of using this CPU. */
> if (max_spare_cap_cpu >= 0) {
> - cur_energy = compute_energy(p, max_spare_cap_cpu, head);
> - if (cur_energy < best_energy) {
> - best_energy = cur_energy;
> + cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
> + cur_delta -= base_energy_pd;
> + if (cur_delta < best_delta) {
> + best_delta = cur_delta;
> best_energy_cpu = max_spare_cap_cpu;
> }
> }
> @@ -6459,10 +6449,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> * Pick the best CPU if prev_cpu cannot be used, or if it saves at
> * least 6% of the energy used by prev_cpu.
> */
> - if (prev_energy == ULONG_MAX)
> + if (prev_delta == ULONG_MAX)
> return best_energy_cpu;
>
> - if ((prev_energy - best_energy) > (prev_energy >> 4))
> + if ((prev_delta - best_delta) > ((prev_delta + base_energy) >> 4))
> return best_energy_cpu;
>
> return prev_cpu;
> --
> 2.22.1
>

Thanks,
Pavan

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

2019-09-21 13:56:28

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Speed-up energy-aware wake-ups

Hi Quentin,

On Fri, Sep 20, 2019 at 11:41:15AM +0200, Quentin Perret wrote:
> Hi Pavan,
>
> On Friday 20 Sep 2019 at 08:32:15 (+0530), Pavan Kondeti wrote:
> > Earlier, we are not checking the spare capacity for the prev_cpu. Now that the
> > continue statement is removed, prev_cpu could also be the max_spare_cap_cpu.
> > Actually that makes sense. Because there is no reason why we want to select
> > another CPU which has less spare capacity than previous CPU.
> >
> > Is this behavior intentional?
>
> The intent was indeed to not compute the energy for another CPU in
> prev_cpu's perf domain if prev_cpu is the one with max spare cap -- it
> is useless to do so since this other CPU cannot 'beat' prev_cpu and
> will never be chosen in the end.

Yes. Selecting the prev_cpu is the correct decision.

>
> But I did miss that we'd end up computing the energy for prev_cpu
> twice ... Harmless but useless. So yeah, let's optimize that case too :)
>
> > When prev_cpu == max_spare_cap_cpu, we are evaluating the energy again for the
> > same CPU below. That could have been skipped by returning prev_cpu when
> > prev_cpu == max_spare_cap_cpu.
>
> Right, something like the patch below ? My test results are still
> looking good with it applied.
>
> Thanks for the careful review,
> Quentin
> ---
> From 7b8258287f180a2c383ebe397e8129f5f898ffbe Mon Sep 17 00:00:00 2001
> From: Quentin Perret <[email protected]>
> Date: Fri, 20 Sep 2019 09:07:20 +0100
> Subject: [PATCH] sched/fair: Avoid redundant EAS calculation
>
> The EAS wake-up path computes the system energy for several CPU
> candidates: the CPU with maximum spare capacity in each performance
> domain, and the prev_cpu. However, if prev_cpu also happens to be the
> CPU with maximum spare capacity in its performance domain, the energy
> calculation is still done twice, unnecessarily.
>
> Add a condition to filter out this corner case before doing the energy
> calculation.
>
> Reported-by: Pavan Kondeti <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d4bbf68c3161..7399382bc291 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6412,7 +6412,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> }
>
> /* Evaluate the energy impact of using this CPU. */
> - if (max_spare_cap_cpu >= 0) {
> + if (max_spare_cap_cpu >= 0 && max_spare_cap_cpu != prev_cpu) {
> cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
> cur_delta -= base_energy_pd;
> if (cur_delta < best_delta) {
> --
> 2.22.1
>

+1. Looks good to me.

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

2019-09-26 09:21:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Speed-up energy-aware wake-ups

On Fri, Sep 20, 2019 at 01:23:00PM +0200, Quentin Perret wrote:
> On Friday 20 Sep 2019 at 16:03:38 (+0530), Pavan Kondeti wrote:
> > +1. Looks good to me.
>
> Cool, thanks.
>
> Peter/Ingo, is there anything else I should do ? Should I resend the
> patch 'properly' (that is, not inline) ?

Got it, thanks!

2019-09-27 08:13:40

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [tip: sched/urgent] sched/fair: Avoid redundant EAS calculation

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 4892f51ad54ddff2883a60b6ad4323c1f632a9d6
Gitweb: https://git.kernel.org/tip/4892f51ad54ddff2883a60b6ad4323c1f632a9d6
Author: Quentin Perret <[email protected]>
AuthorDate: Fri, 20 Sep 2019 11:41:15 +02:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Wed, 25 Sep 2019 17:42:32 +02:00

sched/fair: Avoid redundant EAS calculation

The EAS wake-up path computes the system energy for several CPU
candidates: the CPU with maximum spare capacity in each performance
domain, and the prev_cpu. However, if prev_cpu also happens to be the
CPU with maximum spare capacity in its performance domain, the energy
calculation is still done twice, unnecessarily.

Add a condition to filter out this corner case before doing the energy
calculation.

Reported-by: Pavan Kondeti <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Fixes: eb92692b2544 ("sched/fair: Speed-up energy-aware wake-ups")
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dfdac90..83ab35e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6389,7 +6389,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
}

/* Evaluate the energy impact of using this CPU. */
- if (max_spare_cap_cpu >= 0) {
+ if (max_spare_cap_cpu >= 0 && max_spare_cap_cpu != prev_cpu) {
cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
cur_delta -= base_energy_pd;
if (cur_delta < best_delta) {