2018-03-20 09:45:47

by Dietmar Eggemann

[permalink] [raw]
Subject: [RFC PATCH 0/6] Energy Aware Scheduling

1. Overview

The Energy Aware Scheduler (EAS) based on Morten Rasmussen's posting on
LKML [1] is currently part of the AOSP Common Kernel and runs on
today's smartphones with Arm's big.LITTLE CPUs.
Based on the experience gained over the last two and a half years in
product development, we propose an energy model based task placement
for CPUs with asymmetric core capacities (e.g. Arm big.LITTLE or
DynamIQ), to align with the EAS adopted by the AOSP Common Kernel. We
have developed a simplified energy model, based on the physical
active power/performance curve of each core type using existing SoC
power/performance data already known to the kernel. The energy model
is used to select the most energy-efficient CPU to place each task,
taking utilization into account.

1.1 Energy Model

A CPU with asymmetric core capacities features cores with significantly
different energy and performance characteristics. As the configurations
can vary greatly from one SoC to another, designing an energy-efficient
scheduling heuristic that performs well on a broad spectrum of platforms
appears to be particularly hard.
This proposal attempts to solve this issue by providing the scheduler
with an energy model of the platform which enables energy impact
estimation of scheduling decisions in a generic way. The energy model is
kept very simple as it represents only the active power of CPUs at all
available P-states and relies on existing data in the kernel (only used
by the thermal subsystem so far).
This proposal does not include the power consumption of C-states and
cluster-level resources which were originally introduced in [1] since
firstly, their impact on task placement decisions appears to be
neglectible on modern asymmetric platforms and secondly, they require
additional infrastructure and data (e.g new DT entries).
The scheduler is also informed of the span of frequency domains, hence
enabling an accurate accounting of the energy costs of frequency
changes. This appears to be especially important for future Arm CPU
topologies (DynamIQ) where the span of scheduling domains can be
different from the span of frequency domains.

1.2 Overutilization/Tipping Point

The primary job for the task scheduler is to deliver the highest
possible throughput with minimal latency. With increasing utilization
the opportunities to save energy for the scheduler become rarer. There
must be spare CPU time available to place tasks based on utilization
in an energy-aware fashion, i.e. to pack tasks on energy-efficient CPUs
with unnecessary constraining of the task throughput. This spare CPU
time decreases towards zero when the utilization of the system rises.
To cope with this situation, we introduce the concept of overutilization
in order to enable/disable EAS depending on system utilization.
The point in which a system switches from being not overutilized to
being overutilized or vice versa is called the tipping point. A per
sched domain tipping point indicator implementation is introduced here.

1.3 Wakeup path

On a system which has an energy model, the energy-aware wakeup path
trumps affine and capacity based wake up in case the lowest sched
domain of the task's previous CPU is not overutilized. The energy-aware
algorithm tries to find a new target CPU among the CPUs of the highest
non-overutilized domain which includes previous and current CPU, for
which the placement of the task would contribute a minimum on energy
consumption. The energy model is only enabled on CPUs with asymmetric
core capacities (SD_ASYM_CPUCAPACITY). These systems typically have
less than or equal 8 cores.

2. Tests

Two fundamentally different tests were executed. Firstly the energy
test case shows the impact on energy consumption this patch-set has
using a synthetic set of tasks. Secondly the performance test case
provides the conventional hackbench metric numbers.

The tests run on two arm64 big.LITTLE platforms: Hikey960 (4xA73 +
4xA53) and Juno r0 (2xA57 + 4xA53).

Base kernel is tip/sched/core (4.16-rc4), with some Hikey960 and
Juno specific patches, the SD_ASYM_CPUCAPACITY flag set at DIE sched
domain level for arm64 and schedutil as cpufreq governor [2].

2.1 Energy test case

10 iterations of between 10 and 50 periodic rt-app tasks (16ms period,
5% duty-cycle) for 30 seconds with energy measurement. Unit is Joules.
The goal is to save energy, so lower is better.

2.1.1 Hikey960

Energy is measured with an ACME Cape on an instrumented board. Numbers
include consumption of big and little CPUs, LPDDR memory, GPU and most
of the other small components on the board. They do not include
consumption of the radio chip (turned-off anyway) and external
connectors.

+----------+-----------------+------------------------+
| | Without patches | With patches |
+----------+---------+-------+-----------------+------+
| Tasks nb | Mean | RSD* | Mean | RSD* |
+----------+---------+-------+-----------------+------+
| 10 | 41.50 | 1.1% | 37.43 (-9.81%) | 2.0% |
| 20 | 55.51 | 0.7% | 50.74 (-8.59%) | 1.5% |
| 30 | 75.39 | 0.4% | 70.36 (-6.67%) | 7.3% |
| 40 | 95.82 | 0.3% | 89.90 (-6.18%) | 1.5% |
| 50 | 121.53 | 0.9% | 112.61 (-7.34%) | 0.9% |
+----------+---------+-------+-----------------+------+

2.1.2 Juno r0

Energy is measured with the onboard energy meter. Numbers include
consumption of big and little CPUs.

+----------+-----------------+------------------------+
| | Without patches | With patches |
+----------+--------+--------+-----------------+------+
| Tasks nb | Mean | RSD* | Mean | RSD* |
+----------+--------+--------+-----------------+------+
| 10 | 11.52 | 1.1% | 7.67 (-33.42%) | 2.8% |
| 20 | 19.25 | 0.9% | 13.39 (-30.44%) | 1.8% |
| 30 | 28.73 | 1.3% | 21.85 (-31.49%) | 0.6% |
| 40 | 37.58 | 0.9% | 31.40 (-16.44%) | 0.4% |
| 50 | 47.24 | 0.6% | 45.37 ( -3.96%) | 0.6% |
+----------+--------+--------+-----------------+------+

2.2 Performance test case

30 iterations of perf bench sched messaging --pipe --thread --group G
--loop L with G=[1 2 4 8] and L=50000 (Hikey960)/16000 (Juno r0).

2.2.1 Hikey960

The impact of thermal capping was mitigated thanks to a heatsink, a
fan, and a 10 sec delay between two successive executions.

+----------------+-----------------+------------------------+
| | Without patches | With patches |
+--------+-------+---------+-------+----------------+-------+
| Groups | Tasks | Mean | RSD* | Mean | RSD* |
+--------+-------+---------+-------+----------------+-------+
| 1 | 40 | 8.01 | 1.70% | 8.16 (+1.90%) | 1.79% |
| 2 | 80 | 15.59 | 0.76% | 15.79 (+1.33%) | 0.92% |
| 4 | 160 | 32.23 | 0.70% | 32.46 (+0.72%) | 0.55% |
| 8 | 320 | 66.93 | 0.46% | 67.40 (+0.69%) | 0.37% |
+--------+-------+---------+-------+----------------+-------+

2.2.2 Juno r0

+----------------+-----------------+------------------------+
| | Without patches | With patches |
+--------+-------+---------+-------+----------------+-------+
| Groups | Tasks | Mean | RSD* | Mean | RSD* |
+--------+-------+---------+-------+----------------+-------+
| 1 | 40 | 8.37 | 0.12% | 8.33 ( 0.00%) | 0.08% |
| 2 | 80 | 14.63 | 0.12% | 14.49 (-0.01%) | 0.14% |
| 4 | 160 | 27.17 | 0.14% | 26.80 (-0.01%) | 0.14% |
| 8 | 320 | 52.50 | 0.25% | 51.54 (-0.02%) | 0.23% |
+--------+-------+---------+-------+----------------+-------+

*RSD: Relative Standard Deviation (std dev / mean)

3. Dependencies

This series depends on additional infrastructure being merged in the
OPP core. As this infrastructure can also be useful for other clients,
the related patches have been posted separately [3].

[1] https://lkml.org/lkml/2015/7/7/754
[2] http://www.linux-arm.org/git?p=linux-de.git;a=shortlog;h=refs/heads/upstream/eas_v1_base
[3] https://marc.info/?l=linux-pm&m=151635516419249&w=2

Dietmar Eggemang (1):
sched/fair: Create util_fits_capacity()

Quentin Perret (4):
sched: Introduce energy models of CPUs
sched/fair: Introduce an energy estimation helper function
sched/fair: Select an energy-efficient CPU on task wake-up
drivers: base: arch_topology.c: Enable EAS for arm/arm64 platforms

Thara Gopinath (1):
sched: Add over-utilization/tipping point indicator

drivers/base/arch_topology.c | 2 +
include/linux/sched/energy.h | 31 ++++++
include/linux/sched/topology.h | 1 +
kernel/sched/Makefile | 2 +-
kernel/sched/energy.c | 190 ++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 226 +++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 12 +--
8 files changed, 449 insertions(+), 16 deletions(-)
create mode 100644 include/linux/sched/energy.h
create mode 100644 kernel/sched/energy.c

--
2.11.0



2018-03-20 09:45:57

by Dietmar Eggemann

[permalink] [raw]
Subject: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

From: Quentin Perret <[email protected]>

In preparation for the definition of an energy-aware wakeup path, a
helper function is provided to estimate the consequence on system energy
when a specific task wakes-up on a specific CPU. compute_energy()
estimates the OPPs to be reached by all frequency domains and estimates
the consumption of each online CPU according to its energy model and its
percentage of busy time.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 81 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6c72a5e7b1b0..76bd46502486 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu)
}

/*
+ * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
+ */
+static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
+{
+ unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
+ unsigned long capacity = capacity_orig_of(cpu);
+
+ /*
+ * If p is where it should be, or if it has no impact on cpu, there is
+ * not much to do.
+ */
+ if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
+ goto clamp_util;
+
+ if (dst_cpu == cpu)
+ util += task_util(p);
+ else
+ util = max_t(long, util - task_util(p), 0);
+
+clamp_util:
+ return (util >= capacity) ? capacity : util;
+}
+
+/*
* Disable WAKE_AFFINE in the case where task @p doesn't fit in the
* capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
*
@@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
return !util_fits_capacity(task_util(p), min_cap);
}

+static struct capacity_state *find_cap_state(int cpu, unsigned long util)
+{
+ struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
+ struct capacity_state *cs = NULL;
+ int i;
+
+ /*
+ * As the goal is to estimate the OPP reached for a specific util
+ * value, mimic the behaviour of schedutil with a 1.25 coefficient
+ */
+ util += util >> 2;
+
+ for (i = 0; i < em->nr_cap_states; i++) {
+ cs = &em->cap_states[i];
+ if (cs->cap >= util)
+ break;
+ }
+
+ return cs;
+}
+
+static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
+{
+ unsigned long util, fdom_max_util;
+ struct capacity_state *cs;
+ unsigned long energy = 0;
+ struct freq_domain *fdom;
+ int cpu;
+
+ for_each_freq_domain(fdom) {
+ fdom_max_util = 0;
+ for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
+ util = cpu_util_next(cpu, p, dst_cpu);
+ fdom_max_util = max(util, fdom_max_util);
+ }
+
+ /*
+ * Here we assume that the capacity states of CPUs belonging to
+ * the same frequency domains are shared. Hence, we look at the
+ * capacity state of the first CPU and re-use it for all.
+ */
+ cpu = cpumask_first(&(fdom->span));
+ cs = find_cap_state(cpu, fdom_max_util);
+
+ /*
+ * The energy consumed by each CPU is derived from the power
+ * it dissipates at the expected OPP and its percentage of
+ * busy time.
+ */
+ for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
+ util = cpu_util_next(cpu, p, dst_cpu);
+ energy += cs->power * util / cs->cap;
+ }
+ }
+ return energy;
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
--
2.11.0


2018-03-20 09:47:18

by Dietmar Eggemann

[permalink] [raw]
Subject: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

From: Quentin Perret <[email protected]>

In case an energy model is available, waking tasks are re-routed into a
new energy-aware placement algorithm. The eligible CPUs to be used in the
energy-aware wakeup path are restricted to the highest non-overutilized
sched_domain containing prev_cpu and this_cpu. If no such domain is found,
the tasks go through the usual wake-up path, hence energy-aware placement
happens only in lightly utilized scenarios.

The selection of the most energy-efficient CPU for a task is achieved by
estimating the impact on system-level active energy resulting from the
placement of the task on each candidate CPU. The best CPU energy-wise is
then selected if it saves a large enough amount of energy with respect to
prev_cpu.

Although it has already shown significant benefits on some existing
targets, this brute force approach clearly cannot scale to platforms with
numerous CPUs. This patch is an attempt to do something useful as writing
a fast heuristic that performs reasonably well on a broad spectrum of
architectures isn't an easy task. As a consequence, the scope of usability
of the energy-aware wake-up path is restricted to systems with the
SD_ASYM_CPUCAPACITY flag set. These systems not only show the most
promising opportunities for saving energy but also typically feature a
limited number of logical CPUs.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 71 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 76bd46502486..65a1bead0773 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
return energy;
}

+static bool task_fits(struct task_struct *p, int cpu)
+{
+ unsigned long next_util = cpu_util_next(cpu, p, cpu);
+
+ return util_fits_capacity(next_util, capacity_orig_of(cpu));
+}
+
+static int find_energy_efficient_cpu(struct sched_domain *sd,
+ struct task_struct *p, int prev_cpu)
+{
+ unsigned long cur_energy, prev_energy, best_energy;
+ int cpu, best_cpu = prev_cpu;
+
+ if (!task_util(p))
+ return prev_cpu;
+
+ /* Compute the energy impact of leaving the task on prev_cpu. */
+ prev_energy = best_energy = compute_energy(p, prev_cpu);
+
+ /* Look for the CPU that minimizes the energy. */
+ for_each_cpu_and(cpu, &p->cpus_allowed, sched_domain_span(sd)) {
+ if (!task_fits(p, cpu) || cpu == prev_cpu)
+ continue;
+ cur_energy = compute_energy(p, cpu);
+ if (cur_energy < best_energy) {
+ best_energy = cur_energy;
+ best_cpu = cpu;
+ }
+ }
+
+ /*
+ * We pick the best CPU only if it saves at least 1.5% of the
+ * energy used by prev_cpu.
+ */
+ if ((prev_energy - best_energy) > (prev_energy >> 6))
+ return best_cpu;
+
+ return prev_cpu;
+}
+
+static inline bool wake_energy(struct task_struct *p, int prev_cpu)
+{
+ struct sched_domain *sd;
+
+ if (!static_branch_unlikely(&sched_energy_present))
+ return false;
+
+ sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd);
+ if (!sd || sd_overutilized(sd))
+ return false;
+
+ return true;
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6529,18 +6583,22 @@ static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+ struct sched_domain *energy_sd = NULL;
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
- int want_affine = 0;
+ int want_affine = 0, want_energy = 0;
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);

+ rcu_read_lock();
+
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
+ want_energy = wake_energy(p, prev_cpu);
want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
- && cpumask_test_cpu(cpu, &p->cpus_allowed);
+ && cpumask_test_cpu(cpu, &p->cpus_allowed)
+ && !want_energy;
}

- rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
break;
@@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
break;
}

+ /*
+ * Energy-aware task placement is performed on the highest
+ * non-overutilized domain spanning over cpu and prev_cpu.
+ */
+ if (want_energy && !sd_overutilized(tmp) &&
+ cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
+ energy_sd = tmp;
+
if (tmp->flags & sd_flag)
sd = tmp;
else if (!want_affine)
@@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (want_affine)
current->recent_used_cpu = cpu;
}
+ } else if (energy_sd) {
+ new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
} else {
new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
}
--
2.11.0


2018-03-20 09:47:42

by Dietmar Eggemann

[permalink] [raw]
Subject: [RFC PATCH 2/6] sched: Introduce energy models of CPUs

From: Quentin Perret <[email protected]>

The energy consumption of each CPU in the system is modeled with a list
of values representing its dissipated power and compute capacity at each
available Operating Performance Point (OPP). These values are derived
from existing information in the kernel (currently used by the thermal
subsystem) and don't require the introduction of new platform-specific
tunables. The energy model is also provided with a simple representation
of all frequency domains as cpumasks, hence enabling the scheduler to be
aware of dependencies between CPUs. The data required to build the energy
model is provided by the OPP library which enables an abstract view of
the platform from the scheduler. The new data structures holding these
models and the routines to populate them are stored in
kernel/sched/energy.c.

For the sake of simplicity, it is assumed in the energy model that all
CPUs in a frequency domain share the same micro-architecture. As long as
this assumption is correct, the energy models of different CPUs belonging
to the same frequency domain are equal. Hence, this commit builds only one
energy model per frequency domain, and links all relevant CPUs to it in
order to save time and memory. If needed for future hardware platforms,
relaxing this assumption should imply relatively simple modifications in
the code but a significantly higher algorithmic complexity.

As it appears that energy-aware scheduling really makes a difference on
heterogeneous systems (e.g. big.LITTLE platforms), it is restricted to
systems having:

1. SD_ASYM_CPUCAPACITY flag set
2. Dynamic Voltage and Frequency Scaling (DVFS) is enabled
3. Available power estimates for the OPPs of all possible CPUs

Moreover, the scheduler is notified of the energy model availability
using a static key in order to minimize the overhead on non-energy-aware
systems.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>

---
This patch depends on additional infrastructure being merged in the OPP
core. As this infrastructure can also be useful for other clients, the
related patches have been posted separately [1].

[1] https://marc.info/?l=linux-pm&m=151635516419249&w=2
---
include/linux/sched/energy.h | 31 +++++++
kernel/sched/Makefile | 2 +-
kernel/sched/energy.c | 190 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 222 insertions(+), 1 deletion(-)
create mode 100644 include/linux/sched/energy.h
create mode 100644 kernel/sched/energy.c

diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
new file mode 100644
index 000000000000..b4f43564ffe4
--- /dev/null
+++ b/include/linux/sched/energy.h
@@ -0,0 +1,31 @@
+#ifndef _LINUX_SCHED_ENERGY_H
+#define _LINUX_SCHED_ENERGY_H
+
+#ifdef CONFIG_SMP
+struct capacity_state {
+ unsigned long cap; /* compute capacity */
+ unsigned long power; /* power consumption at this compute capacity */
+};
+
+struct sched_energy_model {
+ int nr_cap_states;
+ struct capacity_state *cap_states;
+};
+
+struct freq_domain {
+ struct list_head next;
+ cpumask_t span;
+};
+
+extern struct sched_energy_model ** __percpu energy_model;
+extern struct static_key_false sched_energy_present;
+extern struct list_head freq_domains;
+#define for_each_freq_domain(fdom) \
+ list_for_each_entry(fdom, &freq_domains, next)
+
+void init_sched_energy(void);
+#else
+static inline void init_sched_energy(void) { }
+#endif
+
+#endif
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d9a02b318108..912972ad4dbc 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
obj-y += idle.o fair.o rt.o deadline.o
obj-y += wait.o wait_bit.o swait.o completion.o

-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o energy.o
obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/energy.c b/kernel/sched/energy.c
new file mode 100644
index 000000000000..4662c993e096
--- /dev/null
+++ b/kernel/sched/energy.c
@@ -0,0 +1,190 @@
+/*
+ * Released under the GPLv2 only.
+ * SPDX-License-Identifier: GPL-2.0
+ *
+ * Energy-aware scheduling models
+ *
+ * Copyright (C) 2018, Arm Ltd.
+ * Written by: Quentin Perret, Arm Ltd.
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file "COPYING" in the main directory of this archive
+ * for more details.
+ */
+
+#define pr_fmt(fmt) "sched-energy: " fmt
+
+#include <linux/sched/topology.h>
+#include <linux/sched/energy.h>
+#include <linux/pm_opp.h>
+
+#include "sched.h"
+
+DEFINE_STATIC_KEY_FALSE(sched_energy_present);
+struct sched_energy_model ** __percpu energy_model;
+
+/*
+ * A copy of the cpumasks representing the frequency domains is kept private
+ * to the scheduler. They are stacked in a dynamically allocated linked list
+ * as we don't know how many frequency domains the system has.
+ */
+LIST_HEAD(freq_domains);
+
+#ifdef CONFIG_PM_OPP
+static struct sched_energy_model *build_energy_model(int cpu)
+{
+ unsigned long cap_scale = arch_scale_cpu_capacity(NULL, cpu);
+ unsigned long cap, freq, power, max_freq = ULONG_MAX;
+ unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
+ struct sched_energy_model *em = NULL;
+ struct device *cpu_dev;
+ struct dev_pm_opp *opp;
+ int opp_cnt, i;
+
+ cpu_dev = get_cpu_device(cpu);
+ if (!cpu_dev) {
+ pr_err("CPU%d: Failed to get device\n", cpu);
+ return NULL;
+ }
+
+ opp_cnt = dev_pm_opp_get_opp_count(cpu_dev);
+ if (opp_cnt <= 0) {
+ pr_err("CPU%d: Failed to get # of available OPPs.\n", cpu);
+ return NULL;
+ }
+
+ opp = dev_pm_opp_find_freq_floor(cpu_dev, &max_freq);
+ if (IS_ERR(opp)) {
+ pr_err("CPU%d: Failed to get max frequency.\n", cpu);
+ return NULL;
+ }
+
+ dev_pm_opp_put(opp);
+ if (!max_freq) {
+ pr_err("CPU%d: Found null max frequency.\n", cpu);
+ return NULL;
+ }
+
+ em = kzalloc(sizeof(*em), GFP_KERNEL);
+ if (!em)
+ return NULL;
+
+ em->cap_states = kcalloc(opp_cnt, sizeof(*em->cap_states), GFP_KERNEL);
+ if (!em->cap_states)
+ goto free_em;
+
+ for (i = 0, freq = 0; i < opp_cnt; i++, freq++) {
+ opp = dev_pm_opp_find_freq_ceil(cpu_dev, &freq);
+ if (IS_ERR(opp)) {
+ pr_err("CPU%d: Failed to get OPP %d.\n", cpu, i+1);
+ goto free_cs;
+ }
+
+ power = dev_pm_opp_get_power(opp);
+ dev_pm_opp_put(opp);
+ if (!power || !freq)
+ goto free_cs;
+
+ cap = freq * cap_scale / max_freq;
+ em->cap_states[i].power = power;
+ em->cap_states[i].cap = cap;
+
+ /*
+ * The capacity/watts efficiency ratio should decrease as the
+ * frequency grows on sane platforms. If not, warn the user
+ * that some high OPPs are more power efficient than some
+ * of the lower ones.
+ */
+ opp_eff = (cap << 20) / power;
+ if (opp_eff >= prev_opp_eff)
+ pr_warn("CPU%d: cap/pwr: OPP%d > OPP%d\n", cpu, i, i-1);
+ prev_opp_eff = opp_eff;
+ }
+
+ em->nr_cap_states = opp_cnt;
+ return em;
+
+free_cs:
+ kfree(em->cap_states);
+free_em:
+ kfree(em);
+ return NULL;
+}
+
+static void free_energy_model(void)
+{
+ struct sched_energy_model *em;
+ struct freq_domain *tmp, *pos;
+ int cpu;
+
+ list_for_each_entry_safe(pos, tmp, &freq_domains, next) {
+ cpu = cpumask_first(&(pos->span));
+ em = *per_cpu_ptr(energy_model, cpu);
+ if (em) {
+ kfree(em->cap_states);
+ kfree(em);
+ }
+
+ list_del(&(pos->next));
+ kfree(pos);
+ }
+
+ free_percpu(energy_model);
+}
+
+void init_sched_energy(void)
+{
+ struct freq_domain *fdom;
+ struct sched_energy_model *em;
+ struct device *cpu_dev;
+ int cpu, ret, fdom_cpu;
+
+ /* Energy Aware Scheduling is used for asymmetric systems only. */
+ if (!lowest_flag_domain(smp_processor_id(), SD_ASYM_CPUCAPACITY))
+ return;
+
+ energy_model = alloc_percpu(struct sched_energy_model *);
+ if (!energy_model)
+ goto exit_fail;
+
+ for_each_possible_cpu(cpu) {
+ if (*per_cpu_ptr(energy_model, cpu))
+ continue;
+
+ /* Keep a copy of the sharing_cpus mask */
+ fdom = kzalloc(sizeof(struct freq_domain), GFP_KERNEL);
+ if (!fdom)
+ goto free_em;
+
+ cpu_dev = get_cpu_device(cpu);
+ ret = dev_pm_opp_get_sharing_cpus(cpu_dev, &(fdom->span));
+ if (ret)
+ goto free_em;
+ list_add(&(fdom->next), &freq_domains);
+
+ /*
+ * Build the energy model of one CPU, and link it to all CPUs
+ * in its frequency domain. This should be correct as long as
+ * they share the same micro-architecture.
+ */
+ fdom_cpu = cpumask_first(&(fdom->span));
+ em = build_energy_model(fdom_cpu);
+ if (!em)
+ goto free_em;
+
+ for_each_cpu(fdom_cpu, &(fdom->span))
+ *per_cpu_ptr(energy_model, fdom_cpu) = em;
+ }
+
+ static_branch_enable(&sched_energy_present);
+
+ pr_info("Energy Aware Scheduling started.\n");
+ return;
+free_em:
+ free_energy_model();
+exit_fail:
+ pr_err("Energy Aware Scheduling initialization failed.\n");
+}
+#else
+void init_sched_energy(void) {}
+#endif
--
2.11.0


2018-03-20 09:47:59

by Dietmar Eggemann

[permalink] [raw]
Subject: [RFC PATCH 6/6] drivers: base: arch_topology.c: Enable EAS for arm/arm64 platforms

From: Quentin Perret <[email protected]>

Energy Aware Scheduling (EAS) has to be started from the arch code.
This commit enables it from the arch topology driver for arm/arm64
systems, hence enabling a better support for Arm big.LITTLE and future
DynamIQ architectures.

Cc: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
drivers/base/arch_topology.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 52ec5174bcb1..e2206ea16538 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -15,6 +15,7 @@
#include <linux/slab.h>
#include <linux/string.h>
#include <linux/sched/topology.h>
+#include <linux/sched/energy.h>

DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;

@@ -204,6 +205,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
free_raw_capacity();
pr_debug("cpu_capacity: parsing done\n");
schedule_work(&parsing_done_work);
+ init_sched_energy();
}

return 0;
--
2.11.0


2018-03-20 09:48:09

by Dietmar Eggemann

[permalink] [raw]
Subject: [RFC PATCH 3/6] sched: Add over-utilization/tipping point indicator

From: Thara Gopinath <[email protected]>

Energy-aware scheduling should only operate when the system is not
overutilized. There must be cpu time available to place tasks based on
utilization in an energy-aware fashion, i.e. to pack tasks on
energy-efficient cpus without harming the overall throughput.

In case the system operates above this tipping point the tasks have to
be placed based on task and cpu load in the classical way of spreading
tasks across as many cpus as possible.

The point in which a system switches from being not overutilized to
being overutilized is called the tipping point.

Such a tipping point indicator on a sched domain as the system
boundary is introduced here. As soon as one cpu of a sched domain is
overutilized the whole sched domain is declared overutilized as well.
A cpu becomes overutilized when its utilization is higher that 80%
(capacity_margin) of its capacity.

The implementation takes advantage of the shared sched domain which is
shared across all per-cpu views of a sched domain level. The new
overutilized flag is placed in this shared sched domain.

Load balancing is skipped in case the energy model is present and the
sched domain is not overutilized because under this condition the
predominantly load-per-capacity driven load-balancer should not
interfere with the energy-aware wakeup placement based on utilization.

In case the total utilization of a sched domain is greater than the
total sched domain capacity the overutilized flag is set at the parent
sched domain level to let other sched groups help getting rid of the
overutilization of cpus.

Signed-off-by: Thara Gopinath <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
include/linux/sched/topology.h | 1 +
kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 12 +++-----
4 files changed, 67 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..dd001c232646 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -72,6 +72,7 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ int overutilized;
};

struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf7b485ddf60..6c72a5e7b1b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5197,6 +5197,28 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+#ifdef CONFIG_SMP
+static inline int cpu_overutilized(int cpu);
+
+static inline int sd_overutilized(struct sched_domain *sd)
+{
+ return READ_ONCE(sd->shared->overutilized);
+}
+
+static inline void update_overutilized_status(struct rq *rq)
+{
+ struct sched_domain *sd;
+
+ rcu_read_lock();
+ sd = rcu_dereference(rq->sd);
+ if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
+ WRITE_ONCE(sd->shared->overutilized, 1);
+ rcu_read_unlock();
+}
+#else
+static inline void update_overutilized_status(struct rq *rq) {}
+#endif /* CONFIG_SMP */
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -5246,8 +5268,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_group(se);
}

- if (!se)
+ if (!se) {
add_nr_running(rq, 1);
+ update_overutilized_status(rq);
+ }

hrtick_update(rq);
}
@@ -6379,6 +6403,11 @@ static inline int util_fits_capacity(unsigned long util, unsigned long capacity)
return capacity * 1024 > util * capacity_margin;
}

+static inline int cpu_overutilized(int cpu)
+{
+ return !util_fits_capacity(cpu_util(cpu), capacity_of(cpu));
+}
+
/*
* Disable WAKE_AFFINE in the case where task @p doesn't fit in the
* capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
@@ -7617,6 +7646,7 @@ struct sd_lb_stats {
unsigned long total_running;
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_capacity; /* Total capacity of all groups in sd */
+ unsigned long total_util; /* Total util of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */

struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
@@ -7637,6 +7667,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
.total_running = 0UL,
.total_load = 0UL,
.total_capacity = 0UL,
+ .total_util = 0UL,
.busiest_stat = {
.avg_load = 0UL,
.sum_nr_running = 0,
@@ -7933,11 +7964,12 @@ static bool update_nohz_stats(struct rq *rq, bool force)
* @local_group: Does group contain this_cpu.
* @sgs: variable to hold the statistics for this group.
* @overload: Indicate more than one runnable task for any CPU.
+ * @overutilized: Indicate overutilization for any CPU.
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
- bool *overload)
+ bool *overload, int *overutilized)
{
unsigned long load;
int i, nr_running;
@@ -7974,6 +8006,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
*/
if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
+
+ if (cpu_overutilized(i))
+ *overutilized = 1;
}

/* Adjust by relative CPU capacity of the group */
@@ -8101,6 +8136,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sg_lb_stats tmp_sgs;
int load_idx, prefer_sibling = 0;
bool overload = false;
+ int overutilized = 0;

if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
@@ -8127,7 +8163,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}

update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
- &overload);
+ &overload, &overutilized);

if (local_group)
goto next_group;
@@ -8159,6 +8195,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
sds->total_running += sgs->sum_nr_running;
sds->total_load += sgs->group_load;
sds->total_capacity += sgs->group_capacity;
+ sds->total_util += sgs->group_util;

sg = sg->next;
} while (sg != env->sd->groups);
@@ -8180,6 +8217,17 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (env->dst_rq->rd->overload != overload)
env->dst_rq->rd->overload = overload;
}
+
+ if (READ_ONCE(env->sd->shared->overutilized) != overutilized)
+ WRITE_ONCE(env->sd->shared->overutilized, overutilized);
+
+ /*
+ * If the domain util is greater that domain capacity, load balancing
+ * needs to be done at the next sched domain level as well.
+ */
+ if (env->sd->parent &&
+ !util_fits_capacity(sds->total_util, sds->total_capacity))
+ WRITE_ONCE(env->sd->parent->shared->overutilized, 1);
}

/**
@@ -9055,6 +9103,10 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
}
max_cost += sd->max_newidle_lb_cost;

+ if (static_branch_unlikely(&sched_energy_present) &&
+ !sd_overutilized(sd))
+ continue;
+
if (!(sd->flags & SD_LOAD_BALANCE))
continue;

@@ -9622,6 +9674,10 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
break;
}

+ if (static_branch_unlikely(&sched_energy_present) &&
+ !sd_overutilized(sd))
+ continue;
+
if (sd->flags & SD_BALANCE_NEWIDLE) {
t0 = sched_clock_cpu(this_cpu);

@@ -9755,6 +9811,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+
+ update_overutilized_status(rq);
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 22909ffc04fb..55ec6a4c4712 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -11,6 +11,7 @@
#include <linux/sched/cputime.h>
#include <linux/sched/deadline.h>
#include <linux/sched/debug.h>
+#include <linux/sched/energy.h>
#include <linux/sched/hotplug.h>
#include <linux/sched/idle.h>
#include <linux/sched/init.h>
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 64cc564f5255..c8b7c7665ab2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1184,15 +1184,11 @@ sd_init(struct sched_domain_topology_level *tl,
sd->idle_idx = 1;
}

- /*
- * For all levels sharing cache; connect a sched_domain_shared
- * instance.
- */
- if (sd->flags & SD_SHARE_PKG_RESOURCES) {
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
+ sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+ atomic_inc(&sd->shared->ref);
+
+ if (sd->flags & SD_SHARE_PKG_RESOURCES)
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
- }

sd->private = sdd;

--
2.11.0


2018-03-20 09:48:26

by Dietmar Eggemann

[permalink] [raw]
Subject: [RFC PATCH 1/6] sched/fair: Create util_fits_capacity()

The functionality that a given utilization fits into a given capacity
is factored out into a separate function.

Currently it is only used in wake_cap() but will be re-used to figure
out if a cpu or a scheduler group is over-utilized.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3582117e1580..bf7b485ddf60 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6374,6 +6374,11 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
return (util >= capacity) ? capacity : util;
}

+static inline int util_fits_capacity(unsigned long util, unsigned long capacity)
+{
+ return capacity * 1024 > util * capacity_margin;
+}
+
/*
* Disable WAKE_AFFINE in the case where task @p doesn't fit in the
* capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
@@ -6395,7 +6400,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
/* Bring task utilization in sync with prev_cpu */
sync_entity_load_avg(&p->se);

- return min_cap * 1024 < task_util(p) * capacity_margin;
+ return !util_fits_capacity(task_util(p), min_cap);
}

/*
--
2.11.0


2018-03-20 09:51:07

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] drivers: base: arch_topology.c: Enable EAS for arm/arm64 platforms

On Tue, Mar 20, 2018 at 09:43:12AM +0000, Dietmar Eggemann wrote:
> From: Quentin Perret <[email protected]>
>
> Energy Aware Scheduling (EAS) has to be started from the arch code.

Ok, but:

> This commit enables it from the arch topology driver for arm/arm64
> systems, hence enabling a better support for Arm big.LITTLE and future
> DynamIQ architectures.

Why does this have to be in the driver core code just for those specific
types of cpus?



>
> Cc: Greg Kroah-Hartman <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> drivers/base/arch_topology.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 52ec5174bcb1..e2206ea16538 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -15,6 +15,7 @@
> #include <linux/slab.h>
> #include <linux/string.h>
> #include <linux/sched/topology.h>
> +#include <linux/sched/energy.h>
>
> DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
>
> @@ -204,6 +205,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
> free_raw_capacity();
> pr_debug("cpu_capacity: parsing done\n");
> schedule_work(&parsing_done_work);
> + init_sched_energy();

This is not arch-specific code only for arm.

Don't you have a ARM cpu bringup code somewhere? Shouldn't this call be
in there? It feels odd that this scheduler change is buried way down
here...

thanks,

greg k-h

2018-03-20 09:53:59

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched: Introduce energy models of CPUs

On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote:
> From: Quentin Perret <[email protected]>
>
> The energy consumption of each CPU in the system is modeled with a list
> of values representing its dissipated power and compute capacity at each
> available Operating Performance Point (OPP). These values are derived
> from existing information in the kernel (currently used by the thermal
> subsystem) and don't require the introduction of new platform-specific
> tunables. The energy model is also provided with a simple representation
> of all frequency domains as cpumasks, hence enabling the scheduler to be
> aware of dependencies between CPUs. The data required to build the energy
> model is provided by the OPP library which enables an abstract view of
> the platform from the scheduler. The new data structures holding these
> models and the routines to populate them are stored in
> kernel/sched/energy.c.
>
> For the sake of simplicity, it is assumed in the energy model that all
> CPUs in a frequency domain share the same micro-architecture. As long as
> this assumption is correct, the energy models of different CPUs belonging
> to the same frequency domain are equal. Hence, this commit builds only one
> energy model per frequency domain, and links all relevant CPUs to it in
> order to save time and memory. If needed for future hardware platforms,
> relaxing this assumption should imply relatively simple modifications in
> the code but a significantly higher algorithmic complexity.
>
> As it appears that energy-aware scheduling really makes a difference on
> heterogeneous systems (e.g. big.LITTLE platforms), it is restricted to
> systems having:
>
> 1. SD_ASYM_CPUCAPACITY flag set
> 2. Dynamic Voltage and Frequency Scaling (DVFS) is enabled
> 3. Available power estimates for the OPPs of all possible CPUs
>
> Moreover, the scheduler is notified of the energy model availability
> using a static key in order to minimize the overhead on non-energy-aware
> systems.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
>
> ---
> This patch depends on additional infrastructure being merged in the OPP
> core. As this infrastructure can also be useful for other clients, the
> related patches have been posted separately [1].
>
> [1] https://marc.info/?l=linux-pm&m=151635516419249&w=2
> ---
> include/linux/sched/energy.h | 31 +++++++
> kernel/sched/Makefile | 2 +-
> kernel/sched/energy.c | 190 +++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 222 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/sched/energy.h
> create mode 100644 kernel/sched/energy.c
>
> diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
> new file mode 100644
> index 000000000000..b4f43564ffe4
> --- /dev/null
> +++ b/include/linux/sched/energy.h
> @@ -0,0 +1,31 @@
> +#ifndef _LINUX_SCHED_ENERGY_H
> +#define _LINUX_SCHED_ENERGY_H

No copyright or license info? Not good :(

> --- /dev/null
> +++ b/kernel/sched/energy.c
> @@ -0,0 +1,190 @@
> +/*
> + * Released under the GPLv2 only.
> + * SPDX-License-Identifier: GPL-2.0

Please read the documentation for the SPDX lines on how to do them
correctly. Newer versions of checkpatch.pl will catch this, but that is
in linux-next for the moment.

And once you have the SPDX line, the "Released under..." line is not
needed.


> + *
> + * Energy-aware scheduling models
> + *
> + * Copyright (C) 2018, Arm Ltd.
> + * Written by: Quentin Perret, Arm Ltd.
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License. See the file "COPYING" in the main directory of this archive
> + * for more details.

This paragraph is not needed at all.

> + */
> +
> +#define pr_fmt(fmt) "sched-energy: " fmt
> +
> +#include <linux/sched/topology.h>
> +#include <linux/sched/energy.h>
> +#include <linux/pm_opp.h>
> +
> +#include "sched.h"
> +
> +DEFINE_STATIC_KEY_FALSE(sched_energy_present);
> +struct sched_energy_model ** __percpu energy_model;
> +
> +/*
> + * A copy of the cpumasks representing the frequency domains is kept private
> + * to the scheduler. They are stacked in a dynamically allocated linked list
> + * as we don't know how many frequency domains the system has.
> + */
> +LIST_HEAD(freq_domains);

global variable? If so, please prefix it with something more unique
than "freq_".

> +#ifdef CONFIG_PM_OPP

#ifdefs go in .h files, not .c files, right?

thanks,

greg k-h

2018-03-20 15:22:37

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] drivers: base: arch_topology.c: Enable EAS for arm/arm64 platforms

On 03/20/2018 10:49 AM, Greg Kroah-Hartman wrote:
> On Tue, Mar 20, 2018 at 09:43:12AM +0000, Dietmar Eggemann wrote:
>> From: Quentin Perret <[email protected]>
>>
>> Energy Aware Scheduling (EAS) has to be started from the arch code.
>
> Ok, but:
>
>> This commit enables it from the arch topology driver for arm/arm64
>> systems, hence enabling a better support for Arm big.LITTLE and future
>> DynamIQ architectures.
>
> Why does this have to be in the driver core code just for those specific
> types of cpus?

The arch_topology driver is shared functionality between arm and arm64
arch. So far it handles the correct setting of per-cpu cpu_scale values
which are then used by the task scheduler to set the correct cpu
capacity values. This allows to provide cpu invariant load tracking for
arm and arm64 big.LITTLE systems.

The driver was initially created so that we didn't had to duplicate the
exact code in the arm and arm64 arch.

Big.Little platforms have to set appropriate cpu node capacity-dmips-mhz
properties (e.g. arch/arm64/boot/dts/arm/juno.dts) to enable the start
the functionality of the driver. The cpu_scale value depends on the
capacity-dmips-mhz property as well as on the max frequency of a logical
cpu (hence the dependency to cpufreq).

A corner case would be a big.little platform purely based on max
frequency differences (e.g. Google Pixel, Qualcomm MSM8996 Snapdragon
821, Quad-core (2x2.15 GHz Kryo & 2x1.6 GHz Kryo)). For such a system
capacity-dmips-mhz should be set to the same value for all logical cpus
(e.g. 1024).

We would like to use the same code (shared between arm and arm64) to
initialize the energy model and start EAS for arm and arm64 systems.

>> Cc: Greg Kroah-Hartman <[email protected]>
>> Signed-off-by: Quentin Perret <[email protected]>
>> ---
>> drivers/base/arch_topology.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
>> index 52ec5174bcb1..e2206ea16538 100644
>> --- a/drivers/base/arch_topology.c
>> +++ b/drivers/base/arch_topology.c
>> @@ -15,6 +15,7 @@
>> #include <linux/slab.h>
>> #include <linux/string.h>
>> #include <linux/sched/topology.h>
>> +#include <linux/sched/energy.h>
>>
>> DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
>>
>> @@ -204,6 +205,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
>> free_raw_capacity();
>> pr_debug("cpu_capacity: parsing done\n");
>> schedule_work(&parsing_done_work);
>> + init_sched_energy();
>
> This is not arch-specific code only for arm.

The driver requires CONFIG_GENERIC_ARCH_TOPOLOGY which is currently only
set in arch/arm/Kconfig and arch/arm64/Kconfig.

> Don't you have a ARM cpu bringup code somewhere? Shouldn't this call be
> in there? It feels odd that this scheduler change is buried way down
> here...

The big benefit is that we don't have to duplicate code for arm and
arm64. In the current form, EAS should only be started when cpufreq is
initialized for all possible cpus. And we don't have to signal back from
the driver to the arch that cpufreq is initialized for all possible cpus.

I agree that none of this is obvious so the patch should explain the
requirements and design better.

-- Dietmar

[...]

2018-03-21 00:46:41

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched: Introduce energy models of CPUs

On Tuesday 20 Mar 2018 at 10:52:15 (+0100), Greg Kroah-Hartman wrote:
> On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote:
> > From: Quentin Perret <[email protected]>
> >
> > The energy consumption of each CPU in the system is modeled with a list
> > of values representing its dissipated power and compute capacity at each
> > available Operating Performance Point (OPP). These values are derived
> > from existing information in the kernel (currently used by the thermal
> > subsystem) and don't require the introduction of new platform-specific
> > tunables. The energy model is also provided with a simple representation
> > of all frequency domains as cpumasks, hence enabling the scheduler to be
> > aware of dependencies between CPUs. The data required to build the energy
> > model is provided by the OPP library which enables an abstract view of
> > the platform from the scheduler. The new data structures holding these
> > models and the routines to populate them are stored in
> > kernel/sched/energy.c.
> >
> > For the sake of simplicity, it is assumed in the energy model that all
> > CPUs in a frequency domain share the same micro-architecture. As long as
> > this assumption is correct, the energy models of different CPUs belonging
> > to the same frequency domain are equal. Hence, this commit builds only one
> > energy model per frequency domain, and links all relevant CPUs to it in
> > order to save time and memory. If needed for future hardware platforms,
> > relaxing this assumption should imply relatively simple modifications in
> > the code but a significantly higher algorithmic complexity.
> >
> > As it appears that energy-aware scheduling really makes a difference on
> > heterogeneous systems (e.g. big.LITTLE platforms), it is restricted to
> > systems having:
> >
> > 1. SD_ASYM_CPUCAPACITY flag set
> > 2. Dynamic Voltage and Frequency Scaling (DVFS) is enabled
> > 3. Available power estimates for the OPPs of all possible CPUs
> >
> > Moreover, the scheduler is notified of the energy model availability
> > using a static key in order to minimize the overhead on non-energy-aware
> > systems.
> >
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Signed-off-by: Quentin Perret <[email protected]>
> > Signed-off-by: Dietmar Eggemann <[email protected]>
> >
> > ---
> > This patch depends on additional infrastructure being merged in the OPP
> > core. As this infrastructure can also be useful for other clients, the
> > related patches have been posted separately [1].
> >
> > [1] https://marc.info/?l=linux-pm&m=151635516419249&w=2
> > ---
> > include/linux/sched/energy.h | 31 +++++++
> > kernel/sched/Makefile | 2 +-
> > kernel/sched/energy.c | 190 +++++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 222 insertions(+), 1 deletion(-)
> > create mode 100644 include/linux/sched/energy.h
> > create mode 100644 kernel/sched/energy.c
> >
> > diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
> > new file mode 100644
> > index 000000000000..b4f43564ffe4
> > --- /dev/null
> > +++ b/include/linux/sched/energy.h
> > @@ -0,0 +1,31 @@
> > +#ifndef _LINUX_SCHED_ENERGY_H
> > +#define _LINUX_SCHED_ENERGY_H
>
> No copyright or license info? Not good :(
>
> > --- /dev/null
> > +++ b/kernel/sched/energy.c
> > @@ -0,0 +1,190 @@
> > +/*
> > + * Released under the GPLv2 only.
> > + * SPDX-License-Identifier: GPL-2.0
>
> Please read the documentation for the SPDX lines on how to do them
> correctly. Newer versions of checkpatch.pl will catch this, but that is
> in linux-next for the moment.
>
> And once you have the SPDX line, the "Released under..." line is not
> needed.
>
>
> > + *
> > + * Energy-aware scheduling models
> > + *
> > + * Copyright (C) 2018, Arm Ltd.
> > + * Written by: Quentin Perret, Arm Ltd.
> > + *
> > + * This file is subject to the terms and conditions of the GNU General Public
> > + * License. See the file "COPYING" in the main directory of this archive
> > + * for more details.
>
> This paragraph is not needed at all.

Right, I will fix all the licence issues and add one to the new header
file. I took example on existing files a while ago when I first wrote
the patches and forgot to update them later on. Sorry about that.

>
> > + */
> > +
> > +#define pr_fmt(fmt) "sched-energy: " fmt
> > +
> > +#include <linux/sched/topology.h>
> > +#include <linux/sched/energy.h>
> > +#include <linux/pm_opp.h>
> > +
> > +#include "sched.h"
> > +
> > +DEFINE_STATIC_KEY_FALSE(sched_energy_present);
> > +struct sched_energy_model ** __percpu energy_model;
> > +
> > +/*
> > + * A copy of the cpumasks representing the frequency domains is kept private
> > + * to the scheduler. They are stacked in a dynamically allocated linked list
> > + * as we don't know how many frequency domains the system has.
> > + */
> > +LIST_HEAD(freq_domains);
>
> global variable? If so, please prefix it with something more unique
> than "freq_".

Will do.

>
> > +#ifdef CONFIG_PM_OPP
>
> #ifdefs go in .h files, not .c files, right?

Yes good point. Actually, I might be able to tweak only kernel/sched/Makefile
to ensure we have CONFIG_PM_OPP. I will look into it.

>
> thanks,
>
> greg k-h

Thanks,
Quentin

2018-03-21 09:06:18

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

Hi,

On 20/03/18 09:43, Dietmar Eggemann wrote:
> From: Quentin Perret <[email protected]>
>
> In preparation for the definition of an energy-aware wakeup path, a
> helper function is provided to estimate the consequence on system energy
> when a specific task wakes-up on a specific CPU. compute_energy()
> estimates the OPPs to be reached by all frequency domains and estimates
> the consumption of each online CPU according to its energy model and its
> percentage of busy time.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 81 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6c72a5e7b1b0..76bd46502486 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu)
> }
>
> /*
> + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> + */
> +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> +{
> + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;

What about other classes? Shouldn't we now also take into account
DEADLINE (as schedutil does)?

BTW, we now also have a getter method in sched/sched.h; it takes
UTIL_EST into account, though. Do we need to take that into account when
estimating energy consumption?

> + unsigned long capacity = capacity_orig_of(cpu);
> +
> + /*
> + * If p is where it should be, or if it has no impact on cpu, there is
> + * not much to do.
> + */
> + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> + goto clamp_util;
> +
> + if (dst_cpu == cpu)
> + util += task_util(p);
> + else
> + util = max_t(long, util - task_util(p), 0);
> +
> +clamp_util:
> + return (util >= capacity) ? capacity : util;
> +}
> +
> +/*
> * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
> * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
> *
> @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> return !util_fits_capacity(task_util(p), min_cap);
> }
>
> +static struct capacity_state *find_cap_state(int cpu, unsigned long util)
> +{
> + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> + struct capacity_state *cs = NULL;
> + int i;
> +
> + /*
> + * As the goal is to estimate the OPP reached for a specific util
> + * value, mimic the behaviour of schedutil with a 1.25 coefficient
> + */
> + util += util >> 2;

What about other governors (ondemand for example). Is this supposed to
work only when schedutil is in use (if so we should probably make it
conditional on that)?

Also, even when schedutil is in use, shouldn't we ask it for a util
"computation" instead of replicating its _current_ heuristic? I fear
the two might diverge in the future.

Best,

- Juri

2018-03-21 12:27:54

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

On 21-Mar 10:04, Juri Lelli wrote:
> Hi,
>
> On 20/03/18 09:43, Dietmar Eggemann wrote:
> > From: Quentin Perret <[email protected]>
> >
> > In preparation for the definition of an energy-aware wakeup path, a
> > helper function is provided to estimate the consequence on system energy
> > when a specific task wakes-up on a specific CPU. compute_energy()
> > estimates the OPPs to be reached by all frequency domains and estimates
> > the consumption of each online CPU according to its energy model and its
> > percentage of busy time.
> >
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Signed-off-by: Quentin Perret <[email protected]>
> > Signed-off-by: Dietmar Eggemann <[email protected]>
> > ---
> > kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 81 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6c72a5e7b1b0..76bd46502486 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu)
> > }
> >
> > /*
> > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> > + */
> > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > +{
> > + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
>
> What about other classes? Shouldn't we now also take into account
> DEADLINE (as schedutil does)?

Good point, although that would likely require to factor out from
schedutil the utilization aggregation function, isn't it?

> BTW, we now also have a getter method in sched/sched.h; it takes
> UTIL_EST into account, though. Do we need to take that into account when
> estimating energy consumption?

Actually I think that this whole function can be written "just" as:

---8<---
unsigned long util = cpu_util_wake(cpu);

if (cpu != dst_cpu)
return util;

return min(util + task_util(p), capacity_orig_of(cpu));
---8<---

which will reuse existing functions as well as getting for free other
stuff (like the CPU util_est).

Considering your observation above, it makes also easy to add into
util the DEADLINE and RT utilizations, just before returning the
value.

> > + unsigned long capacity = capacity_orig_of(cpu);
> > +
> > + /*
> > + * If p is where it should be, or if it has no impact on cpu, there is
> > + * not much to do.
> > + */
> > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> > + goto clamp_util;
> > +
> > + if (dst_cpu == cpu)
> > + util += task_util(p);
> > + else
> > + util = max_t(long, util - task_util(p), 0);
> > +
> > +clamp_util:
> > + return (util >= capacity) ? capacity : util;
> > +}
> > +
> > +/*
> > * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
> > * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
> > *
> > @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > return !util_fits_capacity(task_util(p), min_cap);
> > }
> >
> > +static struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > +{
> > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > + struct capacity_state *cs = NULL;
> > + int i;
> > +
> > + /*
> > + * As the goal is to estimate the OPP reached for a specific util
> > + * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > + */
> > + util += util >> 2;
>
> What about other governors (ondemand for example). Is this supposed to
> work only when schedutil is in use (if so we should probably make it
> conditional on that)?

Yes, I would say that EAS mostly makes sense when you have a "minimum"
control on OPPs... otherwise all the energy estimations are really
fuzzy.

> Also, even when schedutil is in use, shouldn't we ask it for a util
> "computation" instead of replicating its _current_ heuristic?

Are you proposing to have the 1.25 factor only here and remove it from
schedutil?

> I fear the two might diverge in the future.

That could be avoided by factoring out from schedutil the
"compensation" factor into a proper function to be used by all the
interested playes, isn't it?

--
#include <best/regards.h>

Patrick Bellasi

2018-03-21 12:40:37

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

On 20-Mar 09:43, Dietmar Eggemann wrote:
> From: Quentin Perret <[email protected]>

[...]

> +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> +{
> + unsigned long util, fdom_max_util;
> + struct capacity_state *cs;
> + unsigned long energy = 0;
> + struct freq_domain *fdom;
> + int cpu;
> +
> + for_each_freq_domain(fdom) {
> + fdom_max_util = 0;
> + for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
> + util = cpu_util_next(cpu, p, dst_cpu);

Would be nice to find a way to cache all these util and reuse them
below... even just to ensure data consistency between the "cs"
computation and its usage...

> + fdom_max_util = max(util, fdom_max_util);
> + }
> +
> + /*
> + * Here we assume that the capacity states of CPUs belonging to
> + * the same frequency domains are shared. Hence, we look at the
> + * capacity state of the first CPU and re-use it for all.
> + */
> + cpu = cpumask_first(&(fdom->span));
> + cs = find_cap_state(cpu, fdom_max_util);
^^^^

The above code could theoretically return NULL, although likely EAS is
completely disabled if em->nb_cap_states == 0, right?

If that's the case then, in the previous function, you can certainly
avoid the initialization of *cs and maybe also add an explicit:

BUG_ON(em->nb_cap_states == 0);

which helps even just as "in code documentation".

But, I'm not sure if maintainers like BUG_ON in scheduler code :)


> +
> + /*
> + * The energy consumed by each CPU is derived from the power
^

Should we make more explicit that this is just the "active" energy
consumed?

> + * it dissipates at the expected OPP and its percentage of
> + * busy time.
> + */
> + for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
> + util = cpu_util_next(cpu, p, dst_cpu);
> + energy += cs->power * util / cs->cap;
> + }
> + }

nit-pick: empty line before return?

> + return energy;
> +}
> +
> /*
> * select_task_rq_fair: Select target runqueue for the waking task in domains
> * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> --
> 2.11.0
>

--
#include <best/regards.h>

Patrick Bellasi

2018-03-21 13:02:06

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

On 21/03/18 12:26, Patrick Bellasi wrote:
> On 21-Mar 10:04, Juri Lelli wrote:
> > Hi,
> >
> > On 20/03/18 09:43, Dietmar Eggemann wrote:
> > > From: Quentin Perret <[email protected]>
> > >
> > > In preparation for the definition of an energy-aware wakeup path, a
> > > helper function is provided to estimate the consequence on system energy
> > > when a specific task wakes-up on a specific CPU. compute_energy()
> > > estimates the OPPs to be reached by all frequency domains and estimates
> > > the consumption of each online CPU according to its energy model and its
> > > percentage of busy time.
> > >
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: Peter Zijlstra <[email protected]>
> > > Signed-off-by: Quentin Perret <[email protected]>
> > > Signed-off-by: Dietmar Eggemann <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 81 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6c72a5e7b1b0..76bd46502486 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu)
> > > }
> > >
> > > /*
> > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> > > + */
> > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > > +{
> > > + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
> >
> > What about other classes? Shouldn't we now also take into account
> > DEADLINE (as schedutil does)?
>
> Good point, although that would likely require to factor out from
> schedutil the utilization aggregation function, isn't it?

Maybe, or simply use getter methods and aggregate again here.

>
> > BTW, we now also have a getter method in sched/sched.h; it takes
> > UTIL_EST into account, though. Do we need to take that into account when
> > estimating energy consumption?
>
> Actually I think that this whole function can be written "just" as:
>
> ---8<---
> unsigned long util = cpu_util_wake(cpu);
>
> if (cpu != dst_cpu)
> return util;
>
> return min(util + task_util(p), capacity_orig_of(cpu));
> ---8<---
>
> which will reuse existing functions as well as getting for free other
> stuff (like the CPU util_est).
>
> Considering your observation above, it makes also easy to add into
> util the DEADLINE and RT utilizations, just before returning the
> value.

Well, for RT we should problably consider the fact that schedutil is
going to select max OPP...

Apart from that I guess it could work like you said.

>
> > > + unsigned long capacity = capacity_orig_of(cpu);
> > > +
> > > + /*
> > > + * If p is where it should be, or if it has no impact on cpu, there is
> > > + * not much to do.
> > > + */
> > > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> > > + goto clamp_util;
> > > +
> > > + if (dst_cpu == cpu)
> > > + util += task_util(p);
> > > + else
> > > + util = max_t(long, util - task_util(p), 0);
> > > +
> > > +clamp_util:
> > > + return (util >= capacity) ? capacity : util;
> > > +}
> > > +
> > > +/*
> > > * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
> > > * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
> > > *
> > > @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > > return !util_fits_capacity(task_util(p), min_cap);
> > > }
> > >
> > > +static struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > > +{
> > > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > > + struct capacity_state *cs = NULL;
> > > + int i;
> > > +
> > > + /*
> > > + * As the goal is to estimate the OPP reached for a specific util
> > > + * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > > + */
> > > + util += util >> 2;
> >
> > What about other governors (ondemand for example). Is this supposed to
> > work only when schedutil is in use (if so we should probably make it
> > conditional on that)?
>
> Yes, I would say that EAS mostly makes sense when you have a "minimum"
> control on OPPs... otherwise all the energy estimations are really
> fuzzy.

Make sense to me. Shouldn't we then make all this conditional on using
schedutil?

>
> > Also, even when schedutil is in use, shouldn't we ask it for a util
> > "computation" instead of replicating its _current_ heuristic?
>
> Are you proposing to have the 1.25 factor only here and remove it from
> schedutil?

I'm only saying that we shouldn't probably have two places where we add
this 1.25 factor to utilization before using it, as in the future one of
the two might modify that 1.25 to something else and then we'll have
problems. So, maybe a common wrapper that adds such factor?

>
> > I fear the two might diverge in the future.
>
> That could be avoided by factoring out from schedutil the
> "compensation" factor into a proper function to be used by all the
> interested playes, isn't it?

And I should have read till the end before writing the above paragraph
it seems. :)

Thanks,

- Juri

2018-03-21 13:57:42

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

On Wednesday 21 Mar 2018 at 13:59:25 (+0100), Juri Lelli wrote:
> On 21/03/18 12:26, Patrick Bellasi wrote:
> > On 21-Mar 10:04, Juri Lelli wrote:
> > > Hi,
> > >
> > > On 20/03/18 09:43, Dietmar Eggemann wrote:
> > > > From: Quentin Perret <[email protected]>
> > > >
> > > > In preparation for the definition of an energy-aware wakeup path, a
> > > > helper function is provided to estimate the consequence on system energy
> > > > when a specific task wakes-up on a specific CPU. compute_energy()
> > > > estimates the OPPs to be reached by all frequency domains and estimates
> > > > the consumption of each online CPU according to its energy model and its
> > > > percentage of busy time.
> > > >
> > > > Cc: Ingo Molnar <[email protected]>
> > > > Cc: Peter Zijlstra <[email protected]>
> > > > Signed-off-by: Quentin Perret <[email protected]>
> > > > Signed-off-by: Dietmar Eggemann <[email protected]>
> > > > ---
> > > > kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > 1 file changed, 81 insertions(+)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 6c72a5e7b1b0..76bd46502486 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu)
> > > > }
> > > >
> > > > /*
> > > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> > > > + */
> > > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > > > +{
> > > > + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
> > >
> > > What about other classes? Shouldn't we now also take into account
> > > DEADLINE (as schedutil does)?
> >
> > Good point, although that would likely require to factor out from
> > schedutil the utilization aggregation function, isn't it?
>
> Maybe, or simply use getter methods and aggregate again here.

I agree with you both, taking DL into account here is most likely the
right thing to do. Thinking about this, there are other places in those
patches where we should really use capacity_of() instead of
capacity_orig_of() (in task_fits() in patch 5/6 for ex) in order to
avoid CPUs under heavy RT pressure. I'll try to improve the integration
with other scheduling classes in v2.

>
> >
> > > BTW, we now also have a getter method in sched/sched.h; it takes
> > > UTIL_EST into account, though. Do we need to take that into account when
> > > estimating energy consumption?

Yes, I think that using UTIL_EST makes a lot of sense for energy
calculation. This is what is used for frequency selection (with
schedutil obviously) and this is also our best guess on how much time
a task will spend on a CPU. V2 will be rebased on the latest
tip/sched/core and I'll make sure to integrate things better with
util_est.

> >
> > Actually I think that this whole function can be written "just" as:
> >
> > ---8<---
> > unsigned long util = cpu_util_wake(cpu);
> >
> > if (cpu != dst_cpu)
> > return util;
> >
> > return min(util + task_util(p), capacity_orig_of(cpu));
> > ---8<---
> >
> > which will reuse existing functions as well as getting for free other
> > stuff (like the CPU util_est).
> >
> > Considering your observation above, it makes also easy to add into
> > util the DEADLINE and RT utilizations, just before returning the
> > value.
>
> Well, for RT we should problably consider the fact that schedutil is
> going to select max OPP...

Right, but I need to think about the right place to put that, and how to
compute the energy accurately in this case. Some modification might also
be required in find_cap_state() (patch 5/6).

>
> Apart from that I guess it could work like you said.
>
> >
> > > > + unsigned long capacity = capacity_orig_of(cpu);
> > > > +
> > > > + /*
> > > > + * If p is where it should be, or if it has no impact on cpu, there is
> > > > + * not much to do.
> > > > + */
> > > > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> > > > + goto clamp_util;
> > > > +
> > > > + if (dst_cpu == cpu)
> > > > + util += task_util(p);
> > > > + else
> > > > + util = max_t(long, util - task_util(p), 0);
> > > > +
> > > > +clamp_util:
> > > > + return (util >= capacity) ? capacity : util;
> > > > +}
> > > > +
> > > > +/*
> > > > * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
> > > > * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
> > > > *
> > > > @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > > > return !util_fits_capacity(task_util(p), min_cap);
> > > > }
> > > >
> > > > +static struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > > > +{
> > > > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > > > + struct capacity_state *cs = NULL;
> > > > + int i;
> > > > +
> > > > + /*
> > > > + * As the goal is to estimate the OPP reached for a specific util
> > > > + * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > > > + */
> > > > + util += util >> 2;
> > >
> > > What about other governors (ondemand for example). Is this supposed to
> > > work only when schedutil is in use (if so we should probably make it
> > > conditional on that)?
> >
> > Yes, I would say that EAS mostly makes sense when you have a "minimum"
> > control on OPPs... otherwise all the energy estimations are really
> > fuzzy.
>
> Make sense to me. Shouldn't we then make all this conditional on using
> schedutil?

So, in theory, EAS could make sense even for other governors than
schedutil. Even with the performance governor it is probably more
energy efficient (although users using "performance" probably don't care
about energy, but that's just an example) to place small tasks onto little
CPUs up to a certain point given by the energy model. The ideal solution
would be to change the behaviour of find_cap_state() depending on the
governor being used, but I don't know if this extra complexity is worth
it really.
I'm happy to make all this conditional on schedutil as a first step and
we can see later if that makes sense to extend EAS to other use-cases.

>
> >
> > > Also, even when schedutil is in use, shouldn't we ask it for a util
> > > "computation" instead of replicating its _current_ heuristic?
> >
> > Are you proposing to have the 1.25 factor only here and remove it from
> > schedutil?
>
> I'm only saying that we shouldn't probably have two places where we add
> this 1.25 factor to utilization before using it, as in the future one of
> the two might modify that 1.25 to something else and then we'll have
> problems. So, maybe a common wrapper that adds such factor?

Ok, I can definitely factorize this code between schedutil and EAS. And
BTW, would it make sense to make schedutil use "capacity_margin" instead
of an arbitrary value ? The semantics feels pretty close. Out of curiosity,
what was the reason to use C=1.25 in the first place ?

>
> >
> > > I fear the two might diverge in the future.
> >
> > That could be avoided by factoring out from schedutil the
> > "compensation" factor into a proper function to be used by all the
> > interested playes, isn't it?
>
> And I should have read till the end before writing the above paragraph
> it seems. :)
>
> Thanks,
>
> - Juri

Thank you very much for the feedback !

2018-03-21 14:06:55

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

On Wednesday 21 Mar 2018 at 12:26:21 (+0000), Patrick Bellasi wrote:
> On 21-Mar 10:04, Juri Lelli wrote:
> > Hi,
> >
> > On 20/03/18 09:43, Dietmar Eggemann wrote:
> > > From: Quentin Perret <[email protected]>
> > >
> > > In preparation for the definition of an energy-aware wakeup path, a
> > > helper function is provided to estimate the consequence on system energy
> > > when a specific task wakes-up on a specific CPU. compute_energy()
> > > estimates the OPPs to be reached by all frequency domains and estimates
> > > the consumption of each online CPU according to its energy model and its
> > > percentage of busy time.
> > >
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: Peter Zijlstra <[email protected]>
> > > Signed-off-by: Quentin Perret <[email protected]>
> > > Signed-off-by: Dietmar Eggemann <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 81 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6c72a5e7b1b0..76bd46502486 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6409,6 +6409,30 @@ static inline int cpu_overutilized(int cpu)
> > > }
> > >
> > > /*
> > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> > > + */
> > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > > +{
> > > + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
> >
> > What about other classes? Shouldn't we now also take into account
> > DEADLINE (as schedutil does)?
>
> Good point, although that would likely require to factor out from
> schedutil the utilization aggregation function, isn't it?
>
> > BTW, we now also have a getter method in sched/sched.h; it takes
> > UTIL_EST into account, though. Do we need to take that into account when
> > estimating energy consumption?
>
> Actually I think that this whole function can be written "just" as:
>
> ---8<---
> unsigned long util = cpu_util_wake(cpu);
>
> if (cpu != dst_cpu)
> return util;
>
> return min(util + task_util(p), capacity_orig_of(cpu));
> ---8<---
>

Yes this should be functionally equivalent. However, with your
suggestion you can potentially remove the task contribution from the
cpu_util in cpu_util_wake() and then add it back right after if
cpu==dst_cpu. This is sub-optimal and that's why I implemented things
slightly differently. But maybe this optimization really is too small to
justify the extra complexity involved ...

> which will reuse existing functions as well as getting for free other
> stuff (like the CPU util_est).
>
> Considering your observation above, it makes also easy to add into
> util the DEADLINE and RT utilizations, just before returning the
> value.
>
> > > + unsigned long capacity = capacity_orig_of(cpu);
> > > +
> > > + /*
> > > + * If p is where it should be, or if it has no impact on cpu, there is
> > > + * not much to do.
> > > + */
> > > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> > > + goto clamp_util;
> > > +
> > > + if (dst_cpu == cpu)
> > > + util += task_util(p);
> > > + else
> > > + util = max_t(long, util - task_util(p), 0);
> > > +
> > > +clamp_util:
> > > + return (util >= capacity) ? capacity : util;
> > > +}
> > > +
> > > +/*
> > > * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
> > > * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
> > > *
> > > @@ -6432,6 +6456,63 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > > return !util_fits_capacity(task_util(p), min_cap);
> > > }
> > >
> > > +static struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > > +{
> > > + struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > > + struct capacity_state *cs = NULL;
> > > + int i;
> > > +
> > > + /*
> > > + * As the goal is to estimate the OPP reached for a specific util
> > > + * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > > + */
> > > + util += util >> 2;
> >
> > What about other governors (ondemand for example). Is this supposed to
> > work only when schedutil is in use (if so we should probably make it
> > conditional on that)?
>
> Yes, I would say that EAS mostly makes sense when you have a "minimum"
> control on OPPs... otherwise all the energy estimations are really
> fuzzy.
>
> > Also, even when schedutil is in use, shouldn't we ask it for a util
> > "computation" instead of replicating its _current_ heuristic?
>
> Are you proposing to have the 1.25 factor only here and remove it from
> schedutil?
>
> > I fear the two might diverge in the future.
>
> That could be avoided by factoring out from schedutil the
> "compensation" factor into a proper function to be used by all the
> interested playes, isn't it?
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-03-21 14:28:07

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

On Wednesday 21 Mar 2018 at 12:39:21 (+0000), Patrick Bellasi wrote:
> On 20-Mar 09:43, Dietmar Eggemann wrote:
> > From: Quentin Perret <[email protected]>
>
> [...]
>
> > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> > +{
> > + unsigned long util, fdom_max_util;
> > + struct capacity_state *cs;
> > + unsigned long energy = 0;
> > + struct freq_domain *fdom;
> > + int cpu;
> > +
> > + for_each_freq_domain(fdom) {
> > + fdom_max_util = 0;
> > + for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
> > + util = cpu_util_next(cpu, p, dst_cpu);
>
> Would be nice to find a way to cache all these util and reuse them
> below... even just to ensure data consistency between the "cs"
> computation and its usage...

So actually, what I can do is add something like

fdom_tot_util += util;

to this loop and compute

energy = cs->power * fdom_tot_util / cs->cap;

only once, instead of having the second loop to compute the energy. We don't
have to scale the util for each and every CPU since they share the same
cap state. That would save some divisions and ensure the consistency
between the selection of the cap state and the associated energy
computation. What do you think ?

Or maybe you were talking about consistency between several consecutive
calls to compute_energy() ?

>
> > + fdom_max_util = max(util, fdom_max_util);
> > + }
> > +
> > + /*
> > + * Here we assume that the capacity states of CPUs belonging to
> > + * the same frequency domains are shared. Hence, we look at the
> > + * capacity state of the first CPU and re-use it for all.
> > + */
> > + cpu = cpumask_first(&(fdom->span));
> > + cs = find_cap_state(cpu, fdom_max_util);
> ^^^^
>
> The above code could theoretically return NULL, although likely EAS is
> completely disabled if em->nb_cap_states == 0, right?

That's right. sched_energy_present cannot be enabled with
em->nb_cap_states == 0, and compute_energy() is never called without
sched_energy_present in the proposed implementation.

>
> If that's the case then, in the previous function, you can certainly
> avoid the initialization of *cs and maybe also add an explicit:
>
> BUG_ON(em->nb_cap_states == 0);
>
> which helps even just as "in code documentation".
>
> But, I'm not sure if maintainers like BUG_ON in scheduler code :)

Yes, I'm not sure about the BUG_ON either :). I agree that it would be
nice to document somewhere that compute_energy() is unsafe to call
without sched_energy_present. I can simply add a proper doc comment to
this function actually. Would that work ?

>
>
> > +
> > + /*
> > + * The energy consumed by each CPU is derived from the power
> ^
>
> Should we make more explicit that this is just the "active" energy
> consumed?

Ok, np.

>
> > + * it dissipates at the expected OPP and its percentage of
> > + * busy time.
> > + */
> > + for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
> > + util = cpu_util_next(cpu, p, dst_cpu);
> > + energy += cs->power * util / cs->cap;
> > + }
> > + }
>
> nit-pick: empty line before return?

Will do.

>
> > + return energy;
> > +}
> > +
> > /*
> > * select_task_rq_fair: Select target runqueue for the waking task in domains
> > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> > --
> > 2.11.0
> >
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

Thanks,
Quentin

2018-03-21 14:52:23

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

On 21/03/18 14:26, Quentin Perret wrote:
> On Wednesday 21 Mar 2018 at 12:39:21 (+0000), Patrick Bellasi wrote:
> > On 20-Mar 09:43, Dietmar Eggemann wrote:

[...]

> >
> > If that's the case then, in the previous function, you can certainly
> > avoid the initialization of *cs and maybe also add an explicit:
> >
> > BUG_ON(em->nb_cap_states == 0);
> >
> > which helps even just as "in code documentation".
> >
> > But, I'm not sure if maintainers like BUG_ON in scheduler code :)
>
> Yes, I'm not sure about the BUG_ON either :). I agree that it would be
> nice to document somewhere that compute_energy() is unsafe to call
> without sched_energy_present. I can simply add a proper doc comment to
> this function actually. Would that work ?

If it is something that must not happen and it is also non recoverable
at runtime, then...

$ git grep BUG_ON -- kernel/sched/ | wc -l
50

:)

2018-03-21 15:18:10

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

On 21/03/18 13:55, Quentin Perret wrote:
> On Wednesday 21 Mar 2018 at 13:59:25 (+0100), Juri Lelli wrote:
> > On 21/03/18 12:26, Patrick Bellasi wrote:
> > > On 21-Mar 10:04, Juri Lelli wrote:

[...]

> > > > > + /*
> > > > > + * As the goal is to estimate the OPP reached for a specific util
> > > > > + * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > > > > + */
> > > > > + util += util >> 2;
> > > >
> > > > What about other governors (ondemand for example). Is this supposed to
> > > > work only when schedutil is in use (if so we should probably make it
> > > > conditional on that)?
> > >
> > > Yes, I would say that EAS mostly makes sense when you have a "minimum"
> > > control on OPPs... otherwise all the energy estimations are really
> > > fuzzy.
> >
> > Make sense to me. Shouldn't we then make all this conditional on using
> > schedutil?
>
> So, in theory, EAS could make sense even for other governors than
> schedutil. Even with the performance governor it is probably more
> energy efficient (although users using "performance" probably don't care
> about energy, but that's just an example) to place small tasks onto little
> CPUs up to a certain point given by the energy model. The ideal solution
> would be to change the behaviour of find_cap_state() depending on the
> governor being used, but I don't know if this extra complexity is worth
> it really.
> I'm happy to make all this conditional on schedutil as a first step and
> we can see later if that makes sense to extend EAS to other use-cases.

I agree that EAS makes still sense even for !schedutil cases (your
performance example being one of them, powersave maybe another one?).
Making it work with ondemand is tricky, though.

So, not sure what's the best thing to do, but we should be at least aware
of limitations.

>
> >
> > >
> > > > Also, even when schedutil is in use, shouldn't we ask it for a util
> > > > "computation" instead of replicating its _current_ heuristic?
> > >
> > > Are you proposing to have the 1.25 factor only here and remove it from
> > > schedutil?
> >
> > I'm only saying that we shouldn't probably have two places where we add
> > this 1.25 factor to utilization before using it, as in the future one of
> > the two might modify that 1.25 to something else and then we'll have
> > problems. So, maybe a common wrapper that adds such factor?
>
> Ok, I can definitely factorize this code between schedutil and EAS. And
> BTW, would it make sense to make schedutil use "capacity_margin" instead
> of an arbitrary value ? The semantics feels pretty close. Out of curiosity,
> what was the reason to use C=1.25 in the first place ?

I seem to remember it was choosen out of experiments, but I might surely
be wrong and Rafael, Viresh, others will correct me. :)

>
> >
> > >
> > > > I fear the two might diverge in the future.
> > >
> > > That could be avoided by factoring out from schedutil the
> > > "compensation" factor into a proper function to be used by all the
> > > interested playes, isn't it?
> >
> > And I should have read till the end before writing the above paragraph
> > it seems. :)
> >
> > Thanks,
> >
> > - Juri
>
> Thank you very much for the feedback !

No problem. I'm of course very interested in this. Could you please
directly Cc me ([email protected]) in next versions?

Thanks,

- Juri

2018-03-21 15:37:45

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

On 20-Mar 09:43, Dietmar Eggemann wrote:
> From: Quentin Perret <[email protected]>

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 76bd46502486..65a1bead0773 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> return energy;
> }
>
> +static bool task_fits(struct task_struct *p, int cpu)
> +{
> + unsigned long next_util = cpu_util_next(cpu, p, cpu);
> +
> + return util_fits_capacity(next_util, capacity_orig_of(cpu));
^^^^^^^^^^^^^^^^^^^^^

Since here we are at scheduling CFS tasks, should we not better use
capacity_of() to account for RT/IRQ pressure ?

> +}
> +
> +static int find_energy_efficient_cpu(struct sched_domain *sd,
> + struct task_struct *p, int prev_cpu)
> +{
> + unsigned long cur_energy, prev_energy, best_energy;
> + int cpu, best_cpu = prev_cpu;
> +
> + if (!task_util(p))

We are still waking up a task... what if the task was previously
running on a big CPU which is now idle?

I understand that from a _relative_ energy_diff standpoint there is
not much to do for a 0 utilization task. However, for those tasks we
can still try to return the most energy efficient CPU among the ones
in their cpus_allowed mask.

It should be a relatively low overhead (maybe contained in a fallback
most_energy_efficient_cpu() kind of function) which allows, for
example on ARM big.LITTLE systems, to consolidate those tasks on
LITTLE CPUs instead for example keep running them on a big CPU.

> + return prev_cpu;
> +
> + /* Compute the energy impact of leaving the task on prev_cpu. */
> + prev_energy = best_energy = compute_energy(p, prev_cpu);
> +
> + /* Look for the CPU that minimizes the energy. */
^^^^^^^^^^
nit-pick: would say explicitly "best_energy" here, just to avoid
confusion about (non) possible overflows in the following if check ;)

> + for_each_cpu_and(cpu, &p->cpus_allowed, sched_domain_span(sd)) {
> + if (!task_fits(p, cpu) || cpu == prev_cpu)

nit-pick: to me it would read better as:

if (cpu == prev_cpu)
continue;
if (!task_fits(p, cpu))
continue;

but it's more matter of (personal) taste then efficiency.

> + continue;
> + cur_energy = compute_energy(p, cpu);
> + if (cur_energy < best_energy) {
> + best_energy = cur_energy;
> + best_cpu = cpu;
> + }
> + }
> +
> + /*
> + * We pick the best CPU only if it saves at least 1.5% of the
> + * energy used by prev_cpu.
> + */
> + if ((prev_energy - best_energy) > (prev_energy >> 6))
> + return best_cpu;
> +
> + return prev_cpu;
> +}

[...]

> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> break;
> }
>
> + /*
> + * Energy-aware task placement is performed on the highest
> + * non-overutilized domain spanning over cpu and prev_cpu.
> + */
> + if (want_energy && !sd_overutilized(tmp) &&
> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> + energy_sd = tmp;
> +

Not entirely sure, but I was trying to understand if we can avoid to
modify the definition of want_affine (in the previous chunk) and move
this block before the previous "if (want_affine..." (in mainline but
not in this chunk), which will became an else, e.g.

if (want_energy && !sd_overutilized(tmp) &&
// ...
else if (want_energy && !sd_overutilized(tmp) &&
// ...

Isn't that the same?

Maybe there is a code path I'm missing... but otherwise it seems a
more self contained modification of select_task_rq_fair...

> if (tmp->flags & sd_flag)
> sd = tmp;
> else if (!want_affine)
> @@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> if (want_affine)
> current->recent_used_cpu = cpu;
> }
> + } else if (energy_sd) {
> + new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
> } else {
> new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
> }

--
#include <best/regards.h>

Patrick Bellasi

2018-03-21 15:56:24

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

On 21-Mar 14:26, Quentin Perret wrote:
> On Wednesday 21 Mar 2018 at 12:39:21 (+0000), Patrick Bellasi wrote:
> > On 20-Mar 09:43, Dietmar Eggemann wrote:
> > > From: Quentin Perret <[email protected]>
> >
> > [...]
> >
> > > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> > > +{
> > > + unsigned long util, fdom_max_util;
> > > + struct capacity_state *cs;
> > > + unsigned long energy = 0;
> > > + struct freq_domain *fdom;
> > > + int cpu;
> > > +
> > > + for_each_freq_domain(fdom) {
> > > + fdom_max_util = 0;
> > > + for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
> > > + util = cpu_util_next(cpu, p, dst_cpu);
> >
> > Would be nice to find a way to cache all these util and reuse them
> > below... even just to ensure data consistency between the "cs"
> > computation and its usage...
>
> So actually, what I can do is add something like
>
> fdom_tot_util += util;
>
> to this loop and compute
>
> energy = cs->power * fdom_tot_util / cs->cap;
>
> only once, instead of having the second loop to compute the energy. We don't
> have to scale the util for each and every CPU since they share the same
> cap state. That would save some divisions and ensure the consistency
> between the selection of the cap state and the associated energy
> computation. What do you think ?

Right, would say that under the hypothesis the we are in the same
frequency domain (and we are because of fdom->span), that's basically
doing:

sum_i(P_x * U_i / C_x) => P_x / C_x * sum_i(U_i)

Where (C_x, P_x) are the EM reported capacity and power for the
expected frequency domain OPP.

> Or maybe you were talking about consistency between several consecutive
> calls to compute_energy() ?

Nope, the above +1

> > > + fdom_max_util = max(util, fdom_max_util);
> > > + }
> > > +
> > > + /*
> > > + * Here we assume that the capacity states of CPUs belonging to
> > > + * the same frequency domains are shared. Hence, we look at the
> > > + * capacity state of the first CPU and re-use it for all.
> > > + */
> > > + cpu = cpumask_first(&(fdom->span));
> > > + cs = find_cap_state(cpu, fdom_max_util);
> > ^^^^
> >
> > The above code could theoretically return NULL, although likely EAS is
> > completely disabled if em->nb_cap_states == 0, right?
>
> That's right. sched_energy_present cannot be enabled with
> em->nb_cap_states == 0, and compute_energy() is never called without
> sched_energy_present in the proposed implementation.
>
> >
> > If that's the case then, in the previous function, you can certainly
> > avoid the initialization of *cs and maybe also add an explicit:
> >
> > BUG_ON(em->nb_cap_states == 0);
> >
> > which helps even just as "in code documentation".
> >
> > But, I'm not sure if maintainers like BUG_ON in scheduler code :)
>
> Yes, I'm not sure about the BUG_ON either :).

FWIW, there are already some BUG_ON in fair.c... thus, if they can
pinpoint a specific bug in case of errors, they should be acceptable ?

> I agree that it would be nice to document somewhere that
> compute_energy() is unsafe to call without sched_energy_present.
> I can simply add a proper doc comment to this function actually.
> Would that work ?

Right, it's just that _maybe_ an explicit BUG_ON is improving the
documentation by making more explicit the error on testing ?

Thus, I would probably add both... but Peter will tell you for sure ;)

--
#include <best/regards.h>

Patrick Bellasi

2018-03-21 16:27:13

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

On Wed, Mar 21, 2018 at 04:15:13PM +0100, Juri Lelli wrote:
> On 21/03/18 13:55, Quentin Perret wrote:
> > On Wednesday 21 Mar 2018 at 13:59:25 (+0100), Juri Lelli wrote:
> > > On 21/03/18 12:26, Patrick Bellasi wrote:
> > > > On 21-Mar 10:04, Juri Lelli wrote:
>
> [...]
>
> > > > > > + /*
> > > > > > + * As the goal is to estimate the OPP reached for a specific util
> > > > > > + * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > > > > > + */
> > > > > > + util += util >> 2;
> > > > >
> > > > > What about other governors (ondemand for example). Is this supposed to
> > > > > work only when schedutil is in use (if so we should probably make it
> > > > > conditional on that)?
> > > >
> > > > Yes, I would say that EAS mostly makes sense when you have a "minimum"
> > > > control on OPPs... otherwise all the energy estimations are really
> > > > fuzzy.
> > >
> > > Make sense to me. Shouldn't we then make all this conditional on using
> > > schedutil?
> >
> > So, in theory, EAS could make sense even for other governors than
> > schedutil. Even with the performance governor it is probably more
> > energy efficient (although users using "performance" probably don't care
> > about energy, but that's just an example) to place small tasks onto little
> > CPUs up to a certain point given by the energy model. The ideal solution
> > would be to change the behaviour of find_cap_state() depending on the
> > governor being used, but I don't know if this extra complexity is worth
> > it really.
> > I'm happy to make all this conditional on schedutil as a first step and
> > we can see later if that makes sense to extend EAS to other use-cases.
>
> I agree that EAS makes still sense even for !schedutil cases (your
> performance example being one of them, powersave maybe another one?).
> Making it work with ondemand is tricky, though.
>
> So, not sure what's the best thing to do, but we should be at least aware
> of limitations.

I would suggest making as few assumptions about the OPP selection as
possible. Even when we do use schedutil, there could be number of
reasons why we don't actually get the OPP schedutil requests (thermal,
hardware-says-no,...).

In the previous energy model-driven scheduling postings, years back, I
went with the assumption that OPP would follow the utilization. So if we
put more tasks on a cpu, the OPP would increase to match. If cpufreq or
hardware decided to go faster, that is fine but it could lead to
suboptimal decisions.

We could call into schedutil somehow to make sure that we at least
request the same OPP as the energy model assumes if the overhead is
small and we can present schedutil with all the information it needs to
choose the OPP for the proposed task placement. I wonder if it is worth
it, or if we should just decide on a simple assumption on OPP selection
for energy estimation and stick it in a comment.

Morten

2018-03-21 17:05:06

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

On 21/03/18 16:26, Morten Rasmussen wrote:
> On Wed, Mar 21, 2018 at 04:15:13PM +0100, Juri Lelli wrote:
> > On 21/03/18 13:55, Quentin Perret wrote:
> > > On Wednesday 21 Mar 2018 at 13:59:25 (+0100), Juri Lelli wrote:
> > > > On 21/03/18 12:26, Patrick Bellasi wrote:
> > > > > On 21-Mar 10:04, Juri Lelli wrote:
> >
> > [...]
> >
> > > > > > > + /*
> > > > > > > + * As the goal is to estimate the OPP reached for a specific util
> > > > > > > + * value, mimic the behaviour of schedutil with a 1.25 coefficient
> > > > > > > + */
> > > > > > > + util += util >> 2;
> > > > > >
> > > > > > What about other governors (ondemand for example). Is this supposed to
> > > > > > work only when schedutil is in use (if so we should probably make it
> > > > > > conditional on that)?
> > > > >
> > > > > Yes, I would say that EAS mostly makes sense when you have a "minimum"
> > > > > control on OPPs... otherwise all the energy estimations are really
> > > > > fuzzy.
> > > >
> > > > Make sense to me. Shouldn't we then make all this conditional on using
> > > > schedutil?
> > >
> > > So, in theory, EAS could make sense even for other governors than
> > > schedutil. Even with the performance governor it is probably more
> > > energy efficient (although users using "performance" probably don't care
> > > about energy, but that's just an example) to place small tasks onto little
> > > CPUs up to a certain point given by the energy model. The ideal solution
> > > would be to change the behaviour of find_cap_state() depending on the
> > > governor being used, but I don't know if this extra complexity is worth
> > > it really.
> > > I'm happy to make all this conditional on schedutil as a first step and
> > > we can see later if that makes sense to extend EAS to other use-cases.
> >
> > I agree that EAS makes still sense even for !schedutil cases (your
> > performance example being one of them, powersave maybe another one?).
> > Making it work with ondemand is tricky, though.
> >
> > So, not sure what's the best thing to do, but we should be at least aware
> > of limitations.
>
> I would suggest making as few assumptions about the OPP selection as
> possible. Even when we do use schedutil, there could be number of
> reasons why we don't actually get the OPP schedutil requests (thermal,
> hardware-says-no,...).
>
> In the previous energy model-driven scheduling postings, years back, I
> went with the assumption that OPP would follow the utilization. So if we
> put more tasks on a cpu, the OPP would increase to match. If cpufreq or
> hardware decided to go faster, that is fine but it could lead to
> suboptimal decisions.
>
> We could call into schedutil somehow to make sure that we at least
> request the same OPP as the energy model assumes if the overhead is
> small and we can present schedutil with all the information it needs to
> choose the OPP for the proposed task placement. I wonder if it is worth
> it, or if we should just decide on a simple assumption on OPP selection
> for energy estimation and stick it in a comment.

Right, I see your point. Refactoring the 1.25 coefficient calculation in
some getter method shouldn't hopefully add much overhead, but yes, it
might not give us much in terms of correctness for certain situations.

Best,

- Juri

2018-03-21 21:17:21

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

On 03/21/2018 03:02 PM, Quentin Perret wrote:
> On Wednesday 21 Mar 2018 at 12:26:21 (+0000), Patrick Bellasi wrote:
>> On 21-Mar 10:04, Juri Lelli wrote:
>>> Hi,
>>>
>>> On 20/03/18 09:43, Dietmar Eggemann wrote:
>>>> From: Quentin Perret <[email protected]>

[...]

>> Actually I think that this whole function can be written "just" as:
>>
>> ---8<---
>> unsigned long util = cpu_util_wake(cpu);
>>
>> if (cpu != dst_cpu)
>> return util;
>>
>> return min(util + task_util(p), capacity_orig_of(cpu));
>> ---8<---
>>
>
> Yes this should be functionally equivalent. However, with your
> suggestion you can potentially remove the task contribution from the
> cpu_util in cpu_util_wake() and then add it back right after if
> cpu==dst_cpu. This is sub-optimal and that's why I implemented things
> slightly differently. But maybe this optimization really is too small to
> justify the extra complexity involved ...

What about we merge both functions by adding an additional 'int dst_cpu'
parameter to cpu_util_wake() (only lightly tested, w/o util_est):

--->8---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 65a1bead0773..4d4f104d5b3d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5860,11 +5860,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
}

static inline unsigned long task_util(struct task_struct *p);
-static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
+static unsigned long cpu_util_wake(int cpu, int dst_cpu, struct task_struct *p);

static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
{
- return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0);
+ return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, -1, p), 0);
}

/*
@@ -6384,16 +6384,22 @@ static inline unsigned long task_util(struct task_struct *p)
* cpu_util_wake: Compute CPU utilization with any contributions from
* the waking task p removed.
*/
-static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
+static unsigned long cpu_util_wake(int cpu, int dst_cpu, struct task_struct *p)
{
unsigned long util, capacity;

/* Task has no contribution or is new */
- if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
+ if ((cpu != task_cpu(p) && cpu != dst_cpu) ||
+ dst_cpu == task_cpu(p) || !p->se.avg.last_update_time)
return cpu_util(cpu);

capacity = capacity_orig_of(cpu);
- util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+ util = cpu_rq(cpu)->cfs.avg.util_avg;
+
+ if (likely(dst_cpu != cpu))
+ util = max_t(long, util - task_util(p), 0);
+ else
+ util += task_util(p);

return (util >= capacity) ? capacity : util;
}
@@ -6409,30 +6415,6 @@ static inline int cpu_overutilized(int cpu)
}

/*
- * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
- */
-static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
-{
- unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
- unsigned long capacity = capacity_orig_of(cpu);
-
- /*
- * If p is where it should be, or if it has no impact on cpu, there is
- * not much to do.
- */
- if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
- goto clamp_util;
-
- if (dst_cpu == cpu)
- util += task_util(p);
- else
- util = max_t(long, util - task_util(p), 0);
-
-clamp_util:
- return (util >= capacity) ? capacity : util;
-}
-
-/*
* Disable WAKE_AFFINE in the case where task @p doesn't fit in the
* capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
*
@@ -6488,7 +6470,7 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
for_each_freq_domain(fdom) {
fdom_max_util = 0;
for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
- util = cpu_util_next(cpu, p, dst_cpu);
+ util = cpu_util_wake(cpu, dst_cpu, p);
fdom_max_util = max(util, fdom_max_util);
}

@@ -6506,7 +6488,7 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
* busy time.
*/
for_each_cpu_and(cpu, &(fdom->span), cpu_online_mask) {
- util = cpu_util_next(cpu, p, dst_cpu);
+ util = cpu_util_wake(cpu, dst_cpu, p);
energy += cs->power * util / cs->cap;
}
}



2018-03-22 05:06:52

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/fair: Introduce an energy estimation helper function

On Wednesday 21 Mar 2018 at 15:54:58 (+0000), Patrick Bellasi wrote:
> On 21-Mar 14:26, Quentin Perret wrote:
> > On Wednesday 21 Mar 2018 at 12:39:21 (+0000), Patrick Bellasi wrote:
> > > On 20-Mar 09:43, Dietmar Eggemann wrote:
> > > > From: Quentin Perret <[email protected]>

[...]

> > So actually, what I can do is add something like
> >
> > fdom_tot_util += util;
> >
> > to this loop and compute
> >
> > energy = cs->power * fdom_tot_util / cs->cap;
> >
> > only once, instead of having the second loop to compute the energy. We don't
> > have to scale the util for each and every CPU since they share the same
> > cap state. That would save some divisions and ensure the consistency
> > between the selection of the cap state and the associated energy
> > computation. What do you think ?
>
> Right, would say that under the hypothesis the we are in the same
> frequency domain (and we are because of fdom->span), that's basically
> doing:
>
> sum_i(P_x * U_i / C_x) => P_x / C_x * sum_i(U_i)
>
> Where (C_x, P_x) are the EM reported capacity and power for the
> expected frequency domain OPP.
>

Yes that's exactly that. I'll do the change in v2.

> > Or maybe you were talking about consistency between several consecutive
> > calls to compute_energy() ?
>
> Nope, the above +1
>

[...]

> > I agree that it would be nice to document somewhere that
> > compute_energy() is unsafe to call without sched_energy_present.
> > I can simply add a proper doc comment to this function actually.
> > Would that work ?
>
> Right, it's just that _maybe_ an explicit BUG_ON is improving the
> documentation by making more explicit the error on testing ?
>
> Thus, I would probably add both... but Peter will tell you for sure ;)
>

Right, but I'm still not sure if the BUG_ON is the right thing to do. I
mean, if we really want to make this check, then we could also try
to recover into a working state ... If we enter compute_energy() without
having an energy model, and if we detect it on time, we could bail out
and disable sched_energy_present ASAP with an error message for example.
That would let us know if EAS is broken without making the system
unusable.

Anyways, if there is a general agreement, or if the maintainers think
that the BUG_ON is the right thing to do here, I'm happy to change that
in future versions :)

2018-03-22 16:29:27

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

Hi,

On Tue, Mar 20, 2018 at 2:43 AM, Dietmar Eggemann
<[email protected]> wrote:
>
> From: Quentin Perret <[email protected]>
>
> In case an energy model is available, waking tasks are re-routed into a
> new energy-aware placement algorithm. The eligible CPUs to be used in the
> energy-aware wakeup path are restricted to the highest non-overutilized
> sched_domain containing prev_cpu and this_cpu. If no such domain is found,
> the tasks go through the usual wake-up path, hence energy-aware placement
> happens only in lightly utilized scenarios.
>
> The selection of the most energy-efficient CPU for a task is achieved by
> estimating the impact on system-level active energy resulting from the
> placement of the task on each candidate CPU. The best CPU energy-wise is
> then selected if it saves a large enough amount of energy with respect to
> prev_cpu.
>
> Although it has already shown significant benefits on some existing
> targets, this brute force approach clearly cannot scale to platforms with
> numerous CPUs. This patch is an attempt to do something useful as writing
> a fast heuristic that performs reasonably well on a broad spectrum of
> architectures isn't an easy task. As a consequence, the scope of usability
> of the energy-aware wake-up path is restricted to systems with the
> SD_ASYM_CPUCAPACITY flag set. These systems not only show the most
> promising opportunities for saving energy but also typically feature a
> limited number of logical CPUs.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 71 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 76bd46502486..65a1bead0773 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> return energy;
> }
>
> +static bool task_fits(struct task_struct *p, int cpu)
> +{
> + unsigned long next_util = cpu_util_next(cpu, p, cpu);
> +
> + return util_fits_capacity(next_util, capacity_orig_of(cpu));
> +}
> +
> +static int find_energy_efficient_cpu(struct sched_domain *sd,
> + struct task_struct *p, int prev_cpu)
> +{
> + unsigned long cur_energy, prev_energy, best_energy;
> + int cpu, best_cpu = prev_cpu;
> +
> + if (!task_util(p))
> + return prev_cpu;
> +
> + /* Compute the energy impact of leaving the task on prev_cpu. */
> + prev_energy = best_energy = compute_energy(p, prev_cpu);

Is it possible that before the wakeup, the task's affinity is changed
so that p->cpus_allowed no longer contains prev_cpu ? In that case
prev_energy wouldn't matter since previous CPU is no longer an option?

> +
> + /* Look for the CPU that minimizes the energy. */
> + for_each_cpu_and(cpu, &p->cpus_allowed, sched_domain_span(sd)) {
> + if (!task_fits(p, cpu) || cpu == prev_cpu)
> + continue;
> + cur_energy = compute_energy(p, cpu);
> + if (cur_energy < best_energy) {
> + best_energy = cur_energy;
> + best_cpu = cpu;
> + }
> + }
> +
> + /*
> + * We pick the best CPU only if it saves at least 1.5% of the
> + * energy used by prev_cpu.
> + */
> + if ((prev_energy - best_energy) > (prev_energy >> 6))
> + return best_cpu;
> +
> + return prev_cpu;
> +}
> +
> +static inline bool wake_energy(struct task_struct *p, int prev_cpu)
> +{
> + struct sched_domain *sd;
> +
> + if (!static_branch_unlikely(&sched_energy_present))
> + return false;
> +
> + sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd);
> + if (!sd || sd_overutilized(sd))

Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here?

> + return false;
> +
> + return true;
> +}
> +
> /*
> * select_task_rq_fair: Select target runqueue for the waking task in domains
> * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -6529,18 +6583,22 @@ static int
> select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
> {
> struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> + struct sched_domain *energy_sd = NULL;
> int cpu = smp_processor_id();
> int new_cpu = prev_cpu;
> - int want_affine = 0;
> + int want_affine = 0, want_energy = 0;
> int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>
> + rcu_read_lock();
> +
> if (sd_flag & SD_BALANCE_WAKE) {
> record_wakee(p);
> + want_energy = wake_energy(p, prev_cpu);
> want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> - && cpumask_test_cpu(cpu, &p->cpus_allowed);
> + && cpumask_test_cpu(cpu, &p->cpus_allowed)
> + && !want_energy;
> }
>
> - rcu_read_lock();
> for_each_domain(cpu, tmp) {
> if (!(tmp->flags & SD_LOAD_BALANCE))
> break;
> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> break;
> }
>
> + /*
> + * Energy-aware task placement is performed on the highest
> + * non-overutilized domain spanning over cpu and prev_cpu.
> + */
> + if (want_energy && !sd_overutilized(tmp) &&
> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))

Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level?


> + energy_sd = tmp;
> +
> if (tmp->flags & sd_flag)
> sd = tmp;
> else if (!want_affine)
> @@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> if (want_affine)
> current->recent_used_cpu = cpu;
> }
> + } else if (energy_sd) {
> + new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);

Even if want_affine = 0 (want_energy = 1), we can have sd = NULL if
sd_flag and tmp->flags don't match. In this case we wont enter the EAS
selection path because sd will be = NULL. So should you be setting sd
= NULL explicitly if energy_sd != NULL , or rather do the if
(energy_sd) before doing the if (!sd) ?

If you still want to keep the logic this way, then probably you should
also check if (tmp->flags & sd_flag) == true in the loop? That way
energy_sd wont be set at all (Since we're basically saying we dont
want to do wake up across this sd (in energy aware fashion in this
case) if the domain flags don't watch the wake up sd_flag.

thanks,

- Joel

2018-03-22 18:07:47

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

On 22-Mar 09:27, Joel Fernandes wrote:
> Hi,
>
> On Tue, Mar 20, 2018 at 2:43 AM, Dietmar Eggemann
> <[email protected]> wrote:
> >
> > From: Quentin Perret <[email protected]>

[...]

> > +static inline bool wake_energy(struct task_struct *p, int prev_cpu)
> > +{
> > + struct sched_domain *sd;
> > +
> > + if (!static_branch_unlikely(&sched_energy_present))
> > + return false;
> > +
> > + sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd);
> > + if (!sd || sd_overutilized(sd))
>
> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here?

I think that should be already covered by the static key check
above...

>
> > + return false;
> > +
> > + return true;
> > +}
> > +
> > /*
> > * select_task_rq_fair: Select target runqueue for the waking task in domains
> > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> > @@ -6529,18 +6583,22 @@ static int
> > select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
> > {
> > struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> > + struct sched_domain *energy_sd = NULL;
> > int cpu = smp_processor_id();
> > int new_cpu = prev_cpu;
> > - int want_affine = 0;
> > + int want_affine = 0, want_energy = 0;
> > int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> >
> > + rcu_read_lock();
> > +
> > if (sd_flag & SD_BALANCE_WAKE) {
> > record_wakee(p);
> > + want_energy = wake_energy(p, prev_cpu);
> > want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> > - && cpumask_test_cpu(cpu, &p->cpus_allowed);
> > + && cpumask_test_cpu(cpu, &p->cpus_allowed)
> > + && !want_energy;
> > }
> >
> > - rcu_read_lock();
> > for_each_domain(cpu, tmp) {
> > if (!(tmp->flags & SD_LOAD_BALANCE))
> > break;
> > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> > break;
> > }
> >
> > + /*
> > + * Energy-aware task placement is performed on the highest
> > + * non-overutilized domain spanning over cpu and prev_cpu.
> > + */
> > + if (want_energy && !sd_overutilized(tmp) &&
> > + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
>
> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level?

... and this then should be covered by the previous check in
wake_energy(), which sets want_energy.

>
> > + energy_sd = tmp;
> > +
> > if (tmp->flags & sd_flag)
> > sd = tmp;
> > else if (!want_affine)
> > @@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> > if (want_affine)
> > current->recent_used_cpu = cpu;
> > }
> > + } else if (energy_sd) {
> > + new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
>
> Even if want_affine = 0 (want_energy = 1), we can have sd = NULL if
> sd_flag and tmp->flags don't match. In this case we wont enter the EAS
> selection path because sd will be = NULL. So should you be setting sd
> = NULL explicitly if energy_sd != NULL , or rather do the if
> (energy_sd) before doing the if (!sd) ?

That's the same think I was also proposing in my reply to this patch.
But in my case the point was mainly to make the code easier to
follow... which at the end it's also to void all the consideration on
dependencies you describe above.

Joel, can you have a look at what I proposed... I was not entirely
sure that we miss some code paths doing it that way.

> If you still want to keep the logic this way, then probably you should
> also check if (tmp->flags & sd_flag) == true in the loop? That way
> energy_sd wont be set at all (Since we're basically saying we dont
> want to do wake up across this sd (in energy aware fashion in this
> case) if the domain flags don't watch the wake up sd_flag.
>
> thanks,
>
> - Joel

--
#include <best/regards.h>

Patrick Bellasi

2018-03-22 20:11:34

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

On Wed, Mar 21, 2018 at 8:35 AM, Patrick Bellasi
<[email protected]> wrote:
> [...]
>
>> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> break;
>> }
>>
>> + /*
>> + * Energy-aware task placement is performed on the highest
>> + * non-overutilized domain spanning over cpu and prev_cpu.
>> + */
>> + if (want_energy && !sd_overutilized(tmp) &&
>> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
>> + energy_sd = tmp;
>> +
>
> Not entirely sure, but I was trying to understand if we can avoid to
> modify the definition of want_affine (in the previous chunk) and move
> this block before the previous "if (want_affine..." (in mainline but
> not in this chunk), which will became an else, e.g.
>
> if (want_energy && !sd_overutilized(tmp) &&
> // ...
> else if (want_energy && !sd_overutilized(tmp) &&
> // ...
>
> Isn't that the same?
>
> Maybe there is a code path I'm missing... but otherwise it seems a
> more self contained modification of select_task_rq_fair...

Just replying to this here Patrick instead of the other thread.

I think this is the right place for the block from Quentin quoted
above because we want to search for the highest domain that is
!overutilized and look among those for the candidates. So from that
perspective, we can't move the block to the beginning and it seems to
be in the right place. My main concern on the other thread was
different, I was talking about the cases where sd_flag & tmp->flags
don't match. In that case, sd = NULL would trump EAS and I was
wondering if that's the right thing to do...

thanks,

- Joel

[...]

2018-03-22 20:21:49

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

On Thu, Mar 22, 2018 at 11:06 AM, Patrick Bellasi
<[email protected]> wrote:
[..]
>> > +static inline bool wake_energy(struct task_struct *p, int prev_cpu)
>> > +{
>> > + struct sched_domain *sd;
>> > +
>> > + if (!static_branch_unlikely(&sched_energy_present))
>> > + return false;
>> > +
>> > + sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd);
>> > + if (!sd || sd_overutilized(sd))
>>
>> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here?
>
> I think that should be already covered by the static key check
> above...
>

I understand there is an assumption like that but I was wondering if
its more future proof for checking it explicity. I am Ok if everyone
thinks its a valid assumption...

>>
>> > + return false;
>> > +
>> > + return true;
>> > +}
>> > +
>> > /*
>> > * select_task_rq_fair: Select target runqueue for the waking task in domains
>> > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>> > @@ -6529,18 +6583,22 @@ static int
>> > select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
>> > {
>> > struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>> > + struct sched_domain *energy_sd = NULL;
>> > int cpu = smp_processor_id();
>> > int new_cpu = prev_cpu;
>> > - int want_affine = 0;
>> > + int want_affine = 0, want_energy = 0;
>> > int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>> >
>> > + rcu_read_lock();
>> > +
>> > if (sd_flag & SD_BALANCE_WAKE) {
>> > record_wakee(p);
>> > + want_energy = wake_energy(p, prev_cpu);
>> > want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
>> > - && cpumask_test_cpu(cpu, &p->cpus_allowed);
>> > + && cpumask_test_cpu(cpu, &p->cpus_allowed)
>> > + && !want_energy;
>> > }
>> >
>> > - rcu_read_lock();
>> > for_each_domain(cpu, tmp) {
>> > if (!(tmp->flags & SD_LOAD_BALANCE))
>> > break;
>> > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> > break;
>> > }
>> >
>> > + /*
>> > + * Energy-aware task placement is performed on the highest
>> > + * non-overutilized domain spanning over cpu and prev_cpu.
>> > + */
>> > + if (want_energy && !sd_overutilized(tmp) &&
>> > + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
>>
>> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level?
>
> ... and this then should be covered by the previous check in
> wake_energy(), which sets want_energy.

Right, but in a scenario which probably doesn't exist today where we
have both SD_ASYM_CPUCAPACITY and !SD_ASYM_CPUCAPACITY domains in the
hierarchy for which want_energy = 1, I was thinking if its more future
proof to check it and not make assumptions...

>>
>> > + energy_sd = tmp;
>> > +
>> > if (tmp->flags & sd_flag)
>> > sd = tmp;
>> > else if (!want_affine)
>> > @@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> > if (want_affine)
>> > current->recent_used_cpu = cpu;
>> > }
>> > + } else if (energy_sd) {
>> > + new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
>>
>> Even if want_affine = 0 (want_energy = 1), we can have sd = NULL if
>> sd_flag and tmp->flags don't match. In this case we wont enter the EAS
>> selection path because sd will be = NULL. So should you be setting sd
>> = NULL explicitly if energy_sd != NULL , or rather do the if
>> (energy_sd) before doing the if (!sd) ?
>
> That's the same think I was also proposing in my reply to this patch.
> But in my case the point was mainly to make the code easier to
> follow... which at the end it's also to void all the consideration on
> dependencies you describe above.
>
> Joel, can you have a look at what I proposed... I was not entirely
> sure that we miss some code paths doing it that way.

Replied to this in the other thread.

thanks,

- Joel

2018-03-23 15:49:19

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote:
> On Wed, Mar 21, 2018 at 8:35 AM, Patrick Bellasi
> <[email protected]> wrote:
> > [...]
> >
> >> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >> break;
> >> }
> >>
> >> + /*
> >> + * Energy-aware task placement is performed on the highest
> >> + * non-overutilized domain spanning over cpu and prev_cpu.
> >> + */
> >> + if (want_energy && !sd_overutilized(tmp) &&
> >> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> >> + energy_sd = tmp;
> >> +
> >
> > Not entirely sure, but I was trying to understand if we can avoid to
> > modify the definition of want_affine (in the previous chunk) and move
> > this block before the previous "if (want_affine..." (in mainline but
> > not in this chunk), which will became an else, e.g.
> >
> > if (want_energy && !sd_overutilized(tmp) &&
> > // ...
> > else if (want_energy && !sd_overutilized(tmp) &&
> > // ...
> >
> > Isn't that the same?
> >
> > Maybe there is a code path I'm missing... but otherwise it seems a
> > more self contained modification of select_task_rq_fair...
>
> Just replying to this here Patrick instead of the other thread.
>
> I think this is the right place for the block from Quentin quoted
> above because we want to search for the highest domain that is
> !overutilized and look among those for the candidates. So from that
> perspective, we can't move the block to the beginning and it seems to
> be in the right place. My main concern on the other thread was
> different, I was talking about the cases where sd_flag & tmp->flags
> don't match. In that case, sd = NULL would trump EAS and I was
> wondering if that's the right thing to do...

You mean if SD_BALANCE_WAKE isn't set on sched_domains?

The current code seems to rely on that flag to be set to work correctly.
Otherwise, the loop might bail out on !want_affine and we end up doing
the find_energy_efficient_cpu() on the lowest level sched_domain even if
there is higher level one which isn't over-utilized.

However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is set so
sd == NULL shouldn't be possible? This only holds as long as we only
want EAS for asymmetric systems.

2018-03-23 16:02:26

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

On Thu, Mar 22, 2018 at 09:27:43AM -0700, Joel Fernandes wrote:
> Hi,
>
> On Tue, Mar 20, 2018 at 2:43 AM, Dietmar Eggemann
> <[email protected]> wrote:
> >
> > From: Quentin Perret <[email protected]>
> >
> > In case an energy model is available, waking tasks are re-routed into a
> > new energy-aware placement algorithm. The eligible CPUs to be used in the
> > energy-aware wakeup path are restricted to the highest non-overutilized
> > sched_domain containing prev_cpu and this_cpu. If no such domain is found,
> > the tasks go through the usual wake-up path, hence energy-aware placement
> > happens only in lightly utilized scenarios.
> >
> > The selection of the most energy-efficient CPU for a task is achieved by
> > estimating the impact on system-level active energy resulting from the
> > placement of the task on each candidate CPU. The best CPU energy-wise is
> > then selected if it saves a large enough amount of energy with respect to
> > prev_cpu.
> >
> > Although it has already shown significant benefits on some existing
> > targets, this brute force approach clearly cannot scale to platforms with
> > numerous CPUs. This patch is an attempt to do something useful as writing
> > a fast heuristic that performs reasonably well on a broad spectrum of
> > architectures isn't an easy task. As a consequence, the scope of usability
> > of the energy-aware wake-up path is restricted to systems with the
> > SD_ASYM_CPUCAPACITY flag set. These systems not only show the most
> > promising opportunities for saving energy but also typically feature a
> > limited number of logical CPUs.
> >
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Signed-off-by: Quentin Perret <[email protected]>
> > Signed-off-by: Dietmar Eggemann <[email protected]>
> > ---
> > kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 71 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 76bd46502486..65a1bead0773 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> > return energy;
> > }
> >
> > +static bool task_fits(struct task_struct *p, int cpu)
> > +{
> > + unsigned long next_util = cpu_util_next(cpu, p, cpu);
> > +
> > + return util_fits_capacity(next_util, capacity_orig_of(cpu));
> > +}
> > +
> > +static int find_energy_efficient_cpu(struct sched_domain *sd,
> > + struct task_struct *p, int prev_cpu)
> > +{
> > + unsigned long cur_energy, prev_energy, best_energy;
> > + int cpu, best_cpu = prev_cpu;
> > +
> > + if (!task_util(p))
> > + return prev_cpu;
> > +
> > + /* Compute the energy impact of leaving the task on prev_cpu. */
> > + prev_energy = best_energy = compute_energy(p, prev_cpu);
>
> Is it possible that before the wakeup, the task's affinity is changed
> so that p->cpus_allowed no longer contains prev_cpu ? In that case
> prev_energy wouldn't matter since previous CPU is no longer an option?

It is possible to wake-up with a disallowed prev_cpu. In fact
select_idle_sibling() may happily return a disallowed cpu in that case.
The mistake gets fixed in select_task_rq() which uses
select_fallback_rq() to find an allowed cpu instead.

Could we fix the issue in find_energy_efficient_cpu() by a simple test
like below

if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
prev_energy = best_energy = compute_energy(p, prev_cpu);
else
prev_energy = best_energy = ULONG_MAX;

2018-03-24 00:37:26

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

On Fri, Mar 23, 2018 at 9:00 AM, Morten Rasmussen
<[email protected]> wrote:
> On Thu, Mar 22, 2018 at 09:27:43AM -0700, Joel Fernandes wrote:
>> >
>> > In case an energy model is available, waking tasks are re-routed into a
>> > new energy-aware placement algorithm. The eligible CPUs to be used in the
>> > energy-aware wakeup path are restricted to the highest non-overutilized
>> > sched_domain containing prev_cpu and this_cpu. If no such domain is found,
>> > the tasks go through the usual wake-up path, hence energy-aware placement
>> > happens only in lightly utilized scenarios.
>> >
>> > The selection of the most energy-efficient CPU for a task is achieved by
>> > estimating the impact on system-level active energy resulting from the
>> > placement of the task on each candidate CPU. The best CPU energy-wise is
>> > then selected if it saves a large enough amount of energy with respect to
>> > prev_cpu.
>> >
>> > Although it has already shown significant benefits on some existing
>> > targets, this brute force approach clearly cannot scale to platforms with
>> > numerous CPUs. This patch is an attempt to do something useful as writing
>> > a fast heuristic that performs reasonably well on a broad spectrum of
>> > architectures isn't an easy task. As a consequence, the scope of usability
>> > of the energy-aware wake-up path is restricted to systems with the
>> > SD_ASYM_CPUCAPACITY flag set. These systems not only show the most
>> > promising opportunities for saving energy but also typically feature a
>> > limited number of logical CPUs.
>> >
>> > Cc: Ingo Molnar <[email protected]>
>> > Cc: Peter Zijlstra <[email protected]>
>> > Signed-off-by: Quentin Perret <[email protected]>
>> > Signed-off-by: Dietmar Eggemann <[email protected]>
>> > ---
>> > kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>> > 1 file changed, 71 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 76bd46502486..65a1bead0773 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
>> > return energy;
>> > }
>> >
>> > +static bool task_fits(struct task_struct *p, int cpu)
>> > +{
>> > + unsigned long next_util = cpu_util_next(cpu, p, cpu);
>> > +
>> > + return util_fits_capacity(next_util, capacity_orig_of(cpu));
>> > +}
>> > +
>> > +static int find_energy_efficient_cpu(struct sched_domain *sd,
>> > + struct task_struct *p, int prev_cpu)
>> > +{
>> > + unsigned long cur_energy, prev_energy, best_energy;
>> > + int cpu, best_cpu = prev_cpu;
>> > +
>> > + if (!task_util(p))
>> > + return prev_cpu;
>> > +
>> > + /* Compute the energy impact of leaving the task on prev_cpu. */
>> > + prev_energy = best_energy = compute_energy(p, prev_cpu);
>>
>> Is it possible that before the wakeup, the task's affinity is changed
>> so that p->cpus_allowed no longer contains prev_cpu ? In that case
>> prev_energy wouldn't matter since previous CPU is no longer an option?
>
> It is possible to wake-up with a disallowed prev_cpu. In fact
> select_idle_sibling() may happily return a disallowed cpu in that case.
> The mistake gets fixed in select_task_rq() which uses
> select_fallback_rq() to find an allowed cpu instead.
>
> Could we fix the issue in find_energy_efficient_cpu() by a simple test
> like below
>
> if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> prev_energy = best_energy = compute_energy(p, prev_cpu);
> else
> prev_energy = best_energy = ULONG_MAX;

Yes, I think setting to ULONG_MAX in this case is Ok with me.

thanks,

- Joel

2018-03-24 01:15:07

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

Hi Morten,

On Fri, Mar 23, 2018 at 8:47 AM, Morten Rasmussen
<[email protected]> wrote:
> On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote:
>> On Wed, Mar 21, 2018 at 8:35 AM, Patrick Bellasi
>> <[email protected]> wrote:
>> > [...]
>> >
>> >> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> >> break;
>> >> }
>> >>
>> >> + /*
>> >> + * Energy-aware task placement is performed on the highest
>> >> + * non-overutilized domain spanning over cpu and prev_cpu.
>> >> + */
>> >> + if (want_energy && !sd_overutilized(tmp) &&
>> >> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
>> >> + energy_sd = tmp;
>> >> +
>> >
>> > Not entirely sure, but I was trying to understand if we can avoid to
>> > modify the definition of want_affine (in the previous chunk) and move
>> > this block before the previous "if (want_affine..." (in mainline but
>> > not in this chunk), which will became an else, e.g.
>> >
>> > if (want_energy && !sd_overutilized(tmp) &&
>> > // ...
>> > else if (want_energy && !sd_overutilized(tmp) &&
>> > // ...
>> >
>> > Isn't that the same?
>> >
>> > Maybe there is a code path I'm missing... but otherwise it seems a
>> > more self contained modification of select_task_rq_fair...
>>
>> Just replying to this here Patrick instead of the other thread.
>>
>> I think this is the right place for the block from Quentin quoted
>> above because we want to search for the highest domain that is
>> !overutilized and look among those for the candidates. So from that
>> perspective, we can't move the block to the beginning and it seems to
>> be in the right place. My main concern on the other thread was
>> different, I was talking about the cases where sd_flag & tmp->flags
>> don't match. In that case, sd = NULL would trump EAS and I was
>> wondering if that's the right thing to do...
>
> You mean if SD_BALANCE_WAKE isn't set on sched_domains?

Yes.

> The current code seems to rely on that flag to be set to work correctly.
> Otherwise, the loop might bail out on !want_affine and we end up doing
> the find_energy_efficient_cpu() on the lowest level sched_domain even if
> there is higher level one which isn't over-utilized.
>
> However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is set so
> sd == NULL shouldn't be possible? This only holds as long as we only
> want EAS for asymmetric systems.

Yes, I see you had topology code that set SD_BALANCE_WAKE for ASYM. It
makes sense to me then, thanks for the clarification.

Still I feel it is a bit tedious/confusing when reading code to draw
the conclusion about why sd is checked first before doing
find_energy_efficient_cpu (and that sd will != NULL for ASYM systems).
If energy_sd is set, then we can just proceed with EAS without
checking that sd != NULL. This function in mainline is already pretty
confusing as it is :-(

Regards,

- Joel

2018-03-24 01:23:34

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

On Friday 23 Mar 2018 at 15:47:45 (+0000), Morten Rasmussen wrote:
> On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote:
> > On Wed, Mar 21, 2018 at 8:35 AM, Patrick Bellasi
> > <[email protected]> wrote:
> > > [...]
> > >
> > >> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> > >> break;
> > >> }
> > >>
> > >> + /*
> > >> + * Energy-aware task placement is performed on the highest
> > >> + * non-overutilized domain spanning over cpu and prev_cpu.
> > >> + */
> > >> + if (want_energy && !sd_overutilized(tmp) &&
> > >> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> > >> + energy_sd = tmp;
> > >> +
> > >
> > > Not entirely sure, but I was trying to understand if we can avoid to
> > > modify the definition of want_affine (in the previous chunk) and move
> > > this block before the previous "if (want_affine..." (in mainline but
> > > not in this chunk), which will became an else, e.g.
> > >
> > > if (want_energy && !sd_overutilized(tmp) &&
> > > // ...
> > > else if (want_energy && !sd_overutilized(tmp) &&
> > > // ...
> > >
> > > Isn't that the same?
> > >
> > > Maybe there is a code path I'm missing... but otherwise it seems a
> > > more self contained modification of select_task_rq_fair...
> >
> > Just replying to this here Patrick instead of the other thread.
> >
> > I think this is the right place for the block from Quentin quoted
> > above because we want to search for the highest domain that is
> > !overutilized and look among those for the candidates. So from that
> > perspective, we can't move the block to the beginning and it seems to
> > be in the right place. My main concern on the other thread was
> > different, I was talking about the cases where sd_flag & tmp->flags
> > don't match. In that case, sd = NULL would trump EAS and I was
> > wondering if that's the right thing to do...
>
> You mean if SD_BALANCE_WAKE isn't set on sched_domains?
>
> The current code seems to rely on that flag to be set to work correctly.
> Otherwise, the loop might bail out on !want_affine and we end up doing
> the find_energy_efficient_cpu() on the lowest level sched_domain even if
> there is higher level one which isn't over-utilized.
>
> However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is set so
> sd == NULL shouldn't be possible? This only holds as long as we only
> want EAS for asymmetric systems.

That's correct, we are under the assumption that the SD_ASYM_CPUCAPACITY
flag is set somewhere in the hierarchy here. If a sched domain has this
flag set, SD_BALANCE_WAKE is propagated to all lower sched domains
(see sd_init() in kernel/sched/topology.c) so we should be fine.

2018-03-24 01:35:42

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

On Friday 23 Mar 2018 at 18:13:56 (-0700), Joel Fernandes wrote:
> Hi Morten,
>
> On Fri, Mar 23, 2018 at 8:47 AM, Morten Rasmussen
> <[email protected]> wrote:
> > On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote:

[...]

> > You mean if SD_BALANCE_WAKE isn't set on sched_domains?
>
> Yes.
>
> > The current code seems to rely on that flag to be set to work correctly.
> > Otherwise, the loop might bail out on !want_affine and we end up doing
> > the find_energy_efficient_cpu() on the lowest level sched_domain even if
> > there is higher level one which isn't over-utilized.
> >
> > However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is set so
> > sd == NULL shouldn't be possible? This only holds as long as we only
> > want EAS for asymmetric systems.
>
> Yes, I see you had topology code that set SD_BALANCE_WAKE for ASYM. It
> makes sense to me then, thanks for the clarification.
>
> Still I feel it is a bit tedious/confusing when reading code to draw
> the conclusion about why sd is checked first before doing
> find_energy_efficient_cpu (and that sd will != NULL for ASYM systems).
> If energy_sd is set, then we can just proceed with EAS without
> checking that sd != NULL. This function in mainline is already pretty
> confusing as it is :-(

Right I see your point. The code is correct as is, but I agree that having
a code structured as

if (energy_sd) {
new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
} else if (!sd) {
...

might be easier to understand and functionally equivalent. What do you
think ?

Thanks,
Quentin

2018-03-24 01:49:55

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

On Thursday 22 Mar 2018 at 13:19:03 (-0700), Joel Fernandes wrote:
> On Thu, Mar 22, 2018 at 11:06 AM, Patrick Bellasi
> <[email protected]> wrote:

[...]

> >> > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >> > break;
> >> > }
> >> >
> >> > + /*
> >> > + * Energy-aware task placement is performed on the highest
> >> > + * non-overutilized domain spanning over cpu and prev_cpu.
> >> > + */
> >> > + if (want_energy && !sd_overutilized(tmp) &&
> >> > + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> >>
> >> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level?
> >
> > ... and this then should be covered by the previous check in
> > wake_energy(), which sets want_energy.
>
> Right, but in a scenario which probably doesn't exist today where we
> have both SD_ASYM_CPUCAPACITY and !SD_ASYM_CPUCAPACITY domains in the
> hierarchy for which want_energy = 1, I was thinking if its more future
> proof to check it and not make assumptions...

So we can definitely have cases where SD_ASYM_CPUCAPACITY is not set at all
sd levels. Today, on mobile systems, this flag is typically set only at DIE
level for big.LITTLE platforms, and not at MC level.
We enable EAS if we find _at least_ one domain that has this flag in the
hierarchy, just to make sure we don't enable EAS for symmetric platform.
It's just a way to check a property about the topology when EAS starts, not
really a way to actually select the sd at which we do scheduling at
runtime.

I hope that helps !

Thanks,
Quentin


2018-03-24 06:08:13

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up



On March 23, 2018 6:34:22 PM PDT, Quentin Perret <[email protected]> wrote:
>On Friday 23 Mar 2018 at 18:13:56 (-0700), Joel Fernandes wrote:
>> Hi Morten,
>>
>> On Fri, Mar 23, 2018 at 8:47 AM, Morten Rasmussen
>> <[email protected]> wrote:
>> > On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote:
>
>[...]
>
>> > You mean if SD_BALANCE_WAKE isn't set on sched_domains?
>>
>> Yes.
>>
>> > The current code seems to rely on that flag to be set to work
>correctly.
>> > Otherwise, the loop might bail out on !want_affine and we end up
>doing
>> > the find_energy_efficient_cpu() on the lowest level sched_domain
>even if
>> > there is higher level one which isn't over-utilized.
>> >
>> > However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is
>set so
>> > sd == NULL shouldn't be possible? This only holds as long as we
>only
>> > want EAS for asymmetric systems.
>>
>> Yes, I see you had topology code that set SD_BALANCE_WAKE for ASYM.
>It
>> makes sense to me then, thanks for the clarification.
>>
>> Still I feel it is a bit tedious/confusing when reading code to draw
>> the conclusion about why sd is checked first before doing
>> find_energy_efficient_cpu (and that sd will != NULL for ASYM
>systems).
>> If energy_sd is set, then we can just proceed with EAS without
>> checking that sd != NULL. This function in mainline is already pretty
>> confusing as it is :-(
>
>Right I see your point. The code is correct as is, but I agree that
>having
>a code structured as
>
> if (energy_sd) {
> new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
> } else if (!sd) {
> ...
>
>might be easier to understand and functionally equivalent. What do you
>think ?

Yeah definitely. Go for it.

- Joel


--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2018-03-25 00:13:33

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

On Fri, Mar 23, 2018 at 6:47 PM, Quentin Perret <[email protected]> wrote:
> On Thursday 22 Mar 2018 at 13:19:03 (-0700), Joel Fernandes wrote:
>> On Thu, Mar 22, 2018 at 11:06 AM, Patrick Bellasi
>> <[email protected]> wrote:
>
> [...]
>
>> >> > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> >> > break;
>> >> > }
>> >> >
>> >> > + /*
>> >> > + * Energy-aware task placement is performed on the highest
>> >> > + * non-overutilized domain spanning over cpu and prev_cpu.
>> >> > + */
>> >> > + if (want_energy && !sd_overutilized(tmp) &&
>> >> > + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
>> >>
>> >> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level?
>> >
>> > ... and this then should be covered by the previous check in
>> > wake_energy(), which sets want_energy.
>>
>> Right, but in a scenario which probably doesn't exist today where we
>> have both SD_ASYM_CPUCAPACITY and !SD_ASYM_CPUCAPACITY domains in the
>> hierarchy for which want_energy = 1, I was thinking if its more future
>> proof to check it and not make assumptions...
>
> So we can definitely have cases where SD_ASYM_CPUCAPACITY is not set at all
> sd levels. Today, on mobile systems, this flag is typically set only at DIE
> level for big.LITTLE platforms, and not at MC level.
> We enable EAS if we find _at least_ one domain that has this flag in the
> hierarchy, just to make sure we don't enable EAS for symmetric platform.
> It's just a way to check a property about the topology when EAS starts, not
> really a way to actually select the sd at which we do scheduling at
> runtime.

Yes Ok you're right we do have the ASYM flag set at some sd levels but
not others at the moment. Sorry about the hasty comment. I understand
what you're doing now, I am Ok with that.

thanks,

- Joel

2018-03-25 01:40:11

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

On Friday 23 Mar 2018 at 16:00:59 (+0000), Morten Rasmussen wrote:
> On Thu, Mar 22, 2018 at 09:27:43AM -0700, Joel Fernandes wrote:
> > Hi,
> >
> > On Tue, Mar 20, 2018 at 2:43 AM, Dietmar Eggemann
> > <[email protected]> wrote:

[...]

> > Is it possible that before the wakeup, the task's affinity is changed
> > so that p->cpus_allowed no longer contains prev_cpu ? In that case
> > prev_energy wouldn't matter since previous CPU is no longer an option?
>
> It is possible to wake-up with a disallowed prev_cpu. In fact
> select_idle_sibling() may happily return a disallowed cpu in that case.
> The mistake gets fixed in select_task_rq() which uses
> select_fallback_rq() to find an allowed cpu instead.
>
> Could we fix the issue in find_energy_efficient_cpu() by a simple test
> like below
>
> if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> prev_energy = best_energy = compute_energy(p, prev_cpu);
> else
> prev_energy = best_energy = ULONG_MAX;

Right, that should work. I'll change this in v2.

Thanks,
Quentin

2018-03-25 01:54:44

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] sched/fair: Select an energy-efficient CPU on task wake-up

On Wednesday 21 Mar 2018 at 15:35:18 (+0000), Patrick Bellasi wrote:
> On 20-Mar 09:43, Dietmar Eggemann wrote:
> > From: Quentin Perret <[email protected]>
>
> [...]
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 76bd46502486..65a1bead0773 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> > return energy;
> > }
> >
> > +static bool task_fits(struct task_struct *p, int cpu)
> > +{
> > + unsigned long next_util = cpu_util_next(cpu, p, cpu);
> > +
> > + return util_fits_capacity(next_util, capacity_orig_of(cpu));
> ^^^^^^^^^^^^^^^^^^^^^
>
> Since here we are at scheduling CFS tasks, should we not better use
> capacity_of() to account for RT/IRQ pressure ?

Yes, definitely. I change this in v2.

>
> > +}
> > +
> > +static int find_energy_efficient_cpu(struct sched_domain *sd,
> > + struct task_struct *p, int prev_cpu)
> > +{
> > + unsigned long cur_energy, prev_energy, best_energy;
> > + int cpu, best_cpu = prev_cpu;
> > +
> > + if (!task_util(p))
>
> We are still waking up a task... what if the task was previously
> running on a big CPU which is now idle?
>
> I understand that from a _relative_ energy_diff standpoint there is
> not much to do for a 0 utilization task. However, for those tasks we
> can still try to return the most energy efficient CPU among the ones
> in their cpus_allowed mask.
>
> It should be a relatively low overhead (maybe contained in a fallback
> most_energy_efficient_cpu() kind of function) which allows, for
> example on ARM big.LITTLE systems, to consolidate those tasks on
> LITTLE CPUs instead for example keep running them on a big CPU.

Hmmmm so the difficult thing about a task with 0 util is that you don't
know if this is really a small task, or a big task with a very long
period. The only useful thing you know for sure about the task is where
it ran last time, so I guess that makes sense to use that information
rather than make assumptions. There is no perfect solution using the
util_avg of the task.

Now, UTIL_EST is changing the game here. If we use it for task placement
(which I think is the right thing to do), this issue should be a lot
easier to solve. What do you think ?

Thanks,
Quentin

2018-03-25 13:49:46

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched: Introduce energy models of CPUs

On Tuesday 20 Mar 2018 at 10:52:15 (+0100), Greg Kroah-Hartman wrote:
> On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote:
> > From: Quentin Perret <[email protected]>

[...]

> > +#ifdef CONFIG_PM_OPP
>
> #ifdefs go in .h files, not .c files, right?
>

So, after looking into this, my suggestion would be to: 1) remove the
#ifdef CONFIG_PM_OPP from energy.c entirely; 2) make sure
init_sched_energy() is stubbed properly for !CONFIG_SMP and
!CONFIG_PM_OPP in include/linux/sched/energy.h; 3) relocate the global
variables (energy_model, freq_domains, ...) to fair.c; and 4) modify
kernel/sched/Makefile with something like:

ifeq ($(CONFIG_PM_OPP),y)
obj-$(CONFIG_SMP) += energy.o
endif

That way, energy.c is not compiled if not needed by the arch, and the
ifdef are kept within header files and Makefiles.

Would that work ?

Thanks,
Quentin

2018-03-26 22:28:30

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched: Introduce energy models of CPUs

On 03/25/2018 03:48 PM, Quentin Perret wrote:
> On Tuesday 20 Mar 2018 at 10:52:15 (+0100), Greg Kroah-Hartman wrote:
>> On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote:
>>> From: Quentin Perret <[email protected]>
>
> [...]
>
>>> +#ifdef CONFIG_PM_OPP
>>
>> #ifdefs go in .h files, not .c files, right?
>>
>
> So, after looking into this, my suggestion would be to: 1) remove the
> #ifdef CONFIG_PM_OPP from energy.c entirely; 2) make sure
> init_sched_energy() is stubbed properly for !CONFIG_SMP and
> !CONFIG_PM_OPP in include/linux/sched/energy.h; 3) relocate the global
> variables (energy_model, freq_domains, ...) to fair.c; and 4) modify
> kernel/sched/Makefile with something like:
>
> ifeq ($(CONFIG_PM_OPP),y)
> obj-$(CONFIG_SMP) += energy.o
> endif
>
> That way, energy.c is not compiled if not needed by the arch, and the
> ifdef are kept within header files and Makefiles.
>
> Would that work ?

Could we extend this idea a little bit further and leave the global
variables in energy.c? Normally, all energy interfaces could be
declared or stubbed in energy.h.

We could hide the access to energy_model, sched_energy_present and
freq_domains behind functions.

It would be nice to provide also find_energy_efficient_cpu() in energy.c
but it uses itself a lot of fair.c stuff and we do want to avoid function
calls in the wakeup path, right?

Boot-tested on juno r0 (CONFIG_SMP=y and CONFIG_PM_OPP=y). Build-tested on
i386 (CONFIG_SMP and CONFIG_PM_OPP not set), arm multi_v5_defconfig (CONFIG_SMP
not set and CONFIG_PM_OPP=y).

in energy.h:

#ifdef CONFIG_SMP
#ifdef CONFIG_PM_OPP
static inline struct capacity_state *find_cap_state(int cpu, unsigned long util)
{
...
}
static inline bool sched_energy_enabled(void)
{
...
}
static inline struct list_head *get_freq_domains(void)
{
...
}
#endif
#endif

#if !defined(CONFIG_SMP) || !defined(CONFIG_PM_OPP)
static inline struct capacity_state *find_cap_state(int cpu, unsigned long util) { return false; }
static inline struct list_head *get_freq_domains(void) { return NULL; }
static inline struct capacity_state *
find_cap_state(int cpu, unsigned long util) { return NULL; }
static inline void init_sched_energy(void) { }
#endif

#define for_each_freq_domain(fdom) \
list_for_each_entry(fdom, get_freq_domains(), next)

--->8---

diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
index b4f43564ffe4..5aad03ec5b30 100644
--- a/include/linux/sched/energy.h
+++ b/include/linux/sched/energy.h
@@ -20,12 +20,49 @@ struct freq_domain {
extern struct sched_energy_model ** __percpu energy_model;
extern struct static_key_false sched_energy_present;
extern struct list_head freq_domains;
-#define for_each_freq_domain(fdom) \
- list_for_each_entry(fdom, &freq_domains, next)

-void init_sched_energy(void);
-#else
-static inline void init_sched_energy(void) { }
+#ifdef CONFIG_PM_OPP
+static inline bool sched_energy_enabled(void)
+{
+ return static_branch_unlikely(&sched_energy_present);
+}
+
+static inline struct list_head *get_freq_domains(void)
+{
+ return &freq_domains;
+}
+
+static inline
+struct capacity_state *find_cap_state(int cpu, unsigned long util)
+{
+ struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
+ struct capacity_state *cs = NULL;
+ int i;
+
+ util += util >> 2;
+
+ for (i = 0; i < em->nr_cap_states; i++) {
+ cs = &em->cap_states[i];
+ if (cs->cap >= util)
+ break;
+ }
+
+ return cs;
+}
+
+extern void init_sched_energy(void);
#endif
+#endif /* CONFIG_SMP */

+#if !defined(CONFIG_SMP) || !defined(CONFIG_PM_OPP)
+static inline bool sched_energy_enabled(void) { return false; }
+static inline struct list_head *get_freq_domains(void) { return NULL; }
+static inline struct capacity_state *
+find_cap_state(int cpu, unsigned long util) { return NULL; }
+static inline void init_sched_energy(void) { }
#endif
+
+#define for_each_freq_domain(fdom) \
+ list_for_each_entry(fdom, get_freq_domains(), next)
+
+#endif /* _LINUX_SCHED_ENERGY_H */
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 912972ad4dbc..e34bec3ae353 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
obj-y += idle.o fair.o rt.o deadline.o
obj-y += wait.o wait_bit.o swait.o completion.o

-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o energy.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o
obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
@@ -29,3 +29,6 @@ obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
+ifeq ($(CONFIG_PM_OPP),y)
+ obj-$(CONFIG_SMP) += energy.o
+endif
diff --git a/kernel/sched/energy.c b/kernel/sched/energy.c
index 4662c993e096..1de8226943b9 100644
--- a/kernel/sched/energy.c
+++ b/kernel/sched/energy.c
@@ -30,7 +30,6 @@ struct sched_energy_model ** __percpu energy_model;
*/
LIST_HEAD(freq_domains);

-#ifdef CONFIG_PM_OPP
static struct sched_energy_model *build_energy_model(int cpu)
{
unsigned long cap_scale = arch_scale_cpu_capacity(NULL, cpu);
@@ -185,6 +184,3 @@ void init_sched_energy(void)
exit_fail:
pr_err("Energy Aware Scheduling initialization failed.\n");
}
-#else
-void init_sched_energy(void) {}
-#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 65a1bead0773..b3e6a2656b68 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6456,27 +6456,6 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
return !util_fits_capacity(task_util(p), min_cap);
}

-static struct capacity_state *find_cap_state(int cpu, unsigned long util)
-{
- struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
- struct capacity_state *cs = NULL;
- int i;
-
- /*
- * As the goal is to estimate the OPP reached for a specific util
- * value, mimic the behaviour of schedutil with a 1.25 coefficient
- */
- util += util >> 2;
-
- for (i = 0; i < em->nr_cap_states; i++) {
- cs = &em->cap_states[i];
- if (cs->cap >= util)
- break;
- }
-
- return cs;
-}
-
static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
{
unsigned long util, fdom_max_util;
@@ -6557,7 +6536,7 @@ static inline bool wake_energy(struct task_struct *p, int prev_cpu)
{
struct sched_domain *sd;

- if (!static_branch_unlikely(&sched_energy_present))
+ if (!sched_energy_enabled())
return false;

sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd);
@@ -9252,8 +9231,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
}
max_cost += sd->max_newidle_lb_cost;

- if (static_branch_unlikely(&sched_energy_present) &&
- !sd_overutilized(sd))
+ if (sched_energy_enabled() && !sd_overutilized(sd))
continue;

if (!(sd->flags & SD_LOAD_BALANCE))
@@ -9823,8 +9801,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
break;
}

- if (static_branch_unlikely(&sched_energy_present) &&
- !sd_overutilized(sd))
+ if (sched_energy_enabled() && !sd_overutilized(sd))
continue;

if (sd->flags & SD_BALANCE_NEWIDLE) {



2018-04-09 09:43:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched: Add over-utilization/tipping point indicator


(I know there is a new version out; but I was reading through this to
catch up with the discussion)

On Tue, Mar 20, 2018 at 09:43:09AM +0000, Dietmar Eggemann wrote:
> +static inline int sd_overutilized(struct sched_domain *sd)
> +{
> + return READ_ONCE(sd->shared->overutilized);
> +}
> +
> +static inline void update_overutilized_status(struct rq *rq)
> +{
> + struct sched_domain *sd;
> +
> + rcu_read_lock();
> + sd = rcu_dereference(rq->sd);
> + if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> + WRITE_ONCE(sd->shared->overutilized, 1);
> + rcu_read_unlock();
> +}
> +#else

I think you ought to go have a look at the end of
kernel/sched/topology.c:sd_init(), where it says:

/*
* For all levels sharing cache; connect a sched_domain_shared
* instance.
*/
if (sd->flags & SD_SHARE_PKG_RESOURCES) {
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
}

Because if I read all this correctly, your code assumes sd->shared
exists unconditionally, while the quoted bit only ensures it does so <=
LLC.

2018-04-09 09:53:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched: Add over-utilization/tipping point indicator

On Mon, Apr 09, 2018 at 11:40:01AM +0200, Peter Zijlstra wrote:
>
> (I know there is a new version out; but I was reading through this to
> catch up with the discussion)
>
> On Tue, Mar 20, 2018 at 09:43:09AM +0000, Dietmar Eggemann wrote:
> > +static inline int sd_overutilized(struct sched_domain *sd)
> > +{
> > + return READ_ONCE(sd->shared->overutilized);
> > +}
> > +
> > +static inline void update_overutilized_status(struct rq *rq)
> > +{
> > + struct sched_domain *sd;
> > +
> > + rcu_read_lock();
> > + sd = rcu_dereference(rq->sd);
> > + if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> > + WRITE_ONCE(sd->shared->overutilized, 1);
> > + rcu_read_unlock();
> > +}
> > +#else
>
> I think you ought to go have a look at the end of
> kernel/sched/topology.c:sd_init(), where it says:
>
> /*
> * For all levels sharing cache; connect a sched_domain_shared
> * instance.
> */
> if (sd->flags & SD_SHARE_PKG_RESOURCES) {
> sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> atomic_inc(&sd->shared->ref);
> atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> }
>
> Because if I read all this correctly, your code assumes sd->shared
> exists unconditionally, while the quoted bit only ensures it does so <=
> LLC.

Argh, n/m, I should read the whole patch before commenting I suppose ;-)

2018-04-09 09:56:25

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched: Add over-utilization/tipping point indicator

On 04/09/2018 11:40 AM, Peter Zijlstra wrote:
>
> (I know there is a new version out; but I was reading through this to
> catch up with the discussion)
>
> On Tue, Mar 20, 2018 at 09:43:09AM +0000, Dietmar Eggemann wrote:
>> +static inline int sd_overutilized(struct sched_domain *sd)
>> +{
>> + return READ_ONCE(sd->shared->overutilized);
>> +}
>> +
>> +static inline void update_overutilized_status(struct rq *rq)
>> +{
>> + struct sched_domain *sd;
>> +
>> + rcu_read_lock();
>> + sd = rcu_dereference(rq->sd);
>> + if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
>> + WRITE_ONCE(sd->shared->overutilized, 1);
>> + rcu_read_unlock();
>> +}
>> +#else
>
> I think you ought to go have a look at the end of
> kernel/sched/topology.c:sd_init(), where it says:
>
> /*
> * For all levels sharing cache; connect a sched_domain_shared
> * instance.
> */
> if (sd->flags & SD_SHARE_PKG_RESOURCES) {
> sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> atomic_inc(&sd->shared->ref);
> atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> }
>
> Because if I read all this correctly, your code assumes sd->shared
> exists unconditionally, while the quoted bit only ensures it does so <=
> LLC.
>

But the patch changes this part further down.

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 64cc564f5255..c8b7c7665ab2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1184,15 +1184,11 @@ sd_init(struct sched_domain_topology_level *tl,
sd->idle_idx = 1;
}

- /*
- * For all levels sharing cache; connect a sched_domain_shared
- * instance.
- */
- if (sd->flags & SD_SHARE_PKG_RESOURCES) {
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
+ sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+ atomic_inc(&sd->shared->ref);
+
+ if (sd->flags & SD_SHARE_PKG_RESOURCES)
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
- }

2018-04-09 11:53:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched: Add over-utilization/tipping point indicator

On Mon, Apr 09, 2018 at 11:53:00AM +0200, Dietmar Eggemann wrote:
> But the patch changes this part further down.

Yes.. let's call it Monday morning sickness :-)


2018-04-09 12:08:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched: Introduce energy models of CPUs

On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote:
> From: Quentin Perret <[email protected]>
>
> The energy consumption of each CPU in the system is modeled with a list
> of values representing its dissipated power and compute capacity at each
> available Operating Performance Point (OPP). These values are derived
> from existing information in the kernel (currently used by the thermal
> subsystem) and don't require the introduction of new platform-specific
> tunables. The energy model is also provided with a simple representation
> of all frequency domains as cpumasks, hence enabling the scheduler to be
> aware of dependencies between CPUs. The data required to build the energy
> model is provided by the OPP library which enables an abstract view of
> the platform from the scheduler. The new data structures holding these
> models and the routines to populate them are stored in
> kernel/sched/energy.c.
>
> For the sake of simplicity, it is assumed in the energy model that all
> CPUs in a frequency domain share the same micro-architecture. As long as
> this assumption is correct, the energy models of different CPUs belonging
> to the same frequency domain are equal. Hence, this commit builds only one
> energy model per frequency domain, and links all relevant CPUs to it in
> order to save time and memory. If needed for future hardware platforms,
> relaxing this assumption should imply relatively simple modifications in
> the code but a significantly higher algorithmic complexity.

What this doesn't mention is why this isn't part of the regular topology
bits. IIRC this is because the frequency domains don't necessarily need
to align with the existing topology, but this completely fails to state
any of that.

Also, since I'm not at all familiar with DT and the OPP library stuff,
this code is completely unreadable to me and there isn't a nice comment
to help me along.

2018-04-09 13:49:52

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched: Introduce energy models of CPUs

On Monday 09 Apr 2018 at 14:01:11 (+0200), Peter Zijlstra wrote:
> On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote:
> > From: Quentin Perret <[email protected]>
> >
> > The energy consumption of each CPU in the system is modeled with a list
> > of values representing its dissipated power and compute capacity at each
> > available Operating Performance Point (OPP). These values are derived
> > from existing information in the kernel (currently used by the thermal
> > subsystem) and don't require the introduction of new platform-specific
> > tunables. The energy model is also provided with a simple representation
> > of all frequency domains as cpumasks, hence enabling the scheduler to be
> > aware of dependencies between CPUs. The data required to build the energy
> > model is provided by the OPP library which enables an abstract view of
> > the platform from the scheduler. The new data structures holding these
> > models and the routines to populate them are stored in
> > kernel/sched/energy.c.
> >
> > For the sake of simplicity, it is assumed in the energy model that all
> > CPUs in a frequency domain share the same micro-architecture. As long as
> > this assumption is correct, the energy models of different CPUs belonging
> > to the same frequency domain are equal. Hence, this commit builds only one
> > energy model per frequency domain, and links all relevant CPUs to it in
> > order to save time and memory. If needed for future hardware platforms,
> > relaxing this assumption should imply relatively simple modifications in
> > the code but a significantly higher algorithmic complexity.
>
> What this doesn't mention is why this isn't part of the regular topology
> bits. IIRC this is because the frequency domains don't necessarily need
> to align with the existing topology, but this completely fails to state
> any of that.

Yes that's the main reason. Frequency domains and scheduling domains don't
necessarily align. That used to be the case for big.LITTLE platforms, but
not anymore with DynamIQ ...

>
> Also, since I'm not at all familiar with DT and the OPP library stuff,
> this code is completely unreadable to me and there isn't a nice comment
> to help me along.

Right, so I can definitely fix that. Comments in the code and a better
commit message should help hopefully. And also, it has already been
suggested that a documentation file should be added alongside the code
for this patchset, so I'll make sure we add that for the next version.
In the meantime, here is a (hopefully) better explanation below.

In this specific patch, we are basically trying to figure out the
boundaries of frequency domains, and the power consumed by each CPU
at each OPP, to make them available to the scheduler. The important
thing here is that, in both cases, we rely on the OPP library to
keep the code as platform-agnostic as possible.

In the case of the frequency domains for example, the cpufreq driver is
in charge of specifying the CPUs that are sharing frequencies. That
information can come from DT, or SCPI, or SCMI, or whatever -- we
probably shouldn't have to care about that from the scheduler's
standpoint. That's why using dev_pm_opp_get_sharing_cpus() is handy,
the OPP library gives us the digested information we need.

The power values (dev_pm_opp_get_power) we use right now are those
already used by the thermal subsystem (IPA), which means we don't have
to introduce any new DT binding whatsoever. In a close future, the power
values could also come from other sources (SCMI for ex), and again it's
probably not the scheduler's job to care about those things, so the OPP
library is helping us again. As mentioned in the notes, as of today, this
approach has dependencies on other patches relating to these things which
are already on the list [1].

The rest of the code in this patch is just about iterating over the
CPUs/freq. domains/OPPs. The algorithm is more or less the following:

1. find a frequency domain which hasn't been visited yet;
2. estimate the power and capacity of a CPU in this freq domain at each
possible OPP;
3. map all CPUs in the freq domain to this list of <capacity, power> tuples;
4. go to 1.

I hope that makes sense.

Thanks,
Quentin

[1] https://marc.info/?l=linux-pm&m=151635516419249&w=2

2018-04-09 15:36:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched: Introduce energy models of CPUs

On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote:

> In this specific patch, we are basically trying to figure out the
> boundaries of frequency domains, and the power consumed by each CPU
> at each OPP, to make them available to the scheduler. The important
> thing here is that, in both cases, we rely on the OPP library to
> keep the code as platform-agnostic as possible.

AFAICT the only users of this PM_OPP stuff is a bunch of ARM platforms.
Granted, body else has build a big.little style system, so that might
all be fine I suppose.

It won't be until some !ARM chip comes along that we'll know how
generically usable any of this really is.

> In the case of the frequency domains for example, the cpufreq driver is
> in charge of specifying the CPUs that are sharing frequencies. That
> information can come from DT, or SCPI, or SCMI, or whatever -- we
> probably shouldn't have to care about that from the scheduler's
> standpoint. That's why using dev_pm_opp_get_sharing_cpus() is handy,
> the OPP library gives us the digested information we need.

So I kinda would've expected to just ask cpufreq, that after all already
knows these things. Why did we need to invent this pm_opp thing?

Cpufreq has a tons of supported architectures, pm_opp not so much.

> The power values (dev_pm_opp_get_power) we use right now are those
> already used by the thermal subsystem (IPA), which means we don't have

I love an IPA style beer, but I'm thinking that's not the same IPA,
right :-)

> to introduce any new DT binding whatsoever. In a close future, the power
> values could also come from other sources (SCMI for ex), and again it's
> probably not the scheduler's job to care about those things, so the OPP
> library is helping us again. As mentioned in the notes, as of today, this
> approach has dependencies on other patches relating to these things which
> are already on the list [1].

Is there any !ARM thermal driver? (clearly I'm not up-to-date on things
thermal).

2018-04-09 16:46:29

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched: Introduce energy models of CPUs

On Monday 09 Apr 2018 at 17:32:33 (+0200), Peter Zijlstra wrote:
> On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote:
>
> > In this specific patch, we are basically trying to figure out the
> > boundaries of frequency domains, and the power consumed by each CPU
> > at each OPP, to make them available to the scheduler. The important
> > thing here is that, in both cases, we rely on the OPP library to
> > keep the code as platform-agnostic as possible.
>
> AFAICT the only users of this PM_OPP stuff is a bunch of ARM platforms.

That's correct.

> Granted, body else has build a big.little style system, so that might
> all be fine I suppose.
>
> It won't be until some !ARM chip comes along that we'll know how
> generically usable any of this really is.
>

Right. There is already a lot of diversity in the Arm ecosystem that has
to be managed. That's what I meant by platform-agnostic. Now, I agree
that it should be discussed whether or not this is enough for other
archs ...

It might be reasonable to expect from the archs who want to use EAS that
they expose their OPPs in the OPP lib. That should be harmless, and EAS
needs to know about the OPPs, so they should be made visible, ideally
somewhere generic. Otherwise, that means the interface with the
EAS has to be defined only by the energy model data structures, and the
actual energy model loading procedure becomes free-form arch code.

I quiet like the first idea from a pure design standpoint, but I could
also understand if maintainers of other archs were reluctant to
have new dependencies on PM_OPP ...

> > In the case of the frequency domains for example, the cpufreq driver is
> > in charge of specifying the CPUs that are sharing frequencies. That
> > information can come from DT, or SCPI, or SCMI, or whatever -- we
> > probably shouldn't have to care about that from the scheduler's
> > standpoint. That's why using dev_pm_opp_get_sharing_cpus() is handy,
> > the OPP library gives us the digested information we need.
>
> So I kinda would've expected to just ask cpufreq, that after all already
> knows these things. Why did we need to invent this pm_opp thing?

Yes, we can definitely rely on cpufreq for this one. There is a "strong"
dependency on PM_OPP to get power values, so I decided to use PM_OPP for
the frequency domains as well, for consistency. But I can change that if
needed.

>
> Cpufreq has a tons of supported architectures, pm_opp not so much.
>
> > The power values (dev_pm_opp_get_power) we use right now are those
> > already used by the thermal subsystem (IPA), which means we don't have
>
> I love an IPA style beer, but I'm thinking that's not the same IPA,
> right :-)

Well, both can help to chill down in a way ... :-)

The IPA I'm talking about means Intelligent Power Allocator. It's a
thermal governor that uses a power model of the platform to allocate
power budgets to CPUs & GPUs using a control loop. The code is in
drivers/thermal/power_allocator.c if this is of interest.

>
> > to introduce any new DT binding whatsoever. In a close future, the power
> > values could also come from other sources (SCMI for ex), and again it's
> > probably not the scheduler's job to care about those things, so the OPP
> > library is helping us again. As mentioned in the notes, as of today, this
> > approach has dependencies on other patches relating to these things which
> > are already on the list [1].
>
> Is there any !ARM thermal driver? (clearly I'm not up-to-date on things
> thermal).

I don't think so.

Thanks,
Quentin


2018-04-10 07:00:02

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched: Introduce energy models of CPUs

On Mon, Apr 9, 2018 at 6:42 PM, Quentin Perret <[email protected]> wrote:
> On Monday 09 Apr 2018 at 17:32:33 (+0200), Peter Zijlstra wrote:
>> On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote:
>>
>> > In this specific patch, we are basically trying to figure out the
>> > boundaries of frequency domains, and the power consumed by each CPU
>> > at each OPP, to make them available to the scheduler. The important
>> > thing here is that, in both cases, we rely on the OPP library to
>> > keep the code as platform-agnostic as possible.
>>
>> AFAICT the only users of this PM_OPP stuff is a bunch of ARM platforms.
>
> That's correct.
>
>> Granted, body else has build a big.little style system, so that might
>> all be fine I suppose.
>>
>> It won't be until some !ARM chip comes along that we'll know how
>> generically usable any of this really is.
>>
>
> Right. There is already a lot of diversity in the Arm ecosystem that has
> to be managed. That's what I meant by platform-agnostic. Now, I agree
> that it should be discussed whether or not this is enough for other
> archs ...

Even for ARM64 w/ ACPI, mind you.

> It might be reasonable to expect from the archs who want to use EAS that
> they expose their OPPs in the OPP lib. That should be harmless, and EAS
> needs to know about the OPPs, so they should be made visible, ideally
> somewhere generic. Otherwise, that means the interface with the
> EAS has to be defined only by the energy model data structures, and the
> actual energy model loading procedure becomes free-form arch code.
>
> I quiet like the first idea from a pure design standpoint, but I could
> also understand if maintainers of other archs were reluctant to
> have new dependencies on PM_OPP ...

Not just reluctant I would think.

Depending on PM_OPP directly here is like depending on ACPI directly.
Would you agree with the latter?

>> > In the case of the frequency domains for example, the cpufreq driver is
>> > in charge of specifying the CPUs that are sharing frequencies. That
>> > information can come from DT, or SCPI, or SCMI, or whatever -- we
>> > probably shouldn't have to care about that from the scheduler's
>> > standpoint. That's why using dev_pm_opp_get_sharing_cpus() is handy,
>> > the OPP library gives us the digested information we need.
>>
>> So I kinda would've expected to just ask cpufreq, that after all already
>> knows these things. Why did we need to invent this pm_opp thing?
>
> Yes, we can definitely rely on cpufreq for this one. There is a "strong"
> dependency on PM_OPP to get power values, so I decided to use PM_OPP for
> the frequency domains as well, for consistency. But I can change that if
> needed.

Yes, please.

>>
>> Cpufreq has a tons of supported architectures, pm_opp not so much.
>>
>> > The power values (dev_pm_opp_get_power) we use right now are those
>> > already used by the thermal subsystem (IPA), which means we don't have
>>
>> I love an IPA style beer, but I'm thinking that's not the same IPA,
>> right :-)
>
> Well, both can help to chill down in a way ... :-)
>
> The IPA I'm talking about means Intelligent Power Allocator. It's a
> thermal governor that uses a power model of the platform to allocate
> power budgets to CPUs & GPUs using a control loop. The code is in
> drivers/thermal/power_allocator.c if this is of interest.
>
>>
>> > to introduce any new DT binding whatsoever. In a close future, the power
>> > values could also come from other sources (SCMI for ex), and again it's
>> > probably not the scheduler's job to care about those things, so the OPP
>> > library is helping us again. As mentioned in the notes, as of today, this
>> > approach has dependencies on other patches relating to these things which
>> > are already on the list [1].
>>
>> Is there any !ARM thermal driver? (clearly I'm not up-to-date on things
>> thermal).
>
> I don't think so.

No, there isn't, AFAICS.

Thanks!

2018-04-10 09:37:02

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched: Introduce energy models of CPUs

On Tuesday 10 Apr 2018 at 08:55:14 (+0200), Rafael J. Wysocki wrote:
> On Mon, Apr 9, 2018 at 6:42 PM, Quentin Perret <[email protected]> wrote:
> > On Monday 09 Apr 2018 at 17:32:33 (+0200), Peter Zijlstra wrote:
> >> On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote:

[...]

> > I quiet like the first idea from a pure design standpoint, but I could
> > also understand if maintainers of other archs were reluctant to
> > have new dependencies on PM_OPP ...
>
> Not just reluctant I would think.
>
> Depending on PM_OPP directly here is like depending on ACPI directly.
> Would you agree with the latter?

Right, I see your point. I was suggesting to use PM_OPP only to make the
OPPs *visible*, nothing else. That doesn't mean all archs would have
to use dev_pm_opp_set_rate() or anything, they could just keep on doing
DVFS their own way. PM_OPP would just be a common way to make OPPs
visible outside of their subsystem, which should be harmless. The point
is to keep the energy model loading code common to all archs.

Another solution would be to let the archs populate the energy model
data-structures themselves, and turn the current energy.c file into
arm/arm64-specific code for ex.

Overall, I guess the question is whether or not PM_OPP is the right
interface for EAS of multiple archs ... That sounds like an interesting
discussion topic for OSPM next week, so thanks a lot for raising this
point !

Regards,
Quentin

2018-04-10 10:26:21

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched: Introduce energy models of CPUs

On Tue, Apr 10, 2018 at 11:31 AM, Quentin Perret <[email protected]> wrote:
> On Tuesday 10 Apr 2018 at 08:55:14 (+0200), Rafael J. Wysocki wrote:
>> On Mon, Apr 9, 2018 at 6:42 PM, Quentin Perret <[email protected]> wrote:
>> > On Monday 09 Apr 2018 at 17:32:33 (+0200), Peter Zijlstra wrote:
>> >> On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote:
>
> [...]
>
>> > I quiet like the first idea from a pure design standpoint, but I could
>> > also understand if maintainers of other archs were reluctant to
>> > have new dependencies on PM_OPP ...
>>
>> Not just reluctant I would think.
>>
>> Depending on PM_OPP directly here is like depending on ACPI directly.
>> Would you agree with the latter?
>
> Right, I see your point. I was suggesting to use PM_OPP only to make the
> OPPs *visible*, nothing else. That doesn't mean all archs would have
> to use dev_pm_opp_set_rate() or anything, they could just keep on doing
> DVFS their own way. PM_OPP would just be a common way to make OPPs
> visible outside of their subsystem, which should be harmless. The point
> is to keep the energy model loading code common to all archs.
>
> Another solution would be to let the archs populate the energy model
> data-structures themselves, and turn the current energy.c file into
> arm/arm64-specific code for ex.
>
> Overall, I guess the question is whether or not PM_OPP is the right
> interface for EAS of multiple archs ... That sounds like an interesting
> discussion topic for OSPM next week,

I agree.

> so thanks a lot for raising this point !

And moreover, we already have cpufreq and cpuidle that use their own
representations of the same information, generally coming from lower
layers. They do that, because they need to work with different
platforms that generally represent the low-level information
differently. I don't see why that principle doesn't apply to EAS.

Maybe there should be a common data structure to be used by them all,
but I'm quite confident that PM_OPP is not suitable for this purpose
in general.