1. Overview
The Energy Aware Scheduler (EAS) based on Morten Rasmussen's posting on
LKML [1] is currently part of the AOSP Common Kernel and runs on today's
smartphones with Arm's big.LITTLE CPUs. This series implements a new and
largely simplified version of EAS based on an Energy Model (EM) of the
platform with only costs for the active states of the CPUs.
Previous versions of this patch-set (i.e. [2]) relied on the PM_OPP
framework to provide the scheduler with an Energy Model. As agreed during
the 2nd OSPM sumit, this new revision removes this dependency by
implementing a new independent EM framework. This framework aggregates
the power values provided by drivers into a table for each frequency
domain in the system. Those tables are then available to interested
clients (e.g. the task scheduler or the thermal subsystem) via
platform-agnostic APIs. The topology code of the scheduler is modified
accordingly to take a reference on the online frequency domains, hence
guaranteeing a fast access to the shared EM data structures in
latency-sensitive code paths. The modifications required to make the
thermal subsystem use the new Energy Model framework are not covered by
this patch-set.
The v2 of this patch-set used a per-scheduling domain overutilization
flag, which has been abandoned in v3 in favour of a simpler but equally
efficient system-wide implementation attached to the root domain (like
the existing overload flag). Consequently, the integration of EAS in the
wake-up path has been reworked to accommodate this change using a
scheduling domain shortcut.
The patch-set is now arranged as follows:
- Patches 1-2 refactor code from schedutil and the scheduler to
simplify the implementation of the EM framework;
- Patches 3-4 introduce the centralized EM framework;
- Patch 5 changes the scheduler topology code to make it aware of the EM;
- Patch 6 implements the overutilization mechanism;
- Patches 7-9 introduce the energy-aware wake-up path in the CFS class;
- Patch 10 starts EAS for arm/arm64 from the arch_topology driver.
2. Test results
Two fundamentally different tests were executed. Firstly the energy
test case shows the impact on energy consumption this patch-set has
using a synthetic set of tasks. Secondly the performance test case
provides the conventional hackbench metric numbers.
The tests run on two arm64 big.LITTLE platforms: Hikey960 (4xA73 +
4xA53) and Juno r0 (2xA57 + 4xA53).
Base kernel is tip/sched/core (4.17-rc3), with some Hikey960 and
Juno specific patches, the SD_ASYM_CPUCAPACITY flag set at DIE sched
domain level for arm64 and schedutil as cpufreq governor [4].
2.1 Energy test case
10 iterations of between 10 and 50 periodic rt-app tasks (16ms period,
5% duty-cycle) for 30 seconds with energy measurement. Unit is Joules.
The goal is to save energy, so lower is better.
2.1.1 Hikey960
Energy is measured with an ACME Cape on an instrumented board. Numbers
include consumption of big and little CPUs, LPDDR memory, GPU and most
of the other small components on the board. They do not include
consumption of the radio chip (turned-off anyway) and external
connectors.
+----------+-----------------+------------------------+
| | Without patches | With patches |
+----------+--------+--------+-----------------+------+
| Tasks nb | Mean | RSD* | Mean | RSD* |
+----------+--------+--------+-----------------+------+
| 10 | 33.45 | 1.2% | 28.97 (-13.39%) | 2.0% |
| 20 | 45.45 | 1.7% | 42.76 (-5.92%) | 0.8% |
| 30 | 65.06 | 0.2% | 64.85 (-0.32%) | 4.7% |
| 40 | 85.67 | 0.7% | 77.98 (-8.98%) | 2.8% |
| 50 | 110.14 | 0.9% | 99.34 (-9.81%) | 2.0% |
+----------+--------+--------+-----------------+------+
2.1.2 Juno r0
Energy is measured with the onboard energy meter. Numbers include
consumption of big and little CPUs.
+----------+-----------------+------------------------+
| | Without patches | With patches |
+----------+--------+--------+-----------------+------+
| Tasks nb | Mean | RSD* | Mean | RSD* |
+----------+--------+--------+-----------------+------+
| 10 | 10.40 | 3.0% | 7.00 (-32.69%) | 2.5% |
| 20 | 18.47 | 1.1% | 12.88 (-30.27%) | 2.4% |
| 30 | 27.97 | 2.2% | 21.26 (-23.99%) | 0.2% |
| 40 | 36.86 | 1.2% | 30.63 (-16.90%) | 0.4% |
| 50 | 46.79 | 0.5% | 45.85 ( -0.02%) | 0.7% |
+----------+--------+--------+------------------+------+
2.2 Performance test case
30 iterations of perf bench sched messaging --pipe --thread --group G
--loop L with G=[1 2 4 8] and L=50000 (Hikey960)/16000 (Juno r0).
2.2.1 Hikey960
The impact of thermal capping was mitigated thanks to a heatsink, a
fan, and a 10 sec delay between two successive executions.
+----------------+-----------------+------------------------+
| | Without patches | With patches |
+--------+-------+---------+-------+----------------+-------+
| Groups | Tasks | Mean | RSD* | Mean | RSD* |
+--------+-------+---------+-------+----------------+-------+
| 1 | 40 | 8.75 | 0.99% | 9.46 (+8.11%) | 3.34% |
| 2 | 80 | 15.64 | 0.68% | 15.96 (+2.05%) | 0.71% |
| 4 | 160 | 31.58 | 0.65% | 32.22 (+2.03%) | 0.61% |
| 8 | 320 | 65.53 | 0.37% | 66.43 (+1.37%) | 0.36% |
+--------+-------+---------+-------+----------------+-------+
2.2.2 Juno r0
+----------------+-----------------+------------------------+
| | Without patches | With patches |
+--------+-------+---------+-------+----------------+-------+
| Groups | Tasks | Mean | RSD* | Mean | RSD* |
+--------+-------+---------+-------+----------------+-------+
| 1 | 40 | 8.25 | 0.11% | 8.21 ( 0.00%) | 0.10% |
| 2 | 80 | 14.40 | 0.14% | 14.37 ( 0.00%) | 0.12% |
| 4 | 160 | 26.72 | 0.24% | 26.73 ( 0.00%) | 0.14% |
| 8 | 320 | 52.89 | 0.10% | 52.87 ( 0.00%) | 0.23% |
+--------+-------+---------+-------+----------------+-------+
*RSD: Relative Standard Deviation (std dev / mean)
3. Changes between versions
Changes v2[2]->v3:
- Removed the PM_OPP dependency by implementing a new EM framework
- Modified the scheduler topology code to take references on the EM data
structures
- Simplified the overutilization mechanism into a system-wide flag
- Reworked the integration in the wake-up path using the sd_ea shortcut
- Rebased on tip/sched/core (247f2f6f3c70 "sched/core: Don't schedule
threads on pre-empted vCPUs")
Changes v1[3]->v2:
- Reworked interface between fair.c and energy.[ch] (Remove #ifdef
CONFIG_PM_OPP from energy.c) (Greg KH)
- Fixed licence & header issue in energy.[ch] (Greg KH)
- Reordered EAS path in select_task_rq_fair() (Joel)
- Avoid prev_cpu if not allowed in select_task_rq_fair() (Morten/Joel)
- Refactored compute_energy() (Patrick)
- Account for RT/IRQ pressure in task_fits() (Patrick)
- Use UTIL_EST and DL utilization during OPP estimation (Patrick/Juri)
- Optimize selection of CPU candidates in the energy-aware wake-up path
- Rebased on top of tip/sched/core (commit b720342849fe “sched/core:
Update Preempt_notifier_key to modern API”)
[1] https://lkml.org/lkml/2015/7/7/754
[2] https://marc.info/?l=linux-kernel&m=152302902427143&w=2
[3] https://marc.info/?l=linux-kernel&m=152153905805048&w=2
[4] http://linux-arm.org/git?p=linux-qp.git;a=shortlog;h=refs/heads/upstream/eas_v3
Morten Rasmussen (1):
sched: Add over-utilization/tipping point indicator
Quentin Perret (9):
sched: Relocate arch_scale_cpu_capacity
sched/cpufreq: Factor out utilization to frequency mapping
PM: Introduce an Energy Model management framework
PM / EM: Expose the Energy Model in sysfs
sched/topology: Reference the Energy Model of CPUs when available
sched/fair: Introduce an energy estimation helper function
sched: Lowest energy aware balancing sched_domain level pointer
sched/fair: Select an energy-efficient CPU on task wake-up
arch_topology: Start Energy Aware Scheduling
drivers/base/arch_topology.c | 19 ++
include/linux/energy_model.h | 123 +++++++++++
include/linux/sched/cpufreq.h | 6 +
include/linux/sched/topology.h | 19 ++
kernel/power/Kconfig | 15 ++
kernel/power/Makefile | 2 +
kernel/power/energy_model.c | 343 +++++++++++++++++++++++++++++++
kernel/sched/cpufreq_schedutil.c | 3 +-
kernel/sched/fair.c | 186 ++++++++++++++++-
kernel/sched/sched.h | 51 +++--
kernel/sched/topology.c | 117 +++++++++++
11 files changed, 860 insertions(+), 24 deletions(-)
create mode 100644 include/linux/energy_model.h
create mode 100644 kernel/power/energy_model.c
--
2.17.0
By default, arch_scale_cpu_capacity() is only visible from within the
kernel/sched folder. Relocate it to include/linux/sched/topology.h to
make it visible to other clients needing to know about the capacity of
CPUs, such as the Energy Model framework.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
include/linux/sched/topology.h | 19 +++++++++++++++++++
kernel/sched/sched.h | 18 ------------------
2 files changed, 19 insertions(+), 18 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..1e24e88bee6d 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -202,6 +202,17 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
# define SD_INIT_NAME(type)
#endif
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+ if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+ return sd->smt_gain / sd->span_weight;
+
+ return SCHED_CAPACITY_SCALE;
+}
+#endif
+
#else /* CONFIG_SMP */
struct sched_domain_attr;
@@ -217,6 +228,14 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
return true;
}
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
+{
+ return SCHED_CAPACITY_SCALE;
+}
+#endif
+
#endif /* !CONFIG_SMP */
static inline int task_node(const struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c222ca2..ce562d3b7526 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1724,30 +1724,12 @@ unsigned long arch_scale_freq_capacity(int cpu)
#ifdef CONFIG_SMP
extern void sched_avg_update(struct rq *rq);
-#ifndef arch_scale_cpu_capacity
-static __always_inline
-unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
- if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
- return sd->smt_gain / sd->span_weight;
-
- return SCHED_CAPACITY_SCALE;
-}
-#endif
-
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
rq->rt_avg += rt_delta * arch_scale_freq_capacity(cpu_of(rq));
sched_avg_update(rq);
}
#else
-#ifndef arch_scale_cpu_capacity
-static __always_inline
-unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
-{
- return SCHED_CAPACITY_SCALE;
-}
-#endif
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
static inline void sched_avg_update(struct rq *rq) { }
#endif
--
2.17.0
Several subsystems in the kernel (scheduler and/or thermal at the time
of writing) can benefit from knowing about the energy consumed by CPUs.
Yet, this information can come from different sources (DT or firmware for
example), in different formats, hence making it hard to exploit without
a standard API.
This patch attempts to solve this issue by introducing a centralized
Energy Model (EM) framework which can be used to interface the data
providers with the client subsystems. This framework standardizes the
API to expose power costs, and to access them from multiple locations.
The current design assumes that all CPUs in a frequency domain share the
same micro-architecture. As such, the EM data is structured in a
per-frequency-domain fashion. Drivers aware of frequency domains
(typically, but not limited to, CPUFreq drivers) are expected to register
data in the EM framework using the em_register_freq_domain() API. To do
so, the drivers must provide a callback function that will be called by
the EM framework to populate the tables. As of today, only the active
power of the CPUs is considered. For each frequency domain, the EM
includes a list of <frequency, power, capacity> tuples for the capacity
states of the domain alongside a cpumask covering the involved CPUs.
The EM framework also provides an API to re-scale the capacity values
of the model asynchronously, after it has been created. This is required
for architectures where the capacity scale factor of CPUs can change at
run-time. This is the case for Arm/Arm64 for example where the
arch_topology driver recomputes the capacity scale factors of the CPUs
after the maximum frequency of all CPUs has been discovered. Although
complex, the process of creating and re-scaling the EM has to be kept in
two separate steps to fulfill the needs of the different users. The thermal
subsystem doesn't use the capacity values and shouldn't have dependencies
on subsystems providing them. On the other hand, the task scheduler needs
the capacity values, and it will benefit from seeing them up-to-date when
applicable.
Because of this need for asynchronous update, the capacity state table
of each frequency domain is protected by RCU, hence guaranteeing a safe
modification of the table and a fast access to readers in latency-sensitive
code paths.
Cc: Peter Zijlstra <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
include/linux/energy_model.h | 122 +++++++++++++++++
kernel/power/Kconfig | 15 +++
kernel/power/Makefile | 2 +
kernel/power/energy_model.c | 249 +++++++++++++++++++++++++++++++++++
4 files changed, 388 insertions(+)
create mode 100644 include/linux/energy_model.h
create mode 100644 kernel/power/energy_model.c
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
new file mode 100644
index 000000000000..edde888852ba
--- /dev/null
+++ b/include/linux/energy_model.h
@@ -0,0 +1,122 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_ENERGY_MODEL_H
+#define _LINUX_ENERGY_MODEL_H
+#include <linux/types.h>
+#include <linux/cpumask.h>
+#include <linux/jump_label.h>
+#include <linux/rcupdate.h>
+#include <linux/kobject.h>
+#include <linux/sched/cpufreq.h>
+
+#ifdef CONFIG_ENERGY_MODEL
+struct em_cap_state {
+ unsigned long capacity;
+ unsigned long frequency;
+ unsigned long power;
+};
+
+struct em_cs_table {
+ struct em_cap_state *state;
+ int nr_cap_states;
+ struct rcu_head rcu;
+};
+
+struct em_freq_domain {
+ struct em_cs_table *cs_table;
+ cpumask_t cpus;
+};
+
+struct em_data_callback {
+ /**
+ * active_power() - Provide power at the next capacity state of a CPU
+ * @power : Active power at the capacity state (modified)
+ * @freq : Frequency at the capacity state (modified)
+ * @cpu : CPU for which we do this operation
+ *
+ * active_power() must find the lowest capacity state of 'cpu' above
+ * 'freq' and update 'power' and 'freq' to the matching active power
+ * and frequency.
+ *
+ * Return 0 on success.
+ */
+ int (*active_power) (unsigned long *power, unsigned long *freq, int cpu);
+};
+
+int em_register_freq_domain(cpumask_t *span, unsigned int nr_states,
+ struct em_data_callback *cb);
+void em_rescale_cpu_capacity(void);
+struct em_freq_domain *em_cpu_get(int cpu);
+
+/**
+ * em_fd_energy() - Estimates the energy consumed by the CPUs of a freq. domain
+ * @fd : frequency domain for which energy has to be estimated
+ * @max_util : highest utilization among CPUs of the domain
+ * @sum_util : sum of the utilization of all CPUs in the domain
+ *
+ * Return: the sum of the energy consumed by the CPUs of the domain assuming
+ * a capacity state satisfying the max utilization of the domain.
+ */
+static inline unsigned long em_fd_energy(struct em_freq_domain *fd,
+ unsigned long max_util, unsigned long sum_util)
+{
+ struct em_cs_table *cs_table;
+ struct em_cap_state *cs;
+ unsigned long freq;
+ int i;
+
+ cs_table = rcu_dereference(fd->cs_table);
+ if (!cs_table)
+ return 0;
+
+ /* Map the utilization value to a frequency */
+ cs = &cs_table->state[cs_table->nr_cap_states - 1];
+ freq = map_util_freq(max_util, cs->frequency, cs->capacity);
+
+ /* Find the lowest capacity state above this frequency */
+ for (i = 0; i < cs_table->nr_cap_states; i++) {
+ cs = &cs_table->state[i];
+ if (cs->frequency >= freq)
+ break;
+ }
+
+ return cs->power * sum_util / cs->capacity;
+}
+
+/**
+ * em_fd_nr_cap_states() - Get the number of capacity states of a freq. domain
+ * @fd : frequency domain for which want to do this
+ *
+ * Return: the number of capacity state in the frequency domain table
+ */
+static inline int em_fd_nr_cap_states(struct em_freq_domain *fd)
+{
+ struct em_cs_table *table = rcu_dereference(fd->cs_table);
+
+ return table->nr_cap_states;
+}
+
+#else
+struct em_freq_domain;
+struct em_data_callback;
+static inline int em_register_freq_domain(cpumask_t *span,
+ unsigned int nr_states, struct em_data_callback *cb)
+{
+ return -ENOTSUPP;
+}
+static inline struct em_freq_domain *em_cpu_get(int cpu)
+{
+ return NULL;
+}
+static inline unsigned long em_fd_energy(struct em_freq_domain *fd,
+ unsigned long max_util, unsigned long sum_util)
+{
+ return 0;
+}
+static inline int em_fd_nr_cap_states(struct em_freq_domain *fd)
+{
+ return 0;
+}
+static inline void em_rescale_cpu_capacity(void) { }
+#endif
+
+#endif
diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig
index e880ca22c5a5..b9e2b92e3be1 100644
--- a/kernel/power/Kconfig
+++ b/kernel/power/Kconfig
@@ -297,3 +297,18 @@ config PM_GENERIC_DOMAINS_OF
config CPU_PM
bool
+
+config ENERGY_MODEL
+ bool "Energy Model for CPUs"
+ depends on SMP
+ depends on CPU_FREQ
+ default n
+ help
+ Several subsystems (thermal and/or the task scheduler for example)
+ can leverage information about the energy consumed by CPUs to make
+ smarter decisions. This config option enables the framework from
+ which a user can access the energy models.
+
+ The exact usage of the energy model is subsystem-dependent.
+
+ If in doubt, say N.
diff --git a/kernel/power/Makefile b/kernel/power/Makefile
index a3f79f0eef36..e7e47d9be1e5 100644
--- a/kernel/power/Makefile
+++ b/kernel/power/Makefile
@@ -15,3 +15,5 @@ obj-$(CONFIG_PM_AUTOSLEEP) += autosleep.o
obj-$(CONFIG_PM_WAKELOCKS) += wakelock.o
obj-$(CONFIG_MAGIC_SYSRQ) += poweroff.o
+
+obj-$(CONFIG_ENERGY_MODEL) += energy_model.o
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
new file mode 100644
index 000000000000..a2eece7007a8
--- /dev/null
+++ b/kernel/power/energy_model.c
@@ -0,0 +1,249 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Energy Model of CPUs
+ *
+ * Copyright (c) 2018, Arm ltd.
+ * Written by: Quentin Perret, Arm ltd.
+ */
+
+#define pr_fmt(fmt) "energy_model: " fmt
+
+#include <linux/cpu.h>
+#include <linux/slab.h>
+#include <linux/cpumask.h>
+#include <linux/energy_model.h>
+#include <linux/sched/topology.h>
+
+/* Mapping of each CPU to the frequency domain to which it belongs. */
+static DEFINE_PER_CPU(struct em_freq_domain *, em_data);
+
+/*
+ * Protects the access to em_data. Readers of em_data can be in RCU-critical
+ * sections, and can't afford to sleep.
+ */
+static DEFINE_RWLOCK(em_data_lock);
+
+/*
+ * Mutex serializing the registrations of frequency domains. It allows the
+ * callbacks defined by drivers to sleep.
+ */
+static DEFINE_MUTEX(em_fd_mutex);
+
+static struct em_cs_table *alloc_cs_table(int nr_states)
+{
+ struct em_cs_table *cs_table;
+
+ cs_table = kzalloc(sizeof(*cs_table), GFP_NOWAIT);
+ if (!cs_table)
+ return NULL;
+
+ cs_table->state = kcalloc(nr_states, sizeof(*cs_table->state),
+ GFP_NOWAIT);
+ if (!cs_table->state) {
+ kfree(cs_table);
+ return NULL;
+ }
+
+ cs_table->nr_cap_states = nr_states;
+
+ return cs_table;
+}
+
+static void free_cs_table(struct em_cs_table *table)
+{
+ if (table) {
+ kfree(table->state);
+ kfree(table);
+ }
+}
+
+static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
+{
+ unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
+ int max_cap_state = cs_table->nr_cap_states - 1;
+ unsigned long fmax = cs_table->state[max_cap_state].frequency;
+ int i;
+
+ for (i = 0; i < cs_table->nr_cap_states; i++)
+ cs_table->state[i].capacity = cmax *
+ cs_table->state[i].frequency / fmax;
+}
+
+static struct em_freq_domain *em_create_fd(cpumask_t *span, int nr_states,
+ struct em_data_callback *cb)
+{
+ unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
+ int i, ret, cpu = cpumask_first(span);
+ struct em_freq_domain *fd;
+ unsigned long power, freq;
+
+ if (!cb->active_power)
+ return NULL;
+
+ fd = kzalloc(sizeof(*fd), GFP_KERNEL);
+ if (!fd)
+ return NULL;
+
+ fd->cs_table = alloc_cs_table(nr_states);
+ if (!fd->cs_table)
+ goto free_fd;
+
+ /* Copy the span of the frequency domain */
+ cpumask_copy(&fd->cpus, span);
+
+ /* Build the list of capacity states for this freq domain */
+ for (i = 0, freq = 0; i < nr_states; i++, freq++) {
+ ret = cb->active_power(&power, &freq, cpu);
+ if (ret)
+ goto free_cs_table;
+
+ fd->cs_table->state[i].power = power;
+ fd->cs_table->state[i].frequency = freq;
+
+ /*
+ * The hertz/watts efficiency ratio should decrease as the
+ * frequency grows on sane platforms. If not, warn the user
+ * that some high OPPs are more power efficient than some
+ * of the lower ones.
+ */
+ opp_eff = freq / power;
+ if (opp_eff >= prev_opp_eff)
+ pr_warn("%*pbl: hz/watt efficiency: OPP %d >= OPP%d\n",
+ cpumask_pr_args(span), i, i - 1);
+ prev_opp_eff = opp_eff;
+ }
+ fd_update_cs_table(fd->cs_table, cpu);
+
+ return fd;
+
+free_cs_table:
+ free_cs_table(fd->cs_table);
+free_fd:
+ kfree(fd);
+
+ return NULL;
+}
+
+static void rcu_free_cs_table(struct rcu_head *rp)
+{
+ struct em_cs_table *table;
+
+ table = container_of(rp, struct em_cs_table, rcu);
+ free_cs_table(table);
+}
+
+/**
+ * em_rescale_cpu_capacity() - Re-scale capacity values of the Energy Model
+ *
+ * This re-scales the capacity values for all capacity states of all frequency
+ * domains of the Energy Model. This should be used when the capacity values
+ * of the CPUs are updated at run-time, after the EM was registered.
+ */
+void em_rescale_cpu_capacity(void)
+{
+ struct em_cs_table *old_table, *new_table;
+ struct em_freq_domain *fd;
+ unsigned long flags;
+ int nr_states, cpu;
+
+ read_lock_irqsave(&em_data_lock, flags);
+ for_each_cpu(cpu, cpu_possible_mask) {
+ fd = per_cpu(em_data, cpu);
+ if (!fd || cpu != cpumask_first(&fd->cpus))
+ continue;
+
+ /* Copy the existing table. */
+ old_table = rcu_dereference(fd->cs_table);
+ nr_states = old_table->nr_cap_states;
+ new_table = alloc_cs_table(nr_states);
+ if (!new_table) {
+ read_unlock_irqrestore(&em_data_lock, flags);
+ return;
+ }
+ memcpy(new_table->state, old_table->state,
+ nr_states * sizeof(*new_table->state));
+
+ /* Re-scale the capacity values on the copy. */
+ fd_update_cs_table(new_table, cpumask_first(&fd->cpus));
+
+ /* Replace the table with the rescaled version. */
+ rcu_assign_pointer(fd->cs_table, new_table);
+ call_rcu(&old_table->rcu, rcu_free_cs_table);
+ }
+ read_unlock_irqrestore(&em_data_lock, flags);
+ pr_debug("Re-scaled CPU capacities\n");
+}
+EXPORT_SYMBOL_GPL(em_rescale_cpu_capacity);
+
+/**
+ * em_cpu_get() - Return the frequency domain for a CPU
+ * @cpu : CPU to find the frequency domain for
+ *
+ * Return: the frequency domain to which 'cpu' belongs, or NULL if it doesn't
+ * exist.
+ */
+struct em_freq_domain *em_cpu_get(int cpu)
+{
+ struct em_freq_domain *fd;
+ unsigned long flags;
+
+ read_lock_irqsave(&em_data_lock, flags);
+ fd = per_cpu(em_data, cpu);
+ read_unlock_irqrestore(&em_data_lock, flags);
+
+ return fd;
+}
+EXPORT_SYMBOL_GPL(em_cpu_get);
+
+/**
+ * em_register_freq_domain() - Register the Energy Model of a frequency domain
+ * @span : Mask of CPUs in the frequency domain
+ * @nr_states : Number of capacity states to register
+ * @cb : Callback functions providing the data of the Energy Model
+ *
+ * Create Energy Model tables for a frequency domain using the callbacks
+ * defined in cb.
+ *
+ * If multiple clients register the same frequency domain, all but the first
+ * registration will be ignored.
+ *
+ * Return 0 on success
+ */
+int em_register_freq_domain(cpumask_t *span, unsigned int nr_states,
+ struct em_data_callback *cb)
+{
+ struct em_freq_domain *fd;
+ unsigned long flags;
+ int cpu, ret = 0;
+
+ if (!span || !nr_states || !cb)
+ return -EINVAL;
+
+ mutex_lock(&em_fd_mutex);
+
+ /* Make sure we don't register again an existing domain. */
+ for_each_cpu(cpu, span) {
+ if (per_cpu(em_data, cpu)) {
+ ret = -EEXIST;
+ goto unlock;
+ }
+ }
+
+ /* Create the frequency domain and add it to the Energy Model. */
+ fd = em_create_fd(span, nr_states, cb);
+ if (!fd) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ write_lock_irqsave(&em_data_lock, flags);
+ for_each_cpu(cpu, span)
+ per_cpu(em_data, cpu) = fd;
+ write_unlock_irqrestore(&em_data_lock, flags);
+
+ pr_debug("Created freq domain %*pbl\n", cpumask_pr_args(span));
+unlock:
+ mutex_unlock(&em_fd_mutex);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(em_register_freq_domain);
--
2.17.0
From: Morten Rasmussen <[email protected]>
Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.
The util_avg for each cpu converges towards 100% (1024) regardless of
how many task additional task we may put on it. If we define
over-utilized as:
sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)
some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.
For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.
To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at it's
highest frequency instead:
cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity
IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.
With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.
For system where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/fair.c | 47 +++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 3 +++
2 files changed, 47 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1f6a23a5b451..ec797d7ede83 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5345,6 +5345,24 @@ static inline void hrtick_update(struct rq *rq)
}
#endif
+#ifdef CONFIG_SMP
+static inline unsigned long cpu_util(int cpu);
+static unsigned long capacity_of(int cpu);
+
+static inline bool cpu_overutilized(int cpu)
+{
+ return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin);
+}
+
+static inline void update_overutilized_status(struct rq *rq)
+{
+ if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+ WRITE_ONCE(rq->rd->overutilized, 1);
+}
+#else
+static inline void update_overutilized_status(struct rq *rq) { }
+#endif
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -5355,6 +5373,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ int task_new = !(flags & ENQUEUE_WAKEUP);
/*
* If in_iowait is set, the code below may not trigger any cpufreq
@@ -5394,8 +5413,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_group(se);
}
- if (!se)
+ if (!se) {
add_nr_running(rq, 1);
+ if (!task_new)
+ update_overutilized_status(rq);
+
+ }
util_est_enqueue(&rq->cfs, p);
hrtick_update(rq);
@@ -8121,11 +8144,12 @@ static bool update_nohz_stats(struct rq *rq, bool force)
* @local_group: Does group contain this_cpu.
* @sgs: variable to hold the statistics for this group.
* @overload: Indicate more than one runnable task for any CPU.
+ * @overutilized: Indicate overutilization for any CPU.
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
- bool *overload)
+ bool *overload, int *overutilized)
{
unsigned long load;
int i, nr_running;
@@ -8152,6 +8176,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (nr_running > 1)
*overload = true;
+ if (cpu_overutilized(i))
+ *overutilized = 1;
+
#ifdef CONFIG_NUMA_BALANCING
sgs->nr_numa_running += rq->nr_numa_running;
sgs->nr_preferred_running += rq->nr_preferred_running;
@@ -8289,6 +8316,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sg_lb_stats tmp_sgs;
int load_idx, prefer_sibling = 0;
bool overload = false;
+ int overutilized = 0;
if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
@@ -8315,7 +8343,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}
update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
- &overload);
+ &overload, &overutilized);
if (local_group)
goto next_group;
@@ -8367,6 +8395,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* update overload indicator if we are at root domain */
if (env->dst_rq->rd->overload != overload)
env->dst_rq->rd->overload = overload;
+
+ /* Update over-utilization (tipping point, U >= 0) indicator */
+ if (READ_ONCE(env->dst_rq->rd->overutilized) != overutilized)
+ WRITE_ONCE(env->dst_rq->rd->overutilized, overutilized);
+ } else {
+ if (!READ_ONCE(env->dst_rq->rd->overutilized) && overutilized)
+ WRITE_ONCE(env->dst_rq->rd->overutilized, 1);
}
}
@@ -8586,6 +8621,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
* this level.
*/
update_sd_lb_stats(env, &sds);
+
+ if (sched_energy_enabled() && !READ_ONCE(env->dst_rq->rd->overutilized))
+ goto out_balanced;
+
local = &sds.local_stat;
busiest = &sds.busiest_stat;
@@ -9943,6 +9982,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+
+ update_overutilized_status(task_rq(curr));
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7c517076a74a..ef5d4ebc205e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -692,6 +692,9 @@ struct root_domain {
/* Indicate more than one runnable task for any CPU */
bool overload;
+ /* Indicate one or more cpus over-utilized (tipping point) */
+ int overutilized;
+
/*
* The bit corresponding to a CPU gets set here if such CPU has more
* than one runnable -deadline task (as it is below for RT tasks).
--
2.17.0
In preparation for the definition of an energy-aware wakeup path, a
helper function is provided to estimate the consequence on system energy
when a specific task wakes-up on a specific CPU. compute_energy()
estimates the capacity state to be reached by all frequency domains and
estimates the consumption of each online CPU according to its Energy Model
and its percentage of busy time.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/fair.c | 55 ++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 2 +-
2 files changed, 56 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ec797d7ede83..1f7029258df2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6628,6 +6628,61 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
return min_cap * 1024 < task_util(p) * capacity_margin;
}
+/*
+ * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
+ */
+static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
+{
+ unsigned long util, util_est;
+ struct cfs_rq *cfs_rq;
+
+ /* Task is where it should be, or has no impact on cpu */
+ if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
+ return cpu_util(cpu);
+
+ cfs_rq = &cpu_rq(cpu)->cfs;
+ util = READ_ONCE(cfs_rq->avg.util_avg);
+
+ if (dst_cpu == cpu)
+ util += task_util(p);
+ else
+ util = max_t(long, util - task_util(p), 0);
+
+ if (sched_feat(UTIL_EST)) {
+ util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+ if (dst_cpu == cpu)
+ util_est += _task_util_est(p);
+ else
+ util_est = max_t(long, util_est - _task_util_est(p), 0);
+ util = max(util, util_est);
+ }
+
+ return min_t(unsigned long, util, capacity_orig_of(cpu));
+}
+
+static long compute_energy(struct task_struct *p, int dst_cpu)
+{
+ long util, max_util, sum_util, energy = 0;
+ struct sched_energy_fd *sfd;
+ int cpu;
+
+ for_each_freq_domain(sfd) {
+ max_util = sum_util = 0;
+ for_each_cpu_and(cpu, freq_domain_span(sfd), cpu_online_mask) {
+ util = cpu_util_next(cpu, p, dst_cpu);
+ util += cpu_util_dl(cpu_rq(cpu));
+ /* XXX: add RT util_avg when available. */
+
+ max_util = max(util, max_util);
+ sum_util += util;
+ }
+
+ energy += em_fd_energy(sfd->fd, max_util, sum_util);
+ }
+
+ return energy;
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef5d4ebc205e..0dd895554f78 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2148,7 +2148,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
# define arch_scale_freq_invariant() false
#endif
-#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+#ifdef CONFIG_SMP
static inline unsigned long cpu_util_dl(struct rq *rq)
{
return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
--
2.17.0
If an energy model is available, and if the system isn't overutilized,
waking tasks are re-routed into a new energy-aware placement algorithm.
The selection of an energy-efficient CPU for a task is achieved by
estimating the impact on system-level active energy resulting from the
placement of the task on the CPU with the highest spare capacity in each
frequency domain. This strategy spreads tasks in a frequency domain and
avoids overly aggressive task packing. The best CPU energy-wise is then
selected if it saves a large enough amount of energy with respect to
prev_cpu.
Although it has already shown significant benefits on some existing
targets, this approach cannot scale to platforms with numerous CPUs.
This patch is an attempt to do something useful as writing a fast
heuristic that performs reasonably well on a broad spectrum of
architectures isn't an easy task. As such, the scope of usability of the
energy-aware wake-up path is restricted to systems with the
SD_ASYM_CPUCAPACITY flag set, and where the EM isn't too complex.
In addition, the energy-aware wake-up path is accessible only if
sched_energy_enabled() is true. For systems which don't meet all
dependencies for EAS (CONFIG_ENERGY_MODEL for ex.) at compile time,
sched_enegy_enabled() defaults to a constant "false" value, hence
letting the compiler remove the unused EAS code entirely.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/fair.c | 84 ++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 83 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1f7029258df2..eb44829be17f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6683,6 +6683,80 @@ static long compute_energy(struct task_struct *p, int dst_cpu)
return energy;
}
+static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
+{
+ unsigned long cur_energy, prev_energy, best_energy, cpu_cap, task_util;
+ int cpu, best_energy_cpu = prev_cpu;
+ struct sched_energy_fd *sfd;
+ struct sched_domain *sd;
+
+ sync_entity_load_avg(&p->se);
+
+ task_util = task_util_est(p);
+ if (!task_util)
+ return prev_cpu;
+
+ /*
+ * Energy-aware wake-up happens on the lowest sched_domain starting
+ * from sd_ea spanning over this_cpu and prev_cpu.
+ */
+ sd = rcu_dereference(*this_cpu_ptr(&sd_ea));
+ while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
+ sd = sd->parent;
+ if (!sd)
+ return -1;
+
+ if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
+ prev_energy = best_energy = compute_energy(p, prev_cpu);
+ else
+ prev_energy = best_energy = ULONG_MAX;
+
+ for_each_freq_domain(sfd) {
+ unsigned long spare_cap, max_spare_cap = 0;
+ int max_spare_cap_cpu = -1;
+ unsigned long util;
+
+ /* Find the CPU with the max spare cap in the freq. dom. */
+ for_each_cpu_and(cpu, freq_domain_span(sfd), sched_domain_span(sd)) {
+ if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
+ continue;
+
+ if (cpu == prev_cpu)
+ continue;
+
+ /* Skip CPUs that will be overutilized */
+ util = cpu_util_wake(cpu, p) + task_util;
+ cpu_cap = capacity_of(cpu);
+ if (cpu_cap * 1024 < util * capacity_margin)
+ continue;
+
+ spare_cap = cpu_cap - util;
+ if (spare_cap > max_spare_cap) {
+ max_spare_cap = spare_cap;
+ max_spare_cap_cpu = cpu;
+ }
+ }
+
+ /* Evaluate the energy impact of using this CPU. */
+ if (max_spare_cap_cpu >= 0) {
+ cur_energy = compute_energy(p, max_spare_cap_cpu);
+ if (cur_energy < best_energy) {
+ best_energy = cur_energy;
+ best_energy_cpu = max_spare_cap_cpu;
+ }
+ }
+ }
+
+ /*
+ * We pick the best CPU only if it saves at least 1.5% of the
+ * energy used by prev_cpu.
+ */
+ if ((prev_energy - best_energy) > (prev_energy >> 6))
+ return best_energy_cpu;
+
+ return prev_cpu;
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6701,16 +6775,23 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
struct sched_domain *tmp, *sd = NULL;
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
- int want_affine = 0;
+ int want_affine = 0, want_energy = 0;
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
+ want_energy = sched_energy_enabled() &&
+ !READ_ONCE(cpu_rq(cpu)->rd->overutilized);
want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
&& cpumask_test_cpu(cpu, &p->cpus_allowed);
}
rcu_read_lock();
+ if (want_energy) {
+ new_cpu = find_energy_efficient_cpu(p, prev_cpu);
+ goto unlock;
+ }
+
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
break;
@@ -6745,6 +6826,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (want_affine)
current->recent_used_cpu = cpu;
}
+unlock:
rcu_read_unlock();
return new_cpu;
--
2.17.0
Energy Aware Scheduling starts when the scheduling domains are built if the
Energy Model is present and all conditions are met. However, in the typical
case of Arm/Arm64 systems, the Energy Model is provided after the scheduling
domains are first built at boot time, which results in EAS staying
disabled.
This commit fixes this issue by re-building the scheduling domain from the
arch topology driver, once CPUfreq is up and running and when the capacity
of the CPUs have been updated to their final value.
Cc: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
drivers/base/arch_topology.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7cb0c6ade81..7f9fa10ef940 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -15,6 +15,8 @@
#include <linux/slab.h>
#include <linux/string.h>
#include <linux/sched/topology.h>
+#include <linux/energy_model.h>
+#include <linux/cpuset.h>
DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
@@ -173,6 +175,9 @@ static cpumask_var_t cpus_to_visit;
static void parsing_done_workfn(struct work_struct *work);
static DECLARE_WORK(parsing_done_work, parsing_done_workfn);
+static void start_eas_workfn(struct work_struct *work);
+static DECLARE_WORK(start_eas_work, start_eas_workfn);
+
static int
init_cpu_capacity_callback(struct notifier_block *nb,
unsigned long val,
@@ -204,6 +209,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
free_raw_capacity();
pr_debug("cpu_capacity: parsing done\n");
schedule_work(&parsing_done_work);
+ schedule_work(&start_eas_work);
}
return 0;
@@ -249,6 +255,19 @@ static void parsing_done_workfn(struct work_struct *work)
free_cpumask_var(cpus_to_visit);
}
+static void start_eas_workfn(struct work_struct *work)
+{
+ /* Make sure the EM knows about the updated CPU capacities. */
+ rcu_read_lock();
+ em_rescale_cpu_capacity();
+ rcu_read_unlock();
+
+ /* Inform the scheduler about the EM availability. */
+ cpus_read_lock();
+ rebuild_sched_domains();
+ cpus_read_unlock();
+}
+
#else
core_initcall(free_raw_capacity);
#endif
--
2.17.0
The schedutil governor maps utilization values to frequencies by applying
a 25% margin. Since this sort of mapping mechanism can be needed by other
users (i.e. EAS), factor the utilization-to-frequency mapping code out
of schedutil and move it to include/linux/sched/cpufreq.h to avoid code
duplication. The new map_util_freq() function is inlined to avoid
overheads.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
include/linux/sched/cpufreq.h | 6 ++++++
kernel/sched/cpufreq_schedutil.c | 3 ++-
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index 59667444669f..afa940cd50dc 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -20,6 +20,12 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
void (*func)(struct update_util_data *data, u64 time,
unsigned int flags));
void cpufreq_remove_update_util_hook(int cpu);
+
+static inline unsigned long map_util_freq(unsigned long util,
+ unsigned long freq, unsigned long cap)
+{
+ return (freq + (freq >> 2)) * util / cap;
+}
#endif /* CONFIG_CPU_FREQ */
#endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 63014cff76a5..4fdc6eb7c88b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -13,6 +13,7 @@
#include "sched.h"
+#include <linux/sched/cpufreq.h>
#include <trace/events/power.h>
struct sugov_tunables {
@@ -163,7 +164,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;
- freq = (freq + (freq >> 2)) * util / max;
+ freq = map_util_freq(util, freq, max);
if (freq == sg_policy->cached_raw_freq && sg_policy->next_freq != UINT_MAX)
return sg_policy->next_freq;
--
2.17.0
Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_ea, points to the lowest level at which energy
aware scheduling should be used.
Generally speaking, the largest opportunity to save energy via scheduling
comes from a smarter exploitation of heterogeneous platforms (i.e.
big.LITTLE). Consequently, the sd_ea shortcut is wired to the lowest
scheduling domain at which the SD_ASYM_CPUCAPACITY flag is set. For
example, it is possible to apply energy-aware scheduling within a socket
on a multi-socket system, as long as each socket has an asymmetric
topology. Cross-sockets wake-up balancing will only happen when the
system is over-utilized, or this_cpu and prev_cpu are in different
sockets.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Suggested-by: Morten Rasmussen <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 4 ++++
2 files changed, 5 insertions(+)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0dd895554f78..31535198f545 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1151,6 +1151,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(struct sched_domain *, sd_ea);
struct sched_group_capacity {
atomic_t ref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 3e22c798f18d..8e7ee37771bb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_PER_CPU(struct sched_domain *, sd_ea);
static void update_top_cache_domain(int cpu)
{
@@ -423,6 +424,9 @@ static void update_top_cache_domain(int cpu)
sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
+
+ sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
+ rcu_assign_pointer(per_cpu(sd_ea, cpu), sd);
}
/*
--
2.17.0
This exposes the Energy Model (read-only) of all frequency domains in
sysfs for convenience. To do so, a parent kobject is added to the CPU
subsystem under the umbrella of which a kobject for each frequency
domain is attached.
The resulting hierarchy is as follows for a platform with two frequency
domains for example:
/sys/devices/system/cpu/energy_model
├── fd0
│ ├── capacity
│ ├── cpus
│ ├── frequency
│ └── power
└── fd4
├── capacity
├── cpus
├── frequency
└── power
In this implementation, the kobject abstraction is only used as a
convenient way of exposing data to sysfs. However, it could also be
used in the future to allocate and release frequency domains in a more
dynamic way using reference counting.
Cc: Peter Zijlstra <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
include/linux/energy_model.h | 1 +
kernel/power/energy_model.c | 94 ++++++++++++++++++++++++++++++++++++
2 files changed, 95 insertions(+)
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index edde888852ba..ed79903a9721 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -24,6 +24,7 @@ struct em_cs_table {
struct em_freq_domain {
struct em_cs_table *cs_table;
cpumask_t cpus;
+ struct kobject kobj;
};
struct em_data_callback {
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index a2eece7007a8..6ad53f1cf7e6 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -29,6 +29,86 @@ static DEFINE_RWLOCK(em_data_lock);
*/
static DEFINE_MUTEX(em_fd_mutex);
+static struct kobject *em_kobject;
+
+/* Getters for the attributes of em_freq_domain objects */
+struct em_fd_attr {
+ struct attribute attr;
+ ssize_t (*show)(struct em_freq_domain *fd, char *buf);
+ ssize_t (*store)(struct em_freq_domain *fd, const char *buf, size_t s);
+};
+
+#define EM_ATTR_LEN 13
+#define show_table_attr(_attr) \
+static ssize_t show_##_attr(struct em_freq_domain *fd, char *buf) \
+{ \
+ ssize_t cnt = 0; \
+ int i; \
+ struct em_cs_table *table; \
+ rcu_read_lock(); \
+ table = rcu_dereference(fd->cs_table);\
+ for (i = 0; i < table->nr_cap_states; i++) { \
+ if (cnt >= (ssize_t) (PAGE_SIZE / sizeof(char) \
+ - (EM_ATTR_LEN + 2))) \
+ goto out; \
+ cnt += scnprintf(&buf[cnt], EM_ATTR_LEN + 1, "%lu ", \
+ table->state[i]._attr); \
+ } \
+out: \
+ rcu_read_unlock(); \
+ cnt += sprintf(&buf[cnt], "\n"); \
+ return cnt; \
+}
+
+show_table_attr(power);
+show_table_attr(frequency);
+show_table_attr(capacity);
+
+static ssize_t show_cpus(struct em_freq_domain *fd, char *buf)
+{
+ return sprintf(buf, "%*pbl\n", cpumask_pr_args(&fd->cpus));
+}
+
+#define fd_attr(_name) em_fd_##_name##_attr
+#define define_fd_attr(_name) static struct em_fd_attr fd_attr(_name) = \
+ __ATTR(_name, 0444, show_##_name, NULL)
+
+define_fd_attr(power);
+define_fd_attr(frequency);
+define_fd_attr(capacity);
+define_fd_attr(cpus);
+
+static struct attribute *em_fd_default_attrs[] = {
+ &fd_attr(power).attr,
+ &fd_attr(frequency).attr,
+ &fd_attr(capacity).attr,
+ &fd_attr(cpus).attr,
+ NULL
+};
+
+#define to_fd(k) container_of(k, struct em_freq_domain, kobj)
+#define to_fd_attr(a) container_of(a, struct em_fd_attr, attr)
+
+static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
+{
+ struct em_freq_domain *fd = to_fd(kobj);
+ struct em_fd_attr *fd_attr = to_fd_attr(attr);
+ ssize_t ret;
+
+ ret = fd_attr->show(fd, buf);
+
+ return ret;
+}
+
+static const struct sysfs_ops em_fd_sysfs_ops = {
+ .show = show,
+};
+
+static struct kobj_type ktype_em_fd = {
+ .sysfs_ops = &em_fd_sysfs_ops,
+ .default_attrs = em_fd_default_attrs,
+};
+
static struct em_cs_table *alloc_cs_table(int nr_states)
{
struct em_cs_table *cs_table;
@@ -114,6 +194,11 @@ static struct em_freq_domain *em_create_fd(cpumask_t *span, int nr_states,
}
fd_update_cs_table(fd->cs_table, cpu);
+ ret = kobject_init_and_add(&fd->kobj, &ktype_em_fd, em_kobject, "fd%u",
+ cpu);
+ if (ret)
+ pr_warn("%*pbl: failed kobject init\n", cpumask_pr_args(span));
+
return fd;
free_cs_table:
@@ -221,6 +306,15 @@ int em_register_freq_domain(cpumask_t *span, unsigned int nr_states,
mutex_lock(&em_fd_mutex);
+ if (!em_kobject) {
+ em_kobject = kobject_create_and_add("energy_model",
+ &cpu_subsys.dev_root->kobj);
+ if (!em_kobject) {
+ ret = -ENODEV;
+ goto unlock;
+ }
+ }
+
/* Make sure we don't register again an existing domain. */
for_each_cpu(cpu, span) {
if (per_cpu(em_data, cpu)) {
--
2.17.0
In order to use EAS, the task scheduler has to know about the Energy
Model (EM) of the platform. This commit extends the scheduler topology
code to take references on the frequency domains objects of the EM
framework for all online CPUs. Hence, the availability of the EM for
those CPUs is guaranteed to the scheduler at runtime without further
checks in latency sensitive code paths (i.e. task wake-up).
A (RCU-protected) private list of online frequency domains is maintained
by the scheduler to enable fast iterations. Furthermore, the availability
of an EM is notified to the rest of the scheduler with a static key,
which ensures a low impact on non-EAS systems.
Energy Aware Scheduling can be started if and only if:
1. all online CPUs are covered by the EM;
2. the EM complexity is low enough to keep scheduling overheads low;
3. the platform has an asymmetric CPU capacity topology (detected by
looking for the SD_ASYM_CPUCAPACITY flag in the sched_domain
hierarchy).
The sched_energy_enabled() function which returns the status of the
static key is stubbed to false when CONFIG_ENERGY_MODEL=n, hence making
sure that all the code behind it can be compiled out by constant
propagation.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/sched.h | 27 ++++++++++
kernel/sched/topology.c | 113 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 140 insertions(+)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ce562d3b7526..7c517076a74a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -63,6 +63,7 @@
#include <linux/syscalls.h>
#include <linux/task_work.h>
#include <linux/tsacct_kern.h>
+#include <linux/energy_model.h>
#include <asm/tlb.h>
@@ -2162,3 +2163,29 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
return util;
}
#endif
+
+struct sched_energy_fd {
+ struct em_freq_domain *fd;
+ struct list_head next;
+ struct rcu_head rcu;
+};
+
+#ifdef CONFIG_ENERGY_MODEL
+extern struct static_key_false sched_energy_present;
+static inline bool sched_energy_enabled(void)
+{
+ return static_branch_unlikely(&sched_energy_present);
+}
+
+extern struct list_head sched_energy_fd_list;
+#define for_each_freq_domain(sfd) \
+ list_for_each_entry_rcu(sfd, &sched_energy_fd_list, next)
+#define freq_domain_span(sfd) (&((sfd)->fd->cpus))
+#else
+static inline bool sched_energy_enabled(void)
+{
+ return false;
+}
+#define for_each_freq_domain(sfd) for (sfd = NULL; sfd;)
+#define freq_domain_span(sfd) NULL
+#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 64cc564f5255..3e22c798f18d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1500,6 +1500,116 @@ void sched_domains_numa_masks_clear(unsigned int cpu)
#endif /* CONFIG_NUMA */
+#ifdef CONFIG_ENERGY_MODEL
+
+/*
+ * The complexity of the Energy Model is defined as the product of the number
+ * of frequency domains with the sum of the number of CPUs and the total
+ * number of OPPs in all frequency domains. It is generally not a good idea
+ * to use such a model on very complex platform because of the associated
+ * scheduling overheads. The arbitrary constraint below prevents that. It
+ * makes EAS usable up to 16 CPUs with per-CPU DVFS and less than 8 OPPs each,
+ * for example.
+ */
+#define EM_MAX_COMPLEXITY 2048
+
+DEFINE_STATIC_KEY_FALSE(sched_energy_present);
+LIST_HEAD(sched_energy_fd_list);
+
+static struct sched_energy_fd *find_sched_energy_fd(int cpu)
+{
+ struct sched_energy_fd *sfd;
+
+ for_each_freq_domain(sfd) {
+ if (cpumask_test_cpu(cpu, freq_domain_span(sfd)))
+ return sfd;
+ }
+
+ return NULL;
+}
+
+static void free_sched_energy_fd(struct rcu_head *rp)
+{
+ struct sched_energy_fd *sfd;
+
+ sfd = container_of(rp, struct sched_energy_fd, rcu);
+ kfree(sfd);
+}
+
+static void build_sched_energy(void)
+{
+ struct sched_energy_fd *sfd, *tmp;
+ struct em_freq_domain *fd;
+ struct sched_domain *sd;
+ int cpu, nr_fd = 0, nr_opp = 0;
+
+ rcu_read_lock();
+
+ /* Disable EAS entirely whenever the system isn't asymmetric. */
+ cpu = cpumask_first(cpu_online_mask);
+ sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
+ if (!sd) {
+ pr_debug("%s: no SD_ASYM_CPUCAPACITY\n", __func__);
+ goto disable;
+ }
+
+ /* Make sure to have an energy model for all CPUs. */
+ for_each_online_cpu(cpu) {
+ /* Skip CPUs with a known energy model. */
+ sfd = find_sched_energy_fd(cpu);
+ if (sfd)
+ continue;
+
+ /* Add the energy model of others. */
+ fd = em_cpu_get(cpu);
+ if (!fd)
+ goto disable;
+ sfd = kzalloc(sizeof(*sfd), GFP_NOWAIT);
+ if (!sfd)
+ goto disable;
+ sfd->fd = fd;
+ list_add_rcu(&sfd->next, &sched_energy_fd_list);
+ }
+
+ list_for_each_entry_safe(sfd, tmp, &sched_energy_fd_list, next) {
+ if (cpumask_intersects(freq_domain_span(sfd),
+ cpu_online_mask)) {
+ nr_opp += em_fd_nr_cap_states(sfd->fd);
+ nr_fd++;
+ continue;
+ }
+
+ /* Remove the unused frequency domains */
+ list_del_rcu(&sfd->next);
+ call_rcu(&sfd->rcu, free_sched_energy_fd);
+ }
+
+ /* Bail out if the Energy Model complexity is too high. */
+ if (nr_fd * (nr_opp + num_online_cpus()) > EM_MAX_COMPLEXITY) {
+ pr_warn("%s: EM complexity too high, stopping EAS", __func__);
+ goto disable;
+ }
+
+ rcu_read_unlock();
+ static_branch_enable_cpuslocked(&sched_energy_present);
+ pr_debug("%s: EAS started\n", __func__);
+ return;
+
+disable:
+ rcu_read_unlock();
+ static_branch_disable_cpuslocked(&sched_energy_present);
+
+ /* Destroy the list */
+ list_for_each_entry_safe(sfd, tmp, &sched_energy_fd_list, next) {
+ list_del_rcu(&sfd->next);
+ call_rcu(&sfd->rcu, free_sched_energy_fd);
+ }
+ pr_debug("%s: EAS stopped\n", __func__);
+}
+#else
+static void build_sched_energy(void) { }
+#endif
+
static int __sdt_alloc(const struct cpumask *cpu_map)
{
struct sched_domain_topology_level *tl;
@@ -1913,6 +2023,9 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
;
}
+ /* Try to start sched energy. */
+ build_sched_energy();
+
/* Remember the new sched domains: */
if (doms_cur != &fallback_doms)
free_sched_domains(doms_cur, ndoms_cur);
--
2.17.0
Hi,
The main new feature of this patch-set is the Energy Model framework we
discussed at OSPM. The core of it and the APIs are all implemented in
patch 03/10, so this is really where I was hoping to get some feedback.
Is the overall idea of this framework reasonable ? What do you think
about the interfaces ? Is the new config option acceptable ? And also,
is there any way I can make reviewing this patch easier ?
Thanks!
Quentin
On 05/21/2018 04:24 PM, Quentin Perret wrote:
> Several subsystems in the kernel (scheduler and/or thermal at the time
> of writing) can benefit from knowing about the energy consumed by CPUs.
> Yet, this information can come from different sources (DT or firmware for
> example), in different formats, hence making it hard to exploit without
> a standard API.
>
> This patch attempts to solve this issue by introducing a centralized
> Energy Model (EM) framework which can be used to interface the data
> providers with the client subsystems. This framework standardizes the
> API to expose power costs, and to access them from multiple locations.
>
> The current design assumes that all CPUs in a frequency domain share the
> same micro-architecture. As such, the EM data is structured in a
> per-frequency-domain fashion. Drivers aware of frequency domains
> (typically, but not limited to, CPUFreq drivers) are expected to register
> data in the EM framework using the em_register_freq_domain() API. To do
> so, the drivers must provide a callback function that will be called by
> the EM framework to populate the tables. As of today, only the active
> power of the CPUs is considered. For each frequency domain, the EM
> includes a list of <frequency, power, capacity> tuples for the capacity
> states of the domain alongside a cpumask covering the involved CPUs.
>
> The EM framework also provides an API to re-scale the capacity values
> of the model asynchronously, after it has been created. This is required
> for architectures where the capacity scale factor of CPUs can change at
> run-time. This is the case for Arm/Arm64 for example where the
> arch_topology driver recomputes the capacity scale factors of the CPUs
> after the maximum frequency of all CPUs has been discovered. Although
> complex, the process of creating and re-scaling the EM has to be kept in
> two separate steps to fulfill the needs of the different users. The thermal
> subsystem doesn't use the capacity values and shouldn't have dependencies
> on subsystems providing them. On the other hand, the task scheduler needs
> the capacity values, and it will benefit from seeing them up-to-date when
> applicable.
>
> Because of this need for asynchronous update, the capacity state table
> of each frequency domain is protected by RCU, hence guaranteeing a safe
> modification of the table and a fast access to readers in latency-sensitive
> code paths.
>
> Cc: Peter Zijlstra <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
[...]
> +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> +{
> + unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> + int max_cap_state = cs_table->nr_cap_states - 1;
> + unsigned long fmax = cs_table->state[max_cap_state].frequency;
> + int i;
> +
> + for (i = 0; i < cs_table->nr_cap_states; i++)
> + cs_table->state[i].capacity = cmax *
> + cs_table->state[i].frequency / fmax;
> +}
This has issues on a 32bit system. cs_table->state[i].capacity (unsigned
long) overflows with the frequency values stored in Hz.
Maybe something like this to cure it:
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 6ad53f1cf7e6..c13b3eb8bf35 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -144,9 +144,11 @@ static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
unsigned long fmax = cs_table->state[max_cap_state].frequency;
int i;
- for (i = 0; i < cs_table->nr_cap_states; i++)
- cs_table->state[i].capacity = cmax *
- cs_table->state[i].frequency / fmax;
+ for (i = 0; i < cs_table->nr_cap_states; i++) {
+ u64 val = (u64)cmax * cs_table->state[i].frequency;
+ do_div(val, fmax);
+ cs_table->state[i].capacity = (unsigned long)val;
+ }
}
This brings me to another question. Let's say there are multiple users of
the Energy Model in the system. Shouldn't the units of frequency and power
not standardized, maybe Mhz and mW?
The task scheduler doesn't care since it is only interested in power diffs
but other user might do.
[...]
Hi Dietmar,
On Wednesday 06 Jun 2018 at 15:12:15 (+0200), Dietmar Eggemann wrote:
> > +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> > +{
> > + unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> > + int max_cap_state = cs_table->nr_cap_states - 1;
> > + unsigned long fmax = cs_table->state[max_cap_state].frequency;
> > + int i;
> > +
> > + for (i = 0; i < cs_table->nr_cap_states; i++)
> > + cs_table->state[i].capacity = cmax *
> > + cs_table->state[i].frequency / fmax;
> > +}
>
> This has issues on a 32bit system. cs_table->state[i].capacity (unsigned
> long) overflows with the frequency values stored in Hz.
Ah, thank you very much for pointing this out ! I haven't tried on a 32bit
machine yet, my bad. I'll fix that for v4.
>
> Maybe something like this to cure it:
>
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 6ad53f1cf7e6..c13b3eb8bf35 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -144,9 +144,11 @@ static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> unsigned long fmax = cs_table->state[max_cap_state].frequency;
> int i;
>
> - for (i = 0; i < cs_table->nr_cap_states; i++)
> - cs_table->state[i].capacity = cmax *
> - cs_table->state[i].frequency / fmax;
> + for (i = 0; i < cs_table->nr_cap_states; i++) {
> + u64 val = (u64)cmax * cs_table->state[i].frequency;
> + do_div(val, fmax);
> + cs_table->state[i].capacity = (unsigned long)val;
> + }
> }
Hmmm yes, that should work.
>
> This brings me to another question. Let's say there are multiple users of
> the Energy Model in the system. Shouldn't the units of frequency and power
> not standardized, maybe Mhz and mW?
> The task scheduler doesn't care since it is only interested in power diffs
> but other user might do.
So the good thing about specifying units is that we can probably assume
ranges on the values. If the power is in mW, assuming that we're talking
about a single CPU, it'll probably fit in 16 bits. 65W/core should be
a reasonable upper-bound ?
But there are also vendors who might not be happy with disclosing absolute
values ... These are sometimes considered sensitive and only relative
numbers are discussed publicly. Now, you can also argue that we already
have units specified in IPA for ex, and that it doesn't really matter if
a driver "lies" about the real value, as long as the ratios are correct.
And I guess that anyone can do measurement on the hardware and get those
values anyway. So specifying a unit (mW) for the power is probably a
good idea.
For the frequency, I would rather keep it consistent with what CPUFreq
manages. But that should be specified somewhere, I agree.
What do you think ?
Thanks !
Quentin
On 06/06/18 15:37, Quentin Perret wrote:
> Hi Dietmar,
>
> On Wednesday 06 Jun 2018 at 15:12:15 (+0200), Dietmar Eggemann wrote:
> > > +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> > > +{
> > > + unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> > > + int max_cap_state = cs_table->nr_cap_states - 1;
> > > + unsigned long fmax = cs_table->state[max_cap_state].frequency;
> > > + int i;
> > > +
> > > + for (i = 0; i < cs_table->nr_cap_states; i++)
> > > + cs_table->state[i].capacity = cmax *
> > > + cs_table->state[i].frequency / fmax;
> > > +}
> >
> > This has issues on a 32bit system. cs_table->state[i].capacity (unsigned
> > long) overflows with the frequency values stored in Hz.
>
> Ah, thank you very much for pointing this out ! I haven't tried on a 32bit
> machine yet, my bad. I'll fix that for v4.
>
> >
> > Maybe something like this to cure it:
> >
> > diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> > index 6ad53f1cf7e6..c13b3eb8bf35 100644
> > --- a/kernel/power/energy_model.c
> > +++ b/kernel/power/energy_model.c
> > @@ -144,9 +144,11 @@ static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> > unsigned long fmax = cs_table->state[max_cap_state].frequency;
> > int i;
> >
> > - for (i = 0; i < cs_table->nr_cap_states; i++)
> > - cs_table->state[i].capacity = cmax *
> > - cs_table->state[i].frequency / fmax;
> > + for (i = 0; i < cs_table->nr_cap_states; i++) {
> > + u64 val = (u64)cmax * cs_table->state[i].frequency;
> > + do_div(val, fmax);
> > + cs_table->state[i].capacity = (unsigned long)val;
> > + }
> > }
>
> Hmmm yes, that should work.
>
> >
> > This brings me to another question. Let's say there are multiple users of
> > the Energy Model in the system. Shouldn't the units of frequency and power
> > not standardized, maybe Mhz and mW?
> > The task scheduler doesn't care since it is only interested in power diffs
> > but other user might do.
>
> So the good thing about specifying units is that we can probably assume
> ranges on the values. If the power is in mW, assuming that we're talking
> about a single CPU, it'll probably fit in 16 bits. 65W/core should be
> a reasonable upper-bound ?
> But there are also vendors who might not be happy with disclosing absolute
> values ... These are sometimes considered sensitive and only relative
> numbers are discussed publicly. Now, you can also argue that we already
> have units specified in IPA for ex, and that it doesn't really matter if
> a driver "lies" about the real value, as long as the ratios are correct.
> And I guess that anyone can do measurement on the hardware and get those
> values anyway. So specifying a unit (mW) for the power is probably a
> good idea.
Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
binding accepted, and one of the musts was that the values were going to
be normalized. So, normalized power values again maybe?
Best,
- Juri
On Wednesday 06 Jun 2018 at 17:20:00 (+0200), Juri Lelli wrote:
> > > This brings me to another question. Let's say there are multiple users of
> > > the Energy Model in the system. Shouldn't the units of frequency and power
> > > not standardized, maybe Mhz and mW?
> > > The task scheduler doesn't care since it is only interested in power diffs
> > > but other user might do.
> >
> > So the good thing about specifying units is that we can probably assume
> > ranges on the values. If the power is in mW, assuming that we're talking
> > about a single CPU, it'll probably fit in 16 bits. 65W/core should be
> > a reasonable upper-bound ?
> > But there are also vendors who might not be happy with disclosing absolute
> > values ... These are sometimes considered sensitive and only relative
> > numbers are discussed publicly. Now, you can also argue that we already
> > have units specified in IPA for ex, and that it doesn't really matter if
> > a driver "lies" about the real value, as long as the ratios are correct.
> > And I guess that anyone can do measurement on the hardware and get those
> > values anyway. So specifying a unit (mW) for the power is probably a
> > good idea.
>
> Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
> binding accepted, and one of the musts was that the values were going to
> be normalized. So, normalized power values again maybe?
Hmmm, that's a very good point ... There should be no problems on the
scheduler side -- we're only interested in correct ratios. But I'm not
sure on the thermal side ... I will double check that.
Javi, Viresh, Eduardo: any thoughts about this ?
Thanks !
Quentin
On Wednesday 06 Jun 2018 at 16:29:50 (+0100), Quentin Perret wrote:
> On Wednesday 06 Jun 2018 at 17:20:00 (+0200), Juri Lelli wrote:
> > > > This brings me to another question. Let's say there are multiple users of
> > > > the Energy Model in the system. Shouldn't the units of frequency and power
> > > > not standardized, maybe Mhz and mW?
> > > > The task scheduler doesn't care since it is only interested in power diffs
> > > > but other user might do.
> > >
> > > So the good thing about specifying units is that we can probably assume
> > > ranges on the values. If the power is in mW, assuming that we're talking
> > > about a single CPU, it'll probably fit in 16 bits. 65W/core should be
> > > a reasonable upper-bound ?
> > > But there are also vendors who might not be happy with disclosing absolute
> > > values ... These are sometimes considered sensitive and only relative
> > > numbers are discussed publicly. Now, you can also argue that we already
> > > have units specified in IPA for ex, and that it doesn't really matter if
> > > a driver "lies" about the real value, as long as the ratios are correct.
> > > And I guess that anyone can do measurement on the hardware and get those
> > > values anyway. So specifying a unit (mW) for the power is probably a
> > > good idea.
> >
> > Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
> > binding accepted, and one of the musts was that the values were going to
> > be normalized. So, normalized power values again maybe?
>
> Hmmm, that's a very good point ... There should be no problems on the
> scheduler side -- we're only interested in correct ratios. But I'm not
> sure on the thermal side ... I will double check that.
So, IPA needs to compare the power of the CPUs with the power of other
things (e.g. GPUs). So we can't normalize the power of the CPUs without
normalizing in the same scale the power of the other devices. I see two
possibilities:
1) we don't normalize the CPU power values, we specify them in mW, and
we document (and maybe throw a warning if we see an issue at runtime)
the max range of values. The max expected power for a single core
could be 65K for ex (16bits). And based on that we can verify
overflow and precision issues in the algorithms, and we keep it easy
to compare the CPU power numbers with other devices.
2) we normalize the power values, but that means that the EM framework
has to manage not only CPUs, but also other types of devices, and
normalized their power values as well. That's required to keep the
scale consistent across all of them, and keep comparisons doable.
But if we do this, we still have to keep a normalized and a "raw"
version of the power for all devices. And the "raw" power must still
be in the same unit across all devices, otherwise the re-scaling is
broken. The main benefit of doing this is that the range of
acceptable "raw" power values can be larger, probably 32bits, and
that the precision of the normalized range is arbitrary.
I feel like 2) involves a lot of complexity, and not so many benefits,
so I'd be happy to go with 1). Unless I forgot something ?
Thanks,
Quentin
On Wednesday 06 Jun 2018 at 18:47:34 (+0200), Juri Lelli wrote:
> Hi Quentin,
>
> On 21/05/18 15:24, Quentin Perret wrote:
>
> [...]
>
> > +#ifdef CONFIG_ENERGY_MODEL
>
> [...]
>
> > +struct em_data_callback {
> > + /**
> > + * active_power() - Provide power at the next capacity state of a CPU
> > + * @power : Active power at the capacity state (modified)
> > + * @freq : Frequency at the capacity state (modified)
> > + * @cpu : CPU for which we do this operation
> > + *
> > + * active_power() must find the lowest capacity state of 'cpu' above
> > + * 'freq' and update 'power' and 'freq' to the matching active power
> > + * and frequency.
> > + *
> > + * Return 0 on success.
> > + */
> > + int (*active_power) (unsigned long *power, unsigned long *freq, int cpu);
> > +};
>
> [...]
>
> > +#else
> > +struct em_freq_domain;
> > +struct em_data_callback;
>
> This doesn't compile for CONFIG_ENERGY_MODEL=n:
>
> drivers/cpufreq/cpufreq-dt.c:164:9: error: variable 'em_cb' has initializer but incomplete type
> struct em_data_callback em_cb = { &dev_pm_opp_of_estimate_power };
> ^~~~~~~~~~~~~~~~
Ah, this is an issue with a patch which isn't part of this patch-set ...
But yeah, that's a valid point :-)
> Guess you need some sort of macro to init the struct (w/o introducing
> ifdeffery into drivers code)?
Yes, the EM framework should provide a generic way of doing this
initialization with nice stubs for CONFIG_ENERGY_MODEL=n, otherwise we
risk wild fixes in the drivers I suppose. I'll fix that in v4.
Thanks for pointing that out !
Quentin
Hi Quentin,
On 21/05/18 15:24, Quentin Perret wrote:
[...]
> +#ifdef CONFIG_ENERGY_MODEL
[...]
> +struct em_data_callback {
> + /**
> + * active_power() - Provide power at the next capacity state of a CPU
> + * @power : Active power at the capacity state (modified)
> + * @freq : Frequency at the capacity state (modified)
> + * @cpu : CPU for which we do this operation
> + *
> + * active_power() must find the lowest capacity state of 'cpu' above
> + * 'freq' and update 'power' and 'freq' to the matching active power
> + * and frequency.
> + *
> + * Return 0 on success.
> + */
> + int (*active_power) (unsigned long *power, unsigned long *freq, int cpu);
> +};
[...]
> +#else
> +struct em_freq_domain;
> +struct em_data_callback;
This doesn't compile for CONFIG_ENERGY_MODEL=n:
drivers/cpufreq/cpufreq-dt.c:164:9: error: variable 'em_cb' has initializer but incomplete type
struct em_data_callback em_cb = { &dev_pm_opp_of_estimate_power };
^~~~~~~~~~~~~~~~
Guess you need some sort of macro to init the struct (w/o introducing
ifdeffery into drivers code)?
Best,
- Juri
On 21/05/18 15:24, Quentin Perret wrote:
> Several subsystems in the kernel (scheduler and/or thermal at the time
> of writing) can benefit from knowing about the energy consumed by CPUs.
> Yet, this information can come from different sources (DT or firmware for
> example), in different formats, hence making it hard to exploit without
> a standard API.
>
> This patch attempts to solve this issue by introducing a centralized
> Energy Model (EM) framework which can be used to interface the data
> providers with the client subsystems. This framework standardizes the
> API to expose power costs, and to access them from multiple locations.
>
> The current design assumes that all CPUs in a frequency domain share the
> same micro-architecture. As such, the EM data is structured in a
> per-frequency-domain fashion. Drivers aware of frequency domains
> (typically, but not limited to, CPUFreq drivers) are expected to register
> data in the EM framework using the em_register_freq_domain() API. To do
> so, the drivers must provide a callback function that will be called by
> the EM framework to populate the tables. As of today, only the active
> power of the CPUs is considered. For each frequency domain, the EM
> includes a list of <frequency, power, capacity> tuples for the capacity
> states of the domain alongside a cpumask covering the involved CPUs.
>
> The EM framework also provides an API to re-scale the capacity values
> of the model asynchronously, after it has been created. This is required
> for architectures where the capacity scale factor of CPUs can change at
> run-time. This is the case for Arm/Arm64 for example where the
> arch_topology driver recomputes the capacity scale factors of the CPUs
> after the maximum frequency of all CPUs has been discovered. Although
> complex, the process of creating and re-scaling the EM has to be kept in
> two separate steps to fulfill the needs of the different users. The thermal
> subsystem doesn't use the capacity values and shouldn't have dependencies
> on subsystems providing them. On the other hand, the task scheduler needs
> the capacity values, and it will benefit from seeing them up-to-date when
> applicable.
>
> Because of this need for asynchronous update, the capacity state table
> of each frequency domain is protected by RCU, hence guaranteeing a safe
> modification of the table and a fast access to readers in latency-sensitive
> code paths.
>
> Cc: Peter Zijlstra <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
OK, I think I'll start with a few comments while I get more into
understanding the set. :)
> +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> +{
> + unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> + int max_cap_state = cs_table->nr_cap_states - 1;
^
You don't need this on the stack, right?
> + unsigned long fmax = cs_table->state[max_cap_state].frequency;
> + int i;
> +
> + for (i = 0; i < cs_table->nr_cap_states; i++)
> + cs_table->state[i].capacity = cmax *
> + cs_table->state[i].frequency / fmax;
> +}
> +
> +static struct em_freq_domain *em_create_fd(cpumask_t *span, int nr_states,
> + struct em_data_callback *cb)
> +{
> + unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
> + int i, ret, cpu = cpumask_first(span);
> + struct em_freq_domain *fd;
> + unsigned long power, freq;
> +
> + if (!cb->active_power)
> + return NULL;
> +
> + fd = kzalloc(sizeof(*fd), GFP_KERNEL);
> + if (!fd)
> + return NULL;
> +
> + fd->cs_table = alloc_cs_table(nr_states);
Mmm, don't you need to rcu_assign_pointer this first one as well?
> + if (!fd->cs_table)
> + goto free_fd;
> +
> + /* Copy the span of the frequency domain */
> + cpumask_copy(&fd->cpus, span);
> +
> + /* Build the list of capacity states for this freq domain */
> + for (i = 0, freq = 0; i < nr_states; i++, freq++) {
^ ^
The fact that this relies on active_power() to use ceil OPP for a given
freq might deserve a comment. Also, is this behaviour of active_power()
standardized?
> + ret = cb->active_power(&power, &freq, cpu);
> + if (ret)
> + goto free_cs_table;
> +
> + fd->cs_table->state[i].power = power;
> + fd->cs_table->state[i].frequency = freq;
> +
> + /*
> + * The hertz/watts efficiency ratio should decrease as the
> + * frequency grows on sane platforms. If not, warn the user
> + * that some high OPPs are more power efficient than some
> + * of the lower ones.
> + */
> + opp_eff = freq / power;
> + if (opp_eff >= prev_opp_eff)
> + pr_warn("%*pbl: hz/watt efficiency: OPP %d >= OPP%d\n",
> + cpumask_pr_args(span), i, i - 1);
> + prev_opp_eff = opp_eff;
> + }
> + fd_update_cs_table(fd->cs_table, cpu);
> +
> + return fd;
> +
> +free_cs_table:
> + free_cs_table(fd->cs_table);
> +free_fd:
> + kfree(fd);
> +
> + return NULL;
> +}
> +
> +static void rcu_free_cs_table(struct rcu_head *rp)
> +{
> + struct em_cs_table *table;
> +
> + table = container_of(rp, struct em_cs_table, rcu);
> + free_cs_table(table);
> +}
> +
> +/**
> + * em_rescale_cpu_capacity() - Re-scale capacity values of the Energy Model
> + *
> + * This re-scales the capacity values for all capacity states of all frequency
> + * domains of the Energy Model. This should be used when the capacity values
> + * of the CPUs are updated at run-time, after the EM was registered.
> + */
> +void em_rescale_cpu_capacity(void)
So, is this thought to be called eventually also after thermal capping
events and such?
> +{
> + struct em_cs_table *old_table, *new_table;
> + struct em_freq_domain *fd;
> + unsigned long flags;
> + int nr_states, cpu;
> +
> + read_lock_irqsave(&em_data_lock, flags);
Don't you need write_lock_ here, since you are going to exchange the
em tables?
> + for_each_cpu(cpu, cpu_possible_mask) {
> + fd = per_cpu(em_data, cpu);
> + if (!fd || cpu != cpumask_first(&fd->cpus))
> + continue;
> +
> + /* Copy the existing table. */
> + old_table = rcu_dereference(fd->cs_table);
> + nr_states = old_table->nr_cap_states;
> + new_table = alloc_cs_table(nr_states);
> + if (!new_table) {
> + read_unlock_irqrestore(&em_data_lock, flags);
> + return;
> + }
> + memcpy(new_table->state, old_table->state,
> + nr_states * sizeof(*new_table->state));
> +
> + /* Re-scale the capacity values on the copy. */
> + fd_update_cs_table(new_table, cpumask_first(&fd->cpus));
> +
> + /* Replace the table with the rescaled version. */
> + rcu_assign_pointer(fd->cs_table, new_table);
> + call_rcu(&old_table->rcu, rcu_free_cs_table);
> + }
> + read_unlock_irqrestore(&em_data_lock, flags);
> + pr_debug("Re-scaled CPU capacities\n");
> +}
> +EXPORT_SYMBOL_GPL(em_rescale_cpu_capacity);
> +
> +/**
> + * em_cpu_get() - Return the frequency domain for a CPU
> + * @cpu : CPU to find the frequency domain for
> + *
> + * Return: the frequency domain to which 'cpu' belongs, or NULL if it doesn't
> + * exist.
> + */
> +struct em_freq_domain *em_cpu_get(int cpu)
> +{
> + struct em_freq_domain *fd;
> + unsigned long flags;
> +
> + read_lock_irqsave(&em_data_lock, flags);
> + fd = per_cpu(em_data, cpu);
> + read_unlock_irqrestore(&em_data_lock, flags);
> +
> + return fd;
> +}
> +EXPORT_SYMBOL_GPL(em_cpu_get);
Mmm, this gets complicated pretty fast eh? :)
I had to go back and forth between patches to start understanding the
different data structures and how they are use, and I'm not sure yet
I've got the full picture. I guess some nice diagram (cover letter or
documentation patch) would help a lot.
Locking of such data structures is pretty involved as well, adding
comments/docs shouldn't harm. :)
Best,
- Juri
Hi,
On 21/05/18 15:25, Quentin Perret wrote:
> In order to use EAS, the task scheduler has to know about the Energy
> Model (EM) of the platform. This commit extends the scheduler topology
> code to take references on the frequency domains objects of the EM
> framework for all online CPUs. Hence, the availability of the EM for
> those CPUs is guaranteed to the scheduler at runtime without further
> checks in latency sensitive code paths (i.e. task wake-up).
>
> A (RCU-protected) private list of online frequency domains is maintained
> by the scheduler to enable fast iterations. Furthermore, the availability
> of an EM is notified to the rest of the scheduler with a static key,
> which ensures a low impact on non-EAS systems.
>
> Energy Aware Scheduling can be started if and only if:
> 1. all online CPUs are covered by the EM;
> 2. the EM complexity is low enough to keep scheduling overheads low;
> 3. the platform has an asymmetric CPU capacity topology (detected by
> looking for the SD_ASYM_CPUCAPACITY flag in the sched_domain
> hierarchy).
Not sure about this. How about multi-freq domain same max capacity
systems. I understand that most of the energy saving come from selecting
the right (big/LITTLE) cluster, but EM should still be useful to drive
OPP selection (that was one of the use-cases we discussed lately IIRC)
and also to decide between packing or spreading, no?
> The sched_energy_enabled() function which returns the status of the
> static key is stubbed to false when CONFIG_ENERGY_MODEL=n, hence making
> sure that all the code behind it can be compiled out by constant
> propagation.
Actually, do we need a config option at all? Shouldn't the static key
(and RCU machinery) guard against unwanted overheads when EM is not
present/used?
I was thinking it should be pretty similar to schedutil setup, no?
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> kernel/sched/sched.h | 27 ++++++++++
> kernel/sched/topology.c | 113 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 140 insertions(+)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ce562d3b7526..7c517076a74a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -63,6 +63,7 @@
> #include <linux/syscalls.h>
> #include <linux/task_work.h>
> #include <linux/tsacct_kern.h>
> +#include <linux/energy_model.h>
>
> #include <asm/tlb.h>
>
> @@ -2162,3 +2163,29 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
> return util;
> }
> #endif
> +
> +struct sched_energy_fd {
> + struct em_freq_domain *fd;
> + struct list_head next;
> + struct rcu_head rcu;
> +};
> +
> +#ifdef CONFIG_ENERGY_MODEL
> +extern struct static_key_false sched_energy_present;
> +static inline bool sched_energy_enabled(void)
> +{
> + return static_branch_unlikely(&sched_energy_present);
> +}
> +
> +extern struct list_head sched_energy_fd_list;
> +#define for_each_freq_domain(sfd) \
> + list_for_each_entry_rcu(sfd, &sched_energy_fd_list, next)
> +#define freq_domain_span(sfd) (&((sfd)->fd->cpus))
> +#else
> +static inline bool sched_energy_enabled(void)
> +{
> + return false;
> +}
> +#define for_each_freq_domain(sfd) for (sfd = NULL; sfd;)
> +#define freq_domain_span(sfd) NULL
> +#endif
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 64cc564f5255..3e22c798f18d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1500,6 +1500,116 @@ void sched_domains_numa_masks_clear(unsigned int cpu)
>
> #endif /* CONFIG_NUMA */
>
> +#ifdef CONFIG_ENERGY_MODEL
> +
> +/*
> + * The complexity of the Energy Model is defined as the product of the number
> + * of frequency domains with the sum of the number of CPUs and the total
> + * number of OPPs in all frequency domains. It is generally not a good idea
> + * to use such a model on very complex platform because of the associated
> + * scheduling overheads. The arbitrary constraint below prevents that. It
> + * makes EAS usable up to 16 CPUs with per-CPU DVFS and less than 8 OPPs each,
> + * for example.
> + */
> +#define EM_MAX_COMPLEXITY 2048
Do we really need this hardcoded constant?
I guess if one spent time deriving an EM for a big system with lot of
OPPs, she/he already knows what is doing? :)
> +
> +DEFINE_STATIC_KEY_FALSE(sched_energy_present);
> +LIST_HEAD(sched_energy_fd_list);
> +
> +static struct sched_energy_fd *find_sched_energy_fd(int cpu)
> +{
> + struct sched_energy_fd *sfd;
> +
> + for_each_freq_domain(sfd) {
> + if (cpumask_test_cpu(cpu, freq_domain_span(sfd)))
> + return sfd;
> + }
> +
> + return NULL;
> +}
> +
> +static void free_sched_energy_fd(struct rcu_head *rp)
> +{
> + struct sched_energy_fd *sfd;
> +
> + sfd = container_of(rp, struct sched_energy_fd, rcu);
> + kfree(sfd);
> +}
> +
> +static void build_sched_energy(void)
> +{
> + struct sched_energy_fd *sfd, *tmp;
> + struct em_freq_domain *fd;
> + struct sched_domain *sd;
> + int cpu, nr_fd = 0, nr_opp = 0;
> +
> + rcu_read_lock();
> +
> + /* Disable EAS entirely whenever the system isn't asymmetric. */
> + cpu = cpumask_first(cpu_online_mask);
> + sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
> + if (!sd) {
> + pr_debug("%s: no SD_ASYM_CPUCAPACITY\n", __func__);
> + goto disable;
> + }
> +
> + /* Make sure to have an energy model for all CPUs. */
> + for_each_online_cpu(cpu) {
> + /* Skip CPUs with a known energy model. */
> + sfd = find_sched_energy_fd(cpu);
> + if (sfd)
> + continue;
> +
> + /* Add the energy model of others. */
> + fd = em_cpu_get(cpu);
> + if (!fd)
> + goto disable;
> + sfd = kzalloc(sizeof(*sfd), GFP_NOWAIT);
> + if (!sfd)
> + goto disable;
> + sfd->fd = fd;
> + list_add_rcu(&sfd->next, &sched_energy_fd_list);
> + }
> +
> + list_for_each_entry_safe(sfd, tmp, &sched_energy_fd_list, next) {
> + if (cpumask_intersects(freq_domain_span(sfd),
> + cpu_online_mask)) {
> + nr_opp += em_fd_nr_cap_states(sfd->fd);
> + nr_fd++;
> + continue;
> + }
> +
> + /* Remove the unused frequency domains */
> + list_del_rcu(&sfd->next);
> + call_rcu(&sfd->rcu, free_sched_energy_fd);
Unused because of? Hotplug?
Not sure, but I guess you have considered the idea of tearing all this
down when sched domains are destroied and then rebuilding it again? Why
did you decide for this approach? Or maybe I just missed where you do
that. :/
> + }
> +
> + /* Bail out if the Energy Model complexity is too high. */
> + if (nr_fd * (nr_opp + num_online_cpus()) > EM_MAX_COMPLEXITY) {
> + pr_warn("%s: EM complexity too high, stopping EAS", __func__);
> + goto disable;
> + }
> +
> + rcu_read_unlock();
> + static_branch_enable_cpuslocked(&sched_energy_present);
> + pr_debug("%s: EAS started\n", __func__);
I'd vote for a pr_info here instead, maybe printing info about the em as
well. Looks pretty useful to me to have that in dmesg. Maybe guarded by
sched_debug?
Best,
- Juri
Hi Juri,
On Thursday 07 Jun 2018 at 16:44:09 (+0200), Juri Lelli wrote:
> On 21/05/18 15:24, Quentin Perret wrote:
> > Several subsystems in the kernel (scheduler and/or thermal at the time
> > of writing) can benefit from knowing about the energy consumed by CPUs.
> > Yet, this information can come from different sources (DT or firmware for
> > example), in different formats, hence making it hard to exploit without
> > a standard API.
> >
> > This patch attempts to solve this issue by introducing a centralized
> > Energy Model (EM) framework which can be used to interface the data
> > providers with the client subsystems. This framework standardizes the
> > API to expose power costs, and to access them from multiple locations.
> >
> > The current design assumes that all CPUs in a frequency domain share the
> > same micro-architecture. As such, the EM data is structured in a
> > per-frequency-domain fashion. Drivers aware of frequency domains
> > (typically, but not limited to, CPUFreq drivers) are expected to register
> > data in the EM framework using the em_register_freq_domain() API. To do
> > so, the drivers must provide a callback function that will be called by
> > the EM framework to populate the tables. As of today, only the active
> > power of the CPUs is considered. For each frequency domain, the EM
> > includes a list of <frequency, power, capacity> tuples for the capacity
> > states of the domain alongside a cpumask covering the involved CPUs.
> >
> > The EM framework also provides an API to re-scale the capacity values
> > of the model asynchronously, after it has been created. This is required
> > for architectures where the capacity scale factor of CPUs can change at
> > run-time. This is the case for Arm/Arm64 for example where the
> > arch_topology driver recomputes the capacity scale factors of the CPUs
> > after the maximum frequency of all CPUs has been discovered. Although
> > complex, the process of creating and re-scaling the EM has to be kept in
> > two separate steps to fulfill the needs of the different users. The thermal
> > subsystem doesn't use the capacity values and shouldn't have dependencies
> > on subsystems providing them. On the other hand, the task scheduler needs
> > the capacity values, and it will benefit from seeing them up-to-date when
> > applicable.
> >
> > Because of this need for asynchronous update, the capacity state table
> > of each frequency domain is protected by RCU, hence guaranteeing a safe
> > modification of the table and a fast access to readers in latency-sensitive
> > code paths.
> >
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: "Rafael J. Wysocki" <[email protected]>
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
>
> OK, I think I'll start with a few comments while I get more into
> understanding the set. :)
:-)
>
> > +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> > +{
> > + unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> > + int max_cap_state = cs_table->nr_cap_states - 1;
> ^
> You don't need this on the stack, right?
Oh, why not ?
>
> > + unsigned long fmax = cs_table->state[max_cap_state].frequency;
> > + int i;
> > +
> > + for (i = 0; i < cs_table->nr_cap_states; i++)
> > + cs_table->state[i].capacity = cmax *
> > + cs_table->state[i].frequency / fmax;
> > +}
> > +
> > +static struct em_freq_domain *em_create_fd(cpumask_t *span, int nr_states,
> > + struct em_data_callback *cb)
> > +{
> > + unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
> > + int i, ret, cpu = cpumask_first(span);
> > + struct em_freq_domain *fd;
> > + unsigned long power, freq;
> > +
> > + if (!cb->active_power)
> > + return NULL;
> > +
> > + fd = kzalloc(sizeof(*fd), GFP_KERNEL);
> > + if (!fd)
> > + return NULL;
> > +
> > + fd->cs_table = alloc_cs_table(nr_states);
>
> Mmm, don't you need to rcu_assign_pointer this first one as well?
Hmmmm, nobody can be using this at this point, but yes, it'd be better
to keep that consistent I suppose ...
>
> > + if (!fd->cs_table)
> > + goto free_fd;
> > +
> > + /* Copy the span of the frequency domain */
> > + cpumask_copy(&fd->cpus, span);
> > +
> > + /* Build the list of capacity states for this freq domain */
> > + for (i = 0, freq = 0; i < nr_states; i++, freq++) {
> ^ ^
> The fact that this relies on active_power() to use ceil OPP for a given
> freq might deserve a comment. Also, is this behaviour of active_power()
> standardized?
Right, this can get confusing pretty quickly. There is a comment in
include/linux/energy_model.h where the expected behaviour of
active_power is explained, but a reminder above this function shouldn't
hurt.
>
> > + ret = cb->active_power(&power, &freq, cpu);
> > + if (ret)
> > + goto free_cs_table;
> > +
> > + fd->cs_table->state[i].power = power;
> > + fd->cs_table->state[i].frequency = freq;
> > +
> > + /*
> > + * The hertz/watts efficiency ratio should decrease as the
> > + * frequency grows on sane platforms. If not, warn the user
> > + * that some high OPPs are more power efficient than some
> > + * of the lower ones.
> > + */
> > + opp_eff = freq / power;
> > + if (opp_eff >= prev_opp_eff)
> > + pr_warn("%*pbl: hz/watt efficiency: OPP %d >= OPP%d\n",
> > + cpumask_pr_args(span), i, i - 1);
> > + prev_opp_eff = opp_eff;
> > + }
> > + fd_update_cs_table(fd->cs_table, cpu);
> > +
> > + return fd;
> > +
> > +free_cs_table:
> > + free_cs_table(fd->cs_table);
> > +free_fd:
> > + kfree(fd);
> > +
> > + return NULL;
> > +}
> > +
> > +static void rcu_free_cs_table(struct rcu_head *rp)
> > +{
> > + struct em_cs_table *table;
> > +
> > + table = container_of(rp, struct em_cs_table, rcu);
> > + free_cs_table(table);
> > +}
> > +
> > +/**
> > + * em_rescale_cpu_capacity() - Re-scale capacity values of the Energy Model
> > + *
> > + * This re-scales the capacity values for all capacity states of all frequency
> > + * domains of the Energy Model. This should be used when the capacity values
> > + * of the CPUs are updated at run-time, after the EM was registered.
> > + */
> > +void em_rescale_cpu_capacity(void)
>
> So, is this thought to be called eventually also after thermal capping
> events and such?
The true reason is that the frequency domains will typically be
registered in the EM framework _before_ the arch_topology driver kicks
in on arm/arm64. That means that the EM tables are created, and only
after, the cpu capacities are updated. So we basically need to update
those capacities to be up-to-date.
The reason we need to keep those two steps separate (registering the
freq domains and re-scaling the capacities) in the EM framework is
because thermal doesn't care about the cpu capacities. It is a perfectly
acceptable configuration to use IPA without having dmips-capacity-mhz
values in the DT for ex.
Now, since we have a RCU protection on the EM tables, we might decide in
the future to use the opportunity to modify the tables at run-time for
other reasons. Thermal capping could be one I guess.
>
> > +{
> > + struct em_cs_table *old_table, *new_table;
> > + struct em_freq_domain *fd;
> > + unsigned long flags;
> > + int nr_states, cpu;
> > +
> > + read_lock_irqsave(&em_data_lock, flags);
>
> Don't you need write_lock_ here, since you are going to exchange the
> em tables?
This lock protects the per_cpu() variable itself. Here we only read
pointers from that per_cpu variable, and we modify one attribute in
the pointed structure. We don't modify the per_cpu table itself. Does
that make sense ?
>
> > + for_each_cpu(cpu, cpu_possible_mask) {
> > + fd = per_cpu(em_data, cpu);
> > + if (!fd || cpu != cpumask_first(&fd->cpus))
> > + continue;
> > +
> > + /* Copy the existing table. */
> > + old_table = rcu_dereference(fd->cs_table);
> > + nr_states = old_table->nr_cap_states;
> > + new_table = alloc_cs_table(nr_states);
> > + if (!new_table) {
> > + read_unlock_irqrestore(&em_data_lock, flags);
> > + return;
> > + }
> > + memcpy(new_table->state, old_table->state,
> > + nr_states * sizeof(*new_table->state));
> > +
> > + /* Re-scale the capacity values on the copy. */
> > + fd_update_cs_table(new_table, cpumask_first(&fd->cpus));
> > +
> > + /* Replace the table with the rescaled version. */
> > + rcu_assign_pointer(fd->cs_table, new_table);
> > + call_rcu(&old_table->rcu, rcu_free_cs_table);
> > + }
> > + read_unlock_irqrestore(&em_data_lock, flags);
> > + pr_debug("Re-scaled CPU capacities\n");
> > +}
> > +EXPORT_SYMBOL_GPL(em_rescale_cpu_capacity);
> > +
> > +/**
> > + * em_cpu_get() - Return the frequency domain for a CPU
> > + * @cpu : CPU to find the frequency domain for
> > + *
> > + * Return: the frequency domain to which 'cpu' belongs, or NULL if it doesn't
> > + * exist.
> > + */
> > +struct em_freq_domain *em_cpu_get(int cpu)
> > +{
> > + struct em_freq_domain *fd;
> > + unsigned long flags;
> > +
> > + read_lock_irqsave(&em_data_lock, flags);
> > + fd = per_cpu(em_data, cpu);
> > + read_unlock_irqrestore(&em_data_lock, flags);
> > +
> > + return fd;
> > +}
> > +EXPORT_SYMBOL_GPL(em_cpu_get);
>
> Mmm, this gets complicated pretty fast eh? :)
Yeah, hopefully I'll be able to explain/clarify that :-).
>
> I had to go back and forth between patches to start understanding the
> different data structures and how they are use, and I'm not sure yet
> I've got the full picture. I guess some nice diagram (cover letter or
> documentation patch) would help a lot.
Right, so I'd like very much to write a nice documentation patch once we
are more or less OK with the overall design of this framework, but I
felt like it was a little bit early for that. If we finally decide that
what I did is totally stupid and that it'd be better to do things
completely differently, my nice documentation patch would be a lot of
efforts for nothing.
But I agree that at the same time all this complex code has to be
explained. Hopefully the existing comments can help with that.
Otherwise, I'm more than happy to answer all questions :-)
>
> Locking of such data structures is pretty involved as well, adding
> comments/docs shouldn't harm. :)
Message received. If I do need to come-up with a brand new
design/implementation for v4, I'll make sure to add more comments.
> Best,
>
> - Juri
Thanks !
Quentin
On 06/07/2018 05:19 PM, Quentin Perret wrote:
> Hi Juri,
>
> On Thursday 07 Jun 2018 at 16:44:09 (+0200), Juri Lelli wrote:
>> On 21/05/18 15:24, Quentin Perret wrote:
[...]
>> Mmm, this gets complicated pretty fast eh? :)
>
> Yeah, hopefully I'll be able to explain/clarify that :-).
>
>>
>> I had to go back and forth between patches to start understanding the
>> different data structures and how they are use, and I'm not sure yet
>> I've got the full picture. I guess some nice diagram (cover letter or
>> documentation patch) would help a lot.
+1 on the diagram.
> Right, so I'd like very much to write a nice documentation patch once we
> are more or less OK with the overall design of this framework, but I
> felt like it was a little bit early for that. If we finally decide that
> what I did is totally stupid and that it'd be better to do things
> completely differently, my nice documentation patch would be a lot of
> efforts for nothing.
>
> But I agree that at the same time all this complex code has to be
> explained. Hopefully the existing comments can help with that.
> Otherwise, I'm more than happy to answer all questions :-)
I'm not sure that the current API is the final one. Not sure that
em_rescale_cpu_capacity() is really needed.
We should first clarify the provider - consumer relation. Are multiple
providers allowed, if yes, are they allowed to provide partial EM data?
Do we really want to allow this overwriting of old EM data
(em_rescale_cpu_capacity()). In case multiple provider are allowed, is
there some kind of priority involved?
The re-scaling thing comes from the requirement that the final cpu
capacity values are only known after the arch_topology driver was able
to scale the dmipz-capacity-values with the policy->cpuinfo.max_freq but
why can't we create the EM on arm/arm64 after this? Even though we would
be forced to get cpufreq's related cpumask from somewhere.
I guess the easiest model will be that the Energy Model (EM) is fully
initialized with one init call (from the arch) and fixed after that.
In case the EM should not be tight to cpufreq, the interface
em_create_fd(cpumask_t *span, int nr_states, struct em_data_callback
*cb) seems ok.
IMHO, part of the problem why this might be harder to understand is the
fact that the patches show the use of the 2. init call
'em_rescale_cpu_capacity()' but not the 1. one
'em_register_freq_domain()'. I guess that Quentin wanted to keep the set
as small as possible.
[...]
On 06/06/2018 06:26 PM, Quentin Perret wrote:
> On Wednesday 06 Jun 2018 at 16:29:50 (+0100), Quentin Perret wrote:
>> On Wednesday 06 Jun 2018 at 17:20:00 (+0200), Juri Lelli wrote:
>>>>> This brings me to another question. Let's say there are multiple users of
>>>>> the Energy Model in the system. Shouldn't the units of frequency and power
>>>>> not standardized, maybe Mhz and mW?
>>>>> The task scheduler doesn't care since it is only interested in power diffs
>>>>> but other user might do.
>>>>
>>>> So the good thing about specifying units is that we can probably assume
>>>> ranges on the values. If the power is in mW, assuming that we're talking
>>>> about a single CPU, it'll probably fit in 16 bits. 65W/core should be
>>>> a reasonable upper-bound ?
>>>> But there are also vendors who might not be happy with disclosing absolute
>>>> values ... These are sometimes considered sensitive and only relative
>>>> numbers are discussed publicly. Now, you can also argue that we already
>>>> have units specified in IPA for ex, and that it doesn't really matter if
>>>> a driver "lies" about the real value, as long as the ratios are correct.
>>>> And I guess that anyone can do measurement on the hardware and get those
>>>> values anyway. So specifying a unit (mW) for the power is probably a
>>>> good idea.
>>>
>>> Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
>>> binding accepted, and one of the musts was that the values were going to
>>> be normalized. So, normalized power values again maybe?
>>
>> Hmmm, that's a very good point ... There should be no problems on the
>> scheduler side -- we're only interested in correct ratios. But I'm not
>> sure on the thermal side ... I will double check that.
>
> So, IPA needs to compare the power of the CPUs with the power of other
> things (e.g. GPUs). So we can't normalize the power of the CPUs without
> normalizing in the same scale the power of the other devices. I see two
> possibilities:
>
> 1) we don't normalize the CPU power values, we specify them in mW, and
> we document (and maybe throw a warning if we see an issue at runtime)
> the max range of values. The max expected power for a single core
> could be 65K for ex (16bits). And based on that we can verify
> overflow and precision issues in the algorithms, and we keep it easy
> to compare the CPU power numbers with other devices.
I would say we need 1). 32bit values with units and proper documentation
of the possible ranges.
[...]
On Thursday 07 Jun 2018 at 16:44:22 (+0200), Juri Lelli wrote:
> Hi,
>
> On 21/05/18 15:25, Quentin Perret wrote:
> > In order to use EAS, the task scheduler has to know about the Energy
> > Model (EM) of the platform. This commit extends the scheduler topology
> > code to take references on the frequency domains objects of the EM
> > framework for all online CPUs. Hence, the availability of the EM for
> > those CPUs is guaranteed to the scheduler at runtime without further
> > checks in latency sensitive code paths (i.e. task wake-up).
> >
> > A (RCU-protected) private list of online frequency domains is maintained
> > by the scheduler to enable fast iterations. Furthermore, the availability
> > of an EM is notified to the rest of the scheduler with a static key,
> > which ensures a low impact on non-EAS systems.
> >
> > Energy Aware Scheduling can be started if and only if:
> > 1. all online CPUs are covered by the EM;
> > 2. the EM complexity is low enough to keep scheduling overheads low;
> > 3. the platform has an asymmetric CPU capacity topology (detected by
> > looking for the SD_ASYM_CPUCAPACITY flag in the sched_domain
> > hierarchy).
>
> Not sure about this. How about multi-freq domain same max capacity
> systems. I understand that most of the energy saving come from selecting
> the right (big/LITTLE) cluster, but EM should still be useful to drive
> OPP selection (that was one of the use-cases we discussed lately IIRC)
> and also to decide between packing or spreading, no?
So, let's discuss the usage of the EM for frequency selection first,
and its usage for task placement after.
For frequency selection, schedutil could definitely use the EM as
provided by the framework introduced in patch 03/10. We could definitely
use that to make policy decisions (jump faster to the so called "knee"
if there is one for ex). This is true for symmetric and asymmetric
system. And I consider that independent from this patch. This patch is
about providing the scheduler with an EM to biais _task placement_.
So, about the task placement ... There are cases (at least theoretical
ones) where EAS _could_ help on symmetric systems, but I have never
been able to measure any real benefits in practice. Most of the time,
it's a good idea from an energy standpoint to just spread the tasks
and to keep the OPPs as low as possible on symmetric systems, which is
already what CFS does. Of course you can come-up with specific counter
examples, but the question is whether or not these (corner) cases are
that important. They might or might not, it's not so easy to tell.
On asymmetric systems, it is pretty clear that there is a massive
potential for saving energy with a different task placement strategy.
So, since the big savings are there, our idea was basically to address
that first, while we minimize the risk of hurting others (server folks
for ex). I guess that enabling EAS for asymmetric systems can be seen as
an incremental step. We should be able to extend the scope of EAS to
symmetric systems later, if proven useful.
Another thing is that, if you are using an asymmetric system (e.g.
big.LITTLE), it is a good indication that energy/battery life is probably
important for your use-case, and that you might be ready to "pay" the
cost of EAS to save energy. This isn't that obvious for symmetric
systems.
>
> > The sched_energy_enabled() function which returns the status of the
> > static key is stubbed to false when CONFIG_ENERGY_MODEL=n, hence making
> > sure that all the code behind it can be compiled out by constant
> > propagation.
>
> Actually, do we need a config option at all? Shouldn't the static key
> (and RCU machinery) guard against unwanted overheads when EM is not
> present/used?
Well, the ENERGY_MODEL config option comes from the EM framework,
independently from what the scheduler does with it. But I just thought
that we might as well use it if it's there. But yeah, we don't _have_ to
play with this Kconfig option, just using the static key could be fine.
I don't have a strong opinion on that :-)
>
> I was thinking it should be pretty similar to schedutil setup, no?
>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > kernel/sched/sched.h | 27 ++++++++++
> > kernel/sched/topology.c | 113 ++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 140 insertions(+)
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ce562d3b7526..7c517076a74a 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -63,6 +63,7 @@
> > #include <linux/syscalls.h>
> > #include <linux/task_work.h>
> > #include <linux/tsacct_kern.h>
> > +#include <linux/energy_model.h>
> >
> > #include <asm/tlb.h>
> >
> > @@ -2162,3 +2163,29 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
> > return util;
> > }
> > #endif
> > +
> > +struct sched_energy_fd {
> > + struct em_freq_domain *fd;
> > + struct list_head next;
> > + struct rcu_head rcu;
> > +};
> > +
> > +#ifdef CONFIG_ENERGY_MODEL
> > +extern struct static_key_false sched_energy_present;
> > +static inline bool sched_energy_enabled(void)
> > +{
> > + return static_branch_unlikely(&sched_energy_present);
> > +}
> > +
> > +extern struct list_head sched_energy_fd_list;
> > +#define for_each_freq_domain(sfd) \
> > + list_for_each_entry_rcu(sfd, &sched_energy_fd_list, next)
> > +#define freq_domain_span(sfd) (&((sfd)->fd->cpus))
> > +#else
> > +static inline bool sched_energy_enabled(void)
> > +{
> > + return false;
> > +}
> > +#define for_each_freq_domain(sfd) for (sfd = NULL; sfd;)
> > +#define freq_domain_span(sfd) NULL
> > +#endif
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 64cc564f5255..3e22c798f18d 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1500,6 +1500,116 @@ void sched_domains_numa_masks_clear(unsigned int cpu)
> >
> > #endif /* CONFIG_NUMA */
> >
> > +#ifdef CONFIG_ENERGY_MODEL
> > +
> > +/*
> > + * The complexity of the Energy Model is defined as the product of the number
> > + * of frequency domains with the sum of the number of CPUs and the total
> > + * number of OPPs in all frequency domains. It is generally not a good idea
> > + * to use such a model on very complex platform because of the associated
> > + * scheduling overheads. The arbitrary constraint below prevents that. It
> > + * makes EAS usable up to 16 CPUs with per-CPU DVFS and less than 8 OPPs each,
> > + * for example.
> > + */
> > +#define EM_MAX_COMPLEXITY 2048
>
> Do we really need this hardcoded constant?
>
> I guess if one spent time deriving an EM for a big system with lot of
> OPPs, she/he already knows what is doing? :)
Yeah probably. But we already agreed with Peter that since the complexity
of the algorithm in the wake-up path can become quite bad, it might be a
good idea to at least have a warning for that.
>
> > +
> > +DEFINE_STATIC_KEY_FALSE(sched_energy_present);
> > +LIST_HEAD(sched_energy_fd_list);
> > +
> > +static struct sched_energy_fd *find_sched_energy_fd(int cpu)
> > +{
> > + struct sched_energy_fd *sfd;
> > +
> > + for_each_freq_domain(sfd) {
> > + if (cpumask_test_cpu(cpu, freq_domain_span(sfd)))
> > + return sfd;
> > + }
> > +
> > + return NULL;
> > +}
> > +
> > +static void free_sched_energy_fd(struct rcu_head *rp)
> > +{
> > + struct sched_energy_fd *sfd;
> > +
> > + sfd = container_of(rp, struct sched_energy_fd, rcu);
> > + kfree(sfd);
> > +}
> > +
> > +static void build_sched_energy(void)
> > +{
> > + struct sched_energy_fd *sfd, *tmp;
> > + struct em_freq_domain *fd;
> > + struct sched_domain *sd;
> > + int cpu, nr_fd = 0, nr_opp = 0;
> > +
> > + rcu_read_lock();
> > +
> > + /* Disable EAS entirely whenever the system isn't asymmetric. */
> > + cpu = cpumask_first(cpu_online_mask);
> > + sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
> > + if (!sd) {
> > + pr_debug("%s: no SD_ASYM_CPUCAPACITY\n", __func__);
> > + goto disable;
> > + }
> > +
> > + /* Make sure to have an energy model for all CPUs. */
> > + for_each_online_cpu(cpu) {
> > + /* Skip CPUs with a known energy model. */
> > + sfd = find_sched_energy_fd(cpu);
> > + if (sfd)
> > + continue;
> > +
> > + /* Add the energy model of others. */
> > + fd = em_cpu_get(cpu);
> > + if (!fd)
> > + goto disable;
> > + sfd = kzalloc(sizeof(*sfd), GFP_NOWAIT);
> > + if (!sfd)
> > + goto disable;
> > + sfd->fd = fd;
> > + list_add_rcu(&sfd->next, &sched_energy_fd_list);
> > + }
> > +
> > + list_for_each_entry_safe(sfd, tmp, &sched_energy_fd_list, next) {
> > + if (cpumask_intersects(freq_domain_span(sfd),
> > + cpu_online_mask)) {
> > + nr_opp += em_fd_nr_cap_states(sfd->fd);
> > + nr_fd++;
> > + continue;
> > + }
> > +
> > + /* Remove the unused frequency domains */
> > + list_del_rcu(&sfd->next);
> > + call_rcu(&sfd->rcu, free_sched_energy_fd);
>
> Unused because of? Hotplug?
Yes. The list of frequency domains is just convenient because we need to
iterate over them in the wake-up path. Now, if you hotplug out all the
CPUs of a frequency domain, it is safe to remove it from the list
because the scheduler shouldn't migrate task to/from those CPUs while
they're offline. And that's one less element in the list, so iterating
over the entire list is faster.
>
> Not sure, but I guess you have considered the idea of tearing all this
> down when sched domains are destroied and then rebuilding it again? Why
> did you decide for this approach? Or maybe I just missed where you do
> that. :/
Well it's easy to detect the frequency domains we should keep and the
ones we can toss to trash. So it's just more efficient to do it that way
than rebuilding everything I guess. But I'm happy to destroy and rebuild
the whole thing every time if this is easier to understand and better
aligned with what the existing topology code does :-)
>
> > + }
> > +
> > + /* Bail out if the Energy Model complexity is too high. */
> > + if (nr_fd * (nr_opp + num_online_cpus()) > EM_MAX_COMPLEXITY) {
> > + pr_warn("%s: EM complexity too high, stopping EAS", __func__);
> > + goto disable;
> > + }
> > +
> > + rcu_read_unlock();
> > + static_branch_enable_cpuslocked(&sched_energy_present);
> > + pr_debug("%s: EAS started\n", __func__);
>
> I'd vote for a pr_info here instead, maybe printing info about the em as
> well. Looks pretty useful to me to have that in dmesg. Maybe guarded by
> sched_debug?
No problem with that :-). And I actually noticed just now that pr_debug
isn't used anywhere in topology.c so I should also align with the
existing code ...
Thanks !
Quentin
On 07/06/18 16:19, Quentin Perret wrote:
> Hi Juri,
>
> On Thursday 07 Jun 2018 at 16:44:09 (+0200), Juri Lelli wrote:
> > On 21/05/18 15:24, Quentin Perret wrote:
[...]
> > > +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> > > +{
> > > + unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> > > + int max_cap_state = cs_table->nr_cap_states - 1;
> > ^
> > You don't need this on the stack, right?
>
> Oh, why not ?
>
Because you use it only once here below? Anyway, more a (debatable)
nitpick than anything.
> > > + unsigned long fmax = cs_table->state[max_cap_state].frequency;
> > > + int i;
> > > +
> > > + for (i = 0; i < cs_table->nr_cap_states; i++)
> > > + cs_table->state[i].capacity = cmax *
> > > + cs_table->state[i].frequency / fmax;
> > > +}
> > > +
> > > +static struct em_freq_domain *em_create_fd(cpumask_t *span, int nr_states,
> > > + struct em_data_callback *cb)
> > > +{
> > > + unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
> > > + int i, ret, cpu = cpumask_first(span);
> > > + struct em_freq_domain *fd;
> > > + unsigned long power, freq;
> > > +
> > > + if (!cb->active_power)
> > > + return NULL;
> > > +
> > > + fd = kzalloc(sizeof(*fd), GFP_KERNEL);
> > > + if (!fd)
> > > + return NULL;
> > > +
> > > + fd->cs_table = alloc_cs_table(nr_states);
> >
> > Mmm, don't you need to rcu_assign_pointer this first one as well?
>
> Hmmmm, nobody can be using this at this point, but yes, it'd be better
> to keep that consistent I suppose ...
Yeah, same thing I thought as well.
> > > + if (!fd->cs_table)
> > > + goto free_fd;
> > > +
> > > + /* Copy the span of the frequency domain */
> > > + cpumask_copy(&fd->cpus, span);
> > > +
> > > + /* Build the list of capacity states for this freq domain */
> > > + for (i = 0, freq = 0; i < nr_states; i++, freq++) {
> > ^ ^
> > The fact that this relies on active_power() to use ceil OPP for a given
> > freq might deserve a comment. Also, is this behaviour of active_power()
> > standardized?
>
> Right, this can get confusing pretty quickly. There is a comment in
> include/linux/energy_model.h where the expected behaviour of
> active_power is explained, but a reminder above this function shouldn't
> hurt.
Mmm, not sure if you could actually check that returned freq values are
actually consistent with the assumption (just in case one didn't do
homework).
> > > + ret = cb->active_power(&power, &freq, cpu);
> > > + if (ret)
> > > + goto free_cs_table;
[...]
> > > +/**
> > > + * em_rescale_cpu_capacity() - Re-scale capacity values of the Energy Model
> > > + *
> > > + * This re-scales the capacity values for all capacity states of all frequency
> > > + * domains of the Energy Model. This should be used when the capacity values
> > > + * of the CPUs are updated at run-time, after the EM was registered.
> > > + */
> > > +void em_rescale_cpu_capacity(void)
> >
> > So, is this thought to be called eventually also after thermal capping
> > events and such?
>
> The true reason is that the frequency domains will typically be
> registered in the EM framework _before_ the arch_topology driver kicks
> in on arm/arm64. That means that the EM tables are created, and only
> after, the cpu capacities are updated. So we basically need to update
> those capacities to be up-to-date.
>
> The reason we need to keep those two steps separate (registering the
> freq domains and re-scaling the capacities) in the EM framework is
> because thermal doesn't care about the cpu capacities. It is a perfectly
> acceptable configuration to use IPA without having dmips-capacity-mhz
> values in the DT for ex.
>
> Now, since we have a RCU protection on the EM tables, we might decide in
> the future to use the opportunity to modify the tables at run-time for
> other reasons. Thermal capping could be one I guess.
OK. Makes sense.
> > > +{
> > > + struct em_cs_table *old_table, *new_table;
> > > + struct em_freq_domain *fd;
> > > + unsigned long flags;
> > > + int nr_states, cpu;
> > > +
> > > + read_lock_irqsave(&em_data_lock, flags);
> >
> > Don't you need write_lock_ here, since you are going to exchange the
> > em tables?
>
> This lock protects the per_cpu() variable itself. Here we only read
> pointers from that per_cpu variable, and we modify one attribute in
> the pointed structure. We don't modify the per_cpu table itself. Does
> that make sense ?
So, I don't seem to understand what protects the rcu_assign_pointer(s)
below (as in
https://elixir.bootlin.com/linux/latest/source/Documentation/RCU/whatisRCU.txt#L395).
> > > + for_each_cpu(cpu, cpu_possible_mask) {
> > > + fd = per_cpu(em_data, cpu);
> > > + if (!fd || cpu != cpumask_first(&fd->cpus))
> > > + continue;
> > > +
> > > + /* Copy the existing table. */
> > > + old_table = rcu_dereference(fd->cs_table);
> > > + nr_states = old_table->nr_cap_states;
> > > + new_table = alloc_cs_table(nr_states);
> > > + if (!new_table) {
> > > + read_unlock_irqrestore(&em_data_lock, flags);
> > > + return;
> > > + }
> > > + memcpy(new_table->state, old_table->state,
> > > + nr_states * sizeof(*new_table->state));
> > > +
> > > + /* Re-scale the capacity values on the copy. */
> > > + fd_update_cs_table(new_table, cpumask_first(&fd->cpus));
> > > +
> > > + /* Replace the table with the rescaled version. */
> > > + rcu_assign_pointer(fd->cs_table, new_table);
> > > + call_rcu(&old_table->rcu, rcu_free_cs_table);
> > > + }
> > > + read_unlock_irqrestore(&em_data_lock, flags);
> > > + pr_debug("Re-scaled CPU capacities\n");
> > > +}
> > > +EXPORT_SYMBOL_GPL(em_rescale_cpu_capacity);
> > > +
> > > +/**
> > > + * em_cpu_get() - Return the frequency domain for a CPU
> > > + * @cpu : CPU to find the frequency domain for
> > > + *
> > > + * Return: the frequency domain to which 'cpu' belongs, or NULL if it doesn't
> > > + * exist.
> > > + */
> > > +struct em_freq_domain *em_cpu_get(int cpu)
> > > +{
> > > + struct em_freq_domain *fd;
> > > + unsigned long flags;
> > > +
> > > + read_lock_irqsave(&em_data_lock, flags);
> > > + fd = per_cpu(em_data, cpu);
> > > + read_unlock_irqrestore(&em_data_lock, flags);
> > > +
> > > + return fd;
> > > +}
> > > +EXPORT_SYMBOL_GPL(em_cpu_get);
> >
> > Mmm, this gets complicated pretty fast eh? :)
>
> Yeah, hopefully I'll be able to explain/clarify that :-).
>
> >
> > I had to go back and forth between patches to start understanding the
> > different data structures and how they are use, and I'm not sure yet
> > I've got the full picture. I guess some nice diagram (cover letter or
> > documentation patch) would help a lot.
>
> Right, so I'd like very much to write a nice documentation patch once we
> are more or less OK with the overall design of this framework, but I
> felt like it was a little bit early for that. If we finally decide that
> what I did is totally stupid and that it'd be better to do things
> completely differently, my nice documentation patch would be a lot of
> efforts for nothing.
>
> But I agree that at the same time all this complex code has to be
> explained. Hopefully the existing comments can help with that.
> Otherwise, I'm more than happy to answer all questions :-)
Thanks for your answers, but I guess my point was that a bit more info
about how this all stay together (maybe in the cover letter) would have
still helped reviewers.
Anyway, no big deal.
> > Locking of such data structures is pretty involved as well, adding
> > comments/docs shouldn't harm. :)
>
> Message received. If I do need to come-up with a brand new
> design/implementation for v4, I'll make sure to add more comments.
I'd vote for adding docs even if design turns out to be good and you
only need to refresh patches. ;)
On 07/06/18 17:02, Quentin Perret wrote:
> On Thursday 07 Jun 2018 at 16:44:22 (+0200), Juri Lelli wrote:
> > Hi,
> >
> > On 21/05/18 15:25, Quentin Perret wrote:
> > > In order to use EAS, the task scheduler has to know about the Energy
> > > Model (EM) of the platform. This commit extends the scheduler topology
> > > code to take references on the frequency domains objects of the EM
> > > framework for all online CPUs. Hence, the availability of the EM for
> > > those CPUs is guaranteed to the scheduler at runtime without further
> > > checks in latency sensitive code paths (i.e. task wake-up).
> > >
> > > A (RCU-protected) private list of online frequency domains is maintained
> > > by the scheduler to enable fast iterations. Furthermore, the availability
> > > of an EM is notified to the rest of the scheduler with a static key,
> > > which ensures a low impact on non-EAS systems.
> > >
> > > Energy Aware Scheduling can be started if and only if:
> > > 1. all online CPUs are covered by the EM;
> > > 2. the EM complexity is low enough to keep scheduling overheads low;
> > > 3. the platform has an asymmetric CPU capacity topology (detected by
> > > looking for the SD_ASYM_CPUCAPACITY flag in the sched_domain
> > > hierarchy).
> >
> > Not sure about this. How about multi-freq domain same max capacity
> > systems. I understand that most of the energy saving come from selecting
> > the right (big/LITTLE) cluster, but EM should still be useful to drive
> > OPP selection (that was one of the use-cases we discussed lately IIRC)
> > and also to decide between packing or spreading, no?
>
> So, let's discuss the usage of the EM for frequency selection first,
> and its usage for task placement after.
>
> For frequency selection, schedutil could definitely use the EM as
> provided by the framework introduced in patch 03/10. We could definitely
> use that to make policy decisions (jump faster to the so called "knee"
> if there is one for ex). This is true for symmetric and asymmetric
> system. And I consider that independent from this patch. This patch is
> about providing the scheduler with an EM to biais _task placement_.
>
> So, about the task placement ... There are cases (at least theoretical
> ones) where EAS _could_ help on symmetric systems, but I have never
> been able to measure any real benefits in practice. Most of the time,
> it's a good idea from an energy standpoint to just spread the tasks
> and to keep the OPPs as low as possible on symmetric systems, which is
> already what CFS does. Of course you can come-up with specific counter
> examples, but the question is whether or not these (corner) cases are
> that important. They might or might not, it's not so easy to tell.
>
> On asymmetric systems, it is pretty clear that there is a massive
> potential for saving energy with a different task placement strategy.
> So, since the big savings are there, our idea was basically to address
> that first, while we minimize the risk of hurting others (server folks
> for ex). I guess that enabling EAS for asymmetric systems can be seen as
> an incremental step. We should be able to extend the scope of EAS to
> symmetric systems later, if proven useful.
>
> Another thing is that, if you are using an asymmetric system (e.g.
> big.LITTLE), it is a good indication that energy/battery life is probably
> important for your use-case, and that you might be ready to "pay" the
> cost of EAS to save energy. This isn't that obvious for symmetric
> systems.
Ok, I buy the step by step approach starting from the use case that
seems to fit most. But I still feel that having something like 3. stated
(or in the code) might stop people from trying to see if having an EM
around might help other cases (freq, sym, etc.).
Also, if no EM data is present should equally result in disabling the
whole thing, so not much (at all?) overhead for who is simply not
providing data, no?
[...]
> > > + list_for_each_entry_safe(sfd, tmp, &sched_energy_fd_list, next) {
> > > + if (cpumask_intersects(freq_domain_span(sfd),
> > > + cpu_online_mask)) {
> > > + nr_opp += em_fd_nr_cap_states(sfd->fd);
> > > + nr_fd++;
> > > + continue;
> > > + }
> > > +
> > > + /* Remove the unused frequency domains */
> > > + list_del_rcu(&sfd->next);
> > > + call_rcu(&sfd->rcu, free_sched_energy_fd);
> >
> > Unused because of? Hotplug?
>
> Yes. The list of frequency domains is just convenient because we need to
> iterate over them in the wake-up path. Now, if you hotplug out all the
> CPUs of a frequency domain, it is safe to remove it from the list
> because the scheduler shouldn't migrate task to/from those CPUs while
> they're offline. And that's one less element in the list, so iterating
> over the entire list is faster.
OK, I mainly asked to be sure that I understood the comment. I guess
some stress test involving hotplug and iterating over the list would
best answer which way is the safest. :)
On Thursday 07 Jun 2018 at 18:04:19 (+0200), Juri Lelli wrote:
> On 07/06/18 16:19, Quentin Perret wrote:
> > On Thursday 07 Jun 2018 at 16:44:09 (+0200), Juri Lelli wrote:
> > > > + if (!fd->cs_table)
> > > > + goto free_fd;
> > > > +
> > > > + /* Copy the span of the frequency domain */
> > > > + cpumask_copy(&fd->cpus, span);
> > > > +
> > > > + /* Build the list of capacity states for this freq domain */
> > > > + for (i = 0, freq = 0; i < nr_states; i++, freq++) {
> > > ^ ^
> > > The fact that this relies on active_power() to use ceil OPP for a given
> > > freq might deserve a comment. Also, is this behaviour of active_power()
> > > standardized?
> >
> > Right, this can get confusing pretty quickly. There is a comment in
> > include/linux/energy_model.h where the expected behaviour of
> > active_power is explained, but a reminder above this function shouldn't
> > hurt.
>
> Mmm, not sure if you could actually check that returned freq values are
> actually consistent with the assumption (just in case one didn't do
> homework).
Right, that's a good point. I'll add checks on the parameters modified by
active_power(). Monotonically increasing freq, monotonically increasing
power as well I guess, something along those lines.
> > > > +{
> > > > + struct em_cs_table *old_table, *new_table;
> > > > + struct em_freq_domain *fd;
> > > > + unsigned long flags;
> > > > + int nr_states, cpu;
> > > > +
> > > > + read_lock_irqsave(&em_data_lock, flags);
> > >
> > > Don't you need write_lock_ here, since you are going to exchange the
> > > em tables?
> >
> > This lock protects the per_cpu() variable itself. Here we only read
> > pointers from that per_cpu variable, and we modify one attribute in
> > the pointed structure. We don't modify the per_cpu table itself. Does
> > that make sense ?
>
> So, I don't seem to understand what protects the rcu_assign_pointer(s)
> below (as in
> https://elixir.bootlin.com/linux/latest/source/Documentation/RCU/whatisRCU.txt#L395).
Sigh, that's not right :(
I take back my previous message, the write lock _is_ needed. Thanks for
pointing that out ...
Thanks,
Quentin
On Thursday 07 Jun 2018 at 18:29:10 (+0200), Juri Lelli wrote:
> On 07/06/18 17:02, Quentin Perret wrote:
> > On Thursday 07 Jun 2018 at 16:44:22 (+0200), Juri Lelli wrote:
> > > Not sure about this. How about multi-freq domain same max capacity
> > > systems. I understand that most of the energy saving come from selecting
> > > the right (big/LITTLE) cluster, but EM should still be useful to drive
> > > OPP selection (that was one of the use-cases we discussed lately IIRC)
> > > and also to decide between packing or spreading, no?
> >
> > So, let's discuss the usage of the EM for frequency selection first,
> > and its usage for task placement after.
> >
> > For frequency selection, schedutil could definitely use the EM as
> > provided by the framework introduced in patch 03/10. We could definitely
> > use that to make policy decisions (jump faster to the so called "knee"
> > if there is one for ex). This is true for symmetric and asymmetric
> > system. And I consider that independent from this patch. This patch is
> > about providing the scheduler with an EM to biais _task placement_.
> >
> > So, about the task placement ... There are cases (at least theoretical
> > ones) where EAS _could_ help on symmetric systems, but I have never
> > been able to measure any real benefits in practice. Most of the time,
> > it's a good idea from an energy standpoint to just spread the tasks
> > and to keep the OPPs as low as possible on symmetric systems, which is
> > already what CFS does. Of course you can come-up with specific counter
> > examples, but the question is whether or not these (corner) cases are
> > that important. They might or might not, it's not so easy to tell.
> >
> > On asymmetric systems, it is pretty clear that there is a massive
> > potential for saving energy with a different task placement strategy.
> > So, since the big savings are there, our idea was basically to address
> > that first, while we minimize the risk of hurting others (server folks
> > for ex). I guess that enabling EAS for asymmetric systems can be seen as
> > an incremental step. We should be able to extend the scope of EAS to
> > symmetric systems later, if proven useful.
> >
> > Another thing is that, if you are using an asymmetric system (e.g.
> > big.LITTLE), it is a good indication that energy/battery life is probably
> > important for your use-case, and that you might be ready to "pay" the
> > cost of EAS to save energy. This isn't that obvious for symmetric
> > systems.
>
> Ok, I buy the step by step approach starting from the use case that
> seems to fit most. But I still feel that having something like 3. stated
> (or in the code) might stop people from trying to see if having an EM
> around might help other cases (freq, sym, etc.).
Ok, I see what you mean. What I should make more clear is that this
patch-set really is split in two relatively independent parts. Patches
01 to 04 introduce a centralized EM framework, which doesn't depend on
the scheduler, or thermal, or schedutil, or anything. It's an
independent thing. And then you can see patches 05 to 10 as _one
possible use-case_ for this framework: EAS.
I'm not convinced that patches 01-04 can leave on their own though. I
assume it must pretty hard to understand how this whole framework can be
used if there isn't an example of user with it ...
> Also, if no EM data is present should equally result in disabling the
> whole thing, so not much (at all?) overhead for who is simply not
> providing data, no?
Right, but some users might want to have an EM without EAS I guess ...
Otherwise, the other solution would be to have a new knob (a sched_feat
for ex ?) to let users disable EAS if they're not interested in saving
energy.
Thanks,
Quentin
Hi Dietmar,
On Thursday 07 Jun 2018 at 17:55:32 (+0200), Dietmar Eggemann wrote:
> On 06/07/2018 05:19 PM, Quentin Perret wrote:
> > Hi Juri,
> >
> > On Thursday 07 Jun 2018 at 16:44:09 (+0200), Juri Lelli wrote:
> > > On 21/05/18 15:24, Quentin Perret wrote:
>
> [...]
>
> > > Mmm, this gets complicated pretty fast eh? :)
> >
> > Yeah, hopefully I'll be able to explain/clarify that :-).
> >
> > >
> > > I had to go back and forth between patches to start understanding the
> > > different data structures and how they are use, and I'm not sure yet
> > > I've got the full picture. I guess some nice diagram (cover letter or
> > > documentation patch) would help a lot.
>
> +1 on the diagram.
:-)
>
> > Right, so I'd like very much to write a nice documentation patch once we
> > are more or less OK with the overall design of this framework, but I
> > felt like it was a little bit early for that. If we finally decide that
> > what I did is totally stupid and that it'd be better to do things
> > completely differently, my nice documentation patch would be a lot of
> > efforts for nothing.
> >
> > But I agree that at the same time all this complex code has to be
> > explained. Hopefully the existing comments can help with that.
> > Otherwise, I'm more than happy to answer all questions :-)
>
> I'm not sure that the current API is the final one. Not sure that
> em_rescale_cpu_capacity() is really needed.
I understand why this specific part of the design can be confusing, but
I couldn't find another _clean_ way to deal with fact that re-scaling
the CPU capacities at run-time can happens, and that the different
clients (thermal, scheduler) have different needs when it comes to
CPU capacities. But suggestions are more than welcome !
> We should first clarify the provider - consumer relation. Are multiple
> providers allowed, if yes, are they allowed to provide partial EM data? Do
> we really want to allow this overwriting of old EM data
> (em_rescale_cpu_capacity()). In case multiple provider are allowed, is there
> some kind of priority involved?
The comment above em_register_freq_domain() explains that, at least
partially. In the current implementation, if multiple providers register
the same frequency domain, all but the first will be ignored. The reason
I implemented this that way is because: 1) it's simple; 2) it should
cover the current use-cases for EAS and IPA.
But we could do something more clever. We could add a parameter to
em_register_freq_domain() that would represent some sort of priority. In
this case, if multiple providers register the same freq domain, the
higher priority would override the lower priority. For example, power
values coming from firmware could overwrite power values estimated with
P=CV^2f for example.
> The re-scaling thing comes from the requirement that the final cpu capacity
> values are only known after the arch_topology driver was able to scale the
> dmipz-capacity-values with the policy->cpuinfo.max_freq but why can't we
> create the EM on arm/arm64 after this?
What if you don't have dmips-capacity-mhz values in the DT and still
want to use IPA ? There is no good reason to create a dependency between
the thermal subsystem and the arch_topology driver IMO.
> Even though we would be forced to get cpufreq's related cpumask from
> somewhere.
That's the easy part. The difficult part is, where do you get power
values from ? You have to let the lower layers register those values
in a centralized location on a voluntary basis. And then it becomes easy
for consumers to access that data, because they know where it is.
> I guess the easiest model will be that the Energy Model (EM) is fully
> initialized with one init call (from the arch) and fixed after that.
Again, I don't think that's possible. You have to let the lower layers
tell you where the power values come from, at the very least. You could
let the archs do that aggregation I suppose, but I don't really see the
benefit over one centralized framework with a generic interface ...
What's your opinion ?
>
> In case the EM should not be tight to cpufreq, the interface
> em_create_fd(cpumask_t *span, int nr_states, struct em_data_callback *cb)
> seems ok.
Cool :-)
> IMHO, part of the problem why this might be harder to understand is the fact
> that the patches show the use of the 2. init call
> 'em_rescale_cpu_capacity()' but not the 1. one 'em_register_freq_domain()'.
> I guess that Quentin wanted to keep the set as small as possible.
Yes, this is confusing. I'm now starting to think that patch 10/10 should
probably not be part of this patch-set, especially if I don't provide
the patches registering the freq domains from the CPUFreq drivers. And
it's the only "Arm-specific" patch in this arch-independent patch-set.
So I think I'll drop patch 10/10 for v4 ... That part should be
discussed separately, with the rest of the Arm-specific changes.
Thanks !
Quentin
On 08/06/18 09:25, Quentin Perret wrote:
> Hi Dietmar,
>
> On Thursday 07 Jun 2018 at 17:55:32 (+0200), Dietmar Eggemann wrote:
[...]
> > IMHO, part of the problem why this might be harder to understand is the fact
> > that the patches show the use of the 2. init call
> > 'em_rescale_cpu_capacity()' but not the 1. one 'em_register_freq_domain()'.
> > I guess that Quentin wanted to keep the set as small as possible.
>
> Yes, this is confusing. I'm now starting to think that patch 10/10 should
> probably not be part of this patch-set, especially if I don't provide
> the patches registering the freq domains from the CPUFreq drivers. And
> it's the only "Arm-specific" patch in this arch-independent patch-set.
>
> So I think I'll drop patch 10/10 for v4 ... That part should be
> discussed separately, with the rest of the Arm-specific changes.
Mmm, I would actually vote to at least have one example showing how and
where the em_register_freq_domain() is going to be used. I had to look
at the repo you referenced since I think it's quite fundamental piece to
understand the design, IMHO.
Hi,
On 21/05/18 15:25, Quentin Perret wrote:
[...]
> +static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> +{
> + unsigned long cur_energy, prev_energy, best_energy, cpu_cap, task_util;
> + int cpu, best_energy_cpu = prev_cpu;
> + struct sched_energy_fd *sfd;
> + struct sched_domain *sd;
> +
> + sync_entity_load_avg(&p->se);
> +
> + task_util = task_util_est(p);
> + if (!task_util)
> + return prev_cpu;
> +
> + /*
> + * Energy-aware wake-up happens on the lowest sched_domain starting
> + * from sd_ea spanning over this_cpu and prev_cpu.
> + */
> + sd = rcu_dereference(*this_cpu_ptr(&sd_ea));
> + while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> + sd = sd->parent;
> + if (!sd)
> + return -1;
Shouldn't this be return prev_cpu?
> +
> + if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> + prev_energy = best_energy = compute_energy(p, prev_cpu);
> + else
> + prev_energy = best_energy = ULONG_MAX;
> +
> + for_each_freq_domain(sfd) {
> + unsigned long spare_cap, max_spare_cap = 0;
> + int max_spare_cap_cpu = -1;
> + unsigned long util;
> +
> + /* Find the CPU with the max spare cap in the freq. dom. */
I undestand this being a heuristic to cut some overhead, but shouldn't
the model tell between packing vs. spreading?
Thanks,
-Juri
On 21/05/18 15:25, Quentin Perret wrote:
[...]
> +static long compute_energy(struct task_struct *p, int dst_cpu)
> +{
> + long util, max_util, sum_util, energy = 0;
> + struct sched_energy_fd *sfd;
> + int cpu;
> +
> + for_each_freq_domain(sfd) {
> + max_util = sum_util = 0;
> + for_each_cpu_and(cpu, freq_domain_span(sfd), cpu_online_mask) {
> + util = cpu_util_next(cpu, p, dst_cpu);
> + util += cpu_util_dl(cpu_rq(cpu));
> + /* XXX: add RT util_avg when available. */
em_fd_energy() below uses this to know which power to pick in the freq
table. So, if you have any RT task running on cpu freq will be at max
anyway. It seems to me that in this case max_util for the freq_domain
must be max_freq (w/o considering rt.util_avg as schedutil does). Then
you could probably still use rt.util_avg to get the percentage of busy
time with sum_util?
> +
> + max_util = max(util, max_util);
> + sum_util += util;
> + }
> +
> + energy += em_fd_energy(sfd->fd, max_util, sum_util);
> + }
> +
> + return energy;
> +}
Best,
- Juri
On Friday 08 Jun 2018 at 11:36:13 (+0200), Juri Lelli wrote:
> On 08/06/18 09:25, Quentin Perret wrote:
> > Hi Dietmar,
> >
> > On Thursday 07 Jun 2018 at 17:55:32 (+0200), Dietmar Eggemann wrote:
>
> [...]
>
> > > IMHO, part of the problem why this might be harder to understand is the fact
> > > that the patches show the use of the 2. init call
> > > 'em_rescale_cpu_capacity()' but not the 1. one 'em_register_freq_domain()'.
> > > I guess that Quentin wanted to keep the set as small as possible.
> >
> > Yes, this is confusing. I'm now starting to think that patch 10/10 should
> > probably not be part of this patch-set, especially if I don't provide
> > the patches registering the freq domains from the CPUFreq drivers. And
> > it's the only "Arm-specific" patch in this arch-independent patch-set.
> >
> > So I think I'll drop patch 10/10 for v4 ... That part should be
> > discussed separately, with the rest of the Arm-specific changes.
>
> Mmm, I would actually vote to at least have one example showing how and
> where the em_register_freq_domain() is going to be used. I had to look
> at the repo you referenced since I think it's quite fundamental piece to
> understand the design, IMHO.
Hmmm I see your point. OK, having an example will help. I'll keep patch
10/10 and add an other one tweaking cpufreq-dt to give an example. But
I'll mark the two as OPTIONAL. I really hope we can reach an agreement
on the core design ideas before discussing too much the details on the
driver side.
There are several valid places where em_register_freq_domain() can be
called. But the exact way of doing so will be platform-dependent, and
driver-dependent, so let's agree on what we want to know from the
drivers first :-)
Thanks,
Quentin
On Friday 08 Jun 2018 at 12:24:46 (+0200), Juri Lelli wrote:
> Hi,
>
> On 21/05/18 15:25, Quentin Perret wrote:
>
> [...]
>
> > +static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > +{
> > + unsigned long cur_energy, prev_energy, best_energy, cpu_cap, task_util;
> > + int cpu, best_energy_cpu = prev_cpu;
> > + struct sched_energy_fd *sfd;
> > + struct sched_domain *sd;
> > +
> > + sync_entity_load_avg(&p->se);
> > +
> > + task_util = task_util_est(p);
> > + if (!task_util)
> > + return prev_cpu;
> > +
> > + /*
> > + * Energy-aware wake-up happens on the lowest sched_domain starting
> > + * from sd_ea spanning over this_cpu and prev_cpu.
> > + */
> > + sd = rcu_dereference(*this_cpu_ptr(&sd_ea));
> > + while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > + sd = sd->parent;
> > + if (!sd)
> > + return -1;
>
> Shouldn't this be return prev_cpu?
Well, you shouldn't be entering this function without an sd_ea pointer,
so this case is a sort of bug I think. By returning -1 I think we should
end-up picking a CPU using select_fallback_rq(), which sort of makes
sense ?
>
> > +
> > + if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> > + prev_energy = best_energy = compute_energy(p, prev_cpu);
> > + else
> > + prev_energy = best_energy = ULONG_MAX;
> > +
> > + for_each_freq_domain(sfd) {
> > + unsigned long spare_cap, max_spare_cap = 0;
> > + int max_spare_cap_cpu = -1;
> > + unsigned long util;
> > +
> > + /* Find the CPU with the max spare cap in the freq. dom. */
>
> I undestand this being a heuristic to cut some overhead, but shouldn't
> the model tell between packing vs. spreading?
Ah, that's a very interesting one :-) !
So, with only active costs of the CPUs in the model, we can't really
tell what's best between packing or spreading between identical CPUs if
the migration of the task doesn't change the OPP request.
In a frequency domain, all the "best" CPU candidates for a task are
those for which we'll request a low OPP. When there are several CPUs for
which the OPP request will be the same, we just don't know which one to
pick from an energy standpoint, because we don't have other energy costs
(for idle states for ex) to break the tie.
With this EM, the interesting thing is that if you assume that OPP
requests follow utilization, you are _guaranteed_ that the CPU with
the max spare capacity in a freq domain will always be among the best
candidates of this freq domain. And since we don't know how to
differentiate those candidates, why not using this one ?
Yes, it _might_ be better from an energy standpoint to pack small tasks
on a CPU in order to let other CPUs go in deeper idle states. But that
also hurts your chances to go cluster idle. Which solution is the best ?
It depends, and we have no ways to tell with this EM.
This approach basically favors cluster-packing, and spreading inside a
cluster. That should at least be a good thing for latency, and this is
consistent with the idea that most of the energy savings come from the
asymmetry of the system, and not so much from breaking the tie between
identical CPUs. That's also the reason why EAS is enabled only if your
system has SD_ASYM_CPUCAPACITY set, as we already discussed for patch
05/10 :-).
Does that make sense ?
Thanks,
Quentin
On 08/06/18 12:19, Quentin Perret wrote:
> On Friday 08 Jun 2018 at 12:24:46 (+0200), Juri Lelli wrote:
> > Hi,
> >
> > On 21/05/18 15:25, Quentin Perret wrote:
> >
> > [...]
> >
> > > +static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > +{
> > > + unsigned long cur_energy, prev_energy, best_energy, cpu_cap, task_util;
> > > + int cpu, best_energy_cpu = prev_cpu;
> > > + struct sched_energy_fd *sfd;
> > > + struct sched_domain *sd;
> > > +
> > > + sync_entity_load_avg(&p->se);
> > > +
> > > + task_util = task_util_est(p);
> > > + if (!task_util)
> > > + return prev_cpu;
> > > +
> > > + /*
> > > + * Energy-aware wake-up happens on the lowest sched_domain starting
> > > + * from sd_ea spanning over this_cpu and prev_cpu.
> > > + */
> > > + sd = rcu_dereference(*this_cpu_ptr(&sd_ea));
> > > + while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > > + sd = sd->parent;
> > > + if (!sd)
> > > + return -1;
> >
> > Shouldn't this be return prev_cpu?
>
> Well, you shouldn't be entering this function without an sd_ea pointer,
> so this case is a sort of bug I think. By returning -1 I think we should
> end-up picking a CPU using select_fallback_rq(), which sort of makes
> sense ?
I fear cpumask_test_cpu() and such won't be happy with a -1 arg.
If it's a recoverable bug, I'd say return prev and WARN_ON_ONCE() ?
> > > +
> > > + if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> > > + prev_energy = best_energy = compute_energy(p, prev_cpu);
> > > + else
> > > + prev_energy = best_energy = ULONG_MAX;
> > > +
> > > + for_each_freq_domain(sfd) {
> > > + unsigned long spare_cap, max_spare_cap = 0;
> > > + int max_spare_cap_cpu = -1;
> > > + unsigned long util;
> > > +
> > > + /* Find the CPU with the max spare cap in the freq. dom. */
> >
> > I undestand this being a heuristic to cut some overhead, but shouldn't
> > the model tell between packing vs. spreading?
>
> Ah, that's a very interesting one :-) !
>
> So, with only active costs of the CPUs in the model, we can't really
> tell what's best between packing or spreading between identical CPUs if
> the migration of the task doesn't change the OPP request.
>
> In a frequency domain, all the "best" CPU candidates for a task are
> those for which we'll request a low OPP. When there are several CPUs for
> which the OPP request will be the same, we just don't know which one to
> pick from an energy standpoint, because we don't have other energy costs
> (for idle states for ex) to break the tie.
>
> With this EM, the interesting thing is that if you assume that OPP
> requests follow utilization, you are _guaranteed_ that the CPU with
> the max spare capacity in a freq domain will always be among the best
> candidates of this freq domain. And since we don't know how to
> differentiate those candidates, why not using this one ?
>
> Yes, it _might_ be better from an energy standpoint to pack small tasks
> on a CPU in order to let other CPUs go in deeper idle states. But that
> also hurts your chances to go cluster idle. Which solution is the best ?
> It depends, and we have no ways to tell with this EM.
>
> This approach basically favors cluster-packing, and spreading inside a
> cluster. That should at least be a good thing for latency, and this is
> consistent with the idea that most of the energy savings come from the
> asymmetry of the system, and not so much from breaking the tie between
> identical CPUs. That's also the reason why EAS is enabled only if your
> system has SD_ASYM_CPUCAPACITY set, as we already discussed for patch
> 05/10 :-).
>
> Does that make sense ?
Yes, thanks for the explanation. It would probably make sense to copy
and paste your text above somewhere in comment/doc for future ref.
On 06/08/2018 10:25 AM, Quentin Perret wrote:
> Hi Dietmar,
>
> On Thursday 07 Jun 2018 at 17:55:32 (+0200), Dietmar Eggemann wrote:
>> On 06/07/2018 05:19 PM, Quentin Perret wrote:
>>> Hi Juri,
>>>
>>> On Thursday 07 Jun 2018 at 16:44:09 (+0200), Juri Lelli wrote:
>>>> On 21/05/18 15:24, Quentin Perret wrote:
[...]
> The comment above em_register_freq_domain() explains that, at least
> partially. In the current implementation, if multiple providers register
> the same frequency domain, all but the first will be ignored. The reason
> I implemented this that way is because: 1) it's simple; 2) it should
> cover the current use-cases for EAS and IPA.
>
> But we could do something more clever. We could add a parameter to
> em_register_freq_domain() that would represent some sort of priority. In
> this case, if multiple providers register the same freq domain, the
> higher priority would override the lower priority. For example, power
> values coming from firmware could overwrite power values estimated with
> P=CV^2f for example.
In your current '(3)* Arm/Arm64 init code' (* see at the end of this
email) you have this dev_pm_opp_of_estimate_power() em_data_callback
active_power function.
Let's say thermal and the task scheduler would initialize the EM
independently. They would still end up using C from dt, and f, V and P
from opp library in your example.
IMHO, this information should be only provided once from one source per
platform.
>> The re-scaling thing comes from the requirement that the final cpu capacity
>> values are only known after the arch_topology driver was able to scale the
>> dmipz-capacity-values with the policy->cpuinfo.max_freq but why can't we
>> create the EM on arm/arm64 after this?
>
> What if you don't have dmips-capacity-mhz values in the DT and still
> want to use IPA ? There is no good reason to create a dependency between
> the thermal subsystem and the arch_topology driver IMO.
Mmmmh, that's correct. So it can't be simply called in
init_cpu_capacity_callback() [drivers/base/arch_topology.c] in case the
cpus_to_visit mask is empty. There is this dependency that cpufreq can
be loaded at any time, requiring this re-scaling of capacity values ...
That's not nice ...
>> Even though we would be forced to get cpufreq's related cpumask from
>> somewhere.
>
> That's the easy part. The difficult part is, where do you get power
> values from ? You have to let the lower layers register those values
> in a centralized location on a voluntary basis. And then it becomes easy
> for consumers to access that data, because they know where it is.
The code in the arch could use the same struct em_data_callback em_cb =
{ &dev_pm_opp_of_estimate_power } that the cpufreq driver is currently
using?
>> I guess the easiest model will be that the Energy Model (EM) is fully
>> initialized with one init call (from the arch) and fixed after that.
>
> Again, I don't think that's possible. You have to let the lower layers
> tell you where the power values come from, at the very least. You could
> let the archs do that aggregation I suppose, but I don't really see the
> benefit over one centralized framework with a generic interface ...
> What's your opinion ?
Don't understand the '... let the lower layers tell you where the power
values come from ...' part. Where is the difference whether the arch or
the cpufreq driver uses em_data_callback?
[...]
> So I think I'll drop patch 10/10 for v4 ... That part should be
> discussed separately, with the rest of the Arm-specific changes.
Maybe 3 clearly separated parts of the patch-set; (1) EM (2) EAS uses EM
(3) Arm/Arm64 init code ?
[...]
On Friday 08 Jun 2018 at 14:39:33 (+0200), Dietmar Eggemann wrote:
> In your current '(3)* Arm/Arm64 init code' (* see at the end of this email)
> you have this dev_pm_opp_of_estimate_power() em_data_callback active_power
> function.
>
> Let's say thermal and the task scheduler would initialize the EM
> independently. They would still end up using C from dt, and f, V and P from
> opp library in your example.
Ah, that's where the confusion comes from ... The task scheduler or
thermal _don't_ initialize the EM. They just use it if it's there.
The _drivers_ (typically, but not limited to, CPUFreq drivers)
initialize the EM. I'll try to make that clearer in the diagram for v4.
> IMHO, this information should be only provided once from one source per
> platform.
Yes, I think so too. That's why all but the first registration of the
same frequency domain is ignored.
> > > Even though we would be forced to get cpufreq's related cpumask from
> > > somewhere.
> >
> > That's the easy part. The difficult part is, where do you get power
> > values from ? You have to let the lower layers register those values
> > in a centralized location on a voluntary basis. And then it becomes easy
> > for consumers to access that data, because they know where it is.
>
> The code in the arch could use the same struct em_data_callback em_cb = {
> &dev_pm_opp_of_estimate_power } that the cpufreq driver is currently using?
How do you know from the arch code if you should get power from DT
with dev_pm_opp_of_estimate_power or use another callback that reads
power from firmware (SCMI) ?
I don't think it is reasonable to assume a single source of information for
an arch. It is is already an incorrect assumption even if just you look at
the Arm world.
> > Again, I don't think that's possible. You have to let the lower layers
> > tell you where the power values come from, at the very least. You could
> > let the archs do that aggregation I suppose, but I don't really see the
> > benefit over one centralized framework with a generic interface ...
> > What's your opinion ?
>
> Don't understand the '... let the lower layers tell you where the power
> values come from ...' part. Where is the difference whether the arch or the
> cpufreq driver uses em_data_callback?
Because different CPUFreq drivers can be used for one arch. There are
different CPUFreq drivers because there are different ways of getting
information about the platform, even just for the Arm world (DT, SCPI,
SCMI, ...). It's the same thing for power values, they don't necessarily
come from DT.
The point of having a centralized EM framework with a standardized
callback prototype is flexibility. You can implement a callback that
estimates power from the DT. You can implement a callback that reads
power from firmware. But you can also have a completely ad-hoc EM
provider in a module if you like. All you have to do to provide data to
the framework is respect the callback API.
> > So I think I'll drop patch 10/10 for v4 ... That part should be
> > discussed separately, with the rest of the Arm-specific changes.
>
> Maybe 3 clearly separated parts of the patch-set; (1) EM (2) EAS uses EM (3)
> Arm/Arm64 init code ?
Right, that sounds like the right thing to do :-)
Thanks,
Quentin
On Wed, Jun 06, 2018 at 05:26:47PM +0100, Quentin Perret wrote:
> On Wednesday 06 Jun 2018 at 16:29:50 (+0100), Quentin Perret wrote:
> > On Wednesday 06 Jun 2018 at 17:20:00 (+0200), Juri Lelli wrote:
> > > > > This brings me to another question. Let's say there are multiple users of
> > > > > the Energy Model in the system. Shouldn't the units of frequency and power
> > > > > not standardized, maybe Mhz and mW?
> > > > > The task scheduler doesn't care since it is only interested in power diffs
> > > > > but other user might do.
> > > >
> > > > So the good thing about specifying units is that we can probably assume
> > > > ranges on the values. If the power is in mW, assuming that we're talking
> > > > about a single CPU, it'll probably fit in 16 bits. 65W/core should be
> > > > a reasonable upper-bound ?
> > > > But there are also vendors who might not be happy with disclosing absolute
> > > > values ... These are sometimes considered sensitive and only relative
> > > > numbers are discussed publicly. Now, you can also argue that we already
> > > > have units specified in IPA for ex, and that it doesn't really matter if
> > > > a driver "lies" about the real value, as long as the ratios are correct.
> > > > And I guess that anyone can do measurement on the hardware and get those
> > > > values anyway. So specifying a unit (mW) for the power is probably a
> > > > good idea.
> > >
> > > Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
> > > binding accepted, and one of the musts was that the values were going to
> > > be normalized. So, normalized power values again maybe?
> >
> > Hmmm, that's a very good point ... There should be no problems on the
> > scheduler side -- we're only interested in correct ratios. But I'm not
> > sure on the thermal side ... I will double check that.
>
> So, IPA needs to compare the power of the CPUs with the power of other
> things (e.g. GPUs). So we can't normalize the power of the CPUs without
> normalizing in the same scale the power of the other devices. I see two
> possibilities:
>
> 1) we don't normalize the CPU power values, we specify them in mW, and
> we document (and maybe throw a warning if we see an issue at runtime)
> the max range of values. The max expected power for a single core
> could be 65K for ex (16bits). And based on that we can verify
> overflow and precision issues in the algorithms, and we keep it easy
> to compare the CPU power numbers with other devices.
>
> 2) we normalize the power values, but that means that the EM framework
> has to manage not only CPUs, but also other types of devices, and
> normalized their power values as well. That's required to keep the
> scale consistent across all of them, and keep comparisons doable.
> But if we do this, we still have to keep a normalized and a "raw"
> version of the power for all devices. And the "raw" power must still
> be in the same unit across all devices, otherwise the re-scaling is
> broken. The main benefit of doing this is that the range of
> acceptable "raw" power values can be larger, probably 32bits, and
> that the precision of the normalized range is arbitrary.
>
> I feel like 2) involves a lot of complexity, and not so many benefits,
> so I'd be happy to go with 1). Unless I forgot something ?
From the thermal point of view, the power values don't need to have
any given unit, as long as the values are comparable to each other.
Do we need to normalize anything in the kernel though? Can't we just
assume that whatever the platform is telling us is correct? Quentin
mentioned it earlier: sometimes absolute values are considered
sensitive and we only get ones that are correct relative to the rest
of the system.
(This reminds me that the units in
Documentation/thermal/cpu-cooling-api.txt are wrong and I need to fix
it :-X )
Cheers,
Javi
Hi Javi,
On Friday 08 Jun 2018 at 14:39:42 (+0100), Javi Merino wrote:
> On Wed, Jun 06, 2018 at 05:26:47PM +0100, Quentin Perret wrote:
> > On Wednesday 06 Jun 2018 at 16:29:50 (+0100), Quentin Perret wrote:
> > > On Wednesday 06 Jun 2018 at 17:20:00 (+0200), Juri Lelli wrote:
> > > > > > This brings me to another question. Let's say there are multiple users of
> > > > > > the Energy Model in the system. Shouldn't the units of frequency and power
> > > > > > not standardized, maybe Mhz and mW?
> > > > > > The task scheduler doesn't care since it is only interested in power diffs
> > > > > > but other user might do.
> > > > >
> > > > > So the good thing about specifying units is that we can probably assume
> > > > > ranges on the values. If the power is in mW, assuming that we're talking
> > > > > about a single CPU, it'll probably fit in 16 bits. 65W/core should be
> > > > > a reasonable upper-bound ?
> > > > > But there are also vendors who might not be happy with disclosing absolute
> > > > > values ... These are sometimes considered sensitive and only relative
> > > > > numbers are discussed publicly. Now, you can also argue that we already
> > > > > have units specified in IPA for ex, and that it doesn't really matter if
> > > > > a driver "lies" about the real value, as long as the ratios are correct.
> > > > > And I guess that anyone can do measurement on the hardware and get those
> > > > > values anyway. So specifying a unit (mW) for the power is probably a
> > > > > good idea.
> > > >
> > > > Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
> > > > binding accepted, and one of the musts was that the values were going to
> > > > be normalized. So, normalized power values again maybe?
> > >
> > > Hmmm, that's a very good point ... There should be no problems on the
> > > scheduler side -- we're only interested in correct ratios. But I'm not
> > > sure on the thermal side ... I will double check that.
> >
> > So, IPA needs to compare the power of the CPUs with the power of other
> > things (e.g. GPUs). So we can't normalize the power of the CPUs without
> > normalizing in the same scale the power of the other devices. I see two
> > possibilities:
> >
> > 1) we don't normalize the CPU power values, we specify them in mW, and
> > we document (and maybe throw a warning if we see an issue at runtime)
> > the max range of values. The max expected power for a single core
> > could be 65K for ex (16bits). And based on that we can verify
> > overflow and precision issues in the algorithms, and we keep it easy
> > to compare the CPU power numbers with other devices.
> >
> > 2) we normalize the power values, but that means that the EM framework
> > has to manage not only CPUs, but also other types of devices, and
> > normalized their power values as well. That's required to keep the
> > scale consistent across all of them, and keep comparisons doable.
> > But if we do this, we still have to keep a normalized and a "raw"
> > version of the power for all devices. And the "raw" power must still
> > be in the same unit across all devices, otherwise the re-scaling is
> > broken. The main benefit of doing this is that the range of
> > acceptable "raw" power values can be larger, probably 32bits, and
> > that the precision of the normalized range is arbitrary.
> >
> > I feel like 2) involves a lot of complexity, and not so many benefits,
> > so I'd be happy to go with 1). Unless I forgot something ?
>
> From the thermal point of view, the power values don't need to have
> any given unit, as long as the values are comparable to each other.
OK, thanks for confirming that :-)
> Do we need to normalize anything in the kernel though? Can't we just
> assume that whatever the platform is telling us is correct? Quentin
> mentioned it earlier: sometimes absolute values are considered
> sensitive and we only get ones that are correct relative to the rest
> of the system.
I'm happy to specify the units as mW and let the drivers lie about the
true values. At least that helps them lie coherently if another
subsystem requires power in uW for example.
>
> (This reminds me that the units in
> Documentation/thermal/cpu-cooling-api.txt are wrong and I need to fix
> it :-X )
;-)
Thanks,
Quentin
On Friday 08 Jun 2018 at 13:59:28 (+0200), Juri Lelli wrote:
> On 08/06/18 12:19, Quentin Perret wrote:
> > On Friday 08 Jun 2018 at 12:24:46 (+0200), Juri Lelli wrote:
> > > Hi,
> > >
> > > On 21/05/18 15:25, Quentin Perret wrote:
> > >
> > > [...]
> > >
> > > > +static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > > +{
> > > > + unsigned long cur_energy, prev_energy, best_energy, cpu_cap, task_util;
> > > > + int cpu, best_energy_cpu = prev_cpu;
> > > > + struct sched_energy_fd *sfd;
> > > > + struct sched_domain *sd;
> > > > +
> > > > + sync_entity_load_avg(&p->se);
> > > > +
> > > > + task_util = task_util_est(p);
> > > > + if (!task_util)
> > > > + return prev_cpu;
> > > > +
> > > > + /*
> > > > + * Energy-aware wake-up happens on the lowest sched_domain starting
> > > > + * from sd_ea spanning over this_cpu and prev_cpu.
> > > > + */
> > > > + sd = rcu_dereference(*this_cpu_ptr(&sd_ea));
> > > > + while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > > > + sd = sd->parent;
> > > > + if (!sd)
> > > > + return -1;
> > >
> > > Shouldn't this be return prev_cpu?
> >
> > Well, you shouldn't be entering this function without an sd_ea pointer,
> > so this case is a sort of bug I think. By returning -1 I think we should
> > end-up picking a CPU using select_fallback_rq(), which sort of makes
> > sense ?
>
> I fear cpumask_test_cpu() and such won't be happy with a -1 arg.
> If it's a recoverable bug, I'd say return prev and WARN_ON_ONCE() ?
Hmmm, yes, prev + WARN_ON_ONCE is probably appropriate here then.
>
> > > > +
> > > > + if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> > > > + prev_energy = best_energy = compute_energy(p, prev_cpu);
> > > > + else
> > > > + prev_energy = best_energy = ULONG_MAX;
> > > > +
> > > > + for_each_freq_domain(sfd) {
> > > > + unsigned long spare_cap, max_spare_cap = 0;
> > > > + int max_spare_cap_cpu = -1;
> > > > + unsigned long util;
> > > > +
> > > > + /* Find the CPU with the max spare cap in the freq. dom. */
> > >
> > > I undestand this being a heuristic to cut some overhead, but shouldn't
> > > the model tell between packing vs. spreading?
> >
> > Ah, that's a very interesting one :-) !
> >
> > So, with only active costs of the CPUs in the model, we can't really
> > tell what's best between packing or spreading between identical CPUs if
> > the migration of the task doesn't change the OPP request.
> >
> > In a frequency domain, all the "best" CPU candidates for a task are
> > those for which we'll request a low OPP. When there are several CPUs for
> > which the OPP request will be the same, we just don't know which one to
> > pick from an energy standpoint, because we don't have other energy costs
> > (for idle states for ex) to break the tie.
> >
> > With this EM, the interesting thing is that if you assume that OPP
> > requests follow utilization, you are _guaranteed_ that the CPU with
> > the max spare capacity in a freq domain will always be among the best
> > candidates of this freq domain. And since we don't know how to
> > differentiate those candidates, why not using this one ?
> >
> > Yes, it _might_ be better from an energy standpoint to pack small tasks
> > on a CPU in order to let other CPUs go in deeper idle states. But that
> > also hurts your chances to go cluster idle. Which solution is the best ?
> > It depends, and we have no ways to tell with this EM.
> >
> > This approach basically favors cluster-packing, and spreading inside a
> > cluster. That should at least be a good thing for latency, and this is
> > consistent with the idea that most of the energy savings come from the
> > asymmetry of the system, and not so much from breaking the tie between
> > identical CPUs. That's also the reason why EAS is enabled only if your
> > system has SD_ASYM_CPUCAPACITY set, as we already discussed for patch
> > 05/10 :-).
> >
> > Does that make sense ?
>
> Yes, thanks for the explanation. It would probably make sense to copy
> and paste your text above somewhere in comment/doc for future ref.
OK, will do.
Thanks !
Quentin
On 06/08/2018 03:11 PM, Quentin Perret wrote:
> On Friday 08 Jun 2018 at 14:39:33 (+0200), Dietmar Eggemann wrote:
[...]
>>>> Even though we would be forced to get cpufreq's related cpumask from
>>>> somewhere.
>>>
>>> That's the easy part. The difficult part is, where do you get power
>>> values from ? You have to let the lower layers register those values
>>> in a centralized location on a voluntary basis. And then it becomes easy
>>> for consumers to access that data, because they know where it is.
>>
>> The code in the arch could use the same struct em_data_callback em_cb = {
>> &dev_pm_opp_of_estimate_power } that the cpufreq driver is currently using?
>
> How do you know from the arch code if you should get power from DT
> with dev_pm_opp_of_estimate_power or use another callback that reads
> power from firmware (SCMI) ?
Ah, ok, cpufreq dt, scpi and arm_big_little are dt, cpufreq scmi can be
different ...
>
> I don't think it is reasonable to assume a single source of information for
> an arch. It is is already an incorrect assumption even if just you look at
> the Arm world.
Ok, I see.
>
>>> Again, I don't think that's possible. You have to let the lower layers
>>> tell you where the power values come from, at the very least. You could
>>> let the archs do that aggregation I suppose, but I don't really see the
>>> benefit over one centralized framework with a generic interface ...
>>> What's your opinion ?
>>
>> Don't understand the '... let the lower layers tell you where the power
>> values come from ...' part. Where is the difference whether the arch or the
>> cpufreq driver uses em_data_callback?
>
> Because different CPUFreq drivers can be used for one arch. There are
> different CPUFreq drivers because there are different ways of getting
> information about the platform, even just for the Arm world (DT, SCPI,
> SCMI, ...). It's the same thing for power values, they don't necessarily
> come from DT.
scpi is dt ? At least scpi-cpufreq.c uses this
dev_pm_opp_of_estimate_power too.
> The point of having a centralized EM framework with a standardized
> callback prototype is flexibility. You can implement a callback that
> estimates power from the DT. You can implement a callback that reads
> power from firmware. But you can also have a completely ad-hoc EM
> provider in a module if you like. All you have to do to provide data to
> the framework is respect the callback API.
IMHO, this idea is good, there should be also user of this outside
arm/arm64 ...
[...]
On Friday 08 Jun 2018 at 18:39:56 (+0200), Dietmar Eggemann wrote:
> On 06/08/2018 03:11 PM, Quentin Perret wrote:
> > On Friday 08 Jun 2018 at 14:39:33 (+0200), Dietmar Eggemann wrote:
>
> [...]
>
> > > > > Even though we would be forced to get cpufreq's related cpumask from
> > > > > somewhere.
> > > >
> > > > That's the easy part. The difficult part is, where do you get power
> > > > values from ? You have to let the lower layers register those values
> > > > in a centralized location on a voluntary basis. And then it becomes easy
> > > > for consumers to access that data, because they know where it is.
> > >
> > > The code in the arch could use the same struct em_data_callback em_cb = {
> > > &dev_pm_opp_of_estimate_power } that the cpufreq driver is currently using?
> >
> > How do you know from the arch code if you should get power from DT
> > with dev_pm_opp_of_estimate_power or use another callback that reads
> > power from firmware (SCMI) ?
>
> Ah, ok, cpufreq dt, scpi and arm_big_little are dt, cpufreq scmi can be
> different ...
>
> >
> > I don't think it is reasonable to assume a single source of information for
> > an arch. It is is already an incorrect assumption even if just you look at
> > the Arm world.
>
> Ok, I see.
>
> >
> > > > Again, I don't think that's possible. You have to let the lower layers
> > > > tell you where the power values come from, at the very least. You could
> > > > let the archs do that aggregation I suppose, but I don't really see the
> > > > benefit over one centralized framework with a generic interface ...
> > > > What's your opinion ?
> > >
> > > Don't understand the '... let the lower layers tell you where the power
> > > values come from ...' part. Where is the difference whether the arch or the
> > > cpufreq driver uses em_data_callback?
> >
> > Because different CPUFreq drivers can be used for one arch. There are
> > different CPUFreq drivers because there are different ways of getting
> > information about the platform, even just for the Arm world (DT, SCPI,
> > SCMI, ...). It's the same thing for power values, they don't necessarily
> > come from DT.
>
> scpi is dt ? At least scpi-cpufreq.c uses this dev_pm_opp_of_estimate_power
> too.
Hrmpf, SCPI isn't really DT. The voltage and freq values don't come from
DT at least. The reason dev_pm_opp_of_estimate_power() is in PM_OPP in my
tree is precisely because PM_OPP take cares about the SCPI/DT abstraction
for me. That makes it easier to factorize a bit of code between cpufreq-dt
and scpi-cpufreq to compute the CVVf thing, but that's a different story,
and could be implemented differently. The common thing between the two is
that the "C" value is in DT. The rest is fairly different.
>
> > The point of having a centralized EM framework with a standardized
> > callback prototype is flexibility. You can implement a callback that
> > estimates power from the DT. You can implement a callback that reads
> > power from firmware. But you can also have a completely ad-hoc EM
> > provider in a module if you like. All you have to do to provide data to
> > the framework is respect the callback API.
>
> IMHO, this idea is good, there should be also user of this outside arm/arm64
Cool :-). Indeed one of the things I had in mind is that basically this
API based on a callback lets you put what you like in the EM. In the Arm
world, we will most likely register power values for all OPPs when we
know them. But if you want to use this on x86 on a platform where the OS
isn't aware of the OPPs for ex, you can still register something in there
if you really want to. It doesn't have to be "real" OPPs.
Thanks !
Quentin
On Thu, Jun 07, 2018 at 06:04:19PM +0200, Juri Lelli wrote:
> On 07/06/18 16:19, Quentin Perret wrote:
> > Hi Juri,
> >
> > On Thursday 07 Jun 2018 at 16:44:09 (+0200), Juri Lelli wrote:
> > > On 21/05/18 15:24, Quentin Perret wrote:
>
> [...]
>
> > > > +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> > > > +{
> > > > + unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> > > > + int max_cap_state = cs_table->nr_cap_states - 1;
> > > ^
> > > You don't need this on the stack, right?
> >
> > Oh, why not ?
> >
>
> Because you use it only once here below? Anyway, more a (debatable)
> nitpick than anything.
The compiler optimizes that for you because it knows that it is used
only once. It doesn't put it in the stack, it uses a register. As it
is, it's more readable so I'd rather keep it.
For reference, this is the code gcc 7.3 generates for arm64 for
fd_update_cstable() (which is inlined in em_rescale_cpu_capacity():
x27 holds the address to cs_table
1ac: b9400b63 ldr w3, [x27, #8] ; w3 = cs_table->nr_cap_states
1b0: b9406fa4 ldr w4, [x29, #108] ; w4 = 0x18 (sizeof(struct em_cap_state))
1b4: f9400362 ldr x2, [x27] ; x2 = &cs_table[state]
1b8: 51000461 sub w1, w3, #0x1 ; w1 max_cap_state = cs_table->nr_cap_states - 1
[...]
1cc: 9b240821 smaddl x1, w1, w4, x2 ; x1 = &cs_table->state[max_cap_state]
[...]
1d4: f9400427 ldr x7, [x1, #8] ; x7 fmax = cs_table->state[max_cap_state].frequency
[...] ; calculates cmax * cs_table->state[i].frequency in x0
200: 9ac70800 udiv x0, x0, x7 ; x0 = x0 / fmax
; x0 is then stored to cs_table->state[i].capacity
On Fri, Jun 08, 2018 at 04:47:39PM +0100, Quentin Perret wrote:
> Hi Javi,
>
> On Friday 08 Jun 2018 at 14:39:42 (+0100), Javi Merino wrote:
> > On Wed, Jun 06, 2018 at 05:26:47PM +0100, Quentin Perret wrote:
> > > On Wednesday 06 Jun 2018 at 16:29:50 (+0100), Quentin Perret wrote:
> > > > On Wednesday 06 Jun 2018 at 17:20:00 (+0200), Juri Lelli wrote:
> > > > > > > This brings me to another question. Let's say there are multiple users of
> > > > > > > the Energy Model in the system. Shouldn't the units of frequency and power
> > > > > > > not standardized, maybe Mhz and mW?
> > > > > > > The task scheduler doesn't care since it is only interested in power diffs
> > > > > > > but other user might do.
> > > > > >
> > > > > > So the good thing about specifying units is that we can probably assume
> > > > > > ranges on the values. If the power is in mW, assuming that we're talking
> > > > > > about a single CPU, it'll probably fit in 16 bits. 65W/core should be
> > > > > > a reasonable upper-bound ?
> > > > > > But there are also vendors who might not be happy with disclosing absolute
> > > > > > values ... These are sometimes considered sensitive and only relative
> > > > > > numbers are discussed publicly. Now, you can also argue that we already
> > > > > > have units specified in IPA for ex, and that it doesn't really matter if
> > > > > > a driver "lies" about the real value, as long as the ratios are correct.
> > > > > > And I guess that anyone can do measurement on the hardware and get those
> > > > > > values anyway. So specifying a unit (mW) for the power is probably a
> > > > > > good idea.
> > > > >
> > > > > Mmm, I remember we fought quite a bit while getting capacity-dmpis-mhz
> > > > > binding accepted, and one of the musts was that the values were going to
> > > > > be normalized. So, normalized power values again maybe?
> > > >
> > > > Hmmm, that's a very good point ... There should be no problems on the
> > > > scheduler side -- we're only interested in correct ratios. But I'm not
> > > > sure on the thermal side ... I will double check that.
> > >
> > > So, IPA needs to compare the power of the CPUs with the power of other
> > > things (e.g. GPUs). So we can't normalize the power of the CPUs without
> > > normalizing in the same scale the power of the other devices. I see two
> > > possibilities:
> > >
> > > 1) we don't normalize the CPU power values, we specify them in mW, and
> > > we document (and maybe throw a warning if we see an issue at runtime)
> > > the max range of values. The max expected power for a single core
> > > could be 65K for ex (16bits). And based on that we can verify
> > > overflow and precision issues in the algorithms, and we keep it easy
> > > to compare the CPU power numbers with other devices.
> > >
> > > 2) we normalize the power values, but that means that the EM framework
> > > has to manage not only CPUs, but also other types of devices, and
> > > normalized their power values as well. That's required to keep the
> > > scale consistent across all of them, and keep comparisons doable.
> > > But if we do this, we still have to keep a normalized and a "raw"
> > > version of the power for all devices. And the "raw" power must still
> > > be in the same unit across all devices, otherwise the re-scaling is
> > > broken. The main benefit of doing this is that the range of
> > > acceptable "raw" power values can be larger, probably 32bits, and
> > > that the precision of the normalized range is arbitrary.
> > >
> > > I feel like 2) involves a lot of complexity, and not so many benefits,
> > > so I'd be happy to go with 1). Unless I forgot something ?
> >
> > From the thermal point of view, the power values don't need to have
> > any given unit, as long as the values are comparable to each other.
>
> OK, thanks for confirming that :-)
>
> > Do we need to normalize anything in the kernel though? Can't we just
> > assume that whatever the platform is telling us is correct? Quentin
> > mentioned it earlier: sometimes absolute values are considered
> > sensitive and we only get ones that are correct relative to the rest
> > of the system.
>
> I'm happy to specify the units as mW and let the drivers lie about the
> true values. At least that helps them lie coherently if another
> subsystem requires power in uW for example.
I think this is a good option.
Cheers,
Javi
On Mon, May 21, 2018 at 03:25:04PM +0100, Quentin Perret wrote:
<snip>
> + if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> + prev_energy = best_energy = compute_energy(p, prev_cpu);
> + else
> + prev_energy = best_energy = ULONG_MAX;
> +
> + for_each_freq_domain(sfd) {
> + unsigned long spare_cap, max_spare_cap = 0;
> + int max_spare_cap_cpu = -1;
> + unsigned long util;
> +
> + /* Find the CPU with the max spare cap in the freq. dom. */
> + for_each_cpu_and(cpu, freq_domain_span(sfd), sched_domain_span(sd)) {
> + if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> + continue;
> +
> + if (cpu == prev_cpu)
> + continue;
> +
> + /* Skip CPUs that will be overutilized */
> + util = cpu_util_wake(cpu, p) + task_util;
> + cpu_cap = capacity_of(cpu);
> + if (cpu_cap * 1024 < util * capacity_margin)
> + continue;
> +
> + spare_cap = cpu_cap - util;
> + if (spare_cap > max_spare_cap) {
> + max_spare_cap = spare_cap;
> + max_spare_cap_cpu = cpu;
> + }
> + }
> +
> + /* Evaluate the energy impact of using this CPU. */
> + if (max_spare_cap_cpu >= 0) {
> + cur_energy = compute_energy(p, max_spare_cap_cpu);
> + if (cur_energy < best_energy) {
> + best_energy = cur_energy;
> + best_energy_cpu = max_spare_cap_cpu;
> + }
> + }
> + }
> +
> + /*
> + * We pick the best CPU only if it saves at least 1.5% of the
> + * energy used by prev_cpu.
> + */
> + if ((prev_energy - best_energy) > (prev_energy >> 6))
> + return best_energy_cpu;
> +
> + return prev_cpu;
> +}
We are comparing the best_energy_cpu against prev_cpu even when prev_cpu
can't accommodate the waking task. Is this intentional? Should not we
discard the prev_cpu if it can't accommodate the task.
This can potentially make a BIG task run on a lower capacity CPU until
load balancer kicks in and corrects the situation.
Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On Mon, May 21, 2018 at 03:25:01PM +0100, Quentin Perret wrote:
<snip>
> util_est_enqueue(&rq->cfs, p);
> hrtick_update(rq);
> @@ -8121,11 +8144,12 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> * @local_group: Does group contain this_cpu.
> * @sgs: variable to hold the statistics for this group.
> * @overload: Indicate more than one runnable task for any CPU.
> + * @overutilized: Indicate overutilization for any CPU.
> */
> static inline void update_sg_lb_stats(struct lb_env *env,
> struct sched_group *group, int load_idx,
> int local_group, struct sg_lb_stats *sgs,
> - bool *overload)
> + bool *overload, int *overutilized)
> {
> unsigned long load;
> int i, nr_running;
> @@ -8152,6 +8176,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> if (nr_running > 1)
> *overload = true;
>
> + if (cpu_overutilized(i))
> + *overutilized = 1;
> +
There is no need to check if every CPU is overutilized or not once
*overutilized is marked as true, right?
<snip>
>
> @@ -8586,6 +8621,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> * this level.
> */
> update_sd_lb_stats(env, &sds);
> +
> + if (sched_energy_enabled() && !READ_ONCE(env->dst_rq->rd->overutilized))
> + goto out_balanced;
> +
Is there any reason for sending no-hz idle kicks but bailing out here when
system is not overutilized?
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Hi Pavan,
On Tuesday 19 Jun 2018 at 10:36:01 (+0530), Pavan Kondeti wrote:
> On Mon, May 21, 2018 at 03:25:04PM +0100, Quentin Perret wrote:
>
> <snip>
>
> > + if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> > + prev_energy = best_energy = compute_energy(p, prev_cpu);
> > + else
> > + prev_energy = best_energy = ULONG_MAX;
> > +
> > + for_each_freq_domain(sfd) {
> > + unsigned long spare_cap, max_spare_cap = 0;
> > + int max_spare_cap_cpu = -1;
> > + unsigned long util;
> > +
> > + /* Find the CPU with the max spare cap in the freq. dom. */
> > + for_each_cpu_and(cpu, freq_domain_span(sfd), sched_domain_span(sd)) {
> > + if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> > + continue;
> > +
> > + if (cpu == prev_cpu)
> > + continue;
> > +
> > + /* Skip CPUs that will be overutilized */
> > + util = cpu_util_wake(cpu, p) + task_util;
> > + cpu_cap = capacity_of(cpu);
> > + if (cpu_cap * 1024 < util * capacity_margin)
> > + continue;
> > +
> > + spare_cap = cpu_cap - util;
> > + if (spare_cap > max_spare_cap) {
> > + max_spare_cap = spare_cap;
> > + max_spare_cap_cpu = cpu;
> > + }
> > + }
> > +
> > + /* Evaluate the energy impact of using this CPU. */
> > + if (max_spare_cap_cpu >= 0) {
> > + cur_energy = compute_energy(p, max_spare_cap_cpu);
> > + if (cur_energy < best_energy) {
> > + best_energy = cur_energy;
> > + best_energy_cpu = max_spare_cap_cpu;
> > + }
> > + }
> > + }
> > +
> > + /*
> > + * We pick the best CPU only if it saves at least 1.5% of the
> > + * energy used by prev_cpu.
> > + */
> > + if ((prev_energy - best_energy) > (prev_energy >> 6))
> > + return best_energy_cpu;
> > +
> > + return prev_cpu;
> > +}
>
> We are comparing the best_energy_cpu against prev_cpu even when prev_cpu
> can't accommodate the waking task. Is this intentional? Should not we
> discard the prev_cpu if it can't accommodate the task.
>
> This can potentially make a BIG task run on a lower capacity CPU until
> load balancer kicks in and corrects the situation.
We shouldn't enter find_energy_efficient_cpu() in the first place if the
system is overutilized, so that shouldn't too much of an issue in
general.
But yeah, there is one small corner case where prev_cpu is overutilized
and the system has not been flagged as such yet (when the tasks wakes-up
shortly before the tick for ex). I think it's possible to cover this case
by extending the "if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))"
condition at the top of the function with a check on capacity_margin.
I'll change that in v4.
Thanks !
Quentin
On Tue, Jun 19, 2018 at 08:57:23AM +0100, Quentin Perret wrote:
> Hi Pavan,
>
> On Tuesday 19 Jun 2018 at 10:36:01 (+0530), Pavan Kondeti wrote:
> > On Mon, May 21, 2018 at 03:25:04PM +0100, Quentin Perret wrote:
> >
> > <snip>
> >
> > > + if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> > > + prev_energy = best_energy = compute_energy(p, prev_cpu);
> > > + else
> > > + prev_energy = best_energy = ULONG_MAX;
> > > +
> > > + for_each_freq_domain(sfd) {
> > > + unsigned long spare_cap, max_spare_cap = 0;
> > > + int max_spare_cap_cpu = -1;
> > > + unsigned long util;
> > > +
> > > + /* Find the CPU with the max spare cap in the freq. dom. */
> > > + for_each_cpu_and(cpu, freq_domain_span(sfd), sched_domain_span(sd)) {
> > > + if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> > > + continue;
> > > +
> > > + if (cpu == prev_cpu)
> > > + continue;
> > > +
> > > + /* Skip CPUs that will be overutilized */
> > > + util = cpu_util_wake(cpu, p) + task_util;
> > > + cpu_cap = capacity_of(cpu);
> > > + if (cpu_cap * 1024 < util * capacity_margin)
> > > + continue;
> > > +
> > > + spare_cap = cpu_cap - util;
> > > + if (spare_cap > max_spare_cap) {
> > > + max_spare_cap = spare_cap;
> > > + max_spare_cap_cpu = cpu;
> > > + }
> > > + }
> > > +
> > > + /* Evaluate the energy impact of using this CPU. */
> > > + if (max_spare_cap_cpu >= 0) {
> > > + cur_energy = compute_energy(p, max_spare_cap_cpu);
> > > + if (cur_energy < best_energy) {
> > > + best_energy = cur_energy;
> > > + best_energy_cpu = max_spare_cap_cpu;
> > > + }
> > > + }
> > > + }
> > > +
> > > + /*
> > > + * We pick the best CPU only if it saves at least 1.5% of the
> > > + * energy used by prev_cpu.
> > > + */
> > > + if ((prev_energy - best_energy) > (prev_energy >> 6))
> > > + return best_energy_cpu;
> > > +
> > > + return prev_cpu;
> > > +}
> >
> > We are comparing the best_energy_cpu against prev_cpu even when prev_cpu
> > can't accommodate the waking task. Is this intentional? Should not we
> > discard the prev_cpu if it can't accommodate the task.
> >
> > This can potentially make a BIG task run on a lower capacity CPU until
> > load balancer kicks in and corrects the situation.
>
> We shouldn't enter find_energy_efficient_cpu() in the first place if the
> system is overutilized, so that shouldn't too much of an issue in
> general.
>
With UTIL_EST enabled, the overutilization condtion may get turned off when
that 1 BIG task goes to sleep. When it wakes up again, we may place it on
the previous CPU due to the above mentioned issue.
It is not just an existing overutilization condition. By placing this task
on the prev_cpu, we may enter overutilization state.
> But yeah, there is one small corner case where prev_cpu is overutilized
> and the system has not been flagged as such yet (when the tasks wakes-up
> shortly before the tick for ex). I think it's possible to cover this case
> by extending the "if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))"
> condition at the top of the function with a check on capacity_margin.
>
LGTM.
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On Mon, May 21, 2018 at 03:25:05PM +0100, Quentin Perret wrote:
<snip>
> +static void start_eas_workfn(struct work_struct *work);
> +static DECLARE_WORK(start_eas_work, start_eas_workfn);
> +
> static int
> init_cpu_capacity_callback(struct notifier_block *nb,
> unsigned long val,
> @@ -204,6 +209,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
> free_raw_capacity();
> pr_debug("cpu_capacity: parsing done\n");
> schedule_work(&parsing_done_work);
> + schedule_work(&start_eas_work);
> }
>
> return 0;
> @@ -249,6 +255,19 @@ static void parsing_done_workfn(struct work_struct *work)
> free_cpumask_var(cpus_to_visit);
> }
>
> +static void start_eas_workfn(struct work_struct *work)
> +{
> + /* Make sure the EM knows about the updated CPU capacities. */
> + rcu_read_lock();
> + em_rescale_cpu_capacity();
> + rcu_read_unlock();
> +
> + /* Inform the scheduler about the EM availability. */
> + cpus_read_lock();
> + rebuild_sched_domains();
> + cpus_read_unlock();
> +}
Rebuilding the sched domains is unnecessary for the platform that don't have
energy-model. In fact, we can completely avoid scheduling this work.
There seems to be a sysfs interface exposed by this driver to change cpu_scale.
Should we worry about it? I don't know what is the usecase for changing the
cpu_scale from user space.
Are we expecting that the energy-model is populated by this time? If it is
not, should not we build the sched domains again after the energy-model is
populated?
Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Hi Pavan,
On Tuesday 19 Jun 2018 at 14:48:41 (+0530), Pavan Kondeti wrote:
> On Mon, May 21, 2018 at 03:25:05PM +0100, Quentin Perret wrote:
>
> <snip>
>
> > +static void start_eas_workfn(struct work_struct *work);
> > +static DECLARE_WORK(start_eas_work, start_eas_workfn);
> > +
> > static int
> > init_cpu_capacity_callback(struct notifier_block *nb,
> > unsigned long val,
> > @@ -204,6 +209,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
> > free_raw_capacity();
> > pr_debug("cpu_capacity: parsing done\n");
> > schedule_work(&parsing_done_work);
> > + schedule_work(&start_eas_work);
> > }
> >
> > return 0;
> > @@ -249,6 +255,19 @@ static void parsing_done_workfn(struct work_struct *work)
> > free_cpumask_var(cpus_to_visit);
> > }
> >
> > +static void start_eas_workfn(struct work_struct *work)
> > +{
> > + /* Make sure the EM knows about the updated CPU capacities. */
> > + rcu_read_lock();
> > + em_rescale_cpu_capacity();
> > + rcu_read_unlock();
> > +
> > + /* Inform the scheduler about the EM availability. */
> > + cpus_read_lock();
> > + rebuild_sched_domains();
> > + cpus_read_unlock();
> > +}
>
> Rebuilding the sched domains is unnecessary for the platform that don't have
> energy-model. In fact, we can completely avoid scheduling this work.
Good point, we might be able to save a little bit of work there.
Now, you will reach this point only if dmips-capacity-mhz values are
specified in DT for you platform, which is usually done only for
big.LITTLE-like Arm platforms. And there are good chances that you want
to have an Energy Model as well for these platforms, so in practice that
won't make a massive difference I suppose ...
> There seems to be a sysfs interface exposed by this driver to change cpu_scale.
> Should we worry about it? I don't know what is the usecase for changing the
> cpu_scale from user space.
This is something I've been wondering as well. TBH, I'm not sure what to
do in this case. And I'm not sure to know what is the use-case either.
Debugging purpose I assume ?
Juri, did you have a specific use-case for this feature when the
arch_topology driver was first introduced ? Or was it just to align
with the existing arm/arm64 code ?
The reason I didn't call the em_rescale_cpu_capacity() function when CPU
capacities are written from sysfs is because that interface allows to have
CPUs with different capacities in the same freq_domain, which is against
the assumptions we have in the EM framework.
>
> Are we expecting that the energy-model is populated by this time? If it is
> not, should not we build the sched domains again after the energy-model is
> populated?
Yes, the assumption is that drivers have provided the EM by this time.
I'll send a v4 of this patch set very soon with an example of cpufreq
driver modification.
Thanks,
Quentin
On 19/06/18 10:40, Quentin Perret wrote:
> Hi Pavan,
>
> On Tuesday 19 Jun 2018 at 14:48:41 (+0530), Pavan Kondeti wrote:
[...]
> > There seems to be a sysfs interface exposed by this driver to change cpu_scale.
> > Should we worry about it? I don't know what is the usecase for changing the
> > cpu_scale from user space.
>
> This is something I've been wondering as well. TBH, I'm not sure what to
> do in this case. And I'm not sure to know what is the use-case either.
> Debugging purpose I assume ?
>
> Juri, did you have a specific use-case for this feature when the
> arch_topology driver was first introduced ? Or was it just to align
> with the existing arm/arm64 code ?
It was requested (IIRC) because DT might have bogus values and not be
easily modifiable. So, this is another way to get things right for your
platform at runtime.
On Mon, May 21, 2018 at 03:25:02PM +0100, Quentin Perret wrote:
<snip>
>
> +/*
> + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> + */
> +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> +{
> + unsigned long util, util_est;
> + struct cfs_rq *cfs_rq;
> +
> + /* Task is where it should be, or has no impact on cpu */
> + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> + return cpu_util(cpu);
> +
> + cfs_rq = &cpu_rq(cpu)->cfs;
> + util = READ_ONCE(cfs_rq->avg.util_avg);
> +
> + if (dst_cpu == cpu)
> + util += task_util(p);
> + else
> + util = max_t(long, util - task_util(p), 0);
> +
> + if (sched_feat(UTIL_EST)) {
> + util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> + if (dst_cpu == cpu)
> + util_est += _task_util_est(p);
> + else
> + util_est = max_t(long, util_est - _task_util_est(p), 0);
For UTIL_EST case, the waking task will not have any contribution in the
previous CPU's util_est. So we can just use the previous CPU util_est as is.
> + util = max(util, util_est);
> + }
> +
> + return min_t(unsigned long, util, capacity_orig_of(cpu));
> +}
> +
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On Tuesday 19 Jun 2018 at 15:21:40 (+0530), Pavan Kondeti wrote:
> On Mon, May 21, 2018 at 03:25:02PM +0100, Quentin Perret wrote:
>
> <snip>
>
> >
> > +/*
> > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> > + */
> > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > +{
> > + unsigned long util, util_est;
> > + struct cfs_rq *cfs_rq;
> > +
> > + /* Task is where it should be, or has no impact on cpu */
> > + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> > + return cpu_util(cpu);
> > +
> > + cfs_rq = &cpu_rq(cpu)->cfs;
> > + util = READ_ONCE(cfs_rq->avg.util_avg);
> > +
> > + if (dst_cpu == cpu)
> > + util += task_util(p);
> > + else
> > + util = max_t(long, util - task_util(p), 0);
> > +
> > + if (sched_feat(UTIL_EST)) {
> > + util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > + if (dst_cpu == cpu)
> > + util_est += _task_util_est(p);
> > + else
> > + util_est = max_t(long, util_est - _task_util_est(p), 0);
>
> For UTIL_EST case, the waking task will not have any contribution in the
> previous CPU's util_est. So we can just use the previous CPU util_est as is.
Right, good catch, I actually spotted that one as well and already fixed it
for v4 ;-)
On Tuesday 19 Jun 2018 at 11:47:14 (+0200), Juri Lelli wrote:
> On 19/06/18 10:40, Quentin Perret wrote:
> > Hi Pavan,
> >
> > On Tuesday 19 Jun 2018 at 14:48:41 (+0530), Pavan Kondeti wrote:
>
> [...]
>
> > > There seems to be a sysfs interface exposed by this driver to change cpu_scale.
> > > Should we worry about it? I don't know what is the usecase for changing the
> > > cpu_scale from user space.
> >
> > This is something I've been wondering as well. TBH, I'm not sure what to
> > do in this case. And I'm not sure to know what is the use-case either.
> > Debugging purpose I assume ?
> >
> > Juri, did you have a specific use-case for this feature when the
> > arch_topology driver was first introduced ? Or was it just to align
> > with the existing arm/arm64 code ?
>
> It was requested (IIRC) because DT might have bogus values and not be
> easily modifiable. So, this is another way to get things right for your
> platform at runtime.
Right, but that also allows you to set different capacities to CPUs
inside the same freq domain, which isn't supported by the EM framework,
at least for now. So I would prefer to assume that your values in DT must
to be correct to use EAS, and leave the code as-is for now.
On 19/06/18 11:02, Quentin Perret wrote:
> On Tuesday 19 Jun 2018 at 11:47:14 (+0200), Juri Lelli wrote:
> > On 19/06/18 10:40, Quentin Perret wrote:
> > > Hi Pavan,
> > >
> > > On Tuesday 19 Jun 2018 at 14:48:41 (+0530), Pavan Kondeti wrote:
> >
> > [...]
> >
> > > > There seems to be a sysfs interface exposed by this driver to change cpu_scale.
> > > > Should we worry about it? I don't know what is the usecase for changing the
> > > > cpu_scale from user space.
> > >
> > > This is something I've been wondering as well. TBH, I'm not sure what to
> > > do in this case. And I'm not sure to know what is the use-case either.
> > > Debugging purpose I assume ?
> > >
> > > Juri, did you have a specific use-case for this feature when the
> > > arch_topology driver was first introduced ? Or was it just to align
> > > with the existing arm/arm64 code ?
> >
> > It was requested (IIRC) because DT might have bogus values and not be
> > easily modifiable. So, this is another way to get things right for your
> > platform at runtime.
>
> Right, but that also allows you to set different capacities to CPUs
> inside the same freq domain, which isn't supported by the EM framework,
> at least for now. So I would prefer to assume that your values in DT must
> to be correct to use EAS, and leave the code as-is for now.
It's actually built on the (current) assumption that siblings share
capacity [1], so it seems to align with what EM requires.
[1] https://elixir.bootlin.com/linux/latest/source/drivers/base/arch_topology.c#L71
On Tuesday 19 Jun 2018 at 12:19:01 (+0200), Juri Lelli wrote:
> On 19/06/18 11:02, Quentin Perret wrote:
> > On Tuesday 19 Jun 2018 at 11:47:14 (+0200), Juri Lelli wrote:
> > > On 19/06/18 10:40, Quentin Perret wrote:
> > > > Hi Pavan,
> > > >
> > > > On Tuesday 19 Jun 2018 at 14:48:41 (+0530), Pavan Kondeti wrote:
> > >
> > > [...]
> > >
> > > > > There seems to be a sysfs interface exposed by this driver to change cpu_scale.
> > > > > Should we worry about it? I don't know what is the usecase for changing the
> > > > > cpu_scale from user space.
> > > >
> > > > This is something I've been wondering as well. TBH, I'm not sure what to
> > > > do in this case. And I'm not sure to know what is the use-case either.
> > > > Debugging purpose I assume ?
> > > >
> > > > Juri, did you have a specific use-case for this feature when the
> > > > arch_topology driver was first introduced ? Or was it just to align
> > > > with the existing arm/arm64 code ?
> > >
> > > It was requested (IIRC) because DT might have bogus values and not be
> > > easily modifiable. So, this is another way to get things right for your
> > > platform at runtime.
> >
> > Right, but that also allows you to set different capacities to CPUs
> > inside the same freq domain, which isn't supported by the EM framework,
> > at least for now. So I would prefer to assume that your values in DT must
> > to be correct to use EAS, and leave the code as-is for now.
>
> It's actually built on the (current) assumption that siblings share
> capacity [1], so it seems to align with what EM requires.
>
> [1] https://elixir.bootlin.com/linux/latest/source/drivers/base/arch_topology.c#L71
But there is not hard guarantee that the core_sibling mask and the
frequency domains are aligned :-(
Hikey 620 is an example where they might be misaligned (I think)
On 06/19/2018 09:01 AM, Pavan Kondeti wrote:
> On Mon, May 21, 2018 at 03:25:01PM +0100, Quentin Perret wrote:
[...]
>> @@ -8152,6 +8176,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> if (nr_running > 1)
>> *overload = true;
>>
>> + if (cpu_overutilized(i))
>> + *overutilized = 1;
>> +
>
> There is no need to check if every CPU is overutilized or not once
> *overutilized is marked as true, right?
True, so you want to check *overutilized before calling
cpu_overutilized() to save a little bit on compute?
[...]
>> @@ -8586,6 +8621,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>> * this level.
>> */
>> update_sd_lb_stats(env, &sds);
>> +
>> + if (sched_energy_enabled() && !READ_ONCE(env->dst_rq->rd->overutilized))
>> + goto out_balanced;
>> +
>
> Is there any reason for sending no-hz idle kicks but bailing out here when
> system is not overutilized?
Even if a system is not-overutilized, we want to update stale cpu
blocked load and utilization so NOHZ_STATS_KICK have to get through.
So calling find_busiest_group() -> update_sd_lb_stats() ->
update_sg_lb_stats() to possibly execute update_nohz_stats() is IMHO the
right thing to do.
On 19/06/18 11:25, Quentin Perret wrote:
> On Tuesday 19 Jun 2018 at 12:19:01 (+0200), Juri Lelli wrote:
> > On 19/06/18 11:02, Quentin Perret wrote:
> > > On Tuesday 19 Jun 2018 at 11:47:14 (+0200), Juri Lelli wrote:
> > > > On 19/06/18 10:40, Quentin Perret wrote:
> > > > > Hi Pavan,
> > > > >
> > > > > On Tuesday 19 Jun 2018 at 14:48:41 (+0530), Pavan Kondeti wrote:
> > > >
> > > > [...]
> > > >
> > > > > > There seems to be a sysfs interface exposed by this driver to change cpu_scale.
> > > > > > Should we worry about it? I don't know what is the usecase for changing the
> > > > > > cpu_scale from user space.
> > > > >
> > > > > This is something I've been wondering as well. TBH, I'm not sure what to
> > > > > do in this case. And I'm not sure to know what is the use-case either.
> > > > > Debugging purpose I assume ?
> > > > >
> > > > > Juri, did you have a specific use-case for this feature when the
> > > > > arch_topology driver was first introduced ? Or was it just to align
> > > > > with the existing arm/arm64 code ?
> > > >
> > > > It was requested (IIRC) because DT might have bogus values and not be
> > > > easily modifiable. So, this is another way to get things right for your
> > > > platform at runtime.
> > >
> > > Right, but that also allows you to set different capacities to CPUs
> > > inside the same freq domain, which isn't supported by the EM framework,
> > > at least for now. So I would prefer to assume that your values in DT must
> > > to be correct to use EAS, and leave the code as-is for now.
> >
> > It's actually built on the (current) assumption that siblings share
> > capacity [1], so it seems to align with what EM requires.
> >
> > [1] https://elixir.bootlin.com/linux/latest/source/drivers/base/arch_topology.c#L71
>
> But there is not hard guarantee that the core_sibling mask and the
> frequency domains are aligned :-(
>
> Hikey 620 is an example where they might be misaligned (I think)
Yep. In this case you'd need to write cpu_capacity twice (for each
cluster). I think.
On Tuesday 19 Jun 2018 at 12:31:08 (+0200), Juri Lelli wrote:
> On 19/06/18 11:25, Quentin Perret wrote:
> > On Tuesday 19 Jun 2018 at 12:19:01 (+0200), Juri Lelli wrote:
> > > On 19/06/18 11:02, Quentin Perret wrote:
> > > > On Tuesday 19 Jun 2018 at 11:47:14 (+0200), Juri Lelli wrote:
> > > > > On 19/06/18 10:40, Quentin Perret wrote:
> > > > > > Hi Pavan,
> > > > > >
> > > > > > On Tuesday 19 Jun 2018 at 14:48:41 (+0530), Pavan Kondeti wrote:
> > > > >
> > > > > [...]
> > > > >
> > > > > > > There seems to be a sysfs interface exposed by this driver to change cpu_scale.
> > > > > > > Should we worry about it? I don't know what is the usecase for changing the
> > > > > > > cpu_scale from user space.
> > > > > >
> > > > > > This is something I've been wondering as well. TBH, I'm not sure what to
> > > > > > do in this case. And I'm not sure to know what is the use-case either.
> > > > > > Debugging purpose I assume ?
> > > > > >
> > > > > > Juri, did you have a specific use-case for this feature when the
> > > > > > arch_topology driver was first introduced ? Or was it just to align
> > > > > > with the existing arm/arm64 code ?
> > > > >
> > > > > It was requested (IIRC) because DT might have bogus values and not be
> > > > > easily modifiable. So, this is another way to get things right for your
> > > > > platform at runtime.
> > > >
> > > > Right, but that also allows you to set different capacities to CPUs
> > > > inside the same freq domain, which isn't supported by the EM framework,
> > > > at least for now. So I would prefer to assume that your values in DT must
> > > > to be correct to use EAS, and leave the code as-is for now.
> > >
> > > It's actually built on the (current) assumption that siblings share
> > > capacity [1], so it seems to align with what EM requires.
> > >
> > > [1] https://elixir.bootlin.com/linux/latest/source/drivers/base/arch_topology.c#L71
> >
> > But there is not hard guarantee that the core_sibling mask and the
> > frequency domains are aligned :-(
> >
> > Hikey 620 is an example where they might be misaligned (I think)
>
> Yep. In this case you'd need to write cpu_capacity twice (for each
> cluster). I think.
Ok, I'll keep the code as-is for now, and we can discuss further on v4.
I guess we need to agree on the need for em_rescale_cpu_capacity() before
going too deep in the Arm-specific changes to call it :-)
Thanks !
Quentin
On Mon, May 21, 2018 at 03:24:58PM +0100, Quentin Perret wrote:
> +struct em_freq_domain {
> + struct em_cs_table *cs_table;
> + cpumask_t cpus;
> +};
https://lkml.kernel.org/r/[email protected]
On Mon, May 21, 2018 at 03:24:58PM +0100, Quentin Perret wrote:
> + read_lock_irqsave(&em_data_lock, flags);
> + for_each_cpu(cpu, cpu_possible_mask) {
I know we're likely to only use this on small systems, but this pattern
is a very bad, Please look at alternatives.
On Mon, May 21, 2018 at 03:24:58PM +0100, Quentin Perret wrote:
> +struct em_freq_domain *em_cpu_get(int cpu)
> +{
> + struct em_freq_domain *fd;
> + unsigned long flags;
> +
> + read_lock_irqsave(&em_data_lock, flags);
> + fd = per_cpu(em_data, cpu);
> + read_unlock_irqrestore(&em_data_lock, flags);
Why can't this use RCU? This is the exact thing read_locks are terrible
at and RCU excells at.
> +
> + return fd;
> +}
> +EXPORT_SYMBOL_GPL(em_cpu_get);
On Mon, May 21, 2018 at 03:24:59PM +0100, Quentin Perret wrote:
> This exposes the Energy Model (read-only) of all frequency domains in
> sysfs for convenience. To do so, a parent kobject is added to the CPU
> subsystem under the umbrella of which a kobject for each frequency
> domain is attached.
>
> The resulting hierarchy is as follows for a platform with two frequency
> domains for example:
>
> /sys/devices/system/cpu/energy_model
> ├── fd0
> │ ├── capacity
> │ ├── cpus
> │ ├── frequency
> │ └── power
> └── fd4
> ├── capacity
> ├── cpus
> ├── frequency
> └── power
>
Given that each FD can have multiple {freq,power} tuples and sysfs has a
one value per file policy, how does this work?
On Mon, May 21, 2018 at 03:25:00PM +0100, Quentin Perret wrote:
> In order to use EAS, the task scheduler has to know about the Energy
> Model (EM) of the platform. This commit extends the scheduler topology
> code to take references on the frequency domains objects of the EM
> framework for all online CPUs. Hence, the availability of the EM for
> those CPUs is guaranteed to the scheduler at runtime without further
> checks in latency sensitive code paths (i.e. task wake-up).
I'm confused by this patch,... what does it do? Why is em_cpu_get()
(after you fix it) not sufficient?
On Tuesday 19 Jun 2018 at 13:07:34 (+0200), Peter Zijlstra wrote:
> On Mon, May 21, 2018 at 03:24:58PM +0100, Quentin Perret wrote:
> > +struct em_freq_domain {
> > + struct em_cs_table *cs_table;
> > + cpumask_t cpus;
> > +};
>
> https://lkml.kernel.org/r/[email protected]
Ok, I'll change that
On Tuesday 19 Jun 2018 at 13:31:06 (+0200), Peter Zijlstra wrote:
> On Mon, May 21, 2018 at 03:24:58PM +0100, Quentin Perret wrote:
> > + read_lock_irqsave(&em_data_lock, flags);
> > + for_each_cpu(cpu, cpu_possible_mask) {
>
> I know we're likely to only use this on small systems, but this pattern
> is a very bad, Please look at alternatives.
Ok, this isn't supposed to be called very often (only once, at boot
time, for Arm for example), but I see your point.
On Tuesday 19 Jun 2018 at 13:34:08 (+0200), Peter Zijlstra wrote:
> On Mon, May 21, 2018 at 03:24:58PM +0100, Quentin Perret wrote:
> > +struct em_freq_domain *em_cpu_get(int cpu)
> > +{
> > + struct em_freq_domain *fd;
> > + unsigned long flags;
> > +
> > + read_lock_irqsave(&em_data_lock, flags);
> > + fd = per_cpu(em_data, cpu);
> > + read_unlock_irqrestore(&em_data_lock, flags);
>
> Why can't this use RCU? This is the exact thing read_locks are terrible
> at and RCU excells at.
So the idea was that clients (the scheduler for ex) can get a reference
to a frequency domain object once, and they're guaranteed it always
exists without asking for it again.
For example, my proposal was to have the scheduler (patch 05) build its
own private list of frequency domains on which it can iterate efficiently
in the wake-up path. If we protect this per_cpu variable with RCU, then
this isn't possible any-more. The scheduler will have to re-ask
em_cpu_get() at every wake-up, and that makes iterating over frequency
domains a whole lot more complex.
Does that make any sense ?
Thanks,
Quentin
On Tuesday 19 Jun 2018 at 14:16:43 (+0200), Peter Zijlstra wrote:
> On Mon, May 21, 2018 at 03:24:59PM +0100, Quentin Perret wrote:
> > This exposes the Energy Model (read-only) of all frequency domains in
> > sysfs for convenience. To do so, a parent kobject is added to the CPU
> > subsystem under the umbrella of which a kobject for each frequency
> > domain is attached.
> >
> > The resulting hierarchy is as follows for a platform with two frequency
> > domains for example:
> >
> > /sys/devices/system/cpu/energy_model
> > ├── fd0
> > │ ├── capacity
> > │ ├── cpus
> > │ ├── frequency
> > │ └── power
> > └── fd4
> > ├── capacity
> > ├── cpus
> > ├── frequency
> > └── power
> >
>
> Given that each FD can have multiple {freq,power} tuples and sysfs has a
> one value per file policy, how does this work?
This is meant to look a little bit like the sysfs entries of CPUFreq
policies, so you get something like this:
$ cat /sys/devices/system/cpu/energy_model/fd0/capacity
133 250 351 428 462
$ cat /sys/devices/system/cpu/energy_model/fd0/frequency
533000 999000 1402000 1709000 1844000
$ cat /sys/devices/system/cpu/energy_model/fd0/power
28 70 124 187 245
$ cat /sys/devices/system/cpu/energy_model/fd0/cpus
0-3
For example, CPUFreq exposes available governors and frequencies as:
$ cat /sys/devices/system/cpu/cpufreq/policy0/scaling_available_governors
conservative ondemand userspace powersave performance schedutil
$ cat /sys/devices/system/cpu/cpufreq/policy0/scaling_available_frequencies
533000 999000 1402000 1709000 1844000
On Tue, Jun 19, 2018 at 01:58:58PM +0100, Quentin Perret wrote:
> On Tuesday 19 Jun 2018 at 13:34:08 (+0200), Peter Zijlstra wrote:
> > On Mon, May 21, 2018 at 03:24:58PM +0100, Quentin Perret wrote:
> > > +struct em_freq_domain *em_cpu_get(int cpu)
> > > +{
> > > + struct em_freq_domain *fd;
> > > + unsigned long flags;
> > > +
> > > + read_lock_irqsave(&em_data_lock, flags);
> > > + fd = per_cpu(em_data, cpu);
> > > + read_unlock_irqrestore(&em_data_lock, flags);
> >
> > Why can't this use RCU? This is the exact thing read_locks are terrible
> > at and RCU excells at.
>
> So the idea was that clients (the scheduler for ex) can get a reference
> to a frequency domain object once, and they're guaranteed it always
> exists without asking for it again.
>
> For example, my proposal was to have the scheduler (patch 05) build its
> own private list of frequency domains on which it can iterate efficiently
> in the wake-up path. If we protect this per_cpu variable with RCU, then
> this isn't possible any-more. The scheduler will have to re-ask
> em_cpu_get() at every wake-up, and that makes iterating over frequency
> domains a whole lot more complex.
>
> Does that make any sense ?
None what so ever... The lock doesn't guarantee stability any more than
RCU does.
If you hand out the pointer and then drop the read-lock, the write-lock
can proceed and change the pointer right after you.
The very easiest solution is to never change the data, as I think was
suggested elsewhere in the thread. Construct the thing once and then
never mutate.
On Tuesday 19 Jun 2018 at 14:26:32 (+0200), Peter Zijlstra wrote:
> On Mon, May 21, 2018 at 03:25:00PM +0100, Quentin Perret wrote:
> > In order to use EAS, the task scheduler has to know about the Energy
> > Model (EM) of the platform. This commit extends the scheduler topology
> > code to take references on the frequency domains objects of the EM
> > framework for all online CPUs. Hence, the availability of the EM for
> > those CPUs is guaranteed to the scheduler at runtime without further
> > checks in latency sensitive code paths (i.e. task wake-up).
>
> I'm confused by this patch,... what does it do? Why is em_cpu_get()
> (after you fix it) not sufficient?
Hmm, so maybe the confusing part is that this patch does two things:
1. it checks all conditions for starting EAS are met
(SD_ASYM_CPUCAPACITY is set, the EM covers all online CPUs, the EM
isn't too complex to be used during wakeup);
2. it builds a list of frequency domains for the private use of the
scheduler in latency sensitive code paths, and sets the static key
So I guess your question is more about 2. It is nice to have a list of
frequency domains because that makes iteration over frequency domains
in the wake-up path very easy, and efficient (for_each_freq_domain() is
used in find_energy_efficient_cpu() and compute_energy(), patches 07 and
09/10).
And also, by making the scheduler maintain that list, we can be more
hotplug-aware. If you hotplug out all CPUs of a freq domain, the scheduler
can remove it from its list and have one less element to iterate against.
The idea was tp remove the unused things on hotplug, just like for
sched domains.
I think that not having that list would mean to play with cpumasks in
find_energy_efficient_cpu() and in compute_energy() to keep track of the
CPUs we have visited and stuff like that. That's doable but probably more
complex, and not more efficient, I think.
Is the overall idea any clearer ?
Thanks,
Quentin
On Tuesday 19 Jun 2018 at 15:23:38 (+0200), Peter Zijlstra wrote:
> On Tue, Jun 19, 2018 at 01:58:58PM +0100, Quentin Perret wrote:
> > On Tuesday 19 Jun 2018 at 13:34:08 (+0200), Peter Zijlstra wrote:
> > > On Mon, May 21, 2018 at 03:24:58PM +0100, Quentin Perret wrote:
> > > > +struct em_freq_domain *em_cpu_get(int cpu)
> > > > +{
> > > > + struct em_freq_domain *fd;
> > > > + unsigned long flags;
> > > > +
> > > > + read_lock_irqsave(&em_data_lock, flags);
> > > > + fd = per_cpu(em_data, cpu);
> > > > + read_unlock_irqrestore(&em_data_lock, flags);
> > >
> > > Why can't this use RCU? This is the exact thing read_locks are terrible
> > > at and RCU excells at.
> >
> > So the idea was that clients (the scheduler for ex) can get a reference
> > to a frequency domain object once, and they're guaranteed it always
> > exists without asking for it again.
> >
> > For example, my proposal was to have the scheduler (patch 05) build its
> > own private list of frequency domains on which it can iterate efficiently
> > in the wake-up path. If we protect this per_cpu variable with RCU, then
> > this isn't possible any-more. The scheduler will have to re-ask
> > em_cpu_get() at every wake-up, and that makes iterating over frequency
> > domains a whole lot more complex.
> >
> > Does that make any sense ?
>
> None what so ever... The lock doesn't guarantee stability any more than
> RCU does.
>
> If you hand out the pointer and then drop the read-lock, the write-lock
> can proceed and change the pointer right after you.
>
> The very easiest solution is to never change the data, as I think was
> suggested elsewhere in the thread. Construct the thing once and then
> never mutate.
This is what is done actually. We will never write twice in the per_cpu
array itself. One of the fields (the table) in the structure pointed from
the per_cpu array can change, but not the pointer on the structure
itself.
The only reason this lock is here is to ensure the atomicity of the
write happening in em_register_freq_domain. But that write can happen
only once, the first time the frequency domain is registered.
But maybe I could use something simpler than a lock in this case ?
Would WRITE_ONCE/READ_ONCE be enough to ensure that atomicity for
example ?
Thanks,
Quentin
On Tue, Jun 19, 2018 at 02:38:45PM +0100, Quentin Perret wrote:
> But maybe I could use something simpler than a lock in this case ?
> Would WRITE_ONCE/READ_ONCE be enough to ensure that atomicity for
> example ?
Yes, since its a single pointer, smp_store_release() + READ_ONCE()
should be sufficient (these are the foundations of RCU).
On Tue, Jun 19, 2018 at 04:16:42PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 19, 2018 at 02:38:45PM +0100, Quentin Perret wrote:
> > But maybe I could use something simpler than a lock in this case ?
> > Would WRITE_ONCE/READ_ONCE be enough to ensure that atomicity for
> > example ?
>
> Yes, since its a single pointer, smp_store_release() + READ_ONCE()
> should be sufficient (these are the foundations of RCU).
Note that per_cpu()/this_cpu_read() and friends should imply READ_ONCE().
On Tuesday 19 Jun 2018 at 16:16:42 (+0200), Peter Zijlstra wrote:
> On Tue, Jun 19, 2018 at 02:38:45PM +0100, Quentin Perret wrote:
> > But maybe I could use something simpler than a lock in this case ?
> > Would WRITE_ONCE/READ_ONCE be enough to ensure that atomicity for
> > example ?
>
> Yes, since its a single pointer, smp_store_release() + READ_ONCE()
> should be sufficient (these are the foundations of RCU).
OK, good, I'll get rid of the spinlock in v4 (and read more about RCU
foundations then :-))
Thanks,
Quentin
On Tue, Jun 19, 2018 at 04:21:16PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 19, 2018 at 04:16:42PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 19, 2018 at 02:38:45PM +0100, Quentin Perret wrote:
> > > But maybe I could use something simpler than a lock in this case ?
> > > Would WRITE_ONCE/READ_ONCE be enough to ensure that atomicity for
> > > example ?
> >
> > Yes, since its a single pointer, smp_store_release() + READ_ONCE()
> > should be sufficient (these are the foundations of RCU).
>
> Note that per_cpu()/this_cpu_read() and friends should imply READ_ONCE().
Mark reminded me that per_cpu() does not in fact imply that, but
READ_ONCE(per_cpu()) should work.
On Tue, Jun 19, 2018 at 02:24:49PM +0100, Quentin Perret wrote:
> On Tuesday 19 Jun 2018 at 14:26:32 (+0200), Peter Zijlstra wrote:
> > I'm confused by this patch,... what does it do? Why is em_cpu_get()
> > (after you fix it) not sufficient?
>
> Hmm, so maybe the confusing part is that this patch does two things:
> 1. it checks all conditions for starting EAS are met
> (SD_ASYM_CPUCAPACITY is set, the EM covers all online CPUs, the EM
> isn't too complex to be used during wakeup);
> 2. it builds a list of frequency domains for the private use of the
> scheduler in latency sensitive code paths,
3. and sets the static key
> So I guess your question is more about 2. It is nice to have a list of
> frequency domains because that makes iteration over frequency domains
> in the wake-up path very easy, and efficient (for_each_freq_domain() is
> used in find_energy_efficient_cpu() and compute_energy(), patches 07 and
> 09/10).
> And also, by making the scheduler maintain that list, we can be more
> hotplug-aware. If you hotplug out all CPUs of a freq domain, the scheduler
> can remove it from its list and have one less element to iterate against.
> The idea was tp remove the unused things on hotplug, just like for
> sched domains.
>
> I think that not having that list would mean to play with cpumasks in
> find_energy_efficient_cpu() and in compute_energy() to keep track of the
> CPUs we have visited and stuff like that. That's doable but probably more
> complex, and not more efficient, I think.
>
> Is the overall idea any clearer ?
Right, so I would not do that many things at once. Also be more explicit
about what data structure, and why.
That said, I think the whole for_each_freq_domain() thing as done is
broken. You've completely ignored the arguments to
partition_sched_domains(). What happens if you create partitions right
along the frequency domains?
So you really want an argument to for_each_freq_domain() to indicate
who's frequency domains you want to iterate. And then I think it's
easiest if you hook into build_sched_domains() instead, because you
really want a list per root_domain I suspect (and an empty list if there
is but one entry on).
On Tuesday 19 Jun 2018 at 18:20:42 (+0200), Peter Zijlstra wrote:
> Right, so I would not do that many things at once. Also be more explicit
> about what data structure, and why.
OK, I can probably split that patch in two smaller patches. One that
introduces and enables the static_key (or something else to replace it,
see my reply below); and another one to create the list of frequency
domains.
> That said, I think the whole for_each_freq_domain() thing as done is
> broken. You've completely ignored the arguments to
> partition_sched_domains(). What happens if you create partitions right
> along the frequency domains?
>
> So you really want an argument to for_each_freq_domain() to indicate
> who's frequency domains you want to iterate. And then I think it's
> easiest if you hook into build_sched_domains() instead, because you
> really want a list per root_domain I suspect
Hmm right, the current implementation is broken with multiple root
domains... We rarely use that in mobile but the code should work
regardless I suppose.
I agree having one list per root_domain should be better. I'll change that
in v4. But I also think the idea of a global static_key is broken then.
We need a per root_domain thing as well. The SD_ASYM_CPUCAPACITY flag
might be set on one hierarchy and not the other for ex. Not sure if
attaching a static_key to each root_domain really makes sense though ...
Would replacing the static_key by a flag attached to the root_domain be
reasonable ? The CONFIG_ENERGY_MODEL config option could help us make
sure there is no performance impact when EAS can't be used with something
like this:
static bool sched_energy_enabled(struct root_domain *rd)
{
#ifdef CONFIG_ENERGY_MODEL
return rd->eas_enabled;
#else
return false;
#endif
}
Or maybe a sched_feat instead of that ifdefery ?
> (and an empty list if there is but one entry on).
Sorry but I didn't understand that ...
Thanks,
Quentin
On Tue, Jun 19, 2018 at 06:13:17PM +0100, Quentin Perret wrote:
> I agree having one list per root_domain should be better. I'll change that
> in v4. But I also think the idea of a global static_key is broken then.
Static keys are global per definition, there is only a single copy of
the code.
You have to enable if there is a root domain with more than 1
freq-domain.
> We need a per root_domain thing as well. The SD_ASYM_CPUCAPACITY flag
> might be set on one hierarchy and not the other for ex. Not sure if
> attaching a static_key to each root_domain really makes sense though ...
>
> Would replacing the static_key by a flag attached to the root_domain be
> reasonable ?
Keep the static key as is, enable if any root domain needs it.
> > (and an empty list if there is but one entry on).
>
> Sorry but I didn't understand that ...
If you have only a single freq-domain, there is nothing to do, right?
On Tuesday 19 Jun 2018 at 20:42:50 (+0200), Peter Zijlstra wrote:
> On Tue, Jun 19, 2018 at 06:13:17PM +0100, Quentin Perret wrote:
> > Would replacing the static_key by a flag attached to the root_domain be
> > reasonable ?
>
> Keep the static key as is, enable if any root domain needs it.
OK. So basically the semantics of the static key becomes: it is set if
at least one root domain has a non-empty list of freq domains. And a
root_domain has a non-empty list of frequency domains if it meets all
conditions for EAS (SD_ASYM_CPUCAPACITY flag set, low complexity for
the "portion" of the EM covering it, ...)
And then, checking the status of the list (empty or not) for a specific
root domain can replace my flag I guess. I'll need to check that somewhere
around select_task_rq_fair() in addition to the check on the static key to
decide if a task can go in find_energy_efficient_cpu().
> > Sorry but I didn't understand that ...
>
> If you have only a single freq-domain, there is nothing to do, right?
Ah, right. Since we already agreed that all CPUs in a freq domain must
have the same micro-arch, that's correct. Checking the presence of the
SD_ASYM_CPUCAPACITY flag should cover this case I guess.
Thanks,
Quentin