2018-08-20 09:45:57

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 00/14] Energy Aware Scheduling

This patch series introduces Energy Aware Scheduling (EAS) for CFS tasks
on platforms with asymmetric CPU topologies (e.g. Arm big.LITTLE).

For more details about the ideas behind it and the overall design,
please refer to the cover letter of version 5 [1].


1. Version History
------------------

Changes v5[1]->v6:
- Rebased on Peter’s sched/core branch (that includes Morten's misfit
patches [2] and the automatic detection of SD_ASYM_CPUCAPACITY [3])
- Removed patch 13/14 (not needed with the automatic flag detection)
- Added patch creating a dependency between sugov and EAS
- Renamed frequency domains to performance domains to avoid creating too
 deep assumptions in the code about the HW
- Renamed the sd_ea shortcut sd_asym_cpucapacity
- Added comment to explain why new tasks are not accounted when
detecting the 'overutilized' flag
- Added comment explaining why forkees don’t go in
find_energy_efficient_cpu()

Changes v4[4]->v5:
- Removed the RCU protection of the EM tables and the associated
 need for em_rescale_cpu_capacity().
- Factorized schedutil’s PELT aggregation function with EAS
- Improved comments/doc in the EM framework
- Added check on the uarch of CPUs in one fd in the EM framework
- Reduced CONFIG_ENERGY_MODEL ifdefery in kernel/sched/topology.c
- Cleaned-up update_sg_lb_stats parameters
- Improved comments in compute_energy() to explain the multi-rd
 scenarios

Changes v3[5]->v4:
- Replaced spinlock in EM framework by smp_store_release/READ_ONCE
- Fixed missing locks to protect rcu_assign_pointer in EM framework
- Fixed capacity calculation in EM framework on 32 bits system
- Fixed compilation issue for CONFIG_ENERGY_MODEL=n
- Removed cpumask from struct em_freq_domain, now dynamically allocated
- Power costs of the EM are specified in milliwatts
- Added example of CPUFreq driver modification
- Added doc/comments in the EM framework and better commit header
- Fixed integration issue with util_est in cpu_util_next()
- Changed scheduler topology code to have one freq. dom. list per rd
- Split sched topology patch in smaller patches
- Added doc/comments explaining the heuristic in the wake-up path
- Changed energy threshold for migration to from 1.5% to 6%

Changes v2[6]->v3:
- Removed the PM_OPP dependency by implementing a new EM framework
- Modified the scheduler topology code to take references on the EM data
 structures
- Simplified the overutilization mechanism into a system-wide flag
- Reworked the integration in the wake-up path using the sd_ea shortcut
- Rebased on tip/sched/core (247f2f6f3c70 "sched/core: Don't schedule
 threads on pre-empted vCPUs")

Changes v1[7]->v2:
- Reworked interface between fair.c and energy.[ch] (Remove #ifdef
 CONFIG_PM_OPP from energy.c) (Greg KH)
- Fixed licence & header issue in energy.[ch] (Greg KH)
- Reordered EAS path in select_task_rq_fair() (Joel)
- Avoid prev_cpu if not allowed in select_task_rq_fair() (Morten/Joel)
- Refactored compute_energy() (Patrick)
- Account for RT/IRQ pressure in task_fits() (Patrick)
- Use UTIL_EST and DL utilization during OPP estimation (Patrick/Juri)
- Optimize selection of CPU candidates in the energy-aware wake-up path
- Rebased on top of tip/sched/core (commit b720342849fe “sched/core:
 Update Preempt_notifier_key to modern API”)


2. Test results
---------------

Two fundamentally different tests were executed. Firstly the energy test
case shows the impact on energy consumption this patch-set has using a
synthetic set of tasks. Secondly the performance test case provides the
conventional hackbench metric numbers.

The tests run on two arm64 big.LITTLE platforms: Hikey960 (4xA73 +
4xA53) and Juno r0 (2xA57 + 4xA53).

Base kernel is tip/sched/core (4.18-rc5), with some Hikey960 and Juno
specific patches, the SD_ASYM_CPUCAPACITY flag set at DIE sched domain
level for arm64 and schedutil as cpufreq governor [8].

2.1 Energy test case

10 iterations of between 10 and 50 periodic rt-app tasks (16ms period,
5% duty-cycle) for 30 seconds with energy measurement. Unit is Joules.
The goal is to save energy, so lower is better.

2.1.1 Hikey960

Energy is measured with an ACME Cape on an instrumented board. Numbers
include consumption of big and little CPUs, LPDDR memory, GPU and most
of the other small components on the board. They do not include
consumption of the radio chip (turned-off anyway) and external
connectors.

+----------+-----------------+-------------------------+
|          | Without patches | With patches            |
+----------+--------+--------+------------------+------+
| Tasks nb |  Mean | RSD* | Mean             | RSD* |
+----------+--------+--------+------------------+------+
|       10 | 34.33 |   4.8% | 30.51 (-11.13%) | 6.4% |
|       20 | 52.84 |   1.9% | 44.15 (-16.45%) | 2.0% |
|       30 | 66.20 |   1.8% | 60.14 (-9.15%) | 4.8% |
|       40 | 90.83 |   2.5% | 86.91 (-4.32%) | 2.7% |
|       50 | 136.76 |   4.6% | 108.90 (-20.37%) | 4.7% |
+----------+--------+--------+------------------+------+

2.1.2 Juno r0

Energy is measured with the onboard energy meter. Numbers include
consumption of big and little CPUs.

+----------+-----------------+------------------------+
|          | Without patches | With patches           |
+----------+--------+--------+-----------------+------+
| Tasks nb |  Mean | RSD* | Mean            | RSD* |
+----------+--------+--------+-----------------+------+
|       10 | 11.48 |   3.2% | 8.09 (-29.53%) | 3.1% |
|       20 | 20.84 |   3.4% | 14.38 (-31.00%) | 1.1% |
|       30 | 32.94 |   3.2% | 23.97 (-27.23%) | 1.0% |
|       40 | 46.05 |   0.5% | 37.82 (-17.87%) | 6.2% |
|       50 | 57.25 |   0.5% | 55.30 ( -3.41%) | 0.5% |
+----------+--------+--------+-----------------+------+


2.2 Performance test case

30 iterations of perf bench sched messaging --pipe --thread --group G
--loop L with G=[1 2 4 8] and L=50000 (Hikey960)/16000 (Juno r0).

2.2.1 Hikey960

The impact of thermal capping was mitigated thanks to a heatsink, a
fan, and a 30 sec delay between two successive executions. IPA is
disabled to reduce the stddev.

+----------------+-----------------+------------------------+
|                | Without patches | With patches           |
+--------+-------+---------+-------+----------------+-------+
| Groups | Tasks | Mean    | RSD* | Mean | RSD*  |
+--------+-------+---------+-------+----------------+-------+
|      1 | 40 |    8.04 | 0.88% | 8.22 (+2.31%) | 1.76% |
|      2 | 80 |   14.78 | 0.67% | 14.83 (+0.35%) | 0.59% |
|      4 | 160 |   30.92 | 0.57% | 30.95 (+0.09%) | 0.51% |
|      8 | 320 |   65.54 | 0.32% | 65.57 (+0.04%) | 0.46% |
+--------+-------+---------+-------+----------------+-------+

2.2.2 Juno r0

+----------------+-----------------+-----------------------+
|                | Without patches | With patches          |
+--------+-------+---------+-------+---------------+-------+
| Groups | Tasks | Mean    | RSD* | Mean | RSD*  |
+--------+-------+---------+-------+---------------+-------+
|      1 | 40 |    7.74 | 0.13% | 7.82 (0.01%) | 0.12% |
|      2 | 80 |   14.27 | 0.15% | 14.27 (0.00%) | 0.14% |
|      4 | 160 |   27.07 | 0.35% | 26.96 (0.00%) | 0.18% |
|      8 | 320 |   55.14 | 1.81% | 55.21 (0.00%) | 1.29% |
+--------+-------+---------+-------+---------------+-------+

*RSD: Relative Standard Deviation (std dev / mean)


[1] https://marc.info/?l=linux-pm&m=153243513908731&w=2
[2] https://marc.info/?l=linux-kernel&m=153069968022982&w=2
[3] https://marc.info/?l=linux-kernel&m=153209362826476&w=2
[4] https://marc.info/?l=linux-kernel&m=153018606728533&w=2
[5] https://marc.info/?l=linux-kernel&m=152691273111941&w=2
[6] https://marc.info/?l=linux-kernel&m=152302902427143&w=2
[7] https://marc.info/?l=linux-kernel&m=152153905805048&w=2
[8] http://www.linux-arm.org/git?p=linux-qp.git;a=shortlog;h=refs/heads/upstream/eas_v6

Morten Rasmussen (1):
sched: Add over-utilization/tipping point indicator

Quentin Perret (13):
sched: Relocate arch_scale_cpu_capacity
sched/cpufreq: Factor out utilization to frequency mapping
PM: Introduce an Energy Model management framework
PM / EM: Expose the Energy Model in sysfs
sched/topology: Reference the Energy Model of CPUs when available
sched/topology: Lowest CPU asymmetry sched_domain level pointer
sched/topology: Introduce sched_energy_present static key
sched/fair: Clean-up update_sg_lb_stats parameters
sched/cpufreq: Refactor the utilization aggregation method
sched/fair: Introduce an energy estimation helper function
sched/fair: Select an energy-efficient CPU on task wake-up
sched/topology: Make Energy Aware Scheduling depend on schedutil
OPTIONAL: cpufreq: dt: Register an Energy Model

drivers/cpufreq/cpufreq-dt.c | 45 ++++-
drivers/cpufreq/cpufreq.c | 4 +
include/linux/cpufreq.h | 1 +
include/linux/energy_model.h | 162 +++++++++++++++++
include/linux/sched/cpufreq.h | 6 +
include/linux/sched/topology.h | 19 ++
kernel/power/Kconfig | 15 ++
kernel/power/Makefile | 2 +
kernel/power/energy_model.c | 289 +++++++++++++++++++++++++++++
kernel/sched/cpufreq_schedutil.c | 136 ++++++++++----
kernel/sched/fair.c | 301 ++++++++++++++++++++++++++++---
kernel/sched/sched.h | 65 ++++---
kernel/sched/topology.c | 231 +++++++++++++++++++++++-
13 files changed, 1195 insertions(+), 81 deletions(-)
create mode 100644 include/linux/energy_model.h
create mode 100644 kernel/power/energy_model.c

--
2.17.1



2018-08-20 09:46:00

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 01/14] sched: Relocate arch_scale_cpu_capacity

By default, arch_scale_cpu_capacity() is only visible from within the
kernel/sched folder. Relocate it to include/linux/sched/topology.h to
make it visible to other clients needing to know about the capacity of
CPUs, such as the Energy Model framework.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
include/linux/sched/topology.h | 19 +++++++++++++++++++
kernel/sched/sched.h | 21 ---------------------
2 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 4fe2e49ab13b..1aaa14462207 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -202,6 +202,17 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
# define SD_INIT_NAME(type)
#endif

+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+ if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+ return sd->smt_gain / sd->span_weight;
+
+ return SCHED_CAPACITY_SCALE;
+}
+#endif
+
#else /* CONFIG_SMP */

struct sched_domain_attr;
@@ -217,6 +228,14 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
return true;
}

+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
+{
+ return SCHED_CAPACITY_SCALE;
+}
+#endif
+
#endif /* !CONFIG_SMP */

static inline int task_node(const struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 85b3a2bf6c2b..481e6759441b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1761,27 +1761,6 @@ unsigned long arch_scale_freq_capacity(int cpu)
}
#endif

-#ifdef CONFIG_SMP
-#ifndef arch_scale_cpu_capacity
-static __always_inline
-unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
- if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
- return sd->smt_gain / sd->span_weight;
-
- return SCHED_CAPACITY_SCALE;
-}
-#endif
-#else
-#ifndef arch_scale_cpu_capacity
-static __always_inline
-unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
-{
- return SCHED_CAPACITY_SCALE;
-}
-#endif
-#endif
-
struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
__acquires(rq->lock);

--
2.17.1


2018-08-20 09:46:08

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 02/14] sched/cpufreq: Factor out utilization to frequency mapping

The schedutil governor maps utilization values to frequencies by applying
a 25% margin. Since this sort of mapping mechanism can be needed by other
users (i.e. EAS), factor the utilization-to-frequency mapping code out
of schedutil and move it to include/linux/sched/cpufreq.h to avoid code
duplication. The new map_util_freq() function is inlined to avoid
overheads.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
include/linux/sched/cpufreq.h | 6 ++++++
kernel/sched/cpufreq_schedutil.c | 3 ++-
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index 59667444669f..afa940cd50dc 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -20,6 +20,12 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
void (*func)(struct update_util_data *data, u64 time,
unsigned int flags));
void cpufreq_remove_update_util_hook(int cpu);
+
+static inline unsigned long map_util_freq(unsigned long util,
+ unsigned long freq, unsigned long cap)
+{
+ return (freq + (freq >> 2)) * util / cap;
+}
#endif /* CONFIG_CPU_FREQ */

#endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 3fffad3bc8a8..f5206c950143 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -13,6 +13,7 @@

#include "sched.h"

+#include <linux/sched/cpufreq.h>
#include <trace/events/power.h>

struct sugov_tunables {
@@ -167,7 +168,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;

- freq = (freq + (freq >> 2)) * util / max;
+ freq = map_util_freq(util, freq, max);

if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
return sg_policy->next_freq;
--
2.17.1


2018-08-20 09:46:22

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 04/14] PM / EM: Expose the Energy Model in sysfs

Expose the Energy Model (read-only) of all performance domains in sysfs
for convenience. To do so, add a kobject to the CPU subsystem under the
umbrella of which a kobject for each performance domain is attached.

The resulting hierarchy is as follows for a platform with two
performance domains for example:

/sys/devices/system/cpu/energy_model
├── pd0
│   ├── cost
│   ├── cpus
│   ├── frequency
│   └── power
└── pd4
├── cost
├── cpus
├── frequency
└── power

In this implementation, the kobject abstraction is only used as a
convenient way of exposing data to sysfs. However, it could also be
used in the future to allocate and release performance domains in a more
dynamic way using reference counting.

Cc: Peter Zijlstra <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
include/linux/energy_model.h | 1 +
kernel/power/energy_model.c | 90 ++++++++++++++++++++++++++++++++++++
2 files changed, 91 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index b89b5596c976..5cfdfa25bf0f 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -19,6 +19,7 @@ struct em_cap_state {
struct em_perf_domain {
struct em_cap_state *table; /* Capacity states, in ascending order. */
int nr_cap_states;
+ struct kobject kobj;
unsigned long cpus[0]; /* CPUs of the frequency domain. */
};

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 2aa6f94343d0..a25d23d73d4e 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -23,6 +23,82 @@ static DEFINE_PER_CPU(struct em_perf_domain *, em_data);
*/
static DEFINE_MUTEX(em_pd_mutex);

+static struct kobject *em_kobject;
+
+/* Getters for the attributes of em_perf_domain objects */
+struct em_pd_attr {
+ struct attribute attr;
+ ssize_t (*show)(struct em_perf_domain *pd, char *buf);
+ ssize_t (*store)(struct em_perf_domain *pd, const char *buf, size_t s);
+};
+
+#define EM_ATTR_LEN 13
+#define show_table_attr(_attr) \
+static ssize_t show_##_attr(struct em_perf_domain *pd, char *buf) \
+{ \
+ ssize_t cnt = 0; \
+ int i; \
+ for (i = 0; i < pd->nr_cap_states; i++) { \
+ if (cnt >= (ssize_t) (PAGE_SIZE / sizeof(char) \
+ - (EM_ATTR_LEN + 2))) \
+ goto out; \
+ cnt += scnprintf(&buf[cnt], EM_ATTR_LEN + 1, "%lu ", \
+ pd->table[i]._attr); \
+ } \
+out: \
+ cnt += sprintf(&buf[cnt], "\n"); \
+ return cnt; \
+}
+
+show_table_attr(power);
+show_table_attr(frequency);
+show_table_attr(cost);
+
+static ssize_t show_cpus(struct em_perf_domain *pd, char *buf)
+{
+ return sprintf(buf, "%*pbl\n", cpumask_pr_args(to_cpumask(pd->cpus)));
+}
+
+#define pd_attr(_name) em_pd_##_name##_attr
+#define define_pd_attr(_name) static struct em_pd_attr pd_attr(_name) = \
+ __ATTR(_name, 0444, show_##_name, NULL)
+
+define_pd_attr(power);
+define_pd_attr(frequency);
+define_pd_attr(cost);
+define_pd_attr(cpus);
+
+static struct attribute *em_pd_default_attrs[] = {
+ &pd_attr(power).attr,
+ &pd_attr(frequency).attr,
+ &pd_attr(cost).attr,
+ &pd_attr(cpus).attr,
+ NULL
+};
+
+#define to_pd(k) container_of(k, struct em_perf_domain, kobj)
+#define to_pd_attr(a) container_of(a, struct em_pd_attr, attr)
+
+static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
+{
+ struct em_perf_domain *pd = to_pd(kobj);
+ struct em_pd_attr *pd_attr = to_pd_attr(attr);
+ ssize_t ret;
+
+ ret = pd_attr->show(pd, buf);
+
+ return ret;
+}
+
+static const struct sysfs_ops em_pd_sysfs_ops = {
+ .show = show,
+};
+
+static struct kobj_type ktype_em_pd = {
+ .sysfs_ops = &em_pd_sysfs_ops,
+ .default_attrs = em_pd_default_attrs,
+};
+
static struct em_perf_domain *em_create_pd(cpumask_t *span, int nr_states,
struct em_data_callback *cb)
{
@@ -102,6 +178,11 @@ static struct em_perf_domain *em_create_pd(cpumask_t *span, int nr_states,
pd->nr_cap_states = nr_states;
cpumask_copy(to_cpumask(pd->cpus), span);

+ ret = kobject_init_and_add(&pd->kobj, &ktype_em_pd, em_kobject,
+ "pd%u", cpu);
+ if (ret)
+ pr_err("pd%d: failed kobject_init_and_add(): %d\n", cpu, ret);
+
return pd;

free_cs_table:
@@ -155,6 +236,15 @@ int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
*/
mutex_lock(&em_pd_mutex);

+ if (!em_kobject) {
+ em_kobject = kobject_create_and_add("energy_model",
+ &cpu_subsys.dev_root->kobj);
+ if (!em_kobject) {
+ ret = -ENODEV;
+ goto unlock;
+ }
+ }
+
for_each_cpu(cpu, span) {
/* Make sure we don't register again an existing domain. */
if (READ_ONCE(per_cpu(em_data, cpu))) {
--
2.17.1


2018-08-20 09:46:30

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 05/14] sched/topology: Reference the Energy Model of CPUs when available

The existing scheduling domain hierarchy is defined to map to the cache
topology of the system. However, Energy Aware Scheduling (EAS) requires
more knowledge about the platform, and specifically needs to know about
the span of Performance Domains (PD), which do not always align with
caches.

To address this issue, use the Energy Model (EM) of the system to extend
the scheduler topology code with a representation of the PDs, alongside
the scheduling domains. More specifically, a linked list of PDs is
attached to each root domain. When multiple root domains are in use,
each list contains only the PDs covering the CPUs of its root domain. If
a PD spans over CPUs of two different root domains, it will be
duplicated in both lists.

The lists are fully maintained by the scheduler from
partition_sched_domains() in order to cope with hotplug and cpuset
changes. As for scheduling domains, the list are protected by RCU to
ensure safe concurrent updates.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/sched.h | 21 +++++++
kernel/sched/topology.c | 134 ++++++++++++++++++++++++++++++++++++++--
2 files changed, 151 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 481e6759441b..97d79429e755 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -44,6 +44,7 @@
#include <linux/ctype.h>
#include <linux/debugfs.h>
#include <linux/delayacct.h>
+#include <linux/energy_model.h>
#include <linux/init_task.h>
#include <linux/kprobes.h>
#include <linux/kthread.h>
@@ -700,6 +701,12 @@ static inline bool sched_asym_prefer(int a, int b)
return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
}

+struct perf_domain {
+ struct em_perf_domain *obj;
+ struct perf_domain *next;
+ struct rcu_head rcu;
+};
+
/*
* We add the notion of a root-domain which will be used to define per-domain
* variables. Each exclusive cpuset essentially defines an island domain by
@@ -752,6 +759,12 @@ struct root_domain {
struct cpupri cpupri;

unsigned long max_cpu_capacity;
+
+ /*
+ * NULL-terminated list of performance domains intersecting with the
+ * CPUs of the rd. Protected by RCU.
+ */
+ struct perf_domain *pd;
};

extern struct root_domain def_root_domain;
@@ -2228,3 +2241,11 @@ unsigned long scale_irq_capacity(unsigned long util, unsigned long irq, unsigned
return util;
}
#endif
+
+#ifdef CONFIG_SMP
+#ifdef CONFIG_ENERGY_MODEL
+#define perf_domain_span(pd) (to_cpumask(((pd)->obj->cpus)))
+#else
+#define perf_domain_span(pd) NULL
+#endif
+#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index c4444ed77a55..545e2c7b6e66 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -201,6 +201,116 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
return 1;
}

+#ifdef CONFIG_ENERGY_MODEL
+static void free_pd(struct perf_domain *pd)
+{
+ struct perf_domain *tmp;
+
+ while (pd) {
+ tmp = pd->next;
+ kfree(pd);
+ pd = tmp;
+ }
+}
+
+static struct perf_domain *find_pd(struct perf_domain *pd, int cpu)
+{
+ while (pd) {
+ if (cpumask_test_cpu(cpu, perf_domain_span(pd)))
+ return pd;
+ pd = pd->next;
+ }
+
+ return NULL;
+}
+
+static struct perf_domain *pd_init(int cpu)
+{
+ struct em_perf_domain *obj = em_cpu_get(cpu);
+ struct perf_domain *pd;
+
+ if (!obj) {
+ if (sched_debug())
+ pr_info("%s: no EM found for CPU%d\n", __func__, cpu);
+ return NULL;
+ }
+
+ pd = kzalloc(sizeof(*pd), GFP_KERNEL);
+ if (!pd)
+ return NULL;
+ pd->obj = obj;
+
+ return pd;
+}
+
+static void perf_domain_debug(const struct cpumask *cpu_map,
+ struct perf_domain *pd)
+{
+ if (!sched_debug() || !pd)
+ return;
+
+ printk(KERN_DEBUG "root_domain %*pbl: ", cpumask_pr_args(cpu_map));
+
+ while (pd) {
+ printk(KERN_CONT " pd%d:{ cpus=%*pbl nr_cstate=%d }",
+ cpumask_first(perf_domain_span(pd)),
+ cpumask_pr_args(perf_domain_span(pd)),
+ em_pd_nr_cap_states(pd->obj));
+ pd = pd->next;
+ }
+
+ printk(KERN_CONT "\n");
+}
+
+static void destroy_perf_domain_rcu(struct rcu_head *rp)
+{
+ struct perf_domain *pd;
+
+ pd = container_of(rp, struct perf_domain, rcu);
+ free_pd(pd);
+}
+
+static void build_perf_domains(const struct cpumask *cpu_map)
+{
+ struct perf_domain *pd = NULL, *tmp;
+ int cpu = cpumask_first(cpu_map);
+ struct root_domain *rd = cpu_rq(cpu)->rd;
+ int i;
+
+ for_each_cpu(i, cpu_map) {
+ /* Skip already covered CPUs. */
+ if (find_pd(pd, i))
+ continue;
+
+ /* Create the new pd and add it to the local list. */
+ tmp = pd_init(i);
+ if (!tmp)
+ goto free;
+ tmp->next = pd;
+ pd = tmp;
+ }
+
+ perf_domain_debug(cpu_map, pd);
+
+ /* Attach the new list of performance domains to the root domain. */
+ tmp = rd->pd;
+ rcu_assign_pointer(rd->pd, pd);
+ if (tmp)
+ call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
+
+ return;
+
+free:
+ free_pd(pd);
+ tmp = rd->pd;
+ rcu_assign_pointer(rd->pd, NULL);
+ if (tmp)
+ call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
+}
+#else
+static void free_pd(struct perf_domain *pd) { }
+#endif
+
static void free_rootdomain(struct rcu_head *rcu)
{
struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
@@ -211,6 +321,7 @@ static void free_rootdomain(struct rcu_head *rcu)
free_cpumask_var(rd->rto_mask);
free_cpumask_var(rd->online);
free_cpumask_var(rd->span);
+ free_pd(rd->pd);
kfree(rd);
}

@@ -1964,8 +2075,8 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
/* Destroy deleted domains: */
for (i = 0; i < ndoms_cur; i++) {
for (j = 0; j < n && !new_topology; j++) {
- if (cpumask_equal(doms_cur[i], doms_new[j])
- && dattrs_equal(dattr_cur, i, dattr_new, j))
+ if (cpumask_equal(doms_cur[i], doms_new[j]) &&
+ dattrs_equal(dattr_cur, i, dattr_new, j))
goto match1;
}
/* No match - a current sched domain not in new doms_new[] */
@@ -1985,8 +2096,8 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
/* Build new domains: */
for (i = 0; i < ndoms_new; i++) {
for (j = 0; j < n && !new_topology; j++) {
- if (cpumask_equal(doms_new[i], doms_cur[j])
- && dattrs_equal(dattr_new, i, dattr_cur, j))
+ if (cpumask_equal(doms_new[i], doms_cur[j]) &&
+ dattrs_equal(dattr_new, i, dattr_cur, j))
goto match2;
}
/* No match - add a new doms_new */
@@ -1995,6 +2106,21 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
;
}

+#ifdef CONFIG_ENERGY_MODEL
+ /* Build perf. domains: */
+ for (i = 0; i < ndoms_new; i++) {
+ for (j = 0; j < n; j++) {
+ if (cpumask_equal(doms_new[i], doms_cur[j]) &&
+ cpu_rq(cpumask_first(doms_cur[j]))->rd->pd)
+ goto match3;
+ }
+ /* No match - add perf. domains for a new rd */
+ build_perf_domains(doms_new[i]);
+match3:
+ ;
+ }
+#endif
+
/* Remember the new sched domains: */
if (doms_cur != &fallback_doms)
free_sched_domains(doms_cur, ndoms_cur);
--
2.17.1


2018-08-20 09:46:35

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 06/14] sched/topology: Lowest CPU asymmetry sched_domain level pointer

Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_asym_cpucapacity, points to the lowest level
at which the SD_ASYM_CPUCAPACITY flag is set. While at it, rename the
sd_asym shortcut to sd_asym_packing to avoid confusions.

Generally speaking, the largest opportunity to save energy via
scheduling comes from a smarter exploitation of heterogeneous platforms
(i.e. big.LITTLE). Consequently, the sd_asym_cpucapacity shortcut will
be used at first as the lowest domain where Energy-Aware Scheduling
(EAS) should be applied. For example, it is possible to apply EAS within
a socket on a multi-socket system, as long as each socket has an
asymmetric topology. Cross-sockets wake-up balancing will only happen
when the system is over-utilized, or this_cpu and prev_cpu are in
different sockets.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Suggested-by: Morten Rasmussen <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 3 ++-
kernel/sched/topology.c | 8 ++++++--
3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c60616796483..c28c7adcc7b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9167,7 +9167,7 @@ static void nohz_balancer_kick(struct rq *rq)
}
}

- sd = rcu_dereference(per_cpu(sd_asym, cpu));
+ sd = rcu_dereference(per_cpu(sd_asym_packing, cpu));
if (sd) {
for_each_cpu(i, sched_domain_span(sd)) {
if (i == cpu ||
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 97d79429e755..4b884e467545 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1203,7 +1203,8 @@ DECLARE_PER_CPU(int, sd_llc_size);
DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
-DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(struct sched_domain *, sd_asym_packing);
+DECLARE_PER_CPU(struct sched_domain *, sd_asym_cpucapacity);
extern struct static_key_false sched_asym_cpucapacity;

struct sched_group_capacity {
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 545e2c7b6e66..4c6a36a8d7b8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -508,7 +508,8 @@ DEFINE_PER_CPU(int, sd_llc_size);
DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
-DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_PER_CPU(struct sched_domain *, sd_asym_packing);
+DEFINE_PER_CPU(struct sched_domain *, sd_asym_cpucapacity);
DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);

static void update_top_cache_domain(int cpu)
@@ -534,7 +535,10 @@ static void update_top_cache_domain(int cpu)
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);

sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
- rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
+ rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
+
+ sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
+ rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
}

/*
--
2.17.1


2018-08-20 09:46:45

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 07/14] sched/topology: Introduce sched_energy_present static key

In order to ensure a minimal performance impact on non-energy-aware
systems, introduce a static_key guarding the access to Energy-Aware
Scheduling (EAS) code.

The static key is set iff all the following conditions are met for at
least one root domain:
1. all online CPUs of the root domain are covered by the Energy
Model (EM);
2. the complexity of the root domain's EM is low enough to keep
scheduling overheads low;
3. the root domain has an asymmetric CPU capacity topology (detected
by looking for the SD_ASYM_CPUCAPACITY flag in the sched_domain
hierarchy).

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 77 ++++++++++++++++++++++++++++++++++++++++-
2 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4b884e467545..cb3d6afdb114 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1421,6 +1421,7 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =

extern struct static_key_false sched_numa_balancing;
extern struct static_key_false sched_schedstats;
+extern struct static_key_false sched_energy_present;

static inline u64 global_rt_period(void)
{
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 4c6a36a8d7b8..1cb86a0ef00f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -200,6 +200,14 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)

return 1;
}
+/*
+ * This static_key is set if at least one root domain meets all the following
+ * conditions:
+ * 1. all CPUs of the root domain are covered by the EM;
+ * 2. the EM complexity is low enough to keep scheduling overheads low;
+ * 3. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
+ */
+DEFINE_STATIC_KEY_FALSE(sched_energy_present);

#ifdef CONFIG_ENERGY_MODEL
static void free_pd(struct perf_domain *pd)
@@ -270,12 +278,34 @@ static void destroy_perf_domain_rcu(struct rcu_head *rp)
free_pd(pd);
}

+/*
+ * The complexity of the Energy Model is defined as: nr_pd * (nr_cpus + nr_cs)
+ * with: 'nr_pd' the number of performance domains; 'nr_cpus' the number of
+ * CPUs; and 'nr_cs' the sum of the capacity states numbers of all performance
+ * domains.
+ *
+ * It is generally not a good idea to use such a model in the wake-up path on
+ * very complex platforms because of the associated scheduling overheads. The
+ * arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs
+ * with per-CPU DVFS and less than 8 capacity states each, for example.
+ */
+#define EM_MAX_COMPLEXITY 2048
+
static void build_perf_domains(const struct cpumask *cpu_map)
{
+ int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map);
struct perf_domain *pd = NULL, *tmp;
int cpu = cpumask_first(cpu_map);
struct root_domain *rd = cpu_rq(cpu)->rd;
- int i;
+
+ /* EAS is enabled for asymmetric CPU capacity topologies. */
+ if (!per_cpu(sd_asym_cpucapacity, cpu)) {
+ if (sched_debug()) {
+ pr_info("rd %*pbl: CPUs do not have asymmetric capacities\n",
+ cpumask_pr_args(cpu_map));
+ }
+ goto free;
+ }

for_each_cpu(i, cpu_map) {
/* Skip already covered CPUs. */
@@ -288,6 +318,21 @@ static void build_perf_domains(const struct cpumask *cpu_map)
goto free;
tmp->next = pd;
pd = tmp;
+
+ /*
+ * Count performance domains and capacity states for the
+ * complexity check.
+ */
+ nr_pd++;
+ nr_cs += em_pd_nr_cap_states(pd->obj);
+ }
+
+ /* Bail out if the Energy Model complexity is too high. */
+ if (nr_pd * (nr_cs + nr_cpus) > EM_MAX_COMPLEXITY) {
+ if (sched_debug())
+ pr_info("rd %*pbl: EM complexity is too high\n ",
+ cpumask_pr_args(cpu_map));
+ goto free;
}

perf_domain_debug(cpu_map, pd);
@@ -307,6 +352,35 @@ static void build_perf_domains(const struct cpumask *cpu_map)
if (tmp)
call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
}
+
+static void sched_energy_start(int ndoms_new, cpumask_var_t doms_new[])
+{
+ /*
+ * The conditions for EAS to start are checked during the creation of
+ * root domains. If one of them meets all conditions, it will have a
+ * non-null list of performance domains.
+ */
+ while (ndoms_new) {
+ if (cpu_rq(cpumask_first(doms_new[ndoms_new - 1]))->rd->pd)
+ goto enable;
+ ndoms_new--;
+ }
+
+ if (static_branch_unlikely(&sched_energy_present)) {
+ if (sched_debug())
+ pr_info("%s: stopping EAS\n", __func__);
+ static_branch_disable_cpuslocked(&sched_energy_present);
+ }
+
+ return;
+
+enable:
+ if (!static_branch_unlikely(&sched_energy_present)) {
+ if (sched_debug())
+ pr_info("%s: starting EAS\n", __func__);
+ static_branch_enable_cpuslocked(&sched_energy_present);
+ }
+}
#else
static void free_pd(struct perf_domain *pd) { }
#endif
@@ -2123,6 +2197,7 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
match3:
;
}
+ sched_energy_start(ndoms_new, doms_new);
#endif

/* Remember the new sched domains: */
--
2.17.1


2018-08-20 09:46:49

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 03/14] PM: Introduce an Energy Model management framework

Several subsystems in the kernel (task scheduler and/or thermal at the
time of writing) can benefit from knowing about the energy consumed by
CPUs. Yet, this information can come from different sources (DT or
firmware for example), in different formats, hence making it hard to
exploit without a standard API.

As an attempt to address this, introduce a centralized Energy Model
(EM) management framework which aggregates the power values provided
by drivers into a table for each performance domain in the system. The
power cost tables are made available to interested clients (e.g. task
scheduler or thermal) via platform-agnostic APIs. The overall design
is represented by the diagram below (focused on Arm-related drivers as
an example, but applicable to any architecture):

+---------------+ +-----------------+ +-------------+
| Thermal (IPA) | | Scheduler (EAS) | | Other |
+---------------+ +-----------------+ +-------------+
| | em_pd_energy() |
| | em_cpu_get() |
+-----------+ | +----------+
| | |
v v v
+---------------------+
| |
| Energy Model |
| |
| Framework |
| |
+---------------------+
^ ^ ^
| | | em_register_perf_domain()
+----------+ | +---------+
| | |
+---------------+ +---------------+ +--------------+
| cpufreq-dt | | arm_scmi | | Other |
+---------------+ +---------------+ +--------------+
^ ^ ^
| | |
+--------------+ +---------------+ +--------------+
| Device Tree | | Firmware | | ? |
+--------------+ +---------------+ +--------------+

Drivers (typically, but not limited to, CPUFreq drivers) can register
data in the EM framework using the em_register_perf_domain() API. The
calling driver must provide a callback function with a standardized
signature that will be used by the EM framework to build the power
cost tables of the performance domain. This design should offer a lot of
flexibility to calling drivers which are free of reading information
from any location and to use any technique to compute power costs.
Moreover, the capacity states registered by drivers in the EM framework
are not required to match real performance states of the target. This
is particularly important on targets where the performance states are
not known by the OS.

The power cost coefficients managed by the EM framework are specified in
milli-watts. Although the two potential users of those coefficients (IPA
and EAS) only need relative correctness, IPA specifically needs to
compare the power of CPUs with the power of other components (GPUs, for
example), which are still expressed in absolute terms in their
respective subsystems. Hence, specifiying the power of CPUs in
milli-watts should help transitioning IPA to using the EM framework
without introducing new problems by keeping units comparable across
sub-systems.
On the longer term, the EM of other devices than CPUs could also be
managed by the EM framework, which would enable to remove the absolute
unit. However, this is not absolutely required as a first step, so this
extension of the EM framework is left for later.

On the client side, the EM framework offers APIs to access the power
cost tables of a CPU (em_cpu_get()), and to estimate the energy
consumed by the CPUs of a performance domain (em_pd_energy()). Clients
such as the task scheduler can then use these APIs to access the shared
data structures holding the Energy Model of CPUs.

Cc: Peter Zijlstra <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
include/linux/energy_model.h | 161 ++++++++++++++++++++++++++++
kernel/power/Kconfig | 15 +++
kernel/power/Makefile | 2 +
kernel/power/energy_model.c | 199 +++++++++++++++++++++++++++++++++++
4 files changed, 377 insertions(+)
create mode 100644 include/linux/energy_model.h
create mode 100644 kernel/power/energy_model.c

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
new file mode 100644
index 000000000000..b89b5596c976
--- /dev/null
+++ b/include/linux/energy_model.h
@@ -0,0 +1,161 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_ENERGY_MODEL_H
+#define _LINUX_ENERGY_MODEL_H
+#include <linux/cpumask.h>
+#include <linux/jump_label.h>
+#include <linux/kobject.h>
+#include <linux/rcupdate.h>
+#include <linux/sched/cpufreq.h>
+#include <linux/sched/topology.h>
+#include <linux/types.h>
+
+#ifdef CONFIG_ENERGY_MODEL
+struct em_cap_state {
+ unsigned long frequency; /* Kilo-hertz */
+ unsigned long power; /* Milli-watts */
+ unsigned long cost; /* power * max_frequency / frequency */
+};
+
+struct em_perf_domain {
+ struct em_cap_state *table; /* Capacity states, in ascending order. */
+ int nr_cap_states;
+ unsigned long cpus[0]; /* CPUs of the frequency domain. */
+};
+
+#define EM_CPU_MAX_POWER 0xFFFF
+
+struct em_data_callback {
+ /**
+ * active_power() - Provide power at the next capacity state of a CPU
+ * @power : Active power at the capacity state in mW (modified)
+ * @freq : Frequency at the capacity state in kHz (modified)
+ * @cpu : CPU for which we do this operation
+ *
+ * active_power() must find the lowest capacity state of 'cpu' above
+ * 'freq' and update 'power' and 'freq' to the matching active power
+ * and frequency.
+ *
+ * The power is the one of a single CPU in the domain, expressed in
+ * milli-watts. It is expected to fit in the [0, EM_CPU_MAX_POWER]
+ * range.
+ *
+ * Return 0 on success.
+ */
+ int (*active_power)(unsigned long *power, unsigned long *freq, int cpu);
+};
+#define EM_DATA_CB(_active_power_cb) { .active_power = &_active_power_cb }
+
+struct em_perf_domain *em_cpu_get(int cpu);
+int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
+ struct em_data_callback *cb);
+
+/**
+ * em_pd_energy() - Estimates the energy consumed by the CPUs of a perf. domain
+ * @pd : performance domain for which energy has to be estimated
+ * @max_util : highest utilization among CPUs of the domain
+ * @sum_util : sum of the utilization of all CPUs in the domain
+ *
+ * Return: the sum of the energy consumed by the CPUs of the domain assuming
+ * a capacity state satisfying the max utilization of the domain.
+ */
+static inline unsigned long em_pd_energy(struct em_perf_domain *pd,
+ unsigned long max_util, unsigned long sum_util)
+{
+ unsigned long freq, scale_cpu;
+ struct em_cap_state *cs;
+ int i, cpu;
+
+ /*
+ * In order to predict the capacity state, map the utilization of the
+ * most utilized CPU of the performance domain to a requested frequency,
+ * like schedutil.
+ */
+ cpu = cpumask_first(to_cpumask(pd->cpus));
+ scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+ cs = &pd->table[pd->nr_cap_states - 1];
+ freq = map_util_freq(max_util, cs->frequency, scale_cpu);
+
+ /*
+ * Find the lowest capacity state of the Energy Model above the
+ * requested frequency.
+ */
+ for (i = 0; i < pd->nr_cap_states; i++) {
+ cs = &pd->table[i];
+ if (cs->frequency >= freq)
+ break;
+ }
+
+ /*
+ * The capacity of a CPU in the domain at that capacity state (cs)
+ * can be computed as:
+ *
+ * cs->freq * scale_cpu
+ * cs->cap = -------------------- (1)
+ * cpu_max_freq
+ *
+ * So, the energy consumed by this CPU at that capacity state is:
+ *
+ * cs->power * cpu_util
+ * cpu_nrg = -------------------- (2)
+ * cs->cap
+ *
+ * since 'cpu_util / cs->cap' represents its percentage of busy time.
+ * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
+ * of two terms:
+ *
+ * cs->power * cpu_max_freq cpu_util
+ * cpu_nrg = ------------------------ * --------- (3)
+ * cs->freq scale_cpu
+ *
+ * The first term is static, and is stored in the em_cap_state struct
+ * as 'cs->cost'.
+ *
+ * Since all CPUs of the domain have the same micro-architecture, they
+ * share the same 'cs->cost', and the same CPU capacity. Hence, the
+ * total energy of the domain (which is the simple sum of the energy of
+ * all of its CPUs) can be factorized as:
+ *
+ * cs->cost * \Sum cpu_util
+ * pd_nrg = ------------------------ (4)
+ * scale_cpu
+ */
+ return cs->cost * sum_util / scale_cpu;
+}
+
+/**
+ * em_pd_nr_cap_states() - Get the number of capacity states of a perf. domain
+ * @pd : performance domain for which this must be done
+ *
+ * Return: the number of capacity states in the performance domain table
+ */
+static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
+{
+ return pd->nr_cap_states;
+}
+
+#else
+struct em_perf_domain {};
+struct em_data_callback {};
+#define EM_DATA_CB(_active_power_cb) { }
+
+static inline int em_register_perf_domain(cpumask_t *span,
+ unsigned int nr_states, struct em_data_callback *cb)
+{
+ return -EINVAL;
+}
+static inline struct em_perf_domain *em_cpu_get(int cpu)
+{
+ return NULL;
+}
+static inline unsigned long em_pd_energy(struct em_perf_domain *pd,
+ unsigned long max_util, unsigned long sum_util)
+{
+ return 0;
+}
+static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
+{
+ return 0;
+}
+#endif
+
+#endif
diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig
index e880ca22c5a5..6f6db452dc7d 100644
--- a/kernel/power/Kconfig
+++ b/kernel/power/Kconfig
@@ -297,3 +297,18 @@ config PM_GENERIC_DOMAINS_OF

config CPU_PM
bool
+
+config ENERGY_MODEL
+ bool "Energy Model for CPUs"
+ depends on SMP
+ depends on CPU_FREQ
+ default n
+ help
+ Several subsystems (thermal and/or the task scheduler for example)
+ can leverage information about the energy consumed by CPUs to make
+ smarter decisions. This config option enables the framework from
+ which subsystems can access the energy models.
+
+ The exact usage of the energy model is subsystem-dependent.
+
+ If in doubt, say N.
diff --git a/kernel/power/Makefile b/kernel/power/Makefile
index a3f79f0eef36..e7e47d9be1e5 100644
--- a/kernel/power/Makefile
+++ b/kernel/power/Makefile
@@ -15,3 +15,5 @@ obj-$(CONFIG_PM_AUTOSLEEP) += autosleep.o
obj-$(CONFIG_PM_WAKELOCKS) += wakelock.o

obj-$(CONFIG_MAGIC_SYSRQ) += poweroff.o
+
+obj-$(CONFIG_ENERGY_MODEL) += energy_model.o
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
new file mode 100644
index 000000000000..2aa6f94343d0
--- /dev/null
+++ b/kernel/power/energy_model.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Energy Model of CPUs
+ *
+ * Copyright (c) 2018, Arm ltd.
+ * Written by: Quentin Perret, Arm ltd.
+ */
+
+#define pr_fmt(fmt) "energy_model: " fmt
+
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/energy_model.h>
+#include <linux/sched/topology.h>
+#include <linux/slab.h>
+
+/* Mapping of each CPU to the performance domain to which it belongs. */
+static DEFINE_PER_CPU(struct em_perf_domain *, em_data);
+
+/*
+ * Mutex serializing the registrations of performance domains and letting
+ * callbacks defined by drivers sleep.
+ */
+static DEFINE_MUTEX(em_pd_mutex);
+
+static struct em_perf_domain *em_create_pd(cpumask_t *span, int nr_states,
+ struct em_data_callback *cb)
+{
+ unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
+ unsigned long power, freq, prev_freq = 0;
+ int i, ret, cpu = cpumask_first(span);
+ struct em_cap_state *table;
+ struct em_perf_domain *pd;
+ u64 fmax;
+
+ if (!cb->active_power)
+ return NULL;
+
+ pd = kzalloc(sizeof(*pd) + cpumask_size(), GFP_KERNEL);
+ if (!pd)
+ return NULL;
+
+ table = kcalloc(nr_states, sizeof(*table), GFP_KERNEL);
+ if (!table)
+ goto free_pd;
+
+ /* Build the list of capacity states for this performance domain */
+ for (i = 0, freq = 0; i < nr_states; i++, freq++) {
+ /*
+ * active_power() is a driver callback which ceils 'freq' to
+ * lowest capacity state of 'cpu' above 'freq' and updates
+ * 'power' and 'freq' accordingly.
+ */
+ ret = cb->active_power(&power, &freq, cpu);
+ if (ret) {
+ pr_err("pd%d: invalid cap. state: %d\n", cpu, ret);
+ goto free_cs_table;
+ }
+
+ /*
+ * We expect the driver callback to increase the frequency for
+ * higher capacity states.
+ */
+ if (freq <= prev_freq) {
+ pr_err("pd%d: non-increasing freq: %lu\n", cpu, freq);
+ goto free_cs_table;
+ }
+
+ /*
+ * The power returned by active_state() is expected to be in
+ * milli-watts and to fit into 16 bits.
+ */
+ if (power > EM_CPU_MAX_POWER) {
+ pr_err("pd%d: power out of scale: %lu\n", cpu, power);
+ goto free_cs_table;
+ }
+
+ table[i].power = power;
+ table[i].frequency = prev_freq = freq;
+
+ /*
+ * The hertz/watts efficiency ratio should decrease as the
+ * frequency grows on sane platforms. But this isn't always
+ * true in practice so warn the user if a higher OPP is more
+ * power efficient than a lower one.
+ */
+ opp_eff = freq / power;
+ if (opp_eff >= prev_opp_eff)
+ pr_warn("pd%d: hertz/watts ratio non-monotonically decreasing: OPP%d >= OPP%d\n",
+ cpu, i, i - 1);
+ prev_opp_eff = opp_eff;
+ }
+
+ /* Compute the cost of each capacity_state. */
+ fmax = (u64) table[nr_states - 1].frequency;
+ for (i = 0; i < nr_states; i++) {
+ table[i].cost = div64_u64(fmax * table[i].power,
+ table[i].frequency);
+ }
+
+ pd->table = table;
+ pd->nr_cap_states = nr_states;
+ cpumask_copy(to_cpumask(pd->cpus), span);
+
+ return pd;
+
+free_cs_table:
+ kfree(table);
+free_pd:
+ kfree(pd);
+
+ return NULL;
+}
+
+/**
+ * em_cpu_get() - Return the performance domain for a CPU
+ * @cpu : CPU to find the performance domain for
+ *
+ * Return: the performance domain to which 'cpu' belongs, or NULL if it doesn't
+ * exist.
+ */
+struct em_perf_domain *em_cpu_get(int cpu)
+{
+ return READ_ONCE(per_cpu(em_data, cpu));
+}
+EXPORT_SYMBOL_GPL(em_cpu_get);
+
+/**
+ * em_register_perf_domain() - Register the Energy Model of a performance domain
+ * @span : Mask of CPUs in the performance domain
+ * @nr_states : Number of capacity states to register
+ * @cb : Callback functions providing the data of the Energy Model
+ *
+ * Create Energy Model tables for a performance domain using the callbacks
+ * defined in cb.
+ *
+ * If multiple clients register the same performance domain, all but the first
+ * registration will be ignored.
+ *
+ * Return 0 on success
+ */
+int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
+ struct em_data_callback *cb)
+{
+ unsigned long cap, prev_cap = 0;
+ struct em_perf_domain *pd;
+ int cpu, ret = 0;
+
+ if (!span || !nr_states || !cb)
+ return -EINVAL;
+
+ /*
+ * Use a mutex to serialize the registration of performance domains and
+ * let the driver-defined callback functions sleep.
+ */
+ mutex_lock(&em_pd_mutex);
+
+ for_each_cpu(cpu, span) {
+ /* Make sure we don't register again an existing domain. */
+ if (READ_ONCE(per_cpu(em_data, cpu))) {
+ ret = -EEXIST;
+ goto unlock;
+ }
+
+ /*
+ * All CPUs of a domain must have the same micro-architecture
+ * since they all share the same table.
+ */
+ cap = arch_scale_cpu_capacity(NULL, cpu);
+ if (prev_cap && prev_cap != cap) {
+ pr_err("CPUs of %*pbl must have the same capacity\n",
+ cpumask_pr_args(span));
+ ret = -EINVAL;
+ goto unlock;
+ }
+ prev_cap = cap;
+ }
+
+ /* Create the performance domain and add it to the Energy Model. */
+ pd = em_create_pd(span, nr_states, cb);
+ if (!pd) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ for_each_cpu(cpu, span) {
+ /*
+ * The per-cpu array can be concurrently accessed from
+ * em_cpu_get().
+ */
+ smp_store_release(per_cpu_ptr(&em_data, cpu), pd);
+ }
+
+ pr_debug("Created perf domain %*pbl\n", cpumask_pr_args(span));
+unlock:
+ mutex_unlock(&em_pd_mutex);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(em_register_perf_domain);
--
2.17.1


2018-08-20 09:46:53

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 08/14] sched/fair: Clean-up update_sg_lb_stats parameters

In preparation for the introduction of a new root domain flag which can
be set during load balance (the 'overutilized' flag), clean-up the set
of parameters passed to update_sg_lb_stats(). More specifically, the
'local_group' and 'local_idx' parameters can be removed since they can
easily be reconstructed from within the function.

While at it, transform the 'overload' parameter into a flag stored in
the 'sg_status parameter hence facilitating the definition of new flags
when needed.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Suggested-by: Valentin Schneider <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/fair.c | 27 +++++++++++----------------
kernel/sched/sched.h | 3 +++
2 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c28c7adcc7b3..23381feae4ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7782,16 +7782,16 @@ static bool update_nohz_stats(struct rq *rq, bool force)
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
* @group: sched_group whose statistics are to be updated.
- * @load_idx: Load index of sched_domain of this_cpu for load calc.
- * @local_group: Does group contain this_cpu.
* @sgs: variable to hold the statistics for this group.
- * @overload: Indicate pullable load (e.g. >1 runnable task).
+ * @sg_status: Holds flag indicating the status of the sched_group
*/
static inline void update_sg_lb_stats(struct lb_env *env,
- struct sched_group *group, int load_idx,
- int local_group, struct sg_lb_stats *sgs,
- bool *overload)
+ struct sched_group *group,
+ struct sg_lb_stats *sgs,
+ int *sg_status)
{
+ int local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
+ int load_idx = get_sd_load_idx(env->sd, env->idle);
unsigned long load;
int i, nr_running;

@@ -7815,7 +7815,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,

nr_running = rq->nr_running;
if (nr_running > 1)
- *overload = true;
+ *sg_status |= SG_OVERLOAD;

#ifdef CONFIG_NUMA_BALANCING
sgs->nr_numa_running += rq->nr_numa_running;
@@ -7831,7 +7831,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
- *overload = 1;
+ *sg_status |= SG_OVERLOAD;
}
}

@@ -7976,17 +7976,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
- int load_idx;
- bool overload = false;
bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+ int sg_status = 0;

#ifdef CONFIG_NO_HZ_COMMON
if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
env->flags |= LBF_NOHZ_STATS;
#endif

- load_idx = get_sd_load_idx(env->sd, env->idle);
-
do {
struct sg_lb_stats *sgs = &tmp_sgs;
int local_group;
@@ -8001,8 +7998,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
update_group_capacity(env->sd, env->dst_cpu);
}

- update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
- &overload);
+ update_sg_lb_stats(env, sg, sgs, &sg_status);

if (local_group)
goto next_group;
@@ -8052,8 +8048,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
- if (READ_ONCE(env->dst_rq->rd->overload) != overload)
- WRITE_ONCE(env->dst_rq->rd->overload, overload);
+ WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cb3d6afdb114..c6c0cf71d03b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -707,6 +707,9 @@ struct perf_domain {
struct rcu_head rcu;
};

+/* Scheduling group status flags */
+#define SG_OVERLOAD 0x1 /* More than one runnable task on a CPU. */
+
/*
* We add the notion of a root-domain which will be used to define per-domain
* variables. Each exclusive cpuset essentially defines an island domain by
--
2.17.1


2018-08-20 09:46:59

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 09/14] sched: Add over-utilization/tipping point indicator

From: Morten Rasmussen <[email protected]>

Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.

The util_avg for each cpu converges towards 100% regardless of how many
additional tasks we may put on it. If we define over-utilized as:

sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)

some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.

For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.

To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at its
highest frequency instead:

cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity

IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.

With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.

For systems where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
[ Added a comment explaining why new tasks are not accounted during
overutilization detection ]
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 4 +++
2 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23381feae4ec..00729ff55fa3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5001,6 +5001,24 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+#ifdef CONFIG_SMP
+static inline unsigned long cpu_util(int cpu);
+static unsigned long capacity_of(int cpu);
+
+static inline bool cpu_overutilized(int cpu)
+{
+ return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin);
+}
+
+static inline void update_overutilized_status(struct rq *rq)
+{
+ if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+ WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
+}
+#else
+static inline void update_overutilized_status(struct rq *rq) { }
+#endif
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -5058,8 +5076,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_group(se);
}

- if (!se)
+ if (!se) {
add_nr_running(rq, 1);
+ /*
+ * Since new tasks are assigned an initial util_avg equal to
+ * half of the spare capacity of their CPU, tiny tasks have the
+ * ability to cross the overutilized threshold, which will
+ * result in the load balancer ruining all the task placement
+ * done by EAS. As a way to mitigate that effect, do not account
+ * for the first enqueue operation of new tasks during the
+ * overutilized flag detection.
+ *
+ * A better way of solving this problem would be to wait for
+ * the PELT signals of tasks to converge before taking them
+ * into account, but that is not straightforward to implement,
+ * and the following generally works well enough in practice.
+ */
+ if (flags & ENQUEUE_WAKEUP)
+ update_overutilized_status(rq);
+
+ }

hrtick_update(rq);
}
@@ -7817,6 +7853,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (nr_running > 1)
*sg_status |= SG_OVERLOAD;

+ if (cpu_overutilized(i))
+ *sg_status |= SG_OVERUTILIZED;
+
#ifdef CONFIG_NUMA_BALANCING
sgs->nr_numa_running += rq->nr_numa_running;
sgs->nr_preferred_running += rq->nr_preferred_running;
@@ -8047,8 +8086,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
env->fbq_type = fbq_classify_group(&sds->busiest_stat);

if (!env->sd->parent) {
+ struct root_domain *rd = env->dst_rq->rd;
+
/* update overload indicator if we are at root domain */
- WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
+ WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);
+
+ /* Update over-utilization (tipping point, U >= 0) indicator */
+ WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
+ } else if (sg_status & SG_OVERUTILIZED) {
+ WRITE_ONCE(env->dst_rq->rd->overutilized, SG_OVERUTILIZED);
}
}

@@ -8275,6 +8321,14 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
* this level.
*/
update_sd_lb_stats(env, &sds);
+
+ if (static_branch_unlikely(&sched_energy_present)) {
+ struct root_domain *rd = env->dst_rq->rd;
+
+ if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
+ goto out_balanced;
+ }
+
local = &sds.local_stat;
busiest = &sds.busiest_stat;

@@ -9666,6 +9720,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
task_tick_numa(rq, curr);

update_misfit_status(curr, rq);
+ update_overutilized_status(task_rq(curr));
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c6c0cf71d03b..6a0f8d1ca2d2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -709,6 +709,7 @@ struct perf_domain {

/* Scheduling group status flags */
#define SG_OVERLOAD 0x1 /* More than one runnable task on a CPU. */
+#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */

/*
* We add the notion of a root-domain which will be used to define per-domain
@@ -732,6 +733,9 @@ struct root_domain {
*/
int overload;

+ /* Indicate one or more cpus over-utilized (tipping point) */
+ int overutilized;
+
/*
* The bit corresponding to a CPU gets set here if such CPU has more
* than one runnable -deadline task (as it is below for RT tasks).
--
2.17.1


2018-08-20 09:47:02

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 10/14] sched/cpufreq: Refactor the utilization aggregation method

Schedutil aggregates the PELT signals of CFS, RT, DL and IRQ in order
to decide which frequency to request. Energy Aware Scheduling (EAS)
needs to be able to predict those requests to assess the energy impact
of scheduling decisions. However, the PELT signals aggregation is only
done in schedutil for now, hence making it hard to synchronize it with
EAS.

To address this issue, introduce schedutil_freq_util() to perform the
aforementioned aggregation and make it available to other parts of the
scheduler. Since frequency selection and energy estimation still need
to deal with RT and DL signals slightly differently, schedutil_freq_util()
is called with a different 'type' parameter in those two contexts, and
returns an aggregated utilization signal accordingly.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 86 +++++++++++++++++++++-----------
kernel/sched/sched.h | 14 ++++++
2 files changed, 72 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index f5206c950143..8356cb0072a6 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -198,15 +198,15 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
* based on the task model parameters and gives the minimal utilization
* required to meet deadlines.
*/
-static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
+unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
+ enum schedutil_type type)
{
- struct rq *rq = cpu_rq(sg_cpu->cpu);
+ struct rq *rq = cpu_rq(cpu);
unsigned long util, irq, max;

- sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
- sg_cpu->bw_dl = cpu_bw_dl(rq);
+ max = arch_scale_cpu_capacity(NULL, cpu);

- if (rt_rq_is_runnable(&rq->rt))
+ if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
return max;

/*
@@ -224,20 +224,33 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
*/
- util = cpu_util_cfs(rq);
+ util = util_cfs;
util += cpu_util_rt(rq);

- /*
- * We do not make cpu_util_dl() a permanent part of this sum because we
- * want to use cpu_bw_dl() later on, but we need to check if the
- * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
- * f_max when there is no idle time.
- *
- * NOTE: numerical errors or stop class might cause us to not quite hit
- * saturation when we should -- something for later.
- */
- if ((util + cpu_util_dl(rq)) >= max)
- return max;
+ if (type == FREQUENCY_UTIL) {
+ /*
+ * For frequency selection we do not make cpu_util_dl() a
+ * permanent part of this sum because we want to use
+ * cpu_bw_dl() later on, but we need to check if the
+ * CFS+RT+DL sum is saturated (ie. no idle time) such
+ * that we select f_max when there is no idle time.
+ *
+ * NOTE: numerical errors or stop class might cause us
+ * to not quite hit saturation when we should --
+ * something for later.
+ */
+
+ if ((util + cpu_util_dl(rq)) >= max)
+ return max;
+ } else {
+ /*
+ * OTOH, for energy computation we need the estimated
+ * running time, so include util_dl and ignore dl_bw.
+ */
+ util += cpu_util_dl(rq);
+ if (util >= max)
+ return max;
+ }

/*
* There is still idle time; further improve the number by using the
@@ -251,17 +264,34 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
util = scale_irq_capacity(util, irq, max);
util += irq;

- /*
- * Bandwidth required by DEADLINE must always be granted while, for
- * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
- * to gracefully reduce the frequency when no tasks show up for longer
- * periods of time.
- *
- * Ideally we would like to set bw_dl as min/guaranteed freq and util +
- * bw_dl as requested freq. However, cpufreq is not yet ready for such
- * an interface. So, we only do the latter for now.
- */
- return min(max, util + sg_cpu->bw_dl);
+ if (type == FREQUENCY_UTIL) {
+ /*
+ * Bandwidth required by DEADLINE must always be granted
+ * while, for FAIR and RT, we use blocked utilization of
+ * IDLE CPUs as a mechanism to gracefully reduce the
+ * frequency when no tasks show up for longer periods of
+ * time.
+ *
+ * Ideally we would like to set bw_dl as min/guaranteed
+ * freq and util + bw_dl as requested freq. However,
+ * cpufreq is not yet ready for such an interface. So,
+ * we only do the latter for now.
+ */
+ util += cpu_bw_dl(rq);
+ }
+
+ return min(max, util);
+}
+
+static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
+{
+ struct rq *rq = cpu_rq(sg_cpu->cpu);
+ unsigned long util = cpu_util_cfs(rq);
+
+ sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+ sg_cpu->bw_dl = cpu_bw_dl(rq);
+
+ return schedutil_freq_util(sg_cpu->cpu, util, FREQUENCY_UTIL);
}

/**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6a0f8d1ca2d2..e594a854977f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2194,7 +2194,15 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
# define arch_scale_freq_invariant() false
#endif

+enum schedutil_type {
+ FREQUENCY_UTIL,
+ ENERGY_UTIL,
+};
+
#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
+ enum schedutil_type type);
+
static inline unsigned long cpu_bw_dl(struct rq *rq)
{
return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
@@ -2221,6 +2229,12 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
{
return READ_ONCE(rq->avg_rt.util_avg);
}
+#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+static inline unsigned long schedutil_freq_util(int cpu, unsigned long util,
+ enum schedutil_type type)
+{
+ return util;
+}
#endif

#ifdef HAVE_SCHED_AVG_IRQ
--
2.17.1


2018-08-20 09:47:08

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 11/14] sched/fair: Introduce an energy estimation helper function

In preparation for the definition of an energy-aware wakeup path,
introduce a helper function to estimate the consequence on system energy
when a specific task wakes-up on a specific CPU. compute_energy()
estimates the capacity state to be reached by all performance domains
and estimates the consumption of each online CPU according to its Energy
Model and its percentage of busy time.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/fair.c | 77 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 77 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 00729ff55fa3..c4e34368795b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6262,6 +6262,83 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
return !task_fits_capacity(p, min_cap);
}

+/*
+ * Predicts what cpu_util(@cpu) would return if @p was migrated (and enqueued)
+ * to @dst_cpu.
+ */
+static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
+{
+ struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
+ unsigned long util_est, util = READ_ONCE(cfs_rq->avg.util_avg);
+
+ /*
+ * If @p migrates from @cpu to another, remove its contribution. Or,
+ * if @p migrates from another CPU to @cpu, add its contribution. In
+ * the other cases, @cpu is not impacted by the migration, so the
+ * util_avg should already be correct.
+ */
+ if (task_cpu(p) == cpu && dst_cpu != cpu)
+ util = max_t(long, util - task_util(p), 0);
+ else if (task_cpu(p) != cpu && dst_cpu == cpu)
+ util += task_util(p);
+
+ if (sched_feat(UTIL_EST)) {
+ util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+
+ /*
+ * During wake-up, the task isn't enqueued yet and doesn't
+ * appear in the cfs_rq->avg.util_est.enqueued of any rq,
+ * so just add it (if needed) to "simulate" what will be
+ * cpu_util() after the task has been enqueued.
+ */
+ if (dst_cpu == cpu)
+ util_est += _task_util_est(p);
+
+ util = max(util, util_est);
+ }
+
+ return min_t(unsigned long, util, capacity_orig_of(cpu));
+}
+
+/*
+ * compute_energy(): Estimates the energy that would be consumed if @p was
+ * migrated to @dst_cpu. compute_energy() predicts what will be the utilization
+ * landscape of the * CPUs after the task migration, and uses the Energy Model
+ * to compute what would be the energy if we decided to actually migrate that
+ * task.
+ */
+static long compute_energy(struct task_struct *p, int dst_cpu,
+ struct perf_domain *pd)
+{
+ long util, max_util, sum_util, energy = 0;
+ int cpu;
+
+ while (pd) {
+ max_util = sum_util = 0;
+ /*
+ * The capacity state of CPUs of the current rd can be driven by
+ * CPUs of another rd if they belong to the same performance
+ * domain. So, account for the utilization of these CPUs too
+ * by masking pd with cpu_online_mask instead of the rd span.
+ *
+ * If an entire performance domain is outside of the current rd,
+ * it will not appear in its pd list and will not be accounted
+ * by compute_energy().
+ */
+ for_each_cpu_and(cpu, perf_domain_span(pd), cpu_online_mask) {
+ util = cpu_util_next(cpu, p, dst_cpu);
+ util = schedutil_freq_util(cpu, util, ENERGY_UTIL);
+ max_util = max(util, max_util);
+ sum_util += util;
+ }
+
+ energy += em_pd_energy(pd->obj, max_util, sum_util);
+ pd = pd->next;
+ }
+
+ return energy;
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
--
2.17.1


2018-08-20 09:47:11

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 12/14] sched/fair: Select an energy-efficient CPU on task wake-up

If an Energy Model (EM) is available and if the system isn't
overutilized, re-route waking tasks into an energy-aware placement
algorithm. The selection of an energy-efficient CPU for a task
is achieved by estimating the impact on system-level active energy
resulting from the placement of the task on the CPU with the highest
spare capacity in each performance domain. This strategy spreads tasks
in a performance domain and avoids overly aggressive task packing. The
best CPU energy-wise is then selected if it saves a large enough amount
of energy with respect to prev_cpu.

Although it has already shown significant benefits on some existing
targets, this approach cannot scale to platforms with numerous CPUs.
This is an attempt to do something useful as writing a fast heuristic
that performs reasonably well on a broad spectrum of architectures isn't
an easy task. As such, the scope of usability of the energy-aware
wake-up path is restricted to systems with the SD_ASYM_CPUCAPACITY flag
set, and where the EM isn't too complex.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
kernel/sched/fair.c | 138 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 135 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c4e34368795b..7b311ff0294b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6339,6 +6339,113 @@ static long compute_energy(struct task_struct *p, int dst_cpu,
return energy;
}

+/*
+ * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the
+ * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
+ * spare capacity in each performance domain and uses it as a potential
+ * candidate to execute the task. Then, it uses the Energy Model to figure
+ * out which of the CPU candidates is the most energy-efficient.
+ *
+ * The rationale for this heuristic is as follows. In a performance domain,
+ * all the most energy efficient CPU candidates (according to the Energy
+ * Model) are those for which we'll request a low frequency. When there are
+ * several CPUs for which the frequency request will be the same, we don't
+ * have enough data to break the tie between them, because the Energy Model
+ * only includes active power costs. With this model, if we assume that
+ * frequency requests follow utilization (e.g. using schedutil), the CPU with
+ * the maximum spare capacity in a performance domain is guaranteed to be among
+ * the best candidates of the performance domain.
+ *
+ * In practice, it could be preferable from an energy standpoint to pack
+ * small tasks on a CPU in order to let other CPUs go in deeper idle states,
+ * but that could also hurt our chances to go cluster idle, and we have no
+ * ways to tell with the current Energy Model if this is actually a good
+ * idea or not. So, find_energy_efficient_cpu() basically favors
+ * cluster-packing, and spreading inside a cluster. That should at least be
+ * a good thing for latency, and this is consistent with the idea that most
+ * of the energy savings of EAS come from the asymmetry of the system, and
+ * not so much from breaking the tie between identical CPUs. That's also the
+ * reason why EAS is enabled in the topology code only for systems where
+ * SD_ASYM_CPUCAPACITY is set.
+ */
+static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu,
+ struct perf_domain *pd)
+{
+ unsigned long prev_energy = ULONG_MAX, best_energy = ULONG_MAX;
+ int cpu, best_energy_cpu = prev_cpu;
+ struct perf_domain *head = pd;
+ unsigned long cpu_cap, util;
+ struct sched_domain *sd;
+
+ sync_entity_load_avg(&p->se);
+
+ if (!task_util_est(p))
+ return prev_cpu;
+
+ /*
+ * Energy-aware wake-up happens on the lowest sched_domain starting
+ * from sd_asym_cpucapacity spanning over this_cpu and prev_cpu.
+ */
+ sd = rcu_dereference(*this_cpu_ptr(&sd_asym_cpucapacity));
+ while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
+ sd = sd->parent;
+ if (!sd)
+ return prev_cpu;
+
+ while (pd) {
+ unsigned long cur_energy, spare_cap, max_spare_cap = 0;
+ int max_spare_cap_cpu = -1;
+
+ for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
+ if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
+ continue;
+
+ /* Skip CPUs that will be overutilized. */
+ util = cpu_util_next(cpu, p, cpu);
+ cpu_cap = capacity_of(cpu);
+ if (cpu_cap * 1024 < util * capacity_margin)
+ continue;
+
+ /* Always use prev_cpu as a candidate. */
+ if (cpu == prev_cpu) {
+ prev_energy = compute_energy(p, prev_cpu, head);
+ if (prev_energy < best_energy)
+ best_energy = prev_energy;
+ continue;
+ }
+
+ /*
+ * Find the CPU with the maximum spare capacity in
+ * the performance domain
+ */
+ spare_cap = cpu_cap - util;
+ if (spare_cap > max_spare_cap) {
+ max_spare_cap = spare_cap;
+ max_spare_cap_cpu = cpu;
+ }
+ }
+
+ /* Evaluate the energy impact of using this CPU. */
+ if (max_spare_cap_cpu >= 0) {
+ cur_energy = compute_energy(p, max_spare_cap_cpu, head);
+ if (cur_energy < best_energy) {
+ best_energy = cur_energy;
+ best_energy_cpu = max_spare_cap_cpu;
+ }
+ }
+ pd = pd->next;
+ }
+
+ /*
+ * Pick the best CPU only if it saves at least 6% of the
+ * energy used by prev_cpu.
+ */
+ if ((prev_energy - best_energy) > (prev_energy >> 4))
+ return best_energy_cpu;
+
+ return prev_cpu;
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6360,13 +6467,37 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
int want_affine = 0;
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);

+ rcu_read_lock();
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
- want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
- && cpumask_test_cpu(cpu, &p->cpus_allowed);
+
+ /*
+ * Forkees are not accepted in the energy-aware wake-up path
+ * because they don't have any useful utilization data yet and
+ * it's not possible to forecast their impact on energy
+ * consumption. Consequently, they will be placed by
+ * find_idlest_cpu() on the least loaded CPU, which might turn
+ * out to be energy-inefficient in some use-cases. The
+ * alternative would be to bias new tasks towards specific types
+ * of CPUs first, or to try to infer their util_avg from the
+ * parent task, but those heuristics could hurt other use-cases
+ * too. So, until someone finds a better way to solve this,
+ * let's keep things simple by re-using the existing slow path.
+ */
+ if (static_branch_unlikely(&sched_energy_present)) {
+ struct root_domain *rd = cpu_rq(cpu)->rd;
+ struct perf_domain *pd = rcu_dereference(rd->pd);
+
+ if (pd && !READ_ONCE(rd->overutilized)) {
+ new_cpu = find_energy_efficient_cpu(p, prev_cpu, pd);
+ goto unlock;
+ }
+ }
+
+ want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
+ cpumask_test_cpu(cpu, &p->cpus_allowed);
}

- rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
break;
@@ -6401,6 +6532,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (want_affine)
current->recent_used_cpu = cpu;
}
+unlock:
rcu_read_unlock();

return new_cpu;
--
2.17.1


2018-08-20 09:47:13

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

Energy Aware Scheduling (EAS) is designed with the assumption that
frequencies of CPUs follow their utilization value. When using a CPUFreq
governor other than schedutil, the chances of this assumption being true
are small, if any. When schedutil is being used, EAS' predictions are at
least consistent with the frequency requests. Although those requests
have no guarantees to be honored by the hardware, they should at least
guide DVFS in the right direction and provide some hope in regards to the
EAS model being accurate.

To make sure EAS is only used in a sane configuration, create a strong
dependency on schedutil being used. Since having sugov compiled-in does
not provide that guarantee, extend the existing CPUFreq policy notifier
with a new case on governor changes. That allows the scheduler to
register a callback on this notifier to rebuild the scheduling domains
when governors are changed, and enable/disable EAS accordingly.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>

---
This patch could probably be squashed into another one, but I kept it
separate to ease the review. Also, it's probably optional as not having
it will not 'break' things per se.

I went for the smallest possible solution I could find, which has the
good side of being simple, but it's definitely not the only one.

Another possibility would be to hook things in sugov_start() and
sugov_stop(), but that requires some more work. In this case, it
wouldn't be possible to just re-build the sched_domains() from there,
because when sugov_stop() is called, the 'governor' field of the policy
hasn't been updated yet, so the condition (if gov == schedutil) in
build_freq_domains() doesn't work.

To workaround the issue we'll need to find a way to pass a cpumask to
the topology code to specifically say 'sugov has been stopped on these
CPUs'. That would mean more code to handle that, but that would also
mean we don't have to mess around with the CPUFreq notifiers ...

Not sure what's best, so all feedback is more than welcome.
---
drivers/cpufreq/cpufreq.c | 4 +++
include/linux/cpufreq.h | 1 +
kernel/sched/cpufreq_schedutil.c | 47 ++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 4 +--
kernel/sched/topology.c | 20 ++++++++++++--
5 files changed, 68 insertions(+), 8 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index b0dfd3222013..bed0a511c504 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2271,6 +2271,10 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
ret = cpufreq_start_governor(policy);
if (!ret) {
pr_debug("cpufreq: governor change\n");
+ /* Notification of the new governor */
+ blocking_notifier_call_chain(
+ &cpufreq_policy_notifier_list,
+ CPUFREQ_GOVERNOR, policy);
return 0;
}
cpufreq_exit_governor(policy);
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 882a9b9e34bc..a4435b5ef3f9 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -437,6 +437,7 @@ static inline void cpufreq_resume(void) {}
/* Policy Notifiers */
#define CPUFREQ_ADJUST (0)
#define CPUFREQ_NOTIFY (1)
+#define CPUFREQ_GOVERNOR (2)

#ifdef CONFIG_CPU_FREQ
int cpufreq_register_notifier(struct notifier_block *nb, unsigned int list);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 8356cb0072a6..e138b5288af4 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -632,7 +632,7 @@ static struct kobj_type sugov_tunables_ktype = {

/********************** cpufreq governor interface *********************/

-static struct cpufreq_governor schedutil_gov;
+struct cpufreq_governor schedutil_gov;

static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
{
@@ -891,7 +891,7 @@ static void sugov_limits(struct cpufreq_policy *policy)
sg_policy->need_freq_update = true;
}

-static struct cpufreq_governor schedutil_gov = {
+struct cpufreq_governor schedutil_gov = {
.name = "schedutil",
.owner = THIS_MODULE,
.dynamic_switching = true,
@@ -914,3 +914,46 @@ static int __init sugov_register(void)
return cpufreq_register_governor(&schedutil_gov);
}
fs_initcall(sugov_register);
+
+#ifdef CONFIG_ENERGY_MODEL
+extern bool sched_energy_update;
+static DEFINE_MUTEX(rebuild_sd_mutex);
+/*
+ * EAS shouldn't be attempted without sugov, so rebuild the sched_domains
+ * on governor changes to make sure the scheduler knows about it.
+ */
+static void rebuild_sd_workfn(struct work_struct *work)
+{
+ mutex_lock(&rebuild_sd_mutex);
+ sched_energy_update = true;
+ rebuild_sched_domains();
+ sched_energy_update = false;
+ mutex_unlock(&rebuild_sd_mutex);
+}
+static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
+
+static int rebuild_sd_callback(struct notifier_block *nb, unsigned long val,
+ void *data)
+{
+ if (val != CPUFREQ_GOVERNOR)
+ return 0;
+ /*
+ * Sched_domains cannot be rebuild from a notifier context, so use a
+ * workqueue.
+ */
+ schedule_work(&rebuild_sd_work);
+
+ return 0;
+}
+
+static struct notifier_block rebuild_sd_notifier = {
+ .notifier_call = rebuild_sd_callback,
+};
+
+static int register_cpufreq_notifier(void)
+{
+ return cpufreq_register_notifier(&rebuild_sd_notifier,
+ CPUFREQ_POLICY_NOTIFIER);
+}
+core_initcall(register_cpufreq_notifier);
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e594a854977f..915766600568 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2265,10 +2265,8 @@ unsigned long scale_irq_capacity(unsigned long util, unsigned long irq, unsigned
}
#endif

-#ifdef CONFIG_SMP
-#ifdef CONFIG_ENERGY_MODEL
+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
#define perf_domain_span(pd) (to_cpumask(((pd)->obj->cpus)))
#else
#define perf_domain_span(pd) NULL
#endif
-#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1cb86a0ef00f..2b6df8edca2a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -209,7 +209,9 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
*/
DEFINE_STATIC_KEY_FALSE(sched_energy_present);

-#ifdef CONFIG_ENERGY_MODEL
+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
+bool sched_energy_update;
+
static void free_pd(struct perf_domain *pd)
{
struct perf_domain *tmp;
@@ -291,12 +293,15 @@ static void destroy_perf_domain_rcu(struct rcu_head *rp)
*/
#define EM_MAX_COMPLEXITY 2048

+extern struct cpufreq_governor schedutil_gov;
static void build_perf_domains(const struct cpumask *cpu_map)
{
int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map);
struct perf_domain *pd = NULL, *tmp;
int cpu = cpumask_first(cpu_map);
struct root_domain *rd = cpu_rq(cpu)->rd;
+ struct cpufreq_policy *policy;
+ struct cpufreq_governor *gov;

/* EAS is enabled for asymmetric CPU capacity topologies. */
if (!per_cpu(sd_asym_cpucapacity, cpu)) {
@@ -312,6 +317,15 @@ static void build_perf_domains(const struct cpumask *cpu_map)
if (find_pd(pd, i))
continue;

+ /* Do not attempt EAS if schedutil is not being used. */
+ policy = cpufreq_cpu_get(i);
+ if (!policy)
+ goto free;
+ gov = policy->governor;
+ cpufreq_cpu_put(policy);
+ if (gov != &schedutil_gov)
+ goto free;
+
/* Create the new pd and add it to the local list. */
tmp = pd_init(i);
if (!tmp)
@@ -2184,10 +2198,10 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
;
}

-#ifdef CONFIG_ENERGY_MODEL
+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
/* Build perf. domains: */
for (i = 0; i < ndoms_new; i++) {
- for (j = 0; j < n; j++) {
+ for (j = 0; j < n && !sched_energy_update; j++) {
if (cpumask_equal(doms_new[i], doms_cur[j]) &&
cpu_rq(cpumask_first(doms_cur[j]))->rd->pd)
goto match3;
--
2.17.1


2018-08-20 09:47:18

by Quentin Perret

[permalink] [raw]
Subject: [PATCH v6 14/14] OPTIONAL: cpufreq: dt: Register an Energy Model

*******************************************************************
* This patch illustrates the usage of the newly introduced Energy *
* Model framework and isn't supposed to be merged as-is. *
*******************************************************************

The Energy Model framework provides an API to register the active power
of CPUs. Call this API from the cpufreq-dt driver with an estimation
of the power as P = C * V^2 * f with C, V, and f respectively the
capacitance of the CPU and the voltage and frequency of the OPP.

The CPU capacitance is read from the "dynamic-power-coefficient" DT
binding (originally introduced for thermal/IPA), and the voltage and
frequency values from PM_OPP.

Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
drivers/cpufreq/cpufreq-dt.c | 45 +++++++++++++++++++++++++++++++++++-
1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq-dt.c b/drivers/cpufreq/cpufreq-dt.c
index 0a9ebf00be46..fd5903091631 100644
--- a/drivers/cpufreq/cpufreq-dt.c
+++ b/drivers/cpufreq/cpufreq-dt.c
@@ -16,6 +16,7 @@
#include <linux/cpu_cooling.h>
#include <linux/cpufreq.h>
#include <linux/cpumask.h>
+#include <linux/energy_model.h>
#include <linux/err.h>
#include <linux/module.h>
#include <linux/of.h>
@@ -149,8 +150,47 @@ static int resources_available(void)
return 0;
}

+static int of_est_power(unsigned long *mW, unsigned long *KHz, int cpu)
+{
+ unsigned long mV, Hz, MHz;
+ struct device *cpu_dev;
+ struct dev_pm_opp *opp;
+ struct device_node *np;
+ u32 cap;
+ u64 tmp;
+
+ cpu_dev = get_cpu_device(cpu);
+ if (!cpu_dev)
+ return -ENODEV;
+
+ np = of_node_get(cpu_dev->of_node);
+ if (!np)
+ return -EINVAL;
+
+ if (of_property_read_u32(np, "dynamic-power-coefficient", &cap))
+ return -EINVAL;
+
+ Hz = *KHz * 1000;
+ opp = dev_pm_opp_find_freq_ceil(cpu_dev, &Hz);
+ if (IS_ERR(opp))
+ return -EINVAL;
+
+ mV = dev_pm_opp_get_voltage(opp) / 1000;
+ dev_pm_opp_put(opp);
+
+ MHz = Hz / 1000000;
+ tmp = (u64)cap * mV * mV * MHz;
+ do_div(tmp, 1000000000);
+
+ *mW = (unsigned long)tmp;
+ *KHz = Hz / 1000;
+
+ return 0;
+}
+
static int cpufreq_init(struct cpufreq_policy *policy)
{
+ struct em_data_callback em_cb = EM_DATA_CB(of_est_power);
struct cpufreq_frequency_table *freq_table;
struct opp_table *opp_table = NULL;
struct private_data *priv;
@@ -159,7 +199,7 @@ static int cpufreq_init(struct cpufreq_policy *policy)
unsigned int transition_latency;
bool fallback = false;
const char *name;
- int ret;
+ int ret, nr_opp;

cpu_dev = get_cpu_device(policy->cpu);
if (!cpu_dev) {
@@ -226,6 +266,7 @@ static int cpufreq_init(struct cpufreq_policy *policy)
ret = -EPROBE_DEFER;
goto out_free_opp;
}
+ nr_opp = ret;

if (fallback) {
cpumask_setall(policy->cpus);
@@ -278,6 +319,8 @@ static int cpufreq_init(struct cpufreq_policy *policy)
policy->cpuinfo.transition_latency = transition_latency;
policy->dvfs_possible_from_any_cpu = true;

+ em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
+
return 0;

out_free_cpufreq_table:
--
2.17.1


2018-08-29 10:06:14

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 03/14] PM: Introduce an Energy Model management framework

Hi Quentin,
few possible optimizations related to the (simplified) following
code:

On 20-Aug 10:44, Quentin Perret wrote:

[...]

> +struct em_perf_domain {
> + struct em_cap_state *table; /* Capacity states, in ascending order. */
> + int nr_cap_states;
> + unsigned long cpus[0]; /* CPUs of the frequency domain. */
> +};

[...]

> +static DEFINE_PER_CPU(struct em_perf_domain *, em_data);

[...]

> +struct em_perf_domain *em_cpu_get(int cpu)
> +{
> + return READ_ONCE(per_cpu(em_data, cpu));
> +}

[...]

> +int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
> + struct em_data_callback *cb)
> +{

[...]

> + mutex_lock(&em_pd_mutex);
> +
[...]

> + for_each_cpu(cpu, span) {
> + if (READ_ONCE(per_cpu(em_data, cpu))) {
> + ret = -EEXIST;
> + goto unlock;
> + }

[...]

> + for_each_cpu(cpu, span) {
> + /*
> + * The per-cpu array can be concurrently accessed from
> + * em_cpu_get().
> + */
> + smp_store_release(per_cpu_ptr(&em_data, cpu), pd);
> + }

[...]

> +unlock:
> + mutex_unlock(&em_pd_mutex);
> +}


In the loop above we use smp_store_release() to propagate the pointer
setting in a PER_CPU(em_data), which ultimate goal is to protect
em_register_perf_domain() from multiple clients registering the same
power domain.

I think there are two possible optimizations there:

1. use of a single memory barrier

Since we are already em_pd_mutex protected, i.e. there cannot be a
concurrent writers, we can use one single memory barrier after the
loop, i.e.

for_each_cpu(cpu, span)
WRITE_ONCE()
smp_wmb()

which should be just enough to ensure that all other CPUs will see
the pointer set once we release the mutex

2. avoid using PER_CPU variables

Apart from the initialization code, i.e. boot time, the em_data is
expected to be read only, isn't it?

If that's the case, I think that using PER_CPU variables is not
strictly required while it unnecessarily increases the cache pressure.

In the worst case we can end up with one cache line for each CPU to
host just an 8B pointer, instead of using that single cache line to host
up to 8 pointers if we use just an array, i.e.

struct em_perf_domain *em_data[NR_CPUS]
____cacheline_aligned_in_smp __read_mostly;

Consider also that: up to 8 pointers in a single cache line means
also that single cache line can be enough to access the EM from all
the CPUs of almost every modern mobile phone SoC.

Note entirely sure if PER_CPU uses less overall memory in case you
have much less CPUs then the compile time defined NR_CPUS.
But still, if the above makes sense, you still have a 8x gain
factor between number Write allocated .data..percp sections and
the value of NR_CPUS. Meaning that in the worst case we allocate
the same amount of memory using NR_CPUS=64 (the default on arm64)
while running on an 8 CPUs system... but still we should get less
cluster caches pressure at run-time with the array approach, 1
cache line vs 4.

Best,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-08-29 13:29:43

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 03/14] PM: Introduce an Energy Model management framework

Hi Patrick,

On Wednesday 29 Aug 2018 at 11:04:35 (+0100), Patrick Bellasi wrote:
> In the loop above we use smp_store_release() to propagate the pointer
> setting in a PER_CPU(em_data), which ultimate goal is to protect
> em_register_perf_domain() from multiple clients registering the same
> power domain.
>
> I think there are two possible optimizations there:
>
> 1. use of a single memory barrier
>
> Since we are already em_pd_mutex protected, i.e. there cannot be a
> concurrent writers, we can use one single memory barrier after the
> loop, i.e.
>
> for_each_cpu(cpu, span)
> WRITE_ONCE()
> smp_wmb()
>
> which should be just enough to ensure that all other CPUs will see
> the pointer set once we release the mutex

Right, I'm actually wondering if the memory barrier is needed at all ...
The mutex lock()/unlock() should already ensure the ordering I want no ?

WRITE_ONCE() should prevent load/store tearing with concurrent em_cpu_get(),
and the release/acquire semantics of mutex lock/unlock should be enough to
serialize the memory accesses of concurrent em_register_perf_domain() calls
properly ...

Hmm, let me read memory-barriers.txt again.

> 2. avoid using PER_CPU variables
>
> Apart from the initialization code, i.e. boot time, the em_data is
> expected to be read only, isn't it?

That's right. It's not only read only, it's also not read very often (in
the use-cases I have in mind at least). The scheduler for example will
call em_cpu_get() once when sched domains are built, and keep the
reference instead of calling it again.

> If that's the case, I think that using PER_CPU variables is not
> strictly required while it unnecessarily increases the cache pressure.
>
> In the worst case we can end up with one cache line for each CPU to
> host just an 8B pointer, instead of using that single cache line to host
> up to 8 pointers if we use just an array, i.e.
>
> struct em_perf_domain *em_data[NR_CPUS]
> ____cacheline_aligned_in_smp __read_mostly;
>
> Consider also that: up to 8 pointers in a single cache line means
> also that single cache line can be enough to access the EM from all
> the CPUs of almost every modern mobile phone SoC.
>
> Note entirely sure if PER_CPU uses less overall memory in case you
> have much less CPUs then the compile time defined NR_CPUS.
> But still, if the above makes sense, you still have a 8x gain
> factor between number Write allocated .data..percp sections and
> the value of NR_CPUS. Meaning that in the worst case we allocate
> the same amount of memory using NR_CPUS=64 (the default on arm64)
> while running on an 8 CPUs system... but still we should get less
> cluster caches pressure at run-time with the array approach, 1
> cache line vs 4.

Right, using per_cpu() might cause to bring in cache things you don't
really care about (other non-related per_cpu stuff), but that shouldn't
waste memory I think. I mean, if my em_data var is the first in a cache
line, the rest of the cache line will most likely be used by other
per_cpu variables anyways ...

As you suggested, the alternative would be to have a simple array. I'm
fine with this TBH. But I would probably allocate it dynamically using
nr_cpu_ids instead of using a static NR_CPUS-wide thing though -- the
registration of perf domains usually happens late enough in the boot
process.

What do you think ?

Thanks
Quentin

2018-08-29 16:24:26

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 05/14] sched/topology: Reference the Energy Model of CPUs when available

Hi Quentin,
likely the organization and naming on this file would require some
lovely refactoring and cleanup... but of course that's outside of the
scope of this patch.

Still, maybe we can improve a bit its current status according to the
following comments ?

On 20-Aug 10:44, Quentin Perret wrote:
> The existing scheduling domain hierarchy is defined to map to the cache
> topology of the system. However, Energy Aware Scheduling (EAS) requires
> more knowledge about the platform, and specifically needs to know about
> the span of Performance Domains (PD), which do not always align with
> caches.
>
> To address this issue, use the Energy Model (EM) of the system to extend
> the scheduler topology code with a representation of the PDs, alongside
> the scheduling domains. More specifically, a linked list of PDs is
> attached to each root domain. When multiple root domains are in use,
> each list contains only the PDs covering the CPUs of its root domain. If
> a PD spans over CPUs of two different root domains, it will be
> duplicated in both lists.
>
> The lists are fully maintained by the scheduler from
> partition_sched_domains() in order to cope with hotplug and cpuset
> changes. As for scheduling domains, the list are protected by RCU to
> ensure safe concurrent updates.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> kernel/sched/sched.h | 21 +++++++
> kernel/sched/topology.c | 134 ++++++++++++++++++++++++++++++++++++++--
> 2 files changed, 151 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 481e6759441b..97d79429e755 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -44,6 +44,7 @@
> #include <linux/ctype.h>
> #include <linux/debugfs.h>
> #include <linux/delayacct.h>
> +#include <linux/energy_model.h>
> #include <linux/init_task.h>
> #include <linux/kprobes.h>
> #include <linux/kthread.h>
> @@ -700,6 +701,12 @@ static inline bool sched_asym_prefer(int a, int b)
> return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
> }
>
> +struct perf_domain {
> + struct em_perf_domain *obj;
> + struct perf_domain *next;
> + struct rcu_head rcu;
> +};
> +
> /*
> * We add the notion of a root-domain which will be used to define per-domain
> * variables. Each exclusive cpuset essentially defines an island domain by
> @@ -752,6 +759,12 @@ struct root_domain {
> struct cpupri cpupri;
>
> unsigned long max_cpu_capacity;
> +
> + /*
> + * NULL-terminated list of performance domains intersecting with the
> + * CPUs of the rd. Protected by RCU.
> + */
> + struct perf_domain *pd;
> };
>
> extern struct root_domain def_root_domain;
> @@ -2228,3 +2241,11 @@ unsigned long scale_irq_capacity(unsigned long util, unsigned long irq, unsigned
> return util;
> }
> #endif
> +
> +#ifdef CONFIG_SMP
> +#ifdef CONFIG_ENERGY_MODEL
> +#define perf_domain_span(pd) (to_cpumask(((pd)->obj->cpus)))
> +#else
> +#define perf_domain_span(pd) NULL
> +#endif
> +#endif
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index c4444ed77a55..545e2c7b6e66 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -201,6 +201,116 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
> return 1;
> }
>
> +#ifdef CONFIG_ENERGY_MODEL
> +static void free_pd(struct perf_domain *pd)
> +{
> + struct perf_domain *tmp;
> +
> + while (pd) {
> + tmp = pd->next;
> + kfree(pd);
> + pd = tmp;
> + }
> +}
> +
> +static struct perf_domain *find_pd(struct perf_domain *pd, int cpu)
> +{
> + while (pd) {
> + if (cpumask_test_cpu(cpu, perf_domain_span(pd)))
> + return pd;
> + pd = pd->next;
> + }
> +
> + return NULL;
> +}
> +
> +static struct perf_domain *pd_init(int cpu)
> +{
> + struct em_perf_domain *obj = em_cpu_get(cpu);
> + struct perf_domain *pd;
> +
> + if (!obj) {
> + if (sched_debug())
> + pr_info("%s: no EM found for CPU%d\n", __func__, cpu);
> + return NULL;
> + }
> +
> + pd = kzalloc(sizeof(*pd), GFP_KERNEL);
> + if (!pd)
> + return NULL;
> + pd->obj = obj;
> +
> + return pd;
> +}
> +
> +
> +static void perf_domain_debug(const struct cpumask *cpu_map,
> + struct perf_domain *pd)
> +{
> + if (!sched_debug() || !pd)
> + return;
> +
> + printk(KERN_DEBUG "root_domain %*pbl: ", cpumask_pr_args(cpu_map));
> +
> + while (pd) {
> + printk(KERN_CONT " pd%d:{ cpus=%*pbl nr_cstate=%d }",
> + cpumask_first(perf_domain_span(pd)),
> + cpumask_pr_args(perf_domain_span(pd)),
> + em_pd_nr_cap_states(pd->obj));
> + pd = pd->next;
> + }
> +
> + printk(KERN_CONT "\n");
> +}
> +
> +static void destroy_perf_domain_rcu(struct rcu_head *rp)
> +{
> + struct perf_domain *pd;
> +
> + pd = container_of(rp, struct perf_domain, rcu);
> + free_pd(pd);
> +}
> +
> +static void build_perf_domains(const struct cpumask *cpu_map)
> +{
> + struct perf_domain *pd = NULL, *tmp;
> + int cpu = cpumask_first(cpu_map);
> + struct root_domain *rd = cpu_rq(cpu)->rd;
> + int i;
> +
> + for_each_cpu(i, cpu_map) {
> + /* Skip already covered CPUs. */
> + if (find_pd(pd, i))
> + continue;
> +
> + /* Create the new pd and add it to the local list. */
> + tmp = pd_init(i);
> + if (!tmp)
> + goto free;
> + tmp->next = pd;
> + pd = tmp;
> + }
> +
> + perf_domain_debug(cpu_map, pd);
> +
> + /* Attach the new list of performance domains to the root domain. */
> + tmp = rd->pd;
> + rcu_assign_pointer(rd->pd, pd);
> + if (tmp)
> + call_rcu(&tmp->rcu, destroy_perf_domain_rcu);

We have:

sched_cpu_activate/cpuset_cpu_inactive
cpuset_cpu_active/sched_cpu_deactivate
partition_sched_domains
build_perf_domains

thus here we are building new SDs and, specifically, above we are
attaching the local list "pd" to a _new_ root domain... thus, there
cannot be already users of this new SDs and root domain at this stage,
isn't it ?

Do we really need that rcu_assign_pointer ?
Is the rcu_assign_pointer there just to "match" the following call_rcu ?

What about this path:

sched_init_domains
partition_sched_domains

in which case we do not call build_perf_domains... is that intended ?

> +
> + return;
> +
> +free:
> + free_pd(pd);
> + tmp = rd->pd;
> + rcu_assign_pointer(rd->pd, NULL);
> + if (tmp)
> + call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
> +}

All the above functions use different naming conventions:

"_pd" suffix, "pd_" prefix and "perf_domain_" prefix.

and you do it like that because it better matches the corresponding
call sites following down the file.

However, since we are into a "CONFIG_ENERGY_MODEL" guarded section,
why not start using a common prefix for all PD related functions?

I very like "perf_domain_" (or "pd_") as a prefix and I would try to
use it for all the functions you defined above:

perf_domain_free
perf_domain_find
perf_domain_debug
perf_domain_destroy_rcu
perf_domain_build

> +#else
> +static void free_pd(struct perf_domain *pd) { }
> +#endif

Maybe better:

#endif /* CONFIG_ENERGY_MODEL */

> +
> static void free_rootdomain(struct rcu_head *rcu)
> {
> struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
> @@ -211,6 +321,7 @@ static void free_rootdomain(struct rcu_head *rcu)
> free_cpumask_var(rd->rto_mask);
> free_cpumask_var(rd->online);
> free_cpumask_var(rd->span);
> + free_pd(rd->pd);
> kfree(rd);
> }
>
> @@ -1964,8 +2075,8 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> /* Destroy deleted domains: */
> for (i = 0; i < ndoms_cur; i++) {
> for (j = 0; j < n && !new_topology; j++) {
> - if (cpumask_equal(doms_cur[i], doms_new[j])
> - && dattrs_equal(dattr_cur, i, dattr_new, j))
> + if (cpumask_equal(doms_cur[i], doms_new[j]) &&
> + dattrs_equal(dattr_cur, i, dattr_new, j))

This chunk looks more like a cleanup which is not really changing
anything: is it intentional?

> goto match1;
> }
> /* No match - a current sched domain not in new doms_new[] */
> @@ -1985,8 +2096,8 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> /* Build new domains: */
> for (i = 0; i < ndoms_new; i++) {
> for (j = 0; j < n && !new_topology; j++) {
> - if (cpumask_equal(doms_new[i], doms_cur[j])
> - && dattrs_equal(dattr_new, i, dattr_cur, j))
> + if (cpumask_equal(doms_new[i], doms_cur[j]) &&
> + dattrs_equal(dattr_new, i, dattr_cur, j))


Same comment for the chunk above

> goto match2;
> }
> /* No match - add a new doms_new */
> @@ -1995,6 +2106,21 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> ;
> }
>
> +#ifdef CONFIG_ENERGY_MODEL
> + /* Build perf. domains: */
> + for (i = 0; i < ndoms_new; i++) {
> + for (j = 0; j < n; j++) {
> + if (cpumask_equal(doms_new[i], doms_cur[j]) &&
> + cpu_rq(cpumask_first(doms_cur[j]))->rd->pd)
> + goto match3;
> + }
> + /* No match - add perf. domains for a new rd */
> + build_perf_domains(doms_new[i]);
> +match3:
> + ;
> + }
> +#endif
> +


Since we already have a CONFIG_ENERGY_MODEL guarded section above,
maybe we can s/build_perf_domains/build_perf_root_domain/ and use
build_perf_domains to provide an inline function for the chunk above
in the guarded section at the beginning of the file ?

The #else section will then provide just an empty implementation.

Something like The diff below seems to work and it should do the
"cleanup" job by also moving at the beginning of the source file the
definition of the global variables (required by some functions).

Perhaps that's a bit of cleanup code that maintainer can accept...
but... to be verified. ;)

---8<---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 2b6df8edca2a..647667b89180 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -10,6 +10,22 @@ DEFINE_MUTEX(sched_domains_mutex);
cpumask_var_t sched_domains_tmpmask;
cpumask_var_t sched_domains_tmpmask2;

+/* Current sched domains: */
+static cpumask_var_t *doms_cur;
+
+/* Number of sched domains in 'doms_cur': */
+static int ndoms_cur;
+
+/* Attribues of custom domains in 'doms_cur' */
+static struct sched_domain_attr *dattr_cur;
+
+/*
+ * Special case: If a kmalloc() of a doms_cur partition (array of
+ * cpumask) fails, then fallback to a single sched domain,
+ * as determined by the single cpumask fallback_doms.
+ */
+static cpumask_var_t fallback_doms;
+
#ifdef CONFIG_SCHED_DEBUG

static int __init sched_debug_setup(char *str)
@@ -294,7 +310,7 @@ static void destroy_perf_domain_rcu(struct rcu_head *rp)
#define EM_MAX_COMPLEXITY 2048

extern struct cpufreq_governor schedutil_gov;
-static void build_perf_domains(const struct cpumask *cpu_map)
+static void build_perf_root_domain(const struct cpumask *cpu_map)
{
int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map);
struct perf_domain *pd = NULL, *tmp;
@@ -395,9 +411,30 @@ static void sched_energy_start(int ndoms_new, cpumask_var_t doms_new[])
static_branch_enable_cpuslocked(&sched_energy_present);
}
}
+
+static void build_perf_domains(int ndoms_new, cpumask_var_t doms_new[])
+{
+ int i, j;
+
+ /* Build perf. domains: */
+ for (i = 0; i < ndoms_new; i++) {
+ for (j = 0; j < ndoms_cur; j++) {
+ if (cpumask_equal(doms_new[i], doms_cur[j]) &&
+ cpu_rq(cpumask_first(doms_cur[j]))->rd->pd)
+ goto done;
+ }
+ build_perf_root_domain(doms_new[i]);
+done:
+ ;
+ }
+ sched_energy_start(ndoms_new, doms_new);
+}
+
#else
static void free_pd(struct perf_domain *pd) { }
-#endif
+static void build_perf_domains(int ndoms_new, cpumask_var_t doms_new[]) { }
+#endif /* CONFIG_ENERGY_MODEL */

static void free_rootdomain(struct rcu_head *rcu)
{
@@ -2004,22 +2041,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
return ret;
}

-/* Current sched domains: */
-static cpumask_var_t *doms_cur;
-
-/* Number of sched domains in 'doms_cur': */
-static int ndoms_cur;
-
-/* Attribues of custom domains in 'doms_cur' */
-static struct sched_domain_attr *dattr_cur;
-
-/*
- * Special case: If a kmalloc() of a doms_cur partition (array of
- * cpumask) fails, then fallback to a single sched domain,
- * as determined by the single cpumask fallback_doms.
- */
-static cpumask_var_t fallback_doms;
-
/*
* arch_update_cpu_topology lets virtualized architectures update the
* CPU core maps. It is supposed to return 1 if the topology changed
@@ -2198,21 +2219,7 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
;
}

-#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
- /* Build perf. domains: */
- for (i = 0; i < ndoms_new; i++) {
- for (j = 0; j < n && !sched_energy_update; j++) {
- if (cpumask_equal(doms_new[i], doms_cur[j]) &&
- cpu_rq(cpumask_first(doms_cur[j]))->rd->pd)
- goto match3;
- }
- /* No match - add perf. domains for a new rd */
- build_perf_domains(doms_new[i]);
-match3:
- ;
- }
- sched_energy_start(ndoms_new, doms_new);
-#endif
+ build_perf_domains(ndoms_new, n, doms_new);

/* Remember the new sched domains: */
if (doms_cur != &fallback_doms)
---8<---



> /* Remember the new sched domains: */
> if (doms_cur != &fallback_doms)
> free_sched_domains(doms_cur, ndoms_cur);
> --
> 2.17.1
>

--
#include <best/regards.h>

Patrick Bellasi

2018-08-29 16:52:56

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 07/14] sched/topology: Introduce sched_energy_present static key

Hi Quentin,
a couple of minor notes/questions follow...

Best,
Patrick

On 20-Aug 10:44, Quentin Perret wrote:
> In order to ensure a minimal performance impact on non-energy-aware
> systems, introduce a static_key guarding the access to Energy-Aware
> Scheduling (EAS) code.
>
> The static key is set iff all the following conditions are met for at
> least one root domain:
> 1. all online CPUs of the root domain are covered by the Energy
> Model (EM);
> 2. the complexity of the root domain's EM is low enough to keep
> scheduling overheads low;
> 3. the root domain has an asymmetric CPU capacity topology (detected
> by looking for the SD_ASYM_CPUCAPACITY flag in the sched_domain
> hierarchy).
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> kernel/sched/sched.h | 1 +
> kernel/sched/topology.c | 77 ++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 77 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4b884e467545..cb3d6afdb114 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1421,6 +1421,7 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
>
> extern struct static_key_false sched_numa_balancing;
> extern struct static_key_false sched_schedstats;
> +extern struct static_key_false sched_energy_present;
>
> static inline u64 global_rt_period(void)
> {
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 4c6a36a8d7b8..1cb86a0ef00f 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -200,6 +200,14 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
>
> return 1;
> }
> +/*
> + * This static_key is set if at least one root domain meets all the following
> + * conditions:
> + * 1. all CPUs of the root domain are covered by the EM;
> + * 2. the EM complexity is low enough to keep scheduling overheads low;
> + * 3. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
> + */
> +DEFINE_STATIC_KEY_FALSE(sched_energy_present);
>
> #ifdef CONFIG_ENERGY_MODEL
> static void free_pd(struct perf_domain *pd)
> @@ -270,12 +278,34 @@ static void destroy_perf_domain_rcu(struct rcu_head *rp)
> free_pd(pd);
> }
>
> +/*
> + * The complexity of the Energy Model is defined as: nr_pd * (nr_cpus + nr_cs)
> + * with: 'nr_pd' the number of performance domains; 'nr_cpus' the number of
> + * CPUs; and 'nr_cs' the sum of the capacity states numbers of all performance
> + * domains.
> + *
> + * It is generally not a good idea to use such a model in the wake-up path on
> + * very complex platforms because of the associated scheduling overheads. The
> + * arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs
> + * with per-CPU DVFS and less than 8 capacity states each, for example.

According to the formula above, that should give a "complexity value" of:

16 * (16 + 9) = 384

while, 2K complexity seems more like a 40xCPUs system with 8 OPPs.

Maybe we should update either the example or the constant below ?

> + */
> +#define EM_MAX_COMPLEXITY 2048
> +
> static void build_perf_domains(const struct cpumask *cpu_map)
> {
> + int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map);
> struct perf_domain *pd = NULL, *tmp;
> int cpu = cpumask_first(cpu_map);
> struct root_domain *rd = cpu_rq(cpu)->rd;
> - int i;
> +
> + /* EAS is enabled for asymmetric CPU capacity topologies. */
> + if (!per_cpu(sd_asym_cpucapacity, cpu)) {
> + if (sched_debug()) {
> + pr_info("rd %*pbl: CPUs do not have asymmetric capacities\n",
> + cpumask_pr_args(cpu_map));
> + }
> + goto free;
> + }
>
> for_each_cpu(i, cpu_map) {
> /* Skip already covered CPUs. */
> @@ -288,6 +318,21 @@ static void build_perf_domains(const struct cpumask *cpu_map)
> goto free;
> tmp->next = pd;
> pd = tmp;
> +
> + /*
> + * Count performance domains and capacity states for the
> + * complexity check.
> + */
> + nr_pd++;

A special case where EAS is not going to be used is for systems where
nr_pd matches the number of online CPUs, isn't it ?

If that's the case, then, by caching this nr_pd you can probably check
this condition in the sched_energy_start() and bail out even faster by
avoiding to scan all the doms_new's pd ?


> + nr_cs += em_pd_nr_cap_states(pd->obj);
> + }
> +
> + /* Bail out if the Energy Model complexity is too high. */
> + if (nr_pd * (nr_cs + nr_cpus) > EM_MAX_COMPLEXITY) {
> + if (sched_debug())
> + pr_info("rd %*pbl: EM complexity is too high\n ",
> + cpumask_pr_args(cpu_map));
> + goto free;
> }
>
> perf_domain_debug(cpu_map, pd);
> @@ -307,6 +352,35 @@ static void build_perf_domains(const struct cpumask *cpu_map)
> if (tmp)
> call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
> }
> +
> +static void sched_energy_start(int ndoms_new, cpumask_var_t doms_new[])
> +{
> + /*
> + * The conditions for EAS to start are checked during the creation of
> + * root domains. If one of them meets all conditions, it will have a
> + * non-null list of performance domains.
> + */
> + while (ndoms_new) {
> + if (cpu_rq(cpumask_first(doms_new[ndoms_new - 1]))->rd->pd)
> + goto enable;
> + ndoms_new--;
> + }
> +
> + if (static_branch_unlikely(&sched_energy_present)) {
^^^^^^^^
Is this defined unlikely to reduce overheads on systems which never
satisfy all the conditions above while still rebuild SDs from time to
time ?

> + if (sched_debug())
> + pr_info("%s: stopping EAS\n", __func__);
> + static_branch_disable_cpuslocked(&sched_energy_present);
> + }
> +
> + return;
> +
> +enable:
> + if (!static_branch_unlikely(&sched_energy_present)) {
> + if (sched_debug())
> + pr_info("%s: starting EAS\n", __func__);
> + static_branch_enable_cpuslocked(&sched_energy_present);
> + }
> +}
> #else
> static void free_pd(struct perf_domain *pd) { }
> #endif
> @@ -2123,6 +2197,7 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> match3:
> ;
> }
> + sched_energy_start(ndoms_new, doms_new);
> #endif
>
> /* Remember the new sched domains: */
> --
> 2.17.1
>

--
#include <best/regards.h>

Patrick Bellasi

2018-08-29 16:57:34

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 05/14] sched/topology: Reference the Energy Model of CPUs when available

On Wednesday 29 Aug 2018 at 17:22:38 (+0100), Patrick Bellasi wrote:
> > +static void build_perf_domains(const struct cpumask *cpu_map)
> > +{
> > + struct perf_domain *pd = NULL, *tmp;
> > + int cpu = cpumask_first(cpu_map);
> > + struct root_domain *rd = cpu_rq(cpu)->rd;
> > + int i;
> > +
> > + for_each_cpu(i, cpu_map) {
> > + /* Skip already covered CPUs. */
> > + if (find_pd(pd, i))
> > + continue;
> > +
> > + /* Create the new pd and add it to the local list. */
> > + tmp = pd_init(i);
> > + if (!tmp)
> > + goto free;
> > + tmp->next = pd;
> > + pd = tmp;
> > + }
> > +
> > + perf_domain_debug(cpu_map, pd);
> > +
> > + /* Attach the new list of performance domains to the root domain. */
> > + tmp = rd->pd;
> > + rcu_assign_pointer(rd->pd, pd);
> > + if (tmp)
> > + call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
>
> We have:
>
> sched_cpu_activate/cpuset_cpu_inactive
> cpuset_cpu_active/sched_cpu_deactivate
> partition_sched_domains
> build_perf_domains
>
> thus here we are building new SDs and, specifically, above we are
> attaching the local list "pd" to a _new_ root domain... thus, there
> cannot be already users of this new SDs and root domain at this stage,
> isn't it ?

Hmm, actually you can end up here even if the rd isn't new. That would
happen if you call rebuild_sched_domains() after the EM has been
registered for example. At this point, you might skip
detach_destroy_domains() and build_sched_domains() from
partition_sched_domains(), but still call build_perf_domains(), which
would then attach the pd list to the current rd.

That's one reason why rcu_assign_pointer() is probably a good idea. And
it's also nice from a doc standpoint I suppose.

>
> Do we really need that rcu_assign_pointer ?
> Is the rcu_assign_pointer there just to "match" the following call_rcu ?
>
> What about this path:
>
> sched_init_domains
> partition_sched_domains
>
> in which case we do not call build_perf_domains... is that intended ?

I assume you meant:

sched_init_domains
build_sched_domains

Is that right ?

If yes, I didn't bother calling build_perf_domains() from there because
I don't think there is a single platform out there which would have a
registered Energy Model that early in the boot process. Or maybe there
is one I don't know ?

Anyway, that is probably easy to fix, if need be.

> > +
> > + return;
> > +
> > +free:
> > + free_pd(pd);
> > + tmp = rd->pd;
> > + rcu_assign_pointer(rd->pd, NULL);
> > + if (tmp)
> > + call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
> > +}
>
> All the above functions use different naming conventions:
>
> "_pd" suffix, "pd_" prefix and "perf_domain_" prefix.
>
> and you do it like that because it better matches the corresponding
> call sites following down the file.

That's right. The functions are supposed to vaguely look like existing
functions dealing with sched domains.

> However, since we are into a "CONFIG_ENERGY_MODEL" guarded section,
> why not start using a common prefix for all PD related functions?
>
> I very like "perf_domain_" (or "pd_") as a prefix and I would try to
> use it for all the functions you defined above:
>
> perf_domain_free
> perf_domain_find
> perf_domain_debug
> perf_domain_destroy_rcu
> perf_domain_build

I kinda like the idea of keeping things consistent with the existing
code TBH. Especially because I'm terrible at naming things ... But if
there is a general agreement that I should rename everything I won't
argue.

> > +#else
> > +static void free_pd(struct perf_domain *pd) { }
> > +#endif
>
> Maybe better:
>
> #endif /* CONFIG_ENERGY_MODEL */

Ack

>
> > +
> > static void free_rootdomain(struct rcu_head *rcu)
> > {
> > struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
> > @@ -211,6 +321,7 @@ static void free_rootdomain(struct rcu_head *rcu)
> > free_cpumask_var(rd->rto_mask);
> > free_cpumask_var(rd->online);
> > free_cpumask_var(rd->span);
> > + free_pd(rd->pd);
> > kfree(rd);
> > }
> >
> > @@ -1964,8 +2075,8 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> > /* Destroy deleted domains: */
> > for (i = 0; i < ndoms_cur; i++) {
> > for (j = 0; j < n && !new_topology; j++) {
> > - if (cpumask_equal(doms_cur[i], doms_new[j])
> > - && dattrs_equal(dattr_cur, i, dattr_new, j))
> > + if (cpumask_equal(doms_cur[i], doms_new[j]) &&
> > + dattrs_equal(dattr_cur, i, dattr_new, j))
>
> This chunk looks more like a cleanup which is not really changing
> anything: is it intentional?

Yep, that's a cleanup Peter requested:
[email protected]

>
> > goto match1;
> > }
> > /* No match - a current sched domain not in new doms_new[] */
> > @@ -1985,8 +2096,8 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> > /* Build new domains: */
> > for (i = 0; i < ndoms_new; i++) {
> > for (j = 0; j < n && !new_topology; j++) {
> > - if (cpumask_equal(doms_new[i], doms_cur[j])
> > - && dattrs_equal(dattr_new, i, dattr_cur, j))
> > + if (cpumask_equal(doms_new[i], doms_cur[j]) &&
> > + dattrs_equal(dattr_new, i, dattr_cur, j))
>
>
> Same comment for the chunk above

Ditto :-)

>
> > goto match2;
> > }
> > /* No match - add a new doms_new */
> > @@ -1995,6 +2106,21 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> > ;
> > }
> >
> > +#ifdef CONFIG_ENERGY_MODEL
> > + /* Build perf. domains: */
> > + for (i = 0; i < ndoms_new; i++) {
> > + for (j = 0; j < n; j++) {
> > + if (cpumask_equal(doms_new[i], doms_cur[j]) &&
> > + cpu_rq(cpumask_first(doms_cur[j]))->rd->pd)
> > + goto match3;
> > + }
> > + /* No match - add perf. domains for a new rd */
> > + build_perf_domains(doms_new[i]);
> > +match3:
> > + ;
> > + }
> > +#endif
> > +
>
>
> Since we already have a CONFIG_ENERGY_MODEL guarded section above,
> maybe we can s/build_perf_domains/build_perf_root_domain/ and use
> build_perf_domains to provide an inline function for the chunk above
> in the guarded section at the beginning of the file ?
>
> The #else section will then provide just an empty implementation.

The only reason I didn't do this was because I wanted to keep all the
logic to skip (or not) building things centralized in
partition_sched_domains(), simply because I find it easier to
understand.

> Something like The diff below seems to work and it should do the
> "cleanup" job by also moving at the beginning of the source file the
> definition of the global variables (required by some functions).
>
> Perhaps that's a bit of cleanup code that maintainer can accept...
> but... to be verified. ;)

Right, I guess it's mainly a matter of 'taste' here. So let's see ... :-)

Thanks !
Quentin

2018-08-29 17:21:37

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 07/14] sched/topology: Introduce sched_energy_present static key

On Wednesday 29 Aug 2018 at 17:50:58 (+0100), Patrick Bellasi wrote:
> > +/*
> > + * The complexity of the Energy Model is defined as: nr_pd * (nr_cpus + nr_cs)
> > + * with: 'nr_pd' the number of performance domains; 'nr_cpus' the number of
> > + * CPUs; and 'nr_cs' the sum of the capacity states numbers of all performance
> > + * domains.
> > + *
> > + * It is generally not a good idea to use such a model in the wake-up path on
> > + * very complex platforms because of the associated scheduling overheads. The
> > + * arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs
> > + * with per-CPU DVFS and less than 8 capacity states each, for example.
>
> According to the formula above, that should give a "complexity value" of:
>
> 16 * (16 + 9) = 384
>
> while, 2K complexity seems more like a 40xCPUs system with 8 OPPs.
>
> Maybe we should update either the example or the constant below ?

Hmm I guess the example isn't really clear. 'nr_cs' is the _sum_ of the
number of OPPs of all perf. domains. So, in the example above, if you
have 16 CPUs with per-CPU DVFS, and each DVFS island has 8 OPPs, then
nr_cs = 16 * 8 = 128.

So if you apply the formula you get C = 16 * (16 + 128) = 2304, which is
more than EM_MAX_COMPLEXITY, so EAS cannot start.

If the DVFS island had 7 OPPs instead of 8 (for example) you would get
nr_cs = 112, C = 2048, and so EAS could start.

I can try to re-work that comment to explain things a bit better ...

>
> > + */
> > +#define EM_MAX_COMPLEXITY 2048
> > +
> > static void build_perf_domains(const struct cpumask *cpu_map)
> > {
> > + int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map);
> > struct perf_domain *pd = NULL, *tmp;
> > int cpu = cpumask_first(cpu_map);
> > struct root_domain *rd = cpu_rq(cpu)->rd;
> > - int i;
> > +
> > + /* EAS is enabled for asymmetric CPU capacity topologies. */
> > + if (!per_cpu(sd_asym_cpucapacity, cpu)) {
> > + if (sched_debug()) {
> > + pr_info("rd %*pbl: CPUs do not have asymmetric capacities\n",
> > + cpumask_pr_args(cpu_map));
> > + }
> > + goto free;
> > + }
> >
> > for_each_cpu(i, cpu_map) {
> > /* Skip already covered CPUs. */
> > @@ -288,6 +318,21 @@ static void build_perf_domains(const struct cpumask *cpu_map)
> > goto free;
> > tmp->next = pd;
> > pd = tmp;
> > +
> > + /*
> > + * Count performance domains and capacity states for the
> > + * complexity check.
> > + */
> > + nr_pd++;
>
> A special case where EAS is not going to be used is for systems where
> nr_pd matches the number of online CPUs, isn't it ?

Well, it depends. Say you have only 4 CPUs with 3 OPPs each. Even with
per-CPU DVFS the complexity is low enough to start EAS. I don't really
see a good reason for not doing so no ?

>
> If that's the case, then, by caching this nr_pd you can probably check
> this condition in the sched_energy_start() and bail out even faster by
> avoiding to scan all the doms_new's pd ?
>
>
> > + nr_cs += em_pd_nr_cap_states(pd->obj);
> > + }
> > +
> > + /* Bail out if the Energy Model complexity is too high. */
> > + if (nr_pd * (nr_cs + nr_cpus) > EM_MAX_COMPLEXITY) {
> > + if (sched_debug())
> > + pr_info("rd %*pbl: EM complexity is too high\n ",
> > + cpumask_pr_args(cpu_map));
> > + goto free;
> > }
> >
> > perf_domain_debug(cpu_map, pd);
> > @@ -307,6 +352,35 @@ static void build_perf_domains(const struct cpumask *cpu_map)
> > if (tmp)
> > call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
> > }
> > +
> > +static void sched_energy_start(int ndoms_new, cpumask_var_t doms_new[])
> > +{
> > + /*
> > + * The conditions for EAS to start are checked during the creation of
> > + * root domains. If one of them meets all conditions, it will have a
> > + * non-null list of performance domains.
> > + */
> > + while (ndoms_new) {
> > + if (cpu_rq(cpumask_first(doms_new[ndoms_new - 1]))->rd->pd)
> > + goto enable;
> > + ndoms_new--;
> > + }
> > +
> > + if (static_branch_unlikely(&sched_energy_present)) {
> ^^^^^^^^
> Is this defined unlikely to reduce overheads on systems which never
> satisfy all the conditions above while still rebuild SDs from time to
> time ?

Something like that. I just thought that the case where EAS needs to be
disabled after being enabled isn't very common. I mean, the most typical
use-case is, EAS is enabled at boot and stays enabled forever, or EAS
never gets enabled.

Enabling/disabling EAS because of hotplug (for example) can definitely
happen, but that shouldn't be the case very often in practice, I think.
So we can optimize things out a bit I suppose.

Thanks !
Quentin

2018-08-30 09:25:25

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 07/14] sched/topology: Introduce sched_energy_present static key

On 29-Aug 18:20, Quentin Perret wrote:
> On Wednesday 29 Aug 2018 at 17:50:58 (+0100), Patrick Bellasi wrote:
> > > +/*
> > > + * The complexity of the Energy Model is defined as: nr_pd * (nr_cpus + nr_cs)
> > > + * with: 'nr_pd' the number of performance domains; 'nr_cpus' the number of
> > > + * CPUs; and 'nr_cs' the sum of the capacity states numbers of all performance
> > > + * domains.
> > > + *
> > > + * It is generally not a good idea to use such a model in the wake-up path on
> > > + * very complex platforms because of the associated scheduling overheads. The
> > > + * arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs
> > > + * with per-CPU DVFS and less than 8 capacity states each, for example.
> >
> > According to the formula above, that should give a "complexity value" of:
> >
> > 16 * (16 + 9) = 384
> >
> > while, 2K complexity seems more like a 40xCPUs system with 8 OPPs.
> >
> > Maybe we should update either the example or the constant below ?
>
> Hmm I guess the example isn't really clear. 'nr_cs' is the _sum_ of the
> number of OPPs of all perf. domains. So, in the example above, if you
> have 16 CPUs with per-CPU DVFS, and each DVFS island has 8 OPPs, then
> nr_cs = 16 * 8 = 128.
>
> So if you apply the formula you get C = 16 * (16 + 128) = 2304, which is
> more than EM_MAX_COMPLEXITY, so EAS cannot start.
>
> If the DVFS island had 7 OPPs instead of 8 (for example) you would get
> nr_cs = 112, C = 2048, and so EAS could start.

Right, I see now.

> I can try to re-work that comment to explain things a bit better ...

Yes, dunno if it's just me but perhaps a bit of rephrasing could help.

Alternatively, why not having this comment and check after patches
11 and 12 ?

> > > + */
> > > +#define EM_MAX_COMPLEXITY 2048
> > > +
> > > static void build_perf_domains(const struct cpumask *cpu_map)
> > > {
> > > + int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map);
> > > struct perf_domain *pd = NULL, *tmp;
> > > int cpu = cpumask_first(cpu_map);
> > > struct root_domain *rd = cpu_rq(cpu)->rd;
> > > - int i;
> > > +
> > > + /* EAS is enabled for asymmetric CPU capacity topologies. */
> > > + if (!per_cpu(sd_asym_cpucapacity, cpu)) {
> > > + if (sched_debug()) {
> > > + pr_info("rd %*pbl: CPUs do not have asymmetric capacities\n",
> > > + cpumask_pr_args(cpu_map));
> > > + }
> > > + goto free;
> > > + }
> > >
> > > for_each_cpu(i, cpu_map) {
> > > /* Skip already covered CPUs. */
> > > @@ -288,6 +318,21 @@ static void build_perf_domains(const struct cpumask *cpu_map)
> > > goto free;
> > > tmp->next = pd;
> > > pd = tmp;
> > > +
> > > + /*
> > > + * Count performance domains and capacity states for the
> > > + * complexity check.
> > > + */
> > > + nr_pd++;
> >
> > A special case where EAS is not going to be used is for systems where
> > nr_pd matches the number of online CPUs, isn't it ?
>
> Well, it depends. Say you have only 4 CPUs with 3 OPPs each. Even with
> per-CPU DVFS the complexity is low enough to start EAS. I don't really
> see a good reason for not doing so no ?

Right... I was totally confused by the idea that we don't
run EAS if we just have 1 CPU per PD... my bad!

Although on those systems, since we don't have idle costs, should not
be a spreading strategy always the best from an energy efficiency
standpoint ?

> > If that's the case, then, by caching this nr_pd you can probably check
> > this condition in the sched_energy_start() and bail out even faster by
> > avoiding to scan all the doms_new's pd ?
> >
> >
> > > + nr_cs += em_pd_nr_cap_states(pd->obj);
> > > + }
> > > +
> > > + /* Bail out if the Energy Model complexity is too high. */
> > > + if (nr_pd * (nr_cs + nr_cpus) > EM_MAX_COMPLEXITY) {
> > > + if (sched_debug())
> > > + pr_info("rd %*pbl: EM complexity is too high\n ",
> > > + cpumask_pr_args(cpu_map));
> > > + goto free;
> > > }
> > >
> > > perf_domain_debug(cpu_map, pd);
> > > @@ -307,6 +352,35 @@ static void build_perf_domains(const struct cpumask *cpu_map)
> > > if (tmp)
> > > call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
> > > }
> > > +
> > > +static void sched_energy_start(int ndoms_new, cpumask_var_t doms_new[])
> > > +{
> > > + /*
> > > + * The conditions for EAS to start are checked during the creation of
> > > + * root domains. If one of them meets all conditions, it will have a
> > > + * non-null list of performance domains.
> > > + */
> > > + while (ndoms_new) {
> > > + if (cpu_rq(cpumask_first(doms_new[ndoms_new - 1]))->rd->pd)
> > > + goto enable;
> > > + ndoms_new--;
> > > + }
> > > +
> > > + if (static_branch_unlikely(&sched_energy_present)) {
> > ^^^^^^^^
> > Is this defined unlikely to reduce overheads on systems which never
> > satisfy all the conditions above while still rebuild SDs from time to
> > time ?
>
> Something like that. I just thought that the case where EAS needs to be
> disabled after being enabled isn't very common. I mean, the most typical
> use-case is, EAS is enabled at boot and stays enabled forever, or EAS
> never gets enabled.

Right, if we have EAS compiled in... we are likely to have it enabled.

> Enabling/disabling EAS because of hotplug (for example) can definitely
> happen, but that shouldn't be the case very often in practice, I think.

Would say yes on sane platform, i.e. where hotplug is not being used
for power/thermal management... but hopefully EAS should improve on
that side ;)

> So we can optimize things out a bit I suppose.

Right thanks!

--
#include <best/regards.h>

Patrick Bellasi

2018-08-30 09:59:15

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 07/14] sched/topology: Introduce sched_energy_present static key

Hi Patrick,

On Thursday 30 Aug 2018 at 10:23:29 (+0100), Patrick Bellasi wrote:
> Yes, dunno if it's just me but perhaps a bit of rephrasing could help.

Ok, so what about something a little bit more explicit like:

/*
* The complexity of the Energy Model is defined as:
*
* C = nr_pd * (nr_cpus + nr_cs)
*
* with parameters defined as:
* - nr_pd: the number of performance domains
* - nr_cpus: the number of CPUs
* - nr_cs: the sum of the number of capacity states of all performance
* domains (for example, on a system with 2 performance domains,
* with 10 capacity states each, nr_cs = 2 * 10 = 20).
*
* It is generally not a good idea to use such a model in the wake-up path on
* very complex platforms because of the associated scheduling overheads. The
* arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs
* with per-CPU DVFS and less than 8 capacity states each, for example.
*/

> Alternatively, why not having this comment and check after patches
> 11 and 12 ?

Oh, you don't like it here ? What's wrong with it :-) ?

> Right... I was totally confused by the idea that we don't
> run EAS if we just have 1 CPU per PD... my bad!
>
> Although on those systems, since we don't have idle costs, should not
> be a spreading strategy always the best from an energy efficiency
> standpoint ?

Even with per-CPU DVFS, if you have big and little CPUs it matters
_where_ you execute a task, and you'll still need an Energy Model to
make good decisions in a generic way. But yes, there is definitely some
room for improvement for those platforms. That's something we could
improve later on top of this series I suppose.

Thanks for looking at the patches !
Quentin

2018-08-30 10:01:47

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 05/14] sched/topology: Reference the Energy Model of CPUs when available

On 29-Aug 17:56, Quentin Perret wrote:
> On Wednesday 29 Aug 2018 at 17:22:38 (+0100), Patrick Bellasi wrote:
> > > +static void build_perf_domains(const struct cpumask *cpu_map)
> > > +{
> > > + struct perf_domain *pd = NULL, *tmp;
> > > + int cpu = cpumask_first(cpu_map);
> > > + struct root_domain *rd = cpu_rq(cpu)->rd;
> > > + int i;
> > > +
> > > + for_each_cpu(i, cpu_map) {
> > > + /* Skip already covered CPUs. */
> > > + if (find_pd(pd, i))
> > > + continue;
> > > +
> > > + /* Create the new pd and add it to the local list. */
> > > + tmp = pd_init(i);
> > > + if (!tmp)
> > > + goto free;
> > > + tmp->next = pd;
> > > + pd = tmp;
> > > + }
> > > +
> > > + perf_domain_debug(cpu_map, pd);
> > > +
> > > + /* Attach the new list of performance domains to the root domain. */
> > > + tmp = rd->pd;
> > > + rcu_assign_pointer(rd->pd, pd);
> > > + if (tmp)
> > > + call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
> >
> > We have:
> >
> > sched_cpu_activate/cpuset_cpu_inactive
> > cpuset_cpu_active/sched_cpu_deactivate
> > partition_sched_domains
> > build_perf_domains
> >
> > thus here we are building new SDs and, specifically, above we are
> > attaching the local list "pd" to a _new_ root domain... thus, there
> > cannot be already users of this new SDs and root domain at this stage,
> > isn't it ?
>
> Hmm, actually you can end up here even if the rd isn't new. That would
> happen if you call rebuild_sched_domains() after the EM has been
> registered for example.
> At this point, you might skip
> detach_destroy_domains() and build_sched_domains() from
> partition_sched_domains(), but still call build_perf_domains(), which
> would then attach the pd list to the current rd.

Ok... then it's just me that need to go back and better study how and
when SD are rebuilds.

> That's one reason why rcu_assign_pointer() is probably a good idea. And
> it's also nice from a doc standpoint I suppose.

If we can call into build_perf_domains() and reach the assignement
above with an existing RD, then yes, it makes perfect sense.

> > Do we really need that rcu_assign_pointer ?
> > Is the rcu_assign_pointer there just to "match" the following call_rcu ?
> >
> > What about this path:
> >
> > sched_init_domains
> > partition_sched_domains
> >
> > in which case we do not call build_perf_domains... is that intended ?
>
> I assume you meant:
>
> sched_init_domains
> build_sched_domains
>
> Is that right ?

Mmm... yes... seems so.

> If yes, I didn't bother calling build_perf_domains() from there because
> I don't think there is a single platform out there which would have a
> registered Energy Model that early in the boot process. Or maybe there
> is one I don't know ?

Dunno... but, in any case, probably we don't care about using EAS until
the boot complete, isn't it?

Just to better understand, what is the most common activation path we expect ?

1. system boot
2. CPUs online
3. CPUFreq initialization
4. EM registered
X. ???
N. partition_sched_domains
build_perf_domains

IOW, who do we expect to call build_perf_domains for the first time ?

> Anyway, that is probably easy to fix, if need be.
>
> > > +
> > > + return;
> > > +
> > > +free:
> > > + free_pd(pd);
> > > + tmp = rd->pd;
> > > + rcu_assign_pointer(rd->pd, NULL);
> > > + if (tmp)
> > > + call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
> > > +}
> >
> > All the above functions use different naming conventions:
> >
> > "_pd" suffix, "pd_" prefix and "perf_domain_" prefix.
> >
> > and you do it like that because it better matches the corresponding
> > call sites following down the file.
>
> That's right. The functions are supposed to vaguely look like existing
> functions dealing with sched domains.
>
> > However, since we are into a "CONFIG_ENERGY_MODEL" guarded section,
> > why not start using a common prefix for all PD related functions?
> >
> > I very like "perf_domain_" (or "pd_") as a prefix and I would try to
> > use it for all the functions you defined above:
> >
> > perf_domain_free
> > perf_domain_find
> > perf_domain_debug
> > perf_domain_destroy_rcu
> > perf_domain_build
>
> I kinda like the idea of keeping things consistent with the existing
> code TBH. Especially because I'm terrible at naming things ... But if
> there is a general agreement that I should rename everything I won't
> argue.

I've just got the impression that naming in this file is a bit
fuzzy and it could be worth a cleanup... but of course there is also
value in minimizing the changes.

Was just wondering if a better file organization in general could help
to make topology.c more easy to compile for humans... but yes... let's
keep this for another time ;)

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-08-30 10:20:09

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 07/14] sched/topology: Introduce sched_energy_present static key

On 30-Aug 10:57, Quentin Perret wrote:
> Hi Patrick,
>
> On Thursday 30 Aug 2018 at 10:23:29 (+0100), Patrick Bellasi wrote:
> > Yes, dunno if it's just me but perhaps a bit of rephrasing could help.
>
> Ok, so what about something a little bit more explicit like:
>
> /*
> * The complexity of the Energy Model is defined as:
> *
> * C = nr_pd * (nr_cpus + nr_cs)
> *
> * with parameters defined as:
> * - nr_pd: the number of performance domains
> * - nr_cpus: the number of CPUs
> * - nr_cs: the sum of the number of capacity states of all performance
> * domains (for example, on a system with 2 performance domains,
> * with 10 capacity states each, nr_cs = 2 * 10 = 20).
> *
> * It is generally not a good idea to use such a model in the wake-up path on
> * very complex platforms because of the associated scheduling overheads. The
> * arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs
> * with per-CPU DVFS and less than 8 capacity states each, for example.
> */

That looks better to me.

> > Alternatively, why not having this comment and check after patches
> > 11 and 12 ?
>
> Oh, you don't like it here ? What's wrong with it :-) ?

Was just saying that since you refer to the complexity of an algo,
which is effectively introduced by a later patch, it could be easier
to understand where the magic numbers come from if you keep the
comment closer to the algo... but it's not really something major.
At the end, when you look at the full series, you still have all the
full picture... thus, if the maintainers are ok with it being here:
it's perfectly fine for me too ;)

> > Right... I was totally confused by the idea that we don't
> > run EAS if we just have 1 CPU per PD... my bad!
> >
> > Although on those systems, since we don't have idle costs, should not
> > be a spreading strategy always the best from an energy efficiency
> > standpoint ?
>
> Even with per-CPU DVFS, if you have big and little CPUs it matters
> _where_ you execute a task, and you'll still need an Energy Model to
> make good decisions in a generic way.

Very good point! ;)

> But yes, there is definitely some room for improvement for those
> platforms. That's something we could improve later on top of this
> series I suppose.

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-08-30 10:48:55

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 05/14] sched/topology: Reference the Energy Model of CPUs when available

On Thursday 30 Aug 2018 at 11:00:20 (+0100), Patrick Bellasi wrote:
> Dunno... but, in any case, probably we don't care about using EAS until
> the boot complete, isn't it?

So, as of now, EAS will typically start soon after CPUFreq. I don't see a
point in starting before, and I'm not sure if starting it later is
really what we want either ... It probably makes sense to start the
DVFS-related features (CPUFreq, EAS, ...) together.

> Just to better understand, what is the most common activation path we expect ?
>
> 1. system boot
> 2. CPUs online
> 3. CPUFreq initialization
> 4. EM registered
> X. ???

X is sort of arch dependent (like the EM registration actually). For
arm/arm64, the arch topology driver (see drivers/base/arch_topology.c)
will rebuild the sched_domains once the capacities of CPU are
discovered. In previous versions, I had a patch that did that explicitly:

https://lore.kernel.org/lkml/[email protected]/

However, I dropped this patch for v6 because I rebased the series on top
of Morten's automatic flag detection patches, which happen to already do
that for me. More precisely, Morten's patches will rebuild the sched
domains if the SD_ASYM_CPUCAPACITY flag needs to be set somewhere in the
hierarchy. EAS depends on this flag anyway, so I guess it's fine to rely
on this rebuild to start EAS at the same time.

I have been wondering for a while if such a 'loose' way of starting EAS
was alright or not. My conclusion was that it should be OK for now,
because that should suit all the users I can see in the short/medium
term. An alternative could be to introduce something like a notifier in
the EM framework in order to let other subsystems know that the EM is
complete or something along those lines. And then we could register a
callback on this notifier to rebuild the sched_domains. But that's added
complexity, which isn't really needed (yet). If that needs to be done to
accomodate the needs of new users/archs, we can still implement that
later :-)

> N. partition_sched_domains
> build_perf_domains
>
> IOW, who do we expect to call build_perf_domains for the first time ?

On arm/arm64, when the arch_topology driver calls rebuild_sched_domains.

Thanks !
Quentin

2018-08-30 12:52:25

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 05/14] sched/topology: Reference the Energy Model of CPUs when available

On 30-Aug 11:47, Quentin Perret wrote:
> On Thursday 30 Aug 2018 at 11:00:20 (+0100), Patrick Bellasi wrote:
> > Dunno... but, in any case, probably we don't care about using EAS until
> > the boot complete, isn't it?
>
> So, as of now, EAS will typically start soon after CPUFreq. I don't see a
> point in starting before, and I'm not sure if starting it later is
> really what we want either ... It probably makes sense to start the
> DVFS-related features (CPUFreq, EAS, ...) together.
>
> > Just to better understand, what is the most common activation path we expect ?
> >
> > 1. system boot
> > 2. CPUs online
> > 3. CPUFreq initialization
> > 4. EM registered
> > X. ???
>
> X is sort of arch dependent (like the EM registration actually). For
> arm/arm64, the arch topology driver (see drivers/base/arch_topology.c)
> will rebuild the sched_domains once the capacities of CPU are
> discovered. In previous versions, I had a patch that did that explicitly:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> However, I dropped this patch for v6 because I rebased the series on top
> of Morten's automatic flag detection patches, which happen to already do
> that for me. More precisely, Morten's patches will rebuild the sched
> domains if the SD_ASYM_CPUCAPACITY flag needs to be set somewhere in the
> hierarchy. EAS depends on this flag anyway, so I guess it's fine to rely
> on this rebuild to start EAS at the same time.
>
> I have been wondering for a while if such a 'loose' way of starting EAS
> was alright or not. My conclusion was that it should be OK for now,
> because that should suit all the users I can see in the short/medium
> term. An alternative could be to introduce something like a notifier in
> the EM framework in order to let other subsystems know that the EM is
> complete or something along those lines. And then we could register a
> callback on this notifier to rebuild the sched_domains. But that's added
> complexity, which isn't really needed (yet). If that needs to be done to
> accomodate the needs of new users/archs, we can still implement that
> later :-)
>
> > N. partition_sched_domains
> > build_perf_domains
> >
> > IOW, who do we expect to call build_perf_domains for the first time ?
>
> On arm/arm64, when the arch_topology driver calls rebuild_sched_domains.

Great, I missed that point, which you actually call out in the cover
letter. Thanks also for the more comprehensive explanation above!

> Thanks !
> Quentin

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-08-31 09:06:18

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 03/14] PM: Introduce an Energy Model management framework

On 29-Aug 14:28, Quentin Perret wrote:
> Hi Patrick,
>
> On Wednesday 29 Aug 2018 at 11:04:35 (+0100), Patrick Bellasi wrote:
> > In the loop above we use smp_store_release() to propagate the pointer
> > setting in a PER_CPU(em_data), which ultimate goal is to protect
> > em_register_perf_domain() from multiple clients registering the same
> > power domain.
> >
> > I think there are two possible optimizations there:
> >
> > 1. use of a single memory barrier
> >
> > Since we are already em_pd_mutex protected, i.e. there cannot be a
> > concurrent writers, we can use one single memory barrier after the
> > loop, i.e.
> >
> > for_each_cpu(cpu, span)
> > WRITE_ONCE()
> > smp_wmb()
> >
> > which should be just enough to ensure that all other CPUs will see
> > the pointer set once we release the mutex
>
> Right, I'm actually wondering if the memory barrier is needed at all ...
> The mutex lock()/unlock() should already ensure the ordering I want no ?
>
> WRITE_ONCE() should prevent load/store tearing with concurrent em_cpu_get(),
> and the release/acquire semantics of mutex lock/unlock should be enough to
> serialize the memory accesses of concurrent em_register_perf_domain() calls
> properly ...
>
> Hmm, let me read memory-barriers.txt again.

Yes, I think it should... but better double check.

> > 2. avoid using PER_CPU variables
> >
> > Apart from the initialization code, i.e. boot time, the em_data is
> > expected to be read only, isn't it?
>
> That's right. It's not only read only, it's also not read very often (in
> the use-cases I have in mind at least). The scheduler for example will
> call em_cpu_get() once when sched domains are built, and keep the
> reference instead of calling it again.
>
> > If that's the case, I think that using PER_CPU variables is not
> > strictly required while it unnecessarily increases the cache pressure.
> >
> > In the worst case we can end up with one cache line for each CPU to
> > host just an 8B pointer, instead of using that single cache line to host
> > up to 8 pointers if we use just an array, i.e.
> >
> > struct em_perf_domain *em_data[NR_CPUS]
> > ____cacheline_aligned_in_smp __read_mostly;
> >
> > Consider also that: up to 8 pointers in a single cache line means
> > also that single cache line can be enough to access the EM from all
> > the CPUs of almost every modern mobile phone SoC.
> >
> > Note entirely sure if PER_CPU uses less overall memory in case you
> > have much less CPUs then the compile time defined NR_CPUS.
> > But still, if the above makes sense, you still have a 8x gain
> > factor between number Write allocated .data..percp sections and
> > the value of NR_CPUS. Meaning that in the worst case we allocate
> > the same amount of memory using NR_CPUS=64 (the default on arm64)
> > while running on an 8 CPUs system... but still we should get less
> > cluster caches pressure at run-time with the array approach, 1
> > cache line vs 4.
>
> Right, using per_cpu() might cause to bring in cache things you don't
> really care about (other non-related per_cpu stuff), but that shouldn't
> waste memory I think. I mean, if my em_data var is the first in a cache
> line, the rest of the cache line will most likely be used by other
> per_cpu variables anyways ...
>
> As you suggested, the alternative would be to have a simple array. I'm
> fine with this TBH. But I would probably allocate it dynamically using
> nr_cpu_ids instead of using a static NR_CPUS-wide thing though -- the
> registration of perf domains usually happens late enough in the boot
> process.
>
> What do you think ?

Sound all reasonable to me.

> Thanks
> Quentin

Best Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-04 11:00:58

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

On Monday 20 Aug 2018 at 10:44:19 (+0100), Quentin Perret wrote:
> Energy Aware Scheduling (EAS) is designed with the assumption that
> frequencies of CPUs follow their utilization value. When using a CPUFreq
> governor other than schedutil, the chances of this assumption being true
> are small, if any. When schedutil is being used, EAS' predictions are at
> least consistent with the frequency requests. Although those requests
> have no guarantees to be honored by the hardware, they should at least
> guide DVFS in the right direction and provide some hope in regards to the
> EAS model being accurate.
>
> To make sure EAS is only used in a sane configuration, create a strong
> dependency on schedutil being used. Since having sugov compiled-in does
> not provide that guarantee, extend the existing CPUFreq policy notifier
> with a new case on governor changes. That allows the scheduler to
> register a callback on this notifier to rebuild the scheduling domains
> when governors are changed, and enable/disable EAS accordingly.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
>
> ---
> This patch could probably be squashed into another one, but I kept it
> separate to ease the review. Also, it's probably optional as not having
> it will not 'break' things per se.
>
> I went for the smallest possible solution I could find, which has the
> good side of being simple, but it's definitely not the only one.
>
> Another possibility would be to hook things in sugov_start() and
> sugov_stop(), but that requires some more work. In this case, it
> wouldn't be possible to just re-build the sched_domains() from there,
> because when sugov_stop() is called, the 'governor' field of the policy
> hasn't been updated yet, so the condition (if gov == schedutil) in
> build_freq_domains() doesn't work.
>
> To workaround the issue we'll need to find a way to pass a cpumask to
> the topology code to specifically say 'sugov has been stopped on these
> CPUs'. That would mean more code to handle that, but that would also
> mean we don't have to mess around with the CPUFreq notifiers ...
>
> Not sure what's best, so all feedback is more than welcome.

Hi,

Does anybody have concerns with this patch ? Is it a reasonable option
to use the CPUFreq notifiers in this case ? If there is anything I can
do to ease the review please let me know.

Also, is there any hope that the 12 first patches could make it in 4.20
on their own ? Or is it already too late ?

Thanks,
Quentin

2018-09-06 06:10:58

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v6 07/14] sched/topology: Introduce sched_energy_present static key

On 08/20/2018 02:44 AM, Quentin Perret wrote:
> In order to ensure a minimal performance impact on non-energy-aware
> systems, introduce a static_key guarding the access to Energy-Aware
> Scheduling (EAS) code.
>
> The static key is set iff all the following conditions are met for at
> least one root domain:
> 1. all online CPUs of the root domain are covered by the Energy
> Model (EM);
> 2. the complexity of the root domain's EM is low enough to keep
> scheduling overheads low;
> 3. the root domain has an asymmetric CPU capacity topology (detected
> by looking for the SD_ASYM_CPUCAPACITY flag in the sched_domain
> hierarchy).

This is pretty much the list (+ is schedutil running) of conditions to
set rd->pd != NULL in build_perf_domains().

So when testing 'static_branch_unlikely(&sched_energy_present) &&
rcu_dereference(rd->pd)' don't you test two times the same thing?

Also, if let's say somebody wants to run another EM user (e.g. a thermal
governor, like IPA) but not EAS on a asymmetric CPU capacity system.
This can't be achieved with the current static branch approach

So what about using a (disabled by default ?) sched_feature + rd->pd !=
NULL instead?

[...]

2018-09-06 06:58:34

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v6 04/14] PM / EM: Expose the Energy Model in sysfs

On 08/20/2018 02:44 AM, Quentin Perret wrote:
> Expose the Energy Model (read-only) of all performance domains in sysfs
> for convenience. To do so, add a kobject to the CPU subsystem under the
> umbrella of which a kobject for each performance domain is attached.
>
> The resulting hierarchy is as follows for a platform with two
> performance domains for example:
>
> /sys/devices/system/cpu/energy_model
> ├── pd0
> │   ├── cost
> │   ├── cpus
> │   ├── frequency
> │   └── power

cpus (cpumask of the perf domain), frequency (OPP's of the perf domain)
and power (values at those OPP's) are somehow easy to grasp, cost is
definitely not.

You have this nice description in em_pd_energy() what cost actually is.
IMHO, might be worth repeating this at least in the patch header here.

[...]

2018-09-06 09:23:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

On Tue, Sep 4, 2018 at 12:59 PM Quentin Perret <[email protected]> wrote:
>
> On Monday 20 Aug 2018 at 10:44:19 (+0100), Quentin Perret wrote:
> > Energy Aware Scheduling (EAS) is designed with the assumption that
> > frequencies of CPUs follow their utilization value. When using a CPUFreq
> > governor other than schedutil, the chances of this assumption being true
> > are small, if any. When schedutil is being used, EAS' predictions are at
> > least consistent with the frequency requests. Although those requests
> > have no guarantees to be honored by the hardware, they should at least
> > guide DVFS in the right direction and provide some hope in regards to the
> > EAS model being accurate.
> >
> > To make sure EAS is only used in a sane configuration, create a strong
> > dependency on schedutil being used. Since having sugov compiled-in does
> > not provide that guarantee, extend the existing CPUFreq policy notifier
> > with a new case on governor changes. That allows the scheduler to
> > register a callback on this notifier to rebuild the scheduling domains
> > when governors are changed, and enable/disable EAS accordingly.
> >
> > cc: Ingo Molnar <[email protected]>
> > cc: Peter Zijlstra <[email protected]>
> > Signed-off-by: Quentin Perret <[email protected]>
> >
> > ---
> > This patch could probably be squashed into another one, but I kept it
> > separate to ease the review. Also, it's probably optional as not having
> > it will not 'break' things per se.
> >
> > I went for the smallest possible solution I could find, which has the
> > good side of being simple, but it's definitely not the only one.
> >
> > Another possibility would be to hook things in sugov_start() and
> > sugov_stop(), but that requires some more work. In this case, it
> > wouldn't be possible to just re-build the sched_domains() from there,
> > because when sugov_stop() is called, the 'governor' field of the policy
> > hasn't been updated yet, so the condition (if gov == schedutil) in
> > build_freq_domains() doesn't work.
> >
> > To workaround the issue we'll need to find a way to pass a cpumask to
> > the topology code to specifically say 'sugov has been stopped on these
> > CPUs'. That would mean more code to handle that, but that would also
> > mean we don't have to mess around with the CPUFreq notifiers ...
> >
> > Not sure what's best, so all feedback is more than welcome.
>
> Hi,
>
> Does anybody have concerns with this patch ? Is it a reasonable option
> to use the CPUFreq notifiers in this case ? If there is anything I can
> do to ease the review please let me know.

I'm not a particular fan of notifiers to be honest and you don't need
to add an extra chain just in order to be able to register a callback
from a single user. That can be achieved with a single callback
pointer too, but also you could just call a function exported by the
scheduler directly from where in the cpufreq code it needs to be
called.

> Also, is there any hope that the 12 first patches could make it in 4.20
> on their own ? Or is it already too late ?

I'm walking through them right now, albeit somewhat slowly due to
various distractions, so we'll see.

Thanks,
Rafael

2018-09-06 09:32:37

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 07/14] sched/topology: Introduce sched_energy_present static key

Hi Dietmar,

On Wednesday 05 Sep 2018 at 23:06:38 (-0700), Dietmar Eggemann wrote:
> On 08/20/2018 02:44 AM, Quentin Perret wrote:
> > In order to ensure a minimal performance impact on non-energy-aware
> > systems, introduce a static_key guarding the access to Energy-Aware
> > Scheduling (EAS) code.
> >
> > The static key is set iff all the following conditions are met for at
> > least one root domain:
> > 1. all online CPUs of the root domain are covered by the Energy
> > Model (EM);
> > 2. the complexity of the root domain's EM is low enough to keep
> > scheduling overheads low;
> > 3. the root domain has an asymmetric CPU capacity topology (detected
> > by looking for the SD_ASYM_CPUCAPACITY flag in the sched_domain
> > hierarchy).
>
> This is pretty much the list (+ is schedutil running) of conditions to set
> rd->pd != NULL in build_perf_domains().

Yes, exactly. I actually loop over the rds to check if one of them has a
pd != NULL in order to enable/disable the static key. So the check for
those conditions is always done at the rd level.

> So when testing 'static_branch_unlikely(&sched_energy_present) &&
> rcu_dereference(rd->pd)' don't you test two times the same thing?

Well, not exactly. You could have two rds in your system, and only one
of the two has an Energy Model. The static key is just a performance
optimization for !EAS systems really. But I must admit it sort of lost a
bit of its interest over the versions. I mean, it's not that clear now
that a static key is a better option than a sched_feat as you suggest
below.

> Also, if let's say somebody wants to run another EM user (e.g. a thermal
> governor, like IPA) but not EAS on a asymmetric CPU capacity system. This
> can't be achieved with the current static branch approach

I assume you're talking about IPA once migrated to using the EM
framework ? In this case, I think you're right, we won't have a single
tunable left to enable/disable EAS. On a big.LITTLE platform, if you
want IPA, EAS will always be enabled by side effect ...

That's a very good point actually. I think some people will not be happy
with that. There are big.LITTLE users (not very many of them, but still)
who don't care that much about energy, but do about performance. And
those guys might want to use IPA without EAS. So I guess we really need
a new knob.

> So what about using a (disabled by default ?) sched_feature + rd->pd != NULL
> instead?

Right, that's an option. I could remove the static key and
sched_energy_start() altogether and replace all the
"if (static_branch_unlikely(&sched_energy_present))" by
"if (sched_feat(ENERGY_AWARE))" for example. That should be equivalent
to what I did with the static key from a performance standpoint. But that
would mean users have to manually flip switches to get EAS up and
running ... I assume it's the price to pay for configurability.

Another option would be a KConfig option + static key. I could keep all
of the ifdefery inside an accessor function like the following:

static inline bool sched_energy_aware(void)
{
#ifdef CONFIG_SCHED_ENERGY
return static_branch_likely(&sched_energy_present);
#else
return false;
#endif
}

Now, I understand that scheduler-related KConfig options aren't welcome
in general, so I tend to prefer the sched_feat option.

Thoughts ?

Thanks,
Quentin

2018-09-06 14:12:00

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 04/14] PM / EM: Expose the Energy Model in sysfs

Hi Dietmar,

On Wednesday 05 Sep 2018 at 23:56:43 (-0700), Dietmar Eggemann wrote:
> On 08/20/2018 02:44 AM, Quentin Perret wrote:
> > Expose the Energy Model (read-only) of all performance domains in sysfs
> > for convenience. To do so, add a kobject to the CPU subsystem under the
> > umbrella of which a kobject for each performance domain is attached.
> >
> > The resulting hierarchy is as follows for a platform with two
> > performance domains for example:
> >
> > /sys/devices/system/cpu/energy_model
> > ├── pd0
> > │   ├── cost
> > │   ├── cpus
> > │   ├── frequency
> > │   └── power
>
> cpus (cpumask of the perf domain), frequency (OPP's of the perf domain) and
> power (values at those OPP's) are somehow easy to grasp, cost is definitely
> not.
>
> You have this nice description in em_pd_energy() what cost actually is.
> IMHO, might be worth repeating this at least in the patch header here.

Hmm, this patch introduces the sysfs interface, not the 'cost' field
itself. As long as 'cost' is documented in the patch that introduces it
we should be good no ? I mean this patch header tells you _where_ the
fields of the structure are exposed. _What_ the structure is all about
is a different story.

But yeah, in any case, a reminder shouldn't hurt I guess, if you really
want one :-)

Thanks,
Quentin

2018-09-06 14:41:23

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

Hi Rafael,

On Thursday 06 Sep 2018 at 11:18:55 (+0200), Rafael J. Wysocki wrote:
> I'm not a particular fan of notifiers to be honest and you don't need
> to add an extra chain just in order to be able to register a callback
> from a single user.

Right. I agree there are alternatives to using notifiers. I used them
because they're existing infrastructure, and because they let me do what
I want without too much troubles, which are two important points.

> That can be achieved with a single callback
> pointer too, but also you could just call a function exported by the
> scheduler directly from where in the cpufreq code it needs to be
> called.

Are you thinking about something comparable to what is done in
cpufreq_add_update_util_hook() (kernel/sched/cpufreq.c) for example ?
That would probably have the same drawback as my current implementation,
that is that the scheduler is notified of _all_ governor changes, not
only changes to/from sugov although this is the only thing we care about
for EAS.

We could also hook things in sugov_start & sugov_stop directly, and keep
all changes into the scheduler ... That is slightly harder to implement
on the scheduler topology side, though.

Thoughts ?

> > Also, is there any hope that the 12 first patches could make it in 4.20
> > on their own ? Or is it already too late ?
>
> I'm walking through them right now, albeit somewhat slowly due to
> various distractions, so we'll see.

Thanks !
Quentin

2018-09-06 23:51:42

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v6 07/14] sched/topology: Introduce sched_energy_present static key

On 09/06/2018 02:29 AM, Quentin Perret wrote:
> Hi Dietmar,
>
> On Wednesday 05 Sep 2018 at 23:06:38 (-0700), Dietmar Eggemann wrote:
>> On 08/20/2018 02:44 AM, Quentin Perret wrote:
>>> In order to ensure a minimal performance impact on non-energy-aware
>>> systems, introduce a static_key guarding the access to Energy-Aware
>>> Scheduling (EAS) code.
>>>
>>> The static key is set iff all the following conditions are met for at
>>> least one root domain:
>>> 1. all online CPUs of the root domain are covered by the Energy
>>> Model (EM);
>>> 2. the complexity of the root domain's EM is low enough to keep
>>> scheduling overheads low;
>>> 3. the root domain has an asymmetric CPU capacity topology (detected
>>> by looking for the SD_ASYM_CPUCAPACITY flag in the sched_domain
>>> hierarchy).
>>
>> This is pretty much the list (+ is schedutil running) of conditions to set
>> rd->pd != NULL in build_perf_domains().
>
> Yes, exactly. I actually loop over the rds to check if one of them has a
> pd != NULL in order to enable/disable the static key. So the check for
> those conditions is always done at the rd level.
>
>> So when testing 'static_branch_unlikely(&sched_energy_present) &&
>> rcu_dereference(rd->pd)' don't you test two times the same thing?
>
> Well, not exactly. You could have two rds in your system, and only one
> of the two has an Energy Model. The static key is just a performance
> optimization for !EAS systems really. But I must admit it sort of lost a

Ah, that's correct. But the static key could be exchanged by a sched
feature, that's the important bit here.

> bit of its interest over the versions. I mean, it's not that clear now
> that a static key is a better option than a sched_feat as you suggest
> below.
>
>> Also, if let's say somebody wants to run another EM user (e.g. a thermal
>> governor, like IPA) but not EAS on a asymmetric CPU capacity system. This
>> can't be achieved with the current static branch approach
>
> I assume you're talking about IPA once migrated to using the EM
> framework ? In this case, I think you're right, we won't have a single

Right, in case we will have multiple user of the EM in the future.

> tunable left to enable/disable EAS. On a big.LITTLE platform, if you
> want IPA, EAS will always be enabled by side effect ...
>
> That's a very good point actually. I think some people will not be happy
> with that. There are big.LITTLE users (not very many of them, but still)
> who don't care that much about energy, but do about performance. And
> those guys might want to use IPA without EAS. So I guess we really need
> a new knob.

I guess so too.

>> So what about using a (disabled by default ?) sched_feature + rd->pd != NULL
>> instead?
>
> Right, that's an option. I could remove the static key and
> sched_energy_start() altogether and replace all the
> "if (static_branch_unlikely(&sched_energy_present))" by
> "if (sched_feat(ENERGY_AWARE))" for example. That should be equivalent
> to what I did with the static key from a performance standpoint. But that
> would mean users have to manually flip switches to get EAS up and
> running ... I assume it's the price to pay for configurability.
>
> Another option would be a KConfig option + static key. I could keep all
> of the ifdefery inside an accessor function like the following:
>
> static inline bool sched_energy_aware(void)
> {
> #ifdef CONFIG_SCHED_ENERGY
> return static_branch_likely(&sched_energy_present);
> #else
> return false;
> #endif
> }
>
> Now, I understand that scheduler-related KConfig options aren't welcome
> in general, so I tend to prefer the sched_feat option.
>
> Thoughts ?

I would prefer a sched_feature. I guess it has to be disabled by default
so that other systems don't have to check rcu_dereference(rd->pd) in the
wakeup path.

But since at the beginning EAS will be the only user of the EM there is
no need to change the static key sched_energy_present right now.

2018-09-07 00:22:43

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v6 04/14] PM / EM: Expose the Energy Model in sysfs

On 09/06/2018 07:09 AM, Quentin Perret wrote:
> Hi Dietmar,
>
> On Wednesday 05 Sep 2018 at 23:56:43 (-0700), Dietmar Eggemann wrote:
>> On 08/20/2018 02:44 AM, Quentin Perret wrote:
>>> Expose the Energy Model (read-only) of all performance domains in sysfs
>>> for convenience. To do so, add a kobject to the CPU subsystem under the
>>> umbrella of which a kobject for each performance domain is attached.
>>>
>>> The resulting hierarchy is as follows for a platform with two
>>> performance domains for example:
>>>
>>> /sys/devices/system/cpu/energy_model
>>> ├── pd0
>>> │   ├── cost
>>> │   ├── cpus
>>> │   ├── frequency
>>> │   └── power
>>
>> cpus (cpumask of the perf domain), frequency (OPP's of the perf domain) and
>> power (values at those OPP's) are somehow easy to grasp, cost is definitely
>> not.
>>
>> You have this nice description in em_pd_energy() what cost actually is.
>> IMHO, might be worth repeating this at least in the patch header here.
>
> Hmm, this patch introduces the sysfs interface, not the 'cost' field
> itself. As long as 'cost' is documented in the patch that introduces it
> we should be good no ? I mean this patch header tells you _where_ the
> fields of the structure are exposed. _What_ the structure is all about
> is a different story.

Mmmh, so maybe a EAS related documentation file explaining this
interface as well, which can be introduced later is the solution here?

I'm just not 100% convinced that those cost values are self-explanatory
like the other three items:

root@h960:~# ls /sys/devices/system/cpu/energy_model/pd0/
cost
cpus
frequency
power

root@h960:~# cat /sys/devices/system/cpu/energy_model/pd0/*
96 129 163 201 245
0-3
533000 999000 1402000 1709000 1844000
28 70 124 187 245

> But yeah, in any case, a reminder shouldn't hurt I guess, if you really
> want one :-)

Nothing which should hold this patch-set back though.

2018-09-07 08:39:42

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 07/14] sched/topology: Introduce sched_energy_present static key

On Thursday 06 Sep 2018 at 16:49:47 (-0700), Dietmar Eggemann wrote:
> I would prefer a sched_feature. I guess it has to be disabled by default so
> that other systems don't have to check rcu_dereference(rd->pd) in the wakeup
> path.

Right, this is what I had in mind too. I guess downstream kernels can
always carry a patch that changes the default if they want it enabled
without messing around in userspace.

> But since at the beginning EAS will be the only user of the EM there is no
> need to change the static key sched_energy_present right now.

Indeed, I could add a patch introducing this sched_feat in the series
that migrates IPA to using the EM framework (to be posted later). It is
just not required until we have a new user.

However that IPA-related patchset would then change the default
behaviour for users who used to get EAS enabled automatically, but
wouldn't after updating their kernel (meaning they'd now have to flip
switches by hand whereas it used to "just work"). Not sure if that
qualifies as "breaking users" (cf. Linus' rule #1 of kernel
development) ...

Thanks,
Quentin

2018-09-07 09:00:37

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

On Friday, September 7, 2018 10:52:01 AM CEST Rafael J. Wysocki wrote:
> On Thursday, September 6, 2018 4:38:44 PM CEST Quentin Perret wrote:
> > Hi Rafael,
> >
> > On Thursday 06 Sep 2018 at 11:18:55 (+0200), Rafael J. Wysocki wrote:
> > > I'm not a particular fan of notifiers to be honest and you don't need
> > > to add an extra chain just in order to be able to register a callback
> > > from a single user.
> >
> > Right. I agree there are alternatives to using notifiers. I used them
> > because they're existing infrastructure, and because they let me do what
> > I want without too much troubles, which are two important points.
> >
> > > That can be achieved with a single callback
> > > pointer too, but also you could just call a function exported by the
> > > scheduler directly from where in the cpufreq code it needs to be
> > > called.
> >
> > Are you thinking about something comparable to what is done in
> > cpufreq_add_update_util_hook() (kernel/sched/cpufreq.c) for example ?
> > That would probably have the same drawback as my current implementation,
> > that is that the scheduler is notified of _all_ governor changes, not
> > only changes to/from sugov although this is the only thing we care about
> > for EAS.
>
> Well, why don't you implement it as something like "if the governor changes
> from sugov to something else (or the other way around), call this function
> from the scheduler"?

That said, governors are stopped and started in a few cases other than just
changing the governor, so maybe you want the EAS side to be notified whenever
sugov is stopped and started after all?


2018-09-07 09:58:24

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

On Friday 07 Sep 2018 at 10:56:12 (+0200), Rafael J. Wysocki wrote:
> On Friday, September 7, 2018 10:52:01 AM CEST Rafael J. Wysocki wrote:
> > On Thursday, September 6, 2018 4:38:44 PM CEST Quentin Perret wrote:
> > > Hi Rafael,
> > >
> > > On Thursday 06 Sep 2018 at 11:18:55 (+0200), Rafael J. Wysocki wrote:
> > > > I'm not a particular fan of notifiers to be honest and you don't need
> > > > to add an extra chain just in order to be able to register a callback
> > > > from a single user.
> > >
> > > Right. I agree there are alternatives to using notifiers. I used them
> > > because they're existing infrastructure, and because they let me do what
> > > I want without too much troubles, which are two important points.
> > >
> > > > That can be achieved with a single callback
> > > > pointer too, but also you could just call a function exported by the
> > > > scheduler directly from where in the cpufreq code it needs to be
> > > > called.
> > >
> > > Are you thinking about something comparable to what is done in
> > > cpufreq_add_update_util_hook() (kernel/sched/cpufreq.c) for example ?
> > > That would probably have the same drawback as my current implementation,
> > > that is that the scheduler is notified of _all_ governor changes, not
> > > only changes to/from sugov although this is the only thing we care about
> > > for EAS.
> >
> > Well, why don't you implement it as something like "if the governor changes
> > from sugov to something else (or the other way around), call this function
> > from the scheduler"?

Yes that work too ...

> That said, governors are stopped and started in a few cases other than just
> changing the governor, so maybe you want the EAS side to be notified whenever
> sugov is stopped and started after all?

Right, so sugov_start/sugov_stop could be an option in this case ... And
that would leave the CPUFreq core untouched. I'll try to write something :-)

Thanks,
Quentin

2018-09-07 11:46:07

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

On Thursday, September 6, 2018 4:38:44 PM CEST Quentin Perret wrote:
> Hi Rafael,
>
> On Thursday 06 Sep 2018 at 11:18:55 (+0200), Rafael J. Wysocki wrote:
> > I'm not a particular fan of notifiers to be honest and you don't need
> > to add an extra chain just in order to be able to register a callback
> > from a single user.
>
> Right. I agree there are alternatives to using notifiers. I used them
> because they're existing infrastructure, and because they let me do what
> I want without too much troubles, which are two important points.
>
> > That can be achieved with a single callback
> > pointer too, but also you could just call a function exported by the
> > scheduler directly from where in the cpufreq code it needs to be
> > called.
>
> Are you thinking about something comparable to what is done in
> cpufreq_add_update_util_hook() (kernel/sched/cpufreq.c) for example ?
> That would probably have the same drawback as my current implementation,
> that is that the scheduler is notified of _all_ governor changes, not
> only changes to/from sugov although this is the only thing we care about
> for EAS.

Well, why don't you implement it as something like "if the governor changes
from sugov to something else (or the other way around), call this function
from the scheduler"?


2018-09-07 15:55:39

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

On Friday 07 Sep 2018 at 10:52:01 (+0200), Rafael J. Wysocki wrote:
> On Thursday, September 6, 2018 4:38:44 PM CEST Quentin Perret wrote:
> > Hi Rafael,
> >
> > On Thursday 06 Sep 2018 at 11:18:55 (+0200), Rafael J. Wysocki wrote:
> > > I'm not a particular fan of notifiers to be honest and you don't need
> > > to add an extra chain just in order to be able to register a callback
> > > from a single user.
> >
> > Right. I agree there are alternatives to using notifiers. I used them
> > because they're existing infrastructure, and because they let me do what
> > I want without too much troubles, which are two important points.
> >
> > > That can be achieved with a single callback
> > > pointer too, but also you could just call a function exported by the
> > > scheduler directly from where in the cpufreq code it needs to be
> > > called.
> >
> > Are you thinking about something comparable to what is done in
> > cpufreq_add_update_util_hook() (kernel/sched/cpufreq.c) for example ?
> > That would probably have the same drawback as my current implementation,
> > that is that the scheduler is notified of _all_ governor changes, not
> > only changes to/from sugov although this is the only thing we care about
> > for EAS.
>
> Well, why don't you implement it as something like "if the governor changes
> from sugov to something else (or the other way around), call this function
> from the scheduler"?

I just gave it a try and ended up with the diff below. It's basically
the exact same patch with a direct function call instead of a notifier.
(I also tried the sugov_start/stop thing I keep mentioning but it is
more complex, so let's see if the simplest solution could work first).

What do you think ?

----------8<----------
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index b0dfd3222013..6300668ac67a 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -25,6 +25,7 @@
#include <linux/kernel_stat.h>
#include <linux/module.h>
#include <linux/mutex.h>
+#include <linux/sched/cpufreq.h>
#include <linux/slab.h>
#include <linux/suspend.h>
#include <linux/syscore_ops.h>
@@ -2271,6 +2272,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
ret = cpufreq_start_governor(policy);
if (!ret) {
pr_debug("cpufreq: governor change\n");
+ sched_governor_change(policy, old_gov);
return 0;
}
cpufreq_exit_governor(policy);
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index afa940cd50dc..33b77eed8a41 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -2,6 +2,7 @@
#ifndef _LINUX_SCHED_CPUFREQ_H
#define _LINUX_SCHED_CPUFREQ_H

+#include <linux/cpufreq.h>
#include <linux/types.h>

/*
@@ -28,4 +29,12 @@ static inline unsigned long map_util_freq(unsigned long util,
}
#endif /* CONFIG_CPU_FREQ */

+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
+void sched_governor_change(struct cpufreq_policy *policy,
+ struct cpufreq_governor *old_gov);
+#else
+static inline void sched_governor_change(struct cpufreq_policy *policy,
+ struct cpufreq_governor *old_gov) { }
+#endif
+
#endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 8356cb0072a6..2ff40b2a8ba0 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -632,7 +632,7 @@ static struct kobj_type sugov_tunables_ktype = {

/********************** cpufreq governor interface *********************/

-static struct cpufreq_governor schedutil_gov;
+struct cpufreq_governor schedutil_gov;

static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
{
@@ -891,7 +891,7 @@ static void sugov_limits(struct cpufreq_policy *policy)
sg_policy->need_freq_update = true;
}

-static struct cpufreq_governor schedutil_gov = {
+struct cpufreq_governor schedutil_gov = {
.name = "schedutil",
.owner = THIS_MODULE,
.dynamic_switching = true,
@@ -914,3 +914,32 @@ static int __init sugov_register(void)
return cpufreq_register_governor(&schedutil_gov);
}
fs_initcall(sugov_register);
+
+#ifdef CONFIG_ENERGY_MODEL
+extern bool sched_energy_update;
+static DEFINE_MUTEX(rebuild_sd_mutex);
+
+static void rebuild_sd_workfn(struct work_struct *work)
+{
+ mutex_lock(&rebuild_sd_mutex);
+ sched_energy_update = true;
+ rebuild_sched_domains();
+ sched_energy_update = false;
+ mutex_unlock(&rebuild_sd_mutex);
+}
+static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
+
+/*
+ * EAS shouldn't be attempted without sugov, so rebuild the sched_domains
+ * on governor changes to make sure the scheduler knows about it.
+ */
+void sched_governor_change(struct cpufreq_policy *policy,
+ struct cpufreq_governor *old_gov)
+{
+ if (old_gov == &schedutil_gov || policy->governor == &schedutil_gov) {
+ /* Sched domains cannot be rebuilt directly from this context */
+ schedule_work(&rebuild_sd_work);
+ }
+
+}
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e594a854977f..915766600568 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2265,10 +2265,8 @@ unsigned long scale_irq_capacity(unsigned long util, unsigned long irq, unsigned
}
#endif

-#ifdef CONFIG_SMP
-#ifdef CONFIG_ENERGY_MODEL
+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
#define perf_domain_span(pd) (to_cpumask(((pd)->obj->cpus)))
#else
#define perf_domain_span(pd) NULL
#endif
-#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ae329447a082..781f3eba840e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -209,7 +209,9 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
*/
DEFINE_STATIC_KEY_FALSE(sched_energy_present);

-#ifdef CONFIG_ENERGY_MODEL
+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
+bool sched_energy_update;
+
static void free_pd(struct perf_domain *pd)
{
struct perf_domain *tmp;
@@ -297,12 +299,15 @@ static void destroy_perf_domain_rcu(struct rcu_head *rp)
*/
#define EM_MAX_COMPLEXITY 2048

+extern struct cpufreq_governor schedutil_gov;
static void build_perf_domains(const struct cpumask *cpu_map)
{
int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map);
struct perf_domain *pd = NULL, *tmp;
int cpu = cpumask_first(cpu_map);
struct root_domain *rd = cpu_rq(cpu)->rd;
+ struct cpufreq_policy *policy;
+ struct cpufreq_governor *gov;

/* EAS is enabled for asymmetric CPU capacity topologies. */
if (!per_cpu(sd_asym_cpucapacity, cpu)) {
@@ -318,6 +323,15 @@ static void build_perf_domains(const struct cpumask *cpu_map)
if (find_pd(pd, i))
continue;

+ /* Do not attempt EAS if schedutil is not being used. */
+ policy = cpufreq_cpu_get(i);
+ if (!policy)
+ goto free;
+ gov = policy->governor;
+ cpufreq_cpu_put(policy);
+ if (gov != &schedutil_gov)
+ goto free;
+
/* Create the new pd and add it to the local list. */
tmp = pd_init(i);
if (!tmp)
@@ -389,7 +403,7 @@ static void sched_energy_start(int ndoms_new, cpumask_var_t doms_new[])
}
#else
static void free_pd(struct perf_domain *pd) { }
-#endif /* CONFIG_ENERGY_MODEL */
+#endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL*/

static void free_rootdomain(struct rcu_head *rcu)
{
@@ -2190,10 +2204,10 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
;
}

-#ifdef CONFIG_ENERGY_MODEL
+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
/* Build perf. domains: */
for (i = 0; i < ndoms_new; i++) {
- for (j = 0; j < n; j++) {
+ for (j = 0; j < n && !sched_energy_update; j++) {
if (cpumask_equal(doms_new[i], doms_cur[j]) &&
cpu_rq(cpumask_first(doms_cur[j]))->rd->pd)
goto match3;


2018-09-09 20:15:28

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

On Fri, Sep 7, 2018 at 5:29 PM Quentin Perret <[email protected]> wrote:
>
> On Friday 07 Sep 2018 at 10:52:01 (+0200), Rafael J. Wysocki wrote:
> > On Thursday, September 6, 2018 4:38:44 PM CEST Quentin Perret wrote:
> > > Hi Rafael,
> > >
> > > On Thursday 06 Sep 2018 at 11:18:55 (+0200), Rafael J. Wysocki wrote:
> > > > I'm not a particular fan of notifiers to be honest and you don't need
> > > > to add an extra chain just in order to be able to register a callback
> > > > from a single user.
> > >
> > > Right. I agree there are alternatives to using notifiers. I used them
> > > because they're existing infrastructure, and because they let me do what
> > > I want without too much troubles, which are two important points.
> > >
> > > > That can be achieved with a single callback
> > > > pointer too, but also you could just call a function exported by the
> > > > scheduler directly from where in the cpufreq code it needs to be
> > > > called.
> > >
> > > Are you thinking about something comparable to what is done in
> > > cpufreq_add_update_util_hook() (kernel/sched/cpufreq.c) for example ?
> > > That would probably have the same drawback as my current implementation,
> > > that is that the scheduler is notified of _all_ governor changes, not
> > > only changes to/from sugov although this is the only thing we care about
> > > for EAS.
> >
> > Well, why don't you implement it as something like "if the governor changes
> > from sugov to something else (or the other way around), call this function
> > from the scheduler"?
>
> I just gave it a try and ended up with the diff below. It's basically
> the exact same patch with a direct function call instead of a notifier.
> (I also tried the sugov_start/stop thing I keep mentioning but it is
> more complex, so let's see if the simplest solution could work first).
>
> What do you think ?

This generally works for me from the cpufreq perspective, but I would
add "cpufreq" to the name of the new function, that is call it
something like sched_cpufreq_governor_change().

Also do you really need the extra work item? Governor changes are
carried out in process context anyway.

Thanks,
Rafael

2018-09-10 08:26:22

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

Hi Rafael,

On Sunday 09 Sep 2018 at 22:13:52 (+0200), Rafael J. Wysocki wrote:
> On Fri, Sep 7, 2018 at 5:29 PM Quentin Perret <[email protected]> wrote:
> > On Friday 07 Sep 2018 at 10:52:01 (+0200), Rafael J. Wysocki wrote:
> > > Well, why don't you implement it as something like "if the governor changes
> > > from sugov to something else (or the other way around), call this function
> > > from the scheduler"?
> >
> > I just gave it a try and ended up with the diff below. It's basically
> > the exact same patch with a direct function call instead of a notifier.
> > (I also tried the sugov_start/stop thing I keep mentioning but it is
> > more complex, so let's see if the simplest solution could work first).
> >
> > What do you think ?
>
> This generally works for me from the cpufreq perspective, but I would
> add "cpufreq" to the name of the new function, that is call it
> something like sched_cpufreq_governor_change().

Ok, no problem.

> Also do you really need the extra work item? Governor changes are
> carried out in process context anyway.

Ah, good point, I can remove that. I just tried and got the following
lock warning on boot, though:

[ 2.518684] ============================================
[ 2.523942] WARNING: possible recursive locking detected
[ 2.529200] 4.18.0-rc6-00086-g940e7a9fd5ec #10 Not tainted
[ 2.534630] --------------------------------------------
[ 2.539888] kworker/2:3/1349 is trying to acquire lock:
[ 2.545059] (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: rebuild_sched_domains_locked+0x2c/0x598
[ 2.554559]
[ 2.554559] but task is already holding lock:
[ 2.560332] (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: cpufreq_register_driver+0x80/0x1d0
[ 2.569396]
[ 2.569396] other info that might help us debug this:
[ 2.575858] Possible unsafe locking scenario:
[ 2.575858]
[ 2.581717] CPU0
[ 2.584135] ----
[ 2.586553] lock(cpu_hotplug_lock.rw_sem);
[ 2.590785] lock(cpu_hotplug_lock.rw_sem);
[ 2.595017]
[ 2.595017] *** DEADLOCK ***
[ 2.595017]
[ 2.600877] May be due to missing lock nesting notation

That seems to happen because cpufreq_register_driver() calls
cpus_read_lock(), which is then called again by rebuild_sched_domains()
down the line. So it might just be a missing lock nesting notation as
the warning suggests ?

I'll have a look.

Thanks,
Quentin

2018-09-10 08:57:44

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

On Mon, Sep 10, 2018 at 10:24 AM Quentin Perret <[email protected]> wrote:
>
> Hi Rafael,
>
> On Sunday 09 Sep 2018 at 22:13:52 (+0200), Rafael J. Wysocki wrote:
> > On Fri, Sep 7, 2018 at 5:29 PM Quentin Perret <[email protected]> wrote:
> > > On Friday 07 Sep 2018 at 10:52:01 (+0200), Rafael J. Wysocki wrote:
> > > > Well, why don't you implement it as something like "if the governor changes
> > > > from sugov to something else (or the other way around), call this function
> > > > from the scheduler"?
> > >
> > > I just gave it a try and ended up with the diff below. It's basically
> > > the exact same patch with a direct function call instead of a notifier.
> > > (I also tried the sugov_start/stop thing I keep mentioning but it is
> > > more complex, so let's see if the simplest solution could work first).
> > >
> > > What do you think ?
> >
> > This generally works for me from the cpufreq perspective, but I would
> > add "cpufreq" to the name of the new function, that is call it
> > something like sched_cpufreq_governor_change().
>
> Ok, no problem.
>
> > Also do you really need the extra work item? Governor changes are
> > carried out in process context anyway.
>
> Ah, good point, I can remove that. I just tried and got the following
> lock warning on boot, though:
>
> [ 2.518684] ============================================
> [ 2.523942] WARNING: possible recursive locking detected
> [ 2.529200] 4.18.0-rc6-00086-g940e7a9fd5ec #10 Not tainted
> [ 2.534630] --------------------------------------------
> [ 2.539888] kworker/2:3/1349 is trying to acquire lock:
> [ 2.545059] (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: rebuild_sched_domains_locked+0x2c/0x598
> [ 2.554559]
> [ 2.554559] but task is already holding lock:
> [ 2.560332] (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: cpufreq_register_driver+0x80/0x1d0
> [ 2.569396]
> [ 2.569396] other info that might help us debug this:
> [ 2.575858] Possible unsafe locking scenario:
> [ 2.575858]
> [ 2.581717] CPU0
> [ 2.584135] ----
> [ 2.586553] lock(cpu_hotplug_lock.rw_sem);
> [ 2.590785] lock(cpu_hotplug_lock.rw_sem);
> [ 2.595017]
> [ 2.595017] *** DEADLOCK ***
> [ 2.595017]
> [ 2.600877] May be due to missing lock nesting notation
>
> That seems to happen because cpufreq_register_driver() calls
> cpus_read_lock(), which is then called again by rebuild_sched_domains()
> down the line. So it might just be a missing lock nesting notation as
> the warning suggests ?
>
> I'll have a look.

It only is nested in the _register_driver() code path, otherwise it may not be.

Using the work item may be the most straightforward way to deal with
that, but then I would add a comment to explain what's up.

Thanks,
Rafael

2018-09-10 09:16:48

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 00/14] Energy Aware Scheduling

On Monday, August 20, 2018 11:44:06 AM CEST Quentin Perret wrote:
> This patch series introduces Energy Aware Scheduling (EAS) for CFS tasks
> on platforms with asymmetric CPU topologies (e.g. Arm big.LITTLE).
>
> For more details about the ideas behind it and the overall design,
> please refer to the cover letter of version 5 [1].
>
>
> 1. Version History
> ------------------
>
> Changes v5[1]->v6:
> - Rebased on Peter’s sched/core branch (that includes Morten's misfit
> patches [2] and the automatic detection of SD_ASYM_CPUCAPACITY [3])
> - Removed patch 13/14 (not needed with the automatic flag detection)
> - Added patch creating a dependency between sugov and EAS
> - Renamed frequency domains to performance domains to avoid creating too
> deep assumptions in the code about the HW
> - Renamed the sd_ea shortcut sd_asym_cpucapacity
> - Added comment to explain why new tasks are not accounted when
> detecting the 'overutilized' flag
> - Added comment explaining why forkees don’t go in
> find_energy_efficient_cpu()
>
> Changes v4[4]->v5:
> - Removed the RCU protection of the EM tables and the associated
> need for em_rescale_cpu_capacity().
> - Factorized schedutil’s PELT aggregation function with EAS
> - Improved comments/doc in the EM framework
> - Added check on the uarch of CPUs in one fd in the EM framework
> - Reduced CONFIG_ENERGY_MODEL ifdefery in kernel/sched/topology.c
> - Cleaned-up update_sg_lb_stats parameters
> - Improved comments in compute_energy() to explain the multi-rd
> scenarios
>
> Changes v3[5]->v4:
> - Replaced spinlock in EM framework by smp_store_release/READ_ONCE
> - Fixed missing locks to protect rcu_assign_pointer in EM framework
> - Fixed capacity calculation in EM framework on 32 bits system
> - Fixed compilation issue for CONFIG_ENERGY_MODEL=n
> - Removed cpumask from struct em_freq_domain, now dynamically allocated
> - Power costs of the EM are specified in milliwatts
> - Added example of CPUFreq driver modification
> - Added doc/comments in the EM framework and better commit header
> - Fixed integration issue with util_est in cpu_util_next()
> - Changed scheduler topology code to have one freq. dom. list per rd
> - Split sched topology patch in smaller patches
> - Added doc/comments explaining the heuristic in the wake-up path
> - Changed energy threshold for migration to from 1.5% to 6%
>
> Changes v2[6]->v3:
> - Removed the PM_OPP dependency by implementing a new EM framework
> - Modified the scheduler topology code to take references on the EM data
> structures
> - Simplified the overutilization mechanism into a system-wide flag
> - Reworked the integration in the wake-up path using the sd_ea shortcut
> - Rebased on tip/sched/core (247f2f6f3c70 "sched/core: Don't schedule
> threads on pre-empted vCPUs")
>
> Changes v1[7]->v2:
> - Reworked interface between fair.c and energy.[ch] (Remove #ifdef
> CONFIG_PM_OPP from energy.c) (Greg KH)
> - Fixed licence & header issue in energy.[ch] (Greg KH)
> - Reordered EAS path in select_task_rq_fair() (Joel)
> - Avoid prev_cpu if not allowed in select_task_rq_fair() (Morten/Joel)
> - Refactored compute_energy() (Patrick)
> - Account for RT/IRQ pressure in task_fits() (Patrick)
> - Use UTIL_EST and DL utilization during OPP estimation (Patrick/Juri)
> - Optimize selection of CPU candidates in the energy-aware wake-up path
> - Rebased on top of tip/sched/core (commit b720342849fe “sched/core:
> Update Preempt_notifier_key to modern API”)
>
>
> 2. Test results
> ---------------
>
> Two fundamentally different tests were executed. Firstly the energy test
> case shows the impact on energy consumption this patch-set has using a
> synthetic set of tasks. Secondly the performance test case provides the
> conventional hackbench metric numbers.
>
> The tests run on two arm64 big.LITTLE platforms: Hikey960 (4xA73 +
> 4xA53) and Juno r0 (2xA57 + 4xA53).
>
> Base kernel is tip/sched/core (4.18-rc5), with some Hikey960 and Juno
> specific patches, the SD_ASYM_CPUCAPACITY flag set at DIE sched domain
> level for arm64 and schedutil as cpufreq governor [8].
>
> 2.1 Energy test case
>
> 10 iterations of between 10 and 50 periodic rt-app tasks (16ms period,
> 5% duty-cycle) for 30 seconds with energy measurement. Unit is Joules.
> The goal is to save energy, so lower is better.
>
> 2.1.1 Hikey960
>
> Energy is measured with an ACME Cape on an instrumented board. Numbers
> include consumption of big and little CPUs, LPDDR memory, GPU and most
> of the other small components on the board. They do not include
> consumption of the radio chip (turned-off anyway) and external
> connectors.
>
> +----------+-----------------+-------------------------+
> | | Without patches | With patches |
> +----------+--------+--------+------------------+------+
> | Tasks nb | Mean | RSD* | Mean | RSD* |
> +----------+--------+--------+------------------+------+
> | 10 | 34.33 | 4.8% | 30.51 (-11.13%) | 6.4% |
> | 20 | 52.84 | 1.9% | 44.15 (-16.45%) | 2.0% |
> | 30 | 66.20 | 1.8% | 60.14 (-9.15%) | 4.8% |
> | 40 | 90.83 | 2.5% | 86.91 (-4.32%) | 2.7% |
> | 50 | 136.76 | 4.6% | 108.90 (-20.37%) | 4.7% |
> +----------+--------+--------+------------------+------+
>
> 2.1.2 Juno r0
>
> Energy is measured with the onboard energy meter. Numbers include
> consumption of big and little CPUs.
>
> +----------+-----------------+------------------------+
> | | Without patches | With patches |
> +----------+--------+--------+-----------------+------+
> | Tasks nb | Mean | RSD* | Mean | RSD* |
> +----------+--------+--------+-----------------+------+
> | 10 | 11.48 | 3.2% | 8.09 (-29.53%) | 3.1% |
> | 20 | 20.84 | 3.4% | 14.38 (-31.00%) | 1.1% |
> | 30 | 32.94 | 3.2% | 23.97 (-27.23%) | 1.0% |
> | 40 | 46.05 | 0.5% | 37.82 (-17.87%) | 6.2% |
> | 50 | 57.25 | 0.5% | 55.30 ( -3.41%) | 0.5% |
> +----------+--------+--------+-----------------+------+
>
>
> 2.2 Performance test case
>
> 30 iterations of perf bench sched messaging --pipe --thread --group G
> --loop L with G=[1 2 4 8] and L=50000 (Hikey960)/16000 (Juno r0).
>
> 2.2.1 Hikey960
>
> The impact of thermal capping was mitigated thanks to a heatsink, a
> fan, and a 30 sec delay between two successive executions. IPA is
> disabled to reduce the stddev.
>
> +----------------+-----------------+------------------------+
> | | Without patches | With patches |
> +--------+-------+---------+-------+----------------+-------+
> | Groups | Tasks | Mean | RSD* | Mean | RSD* |
> +--------+-------+---------+-------+----------------+-------+
> | 1 | 40 | 8.04 | 0.88% | 8.22 (+2.31%) | 1.76% |
> | 2 | 80 | 14.78 | 0.67% | 14.83 (+0.35%) | 0.59% |
> | 4 | 160 | 30.92 | 0.57% | 30.95 (+0.09%) | 0.51% |
> | 8 | 320 | 65.54 | 0.32% | 65.57 (+0.04%) | 0.46% |
> +--------+-------+---------+-------+----------------+-------+
>
> 2.2.2 Juno r0
>
> +----------------+-----------------+-----------------------+
> | | Without patches | With patches |
> +--------+-------+---------+-------+---------------+-------+
> | Groups | Tasks | Mean | RSD* | Mean | RSD* |
> +--------+-------+---------+-------+---------------+-------+
> | 1 | 40 | 7.74 | 0.13% | 7.82 (0.01%) | 0.12% |
> | 2 | 80 | 14.27 | 0.15% | 14.27 (0.00%) | 0.14% |
> | 4 | 160 | 27.07 | 0.35% | 26.96 (0.00%) | 0.18% |
> | 8 | 320 | 55.14 | 1.81% | 55.21 (0.00%) | 1.29% |
> +--------+-------+---------+-------+---------------+-------+
>
> *RSD: Relative Standard Deviation (std dev / mean)
>
>
> [1] https://marc.info/?l=linux-pm&m=153243513908731&w=2
> [2] https://marc.info/?l=linux-kernel&m=153069968022982&w=2
> [3] https://marc.info/?l=linux-kernel&m=153209362826476&w=2
> [4] https://marc.info/?l=linux-kernel&m=153018606728533&w=2
> [5] https://marc.info/?l=linux-kernel&m=152691273111941&w=2
> [6] https://marc.info/?l=linux-kernel&m=152302902427143&w=2
> [7] https://marc.info/?l=linux-kernel&m=152153905805048&w=2
> [8] http://www.linux-arm.org/git?p=linux-qp.git;a=shortlog;h=refs/heads/upstream/eas_v6
>
> Morten Rasmussen (1):
> sched: Add over-utilization/tipping point indicator
>
> Quentin Perret (13):
> sched: Relocate arch_scale_cpu_capacity
> sched/cpufreq: Factor out utilization to frequency mapping
> PM: Introduce an Energy Model management framework
> PM / EM: Expose the Energy Model in sysfs
> sched/topology: Reference the Energy Model of CPUs when available
> sched/topology: Lowest CPU asymmetry sched_domain level pointer
> sched/topology: Introduce sched_energy_present static key
> sched/fair: Clean-up update_sg_lb_stats parameters
> sched/cpufreq: Refactor the utilization aggregation method
> sched/fair: Introduce an energy estimation helper function
> sched/fair: Select an energy-efficient CPU on task wake-up
> sched/topology: Make Energy Aware Scheduling depend on schedutil
> OPTIONAL: cpufreq: dt: Register an Energy Model
>
> drivers/cpufreq/cpufreq-dt.c | 45 ++++-
> drivers/cpufreq/cpufreq.c | 4 +
> include/linux/cpufreq.h | 1 +
> include/linux/energy_model.h | 162 +++++++++++++++++
> include/linux/sched/cpufreq.h | 6 +
> include/linux/sched/topology.h | 19 ++
> kernel/power/Kconfig | 15 ++
> kernel/power/Makefile | 2 +
> kernel/power/energy_model.c | 289 +++++++++++++++++++++++++++++
> kernel/sched/cpufreq_schedutil.c | 136 ++++++++++----
> kernel/sched/fair.c | 301 ++++++++++++++++++++++++++++---
> kernel/sched/sched.h | 65 ++++---
> kernel/sched/topology.c | 231 +++++++++++++++++++++++-
> 13 files changed, 1195 insertions(+), 81 deletions(-)
> create mode 100644 include/linux/energy_model.h
> create mode 100644 kernel/power/energy_model.c

I have looked at all of the patches in the series now and I don't really
have any major objections from the cpufreq (and generally PM) perspective.

There are some points of concern here and there, but they are mostly details
and things I would do differently, but as a whole this looks mostly OK to me.

I will reply to the individual patches where there are issues in my view.

Thanks,
Rafael



2018-09-10 09:33:47

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 02/14] sched/cpufreq: Factor out utilization to frequency mapping

On Monday, August 20, 2018 11:44:08 AM CEST Quentin Perret wrote:
> The schedutil governor maps utilization values to frequencies by applying
> a 25% margin. Since this sort of mapping mechanism can be needed by other
> users (i.e. EAS), factor the utilization-to-frequency mapping code out
> of schedutil and move it to include/linux/sched/cpufreq.h to avoid code
> duplication. The new map_util_freq() function is inlined to avoid
> overheads.

There really is more to this change than it seems at the face value IMO,
because the formula in question represents the policy part of the governor.

Namely, it is roughly equivalent to the assumption that the governor will
attempt to "stabilize" the "busy" time of the CPU at 80% of its total time
(if possible for the given "raw" utilization and performance limits).
If you take that formula into EAS, it will cause EAS to implicitly require
the cpufreq governor to follow this assumption.

Now, I know that you want to let EAS only work with the schedutil governor,
which is consistent with the above, but the implicit assumption here should
at least be documented somehow IMO.

> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> include/linux/sched/cpufreq.h | 6 ++++++
> kernel/sched/cpufreq_schedutil.c | 3 ++-
> 2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index 59667444669f..afa940cd50dc 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -20,6 +20,12 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
> void (*func)(struct update_util_data *data, u64 time,
> unsigned int flags));
> void cpufreq_remove_update_util_hook(int cpu);
> +
> +static inline unsigned long map_util_freq(unsigned long util,
> + unsigned long freq, unsigned long cap)
> +{
> + return (freq + (freq >> 2)) * util / cap;
> +}
> #endif /* CONFIG_CPU_FREQ */
>
> #endif /* _LINUX_SCHED_CPUFREQ_H */
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 3fffad3bc8a8..f5206c950143 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -13,6 +13,7 @@
>
> #include "sched.h"
>
> +#include <linux/sched/cpufreq.h>
> #include <trace/events/power.h>
>
> struct sugov_tunables {
> @@ -167,7 +168,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> unsigned int freq = arch_scale_freq_invariant() ?
> policy->cpuinfo.max_freq : policy->cur;
>
> - freq = (freq + (freq >> 2)) * util / max;
> + freq = map_util_freq(util, freq, max);
>
> if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> return sg_policy->next_freq;
>



2018-09-10 09:45:17

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

On Monday 10 Sep 2018 at 10:55:43 (+0200), Rafael J. Wysocki wrote:
> On Mon, Sep 10, 2018 at 10:24 AM Quentin Perret <[email protected]> wrote:
> >
> > Hi Rafael,
> >
> > On Sunday 09 Sep 2018 at 22:13:52 (+0200), Rafael J. Wysocki wrote:
> > > On Fri, Sep 7, 2018 at 5:29 PM Quentin Perret <[email protected]> wrote:
> > > > On Friday 07 Sep 2018 at 10:52:01 (+0200), Rafael J. Wysocki wrote:
> > > > > Well, why don't you implement it as something like "if the governor changes
> > > > > from sugov to something else (or the other way around), call this function
> > > > > from the scheduler"?
> > > >
> > > > I just gave it a try and ended up with the diff below. It's basically
> > > > the exact same patch with a direct function call instead of a notifier.
> > > > (I also tried the sugov_start/stop thing I keep mentioning but it is
> > > > more complex, so let's see if the simplest solution could work first).
> > > >
> > > > What do you think ?
> > >
> > > This generally works for me from the cpufreq perspective, but I would
> > > add "cpufreq" to the name of the new function, that is call it
> > > something like sched_cpufreq_governor_change().
> >
> > Ok, no problem.
> >
> > > Also do you really need the extra work item? Governor changes are
> > > carried out in process context anyway.
> >
> > Ah, good point, I can remove that. I just tried and got the following
> > lock warning on boot, though:
> >
> > [ 2.518684] ============================================
> > [ 2.523942] WARNING: possible recursive locking detected
> > [ 2.529200] 4.18.0-rc6-00086-g940e7a9fd5ec #10 Not tainted
> > [ 2.534630] --------------------------------------------
> > [ 2.539888] kworker/2:3/1349 is trying to acquire lock:
> > [ 2.545059] (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: rebuild_sched_domains_locked+0x2c/0x598
> > [ 2.554559]
> > [ 2.554559] but task is already holding lock:
> > [ 2.560332] (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: cpufreq_register_driver+0x80/0x1d0
> > [ 2.569396]
> > [ 2.569396] other info that might help us debug this:
> > [ 2.575858] Possible unsafe locking scenario:
> > [ 2.575858]
> > [ 2.581717] CPU0
> > [ 2.584135] ----
> > [ 2.586553] lock(cpu_hotplug_lock.rw_sem);
> > [ 2.590785] lock(cpu_hotplug_lock.rw_sem);
> > [ 2.595017]
> > [ 2.595017] *** DEADLOCK ***
> > [ 2.595017]
> > [ 2.600877] May be due to missing lock nesting notation
> >
> > That seems to happen because cpufreq_register_driver() calls
> > cpus_read_lock(), which is then called again by rebuild_sched_domains()
> > down the line. So it might just be a missing lock nesting notation as
> > the warning suggests ?
> >
> > I'll have a look.
>
> It only is nested in the _register_driver() code path, otherwise it may not be.

Right.

> Using the work item may be the most straightforward way to deal with
> that, but then I would add a comment to explain what's up.

Indeed, rw_sems don't seem to be appropriate for nested locking:
https://elixir.bootlin.com/linux/latest/source/include/linux/rwsem.h#L156

I'll stick a comment explaining that for now, unless I find a better
idea than using a work item in the meantime.

Thanks,
Quentin

2018-09-10 09:49:07

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 03/14] PM: Introduce an Energy Model management framework

On Monday, August 20, 2018 11:44:09 AM CEST Quentin Perret wrote:
> Several subsystems in the kernel (task scheduler and/or thermal at the
> time of writing) can benefit from knowing about the energy consumed by
> CPUs. Yet, this information can come from different sources (DT or
> firmware for example), in different formats, hence making it hard to
> exploit without a standard API.
>
> As an attempt to address this, introduce a centralized Energy Model
> (EM) management framework which aggregates the power values provided
> by drivers into a table for each performance domain in the system. The
> power cost tables are made available to interested clients (e.g. task
> scheduler or thermal) via platform-agnostic APIs. The overall design
> is represented by the diagram below (focused on Arm-related drivers as
> an example, but applicable to any architecture):
>
> +---------------+ +-----------------+ +-------------+
> | Thermal (IPA) | | Scheduler (EAS) | | Other |
> +---------------+ +-----------------+ +-------------+
> | | em_pd_energy() |
> | | em_cpu_get() |
> +-----------+ | +----------+
> | | |
> v v v
> +---------------------+
> | |
> | Energy Model |
> | |
> | Framework |
> | |
> +---------------------+
> ^ ^ ^
> | | | em_register_perf_domain()
> +----------+ | +---------+
> | | |
> +---------------+ +---------------+ +--------------+
> | cpufreq-dt | | arm_scmi | | Other |
> +---------------+ +---------------+ +--------------+
> ^ ^ ^
> | | |
> +--------------+ +---------------+ +--------------+
> | Device Tree | | Firmware | | ? |
> +--------------+ +---------------+ +--------------+
>
> Drivers (typically, but not limited to, CPUFreq drivers) can register
> data in the EM framework using the em_register_perf_domain() API. The
> calling driver must provide a callback function with a standardized
> signature that will be used by the EM framework to build the power
> cost tables of the performance domain. This design should offer a lot of
> flexibility to calling drivers which are free of reading information
> from any location and to use any technique to compute power costs.
> Moreover, the capacity states registered by drivers in the EM framework
> are not required to match real performance states of the target. This
> is particularly important on targets where the performance states are
> not known by the OS.
>
> The power cost coefficients managed by the EM framework are specified in
> milli-watts. Although the two potential users of those coefficients (IPA
> and EAS) only need relative correctness, IPA specifically needs to
> compare the power of CPUs with the power of other components (GPUs, for
> example), which are still expressed in absolute terms in their
> respective subsystems. Hence, specifiying the power of CPUs in
> milli-watts should help transitioning IPA to using the EM framework
> without introducing new problems by keeping units comparable across
> sub-systems.
> On the longer term, the EM of other devices than CPUs could also be
> managed by the EM framework, which would enable to remove the absolute
> unit. However, this is not absolutely required as a first step, so this
> extension of the EM framework is left for later.
>
> On the client side, the EM framework offers APIs to access the power
> cost tables of a CPU (em_cpu_get()), and to estimate the energy
> consumed by the CPUs of a performance domain (em_pd_energy()). Clients
> such as the task scheduler can then use these APIs to access the shared
> data structures holding the Energy Model of CPUs.
>
> Cc: Peter Zijlstra <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> include/linux/energy_model.h | 161 ++++++++++++++++++++++++++++
> kernel/power/Kconfig | 15 +++
> kernel/power/Makefile | 2 +
> kernel/power/energy_model.c | 199 +++++++++++++++++++++++++++++++++++
> 4 files changed, 377 insertions(+)
> create mode 100644 include/linux/energy_model.h
> create mode 100644 kernel/power/energy_model.c
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> new file mode 100644
> index 000000000000..b89b5596c976
> --- /dev/null
> +++ b/include/linux/energy_model.h
> @@ -0,0 +1,161 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_ENERGY_MODEL_H
> +#define _LINUX_ENERGY_MODEL_H
> +#include <linux/cpumask.h>
> +#include <linux/jump_label.h>
> +#include <linux/kobject.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched/cpufreq.h>
> +#include <linux/sched/topology.h>
> +#include <linux/types.h>
> +
> +#ifdef CONFIG_ENERGY_MODEL

A kerneldoc comment would be useful here IMO.

> +struct em_cap_state {
> + unsigned long frequency; /* Kilo-hertz */

I wonder if the "frequency" field here could be changed into something a bit
more abstract like "level" or similar?

The reason why is because in some cases we may end up with somewhat artificial
values of "frequency" like when the intel_pstate driver is in use (it uses
abstract "p-state" values internally and only produces "frequency" numbers for
the cpufreq core and the way they are derived from the "p-states" is not always
entirely clean).

The "level" could just be frequency on systems where cpufreq drivers operate on
frequencies directly or something else on the other systems.

> + unsigned long power; /* Milli-watts */
> + unsigned long cost; /* power * max_frequency / frequency */
> +};
> +

Like above, a kerneldoc comment documenting the structure below would be useful.

> +struct em_perf_domain {
> + struct em_cap_state *table; /* Capacity states, in ascending order. */
> + int nr_cap_states;
> + unsigned long cpus[0]; /* CPUs of the frequency domain. */
> +};
> +
> +#define EM_CPU_MAX_POWER 0xFFFF
> +
> +struct em_data_callback {
> + /**
> + * active_power() - Provide power at the next capacity state of a CPU
> + * @power : Active power at the capacity state in mW (modified)
> + * @freq : Frequency at the capacity state in kHz (modified)
> + * @cpu : CPU for which we do this operation
> + *
> + * active_power() must find the lowest capacity state of 'cpu' above
> + * 'freq' and update 'power' and 'freq' to the matching active power
> + * and frequency.
> + *
> + * The power is the one of a single CPU in the domain, expressed in
> + * milli-watts. It is expected to fit in the [0, EM_CPU_MAX_POWER]
> + * range.
> + *
> + * Return 0 on success.
> + */
> + int (*active_power)(unsigned long *power, unsigned long *freq, int cpu);
> +};
> +#define EM_DATA_CB(_active_power_cb) { .active_power = &_active_power_cb }
> +
> +struct em_perf_domain *em_cpu_get(int cpu);
> +int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
> + struct em_data_callback *cb);
> +
> +/**
> + * em_pd_energy() - Estimates the energy consumed by the CPUs of a perf. domain
> + * @pd : performance domain for which energy has to be estimated
> + * @max_util : highest utilization among CPUs of the domain
> + * @sum_util : sum of the utilization of all CPUs in the domain
> + *
> + * Return: the sum of the energy consumed by the CPUs of the domain assuming
> + * a capacity state satisfying the max utilization of the domain.

Well, this confuses energy with power AFAICS. The comment talks about energy,
but the return value is in the units of power.

I guess this assumes constant power over the next scheduling interval, which is
why energy and power can be treated as equivalent here, but that needs to be
clarified as it is somewhat confusing right now.

> + */
> +static inline unsigned long em_pd_energy(struct em_perf_domain *pd,
> + unsigned long max_util, unsigned long sum_util)
> +{
> + unsigned long freq, scale_cpu;
> + struct em_cap_state *cs;
> + int i, cpu;
> +
> + /*
> + * In order to predict the capacity state, map the utilization of the
> + * most utilized CPU of the performance domain to a requested frequency,
> + * like schedutil.
> + */
> + cpu = cpumask_first(to_cpumask(pd->cpus));
> + scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> + cs = &pd->table[pd->nr_cap_states - 1];
> + freq = map_util_freq(max_util, cs->frequency, scale_cpu);
> +
> + /*
> + * Find the lowest capacity state of the Energy Model above the
> + * requested frequency.
> + */
> + for (i = 0; i < pd->nr_cap_states; i++) {
> + cs = &pd->table[i];
> + if (cs->frequency >= freq)
> + break;
> + }
> +
> + /*
> + * The capacity of a CPU in the domain at that capacity state (cs)
> + * can be computed as:
> + *
> + * cs->freq * scale_cpu
> + * cs->cap = -------------------- (1)
> + * cpu_max_freq
> + *
> + * So, the energy consumed by this CPU at that capacity state is:
> + *
> + * cs->power * cpu_util
> + * cpu_nrg = -------------------- (2)
> + * cs->cap
> + *
> + * since 'cpu_util / cs->cap' represents its percentage of busy time.
> + * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
> + * of two terms:
> + *
> + * cs->power * cpu_max_freq cpu_util
> + * cpu_nrg = ------------------------ * --------- (3)
> + * cs->freq scale_cpu
> + *
> + * The first term is static, and is stored in the em_cap_state struct
> + * as 'cs->cost'.
> + *
> + * Since all CPUs of the domain have the same micro-architecture, they
> + * share the same 'cs->cost', and the same CPU capacity. Hence, the
> + * total energy of the domain (which is the simple sum of the energy of
> + * all of its CPUs) can be factorized as:
> + *
> + * cs->cost * \Sum cpu_util
> + * pd_nrg = ------------------------ (4)
> + * scale_cpu
> + */
> + return cs->cost * sum_util / scale_cpu;
> +}


2018-09-10 09:58:00

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 10/14] sched/cpufreq: Refactor the utilization aggregation method

On Monday, August 20, 2018 11:44:16 AM CEST Quentin Perret wrote:
> Schedutil aggregates the PELT signals of CFS, RT, DL and IRQ in order
> to decide which frequency to request. Energy Aware Scheduling (EAS)
> needs to be able to predict those requests to assess the energy impact
> of scheduling decisions. However, the PELT signals aggregation is only
> done in schedutil for now, hence making it hard to synchronize it with
> EAS.
>
> To address this issue, introduce schedutil_freq_util() to perform the
> aforementioned aggregation and make it available to other parts of the
> scheduler. Since frequency selection and energy estimation still need
> to deal with RT and DL signals slightly differently, schedutil_freq_util()
> is called with a different 'type' parameter in those two contexts, and
> returns an aggregated utilization signal accordingly.

This is complementary to patch [02/14] IMO.

schedutil_freq_util() and map_util_freq() introduced by that patch should
always be used together as they are two parts of one algorithm in my view.

Would it be possible to make that clearer?

Thanks,
Rafael


2018-09-10 10:09:05

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 10/14] sched/cpufreq: Refactor the utilization aggregation method

On Monday 10 Sep 2018 at 11:53:58 (+0200), Rafael J. Wysocki wrote:
> On Monday, August 20, 2018 11:44:16 AM CEST Quentin Perret wrote:
> > Schedutil aggregates the PELT signals of CFS, RT, DL and IRQ in order
> > to decide which frequency to request. Energy Aware Scheduling (EAS)
> > needs to be able to predict those requests to assess the energy impact
> > of scheduling decisions. However, the PELT signals aggregation is only
> > done in schedutil for now, hence making it hard to synchronize it with
> > EAS.
> >
> > To address this issue, introduce schedutil_freq_util() to perform the
> > aforementioned aggregation and make it available to other parts of the
> > scheduler. Since frequency selection and energy estimation still need
> > to deal with RT and DL signals slightly differently, schedutil_freq_util()
> > is called with a different 'type' parameter in those two contexts, and
> > returns an aggregated utilization signal accordingly.
>
> This is complementary to patch [02/14] IMO.
>
> schedutil_freq_util() and map_util_freq() introduced by that patch should
> always be used together as they are two parts of one algorithm in my view.

I agree.

> Would it be possible to make that clearer?

I could squash the two at the beginning of the series in a preparatory
patch that refactors schedutil for EAS all in one go, with a clear
mention of what we intend to do (make EAS depend on sugov) in the commit
message.

Would that work ?

Thanks,
Quentin

2018-09-10 10:29:10

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 10/14] sched/cpufreq: Refactor the utilization aggregation method

On Monday, September 10, 2018 12:07:21 PM CEST Quentin Perret wrote:
> On Monday 10 Sep 2018 at 11:53:58 (+0200), Rafael J. Wysocki wrote:
> > On Monday, August 20, 2018 11:44:16 AM CEST Quentin Perret wrote:
> > > Schedutil aggregates the PELT signals of CFS, RT, DL and IRQ in order
> > > to decide which frequency to request. Energy Aware Scheduling (EAS)
> > > needs to be able to predict those requests to assess the energy impact
> > > of scheduling decisions. However, the PELT signals aggregation is only
> > > done in schedutil for now, hence making it hard to synchronize it with
> > > EAS.
> > >
> > > To address this issue, introduce schedutil_freq_util() to perform the
> > > aforementioned aggregation and make it available to other parts of the
> > > scheduler. Since frequency selection and energy estimation still need
> > > to deal with RT and DL signals slightly differently, schedutil_freq_util()
> > > is called with a different 'type' parameter in those two contexts, and
> > > returns an aggregated utilization signal accordingly.
> >
> > This is complementary to patch [02/14] IMO.
> >
> > schedutil_freq_util() and map_util_freq() introduced by that patch should
> > always be used together as they are two parts of one algorithm in my view.
>
> I agree.
>
> > Would it be possible to make that clearer?
>
> I could squash the two at the beginning of the series in a preparatory
> patch that refactors schedutil for EAS all in one go, with a clear
> mention of what we intend to do (make EAS depend on sugov) in the commit
> message.
>
> Would that work ?

I think so.

Thanks,
Rafael


2018-09-10 10:42:02

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 03/14] PM: Introduce an Energy Model management framework

On Monday 10 Sep 2018 at 11:44:33 (+0200), Rafael J. Wysocki wrote:
> A kerneldoc comment would be useful here IMO.

OK

> > +struct em_cap_state {
> > + unsigned long frequency; /* Kilo-hertz */
>
> I wonder if the "frequency" field here could be changed into something a bit
> more abstract like "level" or similar?
>
> The reason why is because in some cases we may end up with somewhat artificial
> values of "frequency" like when the intel_pstate driver is in use (it uses
> abstract "p-state" values internally and only produces "frequency" numbers for
> the cpufreq core and the way they are derived from the "p-states" is not always
> entirely clean).
>
> The "level" could just be frequency on systems where cpufreq drivers operate on
> frequencies directly or something else on the other systems.

I see your point (and TBH we start to have same sort of problems on
Arm) but at this stage I would rather keep this field coherent with
what CPUFreq manages, that is, KHz. The only reason for that is because
the thermal subsystem (IPA) will look at this table to apply a max freq
capping on CPUFreq policies, so things need to be aligned.

I agree that even if the unit of this field wasn't specified we could
still build a system that works just fine. However if things are too
loosely specified, problems are allowed to happen, so they will.

Now, if the CPUFreq core is modified to manipulate abstract performance
levels one day, I'll be happy to change the EM framework the same way :-)

>
> > + unsigned long power; /* Milli-watts */
> > + unsigned long cost; /* power * max_frequency / frequency */
> > +};
> > +
>
> Like above, a kerneldoc comment documenting the structure below would be useful.

OK for that one too.

> > +struct em_perf_domain {
> > + struct em_cap_state *table; /* Capacity states, in ascending order. */
> > + int nr_cap_states;
> > + unsigned long cpus[0]; /* CPUs of the frequency domain. */
> > +};
> > +
> > +#define EM_CPU_MAX_POWER 0xFFFF
> > +
> > +struct em_data_callback {
> > + /**
> > + * active_power() - Provide power at the next capacity state of a CPU
> > + * @power : Active power at the capacity state in mW (modified)
> > + * @freq : Frequency at the capacity state in kHz (modified)
> > + * @cpu : CPU for which we do this operation
> > + *
> > + * active_power() must find the lowest capacity state of 'cpu' above
> > + * 'freq' and update 'power' and 'freq' to the matching active power
> > + * and frequency.
> > + *
> > + * The power is the one of a single CPU in the domain, expressed in
> > + * milli-watts. It is expected to fit in the [0, EM_CPU_MAX_POWER]
> > + * range.
> > + *
> > + * Return 0 on success.
> > + */
> > + int (*active_power)(unsigned long *power, unsigned long *freq, int cpu);
> > +};
> > +#define EM_DATA_CB(_active_power_cb) { .active_power = &_active_power_cb }
> > +
> > +struct em_perf_domain *em_cpu_get(int cpu);
> > +int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
> > + struct em_data_callback *cb);
> > +
> > +/**
> > + * em_pd_energy() - Estimates the energy consumed by the CPUs of a perf. domain
> > + * @pd : performance domain for which energy has to be estimated
> > + * @max_util : highest utilization among CPUs of the domain
> > + * @sum_util : sum of the utilization of all CPUs in the domain
> > + *
> > + * Return: the sum of the energy consumed by the CPUs of the domain assuming
> > + * a capacity state satisfying the max utilization of the domain.
>
> Well, this confuses energy with power AFAICS. The comment talks about energy,
> but the return value is in the units of power.
>
> I guess this assumes constant power over the next scheduling interval, which is
> why energy and power can be treated as equivalent here, but that needs to be
> clarified as it is somewhat confusing right now.

That's right, manipulating power or energy is equivalent here (as in, it
leads to the same conclusions).

I can explain that better in the comment.

Thanks,
Quentin

2018-09-10 10:45:18

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 03/14] PM: Introduce an Energy Model management framework

On Monday, September 10, 2018 12:38:05 PM CEST Quentin Perret wrote:
> On Monday 10 Sep 2018 at 11:44:33 (+0200), Rafael J. Wysocki wrote:
> > A kerneldoc comment would be useful here IMO.
>
> OK
>
> > > +struct em_cap_state {
> > > + unsigned long frequency; /* Kilo-hertz */
> >
> > I wonder if the "frequency" field here could be changed into something a bit
> > more abstract like "level" or similar?
> >
> > The reason why is because in some cases we may end up with somewhat artificial
> > values of "frequency" like when the intel_pstate driver is in use (it uses
> > abstract "p-state" values internally and only produces "frequency" numbers for
> > the cpufreq core and the way they are derived from the "p-states" is not always
> > entirely clean).
> >
> > The "level" could just be frequency on systems where cpufreq drivers operate on
> > frequencies directly or something else on the other systems.
>
> I see your point (and TBH we start to have same sort of problems on
> Arm) but at this stage I would rather keep this field coherent with
> what CPUFreq manages, that is, KHz. The only reason for that is because
> the thermal subsystem (IPA) will look at this table to apply a max freq
> capping on CPUFreq policies, so things need to be aligned.
>
> I agree that even if the unit of this field wasn't specified we could
> still build a system that works just fine. However if things are too
> loosely specified, problems are allowed to happen, so they will.

Fair enough.

> Now, if the CPUFreq core is modified to manipulate abstract performance
> levels one day, I'll be happy to change the EM framework the same way :-)

I don't think this is going to happen any time soon, though.


2018-09-11 09:40:04

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH v6 03/14] PM: Introduce an Energy Model management framework

Hi Quentin,

> > 1. use of a single memory barrier
> >
> > Since we are already em_pd_mutex protected, i.e. there cannot be a
> > concurrent writers, we can use one single memory barrier after the
> > loop, i.e.
> >
> > for_each_cpu(cpu, span)
> > WRITE_ONCE()
> > smp_wmb()
> >
> > which should be just enough to ensure that all other CPUs will see
> > the pointer set once we release the mutex
>
> Right, I'm actually wondering if the memory barrier is needed at all ...
> The mutex lock()/unlock() should already ensure the ordering I want no ?
>
> WRITE_ONCE() should prevent load/store tearing with concurrent em_cpu_get(),
> and the release/acquire semantics of mutex lock/unlock should be enough to
> serialize the memory accesses of concurrent em_register_perf_domain() calls
> properly ...
>
> Hmm, let me read memory-barriers.txt again.

FYI, the directory "tools/memory-model/" provides an "automated
memory-barriers.txt": in short, you encode your "memory ordering
questions" into "litmus tests" to be passed to the tool/simulator;
the tool will then answer with "Yes/No" (plus other information).

Some preparation is required to set up and learn how to use the
LKMM tools, but once there, I expect them to be more "efficient"
than reading memory-barriers.txt... ;-) Please don't hesitate
to contact me/the LKMM maintainers if you need help with this.

You'd need some info in order to write down a _well-formed litmus
test, e.g., matching barrier/synchronization and interested memory
accesses on the reader side (IAC, the replacement "store-release
-> store-once+smp_wmb" discussed above is suspicious...).

Andrea

2018-09-11 12:33:39

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 03/14] PM: Introduce an Energy Model management framework

Hi Andrea,

On Tuesday 11 Sep 2018 at 11:34:56 (+0200), Andrea Parri wrote:
> FYI, the directory "tools/memory-model/" provides an "automated
> memory-barriers.txt": in short, you encode your "memory ordering
> questions" into "litmus tests" to be passed to the tool/simulator;
> the tool will then answer with "Yes/No" (plus other information).
>
> Some preparation is required to set up and learn how to use the
> LKMM tools, but once there, I expect them to be more "efficient"
> than reading memory-barriers.txt... ;-)

Thanks for pointing this out, I'll have a look.

> Please don't hesitate
> to contact me/the LKMM maintainers if you need help with this.

And thanks for that too.

> You'd need some info in order to write down a _well-formed litmus
> test, e.g., matching barrier/synchronization and interested memory
> accesses on the reader side (IAC, the replacement "store-release
> -> store-once+smp_wmb" discussed above is suspicious...).

Regarding the disccusion above, I was actually planning on removing the
smp_wmb entirely and rely on WRITE_ONCE + mutex_{un}lock here. Do you
see something obviously wrong with that ?

I guess the LKMM tools should give me the yes/no answer I want, but if
that's a no, I'd also like to understand why ... :-)

Thanks,
Quentin

2018-09-11 13:33:48

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH v6 03/14] PM: Introduce an Energy Model management framework

On Tue, Sep 11, 2018 at 01:32:50PM +0100, Quentin Perret wrote:
> Hi Andrea,
>
> On Tuesday 11 Sep 2018 at 11:34:56 (+0200), Andrea Parri wrote:
> > FYI, the directory "tools/memory-model/" provides an "automated
> > memory-barriers.txt": in short, you encode your "memory ordering
> > questions" into "litmus tests" to be passed to the tool/simulator;
> > the tool will then answer with "Yes/No" (plus other information).
> >
> > Some preparation is required to set up and learn how to use the
> > LKMM tools, but once there, I expect them to be more "efficient"
> > than reading memory-barriers.txt... ;-)
>
> Thanks for pointing this out, I'll have a look.
>
> > Please don't hesitate
> > to contact me/the LKMM maintainers if you need help with this.
>
> And thanks for that too.
>
> > You'd need some info in order to write down a _well-formed litmus
> > test, e.g., matching barrier/synchronization and interested memory
> > accesses on the reader side (IAC, the replacement "store-release
> > -> store-once+smp_wmb" discussed above is suspicious...).
>
> Regarding the disccusion above, I was actually planning on removing the
> smp_wmb entirely and rely on WRITE_ONCE + mutex_{un}lock here. Do you
> see something obviously wrong with that ?

As said in IRC: nothing I can currently see.


>
> I guess the LKMM tools should give me the yes/no answer I want, but if
> that's a no, I'd also like to understand why ... :-)

That answer would be a little bit more involved ... ;-) The file
Documentation/explanation.txt within the above mentioned directory
could be a good starting point; skimming through litmus-tests/ and
Documentation/recipes.txt could also provide some hints.

Andrea


>
> Thanks,
> Quentin