This is RFC v2 of this proposal (changelog at the end).
Several techniques for saving energy through various scheduler
modifications have been proposed in the past, however most of the
techniques have not been universally beneficial for all use-cases and
platforms. For example, consolidating tasks on fewer cpus is an
effective way to save energy on some platforms, while it might make
things worse on others.
This proposal, which is inspired by the Ksummit workshop discussions
last year [1], takes a different approach by using a (relatively) simple
platform energy cost model to guide scheduling decisions. By providing
the model with platform specific costing data the model can provide a
estimate of the energy implications of scheduling decisions. So instead
of blindly applying scheduling techniques that may or may not work for
the current use-case, the scheduler can make informed energy-aware
decisions. We believe this approach provides a methodology that can be
adapted to any platform, including heterogeneous systems such as ARM
big.LITTLE. The model considers cpus only. Model data includes power
consumption at each P-state, C-state power consumption, and wake-up
energy costs. However, the energy model could potentially be extended to
be used to guide performance/energy decisions in other subsystems.
For example, the scheduler can use energy_diff_task(cpu, task) to
estimate the cost of placing a task on a specific cpu and compare energy
costs of different cpus.
This is an RFC and there are some loose ends that have not been
addressed here or in the code yet. The model and its infrastructure is
in place in the scheduler and it is being used for load-balancing
decisions. It is used for the select_task_rq_fair() path for
fork/exec/wake balancing and to guide the selection of the source cpu
for periodic or idle balance. The latter is still very early days. There
are quite a few dirty hacks in there to tie things together. To mention
a few current limitations:
1. Due to the lack of scale invariant cpu and task utilization, it
doesn't work properly with frequency scaling or heterogeneous systems
(big.LITTLE).
2. Platform data for the test platform (ARM TC2) has been hardcoded in
arch/arm/ code.
3. Most likely idle-state is currently hardcoded to be the shallowest
one. cpuidle integration missing.
However, the main ideas and the primary focus of this RFC: The energy
model and energy_diff_{load, task, cpu}() are there.
Due to limitation 1, the ARM TC2 platform (2xA15+3xA7) was setup to
disable frequency scaling and set frequencies to eliminate the
big.LITTLE performance difference. That basically turns TC2 into an SMP
platform where a subset of the cpus are less energy-efficient.
Tests using a synthetic workload with seven short running periodic
tasks of different size and period, and the sysbench cpu benchmark with
five threads gave the following results:
cpu energy* short tasks sysbench
Mainline 100 100
EA 49 99
* Note that these energy savings are _not_ representative of what can be
achieved on a true SMP platform where all cpus are equally
energy-efficient. There should be benefit for SMP platforms as well,
however, it will be smaller.
The energy model led to consolidation of the short tasks on the A7
cluster (more energy-efficient), while sysbench made use of all cpus as
the A7s didn't have sufficient compute capacity to handle the five
tasks.
To see how scheduling would happen if all cpus would have been A7s the
same tests were done with the A15s' energy model being the same as that
of the A7s (i.e. lying about the platform to the scheduler energy
model). The scheduling pattern for the short tasks changed to being
either consolidated on the A7 or the A15 cluster instead of just on the
A7, which was expected. Currently, there are no tools available to
easily deduce energy for traces using a platform energy model, which
could have estimated the energy benefit. Linaro is currently looking
into extending the idle-stat tool [3] to do this.
Testing with more realistic (mobile) use-cases was done using two
previously described Android workloads [2]: Audio playback and Web
browsing. In addition the combination of the the two was measured.
Reported numbers are averages for 20 runs and have been normalized.
Browsing performance score is roughly rendering time (less is better).
browsing audio browsing+audio
Mainline
A15 51.5 17.7 40.5
A7 48.5 82.3 59.5
energy 100.0 100.0 100.0
perf 100.0 100.0
EA
A15 16.3 2.2 13.4
A7 60.2 80.7 61.1
energy 76.6 82.9 74.6
perf 108.9 108.9
Diff
energy -23.4% -17.1% -25.4%
perf -8.9% -8.9%
Energy is saved for all three use-cases. The performance loss is due to
the TC2 fixed frequency setup. The A15s are not exactly delivering the
same performance as the A7s. They have ~10% more compute capacity
(guestimate). As with the synthetic tests, these numbers are better than
what should be expected for a true SMP platform.
The latency overhead induced by the energy model in
select_task_rq_fair() for this unoptimized implementation on TC2 is:
latency avg (depending on cpu)
Mainline 2.3 - 4.9 us
EA 13.3 - 15.8 us
However, it should be possible to reduce this significantly.
Patch 1: Documentation
Patch 2-5: Infrastructure to set up energy model data
Patch 6: ARM TC2 energy model data
Patch 7: Infrastructure
Patch 8-13: Unweighted load tracking
Patch 14-17: Bits and pieces needed for the energy model
Patch 18-23: The energy model and scheduler tweaks
This series is based on a fairly recent tip/sched/core.
[1] http://etherpad.osuosl.org/energy-aware-scheduling-ks-2013 (search
for 'cost')
[2] https://lkml.org/lkml/2014/1/7/355
[3] http://git.linaro.org/power/idlestat.git
[4] https://lkml.org/lkml/2014/4/11/137
Changes
RFC v2:
- Extended documentation:
- Cover the energy model in greater detail.
- Recipe for deriving platform energy model.
- Replaced Kconfig with sched feature (jump label).
- Add unweighted load tracking.
- Use unweighted load as task/cpu utilization.
- Support for multiple idle states per sched_group. cpuidle integration
still missing.
- Changed energy aware functionality in select_idle_sibling().
- Experimental energy aware load-balance support.
Dietmar Eggemann (12):
sched: Introduce energy data structures
sched: Allocate and initialize energy data structures
sched: Add energy procfs interface
arm: topology: Define TC2 energy and provide it to the scheduler
sched: Introduce system-wide sched_energy
sched: Aggregate unweighted load contributed by task entities on
parenting cfs_rq
sched: Maintain the unweighted load contribution of blocked entities
sched: Account for blocked unweighted load waking back up
sched: Introduce an unweighted cpu_load array
sched: Rename weighted_cpuload() to cpu_load()
sched: Introduce weighted/unweighted switch in load related functions
sched: Use energy model in load balance path
Morten Rasmussen (11):
sched: Documentation for scheduler energy cost model
sched: Make energy awareness a sched feature
sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
sched, cpufreq: Introduce current cpu compute capacity into scheduler
sched, cpufreq: Current compute capacity hack for ARM TC2
sched: Likely idle state statistics placeholder
sched: Energy model functions
sched: Task wakeup tracking
sched: Take task wakeups into account in energy estimates
sched: Use energy model in select_idle_sibling
sched: Use energy to guide wakeup task placement
Documentation/scheduler/sched-energy.txt | 439 ++++++++++++++++++++
arch/arm/kernel/topology.c | 126 +++++-
drivers/cpufreq/cpufreq.c | 8 +
include/linux/sched.h | 28 ++
kernel/sched/core.c | 178 +++++++-
kernel/sched/debug.c | 6 +
kernel/sched/fair.c | 646 ++++++++++++++++++++++++++----
kernel/sched/features.h | 6 +
kernel/sched/proc.c | 22 +-
kernel/sched/sched.h | 44 +-
10 files changed, 1416 insertions(+), 87 deletions(-)
create mode 100644 Documentation/scheduler/sched-energy.txt
--
1.7.9.5
This documentation patch provides an overview of the experimental
scheduler energy costing model, associated data structures, and a
reference recipe on how platforms can be characterized to derive energy
models.
Signed-off-by: Morten Rasmussen <[email protected]>
---
Documentation/scheduler/sched-energy.txt | 439 ++++++++++++++++++++++++++++++
1 file changed, 439 insertions(+)
create mode 100644 Documentation/scheduler/sched-energy.txt
diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt
new file mode 100644
index 0000000..7c0b4dc
--- /dev/null
+++ b/Documentation/scheduler/sched-energy.txt
@@ -0,0 +1,439 @@
+Energy cost model for energy-aware scheduling (EXPERIMENTAL)
+
+Introduction
+=============
+
+The basic energy model uses platform energy data stored in sched_group_energy
+data structures attached to the sched_groups in the sched_domain hierarchy. The
+energy cost model offers three functions that can be used to guide scheduling
+decisions:
+
+1. static int energy_diff_util(int cpu, int util, int wakeups)
+2. static int energy_diff_task(int cpu, struct task_struct *p)
+3. static int energy_diff_cpu(int dst_cpu, int src_cpu)
+
+All three of them return the energy cost delta caused by adding/removing
+utilization or a task to/from specific cpus. To be more precise:
+
+util: The signed utilization delta. That is, the amount of cpu utilization we
+want to add or remove from the cpu, basically how much more/less is the cpu
+expected to be busy. The current metric used to represent utilization is the
+actual per-entity runnable time averaged over time using a geometric series.
+Very similar to the existing per-entity load-tracking, but _not_ scaled by task
+priority. energy_diff_util() calculates the energy implications of
+adding/removing a specific amount of utilization (which could be the combined
+utilization of a number of tasks), while energy_diff_task() calculates the
+energy implications of adding a specific task to a specific cpu.
+
+wakeups: Represents the wakeups (task enqueues, not idle exits) caused by the
+utilization we are about to add/remove to/from the cpu. As with utilization,
+the wakeup rate is averaged over time using a geometric series. The energy
+model estimates (in a fairly naive way) the proportion of the wakeups that
+causes cpu wakeups (idle exits). This metric is particularly important for
+short but frequently running tasks as the wakeup cost (energy) for these can be
+substantial if placed on an idle cpu.
+
+Background and Terminology
+===========================
+
+To make it clear from the start:
+
+energy = [joule] (resource like a battery on powered devices)
+power = energy/time = [joule/second] = [watt]
+
+The goal of energy-aware scheduling is to minimize energy, while still getting
+the job done. That is, we want to maximize:
+
+ performance [inst/s]
+ --------------------
+ power [W]
+
+which is equivalent to minimizing:
+
+ energy [J]
+ -----------
+ instruction
+
+while still getting 'good' performance. It is essentially an alternative
+optimization objective to the current performance-only objective for the
+scheduler. This alternative considers two objectives: energy-efficiency and
+performance. Hence, there needs to be a user controllable knob to switch the
+objective. Since it is early days, this is currently a sched_feature
+(ENERGY_AWARE).
+
+The idea behind introducing an energy cost model is to allow the scheduler to
+evaluate the implications of its decisions rather than applying energy-saving
+techniques blindly that may only have positive effects on some platforms. At
+the same time, the energy cost model must be as simple as possible to minimize
+the scheduler latency impact.
+
+Platform topology
+------------------
+
+The system topology (cpus, caches, and NUMA information, not peripherals) is
+represented in the scheduler by the sched_domain hierarchy which has
+sched_groups attached at each level that covers one or more cpus (see
+sched-domains.txt for more details). To add energy awareness to the scheduler
+we need to consider power and frequency domains.
+
+Power domain:
+
+A power domain is a part of the system that can be powered on/off
+independently. Power domains are typically organized in a hierarchy where you
+may be able to power down just a cpu or a group of cpus along with any
+associated resources (e.g. shared caches). Powering up a cpu means that all
+power domains it is a part of in the hierarchy must be powered up. Hence, it is
+more expensive to power up the first cpu that belongs to a higher level power
+domain than powering up additional cpus in the same high level domain. Two
+level power domain hierarchy example:
+
+ Power source
+ +-------------------------------+----...
+per group PD G G
+ | +----------+ |
+ +--------+-------| Shared | (other groups)
+per-cpu PD G G | resource |
+ | | +----------+
+ +-------+ +-------+
+ | CPU 0 | | CPU 1 |
+ +-------+ +-------+
+
+Frequency domain:
+
+Frequency domains (P-states) typically cover the same group of cpus as one of
+the power domain levels. That is, there might be several smaller power domains
+sharing the same frequency (P-state) or there might be a power domain spanning
+multiple frequency domains.
+
+From a scheduling point of view there is no need to know the actual frequencies
+[Hz]. All the scheduler cares about is the compute capacity available at the
+current state (P-state) the cpu is in and any other available states. For that
+reason, and to also factor in any cpu micro-architecture differences, compute
+capacity scaling states are called 'capacity states' in this document. For SMP
+systems this is equivalent to P-states. For mixed micro-architecture systems
+(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
+performance relative to the other cpus in the system.
+
+Energy modelling:
+------------------
+
+Due to the hierarchical nature of the power domains, the most obvious way to
+model energy costs is therefore to associate power and energy costs with
+domains (groups of cpus). Energy costs of shared resources are associated with
+the group of cpus that share the resources, only the cost of powering the
+cpu itself and any private resources (e.g. private L1 caches) is associated
+with the per-cpu groups (lowest level).
+
+For example, for an SMP system with per-cpu power domains and a cluster level
+(group of cpus) power domain we get the overall energy costs to be:
+
+ energy = energy_cluster + n * energy_cpu
+
+where 'n' is the number of cpus powered up and energy_cluster is the cost paid
+as soon as any cpu in the cluster is powered up.
+
+The power and frequency domains can naturally be mapped onto the existing
+sched_domain hierarchy and sched_groups by adding the necessary data to the
+existing data structures.
+
+The energy model considers energy consumption from three contributors (shown in
+the illustration below):
+
+1. Busy energy: Energy consumed while a cpu and the higher level groups that it
+belongs to are busy running tasks. Busy energy is associated with the state of
+the cpu, not an event. The time the cpu spends in this state varies. Thus, the
+most obvious platform parameter for this contribution is busy power
+(energy/time).
+
+2. Idle energy: Energy consumed while a cpu and higher level groups that it
+belongs to are idle (in a C-state). Like busy energy, idle energy is associated
+with the state of the cpu. Thus, the platform parameter for this contribution
+is idle power (energy/time).
+
+3. Wakeup energy: Energy consumed for a transition from an idle-state (C-state)
+to a busy state (P-state) and back again, that is, a full run->sleep->run cycle
+(they always come in pairs, transitions between idle-states are not modelled).
+This energy is associated with an event with a fixed duration (at least
+roughly). The most obvious platform parameter for this contribution is
+therefore wakeup energy. Wakeup energy is depicted by the areas under the power
+graph for the transition phases in the illustration.
+
+
+ Power
+ ^
+ | busy->idle idle->busy
+ | transition transition
+ |
+ | _ __
+ | / \ / \__________________
+ |______________/ \ /
+ | \ /
+ | Busy \ Idle / Busy
+ | low P-state \____________/ high P-state
+ |
+ +------------------------------------------------------------> time
+
+Busy |--------------| |-----------------|
+
+Wakeup |------| |------|
+
+Idle |------------|
+
+
+The basic algorithm
+====================
+
+The basic idea is to determine the total energy impact when utilization is
+added or removed by estimating the impact at each level in the sched_domain
+hierarchy starting from the bottom (sched_group contains just a single cpu).
+The energy cost comes from three sources: busy time (sched_group is awake
+because one or more cpus are busy), idle time (in an idle-state), and wakeups
+(idle state exits). Power and energy numbers account for energy costs
+associated with all cpus in the sched_group as a group. In some cases it is
+possible to bail out early without having go to the top of the hierarchy if the
+additional/removed utilization doesn't affect the busy time of higher levels.
+
+ for_each_domain(cpu, sd) {
+ sg = sched_group_of(cpu)
+ energy_before = curr_util(sg) * busy_power(sg)
+ + (1-curr_util(sg)) * idle_power(sg)
+ energy_after = new_util(sg) * busy_power(sg)
+ + (1-new_util(sg)) * idle_power(sg)
+ + (1-new_util(sg)) * wakeups * wakeup_energy(sg)
+ energy_diff += energy_before - energy_after
+
+ if (energy_before == energy_after)
+ break;
+ }
+
+ return energy_diff
+
+{curr, new}_util: The cpu utilization at the lowest level and the overall
+non-idle time for the entire group for higher levels. Utilization is in the
+range 0.0 to 1.0 in the pseudo-code.
+
+busy_power: The power consumption of the sched_group.
+
+idle_power: The power consumption of the sched_group when idle.
+
+wakeups: Average wakeup rate of the task(s) being added/removed. To predict how
+many of the wakeups are wakeups that causes idle exits we scale the number by
+the unused utilization (assuming that wakeups are uniformly distributed).
+
+wakeup_energy: The energy consumed for a run->sleep->run cycle for the
+sched_group.
+
+Note: It is a fundamental assumption that the utilization is (roughly) scale
+invariant. Task utilization tracking factors in any frequency scaling and
+performance scaling differences due to difference cpu microarchitectures such
+that task utilization can be used across the entire system. This is _not_ in
+place yet.
+
+Platform energy data
+=====================
+
+struct sched_group_energy can be attached to sched_groups in the sched_domain
+hierarchy and has the following members:
+
+cap_states:
+ List of struct capacity_state representing the supported capacity states
+ (P-states). struct capacity_state has two members: cap and power, which
+ represents the compute capacity and the busy_power of the state. The
+ list must be ordered by capacity low->high.
+
+nr_cap_states:
+ Number of capacity states in cap_states list.
+
+idle_states:
+ List of struct idle_state containing idle_state related costs for each
+ idle-state support by the sched_group. struct idle_state has two
+ members: power and wu_energy. The former is the idle-state power
+ consumption, while the latter is the wakeup energy for an
+ run->sleep->run cycle for that particular sched_group idle-state.
+ Note that the list should only contain idle-states where the affected
+ cpu matches the members of the sched_group. That is, per-cpu idle
+ states are associated with sched_groups at the lowest level, and
+ package/cluster idle-states are listed for sched_group's further up the
+ hierarchy.
+
+nr_idle_states:
+ Number of idle states in idle_states list.
+
+There are no unit requirements for the energy cost data. Data can be normalized
+with any reference, however, the normalization must be consistent across all
+energy cost data. That is, one bogo-joule/watt must be the same quantity for
+data, but we don't care what it is.
+
+A recipe for platform characterization
+=======================================
+
+Obtaining the actual model data for a particular platform requires some way of
+measuring power/energy. There isn't a tool to help with this (yet). This
+section provides a recipe for use as reference. It covers the steps used to
+characterize the ARM TC2 development platform. This sort of measurements is
+expected to be done anyway when tuning cpuidle and cpufreq for a given
+platform.
+
+The energy model needs three types of data (struct sched_group_energy holds
+these) for each sched_group where energy costs should be taken into account:
+
+1. Capacity state information
+
+A list containing the compute capacity and power consumption when fully
+utilized attributed to the group as a whole for each available capacity state.
+At the lowest level (group contains just a single cpu) this is the power of the
+cpu alone without including power consumed by resources shared with other cpus.
+It basically needs to fit the basic modelling approach described in "Background
+and Terminology" section:
+
+ energy_system = energy_shared + n * energy_cpu
+
+for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at
+the lowest level. 'energy_shared' is included at the next level which
+represents the group of cpus among which the resources are shared.
+
+This model is, of course, a simplification of reality. Thus, power/energy
+attributions might not always exactly represent how the hardware is designed.
+Also, busy power is likely to depend on the workload. It is therefore
+recommended to use a representative mix of workloads when characterizing the
+capacity states.
+
+If the group has no capacity scaling support, the list will contain a single
+state where power is the busy power attributed to the group. The capacity
+should be set to a default value (1024).
+
+When frequency domains include multiple power domains, the group representing
+the frequency domain and all child groups share capacity states. This must be
+indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at
+all levels that share the capacity state must have the list of capacity states
+with the power set to the contribution of the individual group.
+
+2. Idle power information
+
+The first member of idle-state information stored in the idle_states list. The
+power number is the group idle power consumption. Due to the way the energy
+model is defined, the idle power of the deepest group idle state can
+alternatively be accounted for in the parent group busy power. In that case the
+group idle state power values are offset such that the idle power of the
+deepest state is zero. It is less intuitive, but it is easier to measure as
+idle power consumed by the group and the busy/idle power of the parent group
+cannot be distinguished without per group measurement points.
+
+3. Wakeup energy information
+
+The second member of the idle-state information stored in the idle_states list.
+The wakeup energy is the total energy consumed during the transition from a
+specific idle state to busy (some P-state) and back again. It is not easy to
+measure them individually and they always occur in pairs anyway. Exit from one
+idle state and going back into a different one is not modelled.
+
+The energy model estimates wakeup energy based on the tracked average wakeup
+rate. Assuming that all task wakeups result in idle exits, the wakeup energy
+consumed per time unit (~ energy rate ~ power) is:
+
+ wakeup_energy_rate = wakeup_energy * wakeup_rate
+
+The wakeup_rate is a geometric series similar to the per entity load tracking.
+To simplify the math in the scheduler the wakeup_energy parameter must be
+pre-scaled to take the geometric series into account. wakeup_rate =
+LOAD_AVG_MAX (=47742) is equivalent to a true wakeup rate of 1000 wakeups per
+second. The wu_energy stored in each struct idle_state in the
+sched_group_energy data structure must therefore be scaled accordingly:
+
+ wakeup_energy = 1000/47742 * true_wakeup_energy
+
+Measuring capacity states and idle power:
+
+The capacity states' capacity and power can be estimated by running a benchmark
+workload at each available capacity state. By restricting the benchmark to run
+on subsets of cpus it is possible to extrapolate the power consumption of
+shared resources.
+
+ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a
+shared L2 cache. TC2 has on-chip energy counters per cluster. Running a
+benchmark workload on just one cpu in a cluster means that power is consumed in
+the cluster (higher level group) and a single cpu (lowest level group). Adding
+another benchmark task to another cpu increases the power consumption by the
+amount consumed by the additional cpu. Hence, it is possible to extrapolate the
+cluster busy power.
+
+For platforms that don't have energy counters or equivalent instrumentation
+built-in, it may be possible to use an external DAQ to acquire similar data.
+
+If the benchmark includes some performance score (for example sysbench cpu
+benchmark), this can be used to record the compute capacity.
+
+Measuring idle power requires insight into the idle state implementation on the
+particular platform. Specifically, if the platform has coupled idle-states (or
+package states). To measure non-coupled per-cpu idle-states it is necessary to
+keep one cpu busy to keep any shared resources alive to isolate the idle power
+of the cpu from idle/busy power of the shared resources. The cpu can be tricked
+into different per-cpu idle states by disabling the other states. Based on
+various combinations of measurements with specific cpus busy and disabling
+idle-states it is possible to extrapolate the idle-state power.
+
+Measuring wakeup energy again requires knowledge about the supported platform
+idle-states, particularly about target residencies. Wakeup energy is a very
+small quantity that might be difficult to distinguish from noise. One way to
+measure it is to use a synthetic test case that periodically wakes up a task on
+a specific cpu. The task should immediately block/go to sleep again. The
+wake-up rate should be as high as the target residency for the idle-state
+allows. Based on cpuidle statistics and knowledge about coupled idle-states the
+wakeup energy can be determined.
+
+The following wakeup generator test was used for the ARM TC2 profiling.
+
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+volatile int busy_loops = 1;
+volatile useconds_t alarm_rate = 100000;
+
+void
+waste_time(int loops)
+{
+ int i;
+ int t = 0;
+ for (i=0;i<loops;i++)
+ t++;
+}
+
+/* The signal handler just clears the flag and re-enables itself. */
+void catch_alarm (int sig)
+{
+ waste_time(busy_loops);
+ signal(sig, catch_alarm);
+}
+
+int main(int argc, char **argv)
+{
+ if (argc != 3) {
+ printf("Usage: timerload <alarm_rate> <busy_loops>\n");
+ printf("alarm_rate:\tBusy loop invocation rate [us]. ");
+ printf("(Doesn't work for rate >= 1s)\n");
+ printf("busy_loops:\tbusy loop iterations per invocation.\n");
+ exit(0);
+ }
+
+ if (atoi(argv[1]) >= 1000000){
+ printf("alarm_rate must be less than 1000000\n");
+ }
+
+ alarm_rate = (useconds_t) atoi(argv[1]);
+ busy_loops = atoi(argv[2]);
+ printf("alarm_rate = %d\nbusy_loops = %d\n",alarm_rate,busy_loops);
+
+ /* Establish a handler for SIGALRM signals. */
+ signal(SIGALRM, catch_alarm);
+
+ /* Set an alarm to go off in a little while. */
+ ualarm(alarm_rate,alarm_rate);
+
+ while (1) {
+ pause();
+ }
+
+ return EXIT_SUCCESS;
+}
--
1.7.9.5
From: Dietmar Eggemann <[email protected]>
Migrate unweighted blocked load of an entity away from the run queue
in case it is migrated to another cpu during wake-up.
This patch is the unweighted counterpart of "sched: Account for blocked
load waking back up" (commit id aff3e4988444).
Note: The unweighted blocked load is not used for energy aware
scheduling yet.
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 9 +++++++--
kernel/sched/sched.h | 2 +-
2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6207f7..93c8dbe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2545,9 +2545,11 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
return;
if (atomic_long_read(&cfs_rq->removed_load)) {
- unsigned long removed_load;
+ unsigned long removed_load, uw_removed_load;
removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
- subtract_blocked_load_contrib(cfs_rq, removed_load, 0);
+ uw_removed_load = atomic_long_xchg(&cfs_rq->uw_removed_load, 0);
+ subtract_blocked_load_contrib(cfs_rq, removed_load,
+ uw_removed_load);
}
if (decays) {
@@ -4606,6 +4608,8 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
se->avg.decay_count = -__synchronize_entity_decay(se);
atomic_long_add(se->avg.load_avg_contrib,
&cfs_rq->removed_load);
+ atomic_long_add(se->avg.uw_load_avg_contrib,
+ &cfs_rq->uw_removed_load);
}
/* We have migrated, no longer consider this task hot */
@@ -7553,6 +7557,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifdef CONFIG_SMP
atomic64_set(&cfs_rq->decay_counter, 1);
atomic_long_set(&cfs_rq->removed_load, 0);
+ atomic_long_set(&cfs_rq->uw_removed_load, 0);
#endif
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3f1eeb3..d7d2ee2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -340,7 +340,7 @@ struct cfs_rq {
unsigned long uw_runnable_load_avg, uw_blocked_load_avg;
atomic64_t decay_counter;
u64 last_decay;
- atomic_long_t removed_load;
+ atomic_long_t removed_load, uw_removed_load;
#ifdef CONFIG_FAIR_GROUP_SCHED
/* Required to track per-cpu representation of a task_group */
--
1.7.9.5
Make select_idle_sibling() consider energy when picking an idle cpu.
This implies having to look beyond sd_llc. Otherwise, consolidating
short frequently running tasks on fewer llc domains will not happen when
that is feasible. The fix is to start select_idle_sibling() at the
highest sched_domain level. A more refined approach causing less
overhead will be considered later. That could be to only look beyond
sd_llc occasionally.
Only idle cpus are still considered. A more aggressive energy conserving
approach could go further and consider partially utilized cpus.
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++----
1 file changed, 37 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aebf3e2..a32d6eb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4747,9 +4747,19 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
*/
static int select_idle_sibling(struct task_struct *p, int target)
{
- struct sched_domain *sd;
+ struct sched_domain *sd = NULL, *tmp;
struct sched_group *sg;
int i = task_cpu(p);
+ int target_nrg;
+ int nrg_min, nrg_cpu = -1;
+
+ if (energy_aware()) {
+ /* When energy-aware, go above sd_llc */
+ for_each_domain(target, tmp)
+ sd = tmp;
+
+ goto loop;
+ }
if (idle_cpu(target))
return target;
@@ -4764,6 +4774,10 @@ static int select_idle_sibling(struct task_struct *p, int target)
* Otherwise, iterate the domains and find an elegible idle cpu.
*/
sd = rcu_dereference(per_cpu(sd_llc, target));
+
+loop:
+ target_nrg = nrg_min = energy_diff_task(target, p);
+
for_each_lower_domain(sd) {
sg = sd->groups;
do {
@@ -4772,16 +4786,35 @@ static int select_idle_sibling(struct task_struct *p, int target)
goto next;
for_each_cpu(i, sched_group_cpus(sg)) {
+ int nrg_diff;
+ if (energy_aware()) {
+ if (!idle_cpu(i))
+ continue;
+
+ nrg_diff = energy_diff_task(i, p);
+ if (nrg_diff < nrg_min) {
+ nrg_min = nrg_diff;
+ nrg_cpu = i;
+ }
+ }
+
if (i == target || !idle_cpu(i))
goto next;
}
- target = cpumask_first_and(sched_group_cpus(sg),
- tsk_cpus_allowed(p));
- goto done;
+ if (!energy_aware()) {
+ target = cpumask_first_and(sched_group_cpus(sg),
+ tsk_cpus_allowed(p));
+ goto done;
+ }
next:
sg = sg->next;
} while (sg != sd->groups);
+
+ if (nrg_cpu >= 0) {
+ target = nrg_cpu;
+ goto done;
+ }
}
done:
return target;
--
1.7.9.5
Attempt to pick most energy efficient wakeup in find_idlest_{group,
cpu}(). Finding the optimum target requires an exhaustive search
through all cpus in the groups. Instead, the target group is determined
based on load and probing the energy cost on a single cpu in each group.
The target cpu is the cpu with the lowest energy cost.
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 71 +++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 57 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a32d6eb..2acd45a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4662,25 +4662,27 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
}
/*
- * find_idlest_group finds and returns the least busy CPU group within the
- * domain.
+ * find_target_group finds and returns the least busy/most energy-efficient
+ * CPU group within the domain.
*/
static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
+find_target_group(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int sd_flag)
{
- struct sched_group *idlest = NULL, *group = sd->groups;
+ struct sched_group *idlest = NULL, *group = sd->groups, *energy = NULL;
unsigned long min_load = ULONG_MAX, this_load = 0;
int load_idx = sd->forkexec_idx;
int imbalance = 100 + (sd->imbalance_pct-100)/2;
+ int local_nrg = 0, min_nrg = INT_MAX;
if (sd_flag & SD_BALANCE_WAKE)
load_idx = sd->wake_idx;
do {
- unsigned long load, avg_load;
+ unsigned long load, avg_load, util, probe_util = UINT_MAX;
int local_group;
int i;
+ int probe_cpu, nrg_diff;
/* Skip over this group if it has no CPUs allowed */
if (!cpumask_intersects(sched_group_cpus(group),
@@ -4692,53 +4694,94 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
/* Tally up the load of all CPUs in the group */
avg_load = 0;
+ probe_cpu = cpumask_first(sched_group_cpus(group));
for_each_cpu(i, sched_group_cpus(group)) {
/* Bias balancing toward cpus of our domain */
- if (local_group)
+ if (local_group) {
load = source_load(i, load_idx, 0);
- else
+ util = source_load(i, load_idx, 1);
+ } else {
load = target_load(i, load_idx, 0);
+ util = target_load(i, load_idx, 1);
+ }
avg_load += load;
+
+ if (util < probe_util) {
+ probe_util = util;
+ probe_cpu = i;
+ }
}
/* Adjust by relative CPU capacity of the group */
avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
+ /*
+ * Sample energy diff on probe_cpu.
+ * Finding the optimum cpu requires testing all cpus which is
+ * expensive.
+ */
+
+ nrg_diff = energy_diff_task(probe_cpu, p);
+
if (local_group) {
this_load = avg_load;
- } else if (avg_load < min_load) {
- min_load = avg_load;
- idlest = group;
+ local_nrg = nrg_diff;
+ } else {
+ if (avg_load < min_load) {
+ min_load = avg_load;
+ idlest = group;
+ }
+
+ if (nrg_diff < min_nrg) {
+ min_nrg = nrg_diff;
+ energy = group;
+ }
}
} while (group = group->next, group != sd->groups);
+ if (energy_aware()) {
+ if (energy && min_nrg < local_nrg)
+ return energy;
+ return NULL;
+ }
+
if (!idlest || 100*this_load < imbalance*min_load)
return NULL;
return idlest;
}
/*
- * find_idlest_cpu - find the idlest cpu among the cpus in group.
+ * find_target_cpu - find the target cpu among the cpus in group.
*/
static int
-find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+find_target_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
{
unsigned long load, min_load = ULONG_MAX;
+ int min_nrg = INT_MAX, nrg, least_nrg = -1;
int idlest = -1;
int i;
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
load = cpu_load(i, 0);
+ nrg = energy_diff_task(i, p);
if (load < min_load || (load == min_load && i == this_cpu)) {
min_load = load;
idlest = i;
}
+
+ if (nrg < min_nrg) {
+ min_nrg = nrg;
+ least_nrg = i;
+ }
}
+ if (least_nrg >= 0)
+ return least_nrg;
+
return idlest;
}
@@ -4886,13 +4929,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
continue;
}
- group = find_idlest_group(sd, p, cpu, sd_flag);
+ group = find_target_group(sd, p, cpu, sd_flag);
if (!group) {
sd = sd->child;
continue;
}
- new_cpu = find_idlest_cpu(group, p, cpu);
+ new_cpu = find_target_cpu(group, p, cpu);
if (new_cpu == -1 || new_cpu == cpu) {
/* Now try balancing at a lower domain level of cpu */
sd = sd->child;
--
1.7.9.5
From: Dietmar Eggemann <[email protected]>
Attempt to pick the source cpu which potentially gives the maximum energy
savings in case the minimum of the amount of utilization which the
destination cpu is additionally able to handle and the current
utilization on the source cpu is taken away from it and put on the
destination cpu instead.
Finding the optimum source requires an exhaustive search through all cpus
in the groups. Instead, the source group is determined based on
utilization and probing the energy cost on a single cpu in each group.
This implementation is not providing an actual energy aware load
balancing right now. It is only trying to showcase the way to find the
most suitable source queue (cpu) based on the energy aware data. The
actual load balance is still done based on the calculated load based
imbalance.
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 83 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2acd45a..1ce3a89 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4549,6 +4549,42 @@ static int energy_diff_task(int cpu, struct task_struct *p)
p->se.avg.wakeup_avg_sum);
}
+static int energy_diff_cpu(int dst_cpu, int src_cpu)
+{
+ int util_diff, dst_nrg_diff, src_nrg_diff;
+ unsigned long src_curr_cap, src_util;
+ unsigned long dst_curr_cap = get_curr_capacity(dst_cpu);
+ unsigned long dst_util = cpu_load(dst_cpu, 1);
+
+ /*
+ * If the destination cpu is already fully or even over-utilized
+ * return error.
+ */
+ if (dst_curr_cap <= dst_util)
+ return INT_MAX;
+
+ src_curr_cap = get_curr_capacity(src_cpu);
+ src_util = cpu_load(src_cpu, 1);
+
+ /*
+ * If the source cpu is over-utilized return the minimum value
+ * to indicate maximum potential energy savings. Performance
+ * is still given priority over pure energy efficiency here.
+ */
+ if (src_curr_cap < src_util)
+ return INT_MIN;
+
+ util_diff = min(dst_curr_cap - dst_util, src_util);
+
+ dst_nrg_diff = energy_diff_util(dst_cpu, util_diff, 0);
+ src_nrg_diff = energy_diff_util(src_cpu, -util_diff, 0);
+
+ if (dst_nrg_diff == INT_MAX || src_nrg_diff == INT_MAX)
+ return INT_MAX;
+
+ return dst_nrg_diff + src_nrg_diff;
+}
+
static int wake_wide(struct task_struct *p)
{
int factor = this_cpu_read(sd_llc_size);
@@ -5488,6 +5524,9 @@ struct lb_env {
unsigned int loop_max;
enum fbq_type fbq_type;
+
+ unsigned int use_ea; /* Use energy aware lb */
+
};
/*
@@ -5957,6 +5996,7 @@ struct sg_lb_stats {
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
#endif
+ int nrg_diff; /* Maximum energy difference btwn dst_cpu and probe_cpu */
};
/*
@@ -5969,9 +6009,11 @@ struct sd_lb_stats {
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_capacity; /* Total capacity of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
+ unsigned int use_ea; /* Use energy aware lb */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
struct sg_lb_stats local_stat; /* Statistics of the local group */
+
};
static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
@@ -5987,8 +6029,10 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
.local = NULL,
.total_load = 0UL,
.total_capacity = 0UL,
+ .use_ea = 0,
.busiest_stat = {
.avg_load = 0UL,
+ .nrg_diff = INT_MAX,
},
};
}
@@ -6282,20 +6326,32 @@ static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs)
{
- unsigned long load;
- int i;
+ unsigned long load, probe_util = 0;
+ int i, probe_cpu = cpumask_first(sched_group_cpus(group));
memset(sgs, 0, sizeof(*sgs));
+ sgs->nrg_diff = INT_MAX;
+
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
struct rq *rq = cpu_rq(i);
/* Bias balancing toward cpus of our domain */
if (local_group)
load = target_load(i, load_idx, 0);
- else
+ else {
load = source_load(i, load_idx, 0);
+ if (energy_aware()) {
+ unsigned long util = source_load(i, load_idx, 1);
+
+ if (probe_util < util) {
+ probe_util = util;
+ probe_cpu = i;
+ }
+ }
+ }
+
sgs->group_load += load;
sgs->sum_nr_running += rq->nr_running;
#ifdef CONFIG_NUMA_BALANCING
@@ -6321,6 +6377,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (sgs->group_capacity_factor > sgs->sum_nr_running)
sgs->group_has_free_capacity = 1;
+
+ if (energy_aware() && !local_group)
+ sgs->nrg_diff = energy_diff_cpu(env->dst_cpu, probe_cpu);
}
/**
@@ -6341,6 +6400,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
struct sched_group *sg,
struct sg_lb_stats *sgs)
{
+ if (energy_aware()) {
+ if (sgs->nrg_diff < sds->busiest_stat.nrg_diff) {
+ sds->use_ea = 1;
+ return true;
+ }
+ sds->use_ea = 0;
+ }
+
if (sgs->avg_load <= sds->busiest_stat.avg_load)
return false;
@@ -6450,6 +6517,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
sds->busiest_stat = *sgs;
+ if (energy_aware())
+ env->use_ea = sds->use_ea;
}
next_group:
@@ -6761,7 +6830,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
{
struct rq *busiest = NULL, *rq;
unsigned long busiest_load = 0, busiest_capacity = 1;
- int i;
+ int i, min_nrg = INT_MAX;
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
unsigned long capacity, capacity_factor, load;
@@ -6807,6 +6876,14 @@ static struct rq *find_busiest_queue(struct lb_env *env,
load > env->imbalance)
continue;
+ if (energy_aware() && env->use_ea) {
+ int nrg = energy_diff_cpu(env->dst_cpu, i);
+
+ if (nrg < min_nrg) {
+ min_nrg = nrg;
+ busiest = rq;
+ }
+ }
/*
* For the load comparisons with the other cpu's, consider
* the cpu_load() scaled with the cpu capacity, so
@@ -6818,7 +6895,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* to: load_i * capacity_j > load_j * capacity_i; where j is
* our previous maximum.
*/
- if (load * busiest_capacity > busiest_load * capacity) {
+ else if (load * busiest_capacity > busiest_load * capacity) {
busiest_load = load;
busiest_capacity = capacity;
busiest = rq;
@@ -6915,6 +6992,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
.fbq_type = all,
+ .use_ea = 0,
};
/*
--
1.7.9.5
The scheduler is currently completely unaware of idle-states. To make
informed decisions using the sched_group_energy idle_states list it
is necessary to know which idle-state a cpu (or group of cpus) is most
likely to be in when it is idle.
For example when migrating a task that wakes up periodically, the wakeup
energy expense depends on the idle-state the destination cpu is most
likely to be in when idle.
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9720f04..353e2d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4254,6 +4254,19 @@ static inline bool energy_aware(void)
return sched_feat(ENERGY_AWARE);
}
+/*
+ * Returns the index of the most likely idle-state that the sched_group is in
+ * when idle. The index can be used to identify the idle-state in the
+ * sched_group_energy idle_states list.
+ *
+ * This is currently just a placeholder. The information needs to come from
+ * cpuidle.
+ */
+static inline int likely_idle_state_idx(struct sched_group *sg)
+{
+ return 0;
+}
+
static int wake_wide(struct task_struct *p)
{
int factor = this_cpu_read(sd_llc_size);
--
1.7.9.5
Introduces energy_diff_util() which finds the energy impacts of adding
or removing utilization from a specific cpu. The calculation is based on
the energy information provided by the platform through sched_energy
data in the sched_domain hierarchy.
Task and cpu utilization is based on unweighted load tracking
(uw_load_avg_contrib) and unweighted cpu_load(cpu, 1) that is introduced
earlier in the patch set. While it isn't a perfect utilization metric,
it is better than weighted load. There are several other loose ends that
need to be addressed, such as load/utilization invariance and proper
representation of compute capacity. However, the energy model and
unweighted load metrics are there.
The energy cost model only considers utilization (busy time) and idle
energy (remaining time) for now. The basic idea is to determine the
energy cost at each level in the sched_domain hierarchy.
for_each_domain(cpu, sd) {
sg = sched_group_of(cpu)
energy_before = curr_util(sg) * busy_power(sg)
+ (1-curr_util(sg)) * idle_power(sg)
energy_after = new_util(sg) * busy_power(sg)
+ (1-new_util(sg)) * idle_power(sg)
energy_diff += energy_before - energy_after
}
Wake-ups energy is added later in this series.
Assumptions and the basic algorithm are described in the code comments.
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 251 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 251 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 353e2d0..44ba754 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4249,6 +4249,32 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
#endif
+/*
+ * Energy model for energy-aware scheduling
+ *
+ * Assumptions:
+ *
+ * 1. Task and cpu load/utilization are assumed to be scale invariant. That is,
+ * task utilization is invariant to frequency scaling and cpu microarchitecture
+ * differences. For example, a task utilization of 256 means a cpu with a
+ * capacity of 1024 will be 25% busy running the task, while another cpu with a
+ * capacity of 512 will be 50% busy.
+ *
+ * 2. When capacity states are shared (SD_SHARE_CAP_STATES) the capacity state
+ * tables are equivalent. That is, the same table index can be used across all
+ * tables.
+ *
+ * 3. Only the lowest level in sched_domain hierarchy has SD_SHARE_CAP_STATES
+ * set. This restriction will be removed later.
+ *
+ * 4. No independent higher level capacity states. Cluster/package power states
+ * are either shared with cpus (SD_SHARE_CAP_STATES) or they only have one.
+ * This restriction will be removed later.
+ *
+ * 5. The scheduler doesn't control capacity (frequency) scaling, but assumes
+ * that the controller will adjust the capacity to match the utilization.
+ */
+
static inline bool energy_aware(void)
{
return sched_feat(ENERGY_AWARE);
@@ -4267,6 +4293,231 @@ static inline int likely_idle_state_idx(struct sched_group *sg)
return 0;
}
+/*
+ * Find suitable capacity state for utilization.
+ * If over-utilized, return nr_cap_states.
+ */
+static int energy_match_cap(unsigned long util,
+ struct sched_group_energy *sge)
+{
+ int idx;
+
+ for (idx = 0; idx < sge->nr_cap_states; idx++) {
+ if (sge->cap_states[idx].cap >= util)
+ return idx;
+ }
+
+ return idx;
+}
+
+/*
+ * Find the max cpu utilization in a group of cpus before and after
+ * adding/removing tasks (util) from a specific cpu (cpu).
+ */
+static void find_max_util(const struct cpumask *mask, int cpu, int util,
+ unsigned long *max_util_bef, unsigned long *max_util_aft)
+{
+ int i;
+
+ *max_util_bef = 0;
+ *max_util_aft = 0;
+
+ for_each_cpu(i, mask) {
+ unsigned long cpu_util = cpu_load(i, 1);
+
+ *max_util_bef = max(*max_util_bef, cpu_util);
+
+ if (i == cpu)
+ cpu_util += util;
+
+ *max_util_aft = max(*max_util_aft, cpu_util);
+ }
+}
+
+static inline unsigned long get_curr_capacity(int cpu);
+
+/*
+ * Estimate the energy cost delta caused by adding/removing utilization (util)
+ * from a specific cpu (cpu).
+ *
+ * The basic idea is to determine the energy cost at each level in sched_domain
+ * hierarchy based on utilization:
+ *
+ * for_each_domain(cpu, sd) {
+ * sg = sched_group_of(cpu)
+ * energy_before = curr_util(sg) * busy_power(sg)
+ * + (1-curr_util(sg)) * idle_power(sg)
+ * energy_after = new_util(sg) * busy_power(sg)
+ * + (1-new_util(sg)) * idle_power(sg)
+ * energy_diff += energy_before - energy_after
+ * }
+ *
+ */
+static int energy_diff_util(int cpu, int util)
+{
+ struct sched_domain *sd;
+ int i;
+ int nrg_diff = 0;
+ int curr_cap_idx = -1;
+ int new_cap_idx = -1;
+ unsigned long max_util_bef, max_util_aft, aff_util_bef, aff_util_aft;
+ unsigned long unused_util_bef, unused_util_aft;
+ unsigned long cpu_curr_capacity;
+
+ cpu_curr_capacity = get_curr_capacity(cpu);
+
+ max_util_aft = cpu_load(cpu, 1) + util;
+
+ /* Can't remove more utilization than there is */
+ if (max_util_aft < 0) {
+ max_util_aft = 0;
+ util = -cpu_load(cpu, 1);
+ }
+
+ rcu_read_lock();
+ for_each_domain(cpu, sd) {
+ struct capacity_state *curr_state, *new_state, *cap_table;
+ struct idle_state *is;
+ struct sched_group_energy *sge;
+
+ if (!sd->groups->sge)
+ continue;
+
+ sge = sd->groups->sge;
+ cap_table = sge->cap_states;
+
+ if (curr_cap_idx < 0 || !(sd->flags & SD_SHARE_CAP_STATES)) {
+
+ /* TODO: Fix assumption 2 and 3. */
+ curr_cap_idx = energy_match_cap(cpu_curr_capacity, sge);
+
+ /*
+ * If we remove tasks, i.e. util < 0, we should find
+ * out if the cap state changes as well, but that is
+ * complicated and might not be worth it. It is assumed
+ * that the state won't be lowered for now.
+ *
+ * Also, if the cap state is shared new_cap_state can't
+ * be lower than curr_cap_idx as the utilization on an
+ * other cpu might have higher utilization than this
+ * cpu.
+ */
+
+ if (cap_table[curr_cap_idx].cap < max_util_aft) {
+ new_cap_idx = energy_match_cap(max_util_aft,
+ sge);
+ if (new_cap_idx >= sge->nr_cap_states) {
+ /*
+ * Can't handle the additional
+ * utilization
+ */
+ nrg_diff = INT_MAX;
+ goto unlock;
+ }
+ } else {
+ new_cap_idx = curr_cap_idx;
+ }
+ }
+
+ curr_state = &cap_table[curr_cap_idx];
+ new_state = &cap_table[new_cap_idx];
+ find_max_util(sched_group_cpus(sd->groups), cpu, util,
+ &max_util_bef, &max_util_aft);
+ is = &sge->idle_states[likely_idle_state_idx(sd->groups)];
+
+ if (!sd->child) {
+ /* Lowest level - groups are individual cpus */
+ if (sd->flags & SD_SHARE_CAP_STATES) {
+ int sum_util = 0;
+ for_each_cpu(i, sched_domain_span(sd))
+ sum_util += cpu_load(i, 1);
+ aff_util_bef = sum_util;
+ } else {
+ aff_util_bef = cpu_load(cpu, 1);
+ }
+ aff_util_aft = aff_util_bef + util;
+
+ /* Estimate idle time based on unused utilization */
+ unused_util_bef = curr_state->cap
+ - cpu_load(cpu, 1);
+ unused_util_aft = new_state->cap - cpu_load(cpu, 1)
+ - util;
+ } else {
+ /* Higher level */
+ aff_util_bef = max_util_bef;
+ aff_util_aft = max_util_aft;
+
+ /* Estimate idle time based on unused utilization */
+ unused_util_bef = curr_state->cap
+ - min(aff_util_bef, curr_state->cap);
+ unused_util_aft = new_state->cap
+ - min(aff_util_aft, new_state->cap);
+ }
+
+ /*
+ * The utilization change has no impact at this level (or any
+ * parent level).
+ */
+ if (aff_util_bef == aff_util_aft && curr_cap_idx == new_cap_idx)
+ goto unlock;
+
+ /* Energy before */
+ nrg_diff -= (aff_util_bef * curr_state->power)/curr_state->cap;
+ nrg_diff -= (unused_util_bef * is->power)/curr_state->cap;
+
+ /* Energy after */
+ nrg_diff += (aff_util_aft*new_state->power)/new_state->cap;
+ nrg_diff += (unused_util_aft * is->power)/new_state->cap;
+ }
+
+ /*
+ * We don't have any sched_group covering all cpus in the sched_domain
+ * hierarchy to associate system wide energy with. Treat it specially
+ * for now until it can be folded into the loop above.
+ */
+ if (sse) {
+ struct capacity_state *cap_table = sse->cap_states;
+ struct capacity_state *curr_state, *new_state;
+ struct idle_state *is;
+
+ curr_state = &cap_table[curr_cap_idx];
+ new_state = &cap_table[new_cap_idx];
+
+ find_max_util(cpu_online_mask, cpu, util, &aff_util_bef,
+ &aff_util_aft);
+ is = &sse->idle_states[likely_idle_state_idx(NULL)];
+
+ /* Estimate idle time based on unused utilization */
+ unused_util_bef = curr_state->cap - aff_util_bef;
+ unused_util_aft = new_state->cap - aff_util_aft;
+
+ /* Energy before */
+ nrg_diff -= (aff_util_bef*curr_state->power)/curr_state->cap;
+ nrg_diff -= (unused_util_bef * is->power)/curr_state->cap;
+
+ /* Energy after */
+ nrg_diff += (aff_util_aft*new_state->power)/new_state->cap;
+ nrg_diff += (unused_util_aft * is->power)/new_state->cap;
+ }
+
+unlock:
+ rcu_read_unlock();
+
+ return nrg_diff;
+}
+
+static int energy_diff_task(int cpu, struct task_struct *p)
+{
+ if (!energy_aware())
+ return INT_MAX;
+
+ if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+ return INT_MAX;
+
+ return energy_diff_util(cpu, p->se.avg.uw_load_avg_contrib);
+
+}
+
static int wake_wide(struct task_struct *p)
{
int factor = this_cpu_read(sd_llc_size);
--
1.7.9.5
From: Dietmar Eggemann <[email protected]>
The energy aware algorithm needs system wide energy information on certain
platforms (e.g. a one socket SMP system). Unfortunately, there is no
sched_group that covers all cpus in the system, so there is no place to
attach a system wide sched_group_energy data structure. In such a system,
the energy data is only attached to the sched groups for the individual
cpus in the sched domain (sd) MC level.
This patch adds a _hack_ to provide system-wide energy data via the
sched_domain_topology_level table for such a system.
The problem is that the sched_domain_topology_level table is not an
interface to provide system-wide data but we want to keep the
configuration of all energy related data in one place.
The sched_domain_energy_f of the last entry (the one which is
initialized with {NULL, }) of the sched_domain_topology_level table is
set to cpu_sys_energy(). Since the sched_domain_mask_f of this entry
stays NULL it is still not considered for the existing scheduler set-up
code (see for_each_sd_topology()).
A second call to init_sched_energy() with an sd pointer argument set to
NULL initializes the system-wide energy structure sse.
There is no system-wide power management on the example platform (ARM TC2)
which could potentially interact with the scheduler so struct
sched_group_energy *sse stays NULL.
Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/kernel/topology.c | 7 ++++++-
kernel/sched/core.c | 34 ++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 2 ++
3 files changed, 38 insertions(+), 5 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index a7d5a6e..70915b1 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -386,6 +386,11 @@ static inline const struct sched_group_energy *cpu_core_energy(int cpu)
&energy_core_a15;
}
+static inline const struct sched_group_energy *cpu_sys_energy(int cpu)
+{
+ return NULL;
+}
+
static inline const int cpu_corepower_flags(void)
{
return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
@@ -396,7 +401,7 @@ static struct sched_domain_topology_level arm_topology[] = {
{ cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) },
#endif
{ cpu_cpu_mask, 0, cpu_cluster_energy, SD_INIT_NAME(DIE) },
- { NULL, },
+ { NULL, 0, cpu_sys_energy},
};
/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7fecc63..2d7544a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5954,20 +5954,44 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
atomic_set(&sg->sgc->nr_busy_cpus, sg->group_weight);
}
+/* System-wide energy information. */
+struct sched_group_energy *sse;
+
static void init_sched_energy(int cpu, struct sched_domain *sd,
struct sched_domain_topology_level *tl)
{
- struct sched_group *sg = sd->groups;
- struct sched_group_energy *energy = sg->sge;
+ struct sched_group *sg = sd ? sd->groups : NULL;
+ struct sched_group_energy *energy = sd ? sg->sge : sse;
sched_domain_energy_f fn = tl->energy;
- struct cpumask *mask = sched_group_cpus(sg);
+ const struct cpumask *mask = sd ? sched_group_cpus(sg) :
+ cpu_cpu_mask(cpu);
- if (!fn || !fn(cpu))
+ if (!fn || !fn(cpu) || (!sd && energy))
return;
if (cpumask_weight(mask) > 1)
check_sched_energy_data(cpu, fn, mask);
+ if (!sd) {
+ energy = sse = kzalloc(sizeof(struct sched_group_energy) +
+ fn(cpu)->nr_idle_states*
+ sizeof(struct idle_state) +
+ fn(cpu)->nr_cap_states*
+ sizeof(struct capacity_state),
+ GFP_KERNEL);
+ BUG_ON(!energy);
+
+ energy->idle_states = (struct idle_state *)
+ ((void *)&energy->cap_states +
+ sizeof(energy->cap_states));
+
+ energy->cap_states = (struct capacity_state *)
+ ((void *)&energy->cap_states +
+ sizeof(energy->cap_states) +
+ fn(cpu)->nr_idle_states*
+ sizeof(struct idle_state));
+ }
+
energy->nr_idle_states = fn(cpu)->nr_idle_states;
memcpy(energy->idle_states, fn(cpu)->idle_states,
energy->nr_idle_states*sizeof(struct idle_state));
@@ -6655,6 +6679,8 @@ static int build_sched_domains(const struct cpumask *cpu_map,
claim_allocations(i, sd);
init_sched_groups_capacity(i, sd);
}
+
+ init_sched_energy(i, NULL, tl);
}
/* Attach the domains */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1a5f1ee..c971359 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -747,6 +747,8 @@ struct sched_group_capacity {
unsigned long cpumask[0]; /* iteration mask */
};
+extern struct sched_group_energy *sse;
+
struct sched_group {
struct sched_group *next; /* Must be a circular list */
atomic_t ref;
--
1.7.9.5
Track task wakeup rate in wakeup_avg_sum by counting wakeups. Note that
this is _not_ cpu wakeups (idle exits). Task wakeups only cause cpu
wakeups if the cpu is idle when the task wakeup occurs.
The wakeup rate decays over time at the same rate as used for the
existing entity load tracking. Unlike runnable_avg_sum, wakeup_avg_sum
is counting events, not time, and is therefore theoretically unbounded
and should be used with care.
Signed-off-by: Morten Rasmussen <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/fair.c | 18 ++++++++++++++++++
2 files changed, 21 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index faebd87..5f854b2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1107,6 +1107,9 @@ struct sched_avg {
s64 decay_count;
unsigned long load_avg_contrib;
unsigned long uw_load_avg_contrib;
+
+ unsigned long last_wakeup_update;
+ u32 wakeup_avg_sum;
};
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44ba754..6da8e2b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -679,6 +679,8 @@ void init_task_runnable_average(struct task_struct *p)
p->se.avg.runnable_avg_sum = slice;
p->se.avg.runnable_avg_period = slice;
__update_task_entity_contrib(&p->se);
+
+ p->se.avg.last_wakeup_update = jiffies;
}
#else
void init_task_runnable_average(struct task_struct *p)
@@ -4112,6 +4114,21 @@ static void record_wakee(struct task_struct *p)
}
}
+static void update_wakeup_avg(struct task_struct *p)
+{
+ struct sched_entity *se = &p->se;
+ struct sched_avg *sa = &se->avg;
+ unsigned long now = ACCESS_ONCE(jiffies);
+
+ if (time_after(now, sa->last_wakeup_update)) {
+ sa->wakeup_avg_sum = decay_load(sa->wakeup_avg_sum,
+ jiffies_to_msecs(now - sa->last_wakeup_update));
+ sa->last_wakeup_update = now;
+ }
+
+ sa->wakeup_avg_sum += 1024;
+}
+
static void task_waking_fair(struct task_struct *p)
{
struct sched_entity *se = &p->se;
@@ -4132,6 +4149,7 @@ static void task_waking_fair(struct task_struct *p)
se->vruntime -= min_vruntime;
record_wakee(p);
+ update_wakeup_avg(p);
}
#ifdef CONFIG_FAIR_GROUP_SCHED
--
1.7.9.5
From: Dietmar Eggemann <[email protected]>
This patch is only here to be able to test provisioning of energy related
data from an arch topology shim layer to the scheduler. Since there is no
code today which deals with extracting energy related data from the dtb or
acpi, and process it in the topology shim layer, the content of the
sched_group_energy structures as well as the idle_state and capacity_state
arrays are hard-coded here.
This patch defines the sched_group_energy structure as well as the
idle_state and capacity_state array for the cluster (relates to sched
groups (sg's) in DIE sched domain (sd) level) and for the core (relates
to sg's in MC sd level) for a Cortex A7 as well as for a Cortex A15.
It further provides related implementations of the sched_domain_energy_f
functions (cpu_cluster_energy() and cpu_core_energy()).
To be able to propagate this information from the topology shim layer to
the scheduler, the elements of the arm_topology[] table have been
provisioned with the appropriate sched_domain_energy_f functions.
Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/kernel/topology.c | 116 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 113 insertions(+), 3 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index d42a7db..a7d5a6e 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -275,6 +275,117 @@ void store_cpu_topology(unsigned int cpuid)
cpu_topology[cpuid].socket_id, mpidr);
}
+/*
+ * ARM TC2 specific energy cost model data. There are no unit requirements for
+ * the data. Data can be normalized to any reference point, but the
+ * normalization must be consistent. That is, one bogo-joule/watt must be the
+ * same quantity for all data, but we don't care what it is.
+ */
+static struct idle_state idle_states_cluster_a7[] = {
+ { .power = 10, .wu_energy = 6 /* << 10 */, },
+ };
+
+static struct idle_state idle_states_cluster_a15[] = {
+ { .power = 25, .wu_energy = 210 /* << 10 */, },
+ };
+
+static struct capacity_state cap_states_cluster_a7[] = {
+ /* Cluster only power */
+ { .cap = 358, .power = 2967, }, /* 350 MHz */
+ { .cap = 410, .power = 2792, }, /* 400 MHz */
+ { .cap = 512, .power = 2810, }, /* 500 MHz */
+ { .cap = 614, .power = 2815, }, /* 600 MHz */
+ { .cap = 717, .power = 2919, }, /* 700 MHz */
+ { .cap = 819, .power = 2847, }, /* 800 MHz */
+ { .cap = 922, .power = 3917, }, /* 900 MHz */
+ { .cap = 1024, .power = 4905, }, /* 1000 MHz */
+ };
+
+static struct capacity_state cap_states_cluster_a15[] = {
+ /* Cluster only power */
+ { .cap = 840, .power = 7920, }, /* 500 MHz */
+ { .cap = 1008, .power = 8165, }, /* 600 MHz */
+ { .cap = 1176, .power = 8172, }, /* 700 MHz */
+ { .cap = 1343, .power = 8195, }, /* 800 MHz */
+ { .cap = 1511, .power = 8265, }, /* 900 MHz */
+ { .cap = 1679, .power = 8446, }, /* 1000 MHz */
+ { .cap = 1847, .power = 11426, }, /* 1100 MHz */
+ { .cap = 2015, .power = 15200, }, /* 1200 MHz */
+ };
+
+static struct sched_group_energy energy_cluster_a7 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a7),
+ .idle_states = idle_states_cluster_a7,
+ .nr_cap_states = ARRAY_SIZE(cap_states_cluster_a7),
+ .cap_states = cap_states_cluster_a7,
+};
+
+static struct sched_group_energy energy_cluster_a15 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a15),
+ .idle_states = idle_states_cluster_a15,
+ .nr_cap_states = ARRAY_SIZE(cap_states_cluster_a15),
+ .cap_states = cap_states_cluster_a15,
+};
+
+static struct idle_state idle_states_core_a7[] = {
+ { .power = 0 /* No power gating */, .wu_energy = 0 /* << 10 */, },
+ };
+
+static struct idle_state idle_states_core_a15[] = {
+ { .power = 0 /* No power gating */, .wu_energy = 5 /* << 10 */, },
+ };
+
+static struct capacity_state cap_states_core_a7[] = {
+ /* Power per cpu */
+ { .cap = 358, .power = 187, }, /* 350 MHz */
+ { .cap = 410, .power = 275, }, /* 400 MHz */
+ { .cap = 512, .power = 334, }, /* 500 MHz */
+ { .cap = 614, .power = 407, }, /* 600 MHz */
+ { .cap = 717, .power = 447, }, /* 700 MHz */
+ { .cap = 819, .power = 549, }, /* 800 MHz */
+ { .cap = 922, .power = 761, }, /* 900 MHz */
+ { .cap = 1024, .power = 1024, }, /* 1000 MHz */
+ };
+
+static struct capacity_state cap_states_core_a15[] = {
+ /* Power per cpu */
+ { .cap = 840, .power = 2021, }, /* 500 MHz */
+ { .cap = 1008, .power = 2312, }, /* 600 MHz */
+ { .cap = 1176, .power = 2756, }, /* 700 MHz */
+ { .cap = 1343, .power = 3125, }, /* 800 MHz */
+ { .cap = 1511, .power = 3524, }, /* 900 MHz */
+ { .cap = 1679, .power = 3846, }, /* 1000 MHz */
+ { .cap = 1847, .power = 5177, }, /* 1100 MHz */
+ { .cap = 2015, .power = 6997, }, /* 1200 MHz */
+ };
+
+static struct sched_group_energy energy_core_a7 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_core_a7),
+ .idle_states = idle_states_core_a7,
+ .nr_cap_states = ARRAY_SIZE(cap_states_core_a7),
+ .cap_states = cap_states_core_a7,
+};
+
+static struct sched_group_energy energy_core_a15 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_core_a15),
+ .idle_states = idle_states_core_a15,
+ .nr_cap_states = ARRAY_SIZE(cap_states_core_a15),
+ .cap_states = cap_states_core_a15,
+};
+
+/* sd energy functions */
+static inline const struct sched_group_energy *cpu_cluster_energy(int cpu)
+{
+ return cpu_topology[cpu].socket_id ? &energy_cluster_a7 :
+ &energy_cluster_a15;
+}
+
+static inline const struct sched_group_energy *cpu_core_energy(int cpu)
+{
+ return cpu_topology[cpu].socket_id ? &energy_core_a7 :
+ &energy_core_a15;
+}
+
static inline const int cpu_corepower_flags(void)
{
return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
@@ -282,10 +393,9 @@ static inline const int cpu_corepower_flags(void)
static struct sched_domain_topology_level arm_topology[] = {
#ifdef CONFIG_SCHED_MC
- { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+ { cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) },
#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { cpu_cpu_mask, 0, cpu_cluster_energy, SD_INIT_NAME(DIE) },
{ NULL, },
};
--
1.7.9.5
From: Dietmar Eggemann <[email protected]>
The unweighted blocked load on a run queue is maintained alongside the
existing (weighted) blocked load.
This patch is the unweighted counterpart of "sched: Maintain the load
contribution of blocked entities" (commit id 9ee474f55664).
Note: The unweighted blocked load is not used for energy aware
scheduling yet.
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/debug.c | 2 ++
kernel/sched/fair.c | 22 +++++++++++++++++-----
kernel/sched/sched.h | 2 +-
3 files changed, 20 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 78d4151..ffa56a8 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -220,6 +220,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->uw_runnable_load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
cfs_rq->blocked_load_avg);
+ SEQ_printf(m, " .%-30s: %ld\n", "uw_blocked_load_avg",
+ cfs_rq->uw_blocked_load_avg);
#ifdef CONFIG_FAIR_GROUP_SCHED
SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib",
cfs_rq->tg_load_contrib);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ee47b3..c6207f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2481,12 +2481,18 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se,
}
static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
- long load_contrib)
+ long load_contrib,
+ long uw_load_contrib)
{
if (likely(load_contrib < cfs_rq->blocked_load_avg))
cfs_rq->blocked_load_avg -= load_contrib;
else
cfs_rq->blocked_load_avg = 0;
+
+ if (likely(uw_load_contrib < cfs_rq->uw_blocked_load_avg))
+ cfs_rq->uw_blocked_load_avg -= uw_load_contrib;
+ else
+ cfs_rq->uw_blocked_load_avg = 0;
}
static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
@@ -2521,7 +2527,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
cfs_rq->uw_runnable_load_avg += uw_contrib_delta;
}
else
- subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+ subtract_blocked_load_contrib(cfs_rq, -contrib_delta,
+ -uw_contrib_delta);
}
/*
@@ -2540,12 +2547,14 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
if (atomic_long_read(&cfs_rq->removed_load)) {
unsigned long removed_load;
removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
- subtract_blocked_load_contrib(cfs_rq, removed_load);
+ subtract_blocked_load_contrib(cfs_rq, removed_load, 0);
}
if (decays) {
cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
decays);
+ cfs_rq->uw_blocked_load_avg =
+ decay_load(cfs_rq->uw_blocked_load_avg, decays);
atomic64_add(decays, &cfs_rq->decay_counter);
cfs_rq->last_decay = now;
}
@@ -2591,7 +2600,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
/* migrated tasks did not contribute to our blocked load */
if (wakeup) {
- subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+ subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib,
+ se->avg.uw_load_avg_contrib);
update_entity_load_avg(se, 0);
}
@@ -2620,6 +2630,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
if (sleep) {
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+ cfs_rq->uw_blocked_load_avg += se->avg.uw_load_avg_contrib;
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
} /* migrations, e.g. sleep=0 leave decay_count == 0 */
}
@@ -7481,7 +7492,8 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
*/
if (se->avg.decay_count) {
__synchronize_entity_decay(se);
- subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+ subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib,
+ se->avg.uw_load_avg_contrib);
}
#endif
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 46cb8bd..3f1eeb3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -337,7 +337,7 @@ struct cfs_rq {
* the FAIR_GROUP_SCHED case).
*/
unsigned long runnable_load_avg, blocked_load_avg;
- unsigned long uw_runnable_load_avg;
+ unsigned long uw_runnable_load_avg, uw_blocked_load_avg;
atomic64_t decay_counter;
u64 last_decay;
atomic_long_t removed_load;
--
1.7.9.5
cpufreq is currently keeping it a secret which cpus are sharing
clock source. The scheduler needs to know about clock domains as well
to become more energy aware. The SD_SHARE_CAP_STATES domain flag
indicates whether cpus belonging to the sched_domain share capacity
states (P-states).
There is no connection with cpufreq (yet). The flag must be set by
the arch specific topology code.
Signed-off-by: Morten Rasmussen <[email protected]>
---
arch/arm/kernel/topology.c | 3 ++-
include/linux/sched.h | 1 +
kernel/sched/core.c | 10 +++++++---
3 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 70915b1..0f9b27a 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -393,7 +393,8 @@ static inline const struct sched_group_energy *cpu_sys_energy(int cpu)
static inline const int cpu_corepower_flags(void)
{
- return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
+ return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN | \
+ SD_SHARE_CAP_STATES;
}
static struct sched_domain_topology_level arm_topology[] = {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b5eeae0..e5d8d57 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -877,6 +877,7 @@ enum cpu_idle_type {
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
#define SD_NUMA 0x4000 /* cross-node balancing */
+#define SD_SHARE_CAP_STATES 0x8000 /* Domain members share capacity state */
#ifdef CONFIG_SCHED_SMT
static inline const int cpu_smt_flags(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d814064..ce43396 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5417,7 +5417,8 @@ static int sd_degenerate(struct sched_domain *sd)
SD_BALANCE_EXEC |
SD_SHARE_CPUCAPACITY |
SD_SHARE_PKG_RESOURCES |
- SD_SHARE_POWERDOMAIN)) {
+ SD_SHARE_POWERDOMAIN |
+ SD_SHARE_CAP_STATES)) {
if (sd->groups != sd->groups->next)
return 0;
}
@@ -5449,7 +5450,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
SD_SHARE_CPUCAPACITY |
SD_SHARE_PKG_RESOURCES |
SD_PREFER_SIBLING |
- SD_SHARE_POWERDOMAIN);
+ SD_SHARE_POWERDOMAIN |
+ SD_SHARE_CAP_STATES);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
}
@@ -6109,6 +6111,7 @@ static int sched_domains_curr_level;
* SD_SHARE_PKG_RESOURCES - describes shared caches
* SD_NUMA - describes NUMA topologies
* SD_SHARE_POWERDOMAIN - describes shared power domain
+ * SD_SHARE_CAP_STATES - describes shared capacity states
*
* Odd one out:
* SD_ASYM_PACKING - describes SMT quirks
@@ -6118,7 +6121,8 @@ static int sched_domains_curr_level;
SD_SHARE_PKG_RESOURCES | \
SD_NUMA | \
SD_ASYM_PACKING | \
- SD_SHARE_POWERDOMAIN)
+ SD_SHARE_POWERDOMAIN | \
+ SD_SHARE_CAP_STATES)
static struct sched_domain *
sd_init(struct sched_domain_topology_level *tl, int cpu)
--
1.7.9.5
The scheduler is currently unaware of frequency changes and the current
compute capacity offered by the cpus. This patch is not the solution.
It is a hack to give us something to experiment with for now.
A proper solution could be based on the frequency invariant load
tracking proposed in the past: https://lkml.org/lkml/2013/4/16/289
The best way to get current compute capacity is likely to be
architecture specific. A potential solution is therefore to let the
architecture implement get_curr_capacity() instead.
This patch should _not_ be considered safe.
Signed-off-by: Morten Rasmussen <[email protected]>
---
drivers/cpufreq/cpufreq.c | 2 ++
include/linux/sched.h | 2 ++
kernel/sched/fair.c | 11 +++++++++++
kernel/sched/sched.h | 2 ++
4 files changed, 17 insertions(+)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index abda660..a2b788d 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -28,6 +28,7 @@
#include <linux/slab.h>
#include <linux/suspend.h>
#include <linux/tick.h>
+#include <linux/sched.h>
#include <trace/events/power.h>
/**
@@ -315,6 +316,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
pr_debug("FREQ: %lu - CPU: %lu\n",
(unsigned long)freqs->new, (unsigned long)freqs->cpu);
trace_cpu_frequency(freqs->new, freqs->cpu);
+ set_curr_capacity(freqs->cpu, (freqs->new*1024)/policy->max);
srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
CPUFREQ_POSTCHANGE, freqs);
if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e5d8d57..faebd87 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3025,4 +3025,6 @@ static inline unsigned long rlimit_max(unsigned int limit)
return task_rlimit_max(current, limit);
}
+void set_curr_capacity(int cpu, long capacity);
+
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 37e9ea1..9720f04 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7564,9 +7564,20 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
atomic64_set(&cfs_rq->decay_counter, 1);
atomic_long_set(&cfs_rq->removed_load, 0);
atomic_long_set(&cfs_rq->uw_removed_load, 0);
+ atomic_long_set(&cfs_rq->curr_capacity, 1024);
#endif
}
+void set_curr_capacity(int cpu, long capacity)
+{
+ atomic_long_set(&cpu_rq(cpu)->cfs.curr_capacity, capacity);
+}
+
+static inline unsigned long get_curr_capacity(int cpu)
+{
+ return atomic_long_read(&cpu_rq(cpu)->cfs.curr_capacity);
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static void task_move_group_fair(struct task_struct *p, int on_rq)
{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 455d152..a6d5239 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -342,6 +342,8 @@ struct cfs_rq {
u64 last_decay;
atomic_long_t removed_load, uw_removed_load;
+ atomic_long_t curr_capacity;
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
--
1.7.9.5
From: Dietmar Eggemann <[email protected]>
Energy aware scheduling relies on cpu utilization and to be able to
maintain it, we need a per run queue signal of the sum of the
unweighted, i.e. not scaled with task priority, load contribution of
runnable task entries.
The unweighted runnable load on a run queue is maintained alongside the
existing (weighted) runnable load.
This patch is the unweighted counterpart of "sched: Aggregate load
contributed by task entities on parenting cfs_rq" (commit id
2dac754e10a5).
Signed-off-by: Dietmar Eggemann <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/debug.c | 4 ++++
kernel/sched/fair.c | 26 ++++++++++++++++++++++----
kernel/sched/sched.h | 1 +
4 files changed, 28 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1507390..b5eeae0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1105,6 +1105,7 @@ struct sched_avg {
u64 last_runnable_update;
s64 decay_count;
unsigned long load_avg_contrib;
+ unsigned long uw_load_avg_contrib;
};
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 695f977..78d4151 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -96,6 +96,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->avg.runnable_avg_sum);
P(se->avg.runnable_avg_period);
P(se->avg.load_avg_contrib);
+ P(se->avg.uw_load_avg_contrib);
P(se->avg.decay_count);
#endif
#undef PN
@@ -215,6 +216,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %ld\n", "runnable_load_avg",
cfs_rq->runnable_load_avg);
+ SEQ_printf(m, " .%-30s: %ld\n", "uw_runnable_load_avg",
+ cfs_rq->uw_runnable_load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
cfs_rq->blocked_load_avg);
#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -635,6 +638,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
P(se.avg.runnable_avg_sum);
P(se.avg.runnable_avg_period);
P(se.avg.load_avg_contrib);
+ P(se.avg.uw_load_avg_contrib);
P(se.avg.decay_count);
#endif
P(policy);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 981406e..1ee47b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2345,6 +2345,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
return 0;
se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+ se->avg.uw_load_avg_contrib = decay_load(se->avg.uw_load_avg_contrib,
+ decays);
se->avg.decay_count = 0;
return decays;
@@ -2451,12 +2453,18 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
contrib /= (se->avg.runnable_avg_period + 1);
se->avg.load_avg_contrib = scale_load(contrib);
+
+ contrib = se->avg.runnable_avg_sum * scale_load_down(NICE_0_LOAD);
+ contrib /= (se->avg.runnable_avg_period + 1);
+ se->avg.uw_load_avg_contrib = scale_load(contrib);
}
/* Compute the current contribution to load_avg by se, return any delta */
-static long __update_entity_load_avg_contrib(struct sched_entity *se)
+static long __update_entity_load_avg_contrib(struct sched_entity *se,
+ long *uw_contrib_delta)
{
long old_contrib = se->avg.load_avg_contrib;
+ long uw_old_contrib = se->avg.uw_load_avg_contrib;
if (entity_is_task(se)) {
__update_task_entity_contrib(se);
@@ -2465,6 +2473,10 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
__update_group_entity_contrib(se);
}
+ if (uw_contrib_delta)
+ *uw_contrib_delta = se->avg.uw_load_avg_contrib -
+ uw_old_contrib;
+
return se->avg.load_avg_contrib - old_contrib;
}
@@ -2484,7 +2496,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
int update_cfs_rq)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- long contrib_delta;
+ long contrib_delta, uw_contrib_delta;
u64 now;
/*
@@ -2499,13 +2511,15 @@ static inline void update_entity_load_avg(struct sched_entity *se,
if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
return;
- contrib_delta = __update_entity_load_avg_contrib(se);
+ contrib_delta = __update_entity_load_avg_contrib(se, &uw_contrib_delta);
if (!update_cfs_rq)
return;
- if (se->on_rq)
+ if (se->on_rq) {
cfs_rq->runnable_load_avg += contrib_delta;
+ cfs_rq->uw_runnable_load_avg += uw_contrib_delta;
+ }
else
subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
}
@@ -2582,6 +2596,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
}
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+ cfs_rq->uw_runnable_load_avg += se->avg.uw_load_avg_contrib;
+
/* we force update consideration on load-balancer moves */
update_cfs_rq_blocked_load(cfs_rq, !wakeup);
}
@@ -2600,6 +2616,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
update_cfs_rq_blocked_load(cfs_rq, !sleep);
cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+ cfs_rq->uw_runnable_load_avg -= se->avg.uw_load_avg_contrib;
+
if (sleep) {
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c971359..46cb8bd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -337,6 +337,7 @@ struct cfs_rq {
* the FAIR_GROUP_SCHED case).
*/
unsigned long runnable_load_avg, blocked_load_avg;
+ unsigned long uw_runnable_load_avg;
atomic64_t decay_counter;
u64 last_decay;
atomic_long_t removed_load;
--
1.7.9.5
From: Dietmar Eggemann <[email protected]>
Add weighted/unweighted switch as an additional argument to cpu_load(),
source_load(), target_load() and task_h_load() to enable the user to
either ask for weighted or unweighted load signal. Use 0 (weighted) for
all existing occurrences of these functions.
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 65 +++++++++++++++++++++++++++------------------------
1 file changed, 35 insertions(+), 30 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 784fdab..37e9ea1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -665,7 +665,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
}
#ifdef CONFIG_SMP
-static unsigned long task_h_load(struct task_struct *p);
+static unsigned long task_h_load(struct task_struct *p, int uw);
static inline void __update_task_entity_contrib(struct sched_entity *se);
@@ -1014,9 +1014,9 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4);
}
-static unsigned long cpu_load(const int cpu);
-static unsigned long source_load(int cpu, int type);
-static unsigned long target_load(int cpu, int type);
+static unsigned long cpu_load(const int cpu, int uw);
+static unsigned long source_load(int cpu, int type, int uw);
+static unsigned long target_load(int cpu, int type, int uw);
static unsigned long capacity_of(int cpu);
static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
@@ -1045,7 +1045,7 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
struct rq *rq = cpu_rq(cpu);
ns->nr_running += rq->nr_running;
- ns->load += cpu_load(cpu);
+ ns->load += cpu_load(cpu, 0);
ns->compute_capacity += capacity_of(cpu);
cpus++;
@@ -1215,12 +1215,12 @@ balance:
orig_src_load = env->src_stats.load;
/* XXX missing capacity terms */
- load = task_h_load(env->p);
+ load = task_h_load(env->p, 0);
dst_load = orig_dst_load + load;
src_load = orig_src_load - load;
if (cur) {
- load = task_h_load(cur);
+ load = task_h_load(cur, 0);
dst_load -= load;
src_load += load;
}
@@ -4036,9 +4036,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
#ifdef CONFIG_SMP
/* Used instead of source_load when we know the type == 0 */
-static unsigned long cpu_load(const int cpu)
+static unsigned long cpu_load(const int cpu, int uw)
{
- return cpu_rq(cpu)->cfs.runnable_load_avg;
+ return uw ? cpu_rq(cpu)->cfs.uw_runnable_load_avg :
+ cpu_rq(cpu)->cfs.runnable_load_avg;
}
/*
@@ -4048,30 +4049,32 @@ static unsigned long cpu_load(const int cpu)
* We want to under-estimate the load of migration sources, to
* balance conservatively.
*/
-static unsigned long source_load(int cpu, int type)
+static unsigned long source_load(int cpu, int type, int uw)
{
struct rq *rq = cpu_rq(cpu);
- unsigned long total = cpu_load(cpu);
+ unsigned long total = cpu_load(cpu, uw);
if (type == 0 || !sched_feat(LB_BIAS))
return total;
- return min(rq->cpu_load[type-1], total);
+ return uw ? min(rq->uw_cpu_load[type-1], total) :
+ min(rq->cpu_load[type-1], total);
}
/*
* Return a high guess at the load of a migration-target cpu weighted
* according to the scheduling class and "nice" value.
*/
-static unsigned long target_load(int cpu, int type)
+static unsigned long target_load(int cpu, int type, int uw)
{
struct rq *rq = cpu_rq(cpu);
- unsigned long total = cpu_load(cpu);
+ unsigned long total = cpu_load(cpu, uw);
if (type == 0 || !sched_feat(LB_BIAS))
return total;
- return max(rq->cpu_load[type-1], total);
+ return uw ? max(rq->uw_cpu_load[type-1], total) :
+ max(rq->cpu_load[type-1], total);
}
static unsigned long capacity_of(int cpu)
@@ -4292,8 +4295,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
- load = source_load(prev_cpu, idx);
- this_load = target_load(this_cpu, idx);
+ load = source_load(prev_cpu, idx, 0);
+ this_load = target_load(this_cpu, idx, 0);
/*
* If sync wakeup then subtract the (maximum possible)
@@ -4349,7 +4352,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
if (balanced ||
(this_load <= load &&
- this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
+ this_load + target_load(prev_cpu, idx, 0) <= tl_per_task)) {
/*
* This domain has SD_WAKE_AFFINE and
* p is cache cold in this domain, and
@@ -4398,9 +4401,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
for_each_cpu(i, sched_group_cpus(group)) {
/* Bias balancing toward cpus of our domain */
if (local_group)
- load = source_load(i, load_idx);
+ load = source_load(i, load_idx, 0);
else
- load = target_load(i, load_idx);
+ load = target_load(i, load_idx, 0);
avg_load += load;
}
@@ -4433,7 +4436,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
- load = cpu_load(i);
+ load = cpu_load(i, 0);
if (load < min_load || (load == min_load && i == this_cpu)) {
min_load = load;
@@ -5402,7 +5405,7 @@ static int move_tasks(struct lb_env *env)
if (!can_migrate_task(p, env))
goto next;
- load = task_h_load(p);
+ load = task_h_load(p, 0);
if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
goto next;
@@ -5542,12 +5545,14 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
}
}
-static unsigned long task_h_load(struct task_struct *p)
+static unsigned long task_h_load(struct task_struct *p, int uw)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ unsigned long task_load = uw ? p->se.avg.uw_load_avg_contrib
+ : p->se.avg.load_avg_contrib;
update_cfs_rq_h_load(cfs_rq);
- return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
+ return div64_ul(task_load * cfs_rq->h_load,
cfs_rq->runnable_load_avg + 1);
}
#else
@@ -5555,9 +5560,9 @@ static inline void update_blocked_averages(int cpu)
{
}
-static unsigned long task_h_load(struct task_struct *p)
+static unsigned long task_h_load(struct task_struct *p, int uw)
{
- return p->se.avg.load_avg_contrib;
+ return uw ? p->se.avg.uw_load_avg_contrib : p->se.avg.load_avg_contrib;
}
#endif
@@ -5916,9 +5921,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
/* Bias balancing toward cpus of our domain */
if (local_group)
- load = target_load(i, load_idx);
+ load = target_load(i, load_idx, 0);
else
- load = source_load(i, load_idx);
+ load = source_load(i, load_idx, 0);
sgs->group_load += load;
sgs->sum_nr_running += rq->nr_running;
@@ -5926,7 +5931,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->nr_numa_running += rq->nr_numa_running;
sgs->nr_preferred_running += rq->nr_preferred_running;
#endif
- sgs->sum_weighted_load += cpu_load(i);
+ sgs->sum_weighted_load += cpu_load(i, 0);
if (idle_cpu(i))
sgs->idle_cpus++;
}
@@ -6421,7 +6426,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
if (!capacity_factor)
capacity_factor = fix_small_capacity(env->sd, group);
- load = cpu_load(i);
+ load = cpu_load(i, 0);
/*
* When comparing with imbalance, use cpu_load()
--
1.7.9.5
From: Dietmar Eggemann <[email protected]>
Maintain an unweighted (uw) cpu_load array as the uw counterpart
of rq.cpu_load[].
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/core.c | 4 +++-
kernel/sched/proc.c | 22 ++++++++++++++++++----
kernel/sched/sched.h | 1 +
3 files changed, 22 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d7544a..d814064 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7114,8 +7114,10 @@ void __init sched_init(void)
init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
#endif
- for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
+ for (j = 0; j < CPU_LOAD_IDX_MAX; j++) {
rq->cpu_load[j] = 0;
+ rq->uw_cpu_load[j] = 0;
+ }
rq->last_load_update_tick = jiffies;
diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index 16f5a30..2260092 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -471,6 +471,7 @@ decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
* every tick. We fix it up based on jiffies.
*/
static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
+ unsigned long uw_this_load,
unsigned long pending_updates)
{
int i, scale;
@@ -479,14 +480,20 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
/* Update our load: */
this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */
+ this_rq->uw_cpu_load[0] = uw_this_load; /* Fasttrack for idx 0 */
for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
- unsigned long old_load, new_load;
+ unsigned long old_load, new_load, uw_old_load, uw_new_load;
/* scale is effectively 1 << i now, and >> i divides by scale */
old_load = this_rq->cpu_load[i];
old_load = decay_load_missed(old_load, pending_updates - 1, i);
new_load = this_load;
+
+ uw_old_load = this_rq->uw_cpu_load[i];
+ uw_old_load = decay_load_missed(uw_old_load,
+ pending_updates - 1, i);
+ uw_new_load = uw_this_load;
/*
* Round up the averaging division if load is increasing. This
* prevents us from getting stuck on 9 if the load is 10, for
@@ -494,8 +501,12 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
*/
if (new_load > old_load)
new_load += scale - 1;
+ if (uw_new_load > uw_old_load)
+ uw_new_load += scale - 1;
this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
+ this_rq->uw_cpu_load[i] = (uw_old_load * (scale - 1) +
+ uw_new_load) >> i;
}
sched_avg_update(this_rq);
@@ -535,6 +546,7 @@ void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
unsigned long load = get_rq_runnable_load(this_rq);
+ unsigned long uw_load = this_rq->cfs.uw_runnable_load_avg;
unsigned long pending_updates;
/*
@@ -546,7 +558,7 @@ void update_idle_cpu_load(struct rq *this_rq)
pending_updates = curr_jiffies - this_rq->last_load_update_tick;
this_rq->last_load_update_tick = curr_jiffies;
- __update_cpu_load(this_rq, load, pending_updates);
+ __update_cpu_load(this_rq, load, uw_load, pending_updates);
}
/*
@@ -569,7 +581,7 @@ void update_cpu_load_nohz(void)
* We were idle, this means load 0, the current load might be
* !0 due to remote wakeups and the sort.
*/
- __update_cpu_load(this_rq, 0, pending_updates);
+ __update_cpu_load(this_rq, 0, 0, pending_updates);
}
raw_spin_unlock(&this_rq->lock);
}
@@ -581,11 +593,13 @@ void update_cpu_load_nohz(void)
void update_cpu_load_active(struct rq *this_rq)
{
unsigned long load = get_rq_runnable_load(this_rq);
+ unsigned long uw_load = this_rq->cfs.uw_runnable_load_avg;
+
/*
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
this_rq->last_load_update_tick = jiffies;
- __update_cpu_load(this_rq, load, 1);
+ __update_cpu_load(this_rq, load, uw_load, 1);
calc_load_account_active(this_rq);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d7d2ee2..455d152 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -521,6 +521,7 @@ struct rq {
#endif
#define CPU_LOAD_IDX_MAX 5
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+ unsigned long uw_cpu_load[CPU_LOAD_IDX_MAX];
unsigned long last_load_update_tick;
#ifdef CONFIG_NO_HZ_COMMON
u64 nohz_stamp;
--
1.7.9.5
From: Dietmar Eggemann <[email protected]>
The function weighted_cpuload() is the only one in the group of load
related functions used in the scheduler load balancing code
(weighted_cpuload(), source_load(), target_load(), task_h_load()) which
carries an explicit 'weighted' identifier in its name. Get rid of this
'weighted' identifier since following patches will introduce a
weighted/unweighted switch as an argument for these functions.
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 33 +++++++++++++++++----------------
1 file changed, 17 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 93c8dbe..784fdab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1014,7 +1014,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4);
}
-static unsigned long weighted_cpuload(const int cpu);
+static unsigned long cpu_load(const int cpu);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
static unsigned long capacity_of(int cpu);
@@ -1045,7 +1045,7 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
struct rq *rq = cpu_rq(cpu);
ns->nr_running += rq->nr_running;
- ns->load += weighted_cpuload(cpu);
+ ns->load += cpu_load(cpu);
ns->compute_capacity += capacity_of(cpu);
cpus++;
@@ -4036,7 +4036,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
#ifdef CONFIG_SMP
/* Used instead of source_load when we know the type == 0 */
-static unsigned long weighted_cpuload(const int cpu)
+static unsigned long cpu_load(const int cpu)
{
return cpu_rq(cpu)->cfs.runnable_load_avg;
}
@@ -4051,7 +4051,7 @@ static unsigned long weighted_cpuload(const int cpu)
static unsigned long source_load(int cpu, int type)
{
struct rq *rq = cpu_rq(cpu);
- unsigned long total = weighted_cpuload(cpu);
+ unsigned long total = cpu_load(cpu);
if (type == 0 || !sched_feat(LB_BIAS))
return total;
@@ -4066,7 +4066,7 @@ static unsigned long source_load(int cpu, int type)
static unsigned long target_load(int cpu, int type)
{
struct rq *rq = cpu_rq(cpu);
- unsigned long total = weighted_cpuload(cpu);
+ unsigned long total = cpu_load(cpu);
if (type == 0 || !sched_feat(LB_BIAS))
return total;
@@ -4433,7 +4433,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
- load = weighted_cpuload(i);
+ load = cpu_load(i);
if (load < min_load || (load == min_load && i == this_cpu)) {
min_load = load;
@@ -5926,7 +5926,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->nr_numa_running += rq->nr_numa_running;
sgs->nr_preferred_running += rq->nr_preferred_running;
#endif
- sgs->sum_weighted_load += weighted_cpuload(i);
+ sgs->sum_weighted_load += cpu_load(i);
if (idle_cpu(i))
sgs->idle_cpus++;
}
@@ -6388,7 +6388,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
int i;
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
- unsigned long capacity, capacity_factor, wl;
+ unsigned long capacity, capacity_factor, load;
enum fbq_type rt;
rq = cpu_rq(i);
@@ -6421,28 +6421,29 @@ static struct rq *find_busiest_queue(struct lb_env *env,
if (!capacity_factor)
capacity_factor = fix_small_capacity(env->sd, group);
- wl = weighted_cpuload(i);
+ load = cpu_load(i);
/*
- * When comparing with imbalance, use weighted_cpuload()
+ * When comparing with imbalance, use cpu_load()
* which is not scaled with the cpu capacity.
*/
- if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+ if (capacity_factor && rq->nr_running == 1 &&
+ load > env->imbalance)
continue;
/*
* For the load comparisons with the other cpu's, consider
- * the weighted_cpuload() scaled with the cpu capacity, so
+ * the cpu_load() scaled with the cpu capacity, so
* that the load can be moved away from the cpu that is
* potentially running at a lower capacity.
*
- * Thus we're looking for max(wl_i / capacity_i), crosswise
+ * Thus we're looking for max(load_i / capacity_i), crosswise
* multiplication to rid ourselves of the division works out
- * to: wl_i * capacity_j > wl_j * capacity_i; where j is
+ * to: load_i * capacity_j > load_j * capacity_i; where j is
* our previous maximum.
*/
- if (wl * busiest_capacity > busiest_load * capacity) {
- busiest_load = wl;
+ if (load * busiest_capacity > busiest_load * capacity) {
+ busiest_load = load;
busiest_capacity = capacity;
busiest = rq;
}
--
1.7.9.5
From: Dietmar Eggemann <[email protected]>
This patch makes the energy data available via procfs. The related files
are placed as sub-directory named 'energy' inside the
/proc/sys/kernel/sched_domain/cpuX/domainY/groupZ directory for those
cpu/domain/group tuples which have energy information.
The following example depicts the contents of
/proc/sys/kernel/sched_domain/cpu0/domain0/group[01] for a system which
has energy information attached to domain level 0.
├── cpu0
│  ├── domain0
│  │  ├── busy_factor
│  │  ├── busy_idx
│  │  ├── cache_nice_tries
│  │  ├── flags
│  │  ├── forkexec_idx
│  │  ├── group0
│  │  │  └── energy
│  │  │  ├── cap_states
│  │  │  ├── idle_states
│  │  │  ├── nr_cap_states
│  │  │  └── nr_idle_states
│  │  ├── group1
│  │  │  └── energy
│  │  │  ├── cap_states
│  │  │  ├── idle_states
│  │  │  ├── nr_cap_states
│  │  │  └── nr_idle_states
│  │  ├── idle_idx
│  │  ├── imbalance_pct
│  │  ├── max_interval
│  │  ├── max_newidle_lb_cost
│  │  ├── min_interval
│  │  ├── name
│  │  ├── newidle_idx
│  │  └── wake_idx
│  └── domain1
│  ├── busy_factor
│  ├── busy_idx
│  ├── cache_nice_tries
│  ├── flags
│  ├── forkexec_idx
│  ├── idle_idx
│  ├── imbalance_pct
│  ├── max_interval
│  ├── max_newidle_lb_cost
│  ├── min_interval
│  ├── name
│  ├── newidle_idx
│  └── wake_idx
The files 'nr_idle_states' and 'nr_cap_states' contain a scalar value
whereas 'idle_states' and 'cap_states' contain a vector of (power
consumption, wakeup energy for run->sleep->run cycle for this idle
state) respectively (compute capacity, power consumption at this compute
capacity) tuples.
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/core.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 65 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ecece17..7fecc63 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4931,9 +4931,60 @@ set_table_entry(struct ctl_table *entry,
}
static struct ctl_table *
+sd_alloc_ctl_energy_table(struct sched_group_energy *sge)
+{
+ struct ctl_table *table = sd_alloc_ctl_entry(5);
+
+ if (table == NULL)
+ return NULL;
+
+ set_table_entry(&table[0], "nr_idle_states", &sge->nr_idle_states,
+ sizeof(int), 0644, proc_dointvec_minmax, false);
+ set_table_entry(&table[1], "idle_states", &sge->idle_states[0].power,
+ sge->nr_idle_states*sizeof(struct idle_state), 0644,
+ proc_doulongvec_minmax, false);
+ set_table_entry(&table[2], "nr_cap_states", &sge->nr_cap_states,
+ sizeof(int), 0644, proc_dointvec_minmax, false);
+ set_table_entry(&table[3], "cap_states", &sge->cap_states[0].cap,
+ sge->nr_cap_states*sizeof(struct capacity_state), 0644,
+ proc_doulongvec_minmax, false);
+
+ return table;
+}
+
+static struct ctl_table *
+sd_alloc_ctl_group_table(struct sched_group *sg)
+{
+ struct ctl_table *table = sd_alloc_ctl_entry(2);
+
+ if (table == NULL)
+ return NULL;
+
+ table->procname = kstrdup("energy", GFP_KERNEL);
+ table->mode = 0555;
+ table->child = sd_alloc_ctl_energy_table(sg->sge);
+
+ return table;
+}
+
+static struct ctl_table *
sd_alloc_ctl_domain_table(struct sched_domain *sd)
{
- struct ctl_table *table = sd_alloc_ctl_entry(14);
+ struct ctl_table *table;
+ unsigned int nr_entries = 14;
+
+ int i = 0;
+ struct sched_group *sg = sd->groups;
+
+ if (sg->sge) {
+ int nr_sgs = 0;
+
+ do {} while (nr_sgs++, sg = sg->next, sg != sd->groups);
+
+ nr_entries += nr_sgs;
+ }
+
+ table = sd_alloc_ctl_entry(nr_entries);
if (table == NULL)
return NULL;
@@ -4966,7 +5017,19 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
sizeof(long), 0644, proc_doulongvec_minmax, false);
set_table_entry(&table[12], "name", sd->name,
CORENAME_MAX_SIZE, 0444, proc_dostring, false);
- /* &table[13] is terminator */
+ sg = sd->groups;
+ if (sg->sge) {
+ char buf[32];
+ struct ctl_table *entry = &table[13];
+
+ do {
+ snprintf(buf, 32, "group%d", i);
+ entry->procname = kstrdup(buf, GFP_KERNEL);
+ entry->mode = 0555;
+ entry->child = sd_alloc_ctl_group_table(sg);
+ } while (entry++, i++, sg = sg->next, sg != sd->groups);
+ }
+ /* &table[nr_entries-1] is terminator */
return table;
}
--
1.7.9.5
This patch introduces the ENERGY_AWARE sched feature, which is
implemented using jump labels when SCHED_DEBUG is defined. It is
statically set false when SCHED_DEBUG is not defined. Hence this doesn't
allow energy awareness to be enabled without SCHED_DEBUG. This
sched_feature knob will be replaced later with a more appropriate
control knob when things have matured a bit.
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 5 +++++
kernel/sched/features.h | 6 ++++++
2 files changed, 11 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3c73122..981406e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4215,6 +4215,11 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
#endif
+static inline bool energy_aware(void)
+{
+ return sched_feat(ENERGY_AWARE);
+}
+
static int wake_wide(struct task_struct *p)
{
int factor = this_cpu_read(sd_llc_size);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 90284d1..199ee3a 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,3 +83,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
*/
SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
+
+/*
+ * Energy aware scheduling. Use platform energy model to guide scheduling
+ * decisions optimizing for energy efficiency.
+ */
+SCHED_FEAT(ENERGY_AWARE, false)
--
1.7.9.5
From: Dietmar Eggemann <[email protected]>
The struct sched_group_energy represents the per sched_group related
data which is needed for energy aware scheduling. It contains:
(1) atomic reference counter for scheduler internal bookkeeping of
data allocation and freeing
(2) number of elements of the idle state array
(3) pointer to the idle state array which comprises 'power consumption
and wakeup energy for the run->sleep->run cycle' tuples for each
idle state
(4) number of elements of the capacity state array
(5) pointer to the capacity state array which comprises 'compute
capacity and power consumption' tuples for each capacity state
Allocation and freeing of struct sched_group_energy utilizes the existing
infrastructure of the scheduler which is currently used for the other sd
hierarchy data structures (e.g. struct sched_domain) as well. That's why
struct sd_data is provisioned with a per cpu struct sched_group_energy
double pointer.
The struct sched_group obtains a pointer to a struct sched_group_energy.
The function pointer sched_domain_energy_f is introduced into struct
sched_domain_topology_level which will allow the arch to pass a particular
struct sched_group_energy from the topology shim layer into the scheduler
core.
The function pointer sched_domain_energy_f has an 'int cpu' parameter
since the folding of two adjacent sd levels via sd degenerate doesn't work
for all sd levels. I.e. it is not possible for example to use this feature
to provide per-cpu energy in sd level DIE on ARM's TC2 platform.
It was discussed that the folding of sd levels approach is preferable
over the cpu parameter approach, simply because the user (the arch
specifying the sd topology table) can introduce less errors. But since
it is not working, the 'int cpu' parameter is the only way out. It's
possible to use the folding of sd levels approach for
sched_domain_flags_f and the cpu parameter approach for the
sched_domain_energy_f at the same time though. With the use of the
'int cpu' parameter, an extra check function has to be provided to make
sure that all cpus spanned by a sched group are provisioned with the same
energy data.
Signed-off-by: Dietmar Eggemann <[email protected]>
---
include/linux/sched.h | 21 +++++++++++++++++++++
kernel/sched/sched.h | 1 +
2 files changed, 22 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4f6bf9..1507390 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -909,6 +909,24 @@ struct sched_domain_attr {
extern int sched_domain_level_max;
+struct capacity_state {
+ unsigned long cap; /* compute capacity */
+ unsigned long power; /* power consumption at this compute capacity */
+};
+
+struct idle_state {
+ unsigned long power; /* power consumption in this idle state */
+ unsigned long wu_energy; /* energy for run->sleep->run cycle (<<10) */
+};
+
+struct sched_group_energy {
+ atomic_t ref;
+ unsigned int nr_idle_states; /* number of idle states */
+ struct idle_state *idle_states; /* ptr to idle state array */
+ unsigned int nr_cap_states; /* number of capacity states */
+ struct capacity_state *cap_states; /* ptr to capacity state array */
+};
+
struct sched_group;
struct sched_domain {
@@ -1007,6 +1025,7 @@ bool cpus_share_cache(int this_cpu, int that_cpu);
typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
typedef const int (*sched_domain_flags_f)(void);
+typedef const struct sched_group_energy *(*sched_domain_energy_f)(int cpu);
#define SDTL_OVERLAP 0x01
@@ -1014,11 +1033,13 @@ struct sd_data {
struct sched_domain **__percpu sd;
struct sched_group **__percpu sg;
struct sched_group_capacity **__percpu sgc;
+ struct sched_group_energy **__percpu sge;
};
struct sched_domain_topology_level {
sched_domain_mask_f mask;
sched_domain_flags_f sd_flags;
+ sched_domain_energy_f energy;
int flags;
int numa_level;
struct sd_data data;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2f86361..d300a64 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -753,6 +753,7 @@ struct sched_group {
unsigned int group_weight;
struct sched_group_capacity *sgc;
+ struct sched_group_energy *sge;
/*
* The CPUs this group covers.
--
1.7.9.5
Hack to report different cpu capacities for big and little cpus.
This is for experimentation on ARM TC2 _only_. A proper solution
has to address this problem.
Signed-off-by: Morten Rasmussen <[email protected]>
---
drivers/cpufreq/cpufreq.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index a2b788d..efec777 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -316,7 +316,13 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
pr_debug("FREQ: %lu - CPU: %lu\n",
(unsigned long)freqs->new, (unsigned long)freqs->cpu);
trace_cpu_frequency(freqs->new, freqs->cpu);
- set_curr_capacity(freqs->cpu, (freqs->new*1024)/policy->max);
+ /* Massive TC2 hack */
+ if (freqs->cpu == 0 || freqs->cpu == 1)
+ /* A15 cpus (max_capacity = 2015) */
+ set_curr_capacity(freqs->cpu, (freqs->new*2015)/1200000);
+ else
+ /* A7 cpus (nax_capacity = 1024) */
+ set_curr_capacity(freqs->cpu, (freqs->new*1024)/1000000);
srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
CPUFREQ_POSTCHANGE, freqs);
if (likely(policy) && likely(policy->cpu == freqs->cpu))
--
1.7.9.5
From: Dietmar Eggemann <[email protected]>
The per sched group (sg) sched_group_energy structure plus the related
idle_state and capacity_state arrays are allocated like the other sched
domain (sd) hierarchy data structures. This includes the freeing of
sched_group_energy structures which are not used.
One problem is that the number of elements of the idle_state and the
capacity_state arrays is not fixed and has to be retrieved in
__sdt_alloc() to allocate memory for the sched_group_energy structure and
the two arrays in one chunk. The array pointers (idle_states and
cap_states) are initialized here to point to the correct place inside the
memory chunk.
The new function init_sched_energy() initializes the sched_group_energy
structure and the two arrays in case the sd topology level contains energy
information.
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/core.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 35 +++++++++++++++++++++++++
2 files changed, 105 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 54f5722..ecece17 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5539,6 +5539,7 @@ static void free_sched_domain(struct rcu_head *rcu)
free_sched_groups(sd->groups, 1);
} else if (atomic_dec_and_test(&sd->groups->ref)) {
kfree(sd->groups->sgc);
+ kfree(sd->groups->sge);
kfree(sd->groups);
}
kfree(sd);
@@ -5799,6 +5800,8 @@ static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
*sg = *per_cpu_ptr(sdd->sg, cpu);
(*sg)->sgc = *per_cpu_ptr(sdd->sgc, cpu);
atomic_set(&(*sg)->sgc->ref, 1); /* for claim_allocations */
+ (*sg)->sge = *per_cpu_ptr(sdd->sge, cpu);
+ atomic_set(&(*sg)->sge->ref, 1); /* for claim_allocations */
}
return cpu;
@@ -5888,6 +5891,28 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
atomic_set(&sg->sgc->nr_busy_cpus, sg->group_weight);
}
+static void init_sched_energy(int cpu, struct sched_domain *sd,
+ struct sched_domain_topology_level *tl)
+{
+ struct sched_group *sg = sd->groups;
+ struct sched_group_energy *energy = sg->sge;
+ sched_domain_energy_f fn = tl->energy;
+ struct cpumask *mask = sched_group_cpus(sg);
+
+ if (!fn || !fn(cpu))
+ return;
+
+ if (cpumask_weight(mask) > 1)
+ check_sched_energy_data(cpu, fn, mask);
+
+ energy->nr_idle_states = fn(cpu)->nr_idle_states;
+ memcpy(energy->idle_states, fn(cpu)->idle_states,
+ energy->nr_idle_states*sizeof(struct idle_state));
+ energy->nr_cap_states = fn(cpu)->nr_cap_states;
+ memcpy(energy->cap_states, fn(cpu)->cap_states,
+ energy->nr_cap_states*sizeof(struct capacity_state));
+}
+
/*
* Initializers for schedule domains
* Non-inlined to reduce accumulated stack pressure in build_sched_domains()
@@ -5978,6 +6003,9 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
if (atomic_read(&(*per_cpu_ptr(sdd->sgc, cpu))->ref))
*per_cpu_ptr(sdd->sgc, cpu) = NULL;
+
+ if (atomic_read(&(*per_cpu_ptr(sdd->sge, cpu))->ref))
+ *per_cpu_ptr(sdd->sge, cpu) = NULL;
}
#ifdef CONFIG_NUMA
@@ -6383,10 +6411,24 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
if (!sdd->sgc)
return -ENOMEM;
+ sdd->sge = alloc_percpu(struct sched_group_energy *);
+ if (!sdd->sge)
+ return -ENOMEM;
+
for_each_cpu(j, cpu_map) {
struct sched_domain *sd;
struct sched_group *sg;
struct sched_group_capacity *sgc;
+ struct sched_group_energy *sge;
+ sched_domain_energy_f fn = tl->energy;
+ unsigned int nr_idle_states = 0;
+ unsigned int nr_cap_states = 0;
+
+ if (fn && fn(j)) {
+ nr_idle_states = fn(j)->nr_idle_states;
+ nr_cap_states = fn(j)->nr_cap_states;
+ BUG_ON(!nr_idle_states || !nr_cap_states);
+ }
sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
@@ -6410,6 +6452,26 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
return -ENOMEM;
*per_cpu_ptr(sdd->sgc, j) = sgc;
+
+ sge = kzalloc_node(sizeof(struct sched_group_energy) +
+ nr_idle_states*sizeof(struct idle_state) +
+ nr_cap_states*sizeof(struct capacity_state),
+ GFP_KERNEL, cpu_to_node(j));
+
+ if (!sge)
+ return -ENOMEM;
+
+ sge->idle_states = (struct idle_state *)
+ ((void *)&sge->cap_states +
+ sizeof(sge->cap_states));
+
+ sge->cap_states = (struct capacity_state *)
+ ((void *)&sge->cap_states +
+ sizeof(sge->cap_states) +
+ nr_idle_states*
+ sizeof(struct idle_state));
+
+ *per_cpu_ptr(sdd->sge, j) = sge;
}
}
@@ -6438,6 +6500,8 @@ static void __sdt_free(const struct cpumask *cpu_map)
kfree(*per_cpu_ptr(sdd->sg, j));
if (sdd->sgc)
kfree(*per_cpu_ptr(sdd->sgc, j));
+ if (sdd->sge)
+ kfree(*per_cpu_ptr(sdd->sge, j));
}
free_percpu(sdd->sd);
sdd->sd = NULL;
@@ -6445,6 +6509,8 @@ static void __sdt_free(const struct cpumask *cpu_map)
sdd->sg = NULL;
free_percpu(sdd->sgc);
sdd->sgc = NULL;
+ free_percpu(sdd->sge);
+ sdd->sge = NULL;
}
}
@@ -6516,10 +6582,13 @@ static int build_sched_domains(const struct cpumask *cpu_map,
/* Calculate CPU capacity for physical packages and nodes */
for (i = nr_cpumask_bits-1; i >= 0; i--) {
+ struct sched_domain_topology_level *tl = sched_domain_topology;
+
if (!cpumask_test_cpu(i, cpu_map))
continue;
- for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+ for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent, tl++) {
+ init_sched_energy(i, sd, tl);
claim_allocations(i, sd);
init_sched_groups_capacity(i, sd);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d300a64..1a5f1ee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -790,6 +790,41 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
extern int group_balance_cpu(struct sched_group *sg);
+/*
+ * Check that the per-cpu provided sd energy data is consistent for all cpus
+ * within the mask.
+ */
+static inline void check_sched_energy_data(int cpu, sched_domain_energy_f fn,
+ const struct cpumask *cpumask)
+{
+ struct cpumask mask;
+ int i;
+
+ cpumask_xor(&mask, cpumask, get_cpu_mask(cpu));
+
+ for_each_cpu(i, &mask) {
+ int y;
+
+ BUG_ON(fn(i)->nr_idle_states != fn(cpu)->nr_idle_states);
+
+ for (y = 0; y < (fn(i)->nr_idle_states); y++) {
+ BUG_ON(fn(i)->idle_states[y].power !=
+ fn(cpu)->idle_states[y].power);
+ BUG_ON(fn(i)->idle_states[y].wu_energy !=
+ fn(cpu)->idle_states[y].wu_energy);
+ }
+
+ BUG_ON(fn(i)->nr_cap_states != fn(cpu)->nr_cap_states);
+
+ for (y = 0; y < (fn(i)->nr_cap_states); y++) {
+ BUG_ON(fn(i)->cap_states[y].cap !=
+ fn(cpu)->cap_states[y].cap);
+ BUG_ON(fn(i)->cap_states[y].power !=
+ fn(cpu)->cap_states[y].power);
+ }
+ }
+}
+
#else
static inline void sched_ttwu_pending(void) { }
--
1.7.9.5
The energy cost of waking a cpu and sending it back to sleep can be
quite significant for short running frequently waking tasks if placed on
an idle cpu in a deep sleep state. By factoring task wakeups in such
tasks can be placed on cpus where the wakeup energy cost is lower. For
example, partly utilized cpus in a shallower idle state, or cpus in a
cluster/die that is already awake.
Current cpu utilization of the target cpu is factored in to guess how
many task wakeups translate into cpu wakeups (idle exits). It is a
very naive approach, but it is virtually impossible to get an accurate
estimate.
wake_energy(task) = unused_util(cpu) * wakeups(task) * wakeup_energy(cpu)
There is no per cpu wakeup tracking, so we can't estimate the energy
savings when removing tasks from a cpu. It is also nearly impossible to
figure out which task is the cause of cpu wakeups if multiple tasks are
scheduled on the same cpu.
wakeup_energy for each idle-state is obtained from the idle_states array.
A prediction of the most likely idle-state is needed. cpuidle is best
placed to provide that. It is not implemented yet.
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6da8e2b..aebf3e2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4367,11 +4367,13 @@ static inline unsigned long get_curr_capacity(int cpu);
* + (1-curr_util(sg)) * idle_power(sg)
* energy_after = new_util(sg) * busy_power(sg)
* + (1-new_util(sg)) * idle_power(sg)
+ * + (1-new_util(sg)) * task_wakeups
+ * * wakeup_energy(sg)
* energy_diff += energy_before - energy_after
* }
*
*/
-static int energy_diff_util(int cpu, int util)
+static int energy_diff_util(int cpu, int util, int wakeups)
{
struct sched_domain *sd;
int i;
@@ -4476,7 +4478,8 @@ static int energy_diff_util(int cpu, int util)
* The utilization change has no impact at this level (or any
* parent level).
*/
- if (aff_util_bef == aff_util_aft && curr_cap_idx == new_cap_idx)
+ if (aff_util_bef == aff_util_aft && curr_cap_idx == new_cap_idx
+ && unused_util_aft < 100)
goto unlock;
/* Energy before */
@@ -4486,6 +4489,14 @@ static int energy_diff_util(int cpu, int util)
/* Energy after */
nrg_diff += (aff_util_aft*new_state->power)/new_state->cap;
nrg_diff += (unused_util_aft * is->power)/new_state->cap;
+
+ /*
+ * Estimate how many of the wakeups that happens while cpu is
+ * idle assuming they are uniformly distributed. Ignoring
+ * wakeups caused by other tasks.
+ */
+ nrg_diff += (wakeups * is->wu_energy >> 10)
+ * unused_util_aft/new_state->cap;
}
/*
@@ -4516,6 +4527,8 @@ static int energy_diff_util(int cpu, int util)
/* Energy after */
nrg_diff += (aff_util_aft*new_state->power)/new_state->cap;
nrg_diff += (unused_util_aft * is->power)/new_state->cap;
+ nrg_diff += (wakeups * is->wu_energy >> 10)
+ * unused_util_aft/new_state->cap;
}
unlock:
@@ -4532,8 +4545,8 @@ static int energy_diff_task(int cpu, struct task_struct *p)
if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
return INT_MAX;
- return energy_diff_util(cpu, p->se.avg.uw_load_avg_contrib);
-
+ return energy_diff_util(cpu, p->se.avg.uw_load_avg_contrib,
+ p->se.avg.wakeup_avg_sum);
}
static int wake_wide(struct task_struct *p)
--
1.7.9.5
Hi Morten,
On Fri, Jul 04, 2014 at 12:25:47AM +0800, Morten Rasmussen wrote:
> * Note that these energy savings are _not_ representative of what can be
> achieved on a true SMP platform where all cpus are equally
> energy-efficient. There should be benefit for SMP platforms as well,
> however, it will be smaller.
>
> The energy model led to consolidation of the short tasks on the A7
> cluster (more energy-efficient), while sysbench made use of all cpus as
> the A7s didn't have sufficient compute capacity to handle the five
> tasks.
Looks like this patchset is mainly for big.LITTLE? And can the patchset
actually replace Global Task Scheduling?
Thanks,
Yuyang
On Fri, Jul 04, 2014 at 12:25:55AM +0800, Morten Rasmussen wrote:
> From: Dietmar Eggemann <[email protected]>
>
> Energy aware scheduling relies on cpu utilization and to be able to
> maintain it, we need a per run queue signal of the sum of the
> unweighted, i.e. not scaled with task priority, load contribution of
> runnable task entries.
>
> The unweighted runnable load on a run queue is maintained alongside the
> existing (weighted) runnable load.
>
> This patch is the unweighted counterpart of "sched: Aggregate load
> contributed by task entities on parenting cfs_rq" (commit id
> 2dac754e10a5).
>
Hi Dietmar and Morten,
You may or may not have noticed my patch the other day:
sched: Rewrite per entity runnable load average tracking
With this new load average tracking, unweighted load average will be very much
simplifed to do.
Thanks,
Yuyang
Hi Yuyang,
On Fri, Jul 04, 2014 at 12:19:50AM +0100, Yuyang Du wrote:
> Hi Morten,
>
> On Fri, Jul 04, 2014 at 12:25:47AM +0800, Morten Rasmussen wrote:
> > * Note that these energy savings are _not_ representative of what can be
> > achieved on a true SMP platform where all cpus are equally
> > energy-efficient. There should be benefit for SMP platforms as well,
> > however, it will be smaller.
> >
> > The energy model led to consolidation of the short tasks on the A7
> > cluster (more energy-efficient), while sysbench made use of all cpus as
> > the A7s didn't have sufficient compute capacity to handle the five
> > tasks.
>
> Looks like this patchset is mainly for big.LITTLE?
No, not at all. The only big.LITTLE in there is the test platform but
that has been configured to be as close as possible to an SMP platform.
That is, no performance difference between cpus. I would have preferred
a true SMP platform for testing, but this is the only dual-cluster
platform that I have access to with proper mainline kernel support.
The patch set essentially puts tasks where it is most energy-efficient
guided by the platform energy model. That should benefit any platform,
SMP and big.LITTLE. That is at least the goal.
On an SMP platform with two clusters/packages (whatever you call a group
of cpus sharing the same power domain) you get task consolidation on a
single cluster if the energy model says that it is beneficial. Very much
like your previous proposals. It is also what I'm trying to show with
the numbers I have included.
That said, we are of course keeping in mind what would be required to
make this work for big.LITTLE. However, there is nothing big.LITTLE
specific in the patch set. Just the possibility of having different
energy models for different cpus in the system. We will have to add some
tweaks eventually to get the best out of big.LITTLE later. Somewhat
similar to what exists today for better SMT support and other
architecture specialities.
> And can the patchset actually replace Global Task Scheduling?
Global Task Scheduling is (ARM) marketing speak for letting the
scheduler know about all cpus in a big.LITTLE system. It is not an
actual implementation. There is an out-of-tree implementation of GTS
available which is very big.LITTLE specific.
The energy model driven scheduling proposed here is not big.LITTLE
specific, but aims at introducing generic energy-awareness in the
scheduler. Once energy-awareness is in place, most of the support needed
for big.LITTLE will be there too. It is generic energy-aware code that
is capable of making informed decisions based on the platform model,
big.LITTLE or SMP.
The short answer is: Not in its current state, but if we get the
energy-awareness right it should be able to.
Morten
This sounds like an an math problem ( for Donald Knuth :) )
You need to think out of the box, present the problem right is just
the fist step and an big one.
Then you need to come with an formal algorithm to solve it, then proof it.
Next step is to code that algorithm and verify that is working in real world.
If you not present the problem right ( missed bigLittle,
over/down-clocking ) you will not get the wright algorithm.
For new algorithms very few people does it right for the first time.
On Fri, Jul 4, 2014 at 2:06 PM, Morten Rasmussen
<[email protected]> wrote:
> Hi Yuyang,
>
> On Fri, Jul 04, 2014 at 12:19:50AM +0100, Yuyang Du wrote:
>> Hi Morten,
>>
>> On Fri, Jul 04, 2014 at 12:25:47AM +0800, Morten Rasmussen wrote:
>> > * Note that these energy savings are _not_ representative of what can be
>> > achieved on a true SMP platform where all cpus are equally
>> > energy-efficient. There should be benefit for SMP platforms as well,
>> > however, it will be smaller.
>> >
>> > The energy model led to consolidation of the short tasks on the A7
>> > cluster (more energy-efficient), while sysbench made use of all cpus as
>> > the A7s didn't have sufficient compute capacity to handle the five
>> > tasks.
>>
>> Looks like this patchset is mainly for big.LITTLE?
>
> No, not at all. The only big.LITTLE in there is the test platform but
> that has been configured to be as close as possible to an SMP platform.
> That is, no performance difference between cpus. I would have preferred
> a true SMP platform for testing, but this is the only dual-cluster
> platform that I have access to with proper mainline kernel support.
>
> The patch set essentially puts tasks where it is most energy-efficient
> guided by the platform energy model. That should benefit any platform,
> SMP and big.LITTLE. That is at least the goal.
>
> On an SMP platform with two clusters/packages (whatever you call a group
> of cpus sharing the same power domain) you get task consolidation on a
> single cluster if the energy model says that it is beneficial. Very much
> like your previous proposals. It is also what I'm trying to show with
> the numbers I have included.
>
> That said, we are of course keeping in mind what would be required to
> make this work for big.LITTLE. However, there is nothing big.LITTLE
> specific in the patch set. Just the possibility of having different
> energy models for different cpus in the system. We will have to add some
> tweaks eventually to get the best out of big.LITTLE later. Somewhat
> similar to what exists today for better SMT support and other
> architecture specialities.
>
>> And can the patchset actually replace Global Task Scheduling?
>
> Global Task Scheduling is (ARM) marketing speak for letting the
> scheduler know about all cpus in a big.LITTLE system. It is not an
> actual implementation. There is an out-of-tree implementation of GTS
> available which is very big.LITTLE specific.
>
> The energy model driven scheduling proposed here is not big.LITTLE
> specific, but aims at introducing generic energy-awareness in the
> scheduler. Once energy-awareness is in place, most of the support needed
> for big.LITTLE will be there too. It is generic energy-aware code that
> is capable of making informed decisions based on the platform model,
> big.LITTLE or SMP.
>
> The short answer is: Not in its current state, but if we get the
> energy-awareness right it should be able to.
>
> Morten
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Hi Morten,
On Thu, Jul 03, 2014 at 05:25:47PM +0100, Morten Rasmussen wrote:
> This is an RFC and there are some loose ends that have not been
> addressed here or in the code yet. The model and its infrastructure is
> in place in the scheduler and it is being used for load-balancing
> decisions. It is used for the select_task_rq_fair() path for
> fork/exec/wake balancing and to guide the selection of the source cpu
> for periodic or idle balance.
IMHO, the series is on the right direction for addressing the energy
aware scheduling (very complex) problem. But I have some high level
comments below.
> However, the main ideas and the primary focus of this RFC: The energy
> model and energy_diff_{load, task, cpu}() are there.
>
> Due to limitation 1, the ARM TC2 platform (2xA15+3xA7) was setup to
> disable frequency scaling and set frequencies to eliminate the
> big.LITTLE performance difference. That basically turns TC2 into an SMP
> platform where a subset of the cpus are less energy-efficient.
>
> Tests using a synthetic workload with seven short running periodic
> tasks of different size and period, and the sysbench cpu benchmark with
> five threads gave the following results:
>
> cpu energy* short tasks sysbench
> Mainline 100 100
> EA 49 99
>
> * Note that these energy savings are _not_ representative of what can be
> achieved on a true SMP platform where all cpus are equally
> energy-efficient. There should be benefit for SMP platforms as well,
> however, it will be smaller.
My impression (and I may be wrong) is that you get bigger energy saving
on a big.LITTLE vs SMP system exactly because of the asymmetry in power
consumption. The algorithm proposed here ends up packing small tasks on
the little CPUs as they are more energy efficient (which is the correct
thing to do but I wonder what results you would get with 3xA7 vs
2xA7+1xA15).
For a symmetric system where all CPUs have the same energy model you
could end up with several small threads balanced equally across the
system. The only way the scheduler could avoid a CPU is if it somehow
manages to get into a deeper idle state (and energy_diff_task() would
show some asymmetry). But this wouldn't happen without the scheduler
first deciding to leave that CPU idle for longer.
Could this be addressed by making the scheduler more "proactive" and,
rather than just looking at the current energy diff, guesstimate what it
would be if not placing a task at all on the CPU? If for example there
is no other task running on that CPU, could energy_diff_task() take into
account the next deeper C-state rather than just the current one? This
way we may be able to achieve more packing even on fully symmetric
systems and allow CPUs to go into deeper sleep states.
Thanks.
--
Catalin
Hi Morten,
Thanks, got it. Then another question,
On Fri, Jul 04, 2014 at 12:06:13PM +0100, Morten Rasmussen wrote:
> The patch set essentially puts tasks where it is most energy-efficient
> guided by the platform energy model. That should benefit any platform,
> SMP and big.LITTLE. That is at least the goal.
>
I understand energy_diff_* functions are based on the energy model (though I
have not dived into the detail of how you change load balancing based on
energy_diff_*).
Speaking of the engergy model, I am not sure why elaborate "imprecise" energy
numbers do a better job than only a general statement: higher freq, more cap,
and more power.
Even for big.LITTLE systems, big and little CPUs also follow that statement
respectively. Then it is just a matter of where to place tasks between them.
Under such, the energy model might be useful, but still probably cpu_power_orig
(from Vincent) might be enough.
Thanks,
Yuyang
Hi Catalin,
On Fri, Jul 04, 2014 at 05:55:52PM +0100, Catalin Marinas wrote:
> Hi Morten,
>
> On Thu, Jul 03, 2014 at 05:25:47PM +0100, Morten Rasmussen wrote:
> > This is an RFC and there are some loose ends that have not been
> > addressed here or in the code yet. The model and its infrastructure is
> > in place in the scheduler and it is being used for load-balancing
> > decisions. It is used for the select_task_rq_fair() path for
> > fork/exec/wake balancing and to guide the selection of the source cpu
> > for periodic or idle balance.
>
> IMHO, the series is on the right direction for addressing the energy
> aware scheduling (very complex) problem. But I have some high level
> comments below.
>
> > However, the main ideas and the primary focus of this RFC: The energy
> > model and energy_diff_{load, task, cpu}() are there.
> >
> > Due to limitation 1, the ARM TC2 platform (2xA15+3xA7) was setup to
> > disable frequency scaling and set frequencies to eliminate the
> > big.LITTLE performance difference. That basically turns TC2 into an SMP
> > platform where a subset of the cpus are less energy-efficient.
> >
> > Tests using a synthetic workload with seven short running periodic
> > tasks of different size and period, and the sysbench cpu benchmark with
> > five threads gave the following results:
> >
> > cpu energy* short tasks sysbench
> > Mainline 100 100
> > EA 49 99
> >
> > * Note that these energy savings are _not_ representative of what can be
> > achieved on a true SMP platform where all cpus are equally
> > energy-efficient. There should be benefit for SMP platforms as well,
> > however, it will be smaller.
>
> My impression (and I may be wrong) is that you get bigger energy saving
> on a big.LITTLE vs SMP system exactly because of the asymmetry in power
> consumption.
That is correct. As said in the note above, the benefit will be smaller
on SMP systems.
> The algorithm proposed here ends up packing small tasks on
> the little CPUs as they are more energy efficient (which is the correct
> thing to do but I wonder what results you would get with 3xA7 vs
> 2xA7+1xA15).
>
> For a symmetric system where all CPUs have the same energy model you
> could end up with several small threads balanced equally across the
> system. The only way the scheduler could avoid a CPU is if it somehow
> manages to get into a deeper idle state (and energy_diff_task() would
> show some asymmetry). But this wouldn't happen without the scheduler
> first deciding to leave that CPU idle for longer.
It is a scenario that could happen with the current use of
energy_diff_task() in the wakeup balancing path. Any 'imbalance' might
make some cpus cheaper and hence attract the other tasks, but it is not
guaranteed to happen.
> Could this be addressed by making the scheduler more "proactive" and,
> rather than just looking at the current energy diff, guesstimate what it
> would be if not placing a task at all on the CPU? If for example there
> is no other task running on that CPU, could energy_diff_task() take into
> account the next deeper C-state rather than just the current one? This
> way we may be able to achieve more packing even on fully symmetric
> systems and allow CPUs to go into deeper sleep states.
I think it would be possible to bias the choice of cpu either by
considering potential energy savings by letting some cpus get into a
deeper C-state, or applying a static bias towards some cpus (lower cpuid
for example). Since it is in the wakeup path it must not be too complex
to figure out though.
I haven't seen the problem in reality yet. When I tried the short tasks
test with all cpus using the same energy model I got tasks consolidated
on either of the clusters. The consolidation cluster sometimes changed
during the test.
There is a lot of tuning to be done, that is for sure. We will have to
make similar decisions for the periodic/idle balance path as well.
Thanks,
Morten
On Sun, Jul 06, 2014 at 08:05:23PM +0100, Yuyang Du wrote:
> Hi Morten,
>
> Thanks, got it. Then another question,
>
> On Fri, Jul 04, 2014 at 12:06:13PM +0100, Morten Rasmussen wrote:
> > The patch set essentially puts tasks where it is most energy-efficient
> > guided by the platform energy model. That should benefit any platform,
> > SMP and big.LITTLE. That is at least the goal.
> >
>
> I understand energy_diff_* functions are based on the energy model (though I
> have not dived into the detail of how you change load balancing based on
> energy_diff_*).
>
> Speaking of the engergy model, I am not sure why elaborate "imprecise" energy
> numbers do a better job than only a general statement: higher freq, more cap,
> and more power.
The idea is that the energy model allows the scheduler to estimate the
energy efficiency of the cpus under any load scenario. That way, the
scheduler can estimate the energy implications of every choice it makes.
Whether it is cheaper (in terms of energy) to increase frequency on the
currently awake cpu instead of waking up more. Which cpu is the cheapest
to wake up if another one is needed. And so on.
> Even for big.LITTLE systems, big and little CPUs also follow that statement
> respectively. Then it is just a matter of where to place tasks between them.
> Under such, the energy model might be useful, but still probably cpu_power_orig
> (from Vincent) might be enough.
cpu_power doesn't tell you anything about energy-efficiency. There is no
link with frequency scaling. No representation of power domains. I don't
see how you can make energy aware decisions without having just a vague
idea about the impact of decisions. You need to consider energy
efficiency to get the most out of big.LITTLE. I believe the same is true
to some extend for SMP systems with aggressive cpu power management.
Could you elaborate on what you mean by 'a general statement'?
Thanks,
Morten
On Mon, Jul 07, 2014 at 03:00:18PM +0100, Morten Rasmussen wrote:
> > Could this be addressed by making the scheduler more "proactive" and,
> > rather than just looking at the current energy diff, guesstimate what it
> > would be if not placing a task at all on the CPU? If for example there
> > is no other task running on that CPU, could energy_diff_task() take into
> > account the next deeper C-state rather than just the current one? This
> > way we may be able to achieve more packing even on fully symmetric
> > systems and allow CPUs to go into deeper sleep states.
>
> I think it would be possible to bias the choice of cpu either by
> considering potential energy savings by letting some cpus get into a
> deeper C-state, or applying a static bias towards some cpus (lower cpuid
> for example). Since it is in the wakeup path it must not be too complex
> to figure out though.
>
> I haven't seen the problem in reality yet. When I tried the short tasks
> test with all cpus using the same energy model I got tasks consolidated
> on either of the clusters. The consolidation cluster sometimes changed
> during the test.
>
> There is a lot of tuning to be done, that is for sure. We will have to
> make similar decisions for the periodic/idle balance path as well.
So one of the things I mentioned previously (on IRC, to Morton) is that
we can use the energy numbers (P and C state) to precompute whether or
not race-to-idle makes sense for the platform. Or if it benefits from
packing etc..
So at topology setup time we can statically determine some of these
policies (maybe with a few parameters) and take it from there.
So if the platform benefits from packing, we can set the appropriate
topology bits to do so. If it benefits from race-to-idle, it can select
that, etc.
Hi Morten,
On Mon, Jul 07, 2014 at 03:16:27PM +0100, Morten Rasmussen wrote:
> Could you elaborate on what you mean by 'a general statement'?
The general statement is: higher freq, more cap, and more power. More specific
numbers are not needed, as they are just instances of this general statement.
> cpu_power doesn't tell you anything about energy-efficiency. There is no
> link with frequency scaling.
In general, more cpu_power, more freq, less energy-efficiency, as you said sometime ago.
> No representation of power domains.
Represented by CPU topology?
Thanks,
Yuyang
On Tue, Jul 08, 2014 at 01:23:43AM +0100, Yuyang Du wrote:
> Hi Morten,
>
> On Mon, Jul 07, 2014 at 03:16:27PM +0100, Morten Rasmussen wrote:
>
> > Could you elaborate on what you mean by 'a general statement'?
>
> The general statement is: higher freq, more cap, and more power. More specific
> numbers are not needed, as they are just instances of this general statement.
I think I understand now. While that statement might be true for SMP
systems, it doesn't tell you the cost of chosing a higher frequency. If
you are optimizing for energy, you really care about energy per work (~
energy/instruction). The additional cost of going to a higher capacity
state very platform dependent. At least on typical modern ARM platforms,
the highest states are significantly more expensive to use, so you don't
want to use them unless you really have to.
If we don't have any energy cost information, we can't make an informed
decision whether it worth running faster (race-to-idle or consolidating
tasks on fewer cpus) or using more cpus (if that is possible).
> > cpu_power doesn't tell you anything about energy-efficiency. There is no
> > link with frequency scaling.
>
> In general, more cpu_power, more freq, less energy-efficiency, as you said sometime ago.
Not in general :) For big.LITTLE it may be more energy efficient to run
a little cpu at a high frequency instead of using a big cpu at a low
frequency. For multi-cluster/package SMP it is not straight forward
either as it is more expensive to run the first cpu in a large power
domain than the additional cpus.
>
> > No representation of power domains.
>
> Represented by CPU topology?
Not really. The sched_domain hierarchy represents the cache hierarhcy
(and nodes for NUMA), but you don't necessarily have a power domains at
the same levels. But yes, the sched_domain hierarchy can be used for
this purpose as well if we attach the necessary power domain information
to it. That is basically what we do in this patch set.
Morten
Hi Morten,
Sorry for the late response, I've been swamped with other stuff lately.
I have a couple of remarks regarding the terminology and one general concern
(please see below).
On Thursday, July 03, 2014 05:25:48 PM Morten Rasmussen wrote:
> This documentation patch provides an overview of the experimental
> scheduler energy costing model, associated data structures, and a
> reference recipe on how platforms can be characterized to derive energy
> models.
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
[cut]
> +
> +Platform topology
> +------------------
> +
> +The system topology (cpus, caches, and NUMA information, not peripherals) is
> +represented in the scheduler by the sched_domain hierarchy which has
> +sched_groups attached at each level that covers one or more cpus (see
> +sched-domains.txt for more details). To add energy awareness to the scheduler
> +we need to consider power and frequency domains.
> +
> +Power domain:
> +
> +A power domain is a part of the system that can be powered on/off
> +independently. Power domains are typically organized in a hierarchy where you
> +may be able to power down just a cpu or a group of cpus along with any
> +associated resources (e.g. shared caches). Powering up a cpu means that all
> +power domains it is a part of in the hierarchy must be powered up. Hence, it is
> +more expensive to power up the first cpu that belongs to a higher level power
> +domain than powering up additional cpus in the same high level domain. Two
> +level power domain hierarchy example:
> +
> + Power source
> + +-------------------------------+----...
> +per group PD G G
> + | +----------+ |
> + +--------+-------| Shared | (other groups)
> +per-cpu PD G G | resource |
> + | | +----------+
> + +-------+ +-------+
> + | CPU 0 | | CPU 1 |
> + +-------+ +-------+
> +
> +Frequency domain:
> +
> +Frequency domains (P-states) typically cover the same group of cpus as one of
> +the power domain levels. That is, there might be several smaller power domains
> +sharing the same frequency (P-state) or there might be a power domain spanning
> +multiple frequency domains.
> +
> +From a scheduling point of view there is no need to know the actual frequencies
> +[Hz]. All the scheduler cares about is the compute capacity available at the
> +current state (P-state) the cpu is in and any other available states. For that
> +reason, and to also factor in any cpu micro-architecture differences, compute
> +capacity scaling states are called 'capacity states' in this document. For SMP
> +systems this is equivalent to P-states. For mixed micro-architecture systems
> +(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
> +performance relative to the other cpus in the system.
> +
I am used to slightly different terminology here. Namely, there are voltage
domains (parts sharing a voltage rail or a voltage regulator, such that you
can only apply/remove/change voltage to all of them at the same time) and clock
domains (analogously, but for clocks). A power domain (which in your description
above seems to correspond to a voltage domain) may be a voltage domain, a clock
domain or a combination thereof.
In addition to that, in a voltage domain it may be possible to apply many
different levels of voltage, which case doesn't seem to be covered at all by
the above (or I'm missing something).
Also a P-state is not just a frequency level, but a combination of frequency
and voltage that has to be applied for that frequency to be stable. You may
regard them as Operation Performance Points of the CPU, but that very well may
go beyond frequencies and voltages. Thus it actually is better not to talk
about P-states as "frequencies".
Now, P-states may or may not have to be coordinated between all CPUs in a
package (cluster), by hardware or software, such that all CPUs in a cluster
need to be kept in the same P-state. That you can regard as a "P-state
domain", but it usually means a specific combination of voltage and frequency.
C-states in turn are states in which CPUs don't execute instructions.
That need not mean the removal of voltage or even frequency from them.
Of course, they do mean some sort of power draw reduction, but that may
be achieved in many different ways. Some C-states require coordination
too (for example, a single C-state may apply to a whole package or cluster
at the same time) and you can think about "domains" here too, but there
need not be a direct mapping to physical parameters such as the frequency
or the voltage.
Moreover, P-states and C-states may overlap. That is, a CPU may be in Px
and Cy at the same time, which means that after leaving Cy it will execute
instructions in Px. Things like leakage may depend on x in that case and
the total power draw may depend on the combination of x and y.
> +Energy modelling:
> +------------------
> +
> +Due to the hierarchical nature of the power domains, the most obvious way to
> +model energy costs is therefore to associate power and energy costs with
> +domains (groups of cpus). Energy costs of shared resources are associated with
> +the group of cpus that share the resources, only the cost of powering the
> +cpu itself and any private resources (e.g. private L1 caches) is associated
> +with the per-cpu groups (lowest level).
> +
> +For example, for an SMP system with per-cpu power domains and a cluster level
> +(group of cpus) power domain we get the overall energy costs to be:
> +
> + energy = energy_cluster + n * energy_cpu
> +
> +where 'n' is the number of cpus powered up and energy_cluster is the cost paid
> +as soon as any cpu in the cluster is powered up.
> +
> +The power and frequency domains can naturally be mapped onto the existing
> +sched_domain hierarchy and sched_groups by adding the necessary data to the
> +existing data structures.
> +
> +The energy model considers energy consumption from three contributors (shown in
> +the illustration below):
> +
> +1. Busy energy: Energy consumed while a cpu and the higher level groups that it
> +belongs to are busy running tasks. Busy energy is associated with the state of
> +the cpu, not an event. The time the cpu spends in this state varies. Thus, the
> +most obvious platform parameter for this contribution is busy power
> +(energy/time).
> +
> +2. Idle energy: Energy consumed while a cpu and higher level groups that it
> +belongs to are idle (in a C-state). Like busy energy, idle energy is associated
> +with the state of the cpu. Thus, the platform parameter for this contribution
> +is idle power (energy/time).
> +
> +3. Wakeup energy: Energy consumed for a transition from an idle-state (C-state)
> +to a busy state (P-state) and back again, that is, a full run->sleep->run cycle
> +(they always come in pairs, transitions between idle-states are not modelled).
> +This energy is associated with an event with a fixed duration (at least
> +roughly). The most obvious platform parameter for this contribution is
> +therefore wakeup energy. Wakeup energy is depicted by the areas under the power
> +graph for the transition phases in the illustration.
> +
> +
> + Power
> + ^
> + | busy->idle idle->busy
> + | transition transition
> + |
> + | _ __
> + | / \ / \__________________
> + |______________/ \ /
> + | \ /
> + | Busy \ Idle / Busy
> + | low P-state \____________/ high P-state
> + |
> + +------------------------------------------------------------> time
> +
> +Busy |--------------| |-----------------|
> +
> +Wakeup |------| |------|
> +
> +Idle |------------|
> +
> +
> +The basic algorithm
> +====================
> +
> +The basic idea is to determine the total energy impact when utilization is
> +added or removed by estimating the impact at each level in the sched_domain
> +hierarchy starting from the bottom (sched_group contains just a single cpu).
> +The energy cost comes from three sources: busy time (sched_group is awake
> +because one or more cpus are busy), idle time (in an idle-state), and wakeups
> +(idle state exits). Power and energy numbers account for energy costs
> +associated with all cpus in the sched_group as a group. In some cases it is
> +possible to bail out early without having go to the top of the hierarchy if the
> +additional/removed utilization doesn't affect the busy time of higher levels.
> +
> + for_each_domain(cpu, sd) {
> + sg = sched_group_of(cpu)
> + energy_before = curr_util(sg) * busy_power(sg)
> + + (1-curr_util(sg)) * idle_power(sg)
> + energy_after = new_util(sg) * busy_power(sg)
> + + (1-new_util(sg)) * idle_power(sg)
> + + (1-new_util(sg)) * wakeups * wakeup_energy(sg)
> + energy_diff += energy_before - energy_after
> +
> + if (energy_before == energy_after)
> + break;
> + }
> +
> + return energy_diff
> +
> +{curr, new}_util: The cpu utilization at the lowest level and the overall
> +non-idle time for the entire group for higher levels. Utilization is in the
> +range 0.0 to 1.0 in the pseudo-code.
> +
> +busy_power: The power consumption of the sched_group.
> +
> +idle_power: The power consumption of the sched_group when idle.
> +
> +wakeups: Average wakeup rate of the task(s) being added/removed. To predict how
> +many of the wakeups are wakeups that causes idle exits we scale the number by
> +the unused utilization (assuming that wakeups are uniformly distributed).
> +
> +wakeup_energy: The energy consumed for a run->sleep->run cycle for the
> +sched_group.
The concern is that if a scaling governor is running in parallel with the above
algorithm and it has its own utilization goal (it usually does), it may change
the P-state under you to match that utilization goal and you'll end up with
something different from what you expected.
That may be addressed either by trying to predict what the scaling governor will
do (and good luck with that) or by taking care of P-states by yourself. The
latter would require changes to the algorithm I think, though.
Kind regards,
Rafael
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
On Thu, Jul 24, 2014 at 02:53:20AM +0200, Rafael J. Wysocki wrote:
> I am used to slightly different terminology here. Namely, there are voltage
> domains (parts sharing a voltage rail or a voltage regulator, such that you
> can only apply/remove/change voltage to all of them at the same time) and clock
> domains (analogously, but for clocks). A power domain (which in your description
> above seems to correspond to a voltage domain) may be a voltage domain, a clock
> domain or a combination thereof.
>
> In addition to that, in a voltage domain it may be possible to apply many
> different levels of voltage, which case doesn't seem to be covered at all by
> the above (or I'm missing something).
>
> Also a P-state is not just a frequency level, but a combination of frequency
> and voltage that has to be applied for that frequency to be stable. You may
> regard them as Operation Performance Points of the CPU, but that very well may
> go beyond frequencies and voltages. Thus it actually is better not to talk
> about P-states as "frequencies".
>
> Now, P-states may or may not have to be coordinated between all CPUs in a
> package (cluster), by hardware or software, such that all CPUs in a cluster
> need to be kept in the same P-state. That you can regard as a "P-state
> domain", but it usually means a specific combination of voltage and frequency.
I think Morton is aware of this, but for the sake of sanity dropped the
whole lot into something simpler (while hoping reality would not ruin
his life).
> C-states in turn are states in which CPUs don't execute instructions.
> That need not mean the removal of voltage or even frequency from them.
> Of course, they do mean some sort of power draw reduction, but that may
> be achieved in many different ways. Some C-states require coordination
> too (for example, a single C-state may apply to a whole package or cluster
> at the same time) and you can think about "domains" here too, but there
> need not be a direct mapping to physical parameters such as the frequency
> or the voltage.
One thing that wasn't clear to me is if you allow for C-domain and
P-domain to overlap or if they're always inclusive (where one is wholly
contained in the other).
> Moreover, P-states and C-states may overlap. That is, a CPU may be in Px
> and Cy at the same time, which means that after leaving Cy it will execute
> instructions in Px. Things like leakage may depend on x in that case and
> the total power draw may depend on the combination of x and y.
Right, and I suppose the domain thing makes it impossible to drop to the
lowest P state on going idle. Tricky that.
> The concern is that if a scaling governor is running in parallel with the above
> algorithm and it has its own utilization goal (it usually does), it may change
> the P-state under you to match that utilization goal and you'll end up with
> something different from what you expected.
>
> That may be addressed either by trying to predict what the scaling governor will
> do (and good luck with that) or by taking care of P-states by yourself. The
> latter would require changes to the algorithm I think, though.
The idea was that we'll do P states ourselves based on these utilization
figures. If we find we cannot fit the 'new' task into the current set
without either raising P or waking an idle cpu (if at all available), we
compute the cost of either option and pick the cheapest.
On Thursday, July 24, 2014 09:26:09 AM Peter Zijlstra wrote:
> On Thu, Jul 24, 2014 at 02:53:20AM +0200, Rafael J. Wysocki wrote:
> > I am used to slightly different terminology here. Namely, there are voltage
> > domains (parts sharing a voltage rail or a voltage regulator, such that you
> > can only apply/remove/change voltage to all of them at the same time) and clock
> > domains (analogously, but for clocks). A power domain (which in your description
> > above seems to correspond to a voltage domain) may be a voltage domain, a clock
> > domain or a combination thereof.
> >
> > In addition to that, in a voltage domain it may be possible to apply many
> > different levels of voltage, which case doesn't seem to be covered at all by
> > the above (or I'm missing something).
> >
> > Also a P-state is not just a frequency level, but a combination of frequency
> > and voltage that has to be applied for that frequency to be stable. You may
> > regard them as Operation Performance Points of the CPU, but that very well may
> > go beyond frequencies and voltages. Thus it actually is better not to talk
> > about P-states as "frequencies".
> >
> > Now, P-states may or may not have to be coordinated between all CPUs in a
> > package (cluster), by hardware or software, such that all CPUs in a cluster
> > need to be kept in the same P-state. That you can regard as a "P-state
> > domain", but it usually means a specific combination of voltage and frequency.
>
> I think Morton is aware of this, but for the sake of sanity dropped the
> whole lot into something simpler (while hoping reality would not ruin
> his life).
>
> > C-states in turn are states in which CPUs don't execute instructions.
> > That need not mean the removal of voltage or even frequency from them.
> > Of course, they do mean some sort of power draw reduction, but that may
> > be achieved in many different ways. Some C-states require coordination
> > too (for example, a single C-state may apply to a whole package or cluster
> > at the same time) and you can think about "domains" here too, but there
> > need not be a direct mapping to physical parameters such as the frequency
> > or the voltage.
>
> One thing that wasn't clear to me is if you allow for C-domain and
> P-domain to overlap or if they're always inclusive (where one is wholly
> contained in the other).
On the CPUs I worked with so far they were always inclusive. Previously, the
whole package was a P-state domain. Today some CPUs (Haswell server chips
for example) have per-core P-states.
> > Moreover, P-states and C-states may overlap. That is, a CPU may be in Px
> > and Cy at the same time, which means that after leaving Cy it will execute
> > instructions in Px. Things like leakage may depend on x in that case and
> > the total power draw may depend on the combination of x and y.
>
> Right, and I suppose the domain thing makes it impossible to drop to the
> lowest P state on going idle. Tricky that.
That's the case for older chips. I'm not sure about the newest lot entirely
to be honest, need to ask.
> > The concern is that if a scaling governor is running in parallel with the above
> > algorithm and it has its own utilization goal (it usually does), it may change
> > the P-state under you to match that utilization goal and you'll end up with
> > something different from what you expected.
> >
> > That may be addressed either by trying to predict what the scaling governor will
> > do (and good luck with that) or by taking care of P-states by yourself. The
> > latter would require changes to the algorithm I think, though.
>
> The idea was that we'll do P states ourselves based on these utilization
> figures. If we find we cannot fit the 'new' task into the current set
> without either raising P or waking an idle cpu (if at all available), we
> compute the cost of either option and pick the cheapest.
Yeah. One subtle thing is that ramping up P may affect the other guys
(if the whole chip is a P-domain, for example), but I guess that can be
taken into account.
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
On Thu, Jul 24, 2014 at 03:28:27PM +0100, Rafael J. Wysocki wrote:
> On Thursday, July 24, 2014 09:26:09 AM Peter Zijlstra wrote:
> > On Thu, Jul 24, 2014 at 02:53:20AM +0200, Rafael J. Wysocki wrote:
> > > I am used to slightly different terminology here. Namely, there are voltage
> > > domains (parts sharing a voltage rail or a voltage regulator, such that you
> > > can only apply/remove/change voltage to all of them at the same time) and clock
> > > domains (analogously, but for clocks). A power domain (which in your description
> > > above seems to correspond to a voltage domain) may be a voltage domain, a clock
> > > domain or a combination thereof.
Your terminology is closer how the hardware actually operates, agreed. I
was hoping to keep things a bit simpler if we can get away with it. In
the simplified view a frequency domain is the combination of voltage and
clock domain (using your terminology). Since clock and voltage usually
scale together (DVFS) the assumption is that those domains are
equivalent. Thus, the frequency domain defines the subset of cpus that
scale P-state together.
A power domain (in my terminology) defines a subset of cpus that share
C-states (do-nothing-states at reduced power consumption). The actual
technique applied for the C-state implementation is not considered. It
may anything between just clock gating up to and including completely
powering the domain off. So it isn't necessarily equivalent to the clock
or voltage domain. For example on ARM is quite typical to have clock
gating per cpu and sometimes also per core power gating, while the
clock/voltage domain covers multiple cpus. It is worth noting that power
gating if often hierarchical meaning that you can power gate larger
subsets of cpus in one go as well to save more power as you can power
down (some) shared resources as well. I think that is equivalent to
package C-states in Intel terminology.
> > > In addition to that, in a voltage domain it may be possible to apply many
> > > different levels of voltage, which case doesn't seem to be covered at all by
> > > the above (or I'm missing something).
I don't include it explicitly, but it is factored into the capacity
state data (which is really frequency states on SMP, but that is another
story). Each capacity state is represented by a compute capacity
(proportional to frequency on SMP) and the associated power consumption.
The energy-efficiency (work/energy) for the capacity state is basically
the ratio of the two. Hence the voltage is include in the power figure
associated with the P-state. It is assumed that you don't scale voltage
without scaling frequency. I hope that is a valid assumption for Intel
systems as well?
> > > Also a P-state is not just a frequency level, but a combination of frequency
> > > and voltage that has to be applied for that frequency to be stable. You may
> > > regard them as Operation Performance Points of the CPU, but that very well may
> > > go beyond frequencies and voltages. Thus it actually is better not to talk
> > > about P-states as "frequencies".
Agreed. In my world voltage and frequency are always linked, so I might
have been a bit sloppy in my definitions. I will fix that to use P-state
instead.
Capacity states are equal to P-states on SMP but not for big.LITTLE as
we also have to factor in performance differences between different
micro-architectures. Any objections to that? It is in line with the
recent renaming of cpu_power to cpu_capacity in fair.c.
> > > Now, P-states may or may not have to be coordinated between all CPUs in a
> > > package (cluster), by hardware or software, such that all CPUs in a cluster
> > > need to be kept in the same P-state. That you can regard as a "P-state
> > > domain", but it usually means a specific combination of voltage and frequency.
> >
> > I think Morton is aware of this, but for the sake of sanity dropped the
> > whole lot into something simpler (while hoping reality would not ruin
> > his life).
Spot on :-) (except for the spelling of my name ;-))
> >
> > > C-states in turn are states in which CPUs don't execute instructions.
> > > That need not mean the removal of voltage or even frequency from them.
> > > Of course, they do mean some sort of power draw reduction, but that may
> > > be achieved in many different ways. Some C-states require coordination
> > > too (for example, a single C-state may apply to a whole package or cluster
> > > at the same time) and you can think about "domains" here too, but there
> > > need not be a direct mapping to physical parameters such as the frequency
> > > or the voltage.
That is "power domains" in my simplified terminology as described above.
> > One thing that wasn't clear to me is if you allow for C-domain and
> > P-domain to overlap or if they're always inclusive (where one is wholly
> > contained in the other).
>
> On the CPUs I worked with so far they were always inclusive. Previously, the
> whole package was a P-state domain. Today some CPUs (Haswell server chips
> for example) have per-core P-states.
I don't know of any design where they overlap. My assumption is that it
won't happen ;-)
> > > Moreover, P-states and C-states may overlap. That is, a CPU may be in Px
> > > and Cy at the same time, which means that after leaving Cy it will execute
> > > instructions in Px. Things like leakage may depend on x in that case and
> > > the total power draw may depend on the combination of x and y.
Right, I have ignored that aspect so far (along with a lot of other
things) hoping that it wouldn't make too much difference. I haven't
investigated it in detail yet. I guess that main difference would be in
the shallowest C-states as you would be power gating in the deeper ones?
It could be factored in but it would mean providing platform data.
> > Right, and I suppose the domain thing makes it impossible to drop to the
> > lowest P state on going idle. Tricky that.
>
> That's the case for older chips. I'm not sure about the newest lot entirely
> to be honest, need to ask.
I think some ARM do the lowest P-state trick before entering idle. But
yeah, it only makes sense if you are the last cpu in the P-state domain
to go down.
> > > The concern is that if a scaling governor is running in parallel with the above
> > > algorithm and it has its own utilization goal (it usually does), it may change
> > > the P-state under you to match that utilization goal and you'll end up with
> > > something different from what you expected.
> > >
> > > That may be addressed either by trying to predict what the scaling governor will
> > > do (and good luck with that) or by taking care of P-states by yourself. The
> > > latter would require changes to the algorithm I think, though.
> >
> > The idea was that we'll do P states ourselves based on these utilization
> > figures. If we find we cannot fit the 'new' task into the current set
> > without either raising P or waking an idle cpu (if at all available), we
> > compute the cost of either option and pick the cheapest.
>
> Yeah. One subtle thing is that ramping up P may affect the other guys
> (if the whole chip is a P-domain, for example), but I guess that can be
> taken into account.
For now I have assumed that the P-state governor will provide select a
P-state which is sufficient for handling the utilization. But, as Peter
already said, the plan is to try at least guide the P-state selection
based on the decisions made by the scheduler.
Affected cpus are actually already take into account when trying to
figure out whether to raise the P-state or waking an idle cpu.
Morten