Several techniques for saving energy through various scheduler
modifications have been proposed in the past, however most of the
techniques have not been universally beneficial for all use-cases and
platforms. For example, consolidating tasks on fewer cpus is an
effective way to save energy on some platforms, while it might make
things worse on others.
This proposal, which is inspired by the Ksummit workshop discussions in
2013 [1], takes a different approach by using a (relatively) simple
platform energy cost model to guide scheduling decisions. By providing
the model with platform specific costing data the model can provide an
estimate of the energy implications of scheduling decisions. So instead
of blindly applying scheduling techniques that may or may not work for
the current use-case, the scheduler can make informed energy-aware
decisions. We believe this approach provides a methodology that can be
adapted to any platform, including heterogeneous systems such as ARM
big.LITTLE. The model considers cpus only, i.e. no peripherals, GPU or
memory. Model data includes power consumption at each P-state and
C-state.
This is an RFC and there are some loose ends that have not been
addressed here or in the code yet. The model and its infrastructure is
in place in the scheduler and it is being used for load-balancing
decisions. The energy model data is hardcoded and there are some
limitations still to be addressed. However, the main idea is presented
here, which is the use of an energy model for scheduling decisions.
RFCv4 is a consolidation of the latest energy model related patches and
patches adding scale-invariance to the CFS per-entity load-tracking
(PELT) as well as fixing a few issues that have emerged as we use PELT
more extensively for load-balancing. The patches are based on
tip/sched/core. Many of the changes since RFCv3 are addressing issues
pointed out during the review of v3 by Peter, Sai, and Xunlei. However,
there are still a few issues that needs fixing. Energy-aware scheduling
is now strictly following the 'tipping point' policy (with one minor
exception). That is, when the system is deemed over-utilized (above the
'tipping point') all balancing decisions are made by the normal way
based on priority scaled load and spreading of tasks. When below the
tipping point energy-aware scheduling decisions are active. The
rationale being that when below the tipping point we can safely shuffle
tasks around without harming throughput. The focus is more on putting
tasks on the right cpus at wake-up and less on periodic/idle/nohz_idle
as the latter are less likely to have a chance of balancing tasks when
below the tipping point as tasks are smaller and not always
running/runnable. This has simplified the code a bit.
The patch set now consists of two main parts but contains independent
fixes that will be reposted separately later. The capacity rework [2]
that was included in RFCv3 has been merged in v4.1-rc1 and [3] has been
reworked. The latter is the first part of this patch set.
Patch 01-12: sched: frequency and cpu invariant per-entity load-tracking
and other load-tracking bits.
Patch 13-34: sched: Energy cost model and energy-aware scheduling
features.
Test results for ARM TC2 (2xA15+3xA7) with cpufreq enabled:
sysbench: Single task running for 3 seconds.
rt-app [4]: mp3 playback use-case model
rt-app [4]: 5 ~[6,13,19,25,31,38,44,50]% periodic (2ms) tasks
Note: % is relative to the capacity of the fastest cpu at the highest
frequency, i.e. the more busy ones do not fit on little cpus.
A newer version of rt-app was used which supports a better but slightly
different way of modelling the periodic tasks. Numbers are therefore
_not_ comparable to the RFCv3 numbers.
Average numbers for 20 runs per test (ARM TC2).
Energy Mainline EAS noEAS
sysbench 100 251* 227*
rt-app mp3 100 63 111
rt-app 6% 100 42 102
rt-app 13% 100 58 101
rt-app 19% 100 87 101
rt-app 25% 100 94 104
rt-app 31% 100 93 104
rt-app 38% 100 114 117
rt-app 44% 100 115 118
rt-app 50% 100 125 126
The higher load rt-app runs show significant variation in the energy
numbers for mainline as it schedules tasks randomly due to lack of
proper compute capacity awareness - tasks may be scheduled on LITTLE
cpus despite being too big.
Early test results for ARM (64-bit) Juno (2xA57+4x53) with cpufreq
enabled:
Average numbers for 20 runs per test (ARM Juno).
Energy Mainline EAS noEAS
sysbench 100 219 196
rt-app mp3 100 82 120
rt-app 6% 100 65 108
rt-app 13% 100 75 102
rt-app 19% 100 86 104
rt-app 25% 100 84 105
rt-app 31% 100 87 111
rt-app 38% 100 136 132
rt-app 44% 100 141 141
rt-app 50% 100 146 142
* Sensitive to task placement on big.LITTLE. Mainline may put it on
either cpu due to it's lack of compute capacity awareness, while EAS
consistently puts heavy tasks on big cpus. The EAS energy increase came
with a 2.06x (TC2)/1.70x (Juno) _increase_ in performance (throughput)
vs Mainline.
[1] http://etherpad.osuosl.org/energy-aware-scheduling-ks-2013 (search
for 'cost')
[2] https://lkml.org/lkml/2015/1/15/136
[3] https://lkml.org/lkml/2014/12/2/328
[4] https://wiki.linaro.org/WorkingGroups/PowerManagement/Resources/Tools/WorkloadGen
Changes:
RFCv4:
(0) Reordering of the whole patch-set:
01-02: Frequency-invariant PELT
03-08: CPU-invariant PELT
09-10: Track blocked usage
11-12: PELT fixes for forked and dying tasks
13-18: Energy model data structures
19-21: Energy model helper functions
22-24: Energy calculation functions
25-26: Tipping point and max cpu capacity
27-29: Idle-state index for energy model
30-34: Energy-aware scheduling
(1) Rework frequency and cpu invariance arch support.
- Remove weak arch functions and replace them with #defines and
cpufreq notifiers.
(2) Changed PELT initialization and immediate removal of dead tasks from
PELT rq signals.
(3) scheduler energy data setup.
- Clean-up of allocation and initialization of energy data structures.
(4) Fix issue in sched_group_energy() not using correct capacity index.
(5) Rework energy-aware load balancing code.
- Introduce a system-wide over-utilization indicator/tipping point.
- Restrict periodic/idle/nohz_idle load balance to the detection of
over-utilization scenarios.
- Use conventional load-balance path when above tipping point and bail
out when below.
- Made energy-aware wake-up conditional on tipping point (only when
below) and added capacity awareness to wake-ups when above.
RFCv3: https://lkml.org/lkml/2015/2/4/537
Dietmar Eggemann (12):
sched: Make load tracking frequency scale-invariant
arm: vexpress: Add CPU clock-frequencies to TC2 device-tree
sched: Make usage tracking cpu scale-invariant
arm: Cpu invariant scheduler load-tracking support
sched: Get rid of scaling usage by cpu_capacity_orig
sched: Introduce energy data structures
sched: Allocate and initialize energy data structures
arm: topology: Define TC2 energy and provide it to the scheduler
sched: Store system-wide maximum cpu capacity in root domain
sched: Determine the current sched_group idle-state
sched: Consider a not over-utilized energy-aware system as balanced
sched: Enable idle balance to pull single task towards cpu with higher
capacity
Morten Rasmussen (22):
arm: Frequency invariant scheduler load-tracking support
sched: Convert arch_scale_cpu_capacity() from weak function to #define
arm: Update arch_scale_cpu_capacity() to reflect change to define
sched: Track blocked utilization contributions
sched: Include blocked utilization in usage tracking
sched: Remove blocked load and utilization contributions of dying
tasks
sched: Initialize CFS task load and usage before placing task on rq
sched: Documentation for scheduler energy cost model
sched: Make energy awareness a sched feature
sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
sched: Compute cpu capacity available at current frequency
sched: Relocated get_cpu_usage() and change return type
sched: Highest energy aware balancing sched_domain level pointer
sched: Calculate energy consumption of sched_group
sched: Extend sched_group_energy to test load-balancing decisions
sched: Estimate energy impact of scheduling decisions
sched: Add over-utilization/tipping point indicator
sched, cpuidle: Track cpuidle state index in the scheduler
sched: Count number of shallower idle-states in struct
sched_group_energy
sched: Add cpu capacity awareness to wakeup balancing
sched: Energy-aware wake-up task placement
sched: Disable energy-unfriendly nohz kicks
Documentation/scheduler/sched-energy.txt | 363 +++++++++++++++++
arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 5 +
arch/arm/include/asm/topology.h | 11 +
arch/arm/kernel/smp.c | 56 ++-
arch/arm/kernel/topology.c | 204 +++++++---
include/linux/sched.h | 22 +
kernel/sched/core.c | 139 ++++++-
kernel/sched/fair.c | 634 +++++++++++++++++++++++++----
kernel/sched/features.h | 11 +-
kernel/sched/idle.c | 2 +
kernel/sched/sched.h | 81 +++-
11 files changed, 1391 insertions(+), 137 deletions(-)
create mode 100644 Documentation/scheduler/sched-energy.txt
--
1.9.1
From: Morten Rasmussen <[email protected]>
Implements arch-specific function to provide the scheduler with a
frequency scaling correction factor for more accurate load-tracking.
The factor is:
current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
This implementation only provides frequency invariance. No cpu
invariance yet.
Cc: Russell King <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
arch/arm/include/asm/topology.h | 7 ++++++
arch/arm/kernel/smp.c | 56 +++++++++++++++++++++++++++++++++++++++--
arch/arm/kernel/topology.c | 17 +++++++++++++
3 files changed, 78 insertions(+), 2 deletions(-)
diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 2fe85ff..4b985dc 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -24,6 +24,13 @@ void init_cpu_topology(void);
void store_cpu_topology(unsigned int cpuid);
const struct cpumask *cpu_coregroup_mask(int cpu);
+#define arch_scale_freq_capacity arm_arch_scale_freq_capacity
+struct sched_domain;
+extern
+unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
+DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);
+
#else
static inline void init_cpu_topology(void) { }
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index cca5b87..856bb3d 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -677,12 +677,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref);
static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq);
static unsigned long global_l_p_j_ref;
static unsigned long global_l_p_j_ref_freq;
+static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
+DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);
+
+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling through arch_scale_freq_capacity()
+ * (implemented in topology.c).
+ */
+static inline
+void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max)
+{
+ unsigned long capacity;
+
+ if (!max)
+ return;
+
+ capacity = (curr << SCHED_CAPACITY_SHIFT) / max;
+ atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity);
+}
static int cpufreq_callback(struct notifier_block *nb,
unsigned long val, void *data)
{
struct cpufreq_freqs *freq = data;
int cpu = freq->cpu;
+ unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
if (freq->flags & CPUFREQ_CONST_LOOPS)
return NOTIFY_OK;
@@ -707,6 +729,9 @@ static int cpufreq_callback(struct notifier_block *nb,
per_cpu(l_p_j_ref_freq, cpu),
freq->new);
}
+
+ scale_freq_capacity(cpu, freq->new, max);
+
return NOTIFY_OK;
}
@@ -714,11 +739,38 @@ static struct notifier_block cpufreq_notifier = {
.notifier_call = cpufreq_callback,
};
+static int cpufreq_policy_callback(struct notifier_block *nb,
+ unsigned long val, void *data)
+{
+ struct cpufreq_policy *policy = data;
+ int i;
+
+ if (val != CPUFREQ_NOTIFY)
+ return NOTIFY_OK;
+
+ for_each_cpu(i, policy->cpus) {
+ scale_freq_capacity(i, policy->cur, policy->max);
+ atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block cpufreq_policy_notifier = {
+ .notifier_call = cpufreq_policy_callback,
+};
+
static int __init register_cpufreq_notifier(void)
{
- return cpufreq_register_notifier(&cpufreq_notifier,
+ int ret;
+
+ ret = cpufreq_register_notifier(&cpufreq_notifier,
CPUFREQ_TRANSITION_NOTIFIER);
+ if (ret)
+ return ret;
+
+ return cpufreq_register_notifier(&cpufreq_policy_notifier,
+ CPUFREQ_POLICY_NOTIFIER);
}
core_initcall(register_cpufreq_notifier);
-
#endif
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 08b7847..9c09e6e 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -169,6 +169,23 @@ static void update_cpu_capacity(unsigned int cpu)
cpu, arch_scale_cpu_capacity(NULL, cpu));
}
+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling (arch_scale_freq_capacity()). The scaling
+ * factor is updated in smp.c
+ */
+unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+ unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu));
+
+ if (!curr)
+ return SCHED_CAPACITY_SCALE;
+
+ return curr;
+}
+
#else
static inline void parse_dt_topology(void) {}
static inline void update_cpu_capacity(unsigned int cpuid) {}
--
1.9.1
From: Dietmar Eggemann <[email protected]>
Apply frequency scale-invariance correction factor to load tracking.
Each segment of the sched_avg::runnable_avg_sum geometric series is now
scaled by the current frequency so the sched_avg::load_avg_contrib of each
entity will be invariant with frequency scaling. As a result,
cfs_rq::runnable_load_avg which is the sum of sched_avg::load_avg_contrib,
becomes invariant too. So the load level that is returned by
weighted_cpuload, stays relative to the max frequency of the cpu.
Then, we want the keep the load tracking values in a 32bits type, which
implies that the max value of sched_avg::{runnable|running}_avg_sum must
be lower than 2^32/88761=48388 (88761 is the max weight of a task). As
LOAD_AVG_MAX = 47742, arch_scale_freq_capacity must return a value less
than (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY =
1024). So we define the range to [0..SCHED_SCALE_CAPACITY] in order to
avoid overflow.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Acked-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 28 ++++++++++++++++------------
1 file changed, 16 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f18ddb7..5eccd63 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2536,9 +2536,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
int runnable,
int running)
{
- u64 delta, periods;
- u32 runnable_contrib;
- int delta_w, decayed = 0;
+ u64 delta, scaled_delta, periods;
+ u32 runnable_contrib, scaled_runnable_contrib;
+ int delta_w, scaled_delta_w, decayed = 0;
unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
delta = now - sa->last_runnable_update;
@@ -2572,11 +2572,12 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
* period and accrue it.
*/
delta_w = 1024 - delta_w;
+ scaled_delta_w = (delta_w * scale_freq) >> SCHED_CAPACITY_SHIFT;
+
if (runnable)
- sa->runnable_avg_sum += delta_w;
+ sa->runnable_avg_sum += scaled_delta_w;
if (running)
- sa->running_avg_sum += delta_w * scale_freq
- >> SCHED_CAPACITY_SHIFT;
+ sa->running_avg_sum += scaled_delta_w;
sa->avg_period += delta_w;
delta -= delta_w;
@@ -2594,20 +2595,23 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
/* Efficiently calculate \sum (1..n_period) 1024*y^i */
runnable_contrib = __compute_runnable_contrib(periods);
+ scaled_runnable_contrib = (runnable_contrib * scale_freq)
+ >> SCHED_CAPACITY_SHIFT;
+
if (runnable)
- sa->runnable_avg_sum += runnable_contrib;
+ sa->runnable_avg_sum += scaled_runnable_contrib;
if (running)
- sa->running_avg_sum += runnable_contrib * scale_freq
- >> SCHED_CAPACITY_SHIFT;
+ sa->running_avg_sum += scaled_runnable_contrib;
sa->avg_period += runnable_contrib;
}
/* Remainder of delta accrued against u_0` */
+ scaled_delta = (delta * scale_freq) >> SCHED_CAPACITY_SHIFT;
+
if (runnable)
- sa->runnable_avg_sum += delta;
+ sa->runnable_avg_sum += scaled_delta;
if (running)
- sa->running_avg_sum += delta * scale_freq
- >> SCHED_CAPACITY_SHIFT;
+ sa->running_avg_sum += scaled_delta;
sa->avg_period += delta;
return decayed;
--
1.9.1
From: Dietmar Eggemann <[email protected]>
To enable the parsing of clock frequency and cpu efficiency values
inside parse_dt_topology [arch/arm/kernel/topology.c] to scale the
relative capacity of the cpus, this property has to be provided within
the cpu nodes of the dts file.
The patch is a copy of commit 8f15973ef8c3 ("ARM: vexpress: Add CPU
clock-frequencies to TC2 device-tree") taken from Linaro Stable Kernel
(LSK) massaged into mainline.
Cc: Jon Medhurst <[email protected]>
Cc: Russell King <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
index 7a2aeac..76613b2 100644
--- a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
+++ b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
@@ -39,6 +39,7 @@
reg = <0>;
cci-control-port = <&cci_control1>;
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
+ clock-frequency = <1000000000>;
};
cpu1: cpu@1 {
@@ -47,6 +48,7 @@
reg = <1>;
cci-control-port = <&cci_control1>;
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
+ clock-frequency = <1000000000>;
};
cpu2: cpu@2 {
@@ -55,6 +57,7 @@
reg = <0x100>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ clock-frequency = <800000000>;
};
cpu3: cpu@3 {
@@ -63,6 +66,7 @@
reg = <0x101>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ clock-frequency = <800000000>;
};
cpu4: cpu@4 {
@@ -71,6 +75,7 @@
reg = <0x102>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ clock-frequency = <800000000>;
};
idle-states {
--
1.9.1
Bring arch_scale_cpu_capacity() in line with the recent change of its
arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched:
Optimize freq invariant accounting") from weak function to #define to
allow inlining of the function.
While at it, remove the ARCH_CAPACITY sched_feature as well. With the
change to #define there isn't a straightforward way to allow runtime
switch between an arch implementation and the default implementation of
arch_scale_cpu_capacity() using sched_feature. The default was to use
the arch-specific implementation, but only the arm architecture provided
one.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 22 +---------------------
kernel/sched/features.h | 5 -----
kernel/sched/sched.h | 11 +++++++++++
3 files changed, 12 insertions(+), 26 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eccd63..d71d0ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6214,19 +6214,6 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
return load_idx;
}
-static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
- return sd->smt_gain / sd->span_weight;
-
- return SCHED_CAPACITY_SCALE;
-}
-
-unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
- return default_scale_cpu_capacity(sd, cpu);
-}
-
static unsigned long scale_rt_capacity(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -6256,16 +6243,9 @@ static unsigned long scale_rt_capacity(int cpu)
static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
- unsigned long capacity = SCHED_CAPACITY_SCALE;
+ unsigned long capacity = arch_scale_cpu_capacity(sd, cpu);
struct sched_group *sdg = sd->groups;
- if (sched_feat(ARCH_CAPACITY))
- capacity *= arch_scale_cpu_capacity(sd, cpu);
- else
- capacity *= default_scale_cpu_capacity(sd, cpu);
-
- capacity >>= SCHED_CAPACITY_SHIFT;
-
cpu_rq(cpu)->cpu_capacity_orig = capacity;
capacity *= scale_rt_capacity(cpu);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 91e33cd..03d8072 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -36,11 +36,6 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
*/
SCHED_FEAT(WAKEUP_PREEMPTION, true)
-/*
- * Use arch dependent cpu capacity functions
- */
-SCHED_FEAT(ARCH_CAPACITY, true)
-
SCHED_FEAT(HRTICK, false)
SCHED_FEAT(DOUBLE_TICK, false)
SCHED_FEAT(LB_BIAS, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d854555..b422e08 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1400,6 +1400,17 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
}
#endif
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+ if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+ return sd->smt_gain / sd->span_weight;
+
+ return SCHED_CAPACITY_SCALE;
+}
+#endif
+
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
--
1.9.1
arch_scale_cpu_capacity() is no longer a weak function but a #define
instead. Include the #define in topology.h.
cc: Russell King <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
arch/arm/include/asm/topology.h | 4 ++++
arch/arm/kernel/topology.c | 2 +-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 4b985dc..cf66aca 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -31,6 +31,10 @@ unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);
+#define arch_scale_cpu_capacity arm_arch_scale_cpu_capacity
+extern
+unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu);
+
#else
static inline void init_cpu_topology(void) { }
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 9c09e6e..bad267c 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -42,7 +42,7 @@
*/
static DEFINE_PER_CPU(unsigned long, cpu_scale);
-unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
{
return per_cpu(cpu_scale, cpu);
}
--
1.9.1
From: Dietmar Eggemann <[email protected]>
Besides the existing frequency scale-invariance correction factor, apply
cpu scale-invariance correction factor to usage tracking.
Cpu scale-invariance takes cpu performance deviations due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency
value of the highest OPP between cpus in SMP systems into consideration.
Each segment of the sched_avg::running_avg_sum geometric series is now
scaled by the cpu performance factor too so the
sched_avg::utilization_avg_contrib of each entity will be invariant from
the particular cpu of the HMP/SMP system it is gathered on.
So the usage level that is returned by get_cpu_usage stays relative to
the max cpu performance of the system.
In contrast to usage, load (sched_avg::runnable_avg_sum) is currently
not considered to be made cpu scale-invariant because this will have a
negative effect on the the existing load balance code based on
s[dg]_lb_stats::avg_load in overload scenarios.
example: 7 always running tasks
4 on cluster 0 (2 cpus w/ cpu_capacity=512)
3 on cluster 1 (1 cpu w/ cpu_capacity=1024)
cluster 0 cluster 1
capacity 1024 (2*512) 1024 (1*1024)
load 4096 3072
cpu-scaled load 2048 3072
Simply using cpu-scaled load in the existing lb code would declare
cluster 1 busier than cluster 0, although the compute capacity budget
for one task is higher on cluster 1 (1024/3 = 341) than on cluster 0
(2*512/4 = 256).
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 13 +++++++++++++
kernel/sched/sched.h | 2 +-
2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d71d0ca..af55982 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2540,6 +2540,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
u32 runnable_contrib, scaled_runnable_contrib;
int delta_w, scaled_delta_w, decayed = 0;
unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
+ unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
delta = now - sa->last_runnable_update;
/*
@@ -2576,6 +2577,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
if (runnable)
sa->runnable_avg_sum += scaled_delta_w;
+
+ scaled_delta_w *= scale_cpu;
+ scaled_delta_w >>= SCHED_CAPACITY_SHIFT;
+
if (running)
sa->running_avg_sum += scaled_delta_w;
sa->avg_period += delta_w;
@@ -2600,6 +2605,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
if (runnable)
sa->runnable_avg_sum += scaled_runnable_contrib;
+
+ scaled_runnable_contrib *= scale_cpu;
+ scaled_runnable_contrib >>= SCHED_CAPACITY_SHIFT;
+
if (running)
sa->running_avg_sum += scaled_runnable_contrib;
sa->avg_period += runnable_contrib;
@@ -2610,6 +2619,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
if (runnable)
sa->runnable_avg_sum += scaled_delta;
+
+ scaled_delta *= scale_cpu;
+ scaled_delta >>= SCHED_CAPACITY_SHIFT;
+
if (running)
sa->running_avg_sum += scaled_delta;
sa->avg_period += delta;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b422e08..3193025 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1404,7 +1404,7 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
{
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+ if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
return sd->smt_gain / sd->span_weight;
return SCHED_CAPACITY_SCALE;
--
1.9.1
From: Dietmar Eggemann <[email protected]>
Reuses the existing infrastructure for cpu_scale to provide the scheduler
with a cpu scaling correction factor for more accurate load-tracking.
This factor comprises a micro-architectural part, which is based on the
cpu efficiency value of a cpu as well as a platform-wide max frequency
part, which relates to the dtb property clock-frequency of a cpu node.
The calculation of cpu_scale, return value of arch_scale_cpu_capacity,
changes from
capacity / middle_capacity
with capacity = (clock_frequency >> 20) * cpu_efficiency
to
SCHED_CAPACITY_SCALE * cpu_perf / max_cpu_perf
The range of the cpu_scale value changes from
[0..3*SCHED_CAPACITY_SCALE/2] to [0..SCHED_CAPACITY_SCALE].
The functionality to calculate the middle_capacity which corresponds to an
'average' cpu has been taken out since the scaling is now done
differently.
In the case that either the cpu efficiency or the clock-frequency value
for a cpu is missing, no cpu scaling is done for any cpu.
The platform-wide max frequency part of the factor should not be confused
with the frequency invariant scheduler load-tracking support which deals
with frequency related scaling due to DFVS functionality on a cpu.
Cc: Russell King <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/kernel/topology.c | 64 +++++++++++++++++-----------------------------
1 file changed, 23 insertions(+), 41 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index bad267c..5867587 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -62,9 +62,7 @@ struct cpu_efficiency {
* Table of relative efficiency of each processors
* The efficiency value must fit in 20bit and the final
* cpu_scale value must be in the range
- * 0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2
- * in order to return at most 1 when DIV_ROUND_CLOSEST
- * is used to compute the capacity of a CPU.
+ * 0 < cpu_scale < SCHED_CAPACITY_SCALE.
* Processors that are not defined in the table,
* use the default SCHED_CAPACITY_SCALE value for cpu_scale.
*/
@@ -77,24 +75,18 @@ static const struct cpu_efficiency table_efficiency[] = {
static unsigned long *__cpu_capacity;
#define cpu_capacity(cpu) __cpu_capacity[cpu]
-static unsigned long middle_capacity = 1;
+static unsigned long max_cpu_perf;
/*
* Iterate all CPUs' descriptor in DT and compute the efficiency
- * (as per table_efficiency). Also calculate a middle efficiency
- * as close as possible to (max{eff_i} - min{eff_i}) / 2
- * This is later used to scale the cpu_capacity field such that an
- * 'average' CPU is of middle capacity. Also see the comments near
- * table_efficiency[] and update_cpu_capacity().
+ * (as per table_efficiency). Calculate the max cpu performance too.
*/
+
static void __init parse_dt_topology(void)
{
const struct cpu_efficiency *cpu_eff;
struct device_node *cn = NULL;
- unsigned long min_capacity = ULONG_MAX;
- unsigned long max_capacity = 0;
- unsigned long capacity = 0;
- int cpu = 0;
+ int cpu = 0, i = 0;
__cpu_capacity = kcalloc(nr_cpu_ids, sizeof(*__cpu_capacity),
GFP_NOWAIT);
@@ -102,6 +94,7 @@ static void __init parse_dt_topology(void)
for_each_possible_cpu(cpu) {
const u32 *rate;
int len;
+ unsigned long cpu_perf;
/* too early to use cpu->of_node */
cn = of_get_cpu_node(cpu, NULL);
@@ -124,46 +117,35 @@ static void __init parse_dt_topology(void)
continue;
}
- capacity = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency;
-
- /* Save min capacity of the system */
- if (capacity < min_capacity)
- min_capacity = capacity;
-
- /* Save max capacity of the system */
- if (capacity > max_capacity)
- max_capacity = capacity;
-
- cpu_capacity(cpu) = capacity;
+ cpu_perf = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency;
+ cpu_capacity(cpu) = cpu_perf;
+ max_cpu_perf = max(max_cpu_perf, cpu_perf);
+ i++;
}
- /* If min and max capacities are equals, we bypass the update of the
- * cpu_scale because all CPUs have the same capacity. Otherwise, we
- * compute a middle_capacity factor that will ensure that the capacity
- * of an 'average' CPU of the system will be as close as possible to
- * SCHED_CAPACITY_SCALE, which is the default value, but with the
- * constraint explained near table_efficiency[].
- */
- if (4*max_capacity < (3*(max_capacity + min_capacity)))
- middle_capacity = (min_capacity + max_capacity)
- >> (SCHED_CAPACITY_SHIFT+1);
- else
- middle_capacity = ((max_capacity / 3)
- >> (SCHED_CAPACITY_SHIFT-1)) + 1;
-
+ if (i < num_possible_cpus())
+ max_cpu_perf = 0;
}
/*
* Look for a customed capacity of a CPU in the cpu_capacity table during the
* boot. The update of all CPUs is in O(n^2) for heteregeneous system but the
- * function returns directly for SMP system.
+ * function returns directly for SMP systems or if there is no complete set
+ * of cpu efficiency, clock frequency data for each cpu.
*/
static void update_cpu_capacity(unsigned int cpu)
{
- if (!cpu_capacity(cpu))
+ unsigned long capacity = cpu_capacity(cpu);
+
+ if (!capacity || !max_cpu_perf) {
+ cpu_capacity(cpu) = 0;
return;
+ }
+
+ capacity *= SCHED_CAPACITY_SCALE;
+ capacity /= max_cpu_perf;
- set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
+ set_capacity_scale(cpu, capacity);
pr_info("CPU%u: update cpu_capacity %lu\n",
cpu, arch_scale_cpu_capacity(NULL, cpu));
--
1.9.1
From: Dietmar Eggemann <[email protected]>
Since now we have besides frequency invariant also cpu (uarch plus max
system frequency) invariant cfs_rq::utilization_load_avg both, frequency
and cpu scaling happens as part of the load tracking.
So cfs_rq::utilization_load_avg does not have to be scaled by the original
capacity of the cpu again.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 34 +++++++++++++++++++++-------------
1 file changed, 21 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index af55982..8420444 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4998,32 +4998,40 @@ static int select_idle_sibling(struct task_struct *p, int target)
done:
return target;
}
+
/*
* get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
* tasks. The unit of the return value must be the one of capacity so we can
* compare the usage with the capacity of the CPU that is available for CFS
* task (ie cpu_capacity).
+ *
* cfs.utilization_load_avg is the sum of running time of runnable tasks on a
* CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full
- * capacity of the CPU because it's about the running time on this CPU.
- * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
- * because of unfortunate rounding in avg_period and running_load_avg or just
- * after migrating tasks until the average stabilizes with the new running
- * time. So we need to check that the usage stays into the range
- * [0..cpu_capacity_orig] and cap if necessary.
- * Without capping the usage, a group could be seen as overloaded (CPU0 usage
- * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
+ * [0..capacity_orig] where capacity_orig is the cpu_capacity available at the
+ * highest frequency (arch_scale_freq_capacity()). The usage of a CPU converges
+ * towards a sum equal to or less than the current capacity (capacity_curr <=
+ * capacity_orig) of the CPU because it is the running time on this CPU scaled
+ * by capacity_curr. Nevertheless, cfs.utilization_load_avg can be higher than
+ * capacity_curr or even higher than capacity_orig because of unfortunate
+ * rounding in avg_period and running_load_avg or just after migrating tasks
+ * (and new task wakeups) until the average stabilizes with the new running
+ * time. We need to check that the usage stays into the range
+ * [0..capacity_orig] and cap if necessary. Without capping the usage, a group
+ * could be seen as overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas
+ * CPU1 has 20% of available capacity. We allow usage to overshoot
+ * capacity_curr (but not capacity_orig) as it useful for predicting the
+ * capacity required after task migrations (scheduler-driven DVFS).
*/
+
static int get_cpu_usage(int cpu)
{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
- unsigned long capacity = capacity_orig_of(cpu);
+ unsigned long capacity_orig = capacity_orig_of(cpu);
- if (usage >= SCHED_LOAD_SCALE)
- return capacity;
+ if (usage >= capacity_orig)
+ return capacity_orig;
- return (usage * capacity) >> SCHED_LOAD_SHIFT;
+ return usage;
}
/*
--
1.9.1
Introduces the blocked utilization, the utilization counter-part to
cfs_rq->blocked_load_avg. It is the sum of sched_entity utilization
contributions of entities that were recently on the cfs_rq and are
currently blocked. Combined with the sum of utilization of entities
currently on the cfs_rq or currently running
(cfs_rq->utilization_load_avg) this provides a more stable average
view of the cpu usage.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 30 +++++++++++++++++++++++++++++-
kernel/sched/sched.h | 8 ++++++--
2 files changed, 35 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8420444..ad07398 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2803,6 +2803,15 @@ static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
cfs_rq->blocked_load_avg = 0;
}
+static inline void subtract_utilization_blocked_contrib(struct cfs_rq *cfs_rq,
+ long utilization_contrib)
+{
+ if (likely(utilization_contrib < cfs_rq->utilization_blocked_avg))
+ cfs_rq->utilization_blocked_avg -= utilization_contrib;
+ else
+ cfs_rq->utilization_blocked_avg = 0;
+}
+
static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
/* Update a sched_entity's runnable average */
@@ -2838,6 +2847,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
cfs_rq->utilization_load_avg += utilization_delta;
} else {
subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ -utilization_delta);
}
}
@@ -2855,14 +2866,20 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
return;
if (atomic_long_read(&cfs_rq->removed_load)) {
- unsigned long removed_load;
+ unsigned long removed_load, removed_utilization;
removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
+ removed_utilization =
+ atomic_long_xchg(&cfs_rq->removed_utilization, 0);
subtract_blocked_load_contrib(cfs_rq, removed_load);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ removed_utilization);
}
if (decays) {
cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
decays);
+ cfs_rq->utilization_blocked_avg =
+ decay_load(cfs_rq->utilization_blocked_avg, decays);
atomic64_add(decays, &cfs_rq->decay_counter);
cfs_rq->last_decay = now;
}
@@ -2909,6 +2926,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
/* migrated tasks did not contribute to our blocked load */
if (wakeup) {
subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ se->avg.utilization_avg_contrib);
update_entity_load_avg(se, 0);
}
@@ -2935,6 +2954,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
if (sleep) {
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+ cfs_rq->utilization_blocked_avg +=
+ se->avg.utilization_avg_contrib;
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
} /* migrations, e.g. sleep=0 leave decay_count == 0 */
}
@@ -5147,6 +5168,8 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
se->avg.decay_count = -__synchronize_entity_decay(se);
atomic_long_add(se->avg.load_avg_contrib,
&cfs_rq->removed_load);
+ atomic_long_add(se->avg.utilization_avg_contrib,
+ &cfs_rq->removed_utilization);
}
/* We have migrated, no longer consider this task hot */
@@ -8129,6 +8152,8 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
if (se->avg.decay_count) {
__synchronize_entity_decay(se);
subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ se->avg.utilization_avg_contrib);
}
#endif
}
@@ -8188,6 +8213,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifdef CONFIG_SMP
atomic64_set(&cfs_rq->decay_counter, 1);
atomic_long_set(&cfs_rq->removed_load, 0);
+ atomic_long_set(&cfs_rq->removed_utilization, 0);
#endif
}
@@ -8240,6 +8266,8 @@ static void task_move_group_fair(struct task_struct *p, int queued)
*/
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+ cfs_rq->utilization_blocked_avg +=
+ se->avg.utilization_avg_contrib;
#endif
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3193025..1070692 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -375,11 +375,15 @@ struct cfs_rq {
* the blocked sched_entities on the rq.
* utilization_load_avg is the sum of the average running time of the
* sched_entities on the rq.
+ * utilization_blocked_avg is the utilization equivalent of
+ * blocked_load_avg, i.e. the sum of running contributions of blocked
+ * sched_entities associated with the rq.
*/
- unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
+ unsigned long runnable_load_avg, blocked_load_avg;
+ unsigned long utilization_load_avg, utilization_blocked_avg;
atomic64_t decay_counter;
u64 last_decay;
- atomic_long_t removed_load;
+ atomic_long_t removed_load, removed_utilization;
#ifdef CONFIG_FAIR_GROUP_SCHED
/* Required to track per-cpu representation of a task_group */
--
1.9.1
Add the blocked utilization contribution to group sched_entity
utilization (se->avg.utilization_avg_contrib) and to get_cpu_usage().
With this change cpu usage now includes recent usage by currently
non-runnable tasks, hence it provides a more stable view of the cpu
usage. It does, however, also mean that the meaning of usage is changed:
A cpu may be momentarily idle while usage is >0. It can no longer be
assumed that cpu usage >0 implies runnable tasks on the rq.
cfs_rq->utilization_load_avg or nr_running should be used instead to get
the current rq status.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ad07398..e40cd88 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2789,7 +2789,8 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
__update_task_entity_utilization(se);
else
se->avg.utilization_avg_contrib =
- group_cfs_rq(se)->utilization_load_avg;
+ group_cfs_rq(se)->utilization_load_avg +
+ group_cfs_rq(se)->utilization_blocked_avg;
return se->avg.utilization_avg_contrib - old_contrib;
}
@@ -5046,13 +5047,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
static int get_cpu_usage(int cpu)
{
+ int sum;
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+ unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
unsigned long capacity_orig = capacity_orig_of(cpu);
- if (usage >= capacity_orig)
+ sum = usage + blocked;
+
+ if (sum >= capacity_orig)
return capacity_orig;
- return usage;
+ return sum;
}
/*
--
1.9.1
Task being dequeued for the last time (state == TASK_DEAD) are dequeued
with the DEQUEUE_SLEEP flag which causes their load and utilization
contributions to be added to the runqueue blocked load and utilization.
Hence they will contain load or utilization that is gone away. The issue
only exists for the root cfs_rq as cgroup_exit() doesn't set
DEQUEUE_SLEEP for task group exits.
If runnable+blocked load is to be used as a better estimate for cpu
load the dead task contributions need to be removed to prevent
load_balance() (idle_balance() in particular) from over-estimating the
cpu load.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e40cd88..d045404 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3202,6 +3202,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
+ if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
+ flags &= !DEQUEUE_SLEEP;
dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
update_stats_dequeue(cfs_rq, se);
--
1.9.1
Task load or usage is not currently considered in select_task_rq_fair(),
but if we want that in the future we should make sure it is not zero for
new tasks.
The load-tracking sums are currently initialized using sched_slice(),
that won't work before the task has been assigned a rq. Initialization
is therefore changed to another semi-arbitrary value, sched_latency,
instead.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/core.c | 4 ++--
kernel/sched/fair.c | 7 +++----
2 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 355f953..bceb3a8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2126,6 +2126,8 @@ void wake_up_new_task(struct task_struct *p)
struct rq *rq;
raw_spin_lock_irqsave(&p->pi_lock, flags);
+ /* Initialize new task's runnable average */
+ init_task_runnable_average(p);
#ifdef CONFIG_SMP
/*
* Fork balancing, do it here and not earlier because:
@@ -2135,8 +2137,6 @@ void wake_up_new_task(struct task_struct *p)
set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
- /* Initialize new task's runnable average */
- init_task_runnable_average(p);
rq = __task_rq_lock(p);
activate_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d045404..f20fae9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -675,11 +675,10 @@ static inline void __update_task_entity_utilization(struct sched_entity *se);
/* Give new task start runnable values to heavy its load in infant time */
void init_task_runnable_average(struct task_struct *p)
{
- u32 slice;
+ u32 start_load = sysctl_sched_latency >> 10;
- slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
- p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
- p->se.avg.avg_period = slice;
+ p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = start_load;
+ p->se.avg.avg_period = start_load;
__update_task_entity_contrib(&p->se);
__update_task_entity_utilization(&p->se);
}
--
1.9.1
This documentation patch provides an overview of the experimental
scheduler energy costing model, associated data structures, and a
reference recipe on how platforms can be characterized to derive energy
models.
Signed-off-by: Morten Rasmussen <[email protected]>
---
Documentation/scheduler/sched-energy.txt | 363 +++++++++++++++++++++++++++++++
1 file changed, 363 insertions(+)
create mode 100644 Documentation/scheduler/sched-energy.txt
diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt
new file mode 100644
index 0000000..f2a4c19
--- /dev/null
+++ b/Documentation/scheduler/sched-energy.txt
@@ -0,0 +1,363 @@
+Energy cost model for energy-aware scheduling (EXPERIMENTAL)
+
+Introduction
+=============
+
+The basic energy model uses platform energy data stored in sched_group_energy
+data structures attached to the sched_groups in the sched_domain hierarchy. The
+energy cost model offers two functions that can be used to guide scheduling
+decisions:
+
+1. static unsigned int sched_group_energy(struct energy_env *eenv)
+2. static int energy_diff(struct energy_env *eenv)
+
+sched_group_energy() estimates the energy consumed by all cpus in a specific
+sched_group including any shared resources owned exclusively by this group of
+cpus. Resources shared with other cpus are excluded (e.g. later level caches).
+
+energy_diff() estimates the total energy impact of a utilization change. That
+is, adding, removing, or migrating utilization (tasks).
+
+Both functions use a struct energy_env to specify the scenario to be evaluated:
+
+ struct energy_env {
+ struct sched_group *sg_top;
+ struct sched_group *sg_cap;
+ int cap_idx;
+ int usage_delta;
+ int src_cpu;
+ int dst_cpu;
+ int energy;
+ };
+
+sg_top: sched_group to be evaluated. Not used by energy_diff().
+
+sg_cap: sched_group covering the cpus in the same frequency domain. Set by
+sched_group_energy().
+
+cap_idx: Capacity state to be used for energy calculations. Set by
+find_new_capacity().
+
+usage_delta: Amount of utilization to be added, removed, or migrated.
+
+src_cpu: Source cpu from where 'usage_delta' utilization is removed. Should be
+-1 if no source (e.g. task wake-up).
+
+dst_cpu: Destination cpu where 'usage_delta' utilization is added. Should be -1
+if utilization is removed (e.g. terminating tasks).
+
+energy: Result of sched_group_energy().
+
+The metric used to represent utilization is the actual per-entity running time
+averaged over time using a geometric series. Very similar to the existing
+per-entity load-tracking, but _not_ scaled by task priority and capped by the
+capacity of the cpu. The latter property does mean that utilization may
+underestimate the compute requirements for task on fully/over utilized cpus.
+The greatest potential for energy savings without affecting performance too much
+is scenarios where the system isn't fully utilized. If the system is deemed
+fully utilized load-balancing should be done with task load (includes task
+priority) instead in the interest of fairness and performance.
+
+
+Background and Terminology
+===========================
+
+To make it clear from the start:
+
+energy = [joule] (resource like a battery on powered devices)
+power = energy/time = [joule/second] = [watt]
+
+The goal of energy-aware scheduling is to minimize energy, while still getting
+the job done. That is, we want to maximize:
+
+ performance [inst/s]
+ --------------------
+ power [W]
+
+which is equivalent to minimizing:
+
+ energy [J]
+ -----------
+ instruction
+
+while still getting 'good' performance. It is essentially an alternative
+optimization objective to the current performance-only objective for the
+scheduler. This alternative considers two objectives: energy-efficiency and
+performance. Hence, there needs to be a user controllable knob to switch the
+objective. Since it is early days, this is currently a sched_feature
+(ENERGY_AWARE).
+
+The idea behind introducing an energy cost model is to allow the scheduler to
+evaluate the implications of its decisions rather than applying energy-saving
+techniques blindly that may only have positive effects on some platforms. At
+the same time, the energy cost model must be as simple as possible to minimize
+the scheduler latency impact.
+
+Platform topology
+------------------
+
+The system topology (cpus, caches, and NUMA information, not peripherals) is
+represented in the scheduler by the sched_domain hierarchy which has
+sched_groups attached at each level that covers one or more cpus (see
+sched-domains.txt for more details). To add energy awareness to the scheduler
+we need to consider power and frequency domains.
+
+Power domain:
+
+A power domain is a part of the system that can be powered on/off
+independently. Power domains are typically organized in a hierarchy where you
+may be able to power down just a cpu or a group of cpus along with any
+associated resources (e.g. shared caches). Powering up a cpu means that all
+power domains it is a part of in the hierarchy must be powered up. Hence, it is
+more expensive to power up the first cpu that belongs to a higher level power
+domain than powering up additional cpus in the same high level domain. Two
+level power domain hierarchy example:
+
+ Power source
+ +-------------------------------+----...
+per group PD G G
+ | +----------+ |
+ +--------+-------| Shared | (other groups)
+per-cpu PD G G | resource |
+ | | +----------+
+ +-------+ +-------+
+ | CPU 0 | | CPU 1 |
+ +-------+ +-------+
+
+Frequency domain:
+
+Frequency domains (P-states) typically cover the same group of cpus as one of
+the power domain levels. That is, there might be several smaller power domains
+sharing the same frequency (P-state) or there might be a power domain spanning
+multiple frequency domains.
+
+From a scheduling point of view there is no need to know the actual frequencies
+[Hz]. All the scheduler cares about is the compute capacity available at the
+current state (P-state) the cpu is in and any other available states. For that
+reason, and to also factor in any cpu micro-architecture differences, compute
+capacity scaling states are called 'capacity states' in this document. For SMP
+systems this is equivalent to P-states. For mixed micro-architecture systems
+(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
+performance relative to the other cpus in the system.
+
+Energy modelling:
+------------------
+
+Due to the hierarchical nature of the power domains, the most obvious way to
+model energy costs is therefore to associate power and energy costs with
+domains (groups of cpus). Energy costs of shared resources are associated with
+the group of cpus that share the resources, only the cost of powering the
+cpu itself and any private resources (e.g. private L1 caches) is associated
+with the per-cpu groups (lowest level).
+
+For example, for an SMP system with per-cpu power domains and a cluster level
+(group of cpus) power domain we get the overall energy costs to be:
+
+ energy = energy_cluster + n * energy_cpu
+
+where 'n' is the number of cpus powered up and energy_cluster is the cost paid
+as soon as any cpu in the cluster is powered up.
+
+The power and frequency domains can naturally be mapped onto the existing
+sched_domain hierarchy and sched_groups by adding the necessary data to the
+existing data structures.
+
+The energy model considers energy consumption from two contributors (shown in
+the illustration below):
+
+1. Busy energy: Energy consumed while a cpu and the higher level groups that it
+belongs to are busy running tasks. Busy energy is associated with the state of
+the cpu, not an event. The time the cpu spends in this state varies. Thus, the
+most obvious platform parameter for this contribution is busy power
+(energy/time).
+
+2. Idle energy: Energy consumed while a cpu and higher level groups that it
+belongs to are idle (in a C-state). Like busy energy, idle energy is associated
+with the state of the cpu. Thus, the platform parameter for this contribution
+is idle power (energy/time).
+
+Energy consumed during transitions from an idle-state (C-state) to a busy state
+(P-staet) or going the other way is ignored by the model to simplify the energy
+model calculations.
+
+
+ Power
+ ^
+ | busy->idle idle->busy
+ | transition transition
+ |
+ | _ __
+ | / \ / \__________________
+ |______________/ \ /
+ | \ /
+ | Busy \ Idle / Busy
+ | low P-state \____________/ high P-state
+ |
+ +------------------------------------------------------------> time
+
+Busy |--------------| |-----------------|
+
+Wakeup |------| |------|
+
+Idle |------------|
+
+
+The basic algorithm
+====================
+
+The basic idea is to determine the total energy impact when utilization is
+added or removed by estimating the impact at each level in the sched_domain
+hierarchy starting from the bottom (sched_group contains just a single cpu).
+The energy cost comes from busy time (sched_group is awake because one or more
+cpus are busy) and idle time (in an idle-state). Energy model numbers account
+for energy costs associated with all cpus in the sched_group as a group.
+
+ for_each_domain(cpu, sd) {
+ sg = sched_group_of(cpu)
+ energy_before = curr_util(sg) * busy_power(sg)
+ + (1-curr_util(sg)) * idle_power(sg)
+ energy_after = new_util(sg) * busy_power(sg)
+ + (1-new_util(sg)) * idle_power(sg)
+ energy_diff += energy_before - energy_after
+
+ }
+
+ return energy_diff
+
+{curr, new}_util: The cpu utilization at the lowest level and the overall
+non-idle time for the entire group for higher levels. Utilization is in the
+range 0.0 to 1.0 in the pseudo-code.
+
+busy_power: The power consumption of the sched_group.
+
+idle_power: The power consumption of the sched_group when idle.
+
+Note: It is a fundamental assumption that the utilization is (roughly) scale
+invariant. Task utilization tracking factors in any frequency scaling and
+performance scaling differences due to difference cpu microarchitectures such
+that task utilization can be used across the entire system.
+
+
+Platform energy data
+=====================
+
+struct sched_group_energy can be attached to sched_groups in the sched_domain
+hierarchy and has the following members:
+
+cap_states:
+ List of struct capacity_state representing the supported capacity states
+ (P-states). struct capacity_state has two members: cap and power, which
+ represents the compute capacity and the busy_power of the state. The
+ list must be ordered by capacity low->high.
+
+nr_cap_states:
+ Number of capacity states in cap_states list.
+
+idle_states:
+ List of struct idle_state containing idle_state power cost for each
+ idle-state support by the sched_group. Note that the energy model
+ calculations will use this table to determine idle power even if no idle
+ state is actually entered by cpuidle. That is, if latency constraints
+ prevents that the group enters a coupled state or no idle-states are
+ supported. Hence, the first entry of the list must be the idle power
+ when idle, but no idle state was actually entered ('active idle'). This
+ state may be left out groups with one cpu if the cpu is guaranteed to
+ enter the state when idle.
+
+nr_idle_states:
+ Number of idle states in idle_states list.
+
+nr_idle_states_below:
+ Number of idle-states below current level. Filled by generic code, not
+ to be provided by the platform.
+
+There are no unit requirements for the energy cost data. Data can be normalized
+with any reference, however, the normalization must be consistent across all
+energy cost data. That is, one bogo-joule/watt must be the same quantity for
+data, but we don't care what it is.
+
+A recipe for platform characterization
+=======================================
+
+Obtaining the actual model data for a particular platform requires some way of
+measuring power/energy. There isn't a tool to help with this (yet). This
+section provides a recipe for use as reference. It covers the steps used to
+characterize the ARM TC2 development platform. This sort of measurements is
+expected to be done anyway when tuning cpuidle and cpufreq for a given
+platform.
+
+The energy model needs two types of data (struct sched_group_energy holds
+these) for each sched_group where energy costs should be taken into account:
+
+1. Capacity state information
+
+A list containing the compute capacity and power consumption when fully
+utilized attributed to the group as a whole for each available capacity state.
+At the lowest level (group contains just a single cpu) this is the power of the
+cpu alone without including power consumed by resources shared with other cpus.
+It basically needs to fit the basic modelling approach described in "Background
+and Terminology" section:
+
+ energy_system = energy_shared + n * energy_cpu
+
+for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at
+the lowest level. 'energy_shared' is included at the next level which
+represents the group of cpus among which the resources are shared.
+
+This model is, of course, a simplification of reality. Thus, power/energy
+attributions might not always exactly represent how the hardware is designed.
+Also, busy power is likely to depend on the workload. It is therefore
+recommended to use a representative mix of workloads when characterizing the
+capacity states.
+
+If the group has no capacity scaling support, the list will contain a single
+state where power is the busy power attributed to the group. The capacity
+should be set to a default value (1024).
+
+When frequency domains include multiple power domains, the group representing
+the frequency domain and all child groups share capacity states. This must be
+indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at
+all levels that share the capacity state must have the list of capacity states
+with the power set to the contribution of the individual group.
+
+2. Idle power information
+
+Stored in the idle_states list. The power number is the group idle power
+consumption in each idle state as well when the group is idle but has not
+entered an idle-state ('active idle' as mentioned earlier). Due to the way the
+energy model is defined, the idle power of the deepest group idle state can
+alternatively be accounted for in the parent group busy power. In that case the
+group idle state power values are offset such that the idle power of the
+deepest state is zero. It is less intuitive, but it is easier to measure as
+idle power consumed by the group and the busy/idle power of the parent group
+cannot be distinguished without per group measurement points.
+
+Measuring capacity states and idle power:
+
+The capacity states' capacity and power can be estimated by running a benchmark
+workload at each available capacity state. By restricting the benchmark to run
+on subsets of cpus it is possible to extrapolate the power consumption of
+shared resources.
+
+ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a
+shared L2 cache. TC2 has on-chip energy counters per cluster. Running a
+benchmark workload on just one cpu in a cluster means that power is consumed in
+the cluster (higher level group) and a single cpu (lowest level group). Adding
+another benchmark task to another cpu increases the power consumption by the
+amount consumed by the additional cpu. Hence, it is possible to extrapolate the
+cluster busy power.
+
+For platforms that don't have energy counters or equivalent instrumentation
+built-in, it may be possible to use an external DAQ to acquire similar data.
+
+If the benchmark includes some performance score (for example sysbench cpu
+benchmark), this can be used to record the compute capacity.
+
+Measuring idle power requires insight into the idle state implementation on the
+particular platform. Specifically, if the platform has coupled idle-states (or
+package states). To measure non-coupled per-cpu idle-states it is necessary to
+keep one cpu busy to keep any shared resources alive to isolate the idle power
+of the cpu from idle/busy power of the shared resources. The cpu can be tricked
+into different per-cpu idle states by disabling the other states. Based on
+various combinations of measurements with specific cpus busy and disabling
+idle-states it is possible to extrapolate the idle-state power.
--
1.9.1
This patch introduces the ENERGY_AWARE sched feature, which is
implemented using jump labels when SCHED_DEBUG is defined. It is
statically set false when SCHED_DEBUG is not defined. Hence this doesn't
allow energy awareness to be enabled without SCHED_DEBUG. This
sched_feature knob will be replaced later with a more appropriate
control knob when things have matured a bit.
ENERGY_AWARE is based on per-entity load-tracking hence FAIR_GROUP_SCHED
must be enable. This dependency isn't checked at compile time yet.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 6 ++++++
kernel/sched/features.h | 6 ++++++
2 files changed, 12 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f20fae9..5980bdd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4766,6 +4766,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
return wl;
}
+
#else
static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
@@ -4775,6 +4776,11 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
#endif
+static inline bool energy_aware(void)
+{
+ return sched_feat(ENERGY_AWARE);
+}
+
static int wake_wide(struct task_struct *p)
{
int factor = this_cpu_read(sd_llc_size);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 03d8072..92bc36e 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -91,3 +91,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
*/
SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
+
+/*
+ * Energy aware scheduling. Use platform energy model to guide scheduling
+ * decisions optimizing for energy efficiency.
+ */
+SCHED_FEAT(ENERGY_AWARE, false)
--
1.9.1
From: Dietmar Eggemann <[email protected]>
The struct sched_group_energy represents the per sched_group related
data which is needed for energy aware scheduling. It contains:
(1) atomic reference counter for scheduler internal bookkeeping of
data allocation and freeing
(2) number of elements of the idle state array
(3) pointer to the idle state array which comprises 'power consumption'
for each idle state
(4) number of elements of the capacity state array
(5) pointer to the capacity state array which comprises 'compute
capacity and power consumption' tuples for each capacity state
Allocation and freeing of struct sched_group_energy utilizes the existing
infrastructure of the scheduler which is currently used for the other sd
hierarchy data structures (e.g. struct sched_domain) as well. That's why
struct sd_data is provisioned with a per cpu struct sched_group_energy
double pointer.
The struct sched_group obtains a pointer to a struct sched_group_energy.
The function pointer sched_domain_energy_f is introduced into struct
sched_domain_topology_level which will allow the arch to pass a particular
struct sched_group_energy from the topology shim layer into the scheduler
core.
The function pointer sched_domain_energy_f has an 'int cpu' parameter
since the folding of two adjacent sd levels via sd degenerate doesn't work
for all sd levels. I.e. it is not possible for example to use this feature
to provide per-cpu energy in sd level DIE on ARM's TC2 platform.
It was discussed that the folding of sd levels approach is preferable
over the cpu parameter approach, simply because the user (the arch
specifying the sd topology table) can introduce less errors. But since
it is not working, the 'int cpu' parameter is the only way out. It's
possible to use the folding of sd levels approach for
sched_domain_flags_f and the cpu parameter approach for the
sched_domain_energy_f at the same time though. With the use of the
'int cpu' parameter, an extra check function has to be provided to make
sure that all cpus spanned by a sched group are provisioned with the same
energy data.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
include/linux/sched.h | 20 ++++++++++++++++++++
kernel/sched/sched.h | 1 +
2 files changed, 21 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0eceeec..cf79eef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1013,6 +1013,23 @@ struct sched_domain_attr {
extern int sched_domain_level_max;
+struct capacity_state {
+ unsigned long cap; /* compute capacity */
+ unsigned long power; /* power consumption at this compute capacity */
+};
+
+struct idle_state {
+ unsigned long power; /* power consumption in this idle state */
+};
+
+struct sched_group_energy {
+ atomic_t ref;
+ unsigned int nr_idle_states; /* number of idle states */
+ struct idle_state *idle_states; /* ptr to idle state array */
+ unsigned int nr_cap_states; /* number of capacity states */
+ struct capacity_state *cap_states; /* ptr to capacity state array */
+};
+
struct sched_group;
struct sched_domain {
@@ -1111,6 +1128,7 @@ bool cpus_share_cache(int this_cpu, int that_cpu);
typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
typedef int (*sched_domain_flags_f)(void);
+typedef const struct sched_group_energy *(*sched_domain_energy_f)(int cpu);
#define SDTL_OVERLAP 0x01
@@ -1118,11 +1136,13 @@ struct sd_data {
struct sched_domain **__percpu sd;
struct sched_group **__percpu sg;
struct sched_group_capacity **__percpu sgc;
+ struct sched_group_energy **__percpu sge;
};
struct sched_domain_topology_level {
sched_domain_mask_f mask;
sched_domain_flags_f sd_flags;
+ sched_domain_energy_f energy;
int flags;
int numa_level;
struct sd_data data;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1070692..df20e23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -853,6 +853,7 @@ struct sched_group {
unsigned int group_weight;
struct sched_group_capacity *sgc;
+ struct sched_group_energy *sge;
/*
* The CPUs this group covers.
--
1.9.1
From: Dietmar Eggemann <[email protected]>
The per sched group sched_group_energy structure plus the related
idle_state and capacity_state arrays are allocated like the other sched
domain (sd) hierarchy data structures. This includes the freeing of
sched_group_energy structures which are not used.
Energy-aware scheduling allows that a system only has energy model data
up to a certain sd level (so called highest energy aware balancing sd
level). A check in init_sched_energy enforces that all sd's below this
sd level contain energy model data.
One problem is that the number of elements of the idle_state and the
capacity_state arrays is not fixed and has to be retrieved in
__sdt_alloc() to allocate memory for the sched_group_energy structure and
the two arrays in one chunk. The array pointers (idle_states and
cap_states) are initialized here to point to the correct place inside the
memory chunk.
The new function init_sched_energy() initializes the sched_group_energy
structure and the two arrays in case the sd topology level contains energy
information.
This patch has been tested with scheduler feature flag FORCE_SD_OVERLAP
enabled as well.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/core.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 33 +++++++++++++++++++
2 files changed, 124 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bceb3a8..ab42515 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5723,6 +5723,9 @@ static void free_sched_groups(struct sched_group *sg, int free_sgc)
if (free_sgc && atomic_dec_and_test(&sg->sgc->ref))
kfree(sg->sgc);
+ if (free_sgc && atomic_dec_and_test(&sg->sge->ref))
+ kfree(sg->sge);
+
kfree(sg);
sg = tmp;
} while (sg != first);
@@ -5740,6 +5743,7 @@ static void free_sched_domain(struct rcu_head *rcu)
free_sched_groups(sd->groups, 1);
} else if (atomic_dec_and_test(&sd->groups->ref)) {
kfree(sd->groups->sgc);
+ kfree(sd->groups->sge);
kfree(sd->groups);
}
kfree(sd);
@@ -5957,6 +5961,8 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
*/
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+ sg->sge = *per_cpu_ptr(sdd->sge, i);
+
/*
* Make sure the first group of this domain contains the
* canonical balance cpu. Otherwise the sched_domain iteration
@@ -5995,6 +6001,7 @@ static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
*sg = *per_cpu_ptr(sdd->sg, cpu);
(*sg)->sgc = *per_cpu_ptr(sdd->sgc, cpu);
atomic_set(&(*sg)->sgc->ref, 1); /* for claim_allocations */
+ (*sg)->sge = *per_cpu_ptr(sdd->sge, cpu);
}
return cpu;
@@ -6084,6 +6091,45 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
atomic_set(&sg->sgc->nr_busy_cpus, sg->group_weight);
}
+static void init_sched_energy(int cpu, struct sched_domain *sd,
+ struct sched_domain_topology_level *tl)
+{
+ struct sched_group *sg = sd->groups;
+ struct sched_group_energy *energy = sg->sge;
+ sched_domain_energy_f fn = tl->energy;
+ struct cpumask *mask = sched_group_cpus(sg);
+
+ if (fn && sd->child && !sd->child->groups->sge) {
+ pr_err("BUG: EAS setup broken for CPU%d\n", cpu);
+#ifdef CONFIG_SCHED_DEBUG
+ pr_err(" energy data on %s but not on %s domain\n",
+ sd->name, sd->child->name);
+#endif
+ return;
+ }
+
+ if (cpu != group_balance_cpu(sg))
+ return;
+
+ if (!fn || !fn(cpu)) {
+ sg->sge = NULL;
+ return;
+ }
+
+ atomic_set(&sg->sge->ref, 1); /* for claim_allocations */
+
+
+ if (cpumask_weight(mask) > 1)
+ check_sched_energy_data(cpu, fn, mask);
+
+ energy->nr_idle_states = fn(cpu)->nr_idle_states;
+ memcpy(energy->idle_states, fn(cpu)->idle_states,
+ energy->nr_idle_states*sizeof(struct idle_state));
+ energy->nr_cap_states = fn(cpu)->nr_cap_states;
+ memcpy(energy->cap_states, fn(cpu)->cap_states,
+ energy->nr_cap_states*sizeof(struct capacity_state));
+}
+
/*
* Initializers for schedule domains
* Non-inlined to reduce accumulated stack pressure in build_sched_domains()
@@ -6174,6 +6220,9 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
if (atomic_read(&(*per_cpu_ptr(sdd->sgc, cpu))->ref))
*per_cpu_ptr(sdd->sgc, cpu) = NULL;
+
+ if (atomic_read(&(*per_cpu_ptr(sdd->sge, cpu))->ref))
+ *per_cpu_ptr(sdd->sge, cpu) = NULL;
}
#ifdef CONFIG_NUMA
@@ -6639,10 +6688,24 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
if (!sdd->sgc)
return -ENOMEM;
+ sdd->sge = alloc_percpu(struct sched_group_energy *);
+ if (!sdd->sge)
+ return -ENOMEM;
+
for_each_cpu(j, cpu_map) {
struct sched_domain *sd;
struct sched_group *sg;
struct sched_group_capacity *sgc;
+ struct sched_group_energy *sge;
+ sched_domain_energy_f fn = tl->energy;
+ unsigned int nr_idle_states = 0;
+ unsigned int nr_cap_states = 0;
+
+ if (fn && fn(j)) {
+ nr_idle_states = fn(j)->nr_idle_states;
+ nr_cap_states = fn(j)->nr_cap_states;
+ BUG_ON(!nr_idle_states || !nr_cap_states);
+ }
sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
@@ -6666,6 +6729,26 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
return -ENOMEM;
*per_cpu_ptr(sdd->sgc, j) = sgc;
+
+ sge = kzalloc_node(sizeof(struct sched_group_energy) +
+ nr_idle_states*sizeof(struct idle_state) +
+ nr_cap_states*sizeof(struct capacity_state),
+ GFP_KERNEL, cpu_to_node(j));
+
+ if (!sge)
+ return -ENOMEM;
+
+ sge->idle_states = (struct idle_state *)
+ ((void *)&sge->cap_states +
+ sizeof(sge->cap_states));
+
+ sge->cap_states = (struct capacity_state *)
+ ((void *)&sge->cap_states +
+ sizeof(sge->cap_states) +
+ nr_idle_states*
+ sizeof(struct idle_state));
+
+ *per_cpu_ptr(sdd->sge, j) = sge;
}
}
@@ -6694,6 +6777,8 @@ static void __sdt_free(const struct cpumask *cpu_map)
kfree(*per_cpu_ptr(sdd->sg, j));
if (sdd->sgc)
kfree(*per_cpu_ptr(sdd->sgc, j));
+ if (sdd->sge)
+ kfree(*per_cpu_ptr(sdd->sge, j));
}
free_percpu(sdd->sd);
sdd->sd = NULL;
@@ -6701,6 +6786,8 @@ static void __sdt_free(const struct cpumask *cpu_map)
sdd->sg = NULL;
free_percpu(sdd->sgc);
sdd->sgc = NULL;
+ free_percpu(sdd->sge);
+ sdd->sge = NULL;
}
}
@@ -6786,10 +6873,13 @@ static int build_sched_domains(const struct cpumask *cpu_map,
/* Calculate CPU capacity for physical packages and nodes */
for (i = nr_cpumask_bits-1; i >= 0; i--) {
+ struct sched_domain_topology_level *tl = sched_domain_topology;
+
if (!cpumask_test_cpu(i, cpu_map))
continue;
- for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+ for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent, tl++) {
+ init_sched_energy(i, sd, tl);
claim_allocations(i, sd);
init_sched_groups_capacity(i, sd);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index df20e23..ea32e9e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -890,6 +890,39 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
extern int group_balance_cpu(struct sched_group *sg);
+/*
+ * Check that the per-cpu provided sd energy data is consistent for all cpus
+ * within the mask.
+ */
+static inline void check_sched_energy_data(int cpu, sched_domain_energy_f fn,
+ const struct cpumask *cpumask)
+{
+ struct cpumask mask;
+ int i;
+
+ cpumask_xor(&mask, cpumask, get_cpu_mask(cpu));
+
+ for_each_cpu(i, &mask) {
+ int y;
+
+ BUG_ON(fn(i)->nr_idle_states != fn(cpu)->nr_idle_states);
+
+ for (y = 0; y < (fn(i)->nr_idle_states); y++) {
+ BUG_ON(fn(i)->idle_states[y].power !=
+ fn(cpu)->idle_states[y].power);
+ }
+
+ BUG_ON(fn(i)->nr_cap_states != fn(cpu)->nr_cap_states);
+
+ for (y = 0; y < (fn(i)->nr_cap_states); y++) {
+ BUG_ON(fn(i)->cap_states[y].cap !=
+ fn(cpu)->cap_states[y].cap);
+ BUG_ON(fn(i)->cap_states[y].power !=
+ fn(cpu)->cap_states[y].power);
+ }
+ }
+}
+
#else
static inline void sched_ttwu_pending(void) { }
--
1.9.1
cpufreq is currently keeping it a secret which cpus are sharing
clock source. The scheduler needs to know about clock domains as well
to become more energy aware. The SD_SHARE_CAP_STATES domain flag
indicates whether cpus belonging to the sched_domain share capacity
states (P-states).
There is no connection with cpufreq (yet). The flag must be set by
the arch specific topology code.
cc: Russell King <[email protected]>
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
arch/arm/kernel/topology.c | 3 ++-
include/linux/sched.h | 1 +
kernel/sched/core.c | 10 +++++++---
3 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 5867587..b35d3e5 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -276,7 +276,8 @@ void store_cpu_topology(unsigned int cpuid)
static inline int cpu_corepower_flags(void)
{
- return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
+ return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN | \
+ SD_SHARE_CAP_STATES;
}
static struct sched_domain_topology_level arm_topology[] = {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cf79eef..fe77e54 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -981,6 +981,7 @@ extern void wake_up_q(struct wake_q_head *head);
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
#define SD_NUMA 0x4000 /* cross-node balancing */
+#define SD_SHARE_CAP_STATES 0x8000 /* Domain members share capacity state */
#ifdef CONFIG_SCHED_SMT
static inline int cpu_smt_flags(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ab42515..5499b2c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5555,7 +5555,8 @@ static int sd_degenerate(struct sched_domain *sd)
SD_BALANCE_EXEC |
SD_SHARE_CPUCAPACITY |
SD_SHARE_PKG_RESOURCES |
- SD_SHARE_POWERDOMAIN)) {
+ SD_SHARE_POWERDOMAIN |
+ SD_SHARE_CAP_STATES)) {
if (sd->groups != sd->groups->next)
return 0;
}
@@ -5587,7 +5588,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
SD_SHARE_CPUCAPACITY |
SD_SHARE_PKG_RESOURCES |
SD_PREFER_SIBLING |
- SD_SHARE_POWERDOMAIN);
+ SD_SHARE_POWERDOMAIN |
+ SD_SHARE_CAP_STATES);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
}
@@ -6241,6 +6243,7 @@ static int sched_domains_curr_level;
* SD_SHARE_PKG_RESOURCES - describes shared caches
* SD_NUMA - describes NUMA topologies
* SD_SHARE_POWERDOMAIN - describes shared power domain
+ * SD_SHARE_CAP_STATES - describes shared capacity states
*
* Odd one out:
* SD_ASYM_PACKING - describes SMT quirks
@@ -6250,7 +6253,8 @@ static int sched_domains_curr_level;
SD_SHARE_PKG_RESOURCES | \
SD_NUMA | \
SD_ASYM_PACKING | \
- SD_SHARE_POWERDOMAIN)
+ SD_SHARE_POWERDOMAIN | \
+ SD_SHARE_CAP_STATES)
static struct sched_domain *
sd_init(struct sched_domain_topology_level *tl, int cpu)
--
1.9.1
From: Dietmar Eggemann <[email protected]>
This patch is only here to be able to test provisioning of energy related
data from an arch topology shim layer to the scheduler. Since there is no
code today which deals with extracting energy related data from the dtb or
acpi, and process it in the topology shim layer, the content of the
sched_group_energy structures as well as the idle_state and capacity_state
arrays are hard-coded here.
This patch defines the sched_group_energy structure as well as the
idle_state and capacity_state array for the cluster (relates to sched
groups (sgs) in DIE sched domain level) and for the core (relates to sgs
in MC sd level) for a Cortex A7 as well as for a Cortex A15.
It further provides related implementations of the sched_domain_energy_f
functions (cpu_cluster_energy() and cpu_core_energy()).
To be able to propagate this information from the topology shim layer to
the scheduler, the elements of the arm_topology[] table have been
provisioned with the appropriate sched_domain_energy_f functions.
cc: Russell King <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/kernel/topology.c | 118 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 115 insertions(+), 3 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index b35d3e5..bbe20c7 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -274,6 +274,119 @@ void store_cpu_topology(unsigned int cpuid)
cpu_topology[cpuid].socket_id, mpidr);
}
+/*
+ * ARM TC2 specific energy cost model data. There are no unit requirements for
+ * the data. Data can be normalized to any reference point, but the
+ * normalization must be consistent. That is, one bogo-joule/watt must be the
+ * same quantity for all data, but we don't care what it is.
+ */
+static struct idle_state idle_states_cluster_a7[] = {
+ { .power = 25 }, /* WFI */
+ { .power = 10 }, /* cluster-sleep-l */
+ };
+
+static struct idle_state idle_states_cluster_a15[] = {
+ { .power = 70 }, /* WFI */
+ { .power = 25 }, /* cluster-sleep-b */
+ };
+
+static struct capacity_state cap_states_cluster_a7[] = {
+ /* Cluster only power */
+ { .cap = 150, .power = 2967, }, /* 350 MHz */
+ { .cap = 172, .power = 2792, }, /* 400 MHz */
+ { .cap = 215, .power = 2810, }, /* 500 MHz */
+ { .cap = 258, .power = 2815, }, /* 600 MHz */
+ { .cap = 301, .power = 2919, }, /* 700 MHz */
+ { .cap = 344, .power = 2847, }, /* 800 MHz */
+ { .cap = 387, .power = 3917, }, /* 900 MHz */
+ { .cap = 430, .power = 4905, }, /* 1000 MHz */
+ };
+
+static struct capacity_state cap_states_cluster_a15[] = {
+ /* Cluster only power */
+ { .cap = 426, .power = 7920, }, /* 500 MHz */
+ { .cap = 512, .power = 8165, }, /* 600 MHz */
+ { .cap = 597, .power = 8172, }, /* 700 MHz */
+ { .cap = 682, .power = 8195, }, /* 800 MHz */
+ { .cap = 768, .power = 8265, }, /* 900 MHz */
+ { .cap = 853, .power = 8446, }, /* 1000 MHz */
+ { .cap = 938, .power = 11426, }, /* 1100 MHz */
+ { .cap = 1024, .power = 15200, }, /* 1200 MHz */
+ };
+
+static struct sched_group_energy energy_cluster_a7 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a7),
+ .idle_states = idle_states_cluster_a7,
+ .nr_cap_states = ARRAY_SIZE(cap_states_cluster_a7),
+ .cap_states = cap_states_cluster_a7,
+};
+
+static struct sched_group_energy energy_cluster_a15 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a15),
+ .idle_states = idle_states_cluster_a15,
+ .nr_cap_states = ARRAY_SIZE(cap_states_cluster_a15),
+ .cap_states = cap_states_cluster_a15,
+};
+
+static struct idle_state idle_states_core_a7[] = {
+ { .power = 0 }, /* WFI */
+ };
+
+static struct idle_state idle_states_core_a15[] = {
+ { .power = 0 }, /* WFI */
+ };
+
+static struct capacity_state cap_states_core_a7[] = {
+ /* Power per cpu */
+ { .cap = 150, .power = 187, }, /* 350 MHz */
+ { .cap = 172, .power = 275, }, /* 400 MHz */
+ { .cap = 215, .power = 334, }, /* 500 MHz */
+ { .cap = 258, .power = 407, }, /* 600 MHz */
+ { .cap = 301, .power = 447, }, /* 700 MHz */
+ { .cap = 344, .power = 549, }, /* 800 MHz */
+ { .cap = 387, .power = 761, }, /* 900 MHz */
+ { .cap = 430, .power = 1024, }, /* 1000 MHz */
+ };
+
+static struct capacity_state cap_states_core_a15[] = {
+ /* Power per cpu */
+ { .cap = 426, .power = 2021, }, /* 500 MHz */
+ { .cap = 512, .power = 2312, }, /* 600 MHz */
+ { .cap = 597, .power = 2756, }, /* 700 MHz */
+ { .cap = 682, .power = 3125, }, /* 800 MHz */
+ { .cap = 768, .power = 3524, }, /* 900 MHz */
+ { .cap = 853, .power = 3846, }, /* 1000 MHz */
+ { .cap = 938, .power = 5177, }, /* 1100 MHz */
+ { .cap = 1024, .power = 6997, }, /* 1200 MHz */
+ };
+
+static struct sched_group_energy energy_core_a7 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_core_a7),
+ .idle_states = idle_states_core_a7,
+ .nr_cap_states = ARRAY_SIZE(cap_states_core_a7),
+ .cap_states = cap_states_core_a7,
+};
+
+static struct sched_group_energy energy_core_a15 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_core_a15),
+ .idle_states = idle_states_core_a15,
+ .nr_cap_states = ARRAY_SIZE(cap_states_core_a15),
+ .cap_states = cap_states_core_a15,
+};
+
+/* sd energy functions */
+static inline const struct sched_group_energy *cpu_cluster_energy(int cpu)
+{
+ return cpu_topology[cpu].socket_id ? &energy_cluster_a7 :
+ &energy_cluster_a15;
+}
+
+static inline const struct sched_group_energy *cpu_core_energy(int cpu)
+{
+ return cpu_topology[cpu].socket_id ? &energy_core_a7 :
+ &energy_core_a15;
+}
+
static inline int cpu_corepower_flags(void)
{
return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN | \
@@ -282,10 +395,9 @@ static inline int cpu_corepower_flags(void)
static struct sched_domain_topology_level arm_topology[] = {
#ifdef CONFIG_SCHED_MC
- { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+ { cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) },
#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { cpu_cpu_mask, 0, cpu_cluster_energy, SD_INIT_NAME(DIE) },
{ NULL, },
};
--
1.9.1
capacity_orig_of() returns the max available compute capacity of a cpu.
For scale-invariant utilization tracking and energy-aware scheduling
decisions it is useful to know the compute capacity available at the
current OPP of a cpu.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5980bdd..4a8404a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4776,6 +4776,17 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
#endif
+/*
+ * Returns the current capacity of cpu after applying both
+ * cpu and freq scaling.
+ */
+static unsigned long capacity_curr_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_orig *
+ arch_scale_freq_capacity(NULL, cpu)
+ >> SCHED_CAPACITY_SHIFT;
+}
+
static inline bool energy_aware(void)
{
return sched_feat(ENERGY_AWARE);
--
1.9.1
Move get_cpu_usage() to an earlier position in fair.c and change return
type to unsigned long as negative usage doesn't make much sense. All
other load and capacity related functions use unsigned long including
the caller of get_cpu_usage().
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 78 ++++++++++++++++++++++++++---------------------------
1 file changed, 39 insertions(+), 39 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4a8404a..70f2700 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4787,6 +4787,45 @@ static unsigned long capacity_curr_of(int cpu)
>> SCHED_CAPACITY_SHIFT;
}
+/*
+ * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * tasks. The unit of the return value must be the one of capacity so we can
+ * compare the usage with the capacity of the CPU that is available for CFS
+ * task (ie cpu_capacity).
+ *
+ * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
+ * CPU. It represents the amount of utilization of a CPU in the range
+ * [0..capacity_orig] where capacity_orig is the cpu_capacity available at the
+ * highest frequency (arch_scale_freq_capacity()). The usage of a CPU converges
+ * towards a sum equal to or less than the current capacity (capacity_curr <=
+ * capacity_orig) of the CPU because it is the running time on this CPU scaled
+ * by capacity_curr. Nevertheless, cfs.utilization_load_avg can be higher than
+ * capacity_curr or even higher than capacity_orig because of unfortunate
+ * rounding in avg_period and running_load_avg or just after migrating tasks
+ * (and new task wakeups) until the average stabilizes with the new running
+ * time. We need to check that the usage stays into the range
+ * [0..capacity_orig] and cap if necessary. Without capping the usage, a group
+ * could be seen as overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas
+ * CPU1 has 20% of available capacity. We allow usage to overshoot
+ * capacity_curr (but not capacity_orig) as it useful for predicting the
+ * capacity required after task migrations (scheduler-driven DVFS).
+ */
+
+static unsigned long get_cpu_usage(int cpu)
+{
+ int sum;
+ unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+ unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
+ unsigned long capacity_orig = capacity_orig_of(cpu);
+
+ sum = usage + blocked;
+
+ if (sum >= capacity_orig)
+ return capacity_orig;
+
+ return sum;
+}
+
static inline bool energy_aware(void)
{
return sched_feat(ENERGY_AWARE);
@@ -5040,45 +5079,6 @@ static int select_idle_sibling(struct task_struct *p, int target)
}
/*
- * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
- * tasks. The unit of the return value must be the one of capacity so we can
- * compare the usage with the capacity of the CPU that is available for CFS
- * task (ie cpu_capacity).
- *
- * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
- * CPU. It represents the amount of utilization of a CPU in the range
- * [0..capacity_orig] where capacity_orig is the cpu_capacity available at the
- * highest frequency (arch_scale_freq_capacity()). The usage of a CPU converges
- * towards a sum equal to or less than the current capacity (capacity_curr <=
- * capacity_orig) of the CPU because it is the running time on this CPU scaled
- * by capacity_curr. Nevertheless, cfs.utilization_load_avg can be higher than
- * capacity_curr or even higher than capacity_orig because of unfortunate
- * rounding in avg_period and running_load_avg or just after migrating tasks
- * (and new task wakeups) until the average stabilizes with the new running
- * time. We need to check that the usage stays into the range
- * [0..capacity_orig] and cap if necessary. Without capping the usage, a group
- * could be seen as overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas
- * CPU1 has 20% of available capacity. We allow usage to overshoot
- * capacity_curr (but not capacity_orig) as it useful for predicting the
- * capacity required after task migrations (scheduler-driven DVFS).
- */
-
-static int get_cpu_usage(int cpu)
-{
- int sum;
- unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
- unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
- unsigned long capacity_orig = capacity_orig_of(cpu);
-
- sum = usage + blocked;
-
- if (sum >= capacity_orig)
- return capacity_orig;
-
- return sum;
-}
-
-/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
* SD_BALANCE_FORK, or SD_BALANCE_EXEC.
--
1.9.1
Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_ea, points to the highest level at which energy
model is provided. At this level and all levels below all sched_groups
have energy model data attached.
Partial energy model information is possible but restricted to providing
energy model data for lower level sched_domains (sd_ea and below) and
leaving load-balancing on levels above to non-energy-aware
load-balancing. For example, it is possible to apply energy-aware
scheduling within each socket on a multi-socket system and let normal
scheduling handle load-balancing between sockets.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/core.c | 11 ++++++++++-
kernel/sched/sched.h | 1 +
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5499b2c..8818bf0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5777,11 +5777,12 @@ DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
DEFINE_PER_CPU(struct sched_domain *, sd_busy);
DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_PER_CPU(struct sched_domain *, sd_ea);
static void update_top_cache_domain(int cpu)
{
struct sched_domain *sd;
- struct sched_domain *busy_sd = NULL;
+ struct sched_domain *busy_sd = NULL, *ea_sd = NULL;
int id = cpu;
int size = 1;
@@ -5802,6 +5803,14 @@ static void update_top_cache_domain(int cpu)
sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
+
+ for_each_domain(cpu, sd) {
+ if (sd->groups->sge)
+ ea_sd = sd;
+ else
+ break;
+ }
+ rcu_assign_pointer(per_cpu(sd_ea, cpu), ea_sd);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ea32e9e..b627dfa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -829,6 +829,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_busy);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(struct sched_domain *, sd_ea);
struct sched_group_capacity {
atomic_t ref;
--
1.9.1
For energy-aware load-balancing decisions it is necessary to know the
energy consumption estimates of groups of cpus. This patch introduces a
basic function, sched_group_energy(), which estimates the energy
consumption of the cpus in the group and any resources shared by the
members of the group.
NOTE: The function has five levels of identation and breaks the 80
character limit. Refactoring is necessary.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 146 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 70f2700..2677ca6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4831,6 +4831,152 @@ static inline bool energy_aware(void)
return sched_feat(ENERGY_AWARE);
}
+/*
+ * cpu_norm_usage() returns the cpu usage relative to a specific capacity,
+ * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
+ * energy calculations. Using the scale-invariant usage returned by
+ * get_cpu_usage() and approximating scale-invariant usage by:
+ *
+ * usage ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time
+ *
+ * the normalized usage can be found using the specific capacity.
+ *
+ * capacity = capacity_orig * curr_freq/max_freq
+ *
+ * norm_usage = running_time/time ~ usage/capacity
+ */
+static unsigned long cpu_norm_usage(int cpu, unsigned long capacity)
+{
+ int usage = __get_cpu_usage(cpu);
+
+ if (usage >= capacity)
+ return SCHED_CAPACITY_SCALE;
+
+ return (usage << SCHED_CAPACITY_SHIFT)/capacity;
+}
+
+static unsigned long group_max_usage(struct sched_group *sg)
+{
+ int i;
+ unsigned long max_usage = 0;
+
+ for_each_cpu(i, sched_group_cpus(sg))
+ max_usage = max(max_usage, get_cpu_usage(i));
+
+ return max_usage;
+}
+
+/*
+ * group_norm_usage() returns the approximated group usage relative to it's
+ * current capacity (busy ratio) in the range [0..SCHED_LOAD_SCALE] for use in
+ * energy calculations. Since task executions may or may not overlap in time in
+ * the group the true normalized usage is between max(cpu_norm_usage(i)) and
+ * sum(cpu_norm_usage(i)) when iterating over all cpus in the group, i. The
+ * latter is used as the estimate as it leads to a more pessimistic energy
+ * estimate (more busy).
+ */
+static unsigned long group_norm_usage(struct sched_group *sg, int cap_idx)
+{
+ int i;
+ unsigned long usage_sum = 0;
+ unsigned long capacity = sg->sge->cap_states[cap_idx].cap;
+
+ for_each_cpu(i, sched_group_cpus(sg))
+ usage_sum += cpu_norm_usage(i, capacity);
+
+ if (usage_sum > SCHED_CAPACITY_SCALE)
+ return SCHED_CAPACITY_SCALE;
+ return usage_sum;
+}
+
+static int find_new_capacity(struct sched_group *sg,
+ struct sched_group_energy *sge)
+{
+ int idx;
+ unsigned long util = group_max_usage(sg);
+
+ for (idx = 0; idx < sge->nr_cap_states; idx++) {
+ if (sge->cap_states[idx].cap >= util)
+ return idx;
+ }
+
+ return idx;
+}
+
+/*
+ * sched_group_energy(): Returns absolute energy consumption of cpus belonging
+ * to the sched_group including shared resources shared only by members of the
+ * group. Iterates over all cpus in the hierarchy below the sched_group starting
+ * from the bottom working it's way up before going to the next cpu until all
+ * cpus are covered at all levels. The current implementation is likely to
+ * gather the same usage statistics multiple times. This can probably be done in
+ * a faster but more complex way.
+ */
+static unsigned int sched_group_energy(struct sched_group *sg_top)
+{
+ struct sched_domain *sd;
+ int cpu, total_energy = 0;
+ struct cpumask visit_cpus;
+ struct sched_group *sg;
+
+ WARN_ON(!sg_top->sge);
+
+ cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
+
+ while (!cpumask_empty(&visit_cpus)) {
+ struct sched_group *sg_shared_cap = NULL;
+
+ cpu = cpumask_first(&visit_cpus);
+
+ /*
+ * Is the group utilization affected by cpus outside this
+ * sched_group?
+ */
+ sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
+ if (sd && sd->parent)
+ sg_shared_cap = sd->parent->groups;
+
+ for_each_domain(cpu, sd) {
+ sg = sd->groups;
+
+ /* Has this sched_domain already been visited? */
+ if (sd->child && group_first_cpu(sg) != cpu)
+ break;
+
+ do {
+ struct sched_group *sg_cap_util;
+ unsigned long group_util;
+ int sg_busy_energy, sg_idle_energy, cap_idx;
+
+ if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
+ sg_cap_util = sg_shared_cap;
+ else
+ sg_cap_util = sg;
+
+ cap_idx = find_new_capacity(sg_cap_util, sg->sge);
+ group_util = group_norm_usage(sg, cap_idx);
+ sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
+ >> SCHED_CAPACITY_SHIFT;
+ sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
+ >> SCHED_CAPACITY_SHIFT;
+
+ total_energy += sg_busy_energy + sg_idle_energy;
+
+ if (!sd->child)
+ cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg));
+
+ if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(sg_top)))
+ goto next_cpu;
+
+ } while (sg = sg->next, sg != sd->groups);
+ }
+next_cpu:
+ continue;
+ }
+
+ return total_energy;
+}
+
static int wake_wide(struct task_struct *p)
{
int factor = this_cpu_read(sd_llc_size);
--
1.9.1
Extended sched_group_energy() to support energy prediction with usage
(tasks) added/removed from a specific cpu or migrated between a pair of
cpus. Useful for load-balancing decision making.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++++++++----------------
1 file changed, 60 insertions(+), 26 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2677ca6..52403e9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4810,15 +4810,17 @@ static unsigned long capacity_curr_of(int cpu)
* capacity_curr (but not capacity_orig) as it useful for predicting the
* capacity required after task migrations (scheduler-driven DVFS).
*/
-
-static unsigned long get_cpu_usage(int cpu)
+static unsigned long __get_cpu_usage(int cpu, int delta)
{
int sum;
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
unsigned long capacity_orig = capacity_orig_of(cpu);
- sum = usage + blocked;
+ sum = usage + blocked + delta;
+
+ if (sum < 0)
+ return 0;
if (sum >= capacity_orig)
return capacity_orig;
@@ -4826,13 +4828,28 @@ static unsigned long get_cpu_usage(int cpu)
return sum;
}
+static unsigned long get_cpu_usage(int cpu)
+{
+ return __get_cpu_usage(cpu, 0);
+}
+
static inline bool energy_aware(void)
{
return sched_feat(ENERGY_AWARE);
}
+struct energy_env {
+ struct sched_group *sg_top;
+ struct sched_group *sg_cap;
+ int cap_idx;
+ int usage_delta;
+ int src_cpu;
+ int dst_cpu;
+ int energy;
+};
+
/*
- * cpu_norm_usage() returns the cpu usage relative to a specific capacity,
+ * __cpu_norm_usage() returns the cpu usage relative to a specific capacity,
* i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
* energy calculations. Using the scale-invariant usage returned by
* get_cpu_usage() and approximating scale-invariant usage by:
@@ -4845,9 +4862,9 @@ static inline bool energy_aware(void)
*
* norm_usage = running_time/time ~ usage/capacity
*/
-static unsigned long cpu_norm_usage(int cpu, unsigned long capacity)
+static unsigned long __cpu_norm_usage(int cpu, unsigned long capacity, int delta)
{
- int usage = __get_cpu_usage(cpu);
+ int usage = __get_cpu_usage(cpu, delta);
if (usage >= capacity)
return SCHED_CAPACITY_SCALE;
@@ -4855,13 +4872,25 @@ static unsigned long cpu_norm_usage(int cpu, unsigned long capacity)
return (usage << SCHED_CAPACITY_SHIFT)/capacity;
}
-static unsigned long group_max_usage(struct sched_group *sg)
+static int calc_usage_delta(struct energy_env *eenv, int cpu)
{
- int i;
+ if (cpu == eenv->src_cpu)
+ return -eenv->usage_delta;
+ if (cpu == eenv->dst_cpu)
+ return eenv->usage_delta;
+ return 0;
+}
+
+static
+unsigned long group_max_usage(struct energy_env *eenv, struct sched_group *sg)
+{
+ int i, delta;
unsigned long max_usage = 0;
- for_each_cpu(i, sched_group_cpus(sg))
- max_usage = max(max_usage, get_cpu_usage(i));
+ for_each_cpu(i, sched_group_cpus(sg)) {
+ delta = calc_usage_delta(eenv, i);
+ max_usage = max(max_usage, __get_cpu_usage(i, delta));
+ }
return max_usage;
}
@@ -4875,31 +4904,36 @@ static unsigned long group_max_usage(struct sched_group *sg)
* latter is used as the estimate as it leads to a more pessimistic energy
* estimate (more busy).
*/
-static unsigned long group_norm_usage(struct sched_group *sg, int cap_idx)
+static unsigned
+long group_norm_usage(struct energy_env *eenv, struct sched_group *sg)
{
- int i;
+ int i, delta;
unsigned long usage_sum = 0;
- unsigned long capacity = sg->sge->cap_states[cap_idx].cap;
+ unsigned long capacity = sg->sge->cap_states[eenv->cap_idx].cap;
- for_each_cpu(i, sched_group_cpus(sg))
- usage_sum += cpu_norm_usage(i, capacity);
+ for_each_cpu(i, sched_group_cpus(sg)) {
+ delta = calc_usage_delta(eenv, i);
+ usage_sum += __cpu_norm_usage(i, capacity, delta);
+ }
if (usage_sum > SCHED_CAPACITY_SCALE)
return SCHED_CAPACITY_SCALE;
return usage_sum;
}
-static int find_new_capacity(struct sched_group *sg,
+static int find_new_capacity(struct energy_env *eenv,
struct sched_group_energy *sge)
{
int idx;
- unsigned long util = group_max_usage(sg);
+ unsigned long util = group_max_usage(eenv, eenv->sg_cap);
for (idx = 0; idx < sge->nr_cap_states; idx++) {
if (sge->cap_states[idx].cap >= util)
return idx;
}
+ eenv->cap_idx = idx;
+
return idx;
}
@@ -4912,16 +4946,16 @@ static int find_new_capacity(struct sched_group *sg,
* gather the same usage statistics multiple times. This can probably be done in
* a faster but more complex way.
*/
-static unsigned int sched_group_energy(struct sched_group *sg_top)
+static unsigned int sched_group_energy(struct energy_env *eenv)
{
struct sched_domain *sd;
int cpu, total_energy = 0;
struct cpumask visit_cpus;
struct sched_group *sg;
- WARN_ON(!sg_top->sge);
+ WARN_ON(!eenv->sg_top->sge);
- cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
+ cpumask_copy(&visit_cpus, sched_group_cpus(eenv->sg_top));
while (!cpumask_empty(&visit_cpus)) {
struct sched_group *sg_shared_cap = NULL;
@@ -4944,17 +4978,16 @@ static unsigned int sched_group_energy(struct sched_group *sg_top)
break;
do {
- struct sched_group *sg_cap_util;
unsigned long group_util;
int sg_busy_energy, sg_idle_energy, cap_idx;
if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
- sg_cap_util = sg_shared_cap;
+ eenv->sg_cap = sg_shared_cap;
else
- sg_cap_util = sg;
+ eenv->sg_cap = sg;
- cap_idx = find_new_capacity(sg_cap_util, sg->sge);
- group_util = group_norm_usage(sg, cap_idx);
+ cap_idx = find_new_capacity(eenv, sg->sge);
+ group_util = group_norm_usage(eenv, sg);
sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
>> SCHED_CAPACITY_SHIFT;
sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
@@ -4965,7 +4998,7 @@ static unsigned int sched_group_energy(struct sched_group *sg_top)
if (!sd->child)
cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg));
- if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(sg_top)))
+ if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(eenv->sg_top)))
goto next_cpu;
} while (sg = sg->next, sg != sd->groups);
@@ -4974,6 +5007,7 @@ static unsigned int sched_group_energy(struct sched_group *sg_top)
continue;
}
+ eenv->energy = total_energy;
return total_energy;
}
--
1.9.1
Adds a generic energy-aware helper function, energy_diff(), that
calculates energy impact of adding, removing, and migrating utilization
in the system.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52403e9..f36ab2f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5011,6 +5011,57 @@ static unsigned int sched_group_energy(struct energy_env *eenv)
return total_energy;
}
+/*
+ * energy_diff(): Estimate the energy impact of changing the utilization
+ * distribution. eenv specifies the change: utilisation amount, source, and
+ * destination cpu. Source or destination cpu may be -1 in which case the
+ * utilization is removed from or added to the system (e.g. task wake-up). If
+ * both are specified, the utilization is migrated.
+ */
+static int energy_diff(struct energy_env *eenv)
+{
+ struct sched_domain *sd;
+ struct sched_group *sg;
+ int sd_cpu = -1, energy_before = 0, energy_after = 0;
+
+ struct energy_env eenv_before = {
+ .usage_delta = 0,
+ .src_cpu = eenv->src_cpu,
+ .dst_cpu = eenv->dst_cpu,
+ };
+
+ if (eenv->src_cpu == eenv->dst_cpu)
+ return 0;
+
+ sd_cpu = (eenv->src_cpu != -1) ? eenv->src_cpu : eenv->dst_cpu;
+ sd = rcu_dereference(per_cpu(sd_ea, sd_cpu));
+
+ if (!sd)
+ return 0; /* Error */
+
+ sg = sd->groups;
+ do {
+ if (eenv->src_cpu != -1 && cpumask_test_cpu(eenv->src_cpu,
+ sched_group_cpus(sg))) {
+ eenv_before.sg_top = eenv->sg_top = sg;
+ energy_before += sched_group_energy(&eenv_before);
+ energy_after += sched_group_energy(eenv);
+
+ /* src_cpu and dst_cpu may belong to the same group */
+ continue;
+ }
+
+ if (eenv->dst_cpu != -1 && cpumask_test_cpu(eenv->dst_cpu,
+ sched_group_cpus(sg))) {
+ eenv_before.sg_top = eenv->sg_top = sg;
+ energy_before += sched_group_energy(&eenv_before);
+ energy_after += sched_group_energy(eenv);
+ }
+ } while (sg = sg->next, sg != sd->groups);
+
+ return energy_after-energy_before;
+}
+
static int wake_wide(struct task_struct *p)
{
int factor = this_cpu_read(sd_llc_size);
--
1.9.1
Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way,
spreading the tasks across as many cpus as possible based on priority
scaled load to preserve smp_nice.
The over-utilization condition is conservatively chosen to indicate
over-utilization as soon as one cpu is fully utilized at it's highest
frequency. We don't consider groups as lumping usage and capacity
together for a group of cpus may hide the fact that one or more cpus in
the group are over-utilized while group-siblings are partially idle. The
tasks could be served better if moved to another group with completely
idle cpus. This is particularly problematic if some cpus have a
significantly reduced capacity due to RT/IRQ pressure or if the system
has cpus of different capacity (e.g. ARM big.LITTLE).
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 3 +++
2 files changed, 34 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f36ab2f3..5b7bc28 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4266,6 +4266,8 @@ static inline void hrtick_update(struct rq *rq)
}
#endif
+static bool cpu_overutilized(int cpu);
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -4276,6 +4278,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ int task_new = !(flags & ENQUEUE_WAKEUP);
for_each_sched_entity(se) {
if (se->on_rq)
@@ -4310,6 +4313,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) {
update_rq_runnable_avg(rq, rq->nr_running);
add_nr_running(rq, 1);
+ if (!task_new && !rq->rd->overutilized &&
+ cpu_overutilized(rq->cpu))
+ rq->rd->overutilized = true;
}
hrtick_update(rq);
}
@@ -4937,6 +4943,14 @@ static int find_new_capacity(struct energy_env *eenv,
return idx;
}
+static unsigned int capacity_margin = 1280; /* ~20% margin */
+
+static bool cpu_overutilized(int cpu)
+{
+ return (capacity_of(cpu) * 1024) <
+ (get_cpu_usage(cpu) * capacity_margin);
+}
+
/*
* sched_group_energy(): Returns absolute energy consumption of cpus belonging
* to the sched_group including shared resources shared only by members of the
@@ -6732,11 +6746,12 @@ static enum group_type group_classify(struct lb_env *env,
* @local_group: Does group contain this_cpu.
* @sgs: variable to hold the statistics for this group.
* @overload: Indicate more than one runnable task for any CPU.
+ * @overutilized: Indicate overutilization for any CPU.
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
- bool *overload)
+ bool *overload, bool *overutilized)
{
unsigned long load;
int i;
@@ -6766,6 +6781,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->sum_weighted_load += weighted_cpuload(i);
if (idle_cpu(i))
sgs->idle_cpus++;
+
+ if (cpu_overutilized(i))
+ *overutilized = true;
}
/* Adjust by relative CPU capacity of the group */
@@ -6871,7 +6889,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats tmp_sgs;
int load_idx, prefer_sibling = 0;
- bool overload = false;
+ bool overload = false, overutilized = false;
if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
@@ -6893,7 +6911,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}
update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
- &overload);
+ &overload, &overutilized);
if (local_group)
goto next_group;
@@ -6935,8 +6953,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* update overload indicator if we are at root domain */
if (env->dst_rq->rd->overload != overload)
env->dst_rq->rd->overload = overload;
- }
+ /* Update over-utilization (tipping point, U >= 0) indicator */
+ if (env->dst_rq->rd->overutilized != overutilized)
+ env->dst_rq->rd->overutilized = overutilized;
+ } else {
+ if (!env->dst_rq->rd->overutilized && overutilized)
+ env->dst_rq->rd->overutilized = true;
+ }
}
/**
@@ -8300,6 +8324,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
task_tick_numa(rq, curr);
update_rq_runnable_avg(rq, 1);
+
+ if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
+ rq->rd->overutilized = true;
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b627dfa..a5d2d69 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -535,6 +535,9 @@ struct root_domain {
/* Indicate more than one runnable task for any CPU */
bool overload;
+ /* Indicate one or more cpus over-utilized (tipping point) */
+ bool overutilized;
+
/*
* The bit corresponding to a CPU gets set here if such CPU has more
* than one runnable -deadline task (as it is below for RT tasks).
--
1.9.1
From: Dietmar Eggemann <[email protected]>
To be able to compare the capacity of the target cpu with the highest
cpu capacity of the system in the wakeup path, store the system-wide
maximum cpu capacity in the root domain.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/core.c | 8 ++++++++
kernel/sched/sched.h | 3 +++
2 files changed, 11 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8818bf0..d307db8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6848,6 +6848,7 @@ static int build_sched_domains(const struct cpumask *cpu_map,
enum s_alloc alloc_state;
struct sched_domain *sd;
struct s_data d;
+ struct rq *rq;
int i, ret = -ENOMEM;
alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
@@ -6901,11 +6902,18 @@ static int build_sched_domains(const struct cpumask *cpu_map,
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
+ rq = cpu_rq(i);
sd = *per_cpu_ptr(d.sd, i);
cpu_attach_domain(sd, d.rd, i);
+
+ if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)
+ rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;
}
rcu_read_unlock();
+ rq = cpu_rq(cpumask_first(cpu_map));
+ pr_info("Max cpu capacity: %lu\n", rq->rd->max_cpu_capacity);
+
ret = 0;
error:
__free_domain_allocs(&d, alloc_state, cpu_map);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a5d2d69..95c8ff7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -553,6 +553,9 @@ struct root_domain {
*/
cpumask_var_t rto_mask;
struct cpupri cpupri;
+
+ /* Maximum cpu capacity in the system. */
+ unsigned long max_cpu_capacity;
};
extern struct root_domain def_root_domain;
--
1.9.1
The idle-state of each cpu is currently pointed to by rq->idle_state but
there isn't any information in the struct cpuidle_state that can used to
look up the idle-state energy model data stored in struct
sched_group_energy. For this purpose is necessary to store the idle
state index as well. Ideally, the idle-state data should be unified.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/idle.c | 2 ++
kernel/sched/sched.h | 21 +++++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index fefcb1f..6832fa1 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -151,6 +151,7 @@ static void cpuidle_idle_call(void)
/* Take note of the planned idle state. */
idle_set_state(this_rq(), &drv->states[next_state]);
+ idle_set_state_idx(this_rq(), next_state);
/*
* Enter the idle state previously returned by the governor decision.
@@ -161,6 +162,7 @@ static void cpuidle_idle_call(void)
/* The cpu is no longer idle or about to enter idle. */
idle_set_state(this_rq(), NULL);
+ idle_set_state_idx(this_rq(), -1);
if (entered_state == -EBUSY)
goto use_default;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 95c8ff7..897730f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -701,6 +701,7 @@ struct rq {
#ifdef CONFIG_CPU_IDLE
/* Must be inspected within a rcu lock section */
struct cpuidle_state *idle_state;
+ int idle_state_idx;
#endif
};
@@ -1316,6 +1317,17 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
WARN_ON(!rcu_read_lock_held());
return rq->idle_state;
}
+
+static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
+{
+ rq->idle_state_idx = idle_state_idx;
+}
+
+static inline int idle_get_state_idx(struct rq *rq)
+{
+ WARN_ON(!rcu_read_lock_held());
+ return rq->idle_state_idx;
+}
#else
static inline void idle_set_state(struct rq *rq,
struct cpuidle_state *idle_state)
@@ -1326,6 +1338,15 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
{
return NULL;
}
+
+static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
+{
+}
+
+static inline int idle_get_state_idx(struct rq *rq)
+{
+ return -1;
+}
#endif
extern void sysrq_sched_debug_show(void);
--
1.9.1
cpuidle associates all idle-states with each cpu while the energy model
associates them with the sched_group covering the cpus coordinating
entry to the idle-state. To look up the idle-state power consumption in
the energy model it is therefore necessary to translate from cpuidle
idle-state index to energy model index. For this purpose it is helpful
to know how many idle-states that are listed in lower level sched_groups
(in struct sched_group_energy).
Example: ARMv8 big.LITTLE JUNO (Cortex A57, A53) idle-states:
Idle-state cpuidle Energy model table indices
index per-cpu sg per-cluster sg
WFI 0 0 (0)
Core power-down 1 1 0*
Cluster power-down 2 (1) 1
For per-cpu sgs no translation is required. If cpuidle reports state
index 0 or 1, the cpu is in WFI or core power-down, respectively. We can
look the idle-power up directly in the sg energy model table. Idle-state
cluster power-down, is represented in the per-cluster sg energy model
table as index 1. Index 0* is reserved for cluster power consumption
when the cpus all are in state 0 or 1, but cpuidle decided not to go for
cluster power-down. Given the index from cpuidle we can compute the
correct index in the energy model tables for the sgs at each level if we
know how many states are in the tables in the child sgs. The actual
translation is implemented in a later patch.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 12 ++++++++++++
2 files changed, 13 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fe77e54..9ea43cd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1027,6 +1027,7 @@ struct sched_group_energy {
atomic_t ref;
unsigned int nr_idle_states; /* number of idle states */
struct idle_state *idle_states; /* ptr to idle state array */
+ unsigned int nr_idle_states_below; /* number idle states in lower groups */
unsigned int nr_cap_states; /* number of capacity states */
struct capacity_state *cap_states; /* ptr to capacity state array */
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d307db8..98a83e4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6109,6 +6109,7 @@ static void init_sched_energy(int cpu, struct sched_domain *sd,
struct sched_group_energy *energy = sg->sge;
sched_domain_energy_f fn = tl->energy;
struct cpumask *mask = sched_group_cpus(sg);
+ int nr_idle_states_below = 0;
if (fn && sd->child && !sd->child->groups->sge) {
pr_err("BUG: EAS setup broken for CPU%d\n", cpu);
@@ -6133,9 +6134,20 @@ static void init_sched_energy(int cpu, struct sched_domain *sd,
if (cpumask_weight(mask) > 1)
check_sched_energy_data(cpu, fn, mask);
+ /* Figure out the number of true cpuidle states below current group */
+ sd = sd->child;
+ for_each_lower_domain(sd) {
+ nr_idle_states_below += sd->groups->sge->nr_idle_states;
+
+ /* Disregard non-cpuidle 'active' idle states */
+ if (sd->child)
+ nr_idle_states_below--;
+ }
+
energy->nr_idle_states = fn(cpu)->nr_idle_states;
memcpy(energy->idle_states, fn(cpu)->idle_states,
energy->nr_idle_states*sizeof(struct idle_state));
+ energy->nr_idle_states_below = nr_idle_states_below;
energy->nr_cap_states = fn(cpu)->nr_cap_states;
memcpy(energy->cap_states, fn(cpu)->cap_states,
energy->nr_cap_states*sizeof(struct capacity_state));
--
1.9.1
From: Dietmar Eggemann <[email protected]>
To estimate the energy consumption of a sched_group in
sched_group_energy() it is necessary to know which idle-state the group
is in when it is idle. For now, it is assumed that this is the current
idle-state (though it might be wrong). Based on the individual cpu
idle-states group_idle_state() finds the group idle-state.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 27 +++++++++++++++++++++++----
1 file changed, 23 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b7bc28..19601c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4951,6 +4951,22 @@ static bool cpu_overutilized(int cpu)
(get_cpu_usage(cpu) * capacity_margin);
}
+static int group_idle_state(struct sched_group *sg)
+{
+ int i, state = INT_MAX;
+
+ /* Find the shallowest idle state in the sched group. */
+ for_each_cpu(i, sched_group_cpus(sg))
+ state = min(state, idle_get_state_idx(cpu_rq(i)));
+
+ /* Transform system into sched domain idle state. */
+ if (sg->sge->nr_idle_states_below > 1)
+ state -= sg->sge->nr_idle_states_below - 1;
+
+ /* Clamp state to the range of sched domain idle states. */
+ return clamp_t(int, state, 0, sg->sge->nr_idle_states - 1);
+}
+
/*
* sched_group_energy(): Returns absolute energy consumption of cpus belonging
* to the sched_group including shared resources shared only by members of the
@@ -4993,7 +5009,8 @@ static unsigned int sched_group_energy(struct energy_env *eenv)
do {
unsigned long group_util;
- int sg_busy_energy, sg_idle_energy, cap_idx;
+ int sg_busy_energy, sg_idle_energy;
+ int cap_idx, idle_idx;
if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
eenv->sg_cap = sg_shared_cap;
@@ -5001,11 +5018,13 @@ static unsigned int sched_group_energy(struct energy_env *eenv)
eenv->sg_cap = sg;
cap_idx = find_new_capacity(eenv, sg->sge);
+ idle_idx = group_idle_state(sg);
group_util = group_norm_usage(eenv, sg);
sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
- >> SCHED_CAPACITY_SHIFT;
- sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
- >> SCHED_CAPACITY_SHIFT;
+ >> SCHED_CAPACITY_SHIFT;
+ sg_idle_energy = ((SCHED_LOAD_SCALE-group_util)
+ * sg->sge->idle_states[idle_idx].power)
+ >> SCHED_CAPACITY_SHIFT;
total_energy += sg_busy_energy + sg_idle_energy;
--
1.9.1
Wakeup balancing is completely unaware of cpu capacity, cpu usage and
task utilization. The task is preferably placed on a cpu which is idle
in the instant the wakeup happens. New tasks (SD_BALANCE_{FORK,EXEC} are
placed on an idle cpu in the idlest group if such can be found, otherwise
it goes on the least loaded one. Existing tasks (SD_BALANCE_WAKE) are
placed on the previous cpu or an idle cpu sharing the same last level
cache. Hence existing tasks don't get a chance to migrate to a different
group at wakeup in case the current one has reduced cpu capacity (due
RT/IRQ pressure or different uarch e.g. ARM big.LITTLE). They may
eventually get pulled by other cpus doing periodic/idle/nohz_idle
balance, but it may take quite a while before it happens.
This patch adds capacity awareness to find_idlest_{group,queue} (used by
SD_BALANCE_{FORK,EXEC}) such that groups/cpus that can accommodate the
waking task based on task utilization are preferred. In addition, wakeup
of existing tasks (SD_BALANCE_WAKE) is sent through
find_idlest_{group,queue} if the task doesn't fit the capacity of the
previous cpu to allow it to escape (override wake_affine) when
necessary instead of relying on periodic/idle/nohz_idle balance to
eventually sort it out.
The patch doesn't depend on any energy model infrastructure, but it is
kept behind the energy_aware() static key despite being primarily a
performance optimization as it may increase scheduler overhead slightly.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 62 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98a83e4..ec46248 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6315,7 +6315,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
| 1*SD_BALANCE_NEWIDLE
| 1*SD_BALANCE_EXEC
| 1*SD_BALANCE_FORK
- | 0*SD_BALANCE_WAKE
+ | 1*SD_BALANCE_WAKE
| 1*SD_WAKE_AFFINE
| 0*SD_SHARE_CPUCAPACITY
| 0*SD_SHARE_PKG_RESOURCES
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19601c9..bb44646 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5190,6 +5190,39 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
return 1;
}
+static inline unsigned long task_utilization(struct task_struct *p)
+{
+ return p->se.avg.utilization_avg_contrib;
+}
+
+static inline bool __task_fits(struct task_struct *p, int cpu, int usage)
+{
+ unsigned long capacity = capacity_of(cpu);
+
+ usage += task_utilization(p);
+
+ return (capacity * 1024) > (usage * capacity_margin);
+}
+
+static inline bool task_fits_capacity(struct task_struct *p, int cpu)
+{
+ unsigned long capacity = capacity_of(cpu);
+ unsigned long max_capacity = cpu_rq(cpu)->rd->max_cpu_capacity;
+
+ if (capacity == max_capacity)
+ return true;
+
+ if (capacity * capacity_margin > max_capacity * 1024)
+ return true;
+
+ return __task_fits(p, cpu, 0);
+}
+
+static inline bool task_fits_cpu(struct task_struct *p, int cpu)
+{
+ return __task_fits(p, cpu, get_cpu_usage(cpu));
+}
+
/*
* find_idlest_group finds and returns the least busy CPU group within the
* domain.
@@ -5199,7 +5232,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int sd_flag)
{
struct sched_group *idlest = NULL, *group = sd->groups;
+ struct sched_group *fit_group = NULL;
unsigned long min_load = ULONG_MAX, this_load = 0;
+ unsigned long fit_capacity = ULONG_MAX;
int load_idx = sd->forkexec_idx;
int imbalance = 100 + (sd->imbalance_pct-100)/2;
@@ -5230,6 +5265,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
load = target_load(i, load_idx);
avg_load += load;
+
+ if (energy_aware() && capacity_of(i) < fit_capacity &&
+ task_fits_cpu(p, i)) {
+ fit_capacity = capacity_of(i);
+ fit_group = group;
+ }
}
/* Adjust by relative CPU capacity of the group */
@@ -5243,6 +5284,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
}
} while (group = group->next, group != sd->groups);
+ if (fit_group)
+ return fit_group;
+
if (!idlest || 100*this_load < imbalance*min_load)
return NULL;
return idlest;
@@ -5263,7 +5307,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
- if (idle_cpu(i)) {
+ if (task_fits_cpu(p, i)) {
struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
@@ -5275,7 +5319,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
min_exit_latency = idle->exit_latency;
latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i;
- } else if ((!idle || idle->exit_latency == min_exit_latency) &&
+ } else if (idle_cpu(i) &&
+ (!idle || idle->exit_latency == min_exit_latency) &&
rq->idle_stamp > latest_idle_timestamp) {
/*
* If equal or no active idle state, then
@@ -5284,6 +5329,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
*/
latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i;
+ } else if (shallowest_idle_cpu == -1) {
+ /*
+ * If we haven't found an idle CPU yet
+ * pick a non-idle one that can fit the task as
+ * fallback.
+ */
+ shallowest_idle_cpu = i;
}
} else if (shallowest_idle_cpu == -1) {
load = weighted_cpuload(i);
@@ -5361,9 +5413,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
int cpu = smp_processor_id();
int new_cpu = cpu;
int want_affine = 0;
+ int want_sibling = true;
int sync = wake_flags & WF_SYNC;
- if (sd_flag & SD_BALANCE_WAKE)
+ /* Check if prev_cpu can fit us ignoring its current usage */
+ if (energy_aware() && !task_fits_capacity(p, prev_cpu))
+ want_sibling = false;
+
+ if (sd_flag & SD_BALANCE_WAKE && want_sibling)
want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
rcu_read_lock();
@@ -5388,7 +5445,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
prev_cpu = cpu;
- if (sd_flag & SD_BALANCE_WAKE) {
+ if (sd_flag & SD_BALANCE_WAKE && want_sibling) {
new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock;
}
--
1.9.1
Let available compute capacity and estimated energy impact select
wake-up target cpu when energy-aware scheduling is enabled and the
system in not over-utilized (above the tipping point).
energy_aware_wake_cpu() attempts to find group of cpus with sufficient
compute capacity to accommodate the task and find a cpu with enough spare
capacity to handle the task within that group. Preference is given to
cpus with enough spare capacity at the current OPP. Finally, the energy
impact of the new target and the previous task cpu is compared to select
the wake-up target cpu.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 84 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bb44646..fe41e1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5394,6 +5394,86 @@ static int select_idle_sibling(struct task_struct *p, int target)
return target;
}
+static int energy_aware_wake_cpu(struct task_struct *p)
+{
+ struct sched_domain *sd;
+ struct sched_group *sg, *sg_target;
+ int target_max_cap = INT_MAX;
+ int target_cpu = task_cpu(p);
+ int i;
+
+ sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
+
+ if (!sd)
+ return -1;
+
+ sg = sd->groups;
+ sg_target = sg;
+
+ /*
+ * Find group with sufficient capacity. We only get here if no cpu is
+ * overutilized. We may end up overutilizing a cpu by adding the task,
+ * but that should not be any worse than select_idle_sibling().
+ * load_balance() should sort it out later as we get above the tipping
+ * point.
+ */
+ do {
+ /* Assuming all cpus are the same in group */
+ int max_cap_cpu = group_first_cpu(sg);
+
+ /*
+ * Assume smaller max capacity means more energy-efficient.
+ * Ideally we should query the energy model for the right
+ * answer but it easily ends up in an exhaustive search.
+ */
+ if (capacity_of(max_cap_cpu) < target_max_cap &&
+ task_fits_capacity(p, max_cap_cpu)) {
+ sg_target = sg;
+ target_max_cap = capacity_of(max_cap_cpu);
+ }
+ } while (sg = sg->next, sg != sd->groups);
+
+ /* Find cpu with sufficient capacity */
+ for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
+ /*
+ * p's blocked utilization is still accounted for on prev_cpu
+ * so prev_cpu will receive a negative bias due the double
+ * accouting. However, the blocked utilization may be zero.
+ */
+ int new_usage = get_cpu_usage(i) + task_utilization(p);
+
+ if (new_usage > capacity_orig_of(i))
+ continue;
+
+ if (new_usage < capacity_curr_of(i)) {
+ target_cpu = i;
+ if (cpu_rq(i)->nr_running)
+ break;
+ }
+
+ /* cpu has capacity at higher OPP, keep it as fallback */
+ if (target_cpu == task_cpu(p))
+ target_cpu = i;
+ }
+
+ if (target_cpu != task_cpu(p)) {
+ struct energy_env eenv = {
+ .usage_delta = task_utilization(p),
+ .src_cpu = task_cpu(p),
+ .dst_cpu = target_cpu,
+ };
+
+ /* Not enough spare capacity on previous cpu */
+ if (cpu_overutilized(task_cpu(p)))
+ return target_cpu;
+
+ if (energy_diff(&eenv) >= 0)
+ return task_cpu(p);
+ }
+
+ return target_cpu;
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5446,7 +5526,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
prev_cpu = cpu;
if (sd_flag & SD_BALANCE_WAKE && want_sibling) {
- new_cpu = select_idle_sibling(p, prev_cpu);
+ if (energy_aware() && !cpu_rq(cpu)->rd->overutilized)
+ new_cpu = energy_aware_wake_cpu(p);
+ else
+ new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock;
}
--
1.9.1
From: Dietmar Eggemann <[email protected]>
In case the system operates below the tipping point indicator,
introduced in ("sched: Add over-utilization/tipping point
indicator"), bail out in find_busiest_group after the dst and src
group statistics have been checked.
There is simply no need to move usage around because all involved
cpus still have spare cycles available.
For an energy-aware system below its tipping point, we rely on the
task placement of the wakeup path. This works well for short running
tasks.
The existence of long running tasks on one of the involved cpus lets
the system operate over its tipping point. To be able to move such
a task (whose load can't be used to average the load among the cpus)
from a src cpu with lower capacity than the dst_cpu, an additional
rule has to be implemented in need_active_balance.
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe41e1e..83218e0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7340,6 +7340,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
* this level.
*/
update_sd_lb_stats(env, &sds);
+
+ if (energy_aware() && !env->dst_rq->rd->overutilized)
+ goto out_balanced;
+
local = &sds.local_stat;
busiest = &sds.busiest_stat;
--
1.9.1
From: Dietmar Eggemann <[email protected]>
We do not want to miss out on the ability to pull a single remaining
task from a potential source cpu towards an idle destination cpu if the
energy aware system operates above the tipping point.
Add an extra criteria to need_active_balance() to kick off active load
balance if the source cpu is over-utilized and has lower capacity than
the destination cpu.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83218e0..8efeb1d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7527,6 +7527,13 @@ static int need_active_balance(struct lb_env *env)
return 1;
}
+ if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&
+ env->src_rq->cfs.h_nr_running == 1 &&
+ cpu_overutilized(env->src_cpu) &&
+ !cpu_overutilized(env->dst_cpu)) {
+ return 1;
+ }
+
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
@@ -7881,7 +7888,8 @@ static int idle_balance(struct rq *this_rq)
this_rq->idle_stamp = rq_clock(this_rq);
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !this_rq->rd->overload) {
+ (!energy_aware() && !this_rq->rd->overload) ||
+ (energy_aware() && !this_rq->rd->overutilized)) {
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
--
1.9.1
With energy-aware scheduling enabled nohz_kick_needed() generates many
nohz idle-balance kicks which lead to nothing when multiple tasks get
packed on a single cpu to save energy. This causes unnecessary wake-ups
and hence wastes energy. Make these conditions depend on !energy_aware()
for now until the energy-aware nohz story gets sorted out.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8efeb1d..e6d32e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8382,12 +8382,13 @@ static inline bool nohz_kick_needed(struct rq *rq)
if (time_before(now, nohz.next_balance))
return false;
- if (rq->nr_running >= 2)
+ if (rq->nr_running >= 2 &&
+ (!energy_aware() || cpu_overutilized(cpu)))
return true;
rcu_read_lock();
sd = rcu_dereference(per_cpu(sd_busy, cpu));
- if (sd) {
+ if (sd && !energy_aware()) {
sgc = sd->groups->sgc;
nr_busy = atomic_read(&sgc->nr_busy_cpus);
--
1.9.1
On 05/12/2015 12:38 PM, Morten Rasmussen wrote:
> Test results for ARM TC2 (2xA15+3xA7) with cpufreq enabled:
>
> sysbench: Single task running for 3 seconds.
> rt-app [4]: mp3 playback use-case model
> rt-app [4]: 5 ~[6,13,19,25,31,38,44,50]% periodic (2ms) tasks
>
> Note: % is relative to the capacity of the fastest cpu at the highest
> frequency, i.e. the more busy ones do not fit on little cpus.
>
> A newer version of rt-app was used which supports a better but slightly
> different way of modelling the periodic tasks. Numbers are therefore
> _not_ comparable to the RFCv3 numbers.
>
> Average numbers for 20 runs per test (ARM TC2).
>
> Energy Mainline EAS noEAS
>
> sysbench 100 251* 227*
>
> rt-app mp3 100 63 111
>
> rt-app 6% 100 42 102
> rt-app 13% 100 58 101
> rt-app 19% 100 87 101
> rt-app 25% 100 94 104
> rt-app 31% 100 93 104
> rt-app 38% 100 114 117
> rt-app 44% 100 115 118
> rt-app 50% 100 125 126
Hi Morten,
What is noEAS? From the numbers, noEAS != Mainline?
Maybe also have some perf numbers to show that perf is in fact preserved
while lowering power.
-Sai
On 05/12/2015 12:38 PM, Morten Rasmussen wrote:
> Task being dequeued for the last time (state == TASK_DEAD) are dequeued
> with the DEQUEUE_SLEEP flag which causes their load and utilization
> contributions to be added to the runqueue blocked load and utilization.
> Hence they will contain load or utilization that is gone away. The issue
> only exists for the root cfs_rq as cgroup_exit() doesn't set
> DEQUEUE_SLEEP for task group exits.
>
> If runnable+blocked load is to be used as a better estimate for cpu
> load the dead task contributions need to be removed to prevent
> load_balance() (idle_balance() in particular) from over-estimating the
> cpu load.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e40cd88..d045404 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3202,6 +3202,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> * Update run-time statistics of the 'current'.
> */
> update_curr(cfs_rq);
> + if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> + flags &= !DEQUEUE_SLEEP;
> dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
>
> update_stats_dequeue(cfs_rq, se);
>
Maybe you could use the sched_class->task_dead() callback instead? I
remember seeing a patch from Yuyang that did that.
Hi Sai,
On Tue, May 12, 2015 at 11:07:26PM +0100, Sai Gurrappadi wrote:
>
> On 05/12/2015 12:38 PM, Morten Rasmussen wrote:
> > Test results for ARM TC2 (2xA15+3xA7) with cpufreq enabled:
> >
> > sysbench: Single task running for 3 seconds.
> > rt-app [4]: mp3 playback use-case model
> > rt-app [4]: 5 ~[6,13,19,25,31,38,44,50]% periodic (2ms) tasks
> >
> > Note: % is relative to the capacity of the fastest cpu at the highest
> > frequency, i.e. the more busy ones do not fit on little cpus.
> >
> > A newer version of rt-app was used which supports a better but slightly
> > different way of modelling the periodic tasks. Numbers are therefore
> > _not_ comparable to the RFCv3 numbers.
> >
> > Average numbers for 20 runs per test (ARM TC2).
> >
> > Energy Mainline EAS noEAS
> >
> > sysbench 100 251* 227*
> >
> > rt-app mp3 100 63 111
> >
> > rt-app 6% 100 42 102
> > rt-app 13% 100 58 101
> > rt-app 19% 100 87 101
> > rt-app 25% 100 94 104
> > rt-app 31% 100 93 104
> > rt-app 38% 100 114 117
> > rt-app 44% 100 115 118
> > rt-app 50% 100 125 126
>
> Hi Morten,
>
> What is noEAS? From the numbers, noEAS != Mainline?
Sorry, that should have been more clear.
Mainline: tip/sched/core (not really mainline yet...)
EAS: tip/sched/core + RFCv4 + EAS enabled.
noEAS: tip/sched/core + RFCv4 + EAS disabled.
The main differences between plain tip/sched/core and EAS disabled is
that PELT is frequency invariant which affects the decisions in
period/idle/nohz_idle balance.
> Maybe also have some perf numbers to show that perf is in fact preserved
> while lowering power.
Couldn't agree more. Energy numbers on their own do not say much. I
hinted at the sysbench performance in the (trimmed) text further down.
The increase in energy for EAS is due to doing more work (higher
performance). The rt-app runs with task utilization in the lower end
should deliver the same level of performance as none of the cpus are
fully utilized. The little cpus have a capacity of 43% each. At the
higher end I would expect performance to be different. EAS tries its
best to put heavier tasks on the big cpus where mainline may choose a
different task distribution hence performance is likely to be different
like it is for sysbench.
A performance metric for rt-app is under discussion but not there yet.
We will work on getting that sorted as the next thing so we can see any
performance impact.
Thanks,
Morten
On Wed, May 13, 2015 at 01:33:22AM +0100, Sai Gurrappadi wrote:
> On 05/12/2015 12:38 PM, Morten Rasmussen wrote:
> > Task being dequeued for the last time (state == TASK_DEAD) are dequeued
> > with the DEQUEUE_SLEEP flag which causes their load and utilization
> > contributions to be added to the runqueue blocked load and utilization.
> > Hence they will contain load or utilization that is gone away. The issue
> > only exists for the root cfs_rq as cgroup_exit() doesn't set
> > DEQUEUE_SLEEP for task group exits.
> >
> > If runnable+blocked load is to be used as a better estimate for cpu
> > load the dead task contributions need to be removed to prevent
> > load_balance() (idle_balance() in particular) from over-estimating the
> > cpu load.
> >
> > cc: Ingo Molnar <[email protected]>
> > cc: Peter Zijlstra <[email protected]>
> >
> > Signed-off-by: Morten Rasmussen <[email protected]>
> > ---
> > kernel/sched/fair.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e40cd88..d045404 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3202,6 +3202,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > * Update run-time statistics of the 'current'.
> > */
> > update_curr(cfs_rq);
> > + if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> > + flags &= !DEQUEUE_SLEEP;
> > dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
> >
> > update_stats_dequeue(cfs_rq, se);
> >
>
> Maybe you could use the sched_class->task_dead() callback instead? I
> remember seeing a patch from Yuyang that did that.
Now that you mention it, I remember that thread I think. I will have a
look again and see if there are any good reasons not to use task_dead().
On 12/05/15 20:39, Morten Rasmussen wrote:
> Let available compute capacity and estimated energy impact select
> wake-up target cpu when energy-aware scheduling is enabled and the
> system in not over-utilized (above the tipping point).
>
> energy_aware_wake_cpu() attempts to find group of cpus with sufficient
> compute capacity to accommodate the task and find a cpu with enough spare
> capacity to handle the task within that group. Preference is given to
> cpus with enough spare capacity at the current OPP. Finally, the energy
> impact of the new target and the previous task cpu is compared to select
> the wake-up target cpu.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
[...]
> /*
> * select_task_rq_fair: Select target runqueue for the waking task in domains
> * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5446,7 +5526,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> prev_cpu = cpu;
>
> if (sd_flag & SD_BALANCE_WAKE && want_sibling) {
> - new_cpu = select_idle_sibling(p, prev_cpu);
> + if (energy_aware() && !cpu_rq(cpu)->rd->overutilized)
> + new_cpu = energy_aware_wake_cpu(p);
If you run RFCv4 on an X86 system w/o energy model, you get a
'BUG: unable to handle kernel paging request at ...' problem after you've enabled
energy awareness (echo ENERGY_AWARE > /sys/kernel/debug/sched_features).
This is related to the fact that cpumask functions like cpumask_test_cpu
(e.g. later in select_task_rq) can't deal with cpu set to -1.
If you enable CONFIG_DEBUG_PER_CPU_MAPS you get the following warning in this case:
WARNING: CPU: 0 PID: 0 at include/linux/cpumask.h:117
cpumask_check.part.79+0x1f/0x30()
We also get the warning on ARM (w/o energy model) but my TC2 system is not crashing
like the X86 box.
Shouldn't we return prev_cpu in case sd_ea is NULL just as select_idle_sibling does
if prev_cpu is idle?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f5897a021f23..8a014fdd6e76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5394,7 +5394,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
return target;
}
-static int energy_aware_wake_cpu(struct task_struct *p)
+static int energy_aware_wake_cpu(struct task_struct *p, int target)
{
struct sched_domain *sd;
struct sched_group *sg, *sg_target;
@@ -5405,7 +5405,7 @@ static int energy_aware_wake_cpu(struct task_struct *p)
sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
if (!sd)
- return -1;
+ return target;
sg = sd->groups;
sg_target = sg;
@@ -5527,7 +5527,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (sd_flag & SD_BALANCE_WAKE && want_sibling) {
if (energy_aware() && !cpu_rq(cpu)->rd->overutilized)
- new_cpu = energy_aware_wake_cpu(p);
+ new_cpu = energy_aware_wake_cpu(p, prev_cpu);
else
new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock;
> + else
> + new_cpu = select_idle_sibling(p, prev_cpu);
> goto unlock;
> }
On Thu, May 14, 2015 at 10:34:20AM +0100, [email protected] wrote:
> Morten Rasmussen <[email protected]> wrote 2015-05-13 AM 03:39:06:
> > [RFCv4 PATCH 31/34] sched: Energy-aware wake-up task placement
> >
> > Let available compute capacity and estimated energy impact select
> > wake-up target cpu when energy-aware scheduling is enabled and the
> > system in not over-utilized (above the tipping point).
> >
> > energy_aware_wake_cpu() attempts to find group of cpus with sufficient
> > compute capacity to accommodate the task and find a cpu with enough spare
> > capacity to handle the task within that group. Preference is given to
> > cpus with enough spare capacity at the current OPP. Finally, the energy
> > impact of the new target and the previous task cpu is compared to select
> > the wake-up target cpu.
> >
> > cc: Ingo Molnar <[email protected]>
> > cc: Peter Zijlstra <[email protected]>
> >
> > Signed-off-by: Morten Rasmussen <[email protected]>
> > ---
> > kernel/sched/fair.c | 85 ++++++++++++++++++++++++++++++++++++++++++
> > ++++++++++-
> > 1 file changed, 84 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bb44646..fe41e1e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5394,6 +5394,86 @@ static int select_idle_sibling(struct
> > task_struct *p, int target)
> > return target;
> > }
> >
> > +static int energy_aware_wake_cpu(struct task_struct *p)
> > +{
> > + struct sched_domain *sd;
> > + struct sched_group *sg, *sg_target;
> > + int target_max_cap = INT_MAX;
> > + int target_cpu = task_cpu(p);
> > + int i;
> > +
> > + sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> > +
> > + if (!sd)
> > + return -1;
> > +
> > + sg = sd->groups;
> > + sg_target = sg;
> > +
> > + /*
> > + * Find group with sufficient capacity. We only get here if no cpu is
> > + * overutilized. We may end up overutilizing a cpu by adding the task,
> > + * but that should not be any worse than select_idle_sibling().
> > + * load_balance() should sort it out later as we get above the tipping
> > + * point.
> > + */
> > + do {
> > + /* Assuming all cpus are the same in group */
> > + int max_cap_cpu = group_first_cpu(sg);
> > +
> > + /*
> > + * Assume smaller max capacity means more energy-efficient.
> > + * Ideally we should query the energy model for the right
> > + * answer but it easily ends up in an exhaustive search.
> > + */
> > + if (capacity_of(max_cap_cpu) < target_max_cap &&
> > + task_fits_capacity(p, max_cap_cpu)) {
> > + sg_target = sg;
> > + target_max_cap = capacity_of(max_cap_cpu);
> > + }
> > + } while (sg = sg->next, sg != sd->groups);
> > +
> > + /* Find cpu with sufficient capacity */
> > + for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> > + /*
> > + * p's blocked utilization is still accounted for on prev_cpu
> > + * so prev_cpu will receive a negative bias due the double
> > + * accouting. However, the blocked utilization may be zero.
> > + */
> > + int new_usage = get_cpu_usage(i) + task_utilization(p);
> > +
> > + if (new_usage > capacity_orig_of(i))
> > + continue;
> > +
> > + if (new_usage < capacity_curr_of(i)) {
> > + target_cpu = i;
> > + if (cpu_rq(i)->nr_running)
> > + break;
> > + }
> > +
> > + /* cpu has capacity at higher OPP, keep it as fallback */
> > + if (target_cpu == task_cpu(p))
> > + target_cpu = i;
> > + }
> > +
> > + if (target_cpu != task_cpu(p)) {
> > + struct energy_env eenv = {
> > + .usage_delta = task_utilization(p),
> > + .src_cpu = task_cpu(p),
> > + .dst_cpu = target_cpu,
> > + };
>
> At this point, p hasn't been queued in src_cpu, but energy_diff() below will
> still substract its utilization from src_cpu, is that right?
energy_aware_wake_cpu() should only be called for existing tasks, i.e.
SD_BALANCE_WAKE, so p should have been queued on src_cpu in the past.
New tasks (SD_BALANCE_FORK) take the find_idlest_{group, cpu}() route.
Or did I miss something?
Since p was last scheduled on src_cpu its usage should still be
accounted for in the blocked utilization of that cpu. At wake-up we are
effectively turning blocked utilization into runnable utilization. The
cpu usage (get_cpu_usage()) is the sum of the two and this is basis for
the energy calculations. So if we migrate the task at wake-up we should
remove the task utilization from the previous cpu and add it to dst_cpu.
As Sai has raised previously, it is not the full story. The blocked
utilization contribution of p on the previous cpu may have decayed while
the task utilization stored in p->se.avg has not. It is therefore
misleading to subtract the non-decayed utilization from src_cpu blocked
utilization. It is on the todo-list to fix that issue.
Does that make any sense?
Morten
On Wed, May 13, 2015 at 02:49:57PM +0100, Morten Rasmussen wrote:
> On Wed, May 13, 2015 at 01:33:22AM +0100, Sai Gurrappadi wrote:
> > On 05/12/2015 12:38 PM, Morten Rasmussen wrote:
> > > Task being dequeued for the last time (state == TASK_DEAD) are dequeued
> > > with the DEQUEUE_SLEEP flag which causes their load and utilization
> > > contributions to be added to the runqueue blocked load and utilization.
> > > Hence they will contain load or utilization that is gone away. The issue
> > > only exists for the root cfs_rq as cgroup_exit() doesn't set
> > > DEQUEUE_SLEEP for task group exits.
> > >
> > > If runnable+blocked load is to be used as a better estimate for cpu
> > > load the dead task contributions need to be removed to prevent
> > > load_balance() (idle_balance() in particular) from over-estimating the
> > > cpu load.
> > >
> > > cc: Ingo Molnar <[email protected]>
> > > cc: Peter Zijlstra <[email protected]>
> > >
> > > Signed-off-by: Morten Rasmussen <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index e40cd88..d045404 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3202,6 +3202,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > > * Update run-time statistics of the 'current'.
> > > */
> > > update_curr(cfs_rq);
> > > + if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> > > + flags &= !DEQUEUE_SLEEP;
> > > dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
> > >
> > > update_stats_dequeue(cfs_rq, se);
> > >
> >
> > Maybe you could use the sched_class->task_dead() callback instead? I
> > remember seeing a patch from Yuyang that did that.
>
> Now that you mention it, I remember that thread I think. I will have a
> look again and see if there are any good reasons not to use task_dead().
After having looked at it again I think using task_dead() is another
option, but it isn't the most elegant one. task_dead() is called very
late in the game, after the task has been dequeued and the rq lock has
been released. Doing it there would require re-taking the rq lock to
remove the task contribution from the rq blocked load that was just
added as part of the dequeue operation.
The above patch ensures that the task contribution isn't added to the rq
blocked load at dequeue when we know the task is dead instead of having
to fix it later. The problem seems to arise due schedule() not really
caring if the previous task is going to die or just sleep. It passed
DEQUEUE_SLEEP to the sched_class in both cases and we therefore have
to do the distinction of the two cases in fair.c.
Morten
* Morten Rasmussen <[email protected]> [2015-05-12 20:38:48]:
[...]
> +Energy consumed during transitions from an idle-state (C-state) to a busy state
> +(P-staet) or going the other way is ignored by the model to simplify the energy
Minor, nit pick. Spelling of "P-State".
> +model calculations.
Thanks,
Kamalesh.
On Wed, May 20, 2015 at 05:04:38AM +0100, Kamalesh Babulal wrote:
> * Morten Rasmussen <[email protected]> [2015-05-12 20:38:48]:
>
> [...]
> > +Energy consumed during transitions from an idle-state (C-state) to a busy state
> > +(P-staet) or going the other way is ignored by the model to simplify the energy
>
> Minor, nit pick. Spelling of "P-State".
Thanks, will fix it for next posting.
Morten
* Morten Rasmussen <[email protected]> [2015-05-12 20:38:57]:
[...]
> +/*
> + * cpu_norm_usage() returns the cpu usage relative to a specific capacity,
> + * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
> + * energy calculations. Using the scale-invariant usage returned by
> + * get_cpu_usage() and approximating scale-invariant usage by:
> + *
> + * usage ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time
> + *
> + * the normalized usage can be found using the specific capacity.
> + *
> + * capacity = capacity_orig * curr_freq/max_freq
> + *
> + * norm_usage = running_time/time ~ usage/capacity
> + */
> +static unsigned long cpu_norm_usage(int cpu, unsigned long capacity)
> +{
> + int usage = __get_cpu_usage(cpu);
__get_cpu_usage is introduced in next patch "sched: Extend sched_group_energy
to test load-balancing decisions", applying the series with this patch as top
most patch breaks the build.
kernel/sched/fair.c: In function ‘cpu_norm_usage’:
kernel/sched/fair.c:4830:2: error: implicit declaration of function ‘__get_cpu_usage’ [-Werror=implicit-function-declaration]
int usage = __get_cpu_usage(cpu);
^
Given that __get_cpu_usage(), take additional parameter - delta. get_cpu_usage()
should have been used here.
Thanks,
Kamalesh
On Thu, May 21, 2015 at 08:57:04AM +0100, Kamalesh Babulal wrote:
> * Morten Rasmussen <[email protected]> [2015-05-12 20:38:57]:
>
> [...]
> > +/*
> > + * cpu_norm_usage() returns the cpu usage relative to a specific capacity,
> > + * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
> > + * energy calculations. Using the scale-invariant usage returned by
> > + * get_cpu_usage() and approximating scale-invariant usage by:
> > + *
> > + * usage ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time
> > + *
> > + * the normalized usage can be found using the specific capacity.
> > + *
> > + * capacity = capacity_orig * curr_freq/max_freq
> > + *
> > + * norm_usage = running_time/time ~ usage/capacity
> > + */
> > +static unsigned long cpu_norm_usage(int cpu, unsigned long capacity)
> > +{
> > + int usage = __get_cpu_usage(cpu);
>
> __get_cpu_usage is introduced in next patch "sched: Extend sched_group_energy
> to test load-balancing decisions", applying the series with this patch as top
> most patch breaks the build.
>
> kernel/sched/fair.c: In function ‘cpu_norm_usage’:
> kernel/sched/fair.c:4830:2: error: implicit declaration of function ‘__get_cpu_usage’ [-Werror=implicit-function-declaration]
> int usage = __get_cpu_usage(cpu);
> ^
>
> Given that __get_cpu_usage(), take additional parameter - delta. get_cpu_usage()
> should have been used here.
Yes, you are right. Using get_cpu_usage() here instead should be
correct. I think it was right in RFCv3, I must have messed it up when I
moved around some of the patches :(
I will make sure to test the patch set is bisectable next time :)
Thanks,
Morten
Trivial fixes forh machines without SMP.
Signed-off-by: Abel Vesa <[email protected]>
---
kernel/sched/fair.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e6d32e6..dae3db7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -672,6 +672,8 @@ static unsigned long task_h_load(struct task_struct *p);
static inline void __update_task_entity_contrib(struct sched_entity *se);
static inline void __update_task_entity_utilization(struct sched_entity *se);
+static bool cpu_overutilized(int cpu);
+
/* Give new task start runnable values to heavy its load in infant time */
void init_task_runnable_average(struct task_struct *p)
{
@@ -4266,8 +4268,6 @@ static inline void hrtick_update(struct rq *rq)
}
#endif
-static bool cpu_overutilized(int cpu);
-
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -4278,7 +4278,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
- int task_new = !(flags & ENQUEUE_WAKEUP);
for_each_sched_entity(se) {
if (se->on_rq)
@@ -4313,10 +4312,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) {
update_rq_runnable_avg(rq, rq->nr_running);
add_nr_running(rq, 1);
- if (!task_new && !rq->rd->overutilized &&
+#ifdef CONFIG_SMP
+ if ((flags & ENQUEUE_WAKEUP) && !rq->rd->overutilized &&
cpu_overutilized(rq->cpu))
rq->rd->overutilized = true;
+#endif
}
+
hrtick_update(rq);
}
@@ -8497,8 +8499,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
update_rq_runnable_avg(rq, 1);
+#ifdef CONFIG_SMP
if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
rq->rd->overutilized = true;
+#endif
}
/*
--
1.9.1
* Abel Vesa <[email protected]> wrote:
> Trivial fixes forh machines without SMP.
>
> Signed-off-by: Abel Vesa <[email protected]>
> ---
> kernel/sched/fair.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e6d32e6..dae3db7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -672,6 +672,8 @@ static unsigned long task_h_load(struct task_struct *p);
> static inline void __update_task_entity_contrib(struct sched_entity *se);
> static inline void __update_task_entity_utilization(struct sched_entity *se);
>
> +static bool cpu_overutilized(int cpu);
> +
> /* Give new task start runnable values to heavy its load in infant time */
> void init_task_runnable_average(struct task_struct *p)
> {
> @@ -4266,8 +4268,6 @@ static inline void hrtick_update(struct rq *rq)
> }
> #endif
>
> -static bool cpu_overutilized(int cpu);
> -
What tree is this against? Neither the upstream kernel nor
tip:sched/core (the scheduler development tree) has this function.
Thanks,
Ingo
On Sat, May 23, 2015 at 04:52:23PM +0200, Ingo Molnar wrote:
>
> What tree is this against? Neither the upstream kernel nor
> tip:sched/core (the scheduler development tree) has this function.
>
Sorry, I forgot to mention.
This patch applies to:
git://linux-arm.org/linux-power.git energy_model_rfc_v4
Best regards,
Abel
Hi,
So I tried to play around a little bit with this patchset. I did a
checkout from:
git://linux-arm.org/linux-power.git energy_model_rfc_v4
and then, when I tried to enable the ENERGY_AWARE from sysfs inside
qemu (x86_64) and I got this:
[69452.750245] BUG: unable to handle kernel paging request at ffff88009d3fb958
[69452.750245] IP: [<ffffffff8107b8b5>] try_to_wake_up+0x125/0x310
[69452.750245] PGD 2155067 PUD 0
[69452.750245] Oops: 0000 [#1] SMP
[69452.750245] Modules linked in:
[69452.750245] CPU: 0 PID: 1007 Comm: sh Not tainted 4.1.0-rc2+ #8
[69452.750245] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.8.1-20150318_183358- 04/01/2014
[69452.750245] task: ffff88007c9e5aa0 ti: ffff88007be0c000 task.ti:
ffff88007be0c000
[69452.750245] RIP: 0010:[<ffffffff8107b8b5>] [<ffffffff8107b8b5>]
try_to_wake_up+0x125/0x310
[69452.750245] RSP: 0000:ffff88007fc03d78 EFLAGS: 00000092
[69452.750245] RAX: 00000000ffffffff RBX: 00000000ffffffff RCX: 0000000000015a40
[69452.750245] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88007d005000
[69452.750245] RBP: ffff88007fc03dc8 R08: 0000000000000400 R09: 0000000000000000
[69452.750245] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000015a40
[69452.750245] R13: ffff88007d3fbdaa R14: 0000000000000000 R15: ffff88007d3fb660
[69452.750245] FS: 00007f8a3c9f0700(0000) GS:ffff88007fc00000(0000)
knlGS:0000000000000000
[69452.750245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[69452.750245] CR2: ffff88009d3fb958 CR3: 000000007c32c000 CR4: 00000000000006f0
[69452.750245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[69452.750245] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
[69452.750245] Stack:
[69452.750245] ffff88007fc15aa8 ffff88007c9e5b08 ffff88007fc15aa8
0000000000000046
[69452.750245] ffff88007fc03e08 ffff88007c83fe60 ffffffff81e3c8a8
ffffffff81e3c890
[69452.750245] 0000000000000000 0000000000000003 ffff88007fc03dd8
ffffffff8107bb8d
[69452.750245] Call Trace:
[69452.750245] <IRQ>
[69452.750245] [<ffffffff8107bb8d>] default_wake_function+0xd/0x10
[69452.750245] [<ffffffff8108ed21>] autoremove_wake_function+0x11/0x40
[69452.750245] [<ffffffff8108e6b5>] __wake_up_common+0x55/0x90
[69452.750245] [<ffffffff8108e728>] __wake_up+0x38/0x60
[69452.750245] [<ffffffff810ab062>] rcu_gp_kthread_wake+0x42/0x50
[69452.750245] [<ffffffff810acd9f>] rcu_process_callbacks+0x2ef/0x5e0
[69452.750245] [<ffffffff81056e0f>] __do_softirq+0x9f/0x280
[69452.750245] [<ffffffff81057145>] irq_exit+0xa5/0xb0
[69452.750245] [<ffffffff81038bd1>] smp_apic_timer_interrupt+0x41/0x50
[69452.750245] [<ffffffff818ae5bb>] apic_timer_interrupt+0x6b/0x70
[69452.750245] <EOI>
[69452.750245] Code: 4c 89 ff ff d0 41 83 bf f8 02 00 00 01 41 8b 5f
48 7e 16 49 8b 47 60 89 de 44 89 f1 ba 10 00 00 00 4c 89 ff ff 50 40
89 c3 89 d8 <49> 0f a3 87 00 03 00 00 19 d2 85 d2 0f 84 59 01 00 00 48
8b 15
[69452.750245] RIP [<ffffffff8107b8b5>] try_to_wake_up+0x125/0x310
[69452.750245] RSP <ffff88007fc03d78>
[69452.750245] CR2: ffff88009d3fb958
[69452.750245] ---[ end trace 9b4570a93c243e98 ]---
[69452.750245] Kernel panic - not syncing: Fatal exception in interrupt
[69452.750245] Kernel Offset: disabled
[69452.750245] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
and then I did a disassable from kgdb and I got this:
0xffffffff8107b8ae <+286>: callq *0x40(%rax)
0xffffffff8107b8b1 <+289>: mov %eax,%ebx
0xffffffff8107b8b3 <+291>: mov %ebx,%eax
0xffffffff8107b8b5 <+293>: bt %rax,0x300(%r15)
0xffffffff8107b8bd <+301>: sbb %edx,%edx
and then I did a objdump and got this:
static inline
int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
{
if (p->nr_cpus_allowed > 1)
7dcb: 7e 16 jle 7de3 <try_to_wake_up+0x123>
cpu = p->sched_class->select_task_rq(p, cpu, sd_flags,
wake_flags);
7dcd: 49 8b 47 60 mov 0x60(%r15),%rax
7dd1: 89 de mov %ebx,%esi
7dd3: 44 89 f1 mov %r14d,%ecx
7dd6: ba 10 00 00 00 mov $0x10,%edx
7ddb: 4c 89 ff mov %r15,%rdi
7dde: ff 50 40 callq *0x40(%rax)
7de1: 89 c3 mov %eax,%ebx
7de3: 89 d8 mov %ebx,%eax
7de5: 49 0f a3 87 00 03 00 bt %rax,0x300(%r15)
7dec: 00
7ded: 19 d2 sbb %edx,%edx
* Since this is common to all placement strategies, this lives here.
*
* [ this allows ->select_task() to simply return task_cpu(p) and
* not worry about this generic constraint ]
*/
if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
7def: 85 d2 test %edx,%edx
I wasn't able to determine the cause from the line:
7de5: 49 0f a3 87 00 03 00 bt %rax,0x300(%r15)
Finally, the question I have is: Could this happen because I'm running
it from qemu?
I hope all this info helps.
Thanks,
Abel.
Hi Abel,
Abel Vesa <[email protected]> wrote 2015-06-29 AM 04:26:31:
>
> Re: [RFCv4 PATCH 00/34] sched: Energy cost model for energy-aware
scheduling
>
> Hi,
>
> So I tried to play around a little bit with this patchset. I did a
> checkout from:
>
> git://linux-arm.org/linux-power.git energy_model_rfc_v4
>
> and then, when I tried to enable the ENERGY_AWARE from sysfs inside
> qemu (x86_64) and I got this:
>
> [69452.750245] BUG: unable to handle kernel paging request at
ffff88009d3fb958
> [69452.750245] IP: [<ffffffff8107b8b5>] try_to_wake_up+0x125/0x310
> [69452.750245] PGD 2155067 PUD 0
> [69452.750245] Oops: 0000 [#1] SMP
> [69452.750245] Modules linked in:
> [69452.750245] CPU: 0 PID: 1007 Comm: sh Not tainted 4.1.0-rc2+ #8
> [69452.750245] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.8.1-20150318_183358- 04/01/2014
> [69452.750245] task: ffff88007c9e5aa0 ti: ffff88007be0c000 task.ti:
> ffff88007be0c000
> [69452.750245] RIP: 0010:[<ffffffff8107b8b5>] [<ffffffff8107b8b5>]
> try_to_wake_up+0x125/0x310
> [69452.750245] RSP: 0000:ffff88007fc03d78 EFLAGS: 00000092
> [69452.750245] RAX: 00000000ffffffff RBX: 00000000ffffffff RCX:
> 0000000000015a40
> [69452.750245] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
> ffff88007d005000
> [69452.750245] RBP: ffff88007fc03dc8 R08: 0000000000000400 R09:
> 0000000000000000
> [69452.750245] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000015a40
> [69452.750245] R13: ffff88007d3fbdaa R14: 0000000000000000 R15:
> ffff88007d3fb660
> [69452.750245] FS: 00007f8a3c9f0700(0000) GS:ffff88007fc00000(0000)
> knlGS:0000000000000000
> [69452.750245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [69452.750245] CR2: ffff88009d3fb958 CR3: 000000007c32c000 CR4:
> 00000000000006f0
> [69452.750245] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [69452.750245] DR3: 0000000000000000 DR6: 0000000000000000 DR7:
> 0000000000000000
> [69452.750245] Stack:
> [69452.750245] ffff88007fc15aa8 ffff88007c9e5b08 ffff88007fc15aa8
> 0000000000000046
> [69452.750245] ffff88007fc03e08 ffff88007c83fe60 ffffffff81e3c8a8
> ffffffff81e3c890
> [69452.750245] 0000000000000000 0000000000000003 ffff88007fc03dd8
> ffffffff8107bb8d
> [69452.750245] Call Trace:
> [69452.750245] <IRQ>
> [69452.750245] [<ffffffff8107bb8d>] default_wake_function+0xd/0x10
> [69452.750245] [<ffffffff8108ed21>] autoremove_wake_function+0x11/0x40
> [69452.750245] [<ffffffff8108e6b5>] __wake_up_common+0x55/0x90
> [69452.750245] [<ffffffff8108e728>] __wake_up+0x38/0x60
> [69452.750245] [<ffffffff810ab062>] rcu_gp_kthread_wake+0x42/0x50
> [69452.750245] [<ffffffff810acd9f>] rcu_process_callbacks+0x2ef/0x5e0
> [69452.750245] [<ffffffff81056e0f>] __do_softirq+0x9f/0x280
> [69452.750245] [<ffffffff81057145>] irq_exit+0xa5/0xb0
> [69452.750245] [<ffffffff81038bd1>] smp_apic_timer_interrupt+0x41/0x50
> [69452.750245] [<ffffffff818ae5bb>] apic_timer_interrupt+0x6b/0x70
> [69452.750245] <EOI>
> [69452.750245] Code: 4c 89 ff ff d0 41 83 bf f8 02 00 00 01 41 8b 5f
> 48 7e 16 49 8b 47 60 89 de 44 89 f1 ba 10 00 00 00 4c 89 ff ff 50 40
> 89 c3 89 d8 <49> 0f a3 87 00 03 00 00 19 d2 85 d2 0f 84 59 01 00 00 48
> 8b 15
> [69452.750245] RIP [<ffffffff8107b8b5>] try_to_wake_up+0x125/0x310
> [69452.750245] RSP <ffff88007fc03d78>
> [69452.750245] CR2: ffff88009d3fb958
> [69452.750245] ---[ end trace 9b4570a93c243e98 ]---
> [69452.750245] Kernel panic - not syncing: Fatal exception in interrupt
> [69452.750245] Kernel Offset: disabled
> [69452.750245] ---[ end Kernel panic - not syncing: Fatal exception
> in interrupt
>
> and then I did a disassable from kgdb and I got this:
>
> 0xffffffff8107b8ae <+286>: callq *0x40(%rax)
> 0xffffffff8107b8b1 <+289>: mov %eax,%ebx
> 0xffffffff8107b8b3 <+291>: mov %ebx,%eax
> 0xffffffff8107b8b5 <+293>: bt %rax,0x300(%r15)
> 0xffffffff8107b8bd <+301>: sbb %edx,%edx
>
> and then I did a objdump and got this:
>
> static inline
> int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int
> wake_flags)
> {
> if (p->nr_cpus_allowed > 1)
> 7dcb: 7e 16 jle 7de3
<try_to_wake_up+0x123>
> cpu = p->sched_class->select_task_rq(p, cpu, sd_flags,
> wake_flags);
> 7dcd: 49 8b 47 60 mov 0x60(%r15),%rax
> 7dd1: 89 de mov %ebx,%esi
> 7dd3: 44 89 f1 mov %r14d,%ecx
> 7dd6: ba 10 00 00 00 mov $0x10,%edx
> 7ddb: 4c 89 ff mov %r15,%rdi
> 7dde: ff 50 40 callq *0x40(%rax)
> 7de1: 89 c3 mov %eax,%ebx
> 7de3: 89 d8 mov %ebx,%eax
> 7de5: 49 0f a3 87 00 03 00 bt %rax,0x300(%r15)
> 7dec: 00
> 7ded: 19 d2 sbb %edx,%edx
> * Since this is common to all placement strategies, this lives
here.
> *
> * [ this allows ->select_task() to simply return task_cpu(p)
and
> * not worry about this generic constraint ]
> */
> if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
> 7def: 85 d2 test %edx,%edx
>
> I wasn't able to determine the cause from the line:
>
> 7de5: 49 0f a3 87 00 03 00 bt %rax,0x300(%r15)
>
> Finally, the question I have is: Could this happen because I'm running
> it from qemu?
>
> I hope all this info helps.
I've met this as well.
You can try to change "return -1;" in energy_aware_wake_cpu() to "return
target_cpu;"
hope this may help.
-Xunlei
>
> Thanks,
> Abel.
--------------------------------------------------------
ZTE Information Security Notice: The information contained in this mail (and any attachment transmitted herewith) is privileged and confidential and is intended for the exclusive use of the addressee(s). If you are not an intended recipient, any disclosure, reproduction, distribution or other dissemination or use of the information contained is strictly prohibited. If you have received this mail in error, please delete it and notify us immediately.
On 29/06/15 10:06, [email protected] wrote:
> Hi Abel,
>
> Abel Vesa <[email protected]> wrote 2015-06-29 AM 04:26:31:
>>
>> Re: [RFCv4 PATCH 00/34] sched: Energy cost model for energy-aware
> scheduling
[...]
>> I wasn't able to determine the cause from the line:
>>
>> 7de5: 49 0f a3 87 00 03 00 bt %rax,0x300(%r15)
>>
>> Finally, the question I have is: Could this happen because I'm running
>> it from qemu?
>>
>> I hope all this info helps.
>
> I've met this as well.
>
> You can try to change "return -1;" in energy_aware_wake_cpu() to "return
> target_cpu;"
> hope this may help.
Yeah, I also suppose that https://lkml.org/lkml/2015/5/14/330 cures it.
-- Dietmar
[...]
Hi Morten,
Morten Rasmussen <[email protected]> wrote 2015-05-13 AM 03:39:00:
>
> [RFCv4 PATCH 25/34] sched: Add over-utilization/tipping point indicator
>
> Energy-aware scheduling is only meant to be active while the system is
> _not_ over-utilized. That is, there are spare cycles available to shift
> tasks around based on their actual utilization to get a more
> energy-efficient task distribution without depriving any tasks. When
> above the tipping point task placement is done the traditional way,
> spreading the tasks across as many cpus as possible based on priority
> scaled load to preserve smp_nice.
>
> The over-utilization condition is conservatively chosen to indicate
> over-utilization as soon as one cpu is fully utilized at it's highest
> frequency. We don't consider groups as lumping usage and capacity
> together for a group of cpus may hide the fact that one or more cpus in
> the group are over-utilized while group-siblings are partially idle. The
> tasks could be served better if moved to another group with completely
> idle cpus. This is particularly problematic if some cpus have a
> significantly reduced capacity due to RT/IRQ pressure or if the system
> has cpus of different capacity (e.g. ARM big.LITTLE).
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++----
> kernel/sched/sched.h | 3 +++
> 2 files changed, 34 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f36ab2f3..5b7bc28 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4266,6 +4266,8 @@ static inline void hrtick_update(struct rq *rq)
> }
> #endif
>
> +static bool cpu_overutilized(int cpu);
> +
> /*
> * The enqueue_task method is called before nr_running is
> * increased. Here we update the fair scheduling stats and
> @@ -4276,6 +4278,7 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
> {
> struct cfs_rq *cfs_rq;
> struct sched_entity *se = &p->se;
> + int task_new = !(flags & ENQUEUE_WAKEUP);
>
> for_each_sched_entity(se) {
> if (se->on_rq)
> @@ -4310,6 +4313,9 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
> if (!se) {
> update_rq_runnable_avg(rq, rq->nr_running);
> add_nr_running(rq, 1);
> + if (!task_new && !rq->rd->overutilized &&
> + cpu_overutilized(rq->cpu))
> + rq->rd->overutilized = true;
> }
> hrtick_update(rq);
> }
> @@ -4937,6 +4943,14 @@ static int find_new_capacity(struct energy_env
*eenv,
> return idx;
> }
>
> +static unsigned int capacity_margin = 1280; /* ~20% margin */
> +
> +static bool cpu_overutilized(int cpu)
> +{
> + return (capacity_of(cpu) * 1024) <
> + (get_cpu_usage(cpu) * capacity_margin);
> +}
> +
> /*
> * sched_group_energy(): Returns absolute energy consumption of
> cpus belonging
> * to the sched_group including shared resources shared only by
> members of the
> @@ -6732,11 +6746,12 @@ static enum group_type group_classify(struct
> lb_env *env,
> * @local_group: Does group contain this_cpu.
> * @sgs: variable to hold the statistics for this group.
> * @overload: Indicate more than one runnable task for any CPU.
> + * @overutilized: Indicate overutilization for any CPU.
> */
> static inline void update_sg_lb_stats(struct lb_env *env,
> struct sched_group *group, int load_idx,
> int local_group, struct sg_lb_stats *sgs,
> - bool *overload)
> + bool *overload, bool *overutilized)
> {
> unsigned long load;
> int i;
> @@ -6766,6 +6781,9 @@ static inline void update_sg_lb_stats(struct
> lb_env *env,
> sgs->sum_weighted_load += weighted_cpuload(i);
> if (idle_cpu(i))
> sgs->idle_cpus++;
> +
> + if (cpu_overutilized(i))
> + *overutilized = true;
> }
>
> /* Adjust by relative CPU capacity of the group */
> @@ -6871,7 +6889,7 @@ static inline void update_sd_lb_stats(struct
> lb_env *env, struct sd_lb_stats *sd
> struct sched_group *sg = env->sd->groups;
> struct sg_lb_stats tmp_sgs;
> int load_idx, prefer_sibling = 0;
> - bool overload = false;
> + bool overload = false, overutilized = false;
>
> if (child && child->flags & SD_PREFER_SIBLING)
> prefer_sibling = 1;
> @@ -6893,7 +6911,7 @@ static inline void update_sd_lb_stats(struct
> lb_env *env, struct sd_lb_stats *sd
> }
>
> update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
> - &overload);
> + &overload, &overutilized);
>
> if (local_group)
> goto next_group;
> @@ -6935,8 +6953,14 @@ static inline void update_sd_lb_stats(struct
> lb_env *env, struct sd_lb_stats *sd
> /* update overload indicator if we are at root domain */
> if (env->dst_rq->rd->overload != overload)
> env->dst_rq->rd->overload = overload;
> - }
>
> + /* Update over-utilization (tipping point, U >= 0) indicator */
> + if (env->dst_rq->rd->overutilized != overutilized)
> + env->dst_rq->rd->overutilized = overutilized;
> + } else {
> + if (!env->dst_rq->rd->overutilized && overutilized)
> + env->dst_rq->rd->overutilized = true;
> + }
> }
>
> /**
> @@ -8300,6 +8324,9 @@ static void task_tick_fair(struct rq *rq,
> struct task_struct *curr, int queued)
> task_tick_numa(rq, curr);
>
> update_rq_runnable_avg(rq, 1);
> +
> + if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
> + rq->rd->overutilized = true;
> }
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b627dfa..a5d2d69 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -535,6 +535,9 @@ struct root_domain {
> /* Indicate more than one runnable task for any CPU */
> bool overload;
>
> + /* Indicate one or more cpus over-utilized (tipping point) */
> + bool overutilized;
> +
> /*
> * The bit corresponding to a CPU gets set here if such CPU has more
> * than one runnable -deadline task (as it is below for RT tasks).
> --
> 1.9.1
>
The tipping point idea is great for EAS, but I wonder if it is
an issue below that I found during my test:
I used rt-app to emulate the work load to test the patchset.
First of all, I added rt-app small work load gradually until
the utilization was around the tipping point, and EAS worked
great putting tasks on small cores.
Then I went on to add some extra small load to break the
tipping point, immediately CFS load-balancer took over EAS
(as I could see the big cores had some tasks running), but
at this point it seemed that the system fluctuated back and
forth badly between the big cores and small cores, and the
util result displayed by "top" was a bit weird.
I guess when exceeding the tipping point, CFS takes over EAS,
so it will migrate tasks from small cores to big cores, as
a result, the cpu util of small cores will be again below the
tipping point due to some tasks migrated by CFS load-balancer,
which will activate the EAS again.
Therefore, the system just fluctuates like that.
-Xunlei
--------------------------------------------------------
ZTE Information Security Notice: The information contained in this mail (and any attachment transmitted herewith) is privileged and confidential and is intended for the exclusive use of the addressee(s). If you are not an intended recipient, any disclosure, reproduction, distribution or other dissemination or use of the information contained is strictly prohibited. If you have received this mail in error, please delete it and notify us immediately.