2015-07-07 18:22:03

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 00/46] sched: Energy cost model for energy-aware scheduling

Several techniques for saving energy through various scheduler
modifications have been proposed in the past, however most of the
techniques have not been universally beneficial for all use-cases and
platforms. For example, consolidating tasks on fewer cpus is an
effective way to save energy on some platforms, while it might make
things worse on others. At the same time there has been a demand for
scheduler driven power management given the scheduler's position to
judge performance requirements for the near future [1].

This proposal, which is inspired by [1] and the Ksummit workshop
discussions in 2013 [2], takes a different approach by using a
(relatively) simple platform energy cost model to guide scheduling
decisions. By providing the model with platform specific costing data
the model can provide an estimate of the energy implications of
scheduling decisions. So instead of blindly applying scheduling
techniques that may or may not work for the current use-case, the
scheduler can make informed energy-aware decisions. We believe this
approach provides a methodology that can be adapted to any platform,
including heterogeneous systems such as ARM big.LITTLE. The model
considers cpus only, i.e. no peripherals, GPU or memory. Model data
includes power consumption at each P-state and C-state. Furthermore a
natural extension of this proposal is to drive P-state selection from
the scheduler given its awareness of changes in cpu utilization.

This is an RFC but contains most of the essential features. The model
and its infrastructure is in place in the scheduler and it is being used
for load-balancing decisions. The energy model data is hardcoded and
there are some limitations still to be addressed. However, the main
ideas are presented here, which is the use of an energy model for
scheduling decisions and scheduler-driven DVFS.

RFCv5 is a consolidation of the latest energy model related patches and
patches adding scale-invariance to the CFS per-entity load-tracking
(PELT) as well as fixing a few issues that have emerged as we use PELT
more extensively for load-balancing. The main additions to v5 are the
inclusion of Mike's previously posted patches that enable
scheduler-driven DVFS [3] (please post comments regarding those in the
original thread) and Juri's patches that drive DVFS from the scheduler.

The patches are based on tip/sched/core. Many of the changes since RFCv4
are addressing issues pointed out during the review of v4. Energy-aware
scheduling is strictly following the 'tipping point' policy (with one
minor exception). That is, when the system is deemed over-utilized
(above the 'tipping point') all balancing decisions are made the normal
way based on priority scaled load and spreading of tasks. When below the
tipping point energy-aware scheduling decisions are active. The
rationale being that when below the tipping point we can safely shuffle
tasks around to save energy without harming throughput. The focus is
more on putting tasks on the right cpus at wake-up and less on
periodic/idle/nohz_idle as the latter are less likely to have a chance
of balancing tasks when below the tipping point as tasks are smaller and
not always running/runnable.

The patch set now consists of four main parts. The first two parts are
largely unchanged since v4, only bug fixes and smaller improvements. The
latter two parts are Mike's DVFS patches and Juri's scheduler-driven
DVFS building on top of Mike's patches.

Patch 01-12: sched: frequency and cpu invariant per-entity load-tracking
and other load-tracking bits.

Patch 13-36: sched: Energy cost model and energy-aware scheduling
features.

Patch 37-38: sched, cpufreq: Scheduler/DVFS integration (repost Mike
Turquette's patches [3])

Patch 39-46: sched: Juri's additions to Mike's patches driving DVFS from
the scheduler.

Test results for ARM TC2 (2xA15+3xA7) with cpufreq enabled:

sysbench: Single task running for 30s.
rt-app [4]: mp3 playback use-case model
rt-app [4]: 5 ~[6,13,19,25,31,38,44,50]% periodic (2ms) tasks for 30s.

Note: % is relative to the capacity of the fastest cpu at the highest
frequency, i.e. the more busy ones do not fit on little cpus.

The numbers are normalized against mainline for comparison except the
rt-app performance numbers. Mainline is however a somewhat random
reference point for big.LITTLE systems due to lack of capacity
awareness. noEAS (ENERGY_AWARE sched_feature disabled) has capacity
awareness and delivers consistent performance for big.LITTLE but does
not consider energy efficiency.

We have added an experimental performance metric to rt-app (based on
Linaro's repo [5]) which basically expresses the average time left from
completion of the run period until the next activation normalized to
best case: 100 is best case (not achievable in practice), the busy
period ended as fast as possible, 0 means on average we just finished in
time before the next activation, negative means we continued running
past the next activation.

Average numbers for 20 runs per test (ARM TC2). ndm = cpufreq ondemand
governor with 20ms sampling rate, sched = scheduler driven DVFS.

Energy Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched)
nrg prf nrg prf nrg prf nrg prf
sysbench 100 100 107 105 108 105 107 105

rt-app mp3 100 n.a. 101 n.a. 45 n.a. 43 n.a.

rt-app 6% 100 85 103 85 31 60 33 59
rt-app 13% 100 76 102 76 39 46 41 50
rt-app 19% 100 64 102 64 93 54 93 54
rt-app 25% 100 53 102 53 93 43 96 45
rt-app 31% 100 44 102 43 115 35 145 43
rt-app 38% 100 35 116 32 113 2 140 29
rt-app 44% 100 -40k 142 -9k 141 -9k 145 -1k
rt-app 50% 100 -100k 133 -21k 131 -22k 131 -4k

sysbench performs slightly better on all EAS kernels with or without EAS
enabled as the task is always scheduled on a big cpu. rt-app mp3 energy
consumption is reduced dramatically with EAS enabled as it is scheduled
on little cpus.

The rt-app periodic tests range from lightly utilized to over-utilized.
At low utilization EAS reduces energy significantly, while the
performance metric is slightly lower due to packing of the tasks on the
little cpus. As the utilization increases the performance metric
decreases as the cpus get closer to over-utilization. 38% is about the
point where little cpus are no longer capable of finishing each period
in time and saturation effects start to kick in. For the two last cases,
the system is over-utilized. EAS consumes more energy than mainline but
has reduced performance degradation (less negative performance metric).
Scheduler driven DVFS generally delivers better performance than
ondemand, which is also why we see a higher energy consumption.

Compile tested and boot tested on x86_64, but doesn't do anything as we
haven't got an energy model for x86_64 yet.

[1] http://article.gmane.org/gmane.linux.kernel/1499836
[2] http://etherpad.osuosl.org/energy-aware-scheduling-ks-2013 (search
for 'cost')
[3] https://lkml.org/lkml/2015/6/26/620
[4] https://github.com/scheduler-tools/rt-app.git exp/eas_v5
[5] https://wiki.linaro.org/WorkingGroups/PowerManagement/Resources/Tools/WorkloadGen

Changes:

RFCv4:

(0) Added better capacity awareness to wake-up path.

(1) Minor cleanups.

(2) Added of two of Mike's DVFS patches.

(3) Added scheduler driven DVFS.

RFCv4: https://lkml.org/lkml/2015/5/12/728

Dietmar Eggemann (12):
sched: Make load tracking frequency scale-invariant
arm: vexpress: Add CPU clock-frequencies to TC2 device-tree
sched: Make usage tracking cpu scale-invariant
arm: Cpu invariant scheduler load-tracking support
sched: Get rid of scaling usage by cpu_capacity_orig
sched: Introduce energy data structures
sched: Allocate and initialize energy data structures
arm: topology: Define TC2 energy and provide it to the scheduler
sched: Store system-wide maximum cpu capacity in root domain
sched: Determine the current sched_group idle-state
sched: Consider a not over-utilized energy-aware system as balanced
sched: Enable idle balance to pull single task towards cpu with higher
capacity

Juri Lelli (8):
sched/cpufreq_sched: use static key for cpu frequency selection
sched/cpufreq_sched: compute freq_new based on capacity_orig_of()
sched/fair: add triggers for OPP change requests
sched/{core,fair}: trigger OPP change request on fork()
sched/{fair,cpufreq_sched}: add reset_capacity interface
sched/fair: jump to max OPP when crossing UP threshold
sched/cpufreq_sched: modify pcpu_capacity handling
sched/fair: cpufreq_sched triggers for load balancing

Michael Turquette (2):
cpufreq: introduce cpufreq_driver_might_sleep
sched: scheduler-driven cpu frequency selection

Morten Rasmussen (24):
arm: Frequency invariant scheduler load-tracking support
sched: Convert arch_scale_cpu_capacity() from weak function to #define
arm: Update arch_scale_cpu_capacity() to reflect change to define
sched: Track blocked utilization contributions
sched: Include blocked utilization in usage tracking
sched: Remove blocked load and utilization contributions of dying
tasks
sched: Initialize CFS task load and usage before placing task on rq
sched: Documentation for scheduler energy cost model
sched: Make energy awareness a sched feature
sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
sched: Compute cpu capacity available at current frequency
sched: Relocated get_cpu_usage() and change return type
sched: Highest energy aware balancing sched_domain level pointer
sched: Calculate energy consumption of sched_group
sched: Extend sched_group_energy to test load-balancing decisions
sched: Estimate energy impact of scheduling decisions
sched: Add over-utilization/tipping point indicator
sched, cpuidle: Track cpuidle state index in the scheduler
sched: Count number of shallower idle-states in struct
sched_group_energy
sched: Add cpu capacity awareness to wakeup balancing
sched: Consider spare cpu capacity at task wake-up
sched: Energy-aware wake-up task placement
sched: Disable energy-unfriendly nohz kicks
sched: Prevent unnecessary active balance of single task in sched
group

Documentation/scheduler/sched-energy.txt | 363 +++++++++++++
arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 5 +
arch/arm/include/asm/topology.h | 11 +
arch/arm/kernel/smp.c | 57 ++-
arch/arm/kernel/topology.c | 204 ++++++--
drivers/cpufreq/Kconfig | 24 +
drivers/cpufreq/cpufreq.c | 6 +
include/linux/cpufreq.h | 12 +
include/linux/sched.h | 22 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 138 ++++-
kernel/sched/cpufreq_sched.c | 334 ++++++++++++
kernel/sched/fair.c | 786 ++++++++++++++++++++++++++---
kernel/sched/features.h | 11 +-
kernel/sched/idle.c | 2 +
kernel/sched/sched.h | 101 +++-
16 files changed, 1934 insertions(+), 143 deletions(-)
create mode 100644 Documentation/scheduler/sched-energy.txt
create mode 100644 kernel/sched/cpufreq_sched.c

--
1.9.1


2015-07-07 18:22:20

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 01/46] arm: Frequency invariant scheduler load-tracking support

From: Morten Rasmussen <[email protected]>

Implements arch-specific function to provide the scheduler with a
frequency scaling correction factor for more accurate load-tracking.
The factor is:

current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)

This implementation only provides frequency invariance. No cpu
invariance yet.

Cc: Russell King <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
arch/arm/include/asm/topology.h | 7 +++++
arch/arm/kernel/smp.c | 57 +++++++++++++++++++++++++++++++++++++++--
arch/arm/kernel/topology.c | 17 ++++++++++++
3 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 370f7a7..c31096f 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -24,6 +24,13 @@ void init_cpu_topology(void);
void store_cpu_topology(unsigned int cpuid);
const struct cpumask *cpu_coregroup_mask(int cpu);

+#define arch_scale_freq_capacity arm_arch_scale_freq_capacity
+struct sched_domain;
+extern
+unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
+DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);
+
#else

static inline void init_cpu_topology(void) { }
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index cca5b87..a32539c 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -677,12 +677,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref);
static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq);
static unsigned long global_l_p_j_ref;
static unsigned long global_l_p_j_ref_freq;
+static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
+DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);
+
+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling through arch_scale_freq_capacity()
+ * (implemented in topology.c).
+ */
+static inline
+void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max)
+{
+ unsigned long capacity;
+
+ if (!max)
+ return;
+
+ capacity = (curr << SCHED_CAPACITY_SHIFT) / max;
+ atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity);
+}

static int cpufreq_callback(struct notifier_block *nb,
unsigned long val, void *data)
{
struct cpufreq_freqs *freq = data;
int cpu = freq->cpu;
+ unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));

if (freq->flags & CPUFREQ_CONST_LOOPS)
return NOTIFY_OK;
@@ -707,6 +729,10 @@ static int cpufreq_callback(struct notifier_block *nb,
per_cpu(l_p_j_ref_freq, cpu),
freq->new);
}
+
+ if (val == CPUFREQ_PRECHANGE)
+ scale_freq_capacity(cpu, freq->new, max);
+
return NOTIFY_OK;
}

@@ -714,11 +740,38 @@ static struct notifier_block cpufreq_notifier = {
.notifier_call = cpufreq_callback,
};

+static int cpufreq_policy_callback(struct notifier_block *nb,
+ unsigned long val, void *data)
+{
+ struct cpufreq_policy *policy = data;
+ int i;
+
+ if (val != CPUFREQ_NOTIFY)
+ return NOTIFY_OK;
+
+ for_each_cpu(i, policy->cpus) {
+ scale_freq_capacity(i, policy->cur, policy->max);
+ atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block cpufreq_policy_notifier = {
+ .notifier_call = cpufreq_policy_callback,
+};
+
static int __init register_cpufreq_notifier(void)
{
- return cpufreq_register_notifier(&cpufreq_notifier,
+ int ret;
+
+ ret = cpufreq_register_notifier(&cpufreq_notifier,
CPUFREQ_TRANSITION_NOTIFIER);
+ if (ret)
+ return ret;
+
+ return cpufreq_register_notifier(&cpufreq_policy_notifier,
+ CPUFREQ_POLICY_NOTIFIER);
}
core_initcall(register_cpufreq_notifier);
-
#endif
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 08b7847..9c09e6e 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -169,6 +169,23 @@ static void update_cpu_capacity(unsigned int cpu)
cpu, arch_scale_cpu_capacity(NULL, cpu));
}

+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling (arch_scale_freq_capacity()). The scaling
+ * factor is updated in smp.c
+ */
+unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+ unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu));
+
+ if (!curr)
+ return SCHED_CAPACITY_SCALE;
+
+ return curr;
+}
+
#else
static inline void parse_dt_topology(void) {}
static inline void update_cpu_capacity(unsigned int cpuid) {}
--
1.9.1

2015-07-07 18:22:28

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 02/46] sched: Make load tracking frequency scale-invariant

From: Dietmar Eggemann <[email protected]>

Apply frequency scale-invariance correction factor to load tracking.
Each segment of the sched_avg::runnable_avg_sum geometric series is now
scaled by the current frequency so the sched_avg::load_avg_contrib of each
entity will be invariant with frequency scaling. As a result,
cfs_rq::runnable_load_avg which is the sum of sched_avg::load_avg_contrib,
becomes invariant too. So the load level that is returned by
weighted_cpuload, stays relative to the max frequency of the cpu.

Then, we want the keep the load tracking values in a 32bits type, which
implies that the max value of sched_avg::{runnable|running}_avg_sum must
be lower than 2^32/88761=48388 (88761 is the max weight of a task). As
LOAD_AVG_MAX = 47742, arch_scale_freq_capacity must return a value less
than (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY =
1024). So we define the range to [0..SCHED_SCALE_CAPACITY] in order to
avoid overflow.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

Signed-off-by: Dietmar Eggemann <[email protected]>
Acked-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 28 ++++++++++++++++------------
1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4b6e5f6..376ec75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2551,9 +2551,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
int runnable,
int running)
{
- u64 delta, periods;
- u32 runnable_contrib;
- int delta_w, decayed = 0;
+ u64 delta, scaled_delta, periods;
+ u32 runnable_contrib, scaled_runnable_contrib;
+ int delta_w, scaled_delta_w, decayed = 0;
unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);

delta = now - sa->last_runnable_update;
@@ -2587,11 +2587,12 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
* period and accrue it.
*/
delta_w = 1024 - delta_w;
+ scaled_delta_w = (delta_w * scale_freq) >> SCHED_CAPACITY_SHIFT;
+
if (runnable)
- sa->runnable_avg_sum += delta_w;
+ sa->runnable_avg_sum += scaled_delta_w;
if (running)
- sa->running_avg_sum += delta_w * scale_freq
- >> SCHED_CAPACITY_SHIFT;
+ sa->running_avg_sum += scaled_delta_w;
sa->avg_period += delta_w;

delta -= delta_w;
@@ -2609,20 +2610,23 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,

/* Efficiently calculate \sum (1..n_period) 1024*y^i */
runnable_contrib = __compute_runnable_contrib(periods);
+ scaled_runnable_contrib = (runnable_contrib * scale_freq)
+ >> SCHED_CAPACITY_SHIFT;
+
if (runnable)
- sa->runnable_avg_sum += runnable_contrib;
+ sa->runnable_avg_sum += scaled_runnable_contrib;
if (running)
- sa->running_avg_sum += runnable_contrib * scale_freq
- >> SCHED_CAPACITY_SHIFT;
+ sa->running_avg_sum += scaled_runnable_contrib;
sa->avg_period += runnable_contrib;
}

/* Remainder of delta accrued against u_0` */
+ scaled_delta = (delta * scale_freq) >> SCHED_CAPACITY_SHIFT;
+
if (runnable)
- sa->runnable_avg_sum += delta;
+ sa->runnable_avg_sum += scaled_delta;
if (running)
- sa->running_avg_sum += delta * scale_freq
- >> SCHED_CAPACITY_SHIFT;
+ sa->running_avg_sum += scaled_delta;
sa->avg_period += delta;

return decayed;
--
1.9.1

2015-07-07 18:22:40

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 03/46] arm: vexpress: Add CPU clock-frequencies to TC2 device-tree

From: Dietmar Eggemann <[email protected]>

To enable the parsing of clock frequency and cpu efficiency values
inside parse_dt_topology [arch/arm/kernel/topology.c] to scale the
relative capacity of the cpus, this property has to be provided within
the cpu nodes of the dts file.

The patch is a copy of commit 8f15973ef8c3 ("ARM: vexpress: Add CPU
clock-frequencies to TC2 device-tree") taken from Linaro Stable Kernel
(LSK) massaged into mainline.

Cc: Jon Medhurst <[email protected]>
Cc: Russell King <[email protected]>

Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
index 107395c..a596d45 100644
--- a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
+++ b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
@@ -39,6 +39,7 @@
reg = <0>;
cci-control-port = <&cci_control1>;
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
+ clock-frequency = <1000000000>;
};

cpu1: cpu@1 {
@@ -47,6 +48,7 @@
reg = <1>;
cci-control-port = <&cci_control1>;
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
+ clock-frequency = <1000000000>;
};

cpu2: cpu@2 {
@@ -55,6 +57,7 @@
reg = <0x100>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ clock-frequency = <800000000>;
};

cpu3: cpu@3 {
@@ -63,6 +66,7 @@
reg = <0x101>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ clock-frequency = <800000000>;
};

cpu4: cpu@4 {
@@ -71,6 +75,7 @@
reg = <0x102>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ clock-frequency = <800000000>;
};

idle-states {
--
1.9.1

2015-07-07 18:23:00

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 04/46] sched: Convert arch_scale_cpu_capacity() from weak function to #define

Bring arch_scale_cpu_capacity() in line with the recent change of its
arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched:
Optimize freq invariant accounting") from weak function to #define to
allow inlining of the function.

While at it, remove the ARCH_CAPACITY sched_feature as well. With the
change to #define there isn't a straightforward way to allow runtime
switch between an arch implementation and the default implementation of
arch_scale_cpu_capacity() using sched_feature. The default was to use
the arch-specific implementation, but only the arm architecture provided
one.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 22 +---------------------
kernel/sched/features.h | 5 -----
kernel/sched/sched.h | 11 +++++++++++
3 files changed, 12 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 376ec75..5b4e8c1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6238,19 +6238,6 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
return load_idx;
}

-static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
- return sd->smt_gain / sd->span_weight;
-
- return SCHED_CAPACITY_SCALE;
-}
-
-unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
- return default_scale_cpu_capacity(sd, cpu);
-}
-
static unsigned long scale_rt_capacity(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -6280,16 +6267,9 @@ static unsigned long scale_rt_capacity(int cpu)

static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
- unsigned long capacity = SCHED_CAPACITY_SCALE;
+ unsigned long capacity = arch_scale_cpu_capacity(sd, cpu);
struct sched_group *sdg = sd->groups;

- if (sched_feat(ARCH_CAPACITY))
- capacity *= arch_scale_cpu_capacity(sd, cpu);
- else
- capacity *= default_scale_cpu_capacity(sd, cpu);
-
- capacity >>= SCHED_CAPACITY_SHIFT;
-
cpu_rq(cpu)->cpu_capacity_orig = capacity;

capacity *= scale_rt_capacity(cpu);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 91e33cd..03d8072 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -36,11 +36,6 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
*/
SCHED_FEAT(WAKEUP_PREEMPTION, true)

-/*
- * Use arch dependent cpu capacity functions
- */
-SCHED_FEAT(ARCH_CAPACITY, true)
-
SCHED_FEAT(HRTICK, false)
SCHED_FEAT(DOUBLE_TICK, false)
SCHED_FEAT(LB_BIAS, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d62b288..af08a82 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1399,6 +1399,17 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
}
#endif

+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+ if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+ return sd->smt_gain / sd->span_weight;
+
+ return SCHED_CAPACITY_SCALE;
+}
+#endif
+
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
--
1.9.1

2015-07-07 18:22:51

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 05/46] arm: Update arch_scale_cpu_capacity() to reflect change to define

arch_scale_cpu_capacity() is no longer a weak function but a #define
instead. Include the #define in topology.h.

cc: Russell King <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
arch/arm/include/asm/topology.h | 4 ++++
arch/arm/kernel/topology.c | 2 +-
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index c31096f..1b8902c 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -31,6 +31,10 @@ unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);

DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);

+#define arch_scale_cpu_capacity arm_arch_scale_cpu_capacity
+extern
+unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu);
+
#else

static inline void init_cpu_topology(void) { }
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 9c09e6e..bad267c 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -42,7 +42,7 @@
*/
static DEFINE_PER_CPU(unsigned long, cpu_scale);

-unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
{
return per_cpu(cpu_scale, cpu);
}
--
1.9.1

2015-07-07 18:23:06

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 06/46] sched: Make usage tracking cpu scale-invariant

From: Dietmar Eggemann <[email protected]>

Besides the existing frequency scale-invariance correction factor, apply
cpu scale-invariance correction factor to usage tracking.

Cpu scale-invariance takes cpu performance deviations due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency
value of the highest OPP between cpus in SMP systems into consideration.

Each segment of the sched_avg::running_avg_sum geometric series is now
scaled by the cpu performance factor too so the
sched_avg::utilization_avg_contrib of each entity will be invariant from
the particular cpu of the HMP/SMP system it is gathered on.

So the usage level that is returned by get_cpu_usage stays relative to
the max cpu performance of the system.

In contrast to usage, load (sched_avg::runnable_avg_sum) is currently
not considered to be made cpu scale-invariant because this will have a
negative effect on the the existing load balance code based on
s[dg]_lb_stats::avg_load in overload scenarios.

example: 7 always running tasks
4 on cluster 0 (2 cpus w/ cpu_capacity=512)
3 on cluster 1 (1 cpu w/ cpu_capacity=1024)

cluster 0 cluster 1

capacity 1024 (2*512) 1024 (1*1024)
load 4096 3072
cpu-scaled load 2048 3072

Simply using cpu-scaled load in the existing lb code would declare
cluster 1 busier than cluster 0, although the compute capacity budget
for one task is higher on cluster 1 (1024/3 = 341) than on cluster 0
(2*512/4 = 256).

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 13 +++++++++++++
kernel/sched/sched.h | 2 +-
2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b4e8c1..39871a4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2555,6 +2555,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
u32 runnable_contrib, scaled_runnable_contrib;
int delta_w, scaled_delta_w, decayed = 0;
unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
+ unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

delta = now - sa->last_runnable_update;
/*
@@ -2591,6 +2592,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,

if (runnable)
sa->runnable_avg_sum += scaled_delta_w;
+
+ scaled_delta_w *= scale_cpu;
+ scaled_delta_w >>= SCHED_CAPACITY_SHIFT;
+
if (running)
sa->running_avg_sum += scaled_delta_w;
sa->avg_period += delta_w;
@@ -2615,6 +2620,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,

if (runnable)
sa->runnable_avg_sum += scaled_runnable_contrib;
+
+ scaled_runnable_contrib *= scale_cpu;
+ scaled_runnable_contrib >>= SCHED_CAPACITY_SHIFT;
+
if (running)
sa->running_avg_sum += scaled_runnable_contrib;
sa->avg_period += runnable_contrib;
@@ -2625,6 +2634,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,

if (runnable)
sa->runnable_avg_sum += scaled_delta;
+
+ scaled_delta *= scale_cpu;
+ scaled_delta >>= SCHED_CAPACITY_SHIFT;
+
if (running)
sa->running_avg_sum += scaled_delta;
sa->avg_period += delta;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index af08a82..4e7bce7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1403,7 +1403,7 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
{
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+ if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
return sd->smt_gain / sd->span_weight;

return SCHED_CAPACITY_SCALE;
--
1.9.1

2015-07-07 19:02:31

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 07/46] arm: Cpu invariant scheduler load-tracking support

From: Dietmar Eggemann <[email protected]>

Reuses the existing infrastructure for cpu_scale to provide the scheduler
with a cpu scaling correction factor for more accurate load-tracking.
This factor comprises a micro-architectural part, which is based on the
cpu efficiency value of a cpu as well as a platform-wide max frequency
part, which relates to the dtb property clock-frequency of a cpu node.

The calculation of cpu_scale, return value of arch_scale_cpu_capacity,
changes from

capacity / middle_capacity

with capacity = (clock_frequency >> 20) * cpu_efficiency

to

SCHED_CAPACITY_SCALE * cpu_perf / max_cpu_perf

The range of the cpu_scale value changes from
[0..3*SCHED_CAPACITY_SCALE/2] to [0..SCHED_CAPACITY_SCALE].

The functionality to calculate the middle_capacity which corresponds to an
'average' cpu has been taken out since the scaling is now done
differently.

In the case that either the cpu efficiency or the clock-frequency value
for a cpu is missing, no cpu scaling is done for any cpu.

The platform-wide max frequency part of the factor should not be confused
with the frequency invariant scheduler load-tracking support which deals
with frequency related scaling due to DFVS functionality on a cpu.

Cc: Russell King <[email protected]>

Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/kernel/topology.c | 64 +++++++++++++++++-----------------------------
1 file changed, 23 insertions(+), 41 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index bad267c..5867587 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -62,9 +62,7 @@ struct cpu_efficiency {
* Table of relative efficiency of each processors
* The efficiency value must fit in 20bit and the final
* cpu_scale value must be in the range
- * 0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2
- * in order to return at most 1 when DIV_ROUND_CLOSEST
- * is used to compute the capacity of a CPU.
+ * 0 < cpu_scale < SCHED_CAPACITY_SCALE.
* Processors that are not defined in the table,
* use the default SCHED_CAPACITY_SCALE value for cpu_scale.
*/
@@ -77,24 +75,18 @@ static const struct cpu_efficiency table_efficiency[] = {
static unsigned long *__cpu_capacity;
#define cpu_capacity(cpu) __cpu_capacity[cpu]

-static unsigned long middle_capacity = 1;
+static unsigned long max_cpu_perf;

/*
* Iterate all CPUs' descriptor in DT and compute the efficiency
- * (as per table_efficiency). Also calculate a middle efficiency
- * as close as possible to (max{eff_i} - min{eff_i}) / 2
- * This is later used to scale the cpu_capacity field such that an
- * 'average' CPU is of middle capacity. Also see the comments near
- * table_efficiency[] and update_cpu_capacity().
+ * (as per table_efficiency). Calculate the max cpu performance too.
*/
+
static void __init parse_dt_topology(void)
{
const struct cpu_efficiency *cpu_eff;
struct device_node *cn = NULL;
- unsigned long min_capacity = ULONG_MAX;
- unsigned long max_capacity = 0;
- unsigned long capacity = 0;
- int cpu = 0;
+ int cpu = 0, i = 0;

__cpu_capacity = kcalloc(nr_cpu_ids, sizeof(*__cpu_capacity),
GFP_NOWAIT);
@@ -102,6 +94,7 @@ static void __init parse_dt_topology(void)
for_each_possible_cpu(cpu) {
const u32 *rate;
int len;
+ unsigned long cpu_perf;

/* too early to use cpu->of_node */
cn = of_get_cpu_node(cpu, NULL);
@@ -124,46 +117,35 @@ static void __init parse_dt_topology(void)
continue;
}

- capacity = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency;
-
- /* Save min capacity of the system */
- if (capacity < min_capacity)
- min_capacity = capacity;
-
- /* Save max capacity of the system */
- if (capacity > max_capacity)
- max_capacity = capacity;
-
- cpu_capacity(cpu) = capacity;
+ cpu_perf = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency;
+ cpu_capacity(cpu) = cpu_perf;
+ max_cpu_perf = max(max_cpu_perf, cpu_perf);
+ i++;
}

- /* If min and max capacities are equals, we bypass the update of the
- * cpu_scale because all CPUs have the same capacity. Otherwise, we
- * compute a middle_capacity factor that will ensure that the capacity
- * of an 'average' CPU of the system will be as close as possible to
- * SCHED_CAPACITY_SCALE, which is the default value, but with the
- * constraint explained near table_efficiency[].
- */
- if (4*max_capacity < (3*(max_capacity + min_capacity)))
- middle_capacity = (min_capacity + max_capacity)
- >> (SCHED_CAPACITY_SHIFT+1);
- else
- middle_capacity = ((max_capacity / 3)
- >> (SCHED_CAPACITY_SHIFT-1)) + 1;
-
+ if (i < num_possible_cpus())
+ max_cpu_perf = 0;
}

/*
* Look for a customed capacity of a CPU in the cpu_capacity table during the
* boot. The update of all CPUs is in O(n^2) for heteregeneous system but the
- * function returns directly for SMP system.
+ * function returns directly for SMP systems or if there is no complete set
+ * of cpu efficiency, clock frequency data for each cpu.
*/
static void update_cpu_capacity(unsigned int cpu)
{
- if (!cpu_capacity(cpu))
+ unsigned long capacity = cpu_capacity(cpu);
+
+ if (!capacity || !max_cpu_perf) {
+ cpu_capacity(cpu) = 0;
return;
+ }
+
+ capacity *= SCHED_CAPACITY_SCALE;
+ capacity /= max_cpu_perf;

- set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
+ set_capacity_scale(cpu, capacity);

pr_info("CPU%u: update cpu_capacity %lu\n",
cpu, arch_scale_cpu_capacity(NULL, cpu));
--
1.9.1

2015-07-07 18:23:15

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 08/46] sched: Get rid of scaling usage by cpu_capacity_orig

From: Dietmar Eggemann <[email protected]>

Since now we have besides frequency invariant also cpu (uarch plus max
system frequency) invariant cfs_rq::utilization_load_avg both, frequency
and cpu scaling happens as part of the load tracking.
So cfs_rq::utilization_load_avg does not have to be scaled by the original
capacity of the cpu again.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 34 +++++++++++++++++++++-------------
1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39871a4..0c08cff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5013,32 +5013,40 @@ static int select_idle_sibling(struct task_struct *p, int target)
done:
return target;
}
+
/*
* get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
* tasks. The unit of the return value must be the one of capacity so we can
* compare the usage with the capacity of the CPU that is available for CFS
* task (ie cpu_capacity).
+ *
* cfs.utilization_load_avg is the sum of running time of runnable tasks on a
* CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full
- * capacity of the CPU because it's about the running time on this CPU.
- * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
- * because of unfortunate rounding in avg_period and running_load_avg or just
- * after migrating tasks until the average stabilizes with the new running
- * time. So we need to check that the usage stays into the range
- * [0..cpu_capacity_orig] and cap if necessary.
- * Without capping the usage, a group could be seen as overloaded (CPU0 usage
- * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
+ * [0..capacity_orig] where capacity_orig is the cpu_capacity available at the
+ * highest frequency (arch_scale_freq_capacity()). The usage of a CPU converges
+ * towards a sum equal to or less than the current capacity (capacity_curr <=
+ * capacity_orig) of the CPU because it is the running time on this CPU scaled
+ * by capacity_curr. Nevertheless, cfs.utilization_load_avg can be higher than
+ * capacity_curr or even higher than capacity_orig because of unfortunate
+ * rounding in avg_period and running_load_avg or just after migrating tasks
+ * (and new task wakeups) until the average stabilizes with the new running
+ * time. We need to check that the usage stays into the range
+ * [0..capacity_orig] and cap if necessary. Without capping the usage, a group
+ * could be seen as overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas
+ * CPU1 has 20% of available capacity. We allow usage to overshoot
+ * capacity_curr (but not capacity_orig) as it useful for predicting the
+ * capacity required after task migrations (scheduler-driven DVFS).
*/
+
static int get_cpu_usage(int cpu)
{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
- unsigned long capacity = capacity_orig_of(cpu);
+ unsigned long capacity_orig = capacity_orig_of(cpu);

- if (usage >= SCHED_LOAD_SCALE)
- return capacity;
+ if (usage >= capacity_orig)
+ return capacity_orig;

- return (usage * capacity) >> SCHED_LOAD_SHIFT;
+ return usage;
}

/*
--
1.9.1

2015-07-07 19:02:21

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 09/46] sched: Track blocked utilization contributions

Introduces the blocked utilization, the utilization counter-part to
cfs_rq->blocked_load_avg. It is the sum of sched_entity utilization
contributions of entities that were recently on the cfs_rq and are
currently blocked. Combined with the sum of utilization of entities
currently on the cfs_rq or currently running
(cfs_rq->utilization_load_avg) this provides a more stable average
view of the cpu usage.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 30 +++++++++++++++++++++++++++++-
kernel/sched/sched.h | 8 ++++++--
2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c08cff..c26980f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2818,6 +2818,15 @@ static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
cfs_rq->blocked_load_avg = 0;
}

+static inline void subtract_utilization_blocked_contrib(struct cfs_rq *cfs_rq,
+ long utilization_contrib)
+{
+ if (likely(utilization_contrib < cfs_rq->utilization_blocked_avg))
+ cfs_rq->utilization_blocked_avg -= utilization_contrib;
+ else
+ cfs_rq->utilization_blocked_avg = 0;
+}
+
static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);

/* Update a sched_entity's runnable average */
@@ -2853,6 +2862,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
cfs_rq->utilization_load_avg += utilization_delta;
} else {
subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ -utilization_delta);
}
}

@@ -2870,14 +2881,20 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
return;

if (atomic_long_read(&cfs_rq->removed_load)) {
- unsigned long removed_load;
+ unsigned long removed_load, removed_utilization;
removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
+ removed_utilization =
+ atomic_long_xchg(&cfs_rq->removed_utilization, 0);
subtract_blocked_load_contrib(cfs_rq, removed_load);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ removed_utilization);
}

if (decays) {
cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
decays);
+ cfs_rq->utilization_blocked_avg =
+ decay_load(cfs_rq->utilization_blocked_avg, decays);
atomic64_add(decays, &cfs_rq->decay_counter);
cfs_rq->last_decay = now;
}
@@ -2924,6 +2941,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
/* migrated tasks did not contribute to our blocked load */
if (wakeup) {
subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ se->avg.utilization_avg_contrib);
update_entity_load_avg(se, 0);
}

@@ -2950,6 +2969,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
if (sleep) {
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+ cfs_rq->utilization_blocked_avg +=
+ se->avg.utilization_avg_contrib;
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
} /* migrations, e.g. sleep=0 leave decay_count == 0 */
}
@@ -5162,6 +5183,8 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
se->avg.decay_count = -__synchronize_entity_decay(se);
atomic_long_add(se->avg.load_avg_contrib,
&cfs_rq->removed_load);
+ atomic_long_add(se->avg.utilization_avg_contrib,
+ &cfs_rq->removed_utilization);
}

/* We have migrated, no longer consider this task hot */
@@ -8153,6 +8176,8 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
if (se->avg.decay_count) {
__synchronize_entity_decay(se);
subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ se->avg.utilization_avg_contrib);
}
#endif
}
@@ -8212,6 +8237,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifdef CONFIG_SMP
atomic64_set(&cfs_rq->decay_counter, 1);
atomic_long_set(&cfs_rq->removed_load, 0);
+ atomic_long_set(&cfs_rq->removed_utilization, 0);
#endif
}

@@ -8264,6 +8290,8 @@ static void task_move_group_fair(struct task_struct *p, int queued)
*/
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+ cfs_rq->utilization_blocked_avg +=
+ se->avg.utilization_avg_contrib;
#endif
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4e7bce7..a22cf89 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -375,11 +375,15 @@ struct cfs_rq {
* the blocked sched_entities on the rq.
* utilization_load_avg is the sum of the average running time of the
* sched_entities on the rq.
+ * utilization_blocked_avg is the utilization equivalent of
+ * blocked_load_avg, i.e. the sum of running contributions of blocked
+ * sched_entities associated with the rq.
*/
- unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
+ unsigned long runnable_load_avg, blocked_load_avg;
+ unsigned long utilization_load_avg, utilization_blocked_avg;
atomic64_t decay_counter;
u64 last_decay;
- atomic_long_t removed_load;
+ atomic_long_t removed_load, removed_utilization;

#ifdef CONFIG_FAIR_GROUP_SCHED
/* Required to track per-cpu representation of a task_group */
--
1.9.1

2015-07-07 18:23:40

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 10/46] sched: Include blocked utilization in usage tracking

Add the blocked utilization contribution to group sched_entity
utilization (se->avg.utilization_avg_contrib) and to get_cpu_usage().
With this change cpu usage now includes recent usage by currently
non-runnable tasks, hence it provides a more stable view of the cpu
usage. It does, however, also mean that the meaning of usage is changed:
A cpu may be momentarily idle while usage is >0. It can no longer be
assumed that cpu usage >0 implies runnable tasks on the rq.
cfs_rq->utilization_load_avg or nr_running should be used instead to get
the current rq status.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c26980f..775b0c7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2804,7 +2804,8 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
__update_task_entity_utilization(se);
else
se->avg.utilization_avg_contrib =
- group_cfs_rq(se)->utilization_load_avg;
+ group_cfs_rq(se)->utilization_load_avg +
+ group_cfs_rq(se)->utilization_blocked_avg;

return se->avg.utilization_avg_contrib - old_contrib;
}
@@ -5061,13 +5062,17 @@ static int select_idle_sibling(struct task_struct *p, int target)

static int get_cpu_usage(int cpu)
{
+ int sum;
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+ unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
unsigned long capacity_orig = capacity_orig_of(cpu);

- if (usage >= capacity_orig)
+ sum = usage + blocked;
+
+ if (sum >= capacity_orig)
return capacity_orig;

- return usage;
+ return sum;
}

/*
--
1.9.1

2015-07-07 18:23:27

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 11/46] sched: Remove blocked load and utilization contributions of dying tasks

Tasks being dequeued for the last time (state == TASK_DEAD) are dequeued
with the DEQUEUE_SLEEP flag which causes their load and utilization
contributions to be added to the runqueue blocked load and utilization.
Hence they will contain load or utilization that is gone away. The issue
only exists for the root cfs_rq as cgroup_exit() doesn't set
DEQUEUE_SLEEP for task group exits.

If runnable+blocked load is to be used as a better estimate for cpu
load the dead task contributions need to be removed to prevent
load_balance() (idle_balance() in particular) from over-estimating the
cpu load.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 775b0c7..fa12ce5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3217,6 +3217,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
+ if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
+ flags &= !DEQUEUE_SLEEP;
dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);

update_stats_dequeue(cfs_rq, se);
--
1.9.1

2015-07-07 18:23:52

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 12/46] sched: Initialize CFS task load and usage before placing task on rq

Task load or usage is not currently considered in select_task_rq_fair(),
but if we want that in the future we should make sure it is not zero for
new tasks.

The load-tracking sums are currently initialized using sched_slice(),
that won't work before the task has been assigned a rq. Initialization
is therefore changed to another semi-arbitrary value, sched_latency,
instead.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/core.c | 4 ++--
kernel/sched/fair.c | 7 +++----
2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 10338ce..6a06fe5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2126,6 +2126,8 @@ void wake_up_new_task(struct task_struct *p)
struct rq *rq;

raw_spin_lock_irqsave(&p->pi_lock, flags);
+ /* Initialize new task's runnable average */
+ init_task_runnable_average(p);
#ifdef CONFIG_SMP
/*
* Fork balancing, do it here and not earlier because:
@@ -2135,8 +2137,6 @@ void wake_up_new_task(struct task_struct *p)
set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif

- /* Initialize new task's runnable average */
- init_task_runnable_average(p);
rq = __task_rq_lock(p);
activate_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fa12ce5..d0df937 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -675,11 +675,10 @@ static inline void __update_task_entity_utilization(struct sched_entity *se);
/* Give new task start runnable values to heavy its load in infant time */
void init_task_runnable_average(struct task_struct *p)
{
- u32 slice;
+ u32 start_load = sysctl_sched_latency >> 10;

- slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
- p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
- p->se.avg.avg_period = slice;
+ p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = start_load;
+ p->se.avg.avg_period = start_load;
__update_task_entity_contrib(&p->se);
__update_task_entity_utilization(&p->se);
}
--
1.9.1

2015-07-07 19:02:07

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 13/46] sched: Documentation for scheduler energy cost model

This documentation patch provides an overview of the experimental
scheduler energy costing model, associated data structures, and a
reference recipe on how platforms can be characterized to derive energy
models.

Signed-off-by: Morten Rasmussen <[email protected]>
---
Documentation/scheduler/sched-energy.txt | 363 +++++++++++++++++++++++++++++++
1 file changed, 363 insertions(+)
create mode 100644 Documentation/scheduler/sched-energy.txt

diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt
new file mode 100644
index 0000000..37be110
--- /dev/null
+++ b/Documentation/scheduler/sched-energy.txt
@@ -0,0 +1,363 @@
+Energy cost model for energy-aware scheduling (EXPERIMENTAL)
+
+Introduction
+=============
+
+The basic energy model uses platform energy data stored in sched_group_energy
+data structures attached to the sched_groups in the sched_domain hierarchy. The
+energy cost model offers two functions that can be used to guide scheduling
+decisions:
+
+1. static unsigned int sched_group_energy(struct energy_env *eenv)
+2. static int energy_diff(struct energy_env *eenv)
+
+sched_group_energy() estimates the energy consumed by all cpus in a specific
+sched_group including any shared resources owned exclusively by this group of
+cpus. Resources shared with other cpus are excluded (e.g. later level caches).
+
+energy_diff() estimates the total energy impact of a utilization change. That
+is, adding, removing, or migrating utilization (tasks).
+
+Both functions use a struct energy_env to specify the scenario to be evaluated:
+
+ struct energy_env {
+ struct sched_group *sg_top;
+ struct sched_group *sg_cap;
+ int cap_idx;
+ int usage_delta;
+ int src_cpu;
+ int dst_cpu;
+ int energy;
+ };
+
+sg_top: sched_group to be evaluated. Not used by energy_diff().
+
+sg_cap: sched_group covering the cpus in the same frequency domain. Set by
+sched_group_energy().
+
+cap_idx: Capacity state to be used for energy calculations. Set by
+find_new_capacity().
+
+usage_delta: Amount of utilization to be added, removed, or migrated.
+
+src_cpu: Source cpu from where 'usage_delta' utilization is removed. Should be
+-1 if no source (e.g. task wake-up).
+
+dst_cpu: Destination cpu where 'usage_delta' utilization is added. Should be -1
+if utilization is removed (e.g. terminating tasks).
+
+energy: Result of sched_group_energy().
+
+The metric used to represent utilization is the actual per-entity running time
+averaged over time using a geometric series. Very similar to the existing
+per-entity load-tracking, but _not_ scaled by task priority and capped by the
+capacity of the cpu. The latter property does mean that utilization may
+underestimate the compute requirements for task on fully/over utilized cpus.
+The greatest potential for energy savings without affecting performance too much
+is scenarios where the system isn't fully utilized. If the system is deemed
+fully utilized load-balancing should be done with task load (includes task
+priority) instead in the interest of fairness and performance.
+
+
+Background and Terminology
+===========================
+
+To make it clear from the start:
+
+energy = [joule] (resource like a battery on powered devices)
+power = energy/time = [joule/second] = [watt]
+
+The goal of energy-aware scheduling is to minimize energy, while still getting
+the job done. That is, we want to maximize:
+
+ performance [inst/s]
+ --------------------
+ power [W]
+
+which is equivalent to minimizing:
+
+ energy [J]
+ -----------
+ instruction
+
+while still getting 'good' performance. It is essentially an alternative
+optimization objective to the current performance-only objective for the
+scheduler. This alternative considers two objectives: energy-efficiency and
+performance. Hence, there needs to be a user controllable knob to switch the
+objective. Since it is early days, this is currently a sched_feature
+(ENERGY_AWARE).
+
+The idea behind introducing an energy cost model is to allow the scheduler to
+evaluate the implications of its decisions rather than applying energy-saving
+techniques blindly that may only have positive effects on some platforms. At
+the same time, the energy cost model must be as simple as possible to minimize
+the scheduler latency impact.
+
+Platform topology
+------------------
+
+The system topology (cpus, caches, and NUMA information, not peripherals) is
+represented in the scheduler by the sched_domain hierarchy which has
+sched_groups attached at each level that covers one or more cpus (see
+sched-domains.txt for more details). To add energy awareness to the scheduler
+we need to consider power and frequency domains.
+
+Power domain:
+
+A power domain is a part of the system that can be powered on/off
+independently. Power domains are typically organized in a hierarchy where you
+may be able to power down just a cpu or a group of cpus along with any
+associated resources (e.g. shared caches). Powering up a cpu means that all
+power domains it is a part of in the hierarchy must be powered up. Hence, it is
+more expensive to power up the first cpu that belongs to a higher level power
+domain than powering up additional cpus in the same high level domain. Two
+level power domain hierarchy example:
+
+ Power source
+ +-------------------------------+----...
+per group PD G G
+ | +----------+ |
+ +--------+-------| Shared | (other groups)
+per-cpu PD G G | resource |
+ | | +----------+
+ +-------+ +-------+
+ | CPU 0 | | CPU 1 |
+ +-------+ +-------+
+
+Frequency domain:
+
+Frequency domains (P-states) typically cover the same group of cpus as one of
+the power domain levels. That is, there might be several smaller power domains
+sharing the same frequency (P-state) or there might be a power domain spanning
+multiple frequency domains.
+
+From a scheduling point of view there is no need to know the actual frequencies
+[Hz]. All the scheduler cares about is the compute capacity available at the
+current state (P-state) the cpu is in and any other available states. For that
+reason, and to also factor in any cpu micro-architecture differences, compute
+capacity scaling states are called 'capacity states' in this document. For SMP
+systems this is equivalent to P-states. For mixed micro-architecture systems
+(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
+performance relative to the other cpus in the system.
+
+Energy modelling:
+------------------
+
+Due to the hierarchical nature of the power domains, the most obvious way to
+model energy costs is therefore to associate power and energy costs with
+domains (groups of cpus). Energy costs of shared resources are associated with
+the group of cpus that share the resources, only the cost of powering the
+cpu itself and any private resources (e.g. private L1 caches) is associated
+with the per-cpu groups (lowest level).
+
+For example, for an SMP system with per-cpu power domains and a cluster level
+(group of cpus) power domain we get the overall energy costs to be:
+
+ energy = energy_cluster + n * energy_cpu
+
+where 'n' is the number of cpus powered up and energy_cluster is the cost paid
+as soon as any cpu in the cluster is powered up.
+
+The power and frequency domains can naturally be mapped onto the existing
+sched_domain hierarchy and sched_groups by adding the necessary data to the
+existing data structures.
+
+The energy model considers energy consumption from two contributors (shown in
+the illustration below):
+
+1. Busy energy: Energy consumed while a cpu and the higher level groups that it
+belongs to are busy running tasks. Busy energy is associated with the state of
+the cpu, not an event. The time the cpu spends in this state varies. Thus, the
+most obvious platform parameter for this contribution is busy power
+(energy/time).
+
+2. Idle energy: Energy consumed while a cpu and higher level groups that it
+belongs to are idle (in a C-state). Like busy energy, idle energy is associated
+with the state of the cpu. Thus, the platform parameter for this contribution
+is idle power (energy/time).
+
+Energy consumed during transitions from an idle-state (C-state) to a busy state
+(P-state) or going the other way is ignored by the model to simplify the energy
+model calculations.
+
+
+ Power
+ ^
+ | busy->idle idle->busy
+ | transition transition
+ |
+ | _ __
+ | / \ / \__________________
+ |______________/ \ /
+ | \ /
+ | Busy \ Idle / Busy
+ | low P-state \____________/ high P-state
+ |
+ +------------------------------------------------------------> time
+
+Busy |--------------| |-----------------|
+
+Wakeup |------| |------|
+
+Idle |------------|
+
+
+The basic algorithm
+====================
+
+The basic idea is to determine the total energy impact when utilization is
+added or removed by estimating the impact at each level in the sched_domain
+hierarchy starting from the bottom (sched_group contains just a single cpu).
+The energy cost comes from busy time (sched_group is awake because one or more
+cpus are busy) and idle time (in an idle-state). Energy model numbers account
+for energy costs associated with all cpus in the sched_group as a group.
+
+ for_each_domain(cpu, sd) {
+ sg = sched_group_of(cpu)
+ energy_before = curr_util(sg) * busy_power(sg)
+ + (1-curr_util(sg)) * idle_power(sg)
+ energy_after = new_util(sg) * busy_power(sg)
+ + (1-new_util(sg)) * idle_power(sg)
+ energy_diff += energy_before - energy_after
+
+ }
+
+ return energy_diff
+
+{curr, new}_util: The cpu utilization at the lowest level and the overall
+non-idle time for the entire group for higher levels. Utilization is in the
+range 0.0 to 1.0 in the pseudo-code.
+
+busy_power: The power consumption of the sched_group.
+
+idle_power: The power consumption of the sched_group when idle.
+
+Note: It is a fundamental assumption that the utilization is (roughly) scale
+invariant. Task utilization tracking factors in any frequency scaling and
+performance scaling differences due to difference cpu microarchitectures such
+that task utilization can be used across the entire system.
+
+
+Platform energy data
+=====================
+
+struct sched_group_energy can be attached to sched_groups in the sched_domain
+hierarchy and has the following members:
+
+cap_states:
+ List of struct capacity_state representing the supported capacity states
+ (P-states). struct capacity_state has two members: cap and power, which
+ represents the compute capacity and the busy_power of the state. The
+ list must be ordered by capacity low->high.
+
+nr_cap_states:
+ Number of capacity states in cap_states list.
+
+idle_states:
+ List of struct idle_state containing idle_state power cost for each
+ idle-state support by the sched_group. Note that the energy model
+ calculations will use this table to determine idle power even if no idle
+ state is actually entered by cpuidle. That is, if latency constraints
+ prevents that the group enters a coupled state or no idle-states are
+ supported. Hence, the first entry of the list must be the idle power
+ when idle, but no idle state was actually entered ('active idle'). This
+ state may be left out groups with one cpu if the cpu is guaranteed to
+ enter the state when idle.
+
+nr_idle_states:
+ Number of idle states in idle_states list.
+
+nr_idle_states_below:
+ Number of idle-states below current level. Filled by generic code, not
+ to be provided by the platform.
+
+There are no unit requirements for the energy cost data. Data can be normalized
+with any reference, however, the normalization must be consistent across all
+energy cost data. That is, one bogo-joule/watt must be the same quantity for
+data, but we don't care what it is.
+
+A recipe for platform characterization
+=======================================
+
+Obtaining the actual model data for a particular platform requires some way of
+measuring power/energy. There isn't a tool to help with this (yet). This
+section provides a recipe for use as reference. It covers the steps used to
+characterize the ARM TC2 development platform. This sort of measurements is
+expected to be done anyway when tuning cpuidle and cpufreq for a given
+platform.
+
+The energy model needs two types of data (struct sched_group_energy holds
+these) for each sched_group where energy costs should be taken into account:
+
+1. Capacity state information
+
+A list containing the compute capacity and power consumption when fully
+utilized attributed to the group as a whole for each available capacity state.
+At the lowest level (group contains just a single cpu) this is the power of the
+cpu alone without including power consumed by resources shared with other cpus.
+It basically needs to fit the basic modelling approach described in "Background
+and Terminology" section:
+
+ energy_system = energy_shared + n * energy_cpu
+
+for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at
+the lowest level. 'energy_shared' is included at the next level which
+represents the group of cpus among which the resources are shared.
+
+This model is, of course, a simplification of reality. Thus, power/energy
+attributions might not always exactly represent how the hardware is designed.
+Also, busy power is likely to depend on the workload. It is therefore
+recommended to use a representative mix of workloads when characterizing the
+capacity states.
+
+If the group has no capacity scaling support, the list will contain a single
+state where power is the busy power attributed to the group. The capacity
+should be set to a default value (1024).
+
+When frequency domains include multiple power domains, the group representing
+the frequency domain and all child groups share capacity states. This must be
+indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at
+all levels that share the capacity state must have the list of capacity states
+with the power set to the contribution of the individual group.
+
+2. Idle power information
+
+Stored in the idle_states list. The power number is the group idle power
+consumption in each idle state as well when the group is idle but has not
+entered an idle-state ('active idle' as mentioned earlier). Due to the way the
+energy model is defined, the idle power of the deepest group idle state can
+alternatively be accounted for in the parent group busy power. In that case the
+group idle state power values are offset such that the idle power of the
+deepest state is zero. It is less intuitive, but it is easier to measure as
+idle power consumed by the group and the busy/idle power of the parent group
+cannot be distinguished without per group measurement points.
+
+Measuring capacity states and idle power:
+
+The capacity states' capacity and power can be estimated by running a benchmark
+workload at each available capacity state. By restricting the benchmark to run
+on subsets of cpus it is possible to extrapolate the power consumption of
+shared resources.
+
+ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a
+shared L2 cache. TC2 has on-chip energy counters per cluster. Running a
+benchmark workload on just one cpu in a cluster means that power is consumed in
+the cluster (higher level group) and a single cpu (lowest level group). Adding
+another benchmark task to another cpu increases the power consumption by the
+amount consumed by the additional cpu. Hence, it is possible to extrapolate the
+cluster busy power.
+
+For platforms that don't have energy counters or equivalent instrumentation
+built-in, it may be possible to use an external DAQ to acquire similar data.
+
+If the benchmark includes some performance score (for example sysbench cpu
+benchmark), this can be used to record the compute capacity.
+
+Measuring idle power requires insight into the idle state implementation on the
+particular platform. Specifically, if the platform has coupled idle-states (or
+package states). To measure non-coupled per-cpu idle-states it is necessary to
+keep one cpu busy to keep any shared resources alive to isolate the idle power
+of the cpu from idle/busy power of the shared resources. The cpu can be tricked
+into different per-cpu idle states by disabling the other states. Based on
+various combinations of measurements with specific cpus busy and disabling
+idle-states it is possible to extrapolate the idle-state power.
--
1.9.1

2015-07-07 19:01:44

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 14/46] sched: Make energy awareness a sched feature

This patch introduces the ENERGY_AWARE sched feature, which is
implemented using jump labels when SCHED_DEBUG is defined. It is
statically set false when SCHED_DEBUG is not defined. Hence this doesn't
allow energy awareness to be enabled without SCHED_DEBUG. This
sched_feature knob will be replaced later with a more appropriate
control knob when things have matured a bit.

ENERGY_AWARE is based on per-entity load-tracking hence FAIR_GROUP_SCHED
must be enable. This dependency isn't checked at compile time yet.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 6 ++++++
kernel/sched/features.h | 6 ++++++
2 files changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0df937..a95dad9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4781,6 +4781,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)

return wl;
}
+
#else

static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
@@ -4790,6 +4791,11 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)

#endif

+static inline bool energy_aware(void)
+{
+ return sched_feat(ENERGY_AWARE);
+}
+
static int wake_wide(struct task_struct *p)
{
int factor = this_cpu_read(sd_llc_size);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 03d8072..92bc36e 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -91,3 +91,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
*/
SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
+
+/*
+ * Energy aware scheduling. Use platform energy model to guide scheduling
+ * decisions optimizing for energy efficiency.
+ */
+SCHED_FEAT(ENERGY_AWARE, false)
--
1.9.1

2015-07-07 18:24:16

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 15/46] sched: Introduce energy data structures

From: Dietmar Eggemann <[email protected]>

The struct sched_group_energy represents the per sched_group related
data which is needed for energy aware scheduling. It contains:

(1) atomic reference counter for scheduler internal bookkeeping of
data allocation and freeing
(2) number of elements of the idle state array
(3) pointer to the idle state array which comprises 'power consumption'
for each idle state
(4) number of elements of the capacity state array
(5) pointer to the capacity state array which comprises 'compute
capacity and power consumption' tuples for each capacity state

Allocation and freeing of struct sched_group_energy utilizes the existing
infrastructure of the scheduler which is currently used for the other sd
hierarchy data structures (e.g. struct sched_domain) as well. That's why
struct sd_data is provisioned with a per cpu struct sched_group_energy
double pointer.

The struct sched_group obtains a pointer to a struct sched_group_energy.

The function pointer sched_domain_energy_f is introduced into struct
sched_domain_topology_level which will allow the arch to pass a particular
struct sched_group_energy from the topology shim layer into the scheduler
core.

The function pointer sched_domain_energy_f has an 'int cpu' parameter
since the folding of two adjacent sd levels via sd degenerate doesn't work
for all sd levels. I.e. it is not possible for example to use this feature
to provide per-cpu energy in sd level DIE on ARM's TC2 platform.

It was discussed that the folding of sd levels approach is preferable
over the cpu parameter approach, simply because the user (the arch
specifying the sd topology table) can introduce less errors. But since
it is not working, the 'int cpu' parameter is the only way out. It's
possible to use the folding of sd levels approach for
sched_domain_flags_f and the cpu parameter approach for the
sched_domain_energy_f at the same time though. With the use of the
'int cpu' parameter, an extra check function has to be provided to make
sure that all cpus spanned by a sched group are provisioned with the same
energy data.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Dietmar Eggemann <[email protected]>
---
include/linux/sched.h | 20 ++++++++++++++++++++
kernel/sched/sched.h | 1 +
2 files changed, 21 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7de815c..c4d0e88 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1017,6 +1017,23 @@ struct sched_domain_attr {

extern int sched_domain_level_max;

+struct capacity_state {
+ unsigned long cap; /* compute capacity */
+ unsigned long power; /* power consumption at this compute capacity */
+};
+
+struct idle_state {
+ unsigned long power; /* power consumption in this idle state */
+};
+
+struct sched_group_energy {
+ atomic_t ref;
+ unsigned int nr_idle_states; /* number of idle states */
+ struct idle_state *idle_states; /* ptr to idle state array */
+ unsigned int nr_cap_states; /* number of capacity states */
+ struct capacity_state *cap_states; /* ptr to capacity state array */
+};
+
struct sched_group;

struct sched_domain {
@@ -1115,6 +1132,7 @@ bool cpus_share_cache(int this_cpu, int that_cpu);

typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
typedef int (*sched_domain_flags_f)(void);
+typedef const struct sched_group_energy *(*sched_domain_energy_f)(int cpu);

#define SDTL_OVERLAP 0x01

@@ -1122,11 +1140,13 @@ struct sd_data {
struct sched_domain **__percpu sd;
struct sched_group **__percpu sg;
struct sched_group_capacity **__percpu sgc;
+ struct sched_group_energy **__percpu sge;
};

struct sched_domain_topology_level {
sched_domain_mask_f mask;
sched_domain_flags_f sd_flags;
+ sched_domain_energy_f energy;
int flags;
int numa_level;
struct sd_data data;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a22cf89..7b687c6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -853,6 +853,7 @@ struct sched_group {

unsigned int group_weight;
struct sched_group_capacity *sgc;
+ struct sched_group_energy *sge;

/*
* The CPUs this group covers.
--
1.9.1

2015-07-07 18:24:05

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 16/46] sched: Allocate and initialize energy data structures

From: Dietmar Eggemann <[email protected]>

The per sched group sched_group_energy structure plus the related
idle_state and capacity_state arrays are allocated like the other sched
domain (sd) hierarchy data structures. This includes the freeing of
sched_group_energy structures which are not used.

Energy-aware scheduling allows that a system only has energy model data
up to a certain sd level (so called highest energy aware balancing sd
level). A check in init_sched_energy enforces that all sd's below this
sd level contain energy model data.

One problem is that the number of elements of the idle_state and the
capacity_state arrays is not fixed and has to be retrieved in
__sdt_alloc() to allocate memory for the sched_group_energy structure and
the two arrays in one chunk. The array pointers (idle_states and
cap_states) are initialized here to point to the correct place inside the
memory chunk.

The new function init_sched_energy() initializes the sched_group_energy
structure and the two arrays in case the sd topology level contains energy
information.

This patch has been tested with scheduler feature flag FORCE_SD_OVERLAP
enabled as well.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/core.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 33 +++++++++++++++++++
2 files changed, 121 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6a06fe5..e09ded5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5731,6 +5731,9 @@ static void free_sched_groups(struct sched_group *sg, int free_sgc)
if (free_sgc && atomic_dec_and_test(&sg->sgc->ref))
kfree(sg->sgc);

+ if (free_sgc && atomic_dec_and_test(&sg->sge->ref))
+ kfree(sg->sge);
+
kfree(sg);
sg = tmp;
} while (sg != first);
@@ -5748,6 +5751,7 @@ static void free_sched_domain(struct rcu_head *rcu)
free_sched_groups(sd->groups, 1);
} else if (atomic_dec_and_test(&sd->groups->ref)) {
kfree(sd->groups->sgc);
+ kfree(sd->groups->sge);
kfree(sd->groups);
}
kfree(sd);
@@ -5965,6 +5969,8 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
*/
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);

+ sg->sge = *per_cpu_ptr(sdd->sge, i);
+
/*
* Make sure the first group of this domain contains the
* canonical balance cpu. Otherwise the sched_domain iteration
@@ -6003,6 +6009,7 @@ static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
*sg = *per_cpu_ptr(sdd->sg, cpu);
(*sg)->sgc = *per_cpu_ptr(sdd->sgc, cpu);
atomic_set(&(*sg)->sgc->ref, 1); /* for claim_allocations */
+ (*sg)->sge = *per_cpu_ptr(sdd->sge, cpu);
}

return cpu;
@@ -6092,6 +6099,52 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
atomic_set(&sg->sgc->nr_busy_cpus, sg->group_weight);
}

+static void init_sched_energy(int cpu, struct sched_domain *sd,
+ struct sched_domain_topology_level *tl)
+{
+ struct sched_group *sg = sd->groups;
+ struct sched_group_energy *sge = sg->sge;
+ sched_domain_energy_f fn = tl->energy;
+ struct cpumask *mask = sched_group_cpus(sg);
+
+ if (fn && sd->child && !sd->child->groups->sge) {
+ pr_err("BUG: EAS setup broken for CPU%d\n", cpu);
+#ifdef CONFIG_SCHED_DEBUG
+ pr_err(" energy data on %s but not on %s domain\n",
+ sd->name, sd->child->name);
+#endif
+ return;
+ }
+
+ if (cpu != group_balance_cpu(sg))
+ return;
+
+ if (!fn || !fn(cpu)) {
+ sg->sge = NULL;
+ return;
+ }
+
+ atomic_set(&sg->sge->ref, 1); /* for claim_allocations */
+
+ if (cpumask_weight(mask) > 1)
+ check_sched_energy_data(cpu, fn, mask);
+
+ sge->nr_idle_states = fn(cpu)->nr_idle_states;
+ sge->nr_cap_states = fn(cpu)->nr_cap_states;
+ sge->idle_states = (struct idle_state *)
+ ((void *)&sge->cap_states +
+ sizeof(sge->cap_states));
+ sge->cap_states = (struct capacity_state *)
+ ((void *)&sge->cap_states +
+ sizeof(sge->cap_states) +
+ sge->nr_idle_states *
+ sizeof(struct idle_state));
+ memcpy(sge->idle_states, fn(cpu)->idle_states,
+ sge->nr_idle_states*sizeof(struct idle_state));
+ memcpy(sge->cap_states, fn(cpu)->cap_states,
+ sge->nr_cap_states*sizeof(struct capacity_state));
+}
+
/*
* Initializers for schedule domains
* Non-inlined to reduce accumulated stack pressure in build_sched_domains()
@@ -6182,6 +6235,9 @@ static void claim_allocations(int cpu, struct sched_domain *sd)

if (atomic_read(&(*per_cpu_ptr(sdd->sgc, cpu))->ref))
*per_cpu_ptr(sdd->sgc, cpu) = NULL;
+
+ if (atomic_read(&(*per_cpu_ptr(sdd->sge, cpu))->ref))
+ *per_cpu_ptr(sdd->sge, cpu) = NULL;
}

#ifdef CONFIG_NUMA
@@ -6647,10 +6703,24 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
if (!sdd->sgc)
return -ENOMEM;

+ sdd->sge = alloc_percpu(struct sched_group_energy *);
+ if (!sdd->sge)
+ return -ENOMEM;
+
for_each_cpu(j, cpu_map) {
struct sched_domain *sd;
struct sched_group *sg;
struct sched_group_capacity *sgc;
+ struct sched_group_energy *sge;
+ sched_domain_energy_f fn = tl->energy;
+ unsigned int nr_idle_states = 0;
+ unsigned int nr_cap_states = 0;
+
+ if (fn && fn(j)) {
+ nr_idle_states = fn(j)->nr_idle_states;
+ nr_cap_states = fn(j)->nr_cap_states;
+ BUG_ON(!nr_idle_states || !nr_cap_states);
+ }

sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
@@ -6674,6 +6744,16 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
return -ENOMEM;

*per_cpu_ptr(sdd->sgc, j) = sgc;
+
+ sge = kzalloc_node(sizeof(struct sched_group_energy) +
+ nr_idle_states*sizeof(struct idle_state) +
+ nr_cap_states*sizeof(struct capacity_state),
+ GFP_KERNEL, cpu_to_node(j));
+
+ if (!sge)
+ return -ENOMEM;
+
+ *per_cpu_ptr(sdd->sge, j) = sge;
}
}

@@ -6702,6 +6782,8 @@ static void __sdt_free(const struct cpumask *cpu_map)
kfree(*per_cpu_ptr(sdd->sg, j));
if (sdd->sgc)
kfree(*per_cpu_ptr(sdd->sgc, j));
+ if (sdd->sge)
+ kfree(*per_cpu_ptr(sdd->sge, j));
}
free_percpu(sdd->sd);
sdd->sd = NULL;
@@ -6709,6 +6791,8 @@ static void __sdt_free(const struct cpumask *cpu_map)
sdd->sg = NULL;
free_percpu(sdd->sgc);
sdd->sgc = NULL;
+ free_percpu(sdd->sge);
+ sdd->sge = NULL;
}
}

@@ -6794,10 +6878,13 @@ static int build_sched_domains(const struct cpumask *cpu_map,

/* Calculate CPU capacity for physical packages and nodes */
for (i = nr_cpumask_bits-1; i >= 0; i--) {
+ struct sched_domain_topology_level *tl = sched_domain_topology;
+
if (!cpumask_test_cpu(i, cpu_map))
continue;

- for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+ for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent, tl++) {
+ init_sched_energy(i, sd, tl);
claim_allocations(i, sd);
init_sched_groups_capacity(i, sd);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7b687c6..b9d7057 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -890,6 +890,39 @@ static inline unsigned int group_first_cpu(struct sched_group *group)

extern int group_balance_cpu(struct sched_group *sg);

+/*
+ * Check that the per-cpu provided sd energy data is consistent for all cpus
+ * within the mask.
+ */
+static inline void check_sched_energy_data(int cpu, sched_domain_energy_f fn,
+ const struct cpumask *cpumask)
+{
+ struct cpumask mask;
+ int i;
+
+ cpumask_xor(&mask, cpumask, get_cpu_mask(cpu));
+
+ for_each_cpu(i, &mask) {
+ int y;
+
+ BUG_ON(fn(i)->nr_idle_states != fn(cpu)->nr_idle_states);
+
+ for (y = 0; y < (fn(i)->nr_idle_states); y++) {
+ BUG_ON(fn(i)->idle_states[y].power !=
+ fn(cpu)->idle_states[y].power);
+ }
+
+ BUG_ON(fn(i)->nr_cap_states != fn(cpu)->nr_cap_states);
+
+ for (y = 0; y < (fn(i)->nr_cap_states); y++) {
+ BUG_ON(fn(i)->cap_states[y].cap !=
+ fn(cpu)->cap_states[y].cap);
+ BUG_ON(fn(i)->cap_states[y].power !=
+ fn(cpu)->cap_states[y].power);
+ }
+ }
+}
+
#else

static inline void sched_ttwu_pending(void) { }
--
1.9.1

2015-07-07 19:01:29

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 17/46] sched: Introduce SD_SHARE_CAP_STATES sched_domain flag

cpufreq is currently keeping it a secret which cpus are sharing
clock source. The scheduler needs to know about clock domains as well
to become more energy aware. The SD_SHARE_CAP_STATES domain flag
indicates whether cpus belonging to the sched_domain share capacity
states (P-states).

There is no connection with cpufreq (yet). The flag must be set by
the arch specific topology code.

cc: Russell King <[email protected]>
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
arch/arm/kernel/topology.c | 3 ++-
include/linux/sched.h | 1 +
kernel/sched/core.c | 10 +++++++---
3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 5867587..b35d3e5 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -276,7 +276,8 @@ void store_cpu_topology(unsigned int cpuid)

static inline int cpu_corepower_flags(void)
{
- return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
+ return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN | \
+ SD_SHARE_CAP_STATES;
}

static struct sched_domain_topology_level arm_topology[] = {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c4d0e88..8ac2db8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -985,6 +985,7 @@ extern void wake_up_q(struct wake_q_head *head);
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
#define SD_NUMA 0x4000 /* cross-node balancing */
+#define SD_SHARE_CAP_STATES 0x8000 /* Domain members share capacity state */

#ifdef CONFIG_SCHED_SMT
static inline int cpu_smt_flags(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e09ded5..b70a5a7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5563,7 +5563,8 @@ static int sd_degenerate(struct sched_domain *sd)
SD_BALANCE_EXEC |
SD_SHARE_CPUCAPACITY |
SD_SHARE_PKG_RESOURCES |
- SD_SHARE_POWERDOMAIN)) {
+ SD_SHARE_POWERDOMAIN |
+ SD_SHARE_CAP_STATES)) {
if (sd->groups != sd->groups->next)
return 0;
}
@@ -5595,7 +5596,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
SD_SHARE_CPUCAPACITY |
SD_SHARE_PKG_RESOURCES |
SD_PREFER_SIBLING |
- SD_SHARE_POWERDOMAIN);
+ SD_SHARE_POWERDOMAIN |
+ SD_SHARE_CAP_STATES);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
}
@@ -6256,6 +6258,7 @@ static int sched_domains_curr_level;
* SD_SHARE_PKG_RESOURCES - describes shared caches
* SD_NUMA - describes NUMA topologies
* SD_SHARE_POWERDOMAIN - describes shared power domain
+ * SD_SHARE_CAP_STATES - describes shared capacity states
*
* Odd one out:
* SD_ASYM_PACKING - describes SMT quirks
@@ -6265,7 +6268,8 @@ static int sched_domains_curr_level;
SD_SHARE_PKG_RESOURCES | \
SD_NUMA | \
SD_ASYM_PACKING | \
- SD_SHARE_POWERDOMAIN)
+ SD_SHARE_POWERDOMAIN | \
+ SD_SHARE_CAP_STATES)

static struct sched_domain *
sd_init(struct sched_domain_topology_level *tl, int cpu)
--
1.9.1

2015-07-07 18:24:24

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 18/46] arm: topology: Define TC2 energy and provide it to the scheduler

From: Dietmar Eggemann <[email protected]>

This patch is only here to be able to test provisioning of energy related
data from an arch topology shim layer to the scheduler. Since there is no
code today which deals with extracting energy related data from the dtb or
acpi, and process it in the topology shim layer, the content of the
sched_group_energy structures as well as the idle_state and capacity_state
arrays are hard-coded here.

This patch defines the sched_group_energy structure as well as the
idle_state and capacity_state array for the cluster (relates to sched
groups (sgs) in DIE sched domain level) and for the core (relates to sgs
in MC sd level) for a Cortex A7 as well as for a Cortex A15.
It further provides related implementations of the sched_domain_energy_f
functions (cpu_cluster_energy() and cpu_core_energy()).

To be able to propagate this information from the topology shim layer to
the scheduler, the elements of the arm_topology[] table have been
provisioned with the appropriate sched_domain_energy_f functions.

cc: Russell King <[email protected]>

Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/kernel/topology.c | 118 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 115 insertions(+), 3 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index b35d3e5..bbe20c7 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -274,6 +274,119 @@ void store_cpu_topology(unsigned int cpuid)
cpu_topology[cpuid].socket_id, mpidr);
}

+/*
+ * ARM TC2 specific energy cost model data. There are no unit requirements for
+ * the data. Data can be normalized to any reference point, but the
+ * normalization must be consistent. That is, one bogo-joule/watt must be the
+ * same quantity for all data, but we don't care what it is.
+ */
+static struct idle_state idle_states_cluster_a7[] = {
+ { .power = 25 }, /* WFI */
+ { .power = 10 }, /* cluster-sleep-l */
+ };
+
+static struct idle_state idle_states_cluster_a15[] = {
+ { .power = 70 }, /* WFI */
+ { .power = 25 }, /* cluster-sleep-b */
+ };
+
+static struct capacity_state cap_states_cluster_a7[] = {
+ /* Cluster only power */
+ { .cap = 150, .power = 2967, }, /* 350 MHz */
+ { .cap = 172, .power = 2792, }, /* 400 MHz */
+ { .cap = 215, .power = 2810, }, /* 500 MHz */
+ { .cap = 258, .power = 2815, }, /* 600 MHz */
+ { .cap = 301, .power = 2919, }, /* 700 MHz */
+ { .cap = 344, .power = 2847, }, /* 800 MHz */
+ { .cap = 387, .power = 3917, }, /* 900 MHz */
+ { .cap = 430, .power = 4905, }, /* 1000 MHz */
+ };
+
+static struct capacity_state cap_states_cluster_a15[] = {
+ /* Cluster only power */
+ { .cap = 426, .power = 7920, }, /* 500 MHz */
+ { .cap = 512, .power = 8165, }, /* 600 MHz */
+ { .cap = 597, .power = 8172, }, /* 700 MHz */
+ { .cap = 682, .power = 8195, }, /* 800 MHz */
+ { .cap = 768, .power = 8265, }, /* 900 MHz */
+ { .cap = 853, .power = 8446, }, /* 1000 MHz */
+ { .cap = 938, .power = 11426, }, /* 1100 MHz */
+ { .cap = 1024, .power = 15200, }, /* 1200 MHz */
+ };
+
+static struct sched_group_energy energy_cluster_a7 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a7),
+ .idle_states = idle_states_cluster_a7,
+ .nr_cap_states = ARRAY_SIZE(cap_states_cluster_a7),
+ .cap_states = cap_states_cluster_a7,
+};
+
+static struct sched_group_energy energy_cluster_a15 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a15),
+ .idle_states = idle_states_cluster_a15,
+ .nr_cap_states = ARRAY_SIZE(cap_states_cluster_a15),
+ .cap_states = cap_states_cluster_a15,
+};
+
+static struct idle_state idle_states_core_a7[] = {
+ { .power = 0 }, /* WFI */
+ };
+
+static struct idle_state idle_states_core_a15[] = {
+ { .power = 0 }, /* WFI */
+ };
+
+static struct capacity_state cap_states_core_a7[] = {
+ /* Power per cpu */
+ { .cap = 150, .power = 187, }, /* 350 MHz */
+ { .cap = 172, .power = 275, }, /* 400 MHz */
+ { .cap = 215, .power = 334, }, /* 500 MHz */
+ { .cap = 258, .power = 407, }, /* 600 MHz */
+ { .cap = 301, .power = 447, }, /* 700 MHz */
+ { .cap = 344, .power = 549, }, /* 800 MHz */
+ { .cap = 387, .power = 761, }, /* 900 MHz */
+ { .cap = 430, .power = 1024, }, /* 1000 MHz */
+ };
+
+static struct capacity_state cap_states_core_a15[] = {
+ /* Power per cpu */
+ { .cap = 426, .power = 2021, }, /* 500 MHz */
+ { .cap = 512, .power = 2312, }, /* 600 MHz */
+ { .cap = 597, .power = 2756, }, /* 700 MHz */
+ { .cap = 682, .power = 3125, }, /* 800 MHz */
+ { .cap = 768, .power = 3524, }, /* 900 MHz */
+ { .cap = 853, .power = 3846, }, /* 1000 MHz */
+ { .cap = 938, .power = 5177, }, /* 1100 MHz */
+ { .cap = 1024, .power = 6997, }, /* 1200 MHz */
+ };
+
+static struct sched_group_energy energy_core_a7 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_core_a7),
+ .idle_states = idle_states_core_a7,
+ .nr_cap_states = ARRAY_SIZE(cap_states_core_a7),
+ .cap_states = cap_states_core_a7,
+};
+
+static struct sched_group_energy energy_core_a15 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_core_a15),
+ .idle_states = idle_states_core_a15,
+ .nr_cap_states = ARRAY_SIZE(cap_states_core_a15),
+ .cap_states = cap_states_core_a15,
+};
+
+/* sd energy functions */
+static inline const struct sched_group_energy *cpu_cluster_energy(int cpu)
+{
+ return cpu_topology[cpu].socket_id ? &energy_cluster_a7 :
+ &energy_cluster_a15;
+}
+
+static inline const struct sched_group_energy *cpu_core_energy(int cpu)
+{
+ return cpu_topology[cpu].socket_id ? &energy_core_a7 :
+ &energy_core_a15;
+}
+
static inline int cpu_corepower_flags(void)
{
return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN | \
@@ -282,10 +395,9 @@ static inline int cpu_corepower_flags(void)

static struct sched_domain_topology_level arm_topology[] = {
#ifdef CONFIG_SCHED_MC
- { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+ { cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) },
#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { cpu_cpu_mask, 0, cpu_cluster_energy, SD_INIT_NAME(DIE) },
{ NULL, },
};

--
1.9.1

2015-07-07 19:01:08

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 19/46] sched: Compute cpu capacity available at current frequency

capacity_orig_of() returns the max available compute capacity of a cpu.
For scale-invariant utilization tracking and energy-aware scheduling
decisions it is useful to know the compute capacity available at the
current OPP of a cpu.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a95dad9..70f81fc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4791,6 +4791,17 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)

#endif

+/*
+ * Returns the current capacity of cpu after applying both
+ * cpu and freq scaling.
+ */
+static unsigned long capacity_curr_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_orig *
+ arch_scale_freq_capacity(NULL, cpu)
+ >> SCHED_CAPACITY_SHIFT;
+}
+
static inline bool energy_aware(void)
{
return sched_feat(ENERGY_AWARE);
--
1.9.1

2015-07-07 18:24:35

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 20/46] sched: Relocated get_cpu_usage() and change return type

Move get_cpu_usage() to an earlier position in fair.c and change return
type to unsigned long as negative usage doesn't make much sense. All
other load and capacity related functions use unsigned long including
the caller of get_cpu_usage().

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 78 ++++++++++++++++++++++++++---------------------------
1 file changed, 39 insertions(+), 39 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 70f81fc..78d3081 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4802,6 +4802,45 @@ static unsigned long capacity_curr_of(int cpu)
>> SCHED_CAPACITY_SHIFT;
}

+/*
+ * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * tasks. The unit of the return value must be the one of capacity so we can
+ * compare the usage with the capacity of the CPU that is available for CFS
+ * task (ie cpu_capacity).
+ *
+ * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
+ * CPU. It represents the amount of utilization of a CPU in the range
+ * [0..capacity_orig] where capacity_orig is the cpu_capacity available at the
+ * highest frequency (arch_scale_freq_capacity()). The usage of a CPU converges
+ * towards a sum equal to or less than the current capacity (capacity_curr <=
+ * capacity_orig) of the CPU because it is the running time on this CPU scaled
+ * by capacity_curr. Nevertheless, cfs.utilization_load_avg can be higher than
+ * capacity_curr or even higher than capacity_orig because of unfortunate
+ * rounding in avg_period and running_load_avg or just after migrating tasks
+ * (and new task wakeups) until the average stabilizes with the new running
+ * time. We need to check that the usage stays into the range
+ * [0..capacity_orig] and cap if necessary. Without capping the usage, a group
+ * could be seen as overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas
+ * CPU1 has 20% of available capacity. We allow usage to overshoot
+ * capacity_curr (but not capacity_orig) as it useful for predicting the
+ * capacity required after task migrations (scheduler-driven DVFS).
+ */
+
+static unsigned long get_cpu_usage(int cpu)
+{
+ int sum;
+ unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+ unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
+ unsigned long capacity_orig = capacity_orig_of(cpu);
+
+ sum = usage + blocked;
+
+ if (sum >= capacity_orig)
+ return capacity_orig;
+
+ return sum;
+}
+
static inline bool energy_aware(void)
{
return sched_feat(ENERGY_AWARE);
@@ -5055,45 +5094,6 @@ static int select_idle_sibling(struct task_struct *p, int target)
}

/*
- * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
- * tasks. The unit of the return value must be the one of capacity so we can
- * compare the usage with the capacity of the CPU that is available for CFS
- * task (ie cpu_capacity).
- *
- * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
- * CPU. It represents the amount of utilization of a CPU in the range
- * [0..capacity_orig] where capacity_orig is the cpu_capacity available at the
- * highest frequency (arch_scale_freq_capacity()). The usage of a CPU converges
- * towards a sum equal to or less than the current capacity (capacity_curr <=
- * capacity_orig) of the CPU because it is the running time on this CPU scaled
- * by capacity_curr. Nevertheless, cfs.utilization_load_avg can be higher than
- * capacity_curr or even higher than capacity_orig because of unfortunate
- * rounding in avg_period and running_load_avg or just after migrating tasks
- * (and new task wakeups) until the average stabilizes with the new running
- * time. We need to check that the usage stays into the range
- * [0..capacity_orig] and cap if necessary. Without capping the usage, a group
- * could be seen as overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas
- * CPU1 has 20% of available capacity. We allow usage to overshoot
- * capacity_curr (but not capacity_orig) as it useful for predicting the
- * capacity required after task migrations (scheduler-driven DVFS).
- */
-
-static int get_cpu_usage(int cpu)
-{
- int sum;
- unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
- unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
- unsigned long capacity_orig = capacity_orig_of(cpu);
-
- sum = usage + blocked;
-
- if (sum >= capacity_orig)
- return capacity_orig;
-
- return sum;
-}
-
-/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
* SD_BALANCE_FORK, or SD_BALANCE_EXEC.
--
1.9.1

2015-07-07 18:24:50

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 21/46] sched: Highest energy aware balancing sched_domain level pointer

Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_ea, points to the highest level at which energy
model is provided. At this level and all levels below all sched_groups
have energy model data attached.

Partial energy model information is possible but restricted to providing
energy model data for lower level sched_domains (sd_ea and below) and
leaving load-balancing on levels above to non-energy-aware
load-balancing. For example, it is possible to apply energy-aware
scheduling within each socket on a multi-socket system and let normal
scheduling handle load-balancing between sockets.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/core.c | 11 ++++++++++-
kernel/sched/sched.h | 1 +
2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b70a5a7..c7fdd07 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5785,11 +5785,12 @@ DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
DEFINE_PER_CPU(struct sched_domain *, sd_busy);
DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_PER_CPU(struct sched_domain *, sd_ea);

static void update_top_cache_domain(int cpu)
{
struct sched_domain *sd;
- struct sched_domain *busy_sd = NULL;
+ struct sched_domain *busy_sd = NULL, *ea_sd = NULL;
int id = cpu;
int size = 1;

@@ -5810,6 +5811,14 @@ static void update_top_cache_domain(int cpu)

sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
+
+ for_each_domain(cpu, sd) {
+ if (sd->groups->sge)
+ ea_sd = sd;
+ else
+ break;
+ }
+ rcu_assign_pointer(per_cpu(sd_ea, cpu), ea_sd);
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b9d7057..8a51692 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -829,6 +829,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_busy);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(struct sched_domain *, sd_ea);

struct sched_group_capacity {
atomic_t ref;
--
1.9.1

2015-07-07 19:00:45

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 22/46] sched: Calculate energy consumption of sched_group

For energy-aware load-balancing decisions it is necessary to know the
energy consumption estimates of groups of cpus. This patch introduces a
basic function, sched_group_energy(), which estimates the energy
consumption of the cpus in the group and any resources shared by the
members of the group.

NOTE: The function has five levels of identation and breaks the 80
character limit. Refactoring is necessary.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 146 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 78d3081..bd0be9d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4846,6 +4846,152 @@ static inline bool energy_aware(void)
return sched_feat(ENERGY_AWARE);
}

+/*
+ * cpu_norm_usage() returns the cpu usage relative to a specific capacity,
+ * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
+ * energy calculations. Using the scale-invariant usage returned by
+ * get_cpu_usage() and approximating scale-invariant usage by:
+ *
+ * usage ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time
+ *
+ * the normalized usage can be found using the specific capacity.
+ *
+ * capacity = capacity_orig * curr_freq/max_freq
+ *
+ * norm_usage = running_time/time ~ usage/capacity
+ */
+static unsigned long cpu_norm_usage(int cpu, unsigned long capacity)
+{
+ int usage = __get_cpu_usage(cpu);
+
+ if (usage >= capacity)
+ return SCHED_CAPACITY_SCALE;
+
+ return (usage << SCHED_CAPACITY_SHIFT)/capacity;
+}
+
+static unsigned long group_max_usage(struct sched_group *sg)
+{
+ int i;
+ unsigned long max_usage = 0;
+
+ for_each_cpu(i, sched_group_cpus(sg))
+ max_usage = max(max_usage, get_cpu_usage(i));
+
+ return max_usage;
+}
+
+/*
+ * group_norm_usage() returns the approximated group usage relative to it's
+ * current capacity (busy ratio) in the range [0..SCHED_LOAD_SCALE] for use in
+ * energy calculations. Since task executions may or may not overlap in time in
+ * the group the true normalized usage is between max(cpu_norm_usage(i)) and
+ * sum(cpu_norm_usage(i)) when iterating over all cpus in the group, i. The
+ * latter is used as the estimate as it leads to a more pessimistic energy
+ * estimate (more busy).
+ */
+static unsigned long group_norm_usage(struct sched_group *sg, int cap_idx)
+{
+ int i;
+ unsigned long usage_sum = 0;
+ unsigned long capacity = sg->sge->cap_states[cap_idx].cap;
+
+ for_each_cpu(i, sched_group_cpus(sg))
+ usage_sum += cpu_norm_usage(i, capacity);
+
+ if (usage_sum > SCHED_CAPACITY_SCALE)
+ return SCHED_CAPACITY_SCALE;
+ return usage_sum;
+}
+
+static int find_new_capacity(struct sched_group *sg,
+ struct sched_group_energy *sge)
+{
+ int idx;
+ unsigned long util = group_max_usage(sg);
+
+ for (idx = 0; idx < sge->nr_cap_states; idx++) {
+ if (sge->cap_states[idx].cap >= util)
+ return idx;
+ }
+
+ return idx;
+}
+
+/*
+ * sched_group_energy(): Returns absolute energy consumption of cpus belonging
+ * to the sched_group including shared resources shared only by members of the
+ * group. Iterates over all cpus in the hierarchy below the sched_group starting
+ * from the bottom working it's way up before going to the next cpu until all
+ * cpus are covered at all levels. The current implementation is likely to
+ * gather the same usage statistics multiple times. This can probably be done in
+ * a faster but more complex way.
+ */
+static unsigned int sched_group_energy(struct sched_group *sg_top)
+{
+ struct sched_domain *sd;
+ int cpu, total_energy = 0;
+ struct cpumask visit_cpus;
+ struct sched_group *sg;
+
+ WARN_ON(!sg_top->sge);
+
+ cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
+
+ while (!cpumask_empty(&visit_cpus)) {
+ struct sched_group *sg_shared_cap = NULL;
+
+ cpu = cpumask_first(&visit_cpus);
+
+ /*
+ * Is the group utilization affected by cpus outside this
+ * sched_group?
+ */
+ sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
+ if (sd && sd->parent)
+ sg_shared_cap = sd->parent->groups;
+
+ for_each_domain(cpu, sd) {
+ sg = sd->groups;
+
+ /* Has this sched_domain already been visited? */
+ if (sd->child && group_first_cpu(sg) != cpu)
+ break;
+
+ do {
+ struct sched_group *sg_cap_util;
+ unsigned long group_util;
+ int sg_busy_energy, sg_idle_energy, cap_idx;
+
+ if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
+ sg_cap_util = sg_shared_cap;
+ else
+ sg_cap_util = sg;
+
+ cap_idx = find_new_capacity(sg_cap_util, sg->sge);
+ group_util = group_norm_usage(sg, cap_idx);
+ sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
+ >> SCHED_CAPACITY_SHIFT;
+ sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
+ >> SCHED_CAPACITY_SHIFT;
+
+ total_energy += sg_busy_energy + sg_idle_energy;
+
+ if (!sd->child)
+ cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg));
+
+ if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(sg_top)))
+ goto next_cpu;
+
+ } while (sg = sg->next, sg != sd->groups);
+ }
+next_cpu:
+ continue;
+ }
+
+ return total_energy;
+}
+
static int wake_wide(struct task_struct *p)
{
int factor = this_cpu_read(sd_llc_size);
--
1.9.1

2015-07-07 19:00:22

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 23/46] sched: Extend sched_group_energy to test load-balancing decisions

Extended sched_group_energy() to support energy prediction with usage
(tasks) added/removed from a specific cpu or migrated between a pair of
cpus. Useful for load-balancing decision making.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++++++++----------------
1 file changed, 60 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bd0be9d..362e33b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4825,15 +4825,17 @@ static unsigned long capacity_curr_of(int cpu)
* capacity_curr (but not capacity_orig) as it useful for predicting the
* capacity required after task migrations (scheduler-driven DVFS).
*/
-
-static unsigned long get_cpu_usage(int cpu)
+static unsigned long __get_cpu_usage(int cpu, int delta)
{
int sum;
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
unsigned long capacity_orig = capacity_orig_of(cpu);

- sum = usage + blocked;
+ sum = usage + blocked + delta;
+
+ if (sum < 0)
+ return 0;

if (sum >= capacity_orig)
return capacity_orig;
@@ -4841,13 +4843,28 @@ static unsigned long get_cpu_usage(int cpu)
return sum;
}

+static unsigned long get_cpu_usage(int cpu)
+{
+ return __get_cpu_usage(cpu, 0);
+}
+
static inline bool energy_aware(void)
{
return sched_feat(ENERGY_AWARE);
}

+struct energy_env {
+ struct sched_group *sg_top;
+ struct sched_group *sg_cap;
+ int cap_idx;
+ int usage_delta;
+ int src_cpu;
+ int dst_cpu;
+ int energy;
+};
+
/*
- * cpu_norm_usage() returns the cpu usage relative to a specific capacity,
+ * __cpu_norm_usage() returns the cpu usage relative to a specific capacity,
* i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
* energy calculations. Using the scale-invariant usage returned by
* get_cpu_usage() and approximating scale-invariant usage by:
@@ -4860,9 +4877,9 @@ static inline bool energy_aware(void)
*
* norm_usage = running_time/time ~ usage/capacity
*/
-static unsigned long cpu_norm_usage(int cpu, unsigned long capacity)
+static unsigned long __cpu_norm_usage(int cpu, unsigned long capacity, int delta)
{
- int usage = __get_cpu_usage(cpu);
+ int usage = __get_cpu_usage(cpu, delta);

if (usage >= capacity)
return SCHED_CAPACITY_SCALE;
@@ -4870,13 +4887,25 @@ static unsigned long cpu_norm_usage(int cpu, unsigned long capacity)
return (usage << SCHED_CAPACITY_SHIFT)/capacity;
}

-static unsigned long group_max_usage(struct sched_group *sg)
+static int calc_usage_delta(struct energy_env *eenv, int cpu)
{
- int i;
+ if (cpu == eenv->src_cpu)
+ return -eenv->usage_delta;
+ if (cpu == eenv->dst_cpu)
+ return eenv->usage_delta;
+ return 0;
+}
+
+static
+unsigned long group_max_usage(struct energy_env *eenv)
+{
+ int i, delta;
unsigned long max_usage = 0;

- for_each_cpu(i, sched_group_cpus(sg))
- max_usage = max(max_usage, get_cpu_usage(i));
+ for_each_cpu(i, sched_group_cpus(eenv->sg_cap)) {
+ delta = calc_usage_delta(eenv, i);
+ max_usage = max(max_usage, __get_cpu_usage(i, delta));
+ }

return max_usage;
}
@@ -4890,31 +4919,36 @@ static unsigned long group_max_usage(struct sched_group *sg)
* latter is used as the estimate as it leads to a more pessimistic energy
* estimate (more busy).
*/
-static unsigned long group_norm_usage(struct sched_group *sg, int cap_idx)
+static unsigned
+long group_norm_usage(struct energy_env *eenv, struct sched_group *sg)
{
- int i;
+ int i, delta;
unsigned long usage_sum = 0;
- unsigned long capacity = sg->sge->cap_states[cap_idx].cap;
+ unsigned long capacity = sg->sge->cap_states[eenv->cap_idx].cap;

- for_each_cpu(i, sched_group_cpus(sg))
- usage_sum += cpu_norm_usage(i, capacity);
+ for_each_cpu(i, sched_group_cpus(sg)) {
+ delta = calc_usage_delta(eenv, i);
+ usage_sum += __cpu_norm_usage(i, capacity, delta);
+ }

if (usage_sum > SCHED_CAPACITY_SCALE)
return SCHED_CAPACITY_SCALE;
return usage_sum;
}

-static int find_new_capacity(struct sched_group *sg,
+static int find_new_capacity(struct energy_env *eenv,
struct sched_group_energy *sge)
{
int idx;
- unsigned long util = group_max_usage(sg);
+ unsigned long util = group_max_usage(eenv);

for (idx = 0; idx < sge->nr_cap_states; idx++) {
if (sge->cap_states[idx].cap >= util)
return idx;
}

+ eenv->cap_idx = idx;
+
return idx;
}

@@ -4927,16 +4961,16 @@ static int find_new_capacity(struct sched_group *sg,
* gather the same usage statistics multiple times. This can probably be done in
* a faster but more complex way.
*/
-static unsigned int sched_group_energy(struct sched_group *sg_top)
+static unsigned int sched_group_energy(struct energy_env *eenv)
{
struct sched_domain *sd;
int cpu, total_energy = 0;
struct cpumask visit_cpus;
struct sched_group *sg;

- WARN_ON(!sg_top->sge);
+ WARN_ON(!eenv->sg_top->sge);

- cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
+ cpumask_copy(&visit_cpus, sched_group_cpus(eenv->sg_top));

while (!cpumask_empty(&visit_cpus)) {
struct sched_group *sg_shared_cap = NULL;
@@ -4959,17 +4993,16 @@ static unsigned int sched_group_energy(struct sched_group *sg_top)
break;

do {
- struct sched_group *sg_cap_util;
unsigned long group_util;
int sg_busy_energy, sg_idle_energy, cap_idx;

if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
- sg_cap_util = sg_shared_cap;
+ eenv->sg_cap = sg_shared_cap;
else
- sg_cap_util = sg;
+ eenv->sg_cap = sg;

- cap_idx = find_new_capacity(sg_cap_util, sg->sge);
- group_util = group_norm_usage(sg, cap_idx);
+ cap_idx = find_new_capacity(eenv, sg->sge);
+ group_util = group_norm_usage(eenv, sg);
sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
>> SCHED_CAPACITY_SHIFT;
sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
@@ -4980,7 +5013,7 @@ static unsigned int sched_group_energy(struct sched_group *sg_top)
if (!sd->child)
cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg));

- if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(sg_top)))
+ if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(eenv->sg_top)))
goto next_cpu;

} while (sg = sg->next, sg != sd->groups);
@@ -4989,6 +5022,7 @@ static unsigned int sched_group_energy(struct sched_group *sg_top)
continue;
}

+ eenv->energy = total_energy;
return total_energy;
}

--
1.9.1

2015-07-07 19:00:12

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 24/46] sched: Estimate energy impact of scheduling decisions

Adds a generic energy-aware helper function, energy_diff(), that
calculates energy impact of adding, removing, and migrating utilization
in the system.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 362e33b..bf1d34c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5026,6 +5026,57 @@ static unsigned int sched_group_energy(struct energy_env *eenv)
return total_energy;
}

+/*
+ * energy_diff(): Estimate the energy impact of changing the utilization
+ * distribution. eenv specifies the change: utilisation amount, source, and
+ * destination cpu. Source or destination cpu may be -1 in which case the
+ * utilization is removed from or added to the system (e.g. task wake-up). If
+ * both are specified, the utilization is migrated.
+ */
+static int energy_diff(struct energy_env *eenv)
+{
+ struct sched_domain *sd;
+ struct sched_group *sg;
+ int sd_cpu = -1, energy_before = 0, energy_after = 0;
+
+ struct energy_env eenv_before = {
+ .usage_delta = 0,
+ .src_cpu = eenv->src_cpu,
+ .dst_cpu = eenv->dst_cpu,
+ };
+
+ if (eenv->src_cpu == eenv->dst_cpu)
+ return 0;
+
+ sd_cpu = (eenv->src_cpu != -1) ? eenv->src_cpu : eenv->dst_cpu;
+ sd = rcu_dereference(per_cpu(sd_ea, sd_cpu));
+
+ if (!sd)
+ return 0; /* Error */
+
+ sg = sd->groups;
+ do {
+ if (eenv->src_cpu != -1 && cpumask_test_cpu(eenv->src_cpu,
+ sched_group_cpus(sg))) {
+ eenv_before.sg_top = eenv->sg_top = sg;
+ energy_before += sched_group_energy(&eenv_before);
+ energy_after += sched_group_energy(eenv);
+
+ /* src_cpu and dst_cpu may belong to the same group */
+ continue;
+ }
+
+ if (eenv->dst_cpu != -1 && cpumask_test_cpu(eenv->dst_cpu,
+ sched_group_cpus(sg))) {
+ eenv_before.sg_top = eenv->sg_top = sg;
+ energy_before += sched_group_energy(&eenv_before);
+ energy_after += sched_group_energy(eenv);
+ }
+ } while (sg = sg->next, sg != sd->groups);
+
+ return energy_after-energy_before;
+}
+
static int wake_wide(struct task_struct *p)
{
int factor = this_cpu_read(sd_llc_size);
--
1.9.1

2015-07-07 18:24:57

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 25/46] sched: Add over-utilization/tipping point indicator

Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way,
spreading the tasks across as many cpus as possible based on priority
scaled load to preserve smp_nice.

The over-utilization condition is conservatively chosen to indicate
over-utilization as soon as one cpu is fully utilized at it's highest
frequency. We don't consider groups as lumping usage and capacity
together for a group of cpus may hide the fact that one or more cpus in
the group are over-utilized while group-siblings are partially idle. The
tasks could be served better if moved to another group with completely
idle cpus. This is particularly problematic if some cpus have a
significantly reduced capacity due to RT/IRQ pressure or if the system
has cpus of different capacity (e.g. ARM big.LITTLE).

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 3 +++
2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf1d34c..99e43ee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4281,6 +4281,8 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+static bool cpu_overutilized(int cpu);
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -4291,6 +4293,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ int task_new = !(flags & ENQUEUE_WAKEUP);

for_each_sched_entity(se) {
if (se->on_rq)
@@ -4325,6 +4328,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) {
update_rq_runnable_avg(rq, rq->nr_running);
add_nr_running(rq, 1);
+ if (!task_new && !rq->rd->overutilized &&
+ cpu_overutilized(rq->cpu))
+ rq->rd->overutilized = true;
}
hrtick_update(rq);
}
@@ -4952,6 +4958,14 @@ static int find_new_capacity(struct energy_env *eenv,
return idx;
}

+static unsigned int capacity_margin = 1280; /* ~20% margin */
+
+static bool cpu_overutilized(int cpu)
+{
+ return (capacity_of(cpu) * 1024) <
+ (get_cpu_usage(cpu) * capacity_margin);
+}
+
/*
* sched_group_energy(): Returns absolute energy consumption of cpus belonging
* to the sched_group including shared resources shared only by members of the
@@ -6756,11 +6770,12 @@ static enum group_type group_classify(struct lb_env *env,
* @local_group: Does group contain this_cpu.
* @sgs: variable to hold the statistics for this group.
* @overload: Indicate more than one runnable task for any CPU.
+ * @overutilized: Indicate overutilization for any CPU.
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
- bool *overload)
+ bool *overload, bool *overutilized)
{
unsigned long load;
int i;
@@ -6790,6 +6805,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->sum_weighted_load += weighted_cpuload(i);
if (idle_cpu(i))
sgs->idle_cpus++;
+
+ if (cpu_overutilized(i))
+ *overutilized = true;
}

/* Adjust by relative CPU capacity of the group */
@@ -6895,7 +6913,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats tmp_sgs;
int load_idx, prefer_sibling = 0;
- bool overload = false;
+ bool overload = false, overutilized = false;

if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
@@ -6917,7 +6935,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}

update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
- &overload);
+ &overload, &overutilized);

if (local_group)
goto next_group;
@@ -6959,8 +6977,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* update overload indicator if we are at root domain */
if (env->dst_rq->rd->overload != overload)
env->dst_rq->rd->overload = overload;
- }

+ /* Update over-utilization (tipping point, U >= 0) indicator */
+ if (env->dst_rq->rd->overutilized != overutilized)
+ env->dst_rq->rd->overutilized = overutilized;
+ } else {
+ if (!env->dst_rq->rd->overutilized && overutilized)
+ env->dst_rq->rd->overutilized = true;
+ }
}

/**
@@ -8324,6 +8348,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
task_tick_numa(rq, curr);

update_rq_runnable_avg(rq, 1);
+
+ if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
+ rq->rd->overutilized = true;
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8a51692..fbe2da0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -535,6 +535,9 @@ struct root_domain {
/* Indicate more than one runnable task for any CPU */
bool overload;

+ /* Indicate one or more cpus over-utilized (tipping point) */
+ bool overutilized;
+
/*
* The bit corresponding to a CPU gets set here if such CPU has more
* than one runnable -deadline task (as it is below for RT tasks).
--
1.9.1

2015-07-07 18:25:07

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 26/46] sched: Store system-wide maximum cpu capacity in root domain

From: Dietmar Eggemann <[email protected]>

To be able to compare the capacity of the target cpu with the highest
cpu capacity of the system in the wakeup path, store the system-wide
maximum cpu capacity in the root domain.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/core.c | 8 ++++++++
kernel/sched/sched.h | 3 +++
2 files changed, 11 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c7fdd07..fe4b361 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6853,6 +6853,7 @@ static int build_sched_domains(const struct cpumask *cpu_map,
enum s_alloc alloc_state;
struct sched_domain *sd;
struct s_data d;
+ struct rq *rq;
int i, ret = -ENOMEM;

alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
@@ -6906,11 +6907,18 @@ static int build_sched_domains(const struct cpumask *cpu_map,
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
+ rq = cpu_rq(i);
sd = *per_cpu_ptr(d.sd, i);
cpu_attach_domain(sd, d.rd, i);
+
+ if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)
+ rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;
}
rcu_read_unlock();

+ rq = cpu_rq(cpumask_first(cpu_map));
+ pr_info("Max cpu capacity: %lu\n", rq->rd->max_cpu_capacity);
+
ret = 0;
error:
__free_domain_allocs(&d, alloc_state, cpu_map);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fbe2da0..9589d9e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -553,6 +553,9 @@ struct root_domain {
*/
cpumask_var_t rto_mask;
struct cpupri cpupri;
+
+ /* Maximum cpu capacity in the system. */
+ unsigned long max_cpu_capacity;
};

extern struct root_domain def_root_domain;
--
1.9.1

2015-07-07 18:25:17

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 27/46] sched, cpuidle: Track cpuidle state index in the scheduler

The idle-state of each cpu is currently pointed to by rq->idle_state but
there isn't any information in the struct cpuidle_state that can used to
look up the idle-state energy model data stored in struct
sched_group_energy. For this purpose is necessary to store the idle
state index as well. Ideally, the idle-state data should be unified.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/idle.c | 2 ++
kernel/sched/sched.h | 21 +++++++++++++++++++++
2 files changed, 23 insertions(+)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index fefcb1f..6832fa1 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -151,6 +151,7 @@ static void cpuidle_idle_call(void)

/* Take note of the planned idle state. */
idle_set_state(this_rq(), &drv->states[next_state]);
+ idle_set_state_idx(this_rq(), next_state);

/*
* Enter the idle state previously returned by the governor decision.
@@ -161,6 +162,7 @@ static void cpuidle_idle_call(void)

/* The cpu is no longer idle or about to enter idle. */
idle_set_state(this_rq(), NULL);
+ idle_set_state_idx(this_rq(), -1);

if (entered_state == -EBUSY)
goto use_default;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9589d9e..c395559 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -701,6 +701,7 @@ struct rq {
#ifdef CONFIG_CPU_IDLE
/* Must be inspected within a rcu lock section */
struct cpuidle_state *idle_state;
+ int idle_state_idx;
#endif
};

@@ -1316,6 +1317,17 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
WARN_ON(!rcu_read_lock_held());
return rq->idle_state;
}
+
+static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
+{
+ rq->idle_state_idx = idle_state_idx;
+}
+
+static inline int idle_get_state_idx(struct rq *rq)
+{
+ WARN_ON(!rcu_read_lock_held());
+ return rq->idle_state_idx;
+}
#else
static inline void idle_set_state(struct rq *rq,
struct cpuidle_state *idle_state)
@@ -1326,6 +1338,15 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
{
return NULL;
}
+
+static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
+{
+}
+
+static inline int idle_get_state_idx(struct rq *rq)
+{
+ return -1;
+}
#endif

extern void sysrq_sched_debug_show(void);
--
1.9.1

2015-07-07 18:59:43

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 28/46] sched: Count number of shallower idle-states in struct sched_group_energy

cpuidle associates all idle-states with each cpu while the energy model
associates them with the sched_group covering the cpus coordinating
entry to the idle-state. To look up the idle-state power consumption in
the energy model it is therefore necessary to translate from cpuidle
idle-state index to energy model index. For this purpose it is helpful
to know how many idle-states that are listed in lower level sched_groups
(in struct sched_group_energy).

Example: ARMv8 big.LITTLE JUNO (Cortex A57, A53) idle-states:
Idle-state cpuidle Energy model table indices
index per-cpu sg per-cluster sg
WFI 0 0 (0)
Core power-down 1 1 0*
Cluster power-down 2 (1) 1

For per-cpu sgs no translation is required. If cpuidle reports state
index 0 or 1, the cpu is in WFI or core power-down, respectively. We can
look the idle-power up directly in the sg energy model table. Idle-state
cluster power-down, is represented in the per-cluster sg energy model
table as index 1. Index 0* is reserved for cluster power consumption
when the cpus all are in state 0 or 1, but cpuidle decided not to go for
cluster power-down. Given the index from cpuidle we can compute the
correct index in the energy model tables for the sgs at each level if we
know how many states are in the tables in the child sgs. The actual
translation is implemented in a later patch.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 12 ++++++++++++
2 files changed, 13 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8ac2db8..6b5e44d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1031,6 +1031,7 @@ struct sched_group_energy {
atomic_t ref;
unsigned int nr_idle_states; /* number of idle states */
struct idle_state *idle_states; /* ptr to idle state array */
+ unsigned int nr_idle_states_below; /* number idle states in lower groups */
unsigned int nr_cap_states; /* number of capacity states */
struct capacity_state *cap_states; /* ptr to capacity state array */
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe4b361..c13fa9c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6117,6 +6117,7 @@ static void init_sched_energy(int cpu, struct sched_domain *sd,
struct sched_group_energy *sge = sg->sge;
sched_domain_energy_f fn = tl->energy;
struct cpumask *mask = sched_group_cpus(sg);
+ int nr_idle_states_below = 0;

if (fn && sd->child && !sd->child->groups->sge) {
pr_err("BUG: EAS setup broken for CPU%d\n", cpu);
@@ -6140,7 +6141,18 @@ static void init_sched_energy(int cpu, struct sched_domain *sd,
if (cpumask_weight(mask) > 1)
check_sched_energy_data(cpu, fn, mask);

+ /* Figure out the number of true cpuidle states below current group */
+ sd = sd->child;
+ for_each_lower_domain(sd) {
+ nr_idle_states_below += sd->groups->sge->nr_idle_states;
+
+ /* Disregard non-cpuidle 'active' idle states */
+ if (sd->child)
+ nr_idle_states_below--;
+ }
+
sge->nr_idle_states = fn(cpu)->nr_idle_states;
+ sge->nr_idle_states_below = nr_idle_states_below;
sge->nr_cap_states = fn(cpu)->nr_cap_states;
sge->idle_states = (struct idle_state *)
((void *)&sge->cap_states +
--
1.9.1

2015-07-07 18:59:09

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 29/46] sched: Determine the current sched_group idle-state

From: Dietmar Eggemann <[email protected]>

To estimate the energy consumption of a sched_group in
sched_group_energy() it is necessary to know which idle-state the group
is in when it is idle. For now, it is assumed that this is the current
idle-state (though it might be wrong). Based on the individual cpu
idle-states group_idle_state() finds the group idle-state.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 27 +++++++++++++++++++++++----
1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99e43ee..a134028 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4966,6 +4966,22 @@ static bool cpu_overutilized(int cpu)
(get_cpu_usage(cpu) * capacity_margin);
}

+static int group_idle_state(struct sched_group *sg)
+{
+ int i, state = INT_MAX;
+
+ /* Find the shallowest idle state in the sched group. */
+ for_each_cpu(i, sched_group_cpus(sg))
+ state = min(state, idle_get_state_idx(cpu_rq(i)));
+
+ /* Transform system into sched domain idle state. */
+ if (sg->sge->nr_idle_states_below > 1)
+ state -= sg->sge->nr_idle_states_below - 1;
+
+ /* Clamp state to the range of sched domain idle states. */
+ return clamp_t(int, state, 0, sg->sge->nr_idle_states - 1);
+}
+
/*
* sched_group_energy(): Returns absolute energy consumption of cpus belonging
* to the sched_group including shared resources shared only by members of the
@@ -5008,7 +5024,8 @@ static unsigned int sched_group_energy(struct energy_env *eenv)

do {
unsigned long group_util;
- int sg_busy_energy, sg_idle_energy, cap_idx;
+ int sg_busy_energy, sg_idle_energy;
+ int cap_idx, idle_idx;

if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
eenv->sg_cap = sg_shared_cap;
@@ -5016,11 +5033,13 @@ static unsigned int sched_group_energy(struct energy_env *eenv)
eenv->sg_cap = sg;

cap_idx = find_new_capacity(eenv, sg->sge);
+ idle_idx = group_idle_state(sg);
group_util = group_norm_usage(eenv, sg);
sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
- >> SCHED_CAPACITY_SHIFT;
- sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
- >> SCHED_CAPACITY_SHIFT;
+ >> SCHED_CAPACITY_SHIFT;
+ sg_idle_energy = ((SCHED_LOAD_SCALE-group_util)
+ * sg->sge->idle_states[idle_idx].power)
+ >> SCHED_CAPACITY_SHIFT;

total_energy += sg_busy_energy + sg_idle_energy;

--
1.9.1

2015-07-07 18:59:05

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 30/46] sched: Add cpu capacity awareness to wakeup balancing

Wakeup balancing is completely unaware of cpu capacity, cpu usage and
task utilization. The task is preferably placed on a cpu which is idle
in the instant the wakeup happens. New tasks (SD_BALANCE_{FORK,EXEC} are
placed on an idle cpu in the idlest group if such can be found, otherwise
it goes on the least loaded one. Existing tasks (SD_BALANCE_WAKE) are
placed on the previous cpu or an idle cpu sharing the same last level
cache. Hence existing tasks don't get a chance to migrate to a different
group at wakeup in case the current one has reduced cpu capacity (due
RT/IRQ pressure or different uarch e.g. ARM big.LITTLE). They may
eventually get pulled by other cpus doing periodic/idle/nohz_idle
balance, but it may take quite a while before it happens.

This patch adds capacity awareness to find_idlest_{group,queue} (used by
SD_BALANCE_{FORK,EXEC}) such that groups/cpus that can accommodate the
waking task based on task utilization are preferred. In addition, wakeup
of existing tasks (SD_BALANCE_WAKE) is sent through
find_idlest_{group,queue} if the task doesn't fit the capacity of the
previous cpu to allow it to escape (override wake_affine) when
necessary instead of relying on periodic/idle/nohz_idle balance to
eventually sort it out.

The patch doesn't depend on any energy model infrastructure, but it is
kept behind the energy_aware() static key despite being primarily a
performance optimization as it may increase scheduler overhead slightly.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c13fa9c..a41bb32 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6330,7 +6330,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
| 1*SD_BALANCE_NEWIDLE
| 1*SD_BALANCE_EXEC
| 1*SD_BALANCE_FORK
- | 0*SD_BALANCE_WAKE
+ | 1*SD_BALANCE_WAKE
| 1*SD_WAKE_AFFINE
| 0*SD_SHARE_CPUCAPACITY
| 0*SD_SHARE_PKG_RESOURCES
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a134028..b0294f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5205,6 +5205,39 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
return 1;
}

+static inline unsigned long task_utilization(struct task_struct *p)
+{
+ return p->se.avg.utilization_avg_contrib;
+}
+
+static inline bool __task_fits(struct task_struct *p, int cpu, int usage)
+{
+ unsigned long capacity = capacity_of(cpu);
+
+ usage += task_utilization(p);
+
+ return (capacity * 1024) > (usage * capacity_margin);
+}
+
+static inline bool task_fits_capacity(struct task_struct *p, int cpu)
+{
+ unsigned long capacity = capacity_of(cpu);
+ unsigned long max_capacity = cpu_rq(cpu)->rd->max_cpu_capacity;
+
+ if (capacity == max_capacity)
+ return true;
+
+ if (capacity * capacity_margin > max_capacity * 1024)
+ return true;
+
+ return __task_fits(p, cpu, 0);
+}
+
+static inline bool task_fits_cpu(struct task_struct *p, int cpu)
+{
+ return __task_fits(p, cpu, get_cpu_usage(cpu));
+}
+
/*
* find_idlest_group finds and returns the least busy CPU group within the
* domain.
@@ -5214,7 +5247,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int sd_flag)
{
struct sched_group *idlest = NULL, *group = sd->groups;
+ struct sched_group *fit_group = NULL;
unsigned long min_load = ULONG_MAX, this_load = 0;
+ unsigned long fit_capacity = ULONG_MAX;
int load_idx = sd->forkexec_idx;
int imbalance = 100 + (sd->imbalance_pct-100)/2;

@@ -5245,6 +5280,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
load = target_load(i, load_idx);

avg_load += load;
+
+ /*
+ * Look for most energy-efficient group that can fit
+ * that can fit the task.
+ */
+ if (energy_aware() && capacity_of(i) < fit_capacity &&
+ task_fits_cpu(p, i)) {
+ fit_capacity = capacity_of(i);
+ fit_group = group;
+ }
}

/* Adjust by relative CPU capacity of the group */
@@ -5258,6 +5303,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
}
} while (group = group->next, group != sd->groups);

+ if (fit_group)
+ return fit_group;
+
if (!idlest || 100*this_load < imbalance*min_load)
return NULL;
return idlest;
@@ -5278,7 +5326,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)

/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
- if (idle_cpu(i)) {
+ if (task_fits_cpu(p, i)) {
struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
@@ -5290,7 +5338,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
min_exit_latency = idle->exit_latency;
latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i;
- } else if ((!idle || idle->exit_latency == min_exit_latency) &&
+ } else if (idle_cpu(i) &&
+ (!idle || idle->exit_latency == min_exit_latency) &&
rq->idle_stamp > latest_idle_timestamp) {
/*
* If equal or no active idle state, then
@@ -5299,6 +5348,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
*/
latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i;
+ } else if (shallowest_idle_cpu == -1) {
+ /*
+ * If we haven't found an idle CPU yet
+ * pick a non-idle one that can fit the task as
+ * fallback.
+ */
+ shallowest_idle_cpu = i;
}
} else if (shallowest_idle_cpu == -1) {
load = weighted_cpuload(i);
@@ -5376,9 +5432,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
int cpu = smp_processor_id();
int new_cpu = cpu;
int want_affine = 0;
+ int want_sibling = true;
int sync = wake_flags & WF_SYNC;

- if (sd_flag & SD_BALANCE_WAKE)
+ /* Check if prev_cpu can fit us ignoring its current usage */
+ if (energy_aware() && !task_fits_capacity(p, prev_cpu))
+ want_sibling = false;
+
+ if (sd_flag & SD_BALANCE_WAKE && want_sibling)
want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));

rcu_read_lock();
@@ -5403,7 +5464,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
prev_cpu = cpu;

- if (sd_flag & SD_BALANCE_WAKE) {
+ if (sd_flag & SD_BALANCE_WAKE && want_sibling) {
new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock;
}
--
1.9.1

2015-07-07 18:58:47

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 31/46] sched: Consider spare cpu capacity at task wake-up

In mainline find_idlest_group() selects the wake-up target group purely
based on group load which leads to suboptimal choices in low load
scenarios. An idle group with reduced capacity (due to RT tasks or
different cpu type) isn't necessarily a better target than a lightly
loaded group with higher capacity.

The patch adds spare capacity as an additional group selection
parameter. The target group is now selected based on the following
criteria listed by highest priority first:

1. If energy-aware scheduling is enabled the group with the lowest
capacity containing a cpu with enough spare capacity to accommodate the
task (with a bit to spare) is selected if such exists.

2. Return the group with the cpu with most spare capacity and this
capacity is significant if such group exists. Significant spare capacity
is currently at least 20% to spare.

3. Return the group with the lowest load, unless it is the local group
in which case NULL is returned and the search is continued at the next
(lower) level.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b0294f0..0f7dbda4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5247,9 +5247,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int sd_flag)
{
struct sched_group *idlest = NULL, *group = sd->groups;
- struct sched_group *fit_group = NULL;
+ struct sched_group *fit_group = NULL, *spare_group = NULL;
unsigned long min_load = ULONG_MAX, this_load = 0;
unsigned long fit_capacity = ULONG_MAX;
+ unsigned long max_spare_capacity = capacity_margin - SCHED_LOAD_SCALE;
int load_idx = sd->forkexec_idx;
int imbalance = 100 + (sd->imbalance_pct-100)/2;

@@ -5257,7 +5258,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
load_idx = sd->wake_idx;

do {
- unsigned long load, avg_load;
+ unsigned long load, avg_load, spare_capacity;
int local_group;
int i;

@@ -5290,6 +5291,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
fit_capacity = capacity_of(i);
fit_group = group;
}
+
+ /*
+ * Look for group which has most spare capacity on a
+ * single cpu.
+ */
+ spare_capacity = capacity_of(i) - get_cpu_usage(i);
+ if (spare_capacity > max_spare_capacity) {
+ max_spare_capacity = spare_capacity;
+ spare_group = group;
+ }
}

/* Adjust by relative CPU capacity of the group */
@@ -5306,6 +5317,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
if (fit_group)
return fit_group;

+ if (spare_group)
+ return spare_group;
+
if (!idlest || 100*this_load < imbalance*min_load)
return NULL;
return idlest;
--
1.9.1

2015-07-07 18:26:05

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 32/46] sched: Energy-aware wake-up task placement

Let available compute capacity and estimated energy impact select
wake-up target cpu when energy-aware scheduling is enabled and the
system in not over-utilized (above the tipping point).

energy_aware_wake_cpu() attempts to find group of cpus with sufficient
compute capacity to accommodate the task and find a cpu with enough spare
capacity to handle the task within that group. Preference is given to
cpus with enough spare capacity at the current OPP. Finally, the energy
impact of the new target and the previous task cpu is compared to select
the wake-up target cpu.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 84 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f7dbda4..01f7337 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5427,6 +5427,86 @@ static int select_idle_sibling(struct task_struct *p, int target)
return target;
}

+static int energy_aware_wake_cpu(struct task_struct *p, int target)
+{
+ struct sched_domain *sd;
+ struct sched_group *sg, *sg_target;
+ int target_max_cap = INT_MAX;
+ int target_cpu = task_cpu(p);
+ int i;
+
+ sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
+
+ if (!sd)
+ return target;
+
+ sg = sd->groups;
+ sg_target = sg;
+
+ /*
+ * Find group with sufficient capacity. We only get here if no cpu is
+ * overutilized. We may end up overutilizing a cpu by adding the task,
+ * but that should not be any worse than select_idle_sibling().
+ * load_balance() should sort it out later as we get above the tipping
+ * point.
+ */
+ do {
+ /* Assuming all cpus are the same in group */
+ int max_cap_cpu = group_first_cpu(sg);
+
+ /*
+ * Assume smaller max capacity means more energy-efficient.
+ * Ideally we should query the energy model for the right
+ * answer but it easily ends up in an exhaustive search.
+ */
+ if (capacity_of(max_cap_cpu) < target_max_cap &&
+ task_fits_capacity(p, max_cap_cpu)) {
+ sg_target = sg;
+ target_max_cap = capacity_of(max_cap_cpu);
+ }
+ } while (sg = sg->next, sg != sd->groups);
+
+ /* Find cpu with sufficient capacity */
+ for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
+ /*
+ * p's blocked utilization is still accounted for on prev_cpu
+ * so prev_cpu will receive a negative bias due the double
+ * accouting. However, the blocked utilization may be zero.
+ */
+ int new_usage = get_cpu_usage(i) + task_utilization(p);
+
+ if (new_usage > capacity_orig_of(i))
+ continue;
+
+ if (new_usage < capacity_curr_of(i)) {
+ target_cpu = i;
+ if (cpu_rq(i)->nr_running)
+ break;
+ }
+
+ /* cpu has capacity at higher OPP, keep it as fallback */
+ if (target_cpu == task_cpu(p))
+ target_cpu = i;
+ }
+
+ if (target_cpu != task_cpu(p)) {
+ struct energy_env eenv = {
+ .usage_delta = task_utilization(p),
+ .src_cpu = task_cpu(p),
+ .dst_cpu = target_cpu,
+ };
+
+ /* Not enough spare capacity on previous cpu */
+ if (cpu_overutilized(task_cpu(p)))
+ return target_cpu;
+
+ if (energy_diff(&eenv) >= 0)
+ return task_cpu(p);
+ }
+
+ return target_cpu;
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5479,7 +5559,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
prev_cpu = cpu;

if (sd_flag & SD_BALANCE_WAKE && want_sibling) {
- new_cpu = select_idle_sibling(p, prev_cpu);
+ if (energy_aware() && !cpu_rq(cpu)->rd->overutilized)
+ new_cpu = energy_aware_wake_cpu(p, prev_cpu);
+ else
+ new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock;
}

--
1.9.1

2015-07-07 18:25:24

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 33/46] sched: Consider a not over-utilized energy-aware system as balanced

From: Dietmar Eggemann <[email protected]>

In case the system operates below the tipping point indicator,
introduced in ("sched: Add over-utilization/tipping point
indicator"), bail out in find_busiest_group after the dst and src
group statistics have been checked.

There is simply no need to move usage around because all involved
cpus still have spare cycles available.

For an energy-aware system below its tipping point, we rely on the
task placement of the wakeup path. This works well for short running
tasks.

The existence of long running tasks on one of the involved cpus lets
the system operate over its tipping point. To be able to move such
a task (whose load can't be used to average the load among the cpus)
from a src cpu with lower capacity than the dst_cpu, an additional
rule has to be implemented in need_active_balance.

Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01f7337..48ecf02 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7382,6 +7382,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
* this level.
*/
update_sd_lb_stats(env, &sds);
+
+ if (energy_aware() && !env->dst_rq->rd->overutilized)
+ goto out_balanced;
+
local = &sds.local_stat;
busiest = &sds.busiest_stat;

--
1.9.1

2015-07-07 18:25:38

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 34/46] sched: Enable idle balance to pull single task towards cpu with higher capacity

From: Dietmar Eggemann <[email protected]>

We do not want to miss out on the ability to pull a single remaining
task from a potential source cpu towards an idle destination cpu if the
energy aware system operates above the tipping point.

Add an extra criteria to need_active_balance() to kick off active load
balance if the source cpu is over-utilized and has lower capacity than
the destination cpu.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48ecf02..97eb83e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7569,6 +7569,13 @@ static int need_active_balance(struct lb_env *env)
return 1;
}

+ if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&
+ env->src_rq->cfs.h_nr_running == 1 &&
+ cpu_overutilized(env->src_cpu) &&
+ !cpu_overutilized(env->dst_cpu)) {
+ return 1;
+ }
+
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}

@@ -7923,7 +7930,8 @@ static int idle_balance(struct rq *this_rq)
this_rq->idle_stamp = rq_clock(this_rq);

if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !this_rq->rd->overload) {
+ (!energy_aware() && !this_rq->rd->overload) ||
+ (energy_aware() && !this_rq->rd->overutilized)) {
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
--
1.9.1

2015-07-07 18:25:49

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 35/46] sched: Disable energy-unfriendly nohz kicks

With energy-aware scheduling enabled nohz_kick_needed() generates many
nohz idle-balance kicks which lead to nothing when multiple tasks get
packed on a single cpu to save energy. This causes unnecessary wake-ups
and hence wastes energy. Make these conditions depend on !energy_aware()
for now until the energy-aware nohz story gets sorted out.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 97eb83e..8e0cbd4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8424,12 +8424,13 @@ static inline bool nohz_kick_needed(struct rq *rq)
if (time_before(now, nohz.next_balance))
return false;

- if (rq->nr_running >= 2)
+ if (rq->nr_running >= 2 &&
+ (!energy_aware() || cpu_overutilized(cpu)))
return true;

rcu_read_lock();
sd = rcu_dereference(per_cpu(sd_busy, cpu));
- if (sd) {
+ if (sd && !energy_aware()) {
sgc = sd->groups->sgc;
nr_busy = atomic_read(&sgc->nr_busy_cpus);

--
1.9.1

2015-07-07 18:58:41

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 36/46] sched: Prevent unnecessary active balance of single task in sched group

Scenarios with the busiest group having just one task and the local
being idle on topologies with sched groups with different numbers of
cpus manage to dodge all load-balance bailout conditions resulting the
nr_balance_failed counter to be incremented. This eventually causes an
pointless active migration of the task. This patch prevents this by not
incrementing the counter when the busiest group only has one task.
ASYM_PACKING migrations and migrations due to reduced capacity should
still take place as these are explicitly captured by
need_active_balance().

A better solution would be to not attempt the load-balance in the first
place, but that requires significant changes to the order of bailout
conditions and statistics gathering.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e0cbd4..f7cb6c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6127,6 +6127,7 @@ struct lb_env {
int new_dst_cpu;
enum cpu_idle_type idle;
long imbalance;
+ unsigned int src_grp_nr_running;
/* The set of CPUs under consideration for load-balancing */
struct cpumask *cpus;

@@ -7150,6 +7151,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (env->sd->flags & SD_NUMA)
env->fbq_type = fbq_classify_group(&sds->busiest_stat);

+ env->src_grp_nr_running = sds->busiest_stat.sum_nr_running;
+
if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
if (env->dst_rq->rd->overload != overload)
@@ -7788,7 +7791,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
* excessive cache_hot migrations and active balances.
*/
if (idle != CPU_NEWLY_IDLE)
- sd->nr_balance_failed++;
+ if (env.src_grp_nr_running > 1)
+ sd->nr_balance_failed++;

if (need_active_balance(&env)) {
raw_spin_lock_irqsave(&busiest->lock, flags);
--
1.9.1

2015-07-07 18:58:15

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 37/46] cpufreq: introduce cpufreq_driver_might_sleep

From: Michael Turquette <[email protected]>

Some architectures and platforms perform CPU frequency transitions
through a non-blocking method, while some might block or sleep. This
distinction is important when trying to change frequency from interrupt
context or in any other non-interruptable context, such as from the
Linux scheduler.

Describe this distinction with a cpufreq driver flag,
CPUFREQ_DRIVER_WILL_NOT_SLEEP. The default is to not have this flag set,
thus erring on the side of caution.

cpufreq_driver_might_sleep() is also introduced in this patch. Setting
the above flag will allow this function to return false.

Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Michael Turquette <[email protected]>
---
drivers/cpufreq/cpufreq.c | 6 ++++++
include/linux/cpufreq.h | 9 +++++++++
2 files changed, 15 insertions(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 8ae655c..5753cb6 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -112,6 +112,12 @@ bool have_governor_per_policy(void)
}
EXPORT_SYMBOL_GPL(have_governor_per_policy);

+bool cpufreq_driver_might_sleep(void)
+{
+ return !(cpufreq_driver->flags & CPUFREQ_DRIVER_WILL_NOT_SLEEP);
+}
+EXPORT_SYMBOL_GPL(cpufreq_driver_might_sleep);
+
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy)
{
if (have_governor_per_policy())
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 2ee4888..1f2c9a1 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -157,6 +157,7 @@ u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy);
int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu);
int cpufreq_update_policy(unsigned int cpu);
bool have_governor_per_policy(void);
+bool cpufreq_driver_might_sleep(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
#else
static inline unsigned int cpufreq_get(unsigned int cpu)
@@ -314,6 +315,14 @@ struct cpufreq_driver {
*/
#define CPUFREQ_NEED_INITIAL_FREQ_CHECK (1 << 5)

+/*
+ * Set by drivers that will never block or sleep during their frequency
+ * transition. Used to indicate when it is safe to call cpufreq_driver_target
+ * from non-interruptable context. Drivers must opt-in to this flag, as the
+ * safe default is that they might sleep.
+ */
+#define CPUFREQ_DRIVER_WILL_NOT_SLEEP (1 << 6)
+
int cpufreq_register_driver(struct cpufreq_driver *driver_data);
int cpufreq_unregister_driver(struct cpufreq_driver *driver_data);

--
1.9.1

2015-07-07 18:58:06

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 38/46] sched: scheduler-driven cpu frequency selection

From: Michael Turquette <[email protected]>

Scheduler-driven cpu frequency selection is desirable as part of the
on-going effort to make the scheduler better aware of energy
consumption. No piece of the Linux kernel has a better view of the
factors that affect a cpu frequency selection policy than the
scheduler[0], and this patch is an attempt to converge on an initial
solution.

This patch implements a simple shim layer between the Linux scheduler
and the cpufreq subsystem. This interface accepts a capacity request
from the Completely Fair Scheduler and honors the max request from all
cpus in the same frequency domain.

The policy magic comes from choosing the cpu capacity request from cfs
and is not contained in this cpufreq governor. This code is
intentionally dumb.

Note that this "governor" is event-driven. There is no polling loop to
check cpu idle time nor any other method which is unsynchronized with
the scheduler.

Thanks to Juri Lelli <[email protected]> for contributing design ideas,
code and test results.

[0] http://article.gmane.org/gmane.linux.kernel/1499836

Signed-off-by: Michael Turquette <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
drivers/cpufreq/Kconfig | 24 ++++
include/linux/cpufreq.h | 3 +
kernel/sched/Makefile | 1 +
kernel/sched/cpufreq_sched.c | 308 +++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 8 ++
5 files changed, 344 insertions(+)
create mode 100644 kernel/sched/cpufreq_sched.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 659879a..9bbf44c 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHED
+ bool "sched"
+ select CPU_FREQ_GOV_SCHED
+ select CPU_FREQ_GOV_PERFORMANCE
+ help
+ Use the CPUfreq governor 'sched' as default. This scales
+ cpu frequency from the scheduler as per-entity load tracking
+ statistics are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE
@@ -183,6 +192,21 @@ config CPU_FREQ_GOV_CONSERVATIVE

If in doubt, say N.

+config CPU_FREQ_GOV_SCHED
+ tristate "'sched' cpufreq governor"
+ depends on CPU_FREQ
+ select CPU_FREQ_GOV_COMMON
+ help
+ 'sched' - this governor scales cpu frequency from the
+ scheduler as a function of cpu capacity utilization. It does
+ not evaluate utilization on a periodic basis (as ondemand
+ does) but instead is invoked from the completely fair
+ scheduler when updating per-entity load tracking statistics.
+ Latency to respond to changes in load is improved over polling
+ governors due to its event-driven design.
+
+ If in doubt, say N.
+
comment "CPU frequency scaling drivers"

config CPUFREQ_DT
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 1f2c9a1..30241c9 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -494,6 +494,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
extern struct cpufreq_governor cpufreq_gov_conservative;
#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative)
+#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED_GOV)
+extern struct cpufreq_governor cpufreq_gov_sched_gov;
+#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_sched)
#endif

/*********************************************************************
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 6768797..90ed832 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
new file mode 100644
index 0000000..5020f24
--- /dev/null
+++ b/kernel/sched/cpufreq_sched.c
@@ -0,0 +1,308 @@
+/*
+ * Copyright (C) 2015 Michael Turquette <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/percpu.h>
+#include <linux/irq_work.h>
+
+#include "sched.h"
+
+#define THROTTLE_NSEC 50000000 /* 50ms default */
+
+static DEFINE_PER_CPU(unsigned long, pcpu_capacity);
+static DEFINE_PER_CPU(struct cpufreq_policy *, pcpu_policy);
+
+/**
+ * gov_data - per-policy data internal to the governor
+ * @throttle: next throttling period expiry. Derived from throttle_nsec
+ * @throttle_nsec: throttle period length in nanoseconds
+ * @task: worker thread for dvfs transition that may block/sleep
+ * @irq_work: callback used to wake up worker thread
+ * @freq: new frequency stored in *_sched_update_cpu and used in *_sched_thread
+ *
+ * struct gov_data is the per-policy cpufreq_sched-specific data structure. A
+ * per-policy instance of it is created when the cpufreq_sched governor receives
+ * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
+ * member of struct cpufreq_policy.
+ *
+ * Readers of this data must call down_read(policy->rwsem). Writers must
+ * call down_write(policy->rwsem).
+ */
+struct gov_data {
+ ktime_t throttle;
+ unsigned int throttle_nsec;
+ struct task_struct *task;
+ struct irq_work irq_work;
+ struct cpufreq_policy *policy;
+ unsigned int freq;
+};
+
+static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy, unsigned int freq)
+{
+ struct gov_data *gd = policy->governor_data;
+
+ /* avoid race with cpufreq_sched_stop */
+ if (!down_write_trylock(&policy->rwsem))
+ return;
+
+ __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
+
+ gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
+ up_write(&policy->rwsem);
+}
+
+/*
+ * we pass in struct cpufreq_policy. This is safe because changing out the
+ * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
+ * which tears down all of the data structures and __cpufreq_governor(policy,
+ * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
+ * new policy pointer
+ */
+static int cpufreq_sched_thread(void *data)
+{
+ struct sched_param param;
+ struct cpufreq_policy *policy;
+ struct gov_data *gd;
+ int ret;
+
+ policy = (struct cpufreq_policy *) data;
+ if (!policy) {
+ pr_warn("%s: missing policy\n", __func__);
+ do_exit(-EINVAL);
+ }
+
+ gd = policy->governor_data;
+ if (!gd) {
+ pr_warn("%s: missing governor data\n", __func__);
+ do_exit(-EINVAL);
+ }
+
+ param.sched_priority = 50;
+ ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);
+ if (ret) {
+ pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
+ do_exit(-EINVAL);
+ } else {
+ pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
+ __func__, gd->task->pid);
+ }
+
+ ret = set_cpus_allowed_ptr(gd->task, policy->related_cpus);
+ if (ret) {
+ pr_warn("%s: failed to set allowed ptr\n", __func__);
+ do_exit(-EINVAL);
+ }
+
+ /* main loop of the per-policy kthread */
+ do {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule();
+ if (kthread_should_stop())
+ break;
+
+ cpufreq_sched_try_driver_target(policy, gd->freq);
+ } while (!kthread_should_stop());
+
+ do_exit(0);
+}
+
+static void cpufreq_sched_irq_work(struct irq_work *irq_work)
+{
+ struct gov_data *gd;
+
+ gd = container_of(irq_work, struct gov_data, irq_work);
+ if (!gd) {
+ return;
+ }
+
+ wake_up_process(gd->task);
+}
+
+/**
+ * cpufreq_sched_set_capacity - interface to scheduler for changing capacity values
+ * @cpu: cpu whose capacity utilization has recently changed
+ * @capacity: the new capacity requested by cpu
+ *
+ * cpufreq_sched_sched_capacity is an interface exposed to the scheduler so
+ * that the scheduler may inform the governor of updates to capacity
+ * utilization and make changes to cpu frequency. Currently this interface is
+ * designed around PELT values in CFS. It can be expanded to other scheduling
+ * classes in the future if needed.
+ *
+ * cpufreq_sched_set_capacity raises an IPI. The irq_work handler for that IPI
+ * wakes up the thread that does the actual work, cpufreq_sched_thread.
+ *
+ * This functions bails out early if either condition is true:
+ * 1) this cpu did not the new maximum capacity for its frequency domain
+ * 2) no change in cpu frequency is necessary to meet the new capacity request
+ */
+void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
+{
+ unsigned int freq_new, cpu_tmp;
+ struct cpufreq_policy *policy;
+ struct gov_data *gd;
+ unsigned long capacity_max = 0;
+
+ /* update per-cpu capacity request */
+ __this_cpu_write(pcpu_capacity, capacity);
+
+ policy = cpufreq_cpu_get(cpu);
+ if (IS_ERR_OR_NULL(policy)) {
+ return;
+ }
+
+ if (!policy->governor_data)
+ goto out;
+
+ gd = policy->governor_data;
+
+ /* bail early if we are throttled */
+ if (ktime_before(ktime_get(), gd->throttle))
+ goto out;
+
+ /* find max capacity requested by cpus in this policy */
+ for_each_cpu(cpu_tmp, policy->cpus)
+ capacity_max = max(capacity_max, per_cpu(pcpu_capacity, cpu_tmp));
+
+ /*
+ * We only change frequency if this cpu's capacity request represents a
+ * new max. If another cpu has requested a capacity greater than the
+ * previous max then we rely on that cpu to hit this code path and make
+ * the change. IOW, the cpu with the new max capacity is responsible
+ * for setting the new capacity/frequency.
+ *
+ * If this cpu is not the new maximum then bail
+ */
+ if (capacity_max > capacity)
+ goto out;
+
+ /* Convert the new maximum capacity request into a cpu frequency */
+ freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
+
+ /* No change in frequency? Bail and return current capacity. */
+ if (freq_new == policy->cur)
+ goto out;
+
+ /* store the new frequency and perform the transition */
+ gd->freq = freq_new;
+
+ if (cpufreq_driver_might_sleep())
+ irq_work_queue_on(&gd->irq_work, cpu);
+ else
+ cpufreq_sched_try_driver_target(policy, freq_new);
+
+out:
+ cpufreq_cpu_put(policy);
+ return;
+}
+
+static int cpufreq_sched_start(struct cpufreq_policy *policy)
+{
+ struct gov_data *gd;
+ int cpu;
+
+ /* prepare per-policy private data */
+ gd = kzalloc(sizeof(*gd), GFP_KERNEL);
+ if (!gd) {
+ pr_debug("%s: failed to allocate private data\n", __func__);
+ return -ENOMEM;
+ }
+
+ /* initialize per-cpu data */
+ for_each_cpu(cpu, policy->cpus) {
+ per_cpu(pcpu_capacity, cpu) = 0;
+ per_cpu(pcpu_policy, cpu) = policy;
+ }
+
+ /*
+ * Don't ask for freq changes at an higher rate than what
+ * the driver advertises as transition latency.
+ */
+ gd->throttle_nsec = policy->cpuinfo.transition_latency ?
+ policy->cpuinfo.transition_latency :
+ THROTTLE_NSEC;
+ pr_debug("%s: throttle threshold = %u [ns]\n",
+ __func__, gd->throttle_nsec);
+
+ if (cpufreq_driver_might_sleep()) {
+ /* init per-policy kthread */
+ gd->task = kthread_run(cpufreq_sched_thread, policy, "kcpufreq_sched_task");
+ if (IS_ERR_OR_NULL(gd->task)) {
+ pr_err("%s: failed to create kcpufreq_sched_task thread\n", __func__);
+ goto err;
+ }
+ init_irq_work(&gd->irq_work, cpufreq_sched_irq_work);
+ }
+
+ policy->governor_data = gd;
+ gd->policy = policy;
+ return 0;
+
+err:
+ kfree(gd);
+ return -ENOMEM;
+}
+
+static int cpufreq_sched_stop(struct cpufreq_policy *policy)
+{
+ struct gov_data *gd = policy->governor_data;
+
+ if (cpufreq_driver_might_sleep()) {
+ kthread_stop(gd->task);
+ }
+
+ policy->governor_data = NULL;
+
+ /* FIXME replace with devm counterparts? */
+ kfree(gd);
+ return 0;
+}
+
+static int cpufreq_sched_setup(struct cpufreq_policy *policy, unsigned int event)
+{
+ switch (event) {
+ case CPUFREQ_GOV_START:
+ /* Start managing the frequency */
+ return cpufreq_sched_start(policy);
+
+ case CPUFREQ_GOV_STOP:
+ return cpufreq_sched_stop(policy);
+
+ case CPUFREQ_GOV_LIMITS: /* unused */
+ case CPUFREQ_GOV_POLICY_INIT: /* unused */
+ case CPUFREQ_GOV_POLICY_EXIT: /* unused */
+ break;
+ }
+ return 0;
+}
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
+static
+#endif
+struct cpufreq_governor cpufreq_gov_sched = {
+ .name = "sched",
+ .governor = cpufreq_sched_setup,
+ .owner = THIS_MODULE,
+};
+
+static int __init cpufreq_sched_init(void)
+{
+ return cpufreq_register_governor(&cpufreq_gov_sched);
+}
+
+static void __exit cpufreq_sched_exit(void)
+{
+ cpufreq_unregister_governor(&cpufreq_gov_sched);
+}
+
+/* Try to make this the default governor */
+fs_initcall(cpufreq_sched_init);
+
+MODULE_LICENSE("GPL v2");
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c395559..30aa0c4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1476,6 +1476,13 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
}
#endif

+#ifdef CONFIG_CPU_FREQ_GOV_SCHED
+void cpufreq_sched_set_cap(int cpu, unsigned long util);
+#else
+static inline void cpufreq_sched_set_cap(int cpu, unsigned long util)
+{ }
+#endif
+
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
@@ -1484,6 +1491,7 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
#else
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
static inline void sched_avg_update(struct rq *rq) { }
+static inline void gov_cfs_update_cpu(int cpu) {}
#endif

extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
--
1.9.1

2015-07-07 18:56:19

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 39/46] sched/cpufreq_sched: use static key for cpu frequency selection

From: Juri Lelli <[email protected]>

Introduce a static key to only affect scheduler hot paths when sched
governor is enabled.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/cpufreq_sched.c | 14 ++++++++++++++
kernel/sched/fair.c | 1 +
kernel/sched/sched.h | 6 ++++++
3 files changed, 21 insertions(+)

diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
index 5020f24..2968f3a 100644
--- a/kernel/sched/cpufreq_sched.c
+++ b/kernel/sched/cpufreq_sched.c
@@ -203,6 +203,18 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
return;
}

+static inline void set_sched_energy_freq(void)
+{
+ if (!sched_energy_freq())
+ static_key_slow_inc(&__sched_energy_freq);
+}
+
+static inline void clear_sched_energy_freq(void)
+{
+ if (sched_energy_freq())
+ static_key_slow_dec(&__sched_energy_freq);
+}
+
static int cpufreq_sched_start(struct cpufreq_policy *policy)
{
struct gov_data *gd;
@@ -243,6 +255,7 @@ static int cpufreq_sched_start(struct cpufreq_policy *policy)

policy->governor_data = gd;
gd->policy = policy;
+ set_sched_energy_freq();
return 0;

err:
@@ -254,6 +267,7 @@ static int cpufreq_sched_stop(struct cpufreq_policy *policy)
{
struct gov_data *gd = policy->governor_data;

+ clear_sched_energy_freq();
if (cpufreq_driver_might_sleep()) {
kthread_stop(gd->task);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f7cb6c9..d395bc9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4282,6 +4282,7 @@ static inline void hrtick_update(struct rq *rq)
#endif

static bool cpu_overutilized(int cpu);
+struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;

/*
* The enqueue_task method is called before nr_running is
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 30aa0c4..b5e27d9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1476,6 +1476,12 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
}
#endif

+extern struct static_key __sched_energy_freq;
+static inline bool sched_energy_freq(void)
+{
+ return static_key_false(&__sched_energy_freq);
+}
+
#ifdef CONFIG_CPU_FREQ_GOV_SCHED
void cpufreq_sched_set_cap(int cpu, unsigned long util);
#else
--
1.9.1

2015-07-07 18:56:13

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 40/46] sched/cpufreq_sched: compute freq_new based on capacity_orig_of()

From: Juri Lelli <[email protected]>

capacity is both cpu and freq scaled with EAS. We thus need to compute
freq_new using capacity_orig_of(), so that we properly scale back the thing
on heterogeneous architectures. In fact, capacity_orig_of() returns only
the cpu scaled part of capacity (see update_cpu_capacity()). So, to scale
freq_new correctly, we multiply policy->max by capacity/capacity_orig_of()
instead of capacity/1024.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/cpufreq_sched.c | 2 +-
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 2 ++
3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
index 2968f3a..7071528 100644
--- a/kernel/sched/cpufreq_sched.c
+++ b/kernel/sched/cpufreq_sched.c
@@ -184,7 +184,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
goto out;

/* Convert the new maximum capacity request into a cpu frequency */
- freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
+ freq_new = (capacity * policy->max) / capacity_orig_of(cpu);

/* No change in frequency? Bail and return current capacity. */
if (freq_new == policy->cur)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d395bc9..f74e9d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4625,7 +4625,7 @@ static unsigned long capacity_of(int cpu)
return cpu_rq(cpu)->cpu_capacity;
}

-static unsigned long capacity_orig_of(int cpu)
+unsigned long capacity_orig_of(int cpu)
{
return cpu_rq(cpu)->cpu_capacity_orig;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b5e27d9..1327dc7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1476,6 +1476,8 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
}
#endif

+unsigned long capacity_orig_of(int cpu);
+
extern struct static_key __sched_energy_freq;
static inline bool sched_energy_freq(void)
{
--
1.9.1

2015-07-07 18:56:07

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

From: Juri Lelli <[email protected]>

Each time a task is {en,de}queued we might need to adapt the current
frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
this purpose. Only trigger a freq request if we are effectively waking up
or going to sleep. Filter out load balancing related calls to reduce the
number of triggers.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++--
1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f74e9d2..b8627c6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4281,7 +4281,10 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+static unsigned int capacity_margin = 1280; /* ~20% margin */
+
static bool cpu_overutilized(int cpu);
+static unsigned long get_cpu_usage(int cpu);
struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;

/*
@@ -4332,6 +4335,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!task_new && !rq->rd->overutilized &&
cpu_overutilized(rq->cpu))
rq->rd->overutilized = true;
+ /*
+ * We want to trigger a freq switch request only for tasks that
+ * are waking up; this is because we get here also during
+ * load balancing, but in these cases it seems wise to trigger
+ * as single request after load balancing is done.
+ *
+ * XXX: how about fork()? Do we need a special flag/something
+ * to tell if we are here after a fork() (wakeup_task_new)?
+ *
+ * Also, we add a margin (same ~20% used for the tipping point)
+ * to our request to provide some head room if p's utilization
+ * further increases.
+ */
+ if (sched_energy_freq() && !task_new) {
+ unsigned long req_cap = get_cpu_usage(cpu_of(rq));
+
+ req_cap = req_cap * capacity_margin
+ >> SCHED_CAPACITY_SHIFT;
+ cpufreq_sched_set_cap(cpu_of(rq), req_cap);
+ }
}
hrtick_update(rq);
}
@@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) {
sub_nr_running(rq, 1);
update_rq_runnable_avg(rq, 1);
+ /*
+ * We want to trigger a freq switch request only for tasks that
+ * are going to sleep; this is because we get here also during
+ * load balancing, but in these cases it seems wise to trigger
+ * as single request after load balancing is done.
+ *
+ * Also, we add a margin (same ~20% used for the tipping point)
+ * to our request to provide some head room if p's utilization
+ * further increases.
+ */
+ if (sched_energy_freq() && task_sleep) {
+ unsigned long req_cap = get_cpu_usage(cpu_of(rq));
+
+ req_cap = req_cap * capacity_margin
+ >> SCHED_CAPACITY_SHIFT;
+ cpufreq_sched_set_cap(cpu_of(rq), req_cap);
+ }
}
hrtick_update(rq);
}
@@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
return idx;
}

-static unsigned int capacity_margin = 1280; /* ~20% margin */
-
static bool cpu_overutilized(int cpu)
{
return (capacity_of(cpu) * 1024) <
--
1.9.1

2015-07-07 18:55:50

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 42/46] sched/{core,fair}: trigger OPP change request on fork()

From: Juri Lelli <[email protected]>

Patch "sched/fair: add triggers for OPP change requests" introduced OPP
change triggers for enqueue_task_fair(), but the trigger was operating only
for wakeups. Fact is that it makes sense to consider wakeup_new also (i.e.,
fork()), as we don't know anything about a newly created task and thus we
most certainly want to jump to max OPP to not harm performance too much.

However, it is not currently possible (or at least it wasn't evident to me
how to do so :/) to tell new wakeups from other (non wakeup) operations.

This patch introduces an additional flag in sched.h that is only set at
fork() time and it is then consumed in enqueue_task_fair() for our purpose.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 8 +++-----
kernel/sched/sched.h | 1 +
3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a41bb32..57c9eb5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2138,7 +2138,7 @@ void wake_up_new_task(struct task_struct *p)
#endif

rq = __task_rq_lock(p);
- activate_task(rq, p, 0);
+ activate_task(rq, p, ENQUEUE_WAKEUP_NEW);
p->on_rq = TASK_ON_RQ_QUEUED;
trace_sched_wakeup_new(p, true);
check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b8627c6..bb49499 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4297,7 +4297,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
- int task_new = !(flags & ENQUEUE_WAKEUP);
+ int task_new = flags & ENQUEUE_WAKEUP_NEW;
+ int task_wakeup = flags & ENQUEUE_WAKEUP;

for_each_sched_entity(se) {
if (se->on_rq)
@@ -4341,14 +4342,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
* load balancing, but in these cases it seems wise to trigger
* as single request after load balancing is done.
*
- * XXX: how about fork()? Do we need a special flag/something
- * to tell if we are here after a fork() (wakeup_task_new)?
- *
* Also, we add a margin (same ~20% used for the tipping point)
* to our request to provide some head room if p's utilization
* further increases.
*/
- if (sched_energy_freq() && !task_new) {
+ if (sched_energy_freq() && (task_new || task_wakeup)) {
unsigned long req_cap = get_cpu_usage(cpu_of(rq));

req_cap = req_cap * capacity_margin
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1327dc7..85be5d8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1206,6 +1206,7 @@ static const u32 prio_to_wmult[40] = {
#define ENQUEUE_WAKING 0
#endif
#define ENQUEUE_REPLENISH 8
+#define ENQUEUE_WAKEUP_NEW 16

#define DEQUEUE_SLEEP 1

--
1.9.1

2015-07-07 18:55:10

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 43/46] sched/{fair,cpufreq_sched}: add reset_capacity interface

From: Juri Lelli <[email protected]>

When a CPU is going idle it is pointless to ask for an OPP update as we
would wake up another task only to ask for the same capacity we are already
running at (utilization gets moved to blocked_utilization). We thus add
cpufreq_sched_reset_capacity() interface to just reset our current capacity
request without triggering any real update. At wakeup we will use the
decayed utilization to select an appropriate OPP.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/cpufreq_sched.c | 12 ++++++++++++
kernel/sched/fair.c | 10 +++++++---
kernel/sched/sched.h | 3 +++
3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
index 7071528..06ff183 100644
--- a/kernel/sched/cpufreq_sched.c
+++ b/kernel/sched/cpufreq_sched.c
@@ -203,6 +203,18 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
return;
}

+/**
+ * cpufreq_sched_reset_capacity - interface to scheduler for resetting capacity
+ * requests
+ * @cpu: cpu whose capacity request has to be reset
+ *
+ * This _wont trigger_ any capacity update.
+ */
+void cpufreq_sched_reset_cap(int cpu)
+{
+ per_cpu(pcpu_capacity, cpu) = 0;
+}
+
static inline void set_sched_energy_freq(void)
{
if (!sched_energy_freq())
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bb49499..323331f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4427,9 +4427,13 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (sched_energy_freq() && task_sleep) {
unsigned long req_cap = get_cpu_usage(cpu_of(rq));

- req_cap = req_cap * capacity_margin
- >> SCHED_CAPACITY_SHIFT;
- cpufreq_sched_set_cap(cpu_of(rq), req_cap);
+ if (rq->cfs.nr_running) {
+ req_cap = req_cap * capacity_margin
+ >> SCHED_CAPACITY_SHIFT;
+ cpufreq_sched_set_cap(cpu_of(rq), req_cap);
+ } else {
+ cpufreq_sched_reset_cap(cpu_of(rq));
+ }
}
}
hrtick_update(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 85be5d8..f1ff5bb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1487,9 +1487,12 @@ static inline bool sched_energy_freq(void)

#ifdef CONFIG_CPU_FREQ_GOV_SCHED
void cpufreq_sched_set_cap(int cpu, unsigned long util);
+void cpufreq_sched_reset_cap(int cpu);
#else
static inline void cpufreq_sched_set_cap(int cpu, unsigned long util)
{ }
+static inline void cpufreq_sched_reset_cap(int cpu)
+{ }
#endif

static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
--
1.9.1

2015-07-07 18:54:24

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 44/46] sched/fair: jump to max OPP when crossing UP threshold

From: Juri Lelli <[email protected]>

Since the true utilization of a long running task is not detectable while
it is running and might be bigger than the current cpu capacity, create the
maximum cpu capacity head room by requesting the maximum cpu capacity once
the cpu usage plus the capacity margin exceeds the current capacity. This
is also done to try to harm the performance of a task the least.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/fair.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 323331f..c2d6de4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8586,6 +8586,25 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
rq->rd->overutilized = true;
+
+ /*
+ * To make free room for a task that is building up its "real"
+ * utilization and to harm its performance the least, request a
+ * jump to max OPP as soon as get_cpu_usage() crosses the UP
+ * threshold. The UP threshold is built relative to the current
+ * capacity (OPP), by using same margin used to tell if a cpu
+ * is overutilized (capacity_margin).
+ */
+ if (sched_energy_freq()) {
+ int cpu = cpu_of(rq);
+ unsigned long capacity_orig = capacity_orig_of(cpu);
+ unsigned long capacity_curr = capacity_curr_of(cpu);
+
+ if (capacity_curr < capacity_orig &&
+ (capacity_curr * SCHED_LOAD_SCALE) <
+ (get_cpu_usage(cpu) * capacity_margin))
+ cpufreq_sched_set_cap(cpu, capacity_orig);
+ }
}

/*
--
1.9.1

2015-07-07 18:50:37

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 45/46] sched/cpufreq_sched: modify pcpu_capacity handling

From: Juri Lelli <[email protected]>

Use the cpu argument of cpufreq_sched_set_cap() to handle per_cpu writes,
as the thing can be called remotely (e.g., from load balacing code).

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/cpufreq_sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
index 06ff183..b81ac779 100644
--- a/kernel/sched/cpufreq_sched.c
+++ b/kernel/sched/cpufreq_sched.c
@@ -151,7 +151,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
unsigned long capacity_max = 0;

/* update per-cpu capacity request */
- __this_cpu_write(pcpu_capacity, capacity);
+ per_cpu(pcpu_capacity, cpu) = capacity;

policy = cpufreq_cpu_get(cpu);
if (IS_ERR_OR_NULL(policy)) {
--
1.9.1

2015-07-07 18:50:07

by Morten Rasmussen

[permalink] [raw]
Subject: [RFCv5 PATCH 46/46] sched/fair: cpufreq_sched triggers for load balancing

From: Juri Lelli <[email protected]>

As we don't trigger freq changes from {en,de}queue_task_fair() during load
balancing, we need to do explicitly so on load balancing paths.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 62 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2d6de4..c513b19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7741,6 +7741,20 @@ static int load_balance(int this_cpu, struct rq *this_rq,
* ld_moved - cumulative load moved across iterations
*/
cur_ld_moved = detach_tasks(&env);
+ /*
+ * We want to potentially update env.src_cpu's OPP.
+ *
+ * Add a margin (same ~20% used for the tipping point)
+ * to our request to provide some head room for the remaining
+ * tasks.
+ */
+ if (sched_energy_freq() && cur_ld_moved) {
+ unsigned long req_cap = get_cpu_usage(env.src_cpu);
+
+ req_cap = req_cap * capacity_margin
+ >> SCHED_CAPACITY_SHIFT;
+ cpufreq_sched_set_cap(env.src_cpu, req_cap);
+ }

/*
* We've detached some tasks from busiest_rq. Every
@@ -7755,6 +7769,21 @@ static int load_balance(int this_cpu, struct rq *this_rq,
if (cur_ld_moved) {
attach_tasks(&env);
ld_moved += cur_ld_moved;
+ /*
+ * We want to potentially update env.dst_cpu's OPP.
+ *
+ * Add a margin (same ~20% used for the tipping point)
+ * to our request to provide some head room if p's
+ * utilization further increases.
+ */
+ if (sched_energy_freq()) {
+ unsigned long req_cap =
+ get_cpu_usage(env.dst_cpu);
+
+ req_cap = req_cap * capacity_margin
+ >> SCHED_CAPACITY_SHIFT;
+ cpufreq_sched_set_cap(env.dst_cpu, req_cap);
+ }
}

local_irq_restore(flags);
@@ -8114,8 +8143,24 @@ static int active_load_balance_cpu_stop(void *data)
schedstat_inc(sd, alb_count);

p = detach_one_task(&env);
- if (p)
+ if (p) {
schedstat_inc(sd, alb_pushed);
+ /*
+ * We want to potentially update env.src_cpu's OPP.
+ *
+ * Add a margin (same ~20% used for the tipping point)
+ * to our request to provide some head room for the
+ * remaining task.
+ */
+ if (sched_energy_freq()) {
+ unsigned long req_cap =
+ get_cpu_usage(env.src_cpu);
+
+ req_cap = req_cap * capacity_margin
+ >> SCHED_CAPACITY_SHIFT;
+ cpufreq_sched_set_cap(env.src_cpu, req_cap);
+ }
+ }
else
schedstat_inc(sd, alb_failed);
}
@@ -8124,8 +8169,23 @@ static int active_load_balance_cpu_stop(void *data)
busiest_rq->active_balance = 0;
raw_spin_unlock(&busiest_rq->lock);

- if (p)
+ if (p) {
attach_one_task(target_rq, p);
+ /*
+ * We want to potentially update target_cpu's OPP.
+ *
+ * Add a margin (same ~20% used for the tipping point)
+ * to our request to provide some head room if p's utilization
+ * further increases.
+ */
+ if (sched_energy_freq()) {
+ unsigned long req_cap = get_cpu_usage(target_cpu);
+
+ req_cap = req_cap * capacity_margin
+ >> SCHED_CAPACITY_SHIFT;
+ cpufreq_sched_set_cap(target_cpu, req_cap);
+ }
+ }

local_irq_enable();

--
1.9.1

2015-07-08 12:36:37

by Jon Medhurst (Tixy)

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 03/46] arm: vexpress: Add CPU clock-frequencies to TC2 device-tree

On Tue, 2015-07-07 at 19:23 +0100, Morten Rasmussen wrote:
> From: Dietmar Eggemann <[email protected]>
>
> To enable the parsing of clock frequency and cpu efficiency values
> inside parse_dt_topology [arch/arm/kernel/topology.c] to scale the
> relative capacity of the cpus, this property has to be provided within
> the cpu nodes of the dts file.
>
> The patch is a copy of commit 8f15973ef8c3 ("ARM: vexpress: Add CPU
> clock-frequencies to TC2 device-tree") taken from Linaro Stable Kernel
> (LSK) massaged into mainline.

Not sure you really need to mention commit hashes from outside of the
mainline Linux tree and the values I added to that commit were probably
copied from some patch originating from ARM or elsewhere anyway. So the
second paragraph is this commit message is probably superfluous. But
this is nitpicking I guess :-)

--
Tixy

2015-07-08 15:09:37

by Michael Turquette

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 37/46] cpufreq: introduce cpufreq_driver_might_sleep

Quoting Morten Rasmussen (2015-07-07 11:24:20)
> From: Michael Turquette <[email protected]>
>
> Some architectures and platforms perform CPU frequency transitions
> through a non-blocking method, while some might block or sleep. This
> distinction is important when trying to change frequency from interrupt
> context or in any other non-interruptable context, such as from the
> Linux scheduler.
>
> Describe this distinction with a cpufreq driver flag,
> CPUFREQ_DRIVER_WILL_NOT_SLEEP. The default is to not have this flag set,
> thus erring on the side of caution.
>
> cpufreq_driver_might_sleep() is also introduced in this patch. Setting
> the above flag will allow this function to return false.
>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Signed-off-by: Michael Turquette <[email protected]>

Hi Morten,

I believe your sign-off is needed here as well.

Regards,
Mike

> ---
> drivers/cpufreq/cpufreq.c | 6 ++++++
> include/linux/cpufreq.h | 9 +++++++++
> 2 files changed, 15 insertions(+)
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 8ae655c..5753cb6 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -112,6 +112,12 @@ bool have_governor_per_policy(void)
> }
> EXPORT_SYMBOL_GPL(have_governor_per_policy);
>
> +bool cpufreq_driver_might_sleep(void)
> +{
> + return !(cpufreq_driver->flags & CPUFREQ_DRIVER_WILL_NOT_SLEEP);
> +}
> +EXPORT_SYMBOL_GPL(cpufreq_driver_might_sleep);
> +
> struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy)
> {
> if (have_governor_per_policy())
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 2ee4888..1f2c9a1 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -157,6 +157,7 @@ u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy);
> int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu);
> int cpufreq_update_policy(unsigned int cpu);
> bool have_governor_per_policy(void);
> +bool cpufreq_driver_might_sleep(void);
> struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
> #else
> static inline unsigned int cpufreq_get(unsigned int cpu)
> @@ -314,6 +315,14 @@ struct cpufreq_driver {
> */
> #define CPUFREQ_NEED_INITIAL_FREQ_CHECK (1 << 5)
>
> +/*
> + * Set by drivers that will never block or sleep during their frequency
> + * transition. Used to indicate when it is safe to call cpufreq_driver_target
> + * from non-interruptable context. Drivers must opt-in to this flag, as the
> + * safe default is that they might sleep.
> + */
> +#define CPUFREQ_DRIVER_WILL_NOT_SLEEP (1 << 6)
> +
> int cpufreq_register_driver(struct cpufreq_driver *driver_data);
> int cpufreq_unregister_driver(struct cpufreq_driver *driver_data);
>
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-07-08 15:09:54

by Michael Turquette

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 38/46] sched: scheduler-driven cpu frequency selection

Quoting Morten Rasmussen (2015-07-07 11:24:21)
> From: Michael Turquette <[email protected]>
>
> Scheduler-driven cpu frequency selection is desirable as part of the
> on-going effort to make the scheduler better aware of energy
> consumption. No piece of the Linux kernel has a better view of the
> factors that affect a cpu frequency selection policy than the
> scheduler[0], and this patch is an attempt to converge on an initial
> solution.
>
> This patch implements a simple shim layer between the Linux scheduler
> and the cpufreq subsystem. This interface accepts a capacity request
> from the Completely Fair Scheduler and honors the max request from all
> cpus in the same frequency domain.
>
> The policy magic comes from choosing the cpu capacity request from cfs
> and is not contained in this cpufreq governor. This code is
> intentionally dumb.
>
> Note that this "governor" is event-driven. There is no polling loop to
> check cpu idle time nor any other method which is unsynchronized with
> the scheduler.
>
> Thanks to Juri Lelli <[email protected]> for contributing design ideas,
> code and test results.
>
> [0] http://article.gmane.org/gmane.linux.kernel/1499836
>
> Signed-off-by: Michael Turquette <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>

Hi Morten,

I believe your sign-off is needed here as well.

Regards,
Mike

> ---
> drivers/cpufreq/Kconfig | 24 ++++
> include/linux/cpufreq.h | 3 +
> kernel/sched/Makefile | 1 +
> kernel/sched/cpufreq_sched.c | 308 +++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 8 ++
> 5 files changed, 344 insertions(+)
> create mode 100644 kernel/sched/cpufreq_sched.c
>
> diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> index 659879a..9bbf44c 100644
> --- a/drivers/cpufreq/Kconfig
> +++ b/drivers/cpufreq/Kconfig
> @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
> Be aware that not all cpufreq drivers support the conservative
> governor. If unsure have a look at the help section of the
> driver. Fallback governor will be the performance governor.
> +
> +config CPU_FREQ_DEFAULT_GOV_SCHED
> + bool "sched"
> + select CPU_FREQ_GOV_SCHED
> + select CPU_FREQ_GOV_PERFORMANCE
> + help
> + Use the CPUfreq governor 'sched' as default. This scales
> + cpu frequency from the scheduler as per-entity load tracking
> + statistics are updated.
> endchoice
>
> config CPU_FREQ_GOV_PERFORMANCE
> @@ -183,6 +192,21 @@ config CPU_FREQ_GOV_CONSERVATIVE
>
> If in doubt, say N.
>
> +config CPU_FREQ_GOV_SCHED
> + tristate "'sched' cpufreq governor"
> + depends on CPU_FREQ
> + select CPU_FREQ_GOV_COMMON
> + help
> + 'sched' - this governor scales cpu frequency from the
> + scheduler as a function of cpu capacity utilization. It does
> + not evaluate utilization on a periodic basis (as ondemand
> + does) but instead is invoked from the completely fair
> + scheduler when updating per-entity load tracking statistics.
> + Latency to respond to changes in load is improved over polling
> + governors due to its event-driven design.
> +
> + If in doubt, say N.
> +
> comment "CPU frequency scaling drivers"
>
> config CPUFREQ_DT
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 1f2c9a1..30241c9 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -494,6 +494,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
> #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
> extern struct cpufreq_governor cpufreq_gov_conservative;
> #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative)
> +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED_GOV)
> +extern struct cpufreq_governor cpufreq_gov_sched_gov;
> +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_sched)
> #endif
>
> /*********************************************************************
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 6768797..90ed832 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
> obj-$(CONFIG_SCHEDSTATS) += stats.o
> obj-$(CONFIG_SCHED_DEBUG) += debug.o
> obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
> +obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
> new file mode 100644
> index 0000000..5020f24
> --- /dev/null
> +++ b/kernel/sched/cpufreq_sched.c
> @@ -0,0 +1,308 @@
> +/*
> + * Copyright (C) 2015 Michael Turquette <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/cpufreq.h>
> +#include <linux/module.h>
> +#include <linux/kthread.h>
> +#include <linux/percpu.h>
> +#include <linux/irq_work.h>
> +
> +#include "sched.h"
> +
> +#define THROTTLE_NSEC 50000000 /* 50ms default */
> +
> +static DEFINE_PER_CPU(unsigned long, pcpu_capacity);
> +static DEFINE_PER_CPU(struct cpufreq_policy *, pcpu_policy);
> +
> +/**
> + * gov_data - per-policy data internal to the governor
> + * @throttle: next throttling period expiry. Derived from throttle_nsec
> + * @throttle_nsec: throttle period length in nanoseconds
> + * @task: worker thread for dvfs transition that may block/sleep
> + * @irq_work: callback used to wake up worker thread
> + * @freq: new frequency stored in *_sched_update_cpu and used in *_sched_thread
> + *
> + * struct gov_data is the per-policy cpufreq_sched-specific data structure. A
> + * per-policy instance of it is created when the cpufreq_sched governor receives
> + * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
> + * member of struct cpufreq_policy.
> + *
> + * Readers of this data must call down_read(policy->rwsem). Writers must
> + * call down_write(policy->rwsem).
> + */
> +struct gov_data {
> + ktime_t throttle;
> + unsigned int throttle_nsec;
> + struct task_struct *task;
> + struct irq_work irq_work;
> + struct cpufreq_policy *policy;
> + unsigned int freq;
> +};
> +
> +static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy, unsigned int freq)
> +{
> + struct gov_data *gd = policy->governor_data;
> +
> + /* avoid race with cpufreq_sched_stop */
> + if (!down_write_trylock(&policy->rwsem))
> + return;
> +
> + __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
> +
> + gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
> + up_write(&policy->rwsem);
> +}
> +
> +/*
> + * we pass in struct cpufreq_policy. This is safe because changing out the
> + * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
> + * which tears down all of the data structures and __cpufreq_governor(policy,
> + * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
> + * new policy pointer
> + */
> +static int cpufreq_sched_thread(void *data)
> +{
> + struct sched_param param;
> + struct cpufreq_policy *policy;
> + struct gov_data *gd;
> + int ret;
> +
> + policy = (struct cpufreq_policy *) data;
> + if (!policy) {
> + pr_warn("%s: missing policy\n", __func__);
> + do_exit(-EINVAL);
> + }
> +
> + gd = policy->governor_data;
> + if (!gd) {
> + pr_warn("%s: missing governor data\n", __func__);
> + do_exit(-EINVAL);
> + }
> +
> + param.sched_priority = 50;
> + ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);
> + if (ret) {
> + pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
> + do_exit(-EINVAL);
> + } else {
> + pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
> + __func__, gd->task->pid);
> + }
> +
> + ret = set_cpus_allowed_ptr(gd->task, policy->related_cpus);
> + if (ret) {
> + pr_warn("%s: failed to set allowed ptr\n", __func__);
> + do_exit(-EINVAL);
> + }
> +
> + /* main loop of the per-policy kthread */
> + do {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule();
> + if (kthread_should_stop())
> + break;
> +
> + cpufreq_sched_try_driver_target(policy, gd->freq);
> + } while (!kthread_should_stop());
> +
> + do_exit(0);
> +}
> +
> +static void cpufreq_sched_irq_work(struct irq_work *irq_work)
> +{
> + struct gov_data *gd;
> +
> + gd = container_of(irq_work, struct gov_data, irq_work);
> + if (!gd) {
> + return;
> + }
> +
> + wake_up_process(gd->task);
> +}
> +
> +/**
> + * cpufreq_sched_set_capacity - interface to scheduler for changing capacity values
> + * @cpu: cpu whose capacity utilization has recently changed
> + * @capacity: the new capacity requested by cpu
> + *
> + * cpufreq_sched_sched_capacity is an interface exposed to the scheduler so
> + * that the scheduler may inform the governor of updates to capacity
> + * utilization and make changes to cpu frequency. Currently this interface is
> + * designed around PELT values in CFS. It can be expanded to other scheduling
> + * classes in the future if needed.
> + *
> + * cpufreq_sched_set_capacity raises an IPI. The irq_work handler for that IPI
> + * wakes up the thread that does the actual work, cpufreq_sched_thread.
> + *
> + * This functions bails out early if either condition is true:
> + * 1) this cpu did not the new maximum capacity for its frequency domain
> + * 2) no change in cpu frequency is necessary to meet the new capacity request
> + */
> +void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
> +{
> + unsigned int freq_new, cpu_tmp;
> + struct cpufreq_policy *policy;
> + struct gov_data *gd;
> + unsigned long capacity_max = 0;
> +
> + /* update per-cpu capacity request */
> + __this_cpu_write(pcpu_capacity, capacity);
> +
> + policy = cpufreq_cpu_get(cpu);
> + if (IS_ERR_OR_NULL(policy)) {
> + return;
> + }
> +
> + if (!policy->governor_data)
> + goto out;
> +
> + gd = policy->governor_data;
> +
> + /* bail early if we are throttled */
> + if (ktime_before(ktime_get(), gd->throttle))
> + goto out;
> +
> + /* find max capacity requested by cpus in this policy */
> + for_each_cpu(cpu_tmp, policy->cpus)
> + capacity_max = max(capacity_max, per_cpu(pcpu_capacity, cpu_tmp));
> +
> + /*
> + * We only change frequency if this cpu's capacity request represents a
> + * new max. If another cpu has requested a capacity greater than the
> + * previous max then we rely on that cpu to hit this code path and make
> + * the change. IOW, the cpu with the new max capacity is responsible
> + * for setting the new capacity/frequency.
> + *
> + * If this cpu is not the new maximum then bail
> + */
> + if (capacity_max > capacity)
> + goto out;
> +
> + /* Convert the new maximum capacity request into a cpu frequency */
> + freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
> +
> + /* No change in frequency? Bail and return current capacity. */
> + if (freq_new == policy->cur)
> + goto out;
> +
> + /* store the new frequency and perform the transition */
> + gd->freq = freq_new;
> +
> + if (cpufreq_driver_might_sleep())
> + irq_work_queue_on(&gd->irq_work, cpu);
> + else
> + cpufreq_sched_try_driver_target(policy, freq_new);
> +
> +out:
> + cpufreq_cpu_put(policy);
> + return;
> +}
> +
> +static int cpufreq_sched_start(struct cpufreq_policy *policy)
> +{
> + struct gov_data *gd;
> + int cpu;
> +
> + /* prepare per-policy private data */
> + gd = kzalloc(sizeof(*gd), GFP_KERNEL);
> + if (!gd) {
> + pr_debug("%s: failed to allocate private data\n", __func__);
> + return -ENOMEM;
> + }
> +
> + /* initialize per-cpu data */
> + for_each_cpu(cpu, policy->cpus) {
> + per_cpu(pcpu_capacity, cpu) = 0;
> + per_cpu(pcpu_policy, cpu) = policy;
> + }
> +
> + /*
> + * Don't ask for freq changes at an higher rate than what
> + * the driver advertises as transition latency.
> + */
> + gd->throttle_nsec = policy->cpuinfo.transition_latency ?
> + policy->cpuinfo.transition_latency :
> + THROTTLE_NSEC;
> + pr_debug("%s: throttle threshold = %u [ns]\n",
> + __func__, gd->throttle_nsec);
> +
> + if (cpufreq_driver_might_sleep()) {
> + /* init per-policy kthread */
> + gd->task = kthread_run(cpufreq_sched_thread, policy, "kcpufreq_sched_task");
> + if (IS_ERR_OR_NULL(gd->task)) {
> + pr_err("%s: failed to create kcpufreq_sched_task thread\n", __func__);
> + goto err;
> + }
> + init_irq_work(&gd->irq_work, cpufreq_sched_irq_work);
> + }
> +
> + policy->governor_data = gd;
> + gd->policy = policy;
> + return 0;
> +
> +err:
> + kfree(gd);
> + return -ENOMEM;
> +}
> +
> +static int cpufreq_sched_stop(struct cpufreq_policy *policy)
> +{
> + struct gov_data *gd = policy->governor_data;
> +
> + if (cpufreq_driver_might_sleep()) {
> + kthread_stop(gd->task);
> + }
> +
> + policy->governor_data = NULL;
> +
> + /* FIXME replace with devm counterparts? */
> + kfree(gd);
> + return 0;
> +}
> +
> +static int cpufreq_sched_setup(struct cpufreq_policy *policy, unsigned int event)
> +{
> + switch (event) {
> + case CPUFREQ_GOV_START:
> + /* Start managing the frequency */
> + return cpufreq_sched_start(policy);
> +
> + case CPUFREQ_GOV_STOP:
> + return cpufreq_sched_stop(policy);
> +
> + case CPUFREQ_GOV_LIMITS: /* unused */
> + case CPUFREQ_GOV_POLICY_INIT: /* unused */
> + case CPUFREQ_GOV_POLICY_EXIT: /* unused */
> + break;
> + }
> + return 0;
> +}
> +
> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
> +static
> +#endif
> +struct cpufreq_governor cpufreq_gov_sched = {
> + .name = "sched",
> + .governor = cpufreq_sched_setup,
> + .owner = THIS_MODULE,
> +};
> +
> +static int __init cpufreq_sched_init(void)
> +{
> + return cpufreq_register_governor(&cpufreq_gov_sched);
> +}
> +
> +static void __exit cpufreq_sched_exit(void)
> +{
> + cpufreq_unregister_governor(&cpufreq_gov_sched);
> +}
> +
> +/* Try to make this the default governor */
> +fs_initcall(cpufreq_sched_init);
> +
> +MODULE_LICENSE("GPL v2");
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c395559..30aa0c4 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1476,6 +1476,13 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> }
> #endif
>
> +#ifdef CONFIG_CPU_FREQ_GOV_SCHED
> +void cpufreq_sched_set_cap(int cpu, unsigned long util);
> +#else
> +static inline void cpufreq_sched_set_cap(int cpu, unsigned long util)
> +{ }
> +#endif
> +
> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
> {
> rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
> @@ -1484,6 +1491,7 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
> #else
> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
> static inline void sched_avg_update(struct rq *rq) { }
> +static inline void gov_cfs_update_cpu(int cpu) {}
> #endif
>
> extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-07-08 15:20:23

by Michael Turquette

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 39/46] sched/cpufreq_sched: use static key for cpu frequency selection

Quoting Morten Rasmussen (2015-07-07 11:24:22)
> From: Juri Lelli <[email protected]>
>
> Introduce a static key to only affect scheduler hot paths when sched
> governor is enabled.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> kernel/sched/cpufreq_sched.c | 14 ++++++++++++++
> kernel/sched/fair.c | 1 +
> kernel/sched/sched.h | 6 ++++++
> 3 files changed, 21 insertions(+)
>
> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
> index 5020f24..2968f3a 100644
> --- a/kernel/sched/cpufreq_sched.c
> +++ b/kernel/sched/cpufreq_sched.c
> @@ -203,6 +203,18 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
> return;
> }
>
> +static inline void set_sched_energy_freq(void)
> +{
> + if (!sched_energy_freq())
> + static_key_slow_inc(&__sched_energy_freq);
> +}
> +
> +static inline void clear_sched_energy_freq(void)
> +{
> + if (sched_energy_freq())
> + static_key_slow_dec(&__sched_energy_freq);
> +}
> +
> static int cpufreq_sched_start(struct cpufreq_policy *policy)
> {
> struct gov_data *gd;
> @@ -243,6 +255,7 @@ static int cpufreq_sched_start(struct cpufreq_policy *policy)
>
> policy->governor_data = gd;
> gd->policy = policy;
> + set_sched_energy_freq();
> return 0;
>
> err:
> @@ -254,6 +267,7 @@ static int cpufreq_sched_stop(struct cpufreq_policy *policy)
> {
> struct gov_data *gd = policy->governor_data;
>
> + clear_sched_energy_freq();

<paranoia>

These controls are exposed to userspace via cpufreq sysfs knobs. Should
we use a struct static_key_deferred and static_key_slow_dec_deferred()
instead? This helps avoid a possible attack vector for slowing down the
system.

</paranoia>

I don't really know what a sane default rate limit would be in that case
though. Otherwise feel free to add:

Reviewed-by: Michael Turquette <[email protected]>

Regards,
Mike

> if (cpufreq_driver_might_sleep()) {
> kthread_stop(gd->task);
> }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f7cb6c9..d395bc9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4282,6 +4282,7 @@ static inline void hrtick_update(struct rq *rq)
> #endif
>
> static bool cpu_overutilized(int cpu);
> +struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
>
> /*
> * The enqueue_task method is called before nr_running is
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 30aa0c4..b5e27d9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1476,6 +1476,12 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> }
> #endif
>
> +extern struct static_key __sched_energy_freq;
> +static inline bool sched_energy_freq(void)
> +{
> + return static_key_false(&__sched_energy_freq);
> +}
> +
> #ifdef CONFIG_CPU_FREQ_GOV_SCHED
> void cpufreq_sched_set_cap(int cpu, unsigned long util);
> #else
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-07-08 15:22:48

by Michael Turquette

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 40/46] sched/cpufreq_sched: compute freq_new based on capacity_orig_of()

Quoting Morten Rasmussen (2015-07-07 11:24:23)
> From: Juri Lelli <[email protected]>
>
> capacity is both cpu and freq scaled with EAS. We thus need to compute
> freq_new using capacity_orig_of(), so that we properly scale back the thing
> on heterogeneous architectures. In fact, capacity_orig_of() returns only
> the cpu scaled part of capacity (see update_cpu_capacity()). So, to scale
> freq_new correctly, we multiply policy->max by capacity/capacity_orig_of()
> instead of capacity/1024.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Juri Lelli <[email protected]>

Looks good to me. Please feel free to add my Reviewed-by or Acked-by as
appropriate.

Regards,
Mike

> ---
> kernel/sched/cpufreq_sched.c | 2 +-
> kernel/sched/fair.c | 2 +-
> kernel/sched/sched.h | 2 ++
> 3 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
> index 2968f3a..7071528 100644
> --- a/kernel/sched/cpufreq_sched.c
> +++ b/kernel/sched/cpufreq_sched.c
> @@ -184,7 +184,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
> goto out;
>
> /* Convert the new maximum capacity request into a cpu frequency */
> - freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
> + freq_new = (capacity * policy->max) / capacity_orig_of(cpu);
>
> /* No change in frequency? Bail and return current capacity. */
> if (freq_new == policy->cur)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d395bc9..f74e9d2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4625,7 +4625,7 @@ static unsigned long capacity_of(int cpu)
> return cpu_rq(cpu)->cpu_capacity;
> }
>
> -static unsigned long capacity_orig_of(int cpu)
> +unsigned long capacity_orig_of(int cpu)
> {
> return cpu_rq(cpu)->cpu_capacity_orig;
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b5e27d9..1327dc7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1476,6 +1476,8 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> }
> #endif
>
> +unsigned long capacity_orig_of(int cpu);
> +
> extern struct static_key __sched_energy_freq;
> static inline bool sched_energy_freq(void)
> {
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-07-08 15:42:40

by Michael Turquette

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

Hi Juri,

Quoting Morten Rasmussen (2015-07-07 11:24:24)
> From: Juri Lelli <[email protected]>
>
> Each time a task is {en,de}queued we might need to adapt the current
> frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
> this purpose. Only trigger a freq request if we are effectively waking up
> or going to sleep. Filter out load balancing related calls to reduce the
> number of triggers.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 40 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f74e9d2..b8627c6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4281,7 +4281,10 @@ static inline void hrtick_update(struct rq *rq)
> }
> #endif
>
> +static unsigned int capacity_margin = 1280; /* ~20% margin */

This is a 25% margin. Calling it ~20% is a bit misleading :)

Should margin be scaled for cpus that do not have max capacity == 1024?
In other words, should margin be dynamically calculated to be 20% of
*this* cpu's max capacity?

I'm imagining a corner case where a heterogeneous cpu system is set up
in such a way that adding margin that is hard-coded to 25% of 1024
almost always puts req_cap to the highest frequency, skipping some
reasonable capacity states in between.

> +
> static bool cpu_overutilized(int cpu);
> +static unsigned long get_cpu_usage(int cpu);
> struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
>
> /*
> @@ -4332,6 +4335,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!task_new && !rq->rd->overutilized &&
> cpu_overutilized(rq->cpu))
> rq->rd->overutilized = true;
> + /*
> + * We want to trigger a freq switch request only for tasks that
> + * are waking up; this is because we get here also during
> + * load balancing, but in these cases it seems wise to trigger
> + * as single request after load balancing is done.
> + *
> + * XXX: how about fork()? Do we need a special flag/something
> + * to tell if we are here after a fork() (wakeup_task_new)?
> + *
> + * Also, we add a margin (same ~20% used for the tipping point)
> + * to our request to provide some head room if p's utilization
> + * further increases.
> + */
> + if (sched_energy_freq() && !task_new) {
> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
> +
> + req_cap = req_cap * capacity_margin
> + >> SCHED_CAPACITY_SHIFT;

Probably a dumb question:

Can we "cheat" here and just assume that capacity and load use the same
units? That would avoid the multiplication and change your code to the
following:

#define capacity_margin SCHED_CAPACITY_SCALE >> 2; /* 25% */
req_cap += SCHED_CAPACITY_SCALE;

> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
> + }
> }
> hrtick_update(rq);
> }
> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!se) {
> sub_nr_running(rq, 1);
> update_rq_runnable_avg(rq, 1);
> + /*
> + * We want to trigger a freq switch request only for tasks that
> + * are going to sleep; this is because we get here also during
> + * load balancing, but in these cases it seems wise to trigger
> + * as single request after load balancing is done.
> + *
> + * Also, we add a margin (same ~20% used for the tipping point)
> + * to our request to provide some head room if p's utilization
> + * further increases.
> + */
> + if (sched_energy_freq() && task_sleep) {
> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
> +
> + req_cap = req_cap * capacity_margin
> + >> SCHED_CAPACITY_SHIFT;
> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);

Filtering out the load_balance bits is neat.

Regards,
Mike

> + }
> }
> hrtick_update(rq);
> }
> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
> return idx;
> }
>
> -static unsigned int capacity_margin = 1280; /* ~20% margin */
> -
> static bool cpu_overutilized(int cpu)
> {
> return (capacity_of(cpu) * 1024) <
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-07-08 16:40:56

by Michael Turquette

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 44/46] sched/fair: jump to max OPP when crossing UP threshold

Quoting Morten Rasmussen (2015-07-07 11:24:27)
> From: Juri Lelli <[email protected]>
>
> Since the true utilization of a long running task is not detectable while
> it is running and might be bigger than the current cpu capacity, create the
> maximum cpu capacity head room by requesting the maximum cpu capacity once
> the cpu usage plus the capacity margin exceeds the current capacity. This
> is also done to try to harm the performance of a task the least.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> kernel/sched/fair.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 323331f..c2d6de4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8586,6 +8586,25 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
> if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
> rq->rd->overutilized = true;
> +
> + /*
> + * To make free room for a task that is building up its "real"
> + * utilization and to harm its performance the least, request a
> + * jump to max OPP as soon as get_cpu_usage() crosses the UP
> + * threshold. The UP threshold is built relative to the current
> + * capacity (OPP), by using same margin used to tell if a cpu
> + * is overutilized (capacity_margin).
> + */
> + if (sched_energy_freq()) {
> + int cpu = cpu_of(rq);
> + unsigned long capacity_orig = capacity_orig_of(cpu);
> + unsigned long capacity_curr = capacity_curr_of(cpu);
> +
> + if (capacity_curr < capacity_orig &&
> + (capacity_curr * SCHED_LOAD_SCALE) <
> + (get_cpu_usage(cpu) * capacity_margin))

As I stated in a previous patch, I wonder if the multiplications can be
removed by assuming equivalent units for load and capacity and simply
adding the 256 (25%) margin to the valued returned by get_cpu_usage?

Regards,
Mike

> + cpufreq_sched_set_cap(cpu, capacity_orig);
> + }
> }
>
> /*
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-07-08 16:42:56

by Michael Turquette

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 45/46] sched/cpufreq_sched: modify pcpu_capacity handling

Quoting Morten Rasmussen (2015-07-07 11:24:28)
> From: Juri Lelli <[email protected]>
>
> Use the cpu argument of cpufreq_sched_set_cap() to handle per_cpu writes,
> as the thing can be called remotely (e.g., from load balacing code).
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Juri Lelli <[email protected]>

Looks good to me. Feel free to add my Reviewed-by or Acked-by as
appropriate.

Regards,
Mike

> ---
> kernel/sched/cpufreq_sched.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
> index 06ff183..b81ac779 100644
> --- a/kernel/sched/cpufreq_sched.c
> +++ b/kernel/sched/cpufreq_sched.c
> @@ -151,7 +151,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
> unsigned long capacity_max = 0;
>
> /* update per-cpu capacity request */
> - __this_cpu_write(pcpu_capacity, capacity);
> + per_cpu(pcpu_capacity, cpu) = capacity;
>
> policy = cpufreq_cpu_get(cpu);
> if (IS_ERR_OR_NULL(policy)) {
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-07-08 16:48:09

by Michael Turquette

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 44/46] sched/fair: jump to max OPP when crossing UP threshold

Quoting Morten Rasmussen (2015-07-07 11:24:27)
> From: Juri Lelli <[email protected]>
>
> Since the true utilization of a long running task is not detectable while
> it is running and might be bigger than the current cpu capacity, create the
> maximum cpu capacity head room by requesting the maximum cpu capacity once
> the cpu usage plus the capacity margin exceeds the current capacity. This
> is also done to try to harm the performance of a task the least.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> kernel/sched/fair.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 323331f..c2d6de4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8586,6 +8586,25 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
> if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
> rq->rd->overutilized = true;
> +
> + /*
> + * To make free room for a task that is building up its "real"
> + * utilization and to harm its performance the least, request a
> + * jump to max OPP as soon as get_cpu_usage() crosses the UP
> + * threshold. The UP threshold is built relative to the current
> + * capacity (OPP), by using same margin used to tell if a cpu
> + * is overutilized (capacity_margin).
> + */
> + if (sched_energy_freq()) {
> + int cpu = cpu_of(rq);
> + unsigned long capacity_orig = capacity_orig_of(cpu);
> + unsigned long capacity_curr = capacity_curr_of(cpu);
> +
> + if (capacity_curr < capacity_orig &&
> + (capacity_curr * SCHED_LOAD_SCALE) <
> + (get_cpu_usage(cpu) * capacity_margin))
> + cpufreq_sched_set_cap(cpu, capacity_orig);

I'm sure that at some point the Product People are going to want to tune
the capacity value that is requested. Hard-coding the max
capacity/frequency in is a reasonable start, but at some point it would
be nice to fetch an intermediate capacity defined by the cpufreq driver
for this particular cpu. We have already seen that a lot in Android
devices using the interactive governor and it could be done from
cpufreq_sched_start().

Regards,
Mike

> + }
> }
>
> /*
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-07-09 16:21:14

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 40/46] sched/cpufreq_sched: compute freq_new based on capacity_orig_of()

Hi Mike,

On 08/07/15 16:22, Michael Turquette wrote:
> Quoting Morten Rasmussen (2015-07-07 11:24:23)
>> From: Juri Lelli <[email protected]>
>>
>> capacity is both cpu and freq scaled with EAS. We thus need to compute
>> freq_new using capacity_orig_of(), so that we properly scale back the thing
>> on heterogeneous architectures. In fact, capacity_orig_of() returns only
>> the cpu scaled part of capacity (see update_cpu_capacity()). So, to scale
>> freq_new correctly, we multiply policy->max by capacity/capacity_orig_of()
>> instead of capacity/1024.
>>
>> cc: Ingo Molnar <[email protected]>
>> cc: Peter Zijlstra <[email protected]>
>>
>> Signed-off-by: Juri Lelli <[email protected]>
>
> Looks good to me. Please feel free to add my Reviewed-by or Acked-by as
> appropriate.
>

Thanks for reviewing it!

Best,

- Juri

> Regards,
> Mike
>
>> ---
>> kernel/sched/cpufreq_sched.c | 2 +-
>> kernel/sched/fair.c | 2 +-
>> kernel/sched/sched.h | 2 ++
>> 3 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
>> index 2968f3a..7071528 100644
>> --- a/kernel/sched/cpufreq_sched.c
>> +++ b/kernel/sched/cpufreq_sched.c
>> @@ -184,7 +184,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
>> goto out;
>>
>> /* Convert the new maximum capacity request into a cpu frequency */
>> - freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
>> + freq_new = (capacity * policy->max) / capacity_orig_of(cpu);
>>
>> /* No change in frequency? Bail and return current capacity. */
>> if (freq_new == policy->cur)
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d395bc9..f74e9d2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4625,7 +4625,7 @@ static unsigned long capacity_of(int cpu)
>> return cpu_rq(cpu)->cpu_capacity;
>> }
>>
>> -static unsigned long capacity_orig_of(int cpu)
>> +unsigned long capacity_orig_of(int cpu)
>> {
>> return cpu_rq(cpu)->cpu_capacity_orig;
>> }
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index b5e27d9..1327dc7 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1476,6 +1476,8 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>> }
>> #endif
>>
>> +unsigned long capacity_orig_of(int cpu);
>> +
>> extern struct static_key __sched_energy_freq;
>> static inline bool sched_energy_freq(void)
>> {
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2015-07-09 16:52:51

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

Hi Mike,

On 08/07/15 16:42, Michael Turquette wrote:
> Hi Juri,
>
> Quoting Morten Rasmussen (2015-07-07 11:24:24)
>> From: Juri Lelli <[email protected]>
>>
>> Each time a task is {en,de}queued we might need to adapt the current
>> frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
>> this purpose. Only trigger a freq request if we are effectively waking up
>> or going to sleep. Filter out load balancing related calls to reduce the
>> number of triggers.
>>
>> cc: Ingo Molnar <[email protected]>
>> cc: Peter Zijlstra <[email protected]>
>>
>> Signed-off-by: Juri Lelli <[email protected]>
>> ---
>> kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 40 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f74e9d2..b8627c6 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4281,7 +4281,10 @@ static inline void hrtick_update(struct rq *rq)
>> }
>> #endif
>>
>> +static unsigned int capacity_margin = 1280; /* ~20% margin */
>
> This is a 25% margin. Calling it ~20% is a bit misleading :)
>

Well, 1024 is what you get if your remove 20% to 1280. But, I
confess it wasn't clear to me too at first sight ;). Anyway,
you are right that the way I use it below, you end up adding
25% to req_cap. It is just because I didn't want to add another
margin I guess. :)

> Should margin be scaled for cpus that do not have max capacity == 1024?
> In other words, should margin be dynamically calculated to be 20% of
> *this* cpu's max capacity?
>
> I'm imagining a corner case where a heterogeneous cpu system is set up
> in such a way that adding margin that is hard-coded to 25% of 1024
> almost always puts req_cap to the highest frequency, skipping some
> reasonable capacity states in between.
>

But, what below should actually ask for a 25% more related to the
current cpu usage. So, if you have let's say a usage of 300 (this
is both cpu and freq scaled) when you do what below you get:

300 * 1280 / 1024 = 375

and 375 is 300 + 25%. It is the ratio between capacity_margin and
SCHED_CAPACITY_SCALE that gives you a percentage relative to cpu usage.
Or did I get it wrong?

>> +
>> static bool cpu_overutilized(int cpu);
>> +static unsigned long get_cpu_usage(int cpu);
>> struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
>>
>> /*
>> @@ -4332,6 +4335,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> if (!task_new && !rq->rd->overutilized &&
>> cpu_overutilized(rq->cpu))
>> rq->rd->overutilized = true;
>> + /*
>> + * We want to trigger a freq switch request only for tasks that
>> + * are waking up; this is because we get here also during
>> + * load balancing, but in these cases it seems wise to trigger
>> + * as single request after load balancing is done.
>> + *
>> + * XXX: how about fork()? Do we need a special flag/something
>> + * to tell if we are here after a fork() (wakeup_task_new)?
>> + *
>> + * Also, we add a margin (same ~20% used for the tipping point)
>> + * to our request to provide some head room if p's utilization
>> + * further increases.
>> + */
>> + if (sched_energy_freq() && !task_new) {
>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>> +
>> + req_cap = req_cap * capacity_margin
>> + >> SCHED_CAPACITY_SHIFT;
>
> Probably a dumb question:
>
> Can we "cheat" here and just assume that capacity and load use the same
> units? That would avoid the multiplication and change your code to the
> following:
>
> #define capacity_margin SCHED_CAPACITY_SCALE >> 2; /* 25% */
> req_cap += SCHED_CAPACITY_SCALE;
>

I'd rather stick with an increase relative to the current usage
as opposed to adding 256 to every request. I fear that the latter
would end up cutting out some OPPs entirely, as you were saying above.

>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>> + }
>> }
>> hrtick_update(rq);
>> }
>> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> if (!se) {
>> sub_nr_running(rq, 1);
>> update_rq_runnable_avg(rq, 1);
>> + /*
>> + * We want to trigger a freq switch request only for tasks that
>> + * are going to sleep; this is because we get here also during
>> + * load balancing, but in these cases it seems wise to trigger
>> + * as single request after load balancing is done.
>> + *
>> + * Also, we add a margin (same ~20% used for the tipping point)
>> + * to our request to provide some head room if p's utilization
>> + * further increases.
>> + */
>> + if (sched_energy_freq() && task_sleep) {
>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>> +
>> + req_cap = req_cap * capacity_margin
>> + >> SCHED_CAPACITY_SHIFT;
>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>
> Filtering out the load_balance bits is neat.
>

Also, I guess we need to do that because we still have some rate
limit to the frequency at which we can issue requests. If we move
more that one task when load balacing, we could miss some requests.

Thanks,

- Juri

> Regards,
> Mike
>
>> + }
>> }
>> hrtick_update(rq);
>> }
>> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
>> return idx;
>> }
>>
>> -static unsigned int capacity_margin = 1280; /* ~20% margin */
>> -
>> static bool cpu_overutilized(int cpu)
>> {
>> return (capacity_of(cpu) * 1024) <
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>

2015-07-09 16:54:58

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 45/46] sched/cpufreq_sched: modify pcpu_capacity handling

Hi Mike,

On 08/07/15 17:42, Michael Turquette wrote:
> Quoting Morten Rasmussen (2015-07-07 11:24:28)
>> From: Juri Lelli <[email protected]>
>>
>> Use the cpu argument of cpufreq_sched_set_cap() to handle per_cpu writes,
>> as the thing can be called remotely (e.g., from load balacing code).
>>
>> cc: Ingo Molnar <[email protected]>
>> cc: Peter Zijlstra <[email protected]>
>>
>> Signed-off-by: Juri Lelli <[email protected]>
>
> Looks good to me. Feel free to add my Reviewed-by or Acked-by as
> appropriate.
>

Thanks for reviewing it! :)

Best,

- Juri

> Regards,
> Mike
>
>> ---
>> kernel/sched/cpufreq_sched.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
>> index 06ff183..b81ac779 100644
>> --- a/kernel/sched/cpufreq_sched.c
>> +++ b/kernel/sched/cpufreq_sched.c
>> @@ -151,7 +151,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
>> unsigned long capacity_max = 0;
>>
>> /* update per-cpu capacity request */
>> - __this_cpu_write(pcpu_capacity, capacity);
>> + per_cpu(pcpu_capacity, cpu) = capacity;
>>
>> policy = cpufreq_cpu_get(cpu);
>> if (IS_ERR_OR_NULL(policy)) {
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>

2015-07-10 09:50:04

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 39/46] sched/cpufreq_sched: use static key for cpu frequency selection

Hi Mike,

On 08/07/15 16:19, Michael Turquette wrote:
> Quoting Morten Rasmussen (2015-07-07 11:24:22)
>> From: Juri Lelli <[email protected]>
>>
>> Introduce a static key to only affect scheduler hot paths when sched
>> governor is enabled.
>>
>> cc: Ingo Molnar <[email protected]>
>> cc: Peter Zijlstra <[email protected]>
>>
>> Signed-off-by: Juri Lelli <[email protected]>
>> ---
>> kernel/sched/cpufreq_sched.c | 14 ++++++++++++++
>> kernel/sched/fair.c | 1 +
>> kernel/sched/sched.h | 6 ++++++
>> 3 files changed, 21 insertions(+)
>>
>> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
>> index 5020f24..2968f3a 100644
>> --- a/kernel/sched/cpufreq_sched.c
>> +++ b/kernel/sched/cpufreq_sched.c
>> @@ -203,6 +203,18 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
>> return;
>> }
>>
>> +static inline void set_sched_energy_freq(void)
>> +{
>> + if (!sched_energy_freq())
>> + static_key_slow_inc(&__sched_energy_freq);
>> +}
>> +
>> +static inline void clear_sched_energy_freq(void)
>> +{
>> + if (sched_energy_freq())
>> + static_key_slow_dec(&__sched_energy_freq);
>> +}
>> +
>> static int cpufreq_sched_start(struct cpufreq_policy *policy)
>> {
>> struct gov_data *gd;
>> @@ -243,6 +255,7 @@ static int cpufreq_sched_start(struct cpufreq_policy *policy)
>>
>> policy->governor_data = gd;
>> gd->policy = policy;
>> + set_sched_energy_freq();
>> return 0;
>>
>> err:
>> @@ -254,6 +267,7 @@ static int cpufreq_sched_stop(struct cpufreq_policy *policy)
>> {
>> struct gov_data *gd = policy->governor_data;
>>
>> + clear_sched_energy_freq();
>
> <paranoia>
>
> These controls are exposed to userspace via cpufreq sysfs knobs. Should
> we use a struct static_key_deferred and static_key_slow_dec_deferred()
> instead? This helps avoid a possible attack vector for slowing down the
> system.
>
> </paranoia>
>
> I don't really know what a sane default rate limit would be in that case
> though.

I guess we could go with HZ, as it seems to be the common default.

> Otherwise feel free to add:
>
> Reviewed-by: Michael Turquette <[email protected]>
>

Thanks,

- Juri

> Regards,
> Mike
>
>> if (cpufreq_driver_might_sleep()) {
>> kthread_stop(gd->task);
>> }
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f7cb6c9..d395bc9 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4282,6 +4282,7 @@ static inline void hrtick_update(struct rq *rq)
>> #endif
>>
>> static bool cpu_overutilized(int cpu);
>> +struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
>>
>> /*
>> * The enqueue_task method is called before nr_running is
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 30aa0c4..b5e27d9 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1476,6 +1476,12 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>> }
>> #endif
>>
>> +extern struct static_key __sched_energy_freq;
>> +static inline bool sched_energy_freq(void)
>> +{
>> + return static_key_false(&__sched_energy_freq);
>> +}
>> +
>> #ifdef CONFIG_CPU_FREQ_GOV_SCHED
>> void cpufreq_sched_set_cap(int cpu, unsigned long util);
>> #else
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>

2015-07-10 10:17:50

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 44/46] sched/fair: jump to max OPP when crossing UP threshold

Hi Mike,

On 08/07/15 17:47, Michael Turquette wrote:
> Quoting Morten Rasmussen (2015-07-07 11:24:27)
>> From: Juri Lelli <[email protected]>
>>
>> Since the true utilization of a long running task is not detectable while
>> it is running and might be bigger than the current cpu capacity, create the
>> maximum cpu capacity head room by requesting the maximum cpu capacity once
>> the cpu usage plus the capacity margin exceeds the current capacity. This
>> is also done to try to harm the performance of a task the least.
>>
>> cc: Ingo Molnar <[email protected]>
>> cc: Peter Zijlstra <[email protected]>
>>
>> Signed-off-by: Juri Lelli <[email protected]>
>> ---
>> kernel/sched/fair.c | 19 +++++++++++++++++++
>> 1 file changed, 19 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 323331f..c2d6de4 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8586,6 +8586,25 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>>
>> if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
>> rq->rd->overutilized = true;
>> +
>> + /*
>> + * To make free room for a task that is building up its "real"
>> + * utilization and to harm its performance the least, request a
>> + * jump to max OPP as soon as get_cpu_usage() crosses the UP
>> + * threshold. The UP threshold is built relative to the current
>> + * capacity (OPP), by using same margin used to tell if a cpu
>> + * is overutilized (capacity_margin).
>> + */
>> + if (sched_energy_freq()) {
>> + int cpu = cpu_of(rq);
>> + unsigned long capacity_orig = capacity_orig_of(cpu);
>> + unsigned long capacity_curr = capacity_curr_of(cpu);
>> +
>> + if (capacity_curr < capacity_orig &&
>> + (capacity_curr * SCHED_LOAD_SCALE) <
>> + (get_cpu_usage(cpu) * capacity_margin))
>> + cpufreq_sched_set_cap(cpu, capacity_orig);
>
> I'm sure that at some point the Product People are going to want to tune
> the capacity value that is requested. Hard-coding the max
> capacity/frequency in is a reasonable start, but at some point it would
> be nice to fetch an intermediate capacity defined by the cpufreq driver
> for this particular cpu. We have already seen that a lot in Android
> devices using the interactive governor and it could be done from
> cpufreq_sched_start().
>

Yeah, right, this bit is subject to change. The thing you are proposing
is one possible way to please Product People. However, we are going to
experiment with a couple of alternatives. The point is that we might
don't want to start exposing tuning knobs from the beginning. I'm
saying this because, IMHO, we should try hard to reduce the number of
tuning knobs to a minimum, so that we don't end up with what other
governors have. The whole thing should "just work" on most
configurations, ideally. :)

So, our current thoughts are around:

- try to derive this "jump to" point by looking at the energy
model; if we can spot an OPP that is particularly energy
efficient and it also gives enough computing capacity, maybe
it is the right place to settle for a bit before going to max;
isn't this what you would tune the system to do anyway?

- we have a prototype (that we should release as an RFC somewhat
soon) infrastructure to let users tune both scheduling decisions
and OPP selection; this "jump to" point might be related in
some way to the tuning infrastructure; I'd say that we could
wait for that RFC to happen and we continue this discussion :)

Thanks,

- Juri

> Regards,
> Mike
>
>> + }
>> }
>>
>> /*
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2015-07-10 13:35:45

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 03/46] arm: vexpress: Add CPU clock-frequencies to TC2 device-tree

Hi Tixy,

On 08/07/15 13:36, Jon Medhurst (Tixy) wrote:
> On Tue, 2015-07-07 at 19:23 +0100, Morten Rasmussen wrote:
>> From: Dietmar Eggemann <[email protected]>
>>
>> To enable the parsing of clock frequency and cpu efficiency values
>> inside parse_dt_topology [arch/arm/kernel/topology.c] to scale the
>> relative capacity of the cpus, this property has to be provided within
>> the cpu nodes of the dts file.
>>
>> The patch is a copy of commit 8f15973ef8c3 ("ARM: vexpress: Add CPU
>> clock-frequencies to TC2 device-tree") taken from Linaro Stable Kernel
>> (LSK) massaged into mainline.
>
> Not sure you really need to mention commit hashes from outside of the
> mainline Linux tree and the values I added to that commit were probably
> copied from some patch originating from ARM or elsewhere anyway. So the
> second paragraph is this commit message is probably superfluous. But
> this is nitpicking I guess :-)

Agreed. In the meantime we agreed that we don't have to use the
clock-frequency property for eas. Instead, we can derive the cpu
capacity values directly from the energy model. This will free us from
the dependency towards this dtb property and the cpu_efficiency for the
arm arch.
This patch won't be part of the next posting any more.

-- Dietmar

>

2015-07-17 00:12:13

by Sai Gurrappadi

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 32/46] sched: Energy-aware wake-up task placement

Hi Morten,

On 07/07/2015 11:24 AM, Morten Rasmussen wrote:
> ---

> +static int energy_aware_wake_cpu(struct task_struct *p, int target)
> +{
> + struct sched_domain *sd;
> + struct sched_group *sg, *sg_target;
> + int target_max_cap = INT_MAX;
> + int target_cpu = task_cpu(p);
> + int i;
> +
> + sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> +
> + if (!sd)
> + return target;
> +
> + sg = sd->groups;
> + sg_target = sg;
> +
> + /*
> + * Find group with sufficient capacity. We only get here if no cpu is
> + * overutilized. We may end up overutilizing a cpu by adding the task,
> + * but that should not be any worse than select_idle_sibling().
> + * load_balance() should sort it out later as we get above the tipping
> + * point.
> + */
> + do {
> + /* Assuming all cpus are the same in group */
> + int max_cap_cpu = group_first_cpu(sg);
> +
> + /*
> + * Assume smaller max capacity means more energy-efficient.
> + * Ideally we should query the energy model for the right
> + * answer but it easily ends up in an exhaustive search.
> + */
> + if (capacity_of(max_cap_cpu) < target_max_cap &&
> + task_fits_capacity(p, max_cap_cpu)) {
> + sg_target = sg;
> + target_max_cap = capacity_of(max_cap_cpu);
> + }
> + } while (sg = sg->next, sg != sd->groups);

Should be capacity_orig_of(max_cap_cpu) right? Might select a suboptimal
sg_target if max_cap_cpu has a significant amount of IRQ/RT activity.

> +
> + /* Find cpu with sufficient capacity */
> + for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> + /*
> + * p's blocked utilization is still accounted for on prev_cpu
> + * so prev_cpu will receive a negative bias due the double
> + * accouting. However, the blocked utilization may be zero.
> + */
> + int new_usage = get_cpu_usage(i) + task_utilization(p);
> +
> + if (new_usage > capacity_orig_of(i))
> + continue;

Is this supposed to be capacity_of(i) instead?

> +
> + if (new_usage < capacity_curr_of(i)) {
> + target_cpu = i;
> + if (cpu_rq(i)->nr_running)
> + break;
> + }
> +
> + /* cpu has capacity at higher OPP, keep it as fallback */
> + if (target_cpu == task_cpu(p))
> + target_cpu = i;
> + }
> +
> + if (target_cpu != task_cpu(p)) {
> + struct energy_env eenv = {
> + .usage_delta = task_utilization(p),
> + .src_cpu = task_cpu(p),
> + .dst_cpu = target_cpu,
> + };
> +
> + /* Not enough spare capacity on previous cpu */
> + if (cpu_overutilized(task_cpu(p)))
> + return target_cpu;
> +
> + if (energy_diff(&eenv) >= 0)
> + return task_cpu(p);
> + }
> +
> + return target_cpu;
> +}
> +

-Sai

2015-07-20 15:35:40

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 32/46] sched: Energy-aware wake-up task placement

On Thu, Jul 16, 2015 at 05:10:52PM -0700, Sai Gurrappadi wrote:
> Hi Morten,
>
> On 07/07/2015 11:24 AM, Morten Rasmussen wrote:
> > ---
>
> > +static int energy_aware_wake_cpu(struct task_struct *p, int target)
> > +{
> > + struct sched_domain *sd;
> > + struct sched_group *sg, *sg_target;
> > + int target_max_cap = INT_MAX;
> > + int target_cpu = task_cpu(p);
> > + int i;
> > +
> > + sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> > +
> > + if (!sd)
> > + return target;
> > +
> > + sg = sd->groups;
> > + sg_target = sg;
> > +
> > + /*
> > + * Find group with sufficient capacity. We only get here if no cpu is
> > + * overutilized. We may end up overutilizing a cpu by adding the task,
> > + * but that should not be any worse than select_idle_sibling().
> > + * load_balance() should sort it out later as we get above the tipping
> > + * point.
> > + */
> > + do {
> > + /* Assuming all cpus are the same in group */
> > + int max_cap_cpu = group_first_cpu(sg);
> > +
> > + /*
> > + * Assume smaller max capacity means more energy-efficient.
> > + * Ideally we should query the energy model for the right
> > + * answer but it easily ends up in an exhaustive search.
> > + */
> > + if (capacity_of(max_cap_cpu) < target_max_cap &&
> > + task_fits_capacity(p, max_cap_cpu)) {
> > + sg_target = sg;
> > + target_max_cap = capacity_of(max_cap_cpu);
> > + }
> > + } while (sg = sg->next, sg != sd->groups);
>
> Should be capacity_orig_of(max_cap_cpu) right? Might select a suboptimal
> sg_target if max_cap_cpu has a significant amount of IRQ/RT activity.

Right, this heuristic isn't as good as I had hoped for.
task_fits_capacity() is using capacity_of() to check if we have
available capacity after subtracting RT/IRQ activity which should be
right but I only check the first cpu. So I might discard a group due to
RT/IRQ activity on cpu the first cpu while one the sibling cpus could be
fine. Then going for lowest capacity_of() means preferring group with
most RT/IRQ activity that still has enough capacity to fit the task.

Using capacity_orig_of() we would ignore RT/IRQ activity but is likely
to be better as we can try to avoid RQ/IRQ activity later. I will use
capacity_orig_of() here instead. Thanks.

> > +
> > + /* Find cpu with sufficient capacity */
> > + for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> > + /*
> > + * p's blocked utilization is still accounted for on prev_cpu
> > + * so prev_cpu will receive a negative bias due the double
> > + * accouting. However, the blocked utilization may be zero.
> > + */
> > + int new_usage = get_cpu_usage(i) + task_utilization(p);
> > +
> > + if (new_usage > capacity_orig_of(i))
> > + continue;
>
> Is this supposed to be capacity_of(i) instead?

Yes, we should skip cpus with too much RT/IRQ activity here. Thanks.

Morten

2015-07-21 00:38:50

by Sai Gurrappadi

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 31/46] sched: Consider spare cpu capacity at task wake-up

Hi Morten,

On 07/07/2015 11:24 AM, Morten Rasmussen wrote:
> In mainline find_idlest_group() selects the wake-up target group purely
> based on group load which leads to suboptimal choices in low load
> scenarios. An idle group with reduced capacity (due to RT tasks or
> different cpu type) isn't necessarily a better target than a lightly
> loaded group with higher capacity.
>
> The patch adds spare capacity as an additional group selection
> parameter. The target group is now selected based on the following
> criteria listed by highest priority first:
>
> 1. If energy-aware scheduling is enabled the group with the lowest
> capacity containing a cpu with enough spare capacity to accommodate the
> task (with a bit to spare) is selected if such exists.
>
> 2. Return the group with the cpu with most spare capacity and this
> capacity is significant if such group exists. Significant spare capacity
> is currently at least 20% to spare.
>
> 3. Return the group with the lowest load, unless it is the local group
> in which case NULL is returned and the search is continued at the next
> (lower) level.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b0294f0..0f7dbda4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5247,9 +5247,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> int this_cpu, int sd_flag)
> {
> struct sched_group *idlest = NULL, *group = sd->groups;
> - struct sched_group *fit_group = NULL;
> + struct sched_group *fit_group = NULL, *spare_group = NULL;
> unsigned long min_load = ULONG_MAX, this_load = 0;
> unsigned long fit_capacity = ULONG_MAX;
> + unsigned long max_spare_capacity = capacity_margin - SCHED_LOAD_SCALE;
> int load_idx = sd->forkexec_idx;
> int imbalance = 100 + (sd->imbalance_pct-100)/2;
>
> @@ -5257,7 +5258,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> load_idx = sd->wake_idx;
>
> do {
> - unsigned long load, avg_load;
> + unsigned long load, avg_load, spare_capacity;
> int local_group;
> int i;
>
> @@ -5290,6 +5291,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> fit_capacity = capacity_of(i);
> fit_group = group;
> }
> +
> + /*
> + * Look for group which has most spare capacity on a
> + * single cpu.
> + */
> + spare_capacity = capacity_of(i) - get_cpu_usage(i);
> + if (spare_capacity > max_spare_capacity) {
> + max_spare_capacity = spare_capacity;
> + spare_group = group;
> + }

Another minor buglet: get_cpu_usage(i) here could be > capacity_of(i)
because usage is bounded by capacity_orig_of(i). Should it be bounded by
capacity_of() instead?

> }
>
> /* Adjust by relative CPU capacity of the group */
> @@ -5306,6 +5317,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> if (fit_group)
> return fit_group;
>
> + if (spare_group)
> + return spare_group;
> +
> if (!idlest || 100*this_load < imbalance*min_load)
> return NULL;
> return idlest;
>

Thanks,
-Sai

2015-07-21 06:41:20

by Leo Yan

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 27/46] sched, cpuidle: Track cpuidle state index in the scheduler

Hi Morten,

On Tue, Jul 07, 2015 at 07:24:10PM +0100, Morten Rasmussen wrote:
> The idle-state of each cpu is currently pointed to by rq->idle_state but
> there isn't any information in the struct cpuidle_state that can used to
> look up the idle-state energy model data stored in struct
> sched_group_energy. For this purpose is necessary to store the idle
> state index as well. Ideally, the idle-state data should be unified.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>

This patch should re-base with latest kernel; otherwise it will have
conflict with below commits:

827a5ae sched / idle: Call default_idle_call() from cpuidle_enter_state()
faad384 sched / idle: Call idle_set_state() from cpuidle_enter_state()
bcf6ad8 sched / idle: Eliminate the "reflect" check from cpuidle_idle_call()
82f6632 sched / idle: Move the default idle call code to a separate function

Thanks,
Leo Yan

2015-07-21 15:09:32

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 31/46] sched: Consider spare cpu capacity at task wake-up

On Mon, Jul 20, 2015 at 05:37:20PM -0700, Sai Gurrappadi wrote:
> Hi Morten,
>
> On 07/07/2015 11:24 AM, Morten Rasmussen wrote:

[...]

> > @@ -5290,6 +5291,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> > fit_capacity = capacity_of(i);
> > fit_group = group;
> > }
> > +
> > + /*
> > + * Look for group which has most spare capacity on a
> > + * single cpu.
> > + */
> > + spare_capacity = capacity_of(i) - get_cpu_usage(i);
> > + if (spare_capacity > max_spare_capacity) {
> > + max_spare_capacity = spare_capacity;
> > + spare_group = group;
> > + }
>
> Another minor buglet: get_cpu_usage(i) here could be > capacity_of(i)
> because usage is bounded by capacity_orig_of(i). Should it be bounded by
> capacity_of() instead?

Yes, that code is clearly broken. For this use of get_cpu_usage() it
makes more sense to cap it by capacity_of(). However, I think we
actually need two versions of get_cpu_usage(): One that reports
CFS utilization which as capped by CFS capacity (capacity_of()), and one
that reports total utilization (all sched_classes and IRQ) which is
capped by capacity_orig_of(). The former for use in CFS scheduling
decisions like the one above, and the latter for energy estimates and
selecting DVFS frequencies where we should include all utilization, not
just CFS tasks.

I will fix get_cpu_usage() as you propose.

Thanks,
Morten

2015-07-21 15:13:29

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 27/46] sched, cpuidle: Track cpuidle state index in the scheduler

Hi Leo Yan,

On Tue, Jul 21, 2015 at 02:41:10PM +0800, Leo Yan wrote:
> Hi Morten,
>
> On Tue, Jul 07, 2015 at 07:24:10PM +0100, Morten Rasmussen wrote:
> > The idle-state of each cpu is currently pointed to by rq->idle_state but
> > there isn't any information in the struct cpuidle_state that can used to
> > look up the idle-state energy model data stored in struct
> > sched_group_energy. For this purpose is necessary to store the idle
> > state index as well. Ideally, the idle-state data should be unified.
> >
> > cc: Ingo Molnar <[email protected]>
> > cc: Peter Zijlstra <[email protected]>
> >
> > Signed-off-by: Morten Rasmussen <[email protected]>
>
> This patch should re-base with latest kernel; otherwise it will have
> conflict with below commits:
>
> 827a5ae sched / idle: Call default_idle_call() from cpuidle_enter_state()
> faad384 sched / idle: Call idle_set_state() from cpuidle_enter_state()
> bcf6ad8 sched / idle: Eliminate the "reflect" check from cpuidle_idle_call()
> 82f6632 sched / idle: Move the default idle call code to a separate function

I will make sure to rebase the patches and hopefully make an updated
branch available for those who want test the patches.

Thanks,
Morten

2015-07-21 15:41:58

by Leo Yan

[permalink] [raw]
Subject: Re: [RFCv5, 01/46] arm: Frequency invariant scheduler load-tracking support

Hi Morten,

On Tue, Jul 07, 2015 at 07:23:44PM +0100, Morten Rasmussen wrote:
> From: Morten Rasmussen <[email protected]>
>
> Implements arch-specific function to provide the scheduler with a
> frequency scaling correction factor for more accurate load-tracking.
> The factor is:
>
> current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
>
> This implementation only provides frequency invariance. No cpu
> invariance yet.
>
> Cc: Russell King <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
>
> ---
> arch/arm/include/asm/topology.h | 7 +++++
> arch/arm/kernel/smp.c | 57 +++++++++++++++++++++++++++++++++++++++--
> arch/arm/kernel/topology.c | 17 ++++++++++++
> 3 files changed, 79 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> index 370f7a7..c31096f 100644
> --- a/arch/arm/include/asm/topology.h
> +++ b/arch/arm/include/asm/topology.h
> @@ -24,6 +24,13 @@ void init_cpu_topology(void);
> void store_cpu_topology(unsigned int cpuid);
> const struct cpumask *cpu_coregroup_mask(int cpu);
>
> +#define arch_scale_freq_capacity arm_arch_scale_freq_capacity
> +struct sched_domain;
> +extern
> +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
> +
> +DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);
> +
> #else
>
> static inline void init_cpu_topology(void) { }
> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index cca5b87..a32539c 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -677,12 +677,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref);
> static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq);
> static unsigned long global_l_p_j_ref;
> static unsigned long global_l_p_j_ref_freq;
> +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
> +DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);
> +
> +/*
> + * Scheduler load-tracking scale-invariance
> + *
> + * Provides the scheduler with a scale-invariance correction factor that
> + * compensates for frequency scaling through arch_scale_freq_capacity()
> + * (implemented in topology.c).
> + */
> +static inline
> +void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max)
> +{
> + unsigned long capacity;
> +
> + if (!max)
> + return;
> +
> + capacity = (curr << SCHED_CAPACITY_SHIFT) / max;
> + atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity);
> +}
>
> static int cpufreq_callback(struct notifier_block *nb,
> unsigned long val, void *data)
> {
> struct cpufreq_freqs *freq = data;
> int cpu = freq->cpu;
> + unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
>
> if (freq->flags & CPUFREQ_CONST_LOOPS)
> return NOTIFY_OK;
> @@ -707,6 +729,10 @@ static int cpufreq_callback(struct notifier_block *nb,
> per_cpu(l_p_j_ref_freq, cpu),
> freq->new);
> }
> +
> + if (val == CPUFREQ_PRECHANGE)
> + scale_freq_capacity(cpu, freq->new, max);
> +
> return NOTIFY_OK;
> }
>
> @@ -714,11 +740,38 @@ static struct notifier_block cpufreq_notifier = {
> .notifier_call = cpufreq_callback,
> };
>
> +static int cpufreq_policy_callback(struct notifier_block *nb,
> + unsigned long val, void *data)
> +{
> + struct cpufreq_policy *policy = data;
> + int i;
> +
> + if (val != CPUFREQ_NOTIFY)
> + return NOTIFY_OK;
> +
> + for_each_cpu(i, policy->cpus) {
> + scale_freq_capacity(i, policy->cur, policy->max);
> + atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
> + }
> +
> + return NOTIFY_OK;
> +}
> +
> +static struct notifier_block cpufreq_policy_notifier = {
> + .notifier_call = cpufreq_policy_callback,
> +};
> +
> static int __init register_cpufreq_notifier(void)
> {
> - return cpufreq_register_notifier(&cpufreq_notifier,
> + int ret;
> +
> + ret = cpufreq_register_notifier(&cpufreq_notifier,
> CPUFREQ_TRANSITION_NOTIFIER);
> + if (ret)
> + return ret;
> +
> + return cpufreq_register_notifier(&cpufreq_policy_notifier,
> + CPUFREQ_POLICY_NOTIFIER);
> }
> core_initcall(register_cpufreq_notifier);

For "cpu_freq_capacity" structure, could move it into driver/cpufreq
so that it can be shared by all architectures? Otherwise, every
architecture's smp.c need register notifier for themselves.

Thanks,
Leo Yan

> -
> #endif
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 08b7847..9c09e6e 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -169,6 +169,23 @@ static void update_cpu_capacity(unsigned int cpu)
> cpu, arch_scale_cpu_capacity(NULL, cpu));
> }
>
> +/*
> + * Scheduler load-tracking scale-invariance
> + *
> + * Provides the scheduler with a scale-invariance correction factor that
> + * compensates for frequency scaling (arch_scale_freq_capacity()). The scaling
> + * factor is updated in smp.c
> + */
> +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> +{
> + unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu));
> +
> + if (!curr)
> + return SCHED_CAPACITY_SCALE;
> +
> + return curr;
> +}
> +
> #else
> static inline void parse_dt_topology(void) {}
> static inline void update_cpu_capacity(unsigned int cpuid) {}

2015-07-22 06:51:13

by Leo Yan

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 11/46] sched: Remove blocked load and utilization contributions of dying tasks

On Tue, Jul 07, 2015 at 07:23:54PM +0100, Morten Rasmussen wrote:
> Tasks being dequeued for the last time (state == TASK_DEAD) are dequeued
> with the DEQUEUE_SLEEP flag which causes their load and utilization
> contributions to be added to the runqueue blocked load and utilization.
> Hence they will contain load or utilization that is gone away. The issue
> only exists for the root cfs_rq as cgroup_exit() doesn't set
> DEQUEUE_SLEEP for task group exits.
>
> If runnable+blocked load is to be used as a better estimate for cpu
> load the dead task contributions need to be removed to prevent
> load_balance() (idle_balance() in particular) from over-estimating the
> cpu load.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 775b0c7..fa12ce5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3217,6 +3217,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> * Update run-time statistics of the 'current'.
> */
> update_curr(cfs_rq);
> + if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> + flags &= !DEQUEUE_SLEEP;

So flags will be set to zero? Could be replaced by "flags &= ~DEQUEUE_SLEEP"?

> dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
>
> update_stats_dequeue(cfs_rq, se);
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-07-22 13:28:26

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5, 01/46] arm: Frequency invariant scheduler load-tracking support

On Tue, Jul 21, 2015 at 11:41:45PM +0800, Leo Yan wrote:
> Hi Morten,
>
> On Tue, Jul 07, 2015 at 07:23:44PM +0100, Morten Rasmussen wrote:
> > From: Morten Rasmussen <[email protected]>
> >
> > Implements arch-specific function to provide the scheduler with a
> > frequency scaling correction factor for more accurate load-tracking.
> > The factor is:
> >
> > current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
> >
> > This implementation only provides frequency invariance. No cpu
> > invariance yet.
> >
> > Cc: Russell King <[email protected]>
> >
> > Signed-off-by: Morten Rasmussen <[email protected]>
> >
> > ---
> > arch/arm/include/asm/topology.h | 7 +++++
> > arch/arm/kernel/smp.c | 57 +++++++++++++++++++++++++++++++++++++++--
> > arch/arm/kernel/topology.c | 17 ++++++++++++
> > 3 files changed, 79 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> > index 370f7a7..c31096f 100644
> > --- a/arch/arm/include/asm/topology.h
> > +++ b/arch/arm/include/asm/topology.h
> > @@ -24,6 +24,13 @@ void init_cpu_topology(void);
> > void store_cpu_topology(unsigned int cpuid);
> > const struct cpumask *cpu_coregroup_mask(int cpu);
> >
> > +#define arch_scale_freq_capacity arm_arch_scale_freq_capacity
> > +struct sched_domain;
> > +extern
> > +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
> > +
> > +DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);
> > +
> > #else
> >
> > static inline void init_cpu_topology(void) { }
> > diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> > index cca5b87..a32539c 100644
> > --- a/arch/arm/kernel/smp.c
> > +++ b/arch/arm/kernel/smp.c
> > @@ -677,12 +677,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref);
> > static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq);
> > static unsigned long global_l_p_j_ref;
> > static unsigned long global_l_p_j_ref_freq;
> > +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
> > +DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);
> > +
> > +/*
> > + * Scheduler load-tracking scale-invariance
> > + *
> > + * Provides the scheduler with a scale-invariance correction factor that
> > + * compensates for frequency scaling through arch_scale_freq_capacity()
> > + * (implemented in topology.c).
> > + */
> > +static inline
> > +void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max)
> > +{
> > + unsigned long capacity;
> > +
> > + if (!max)
> > + return;
> > +
> > + capacity = (curr << SCHED_CAPACITY_SHIFT) / max;
> > + atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity);
> > +}
> >
> > static int cpufreq_callback(struct notifier_block *nb,
> > unsigned long val, void *data)
> > {
> > struct cpufreq_freqs *freq = data;
> > int cpu = freq->cpu;
> > + unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
> >
> > if (freq->flags & CPUFREQ_CONST_LOOPS)
> > return NOTIFY_OK;
> > @@ -707,6 +729,10 @@ static int cpufreq_callback(struct notifier_block *nb,
> > per_cpu(l_p_j_ref_freq, cpu),
> > freq->new);
> > }
> > +
> > + if (val == CPUFREQ_PRECHANGE)
> > + scale_freq_capacity(cpu, freq->new, max);
> > +
> > return NOTIFY_OK;
> > }
> >
> > @@ -714,11 +740,38 @@ static struct notifier_block cpufreq_notifier = {
> > .notifier_call = cpufreq_callback,
> > };
> >
> > +static int cpufreq_policy_callback(struct notifier_block *nb,
> > + unsigned long val, void *data)
> > +{
> > + struct cpufreq_policy *policy = data;
> > + int i;
> > +
> > + if (val != CPUFREQ_NOTIFY)
> > + return NOTIFY_OK;
> > +
> > + for_each_cpu(i, policy->cpus) {
> > + scale_freq_capacity(i, policy->cur, policy->max);
> > + atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
> > + }
> > +
> > + return NOTIFY_OK;
> > +}
> > +
> > +static struct notifier_block cpufreq_policy_notifier = {
> > + .notifier_call = cpufreq_policy_callback,
> > +};
> > +
> > static int __init register_cpufreq_notifier(void)
> > {
> > - return cpufreq_register_notifier(&cpufreq_notifier,
> > + int ret;
> > +
> > + ret = cpufreq_register_notifier(&cpufreq_notifier,
> > CPUFREQ_TRANSITION_NOTIFIER);
> > + if (ret)
> > + return ret;
> > +
> > + return cpufreq_register_notifier(&cpufreq_policy_notifier,
> > + CPUFREQ_POLICY_NOTIFIER);
> > }
> > core_initcall(register_cpufreq_notifier);
>
> For "cpu_freq_capacity" structure, could move it into driver/cpufreq
> so that it can be shared by all architectures? Otherwise, every
> architecture's smp.c need register notifier for themselves.

We could, but I put it in arch/arm/* as not all architectures might want
this notifier. The frequency scaling factor could be provided based on
architecture specific performance counters instead. AFAIK, the Intel
p-state driver does not even fire the notifiers so the notifier
solution would be redundant code for those platforms.

That said, the above solution is not handling changes to policy->max
very well. Basically, we don't inform the scheduler if it has changed
which means that the OPP represented by "100%" might change. We need
cpufreq to keep track of the true max frequency when policy->max is
changed to work out the correct scaling factor instead of having it
relative to policy->max.

Morten

2015-07-22 13:42:32

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 11/46] sched: Remove blocked load and utilization contributions of dying tasks

On Wed, Jul 22, 2015 at 02:51:01PM +0800, Leo Yan wrote:
> On Tue, Jul 07, 2015 at 07:23:54PM +0100, Morten Rasmussen wrote:
> > Tasks being dequeued for the last time (state == TASK_DEAD) are dequeued
> > with the DEQUEUE_SLEEP flag which causes their load and utilization
> > contributions to be added to the runqueue blocked load and utilization.
> > Hence they will contain load or utilization that is gone away. The issue
> > only exists for the root cfs_rq as cgroup_exit() doesn't set
> > DEQUEUE_SLEEP for task group exits.
> >
> > If runnable+blocked load is to be used as a better estimate for cpu
> > load the dead task contributions need to be removed to prevent
> > load_balance() (idle_balance() in particular) from over-estimating the
> > cpu load.
> >
> > cc: Ingo Molnar <[email protected]>
> > cc: Peter Zijlstra <[email protected]>
> >
> > Signed-off-by: Morten Rasmussen <[email protected]>
> > ---
> > kernel/sched/fair.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 775b0c7..fa12ce5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3217,6 +3217,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > * Update run-time statistics of the 'current'.
> > */
> > update_curr(cfs_rq);
> > + if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> > + flags &= !DEQUEUE_SLEEP;
>
> So flags will be set to zero? Could be replaced by "flags &= ~DEQUEUE_SLEEP"?

Not could, should :)

I meant to clear the flag, but used the wrong operator. We only have
DEQUEUE_SLEEP and 0 at the moment so it doesn't matter, but it might
later.

Thanks,
Morten

2015-07-22 14:59:15

by Leo Yan

[permalink] [raw]
Subject: Re: [RFCv5, 01/46] arm: Frequency invariant scheduler load-tracking support

On Wed, Jul 22, 2015 at 02:31:04PM +0100, Morten Rasmussen wrote:
> On Tue, Jul 21, 2015 at 11:41:45PM +0800, Leo Yan wrote:
> > Hi Morten,
> >
> > On Tue, Jul 07, 2015 at 07:23:44PM +0100, Morten Rasmussen wrote:
> > > From: Morten Rasmussen <[email protected]>
> > >
> > > Implements arch-specific function to provide the scheduler with a
> > > frequency scaling correction factor for more accurate load-tracking.
> > > The factor is:
> > >
> > > current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
> > >
> > > This implementation only provides frequency invariance. No cpu
> > > invariance yet.
> > >
> > > Cc: Russell King <[email protected]>
> > >
> > > Signed-off-by: Morten Rasmussen <[email protected]>
> > >
> > > ---
> > > arch/arm/include/asm/topology.h | 7 +++++
> > > arch/arm/kernel/smp.c | 57 +++++++++++++++++++++++++++++++++++++++--
> > > arch/arm/kernel/topology.c | 17 ++++++++++++
> > > 3 files changed, 79 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> > > index 370f7a7..c31096f 100644
> > > --- a/arch/arm/include/asm/topology.h
> > > +++ b/arch/arm/include/asm/topology.h
> > > @@ -24,6 +24,13 @@ void init_cpu_topology(void);
> > > void store_cpu_topology(unsigned int cpuid);
> > > const struct cpumask *cpu_coregroup_mask(int cpu);
> > >
> > > +#define arch_scale_freq_capacity arm_arch_scale_freq_capacity
> > > +struct sched_domain;
> > > +extern
> > > +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
> > > +
> > > +DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);
> > > +
> > > #else
> > >
> > > static inline void init_cpu_topology(void) { }
> > > diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> > > index cca5b87..a32539c 100644
> > > --- a/arch/arm/kernel/smp.c
> > > +++ b/arch/arm/kernel/smp.c
> > > @@ -677,12 +677,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref);
> > > static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq);
> > > static unsigned long global_l_p_j_ref;
> > > static unsigned long global_l_p_j_ref_freq;
> > > +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
> > > +DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);
> > > +
> > > +/*
> > > + * Scheduler load-tracking scale-invariance
> > > + *
> > > + * Provides the scheduler with a scale-invariance correction factor that
> > > + * compensates for frequency scaling through arch_scale_freq_capacity()
> > > + * (implemented in topology.c).
> > > + */
> > > +static inline
> > > +void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max)
> > > +{
> > > + unsigned long capacity;
> > > +
> > > + if (!max)
> > > + return;
> > > +
> > > + capacity = (curr << SCHED_CAPACITY_SHIFT) / max;
> > > + atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity);
> > > +}
> > >
> > > static int cpufreq_callback(struct notifier_block *nb,
> > > unsigned long val, void *data)
> > > {
> > > struct cpufreq_freqs *freq = data;
> > > int cpu = freq->cpu;
> > > + unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
> > >
> > > if (freq->flags & CPUFREQ_CONST_LOOPS)
> > > return NOTIFY_OK;
> > > @@ -707,6 +729,10 @@ static int cpufreq_callback(struct notifier_block *nb,
> > > per_cpu(l_p_j_ref_freq, cpu),
> > > freq->new);
> > > }
> > > +
> > > + if (val == CPUFREQ_PRECHANGE)
> > > + scale_freq_capacity(cpu, freq->new, max);
> > > +
> > > return NOTIFY_OK;
> > > }
> > >
> > > @@ -714,11 +740,38 @@ static struct notifier_block cpufreq_notifier = {
> > > .notifier_call = cpufreq_callback,
> > > };
> > >
> > > +static int cpufreq_policy_callback(struct notifier_block *nb,
> > > + unsigned long val, void *data)
> > > +{
> > > + struct cpufreq_policy *policy = data;
> > > + int i;
> > > +
> > > + if (val != CPUFREQ_NOTIFY)
> > > + return NOTIFY_OK;
> > > +
> > > + for_each_cpu(i, policy->cpus) {
> > > + scale_freq_capacity(i, policy->cur, policy->max);
> > > + atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
> > > + }
> > > +
> > > + return NOTIFY_OK;
> > > +}
> > > +
> > > +static struct notifier_block cpufreq_policy_notifier = {
> > > + .notifier_call = cpufreq_policy_callback,
> > > +};
> > > +
> > > static int __init register_cpufreq_notifier(void)
> > > {
> > > - return cpufreq_register_notifier(&cpufreq_notifier,
> > > + int ret;
> > > +
> > > + ret = cpufreq_register_notifier(&cpufreq_notifier,
> > > CPUFREQ_TRANSITION_NOTIFIER);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + return cpufreq_register_notifier(&cpufreq_policy_notifier,
> > > + CPUFREQ_POLICY_NOTIFIER);
> > > }
> > > core_initcall(register_cpufreq_notifier);
> >
> > For "cpu_freq_capacity" structure, could move it into driver/cpufreq
> > so that it can be shared by all architectures? Otherwise, every
> > architecture's smp.c need register notifier for themselves.
>
> We could, but I put it in arch/arm/* as not all architectures might want
> this notifier. The frequency scaling factor could be provided based on
> architecture specific performance counters instead. AFAIK, the Intel
> p-state driver does not even fire the notifiers so the notifier
> solution would be redundant code for those platforms.

When i tried to enable EAS on Hikey, i found it's absent related code
for arm64; actually this code section can also be reused by arm64,
so just brought up this question.

Just now roughly went through the driver
"drivers/cpufreq/intel_pstate.c"; that's true it has different
implementation comparing to usual ARM SoCs. So i'd like to ask this
question with another way: should cpufreq framework provides helper
functions for getting related cpu frequency scaling info? If the
architecture has specific performance counters then it can ignore
these helper functions.

> That said, the above solution is not handling changes to policy->max
> very well. Basically, we don't inform the scheduler if it has changed
> which means that the OPP represented by "100%" might change. We need
> cpufreq to keep track of the true max frequency when policy->max is
> changed to work out the correct scaling factor instead of having it
> relative to policy->max.

i'm not sure understand correctly here. For example, when thermal
framework limits the cpu frequency, it will update the value for
policy->max, so scheduler will get the correct scaling factor, right?
So i don't know what's the issue at here.

Further more, i noticed in the later patches for
arch_scale_cpu_capacity(); the cpu capacity is calculated by the
property passed by DT, so it's a static value. In some cases, system
may constraint the maximum frequency for CPUs, so in this case, will
scheduler get misknowledge from arch_scale_cpu_capacity after system
has imposed constraint for maximum frequency?

Sorry if these questions have been discussed before :)

Thanks,
Leo Yan

2015-07-23 11:03:43

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5, 01/46] arm: Frequency invariant scheduler load-tracking support

On Wed, Jul 22, 2015 at 10:59:04PM +0800, Leo Yan wrote:
> On Wed, Jul 22, 2015 at 02:31:04PM +0100, Morten Rasmussen wrote:
> > On Tue, Jul 21, 2015 at 11:41:45PM +0800, Leo Yan wrote:
> > > Hi Morten,
> > >
> > > On Tue, Jul 07, 2015 at 07:23:44PM +0100, Morten Rasmussen wrote:
> > > > From: Morten Rasmussen <[email protected]>
> > > >
> > > > Implements arch-specific function to provide the scheduler with a
> > > > frequency scaling correction factor for more accurate load-tracking.
> > > > The factor is:
> > > >
> > > > current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
> > > >
> > > > This implementation only provides frequency invariance. No cpu
> > > > invariance yet.
> > > >
> > > > Cc: Russell King <[email protected]>
> > > >
> > > > Signed-off-by: Morten Rasmussen <[email protected]>
> > > >
> > > > ---
> > > > arch/arm/include/asm/topology.h | 7 +++++
> > > > arch/arm/kernel/smp.c | 57 +++++++++++++++++++++++++++++++++++++++--
> > > > arch/arm/kernel/topology.c | 17 ++++++++++++
> > > > 3 files changed, 79 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> > > > index 370f7a7..c31096f 100644
> > > > --- a/arch/arm/include/asm/topology.h
> > > > +++ b/arch/arm/include/asm/topology.h
> > > > @@ -24,6 +24,13 @@ void init_cpu_topology(void);
> > > > void store_cpu_topology(unsigned int cpuid);
> > > > const struct cpumask *cpu_coregroup_mask(int cpu);
> > > >
> > > > +#define arch_scale_freq_capacity arm_arch_scale_freq_capacity
> > > > +struct sched_domain;
> > > > +extern
> > > > +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
> > > > +
> > > > +DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);
> > > > +
> > > > #else
> > > >
> > > > static inline void init_cpu_topology(void) { }
> > > > diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> > > > index cca5b87..a32539c 100644
> > > > --- a/arch/arm/kernel/smp.c
> > > > +++ b/arch/arm/kernel/smp.c
> > > > @@ -677,12 +677,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref);
> > > > static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq);
> > > > static unsigned long global_l_p_j_ref;
> > > > static unsigned long global_l_p_j_ref_freq;
> > > > +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
> > > > +DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);
> > > > +
> > > > +/*
> > > > + * Scheduler load-tracking scale-invariance
> > > > + *
> > > > + * Provides the scheduler with a scale-invariance correction factor that
> > > > + * compensates for frequency scaling through arch_scale_freq_capacity()
> > > > + * (implemented in topology.c).
> > > > + */
> > > > +static inline
> > > > +void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max)
> > > > +{
> > > > + unsigned long capacity;
> > > > +
> > > > + if (!max)
> > > > + return;
> > > > +
> > > > + capacity = (curr << SCHED_CAPACITY_SHIFT) / max;
> > > > + atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity);
> > > > +}
> > > >
> > > > static int cpufreq_callback(struct notifier_block *nb,
> > > > unsigned long val, void *data)
> > > > {
> > > > struct cpufreq_freqs *freq = data;
> > > > int cpu = freq->cpu;
> > > > + unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
> > > >
> > > > if (freq->flags & CPUFREQ_CONST_LOOPS)
> > > > return NOTIFY_OK;
> > > > @@ -707,6 +729,10 @@ static int cpufreq_callback(struct notifier_block *nb,
> > > > per_cpu(l_p_j_ref_freq, cpu),
> > > > freq->new);
> > > > }
> > > > +
> > > > + if (val == CPUFREQ_PRECHANGE)
> > > > + scale_freq_capacity(cpu, freq->new, max);
> > > > +
> > > > return NOTIFY_OK;
> > > > }
> > > >
> > > > @@ -714,11 +740,38 @@ static struct notifier_block cpufreq_notifier = {
> > > > .notifier_call = cpufreq_callback,
> > > > };
> > > >
> > > > +static int cpufreq_policy_callback(struct notifier_block *nb,
> > > > + unsigned long val, void *data)
> > > > +{
> > > > + struct cpufreq_policy *policy = data;
> > > > + int i;
> > > > +
> > > > + if (val != CPUFREQ_NOTIFY)
> > > > + return NOTIFY_OK;
> > > > +
> > > > + for_each_cpu(i, policy->cpus) {
> > > > + scale_freq_capacity(i, policy->cur, policy->max);
> > > > + atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
> > > > + }
> > > > +
> > > > + return NOTIFY_OK;
> > > > +}
> > > > +
> > > > +static struct notifier_block cpufreq_policy_notifier = {
> > > > + .notifier_call = cpufreq_policy_callback,
> > > > +};
> > > > +
> > > > static int __init register_cpufreq_notifier(void)
> > > > {
> > > > - return cpufreq_register_notifier(&cpufreq_notifier,
> > > > + int ret;
> > > > +
> > > > + ret = cpufreq_register_notifier(&cpufreq_notifier,
> > > > CPUFREQ_TRANSITION_NOTIFIER);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + return cpufreq_register_notifier(&cpufreq_policy_notifier,
> > > > + CPUFREQ_POLICY_NOTIFIER);
> > > > }
> > > > core_initcall(register_cpufreq_notifier);
> > >
> > > For "cpu_freq_capacity" structure, could move it into driver/cpufreq
> > > so that it can be shared by all architectures? Otherwise, every
> > > architecture's smp.c need register notifier for themselves.
> >
> > We could, but I put it in arch/arm/* as not all architectures might want
> > this notifier. The frequency scaling factor could be provided based on
> > architecture specific performance counters instead. AFAIK, the Intel
> > p-state driver does not even fire the notifiers so the notifier
> > solution would be redundant code for those platforms.
>
> When i tried to enable EAS on Hikey, i found it's absent related code
> for arm64; actually this code section can also be reused by arm64,
> so just brought up this question.

Yes. We have patches for arm64 if you are interested. We are using them
for the Juno platforms.

> Just now roughly went through the driver
> "drivers/cpufreq/intel_pstate.c"; that's true it has different
> implementation comparing to usual ARM SoCs. So i'd like to ask this
> question with another way: should cpufreq framework provides helper
> functions for getting related cpu frequency scaling info? If the
> architecture has specific performance counters then it can ignore
> these helper functions.

That is the idea with the notifiers. If the architecture code a specific
architecture wants to be poked by cpufreq when the frequency is changed
it should have a way to subscribe to those. Another way of implementing
it is to let the architecture code call a helper function in cpufreq
every time the scheduler calls into the architecture code to get the
scaling factor (arch_scale_freq_capacity()). We actually did it that way
a couple of versions back using weak functions. It wasn't as clean as
using the notifiers, but if we make the necessary changes to cpufreq to
let the architecture code call into cpufreq that could be even better.

>
> > That said, the above solution is not handling changes to policy->max
> > very well. Basically, we don't inform the scheduler if it has changed
> > which means that the OPP represented by "100%" might change. We need
> > cpufreq to keep track of the true max frequency when policy->max is
> > changed to work out the correct scaling factor instead of having it
> > relative to policy->max.
>
> i'm not sure understand correctly here. For example, when thermal
> framework limits the cpu frequency, it will update the value for
> policy->max, so scheduler will get the correct scaling factor, right?
> So i don't know what's the issue at here.
>
> Further more, i noticed in the later patches for
> arch_scale_cpu_capacity(); the cpu capacity is calculated by the
> property passed by DT, so it's a static value. In some cases, system
> may constraint the maximum frequency for CPUs, so in this case, will
> scheduler get misknowledge from arch_scale_cpu_capacity after system
> has imposed constraint for maximum frequency?

The issue is first of all to define what 100% means. Is it
policy->cur/policy->max or policy->cur/uncapped_max? Where uncapped max
is the max frequency supported by the hardware when not capped in any
way by governors or thermal framework.

If we choose the first definition then we have to recalculate the cpu
capacity scaling factor (arch_scale_cpu_capacity()) too whenever
policy->max changes such that capacity_orig is updated appropriately.

The scale-invariance code in the scheduler assumes:

arch_scale_cpu_capacity()*arch_scale_freq_capacity() = current capacity

...and that capacity_orig = arch_scale_cpu_capacity() is the max
available capacity. If we cap the frequency to say, 50%, by setting
policy->max then we have to reduce arch_scale_cpu_capacity() to 50% to
still get the right current capacity using the expression above.

Using the second definition arch_scale_cpu_capacity() can be a static
value and arch_scale_freq_capacity() is always relative to uncapped_max.
It seems simpler, but capacity_orig could then be an unavailable
capacity and hence we would need to introduce a third capacity to track
the current max capacity and use that for scheduling decisions.

As you have already discovered the current code is a combination of both
which is broken when policy->max is reduced.

Thinking more about it, I would suggest to go with the first definition.
The scheduler doesn't need to know about currently unavailable compute
capacity it should balance based on the current situation, so it seems
to make sense to let capacity_orig reflect the current max capacity.

I would suggest that we fix arch_scale_cpu_capacity() to take
policy->max changes into account. We need to know the uncapped max
frequency somehow to do that. I haven't looked into if we can get that
from cpufreq. Also, we need to make sure that no load-balance code
assumes that cpus have a capacity of 1024.

> Sorry if these questions have been discussed before :)

No problem. I don't think we have discussed it to this detail before and
it is very valid points.

Thanks,
Morten

2015-07-23 14:22:32

by Leo Yan

[permalink] [raw]
Subject: Re: [RFCv5, 01/46] arm: Frequency invariant scheduler load-tracking support

On Thu, Jul 23, 2015 at 12:06:26PM +0100, Morten Rasmussen wrote:
> On Wed, Jul 22, 2015 at 10:59:04PM +0800, Leo Yan wrote:
> > On Wed, Jul 22, 2015 at 02:31:04PM +0100, Morten Rasmussen wrote:
> > > On Tue, Jul 21, 2015 at 11:41:45PM +0800, Leo Yan wrote:
> > > > Hi Morten,
> > > >
> > > > On Tue, Jul 07, 2015 at 07:23:44PM +0100, Morten Rasmussen wrote:
> > > > > From: Morten Rasmussen <[email protected]>
> > > > >
> > > > > Implements arch-specific function to provide the scheduler with a
> > > > > frequency scaling correction factor for more accurate load-tracking.
> > > > > The factor is:
> > > > >
> > > > > current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
> > > > >
> > > > > This implementation only provides frequency invariance. No cpu
> > > > > invariance yet.
> > > > >
> > > > > Cc: Russell King <[email protected]>
> > > > >
> > > > > Signed-off-by: Morten Rasmussen <[email protected]>
> > > > >
> > > > > ---
> > > > > arch/arm/include/asm/topology.h | 7 +++++
> > > > > arch/arm/kernel/smp.c | 57 +++++++++++++++++++++++++++++++++++++++--
> > > > > arch/arm/kernel/topology.c | 17 ++++++++++++
> > > > > 3 files changed, 79 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> > > > > index 370f7a7..c31096f 100644
> > > > > --- a/arch/arm/include/asm/topology.h
> > > > > +++ b/arch/arm/include/asm/topology.h
> > > > > @@ -24,6 +24,13 @@ void init_cpu_topology(void);
> > > > > void store_cpu_topology(unsigned int cpuid);
> > > > > const struct cpumask *cpu_coregroup_mask(int cpu);
> > > > >
> > > > > +#define arch_scale_freq_capacity arm_arch_scale_freq_capacity
> > > > > +struct sched_domain;
> > > > > +extern
> > > > > +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
> > > > > +
> > > > > +DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);
> > > > > +
> > > > > #else
> > > > >
> > > > > static inline void init_cpu_topology(void) { }
> > > > > diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> > > > > index cca5b87..a32539c 100644
> > > > > --- a/arch/arm/kernel/smp.c
> > > > > +++ b/arch/arm/kernel/smp.c
> > > > > @@ -677,12 +677,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref);
> > > > > static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq);
> > > > > static unsigned long global_l_p_j_ref;
> > > > > static unsigned long global_l_p_j_ref_freq;
> > > > > +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
> > > > > +DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);
> > > > > +
> > > > > +/*
> > > > > + * Scheduler load-tracking scale-invariance
> > > > > + *
> > > > > + * Provides the scheduler with a scale-invariance correction factor that
> > > > > + * compensates for frequency scaling through arch_scale_freq_capacity()
> > > > > + * (implemented in topology.c).
> > > > > + */
> > > > > +static inline
> > > > > +void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max)
> > > > > +{
> > > > > + unsigned long capacity;
> > > > > +
> > > > > + if (!max)
> > > > > + return;
> > > > > +
> > > > > + capacity = (curr << SCHED_CAPACITY_SHIFT) / max;
> > > > > + atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity);
> > > > > +}
> > > > >
> > > > > static int cpufreq_callback(struct notifier_block *nb,
> > > > > unsigned long val, void *data)
> > > > > {
> > > > > struct cpufreq_freqs *freq = data;
> > > > > int cpu = freq->cpu;
> > > > > + unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
> > > > >
> > > > > if (freq->flags & CPUFREQ_CONST_LOOPS)
> > > > > return NOTIFY_OK;
> > > > > @@ -707,6 +729,10 @@ static int cpufreq_callback(struct notifier_block *nb,
> > > > > per_cpu(l_p_j_ref_freq, cpu),
> > > > > freq->new);
> > > > > }
> > > > > +
> > > > > + if (val == CPUFREQ_PRECHANGE)
> > > > > + scale_freq_capacity(cpu, freq->new, max);
> > > > > +
> > > > > return NOTIFY_OK;
> > > > > }
> > > > >
> > > > > @@ -714,11 +740,38 @@ static struct notifier_block cpufreq_notifier = {
> > > > > .notifier_call = cpufreq_callback,
> > > > > };
> > > > >
> > > > > +static int cpufreq_policy_callback(struct notifier_block *nb,
> > > > > + unsigned long val, void *data)
> > > > > +{
> > > > > + struct cpufreq_policy *policy = data;
> > > > > + int i;
> > > > > +
> > > > > + if (val != CPUFREQ_NOTIFY)
> > > > > + return NOTIFY_OK;
> > > > > +
> > > > > + for_each_cpu(i, policy->cpus) {
> > > > > + scale_freq_capacity(i, policy->cur, policy->max);
> > > > > + atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
> > > > > + }
> > > > > +
> > > > > + return NOTIFY_OK;
> > > > > +}
> > > > > +
> > > > > +static struct notifier_block cpufreq_policy_notifier = {
> > > > > + .notifier_call = cpufreq_policy_callback,
> > > > > +};
> > > > > +
> > > > > static int __init register_cpufreq_notifier(void)
> > > > > {
> > > > > - return cpufreq_register_notifier(&cpufreq_notifier,
> > > > > + int ret;
> > > > > +
> > > > > + ret = cpufreq_register_notifier(&cpufreq_notifier,
> > > > > CPUFREQ_TRANSITION_NOTIFIER);
> > > > > + if (ret)
> > > > > + return ret;
> > > > > +
> > > > > + return cpufreq_register_notifier(&cpufreq_policy_notifier,
> > > > > + CPUFREQ_POLICY_NOTIFIER);
> > > > > }
> > > > > core_initcall(register_cpufreq_notifier);
> > > >
> > > > For "cpu_freq_capacity" structure, could move it into driver/cpufreq
> > > > so that it can be shared by all architectures? Otherwise, every
> > > > architecture's smp.c need register notifier for themselves.
> > >
> > > We could, but I put it in arch/arm/* as not all architectures might want
> > > this notifier. The frequency scaling factor could be provided based on
> > > architecture specific performance counters instead. AFAIK, the Intel
> > > p-state driver does not even fire the notifiers so the notifier
> > > solution would be redundant code for those platforms.
> >
> > When i tried to enable EAS on Hikey, i found it's absent related code
> > for arm64; actually this code section can also be reused by arm64,
> > so just brought up this question.
>
> Yes. We have patches for arm64 if you are interested. We are using them
> for the Juno platforms.

If convenience, please share with me related patches, so i can
directly apply them and do some profiling works.

> > Just now roughly went through the driver
> > "drivers/cpufreq/intel_pstate.c"; that's true it has different
> > implementation comparing to usual ARM SoCs. So i'd like to ask this
> > question with another way: should cpufreq framework provides helper
> > functions for getting related cpu frequency scaling info? If the
> > architecture has specific performance counters then it can ignore
> > these helper functions.
>
> That is the idea with the notifiers. If the architecture code a specific
> architecture wants to be poked by cpufreq when the frequency is changed
> it should have a way to subscribe to those. Another way of implementing
> it is to let the architecture code call a helper function in cpufreq
> every time the scheduler calls into the architecture code to get the
> scaling factor (arch_scale_freq_capacity()). We actually did it that way
> a couple of versions back using weak functions. It wasn't as clean as
> using the notifiers, but if we make the necessary changes to cpufreq to
> let the architecture code call into cpufreq that could be even better.
>
> >
> > > That said, the above solution is not handling changes to policy->max
> > > very well. Basically, we don't inform the scheduler if it has changed
> > > which means that the OPP represented by "100%" might change. We need
> > > cpufreq to keep track of the true max frequency when policy->max is
> > > changed to work out the correct scaling factor instead of having it
> > > relative to policy->max.
> >
> > i'm not sure understand correctly here. For example, when thermal
> > framework limits the cpu frequency, it will update the value for
> > policy->max, so scheduler will get the correct scaling factor, right?
> > So i don't know what's the issue at here.
> >
> > Further more, i noticed in the later patches for
> > arch_scale_cpu_capacity(); the cpu capacity is calculated by the
> > property passed by DT, so it's a static value. In some cases, system
> > may constraint the maximum frequency for CPUs, so in this case, will
> > scheduler get misknowledge from arch_scale_cpu_capacity after system
> > has imposed constraint for maximum frequency?
>
> The issue is first of all to define what 100% means. Is it
> policy->cur/policy->max or policy->cur/uncapped_max? Where uncapped max
> is the max frequency supported by the hardware when not capped in any
> way by governors or thermal framework.
>
> If we choose the first definition then we have to recalculate the cpu
> capacity scaling factor (arch_scale_cpu_capacity()) too whenever
> policy->max changes such that capacity_orig is updated appropriately.
>
> The scale-invariance code in the scheduler assumes:
>
> arch_scale_cpu_capacity()*arch_scale_freq_capacity() = current capacity

This is an important concept, thanks for the explaining.

> ...and that capacity_orig = arch_scale_cpu_capacity() is the max
> available capacity. If we cap the frequency to say, 50%, by setting
> policy->max then we have to reduce arch_scale_cpu_capacity() to 50% to
> still get the right current capacity using the expression above.
>
> Using the second definition arch_scale_cpu_capacity() can be a static
> value and arch_scale_freq_capacity() is always relative to uncapped_max.
> It seems simpler, but capacity_orig could then be an unavailable
> capacity and hence we would need to introduce a third capacity to track
> the current max capacity and use that for scheduling decisions.
> As you have already discovered the current code is a combination of both
> which is broken when policy->max is reduced.
>
> Thinking more about it, I would suggest to go with the first definition.
> The scheduler doesn't need to know about currently unavailable compute
> capacity it should balance based on the current situation, so it seems
> to make sense to let capacity_orig reflect the current max capacity.

Agree.

> I would suggest that we fix arch_scale_cpu_capacity() to take
> policy->max changes into account. We need to know the uncapped max
> frequency somehow to do that. I haven't looked into if we can get that
> from cpufreq. Also, we need to make sure that no load-balance code
> assumes that cpus have a capacity of 1024.

Cpufreq framework provides API *cpufreq_quick_get_max()* and
*cpufreq_quick_get()* for inquiry current frequency and max frequency,
but i'm curious if these two functions can be directly called by
scheduler, due they acquire and release locks internally.

Thanks,
Leo Yan

2015-07-24 09:40:36

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5, 01/46] arm: Frequency invariant scheduler load-tracking support

On Thu, Jul 23, 2015 at 10:22:16PM +0800, Leo Yan wrote:
> On Thu, Jul 23, 2015 at 12:06:26PM +0100, Morten Rasmussen wrote:
> > Yes. We have patches for arm64 if you are interested. We are using them
> > for the Juno platforms.
>
> If convenience, please share with me related patches, so i can
> directly apply them and do some profiling works.

Will do.

>
> > > Just now roughly went through the driver
> > > "drivers/cpufreq/intel_pstate.c"; that's true it has different
> > > implementation comparing to usual ARM SoCs. So i'd like to ask this
> > > question with another way: should cpufreq framework provides helper
> > > functions for getting related cpu frequency scaling info? If the
> > > architecture has specific performance counters then it can ignore
> > > these helper functions.
> >
> > That is the idea with the notifiers. If the architecture code a specific
> > architecture wants to be poked by cpufreq when the frequency is changed
> > it should have a way to subscribe to those. Another way of implementing
> > it is to let the architecture code call a helper function in cpufreq
> > every time the scheduler calls into the architecture code to get the
> > scaling factor (arch_scale_freq_capacity()). We actually did it that way
> > a couple of versions back using weak functions. It wasn't as clean as
> > using the notifiers, but if we make the necessary changes to cpufreq to
> > let the architecture code call into cpufreq that could be even better.
> >
> > >
> > > > That said, the above solution is not handling changes to policy->max
> > > > very well. Basically, we don't inform the scheduler if it has changed
> > > > which means that the OPP represented by "100%" might change. We need
> > > > cpufreq to keep track of the true max frequency when policy->max is
> > > > changed to work out the correct scaling factor instead of having it
> > > > relative to policy->max.
> > >
> > > i'm not sure understand correctly here. For example, when thermal
> > > framework limits the cpu frequency, it will update the value for
> > > policy->max, so scheduler will get the correct scaling factor, right?
> > > So i don't know what's the issue at here.
> > >
> > > Further more, i noticed in the later patches for
> > > arch_scale_cpu_capacity(); the cpu capacity is calculated by the
> > > property passed by DT, so it's a static value. In some cases, system
> > > may constraint the maximum frequency for CPUs, so in this case, will
> > > scheduler get misknowledge from arch_scale_cpu_capacity after system
> > > has imposed constraint for maximum frequency?
> >
> > The issue is first of all to define what 100% means. Is it
> > policy->cur/policy->max or policy->cur/uncapped_max? Where uncapped max
> > is the max frequency supported by the hardware when not capped in any
> > way by governors or thermal framework.
> >
> > If we choose the first definition then we have to recalculate the cpu
> > capacity scaling factor (arch_scale_cpu_capacity()) too whenever
> > policy->max changes such that capacity_orig is updated appropriately.
> >
> > The scale-invariance code in the scheduler assumes:
> >
> > arch_scale_cpu_capacity()*arch_scale_freq_capacity() = current capacity
>
> This is an important concept, thanks for the explaining.

No problem, thanks for reviewing the patches.

> > ...and that capacity_orig = arch_scale_cpu_capacity() is the max
> > available capacity. If we cap the frequency to say, 50%, by setting
> > policy->max then we have to reduce arch_scale_cpu_capacity() to 50% to
> > still get the right current capacity using the expression above.
> >
> > Using the second definition arch_scale_cpu_capacity() can be a static
> > value and arch_scale_freq_capacity() is always relative to uncapped_max.
> > It seems simpler, but capacity_orig could then be an unavailable
> > capacity and hence we would need to introduce a third capacity to track
> > the current max capacity and use that for scheduling decisions.
> > As you have already discovered the current code is a combination of both
> > which is broken when policy->max is reduced.
> >
> > Thinking more about it, I would suggest to go with the first definition.
> > The scheduler doesn't need to know about currently unavailable compute
> > capacity it should balance based on the current situation, so it seems
> > to make sense to let capacity_orig reflect the current max capacity.
>
> Agree.
>
> > I would suggest that we fix arch_scale_cpu_capacity() to take
> > policy->max changes into account. We need to know the uncapped max
> > frequency somehow to do that. I haven't looked into if we can get that
> > from cpufreq. Also, we need to make sure that no load-balance code
> > assumes that cpus have a capacity of 1024.
>
> Cpufreq framework provides API *cpufreq_quick_get_max()* and
> *cpufreq_quick_get()* for inquiry current frequency and max frequency,
> but i'm curious if these two functions can be directly called by
> scheduler, due they acquire and release locks internally.

The arch_scale_{cpu,freq}_capacity() functions are called from contexts
where blocking/sleeping is not allowed, so that rules out calling
function that takes locks. We currently avoid that by using atomics.

However, even if we had non-sleeping functions to call into cpufreq, we
would still need some code in arch/* to make that call so it is only the
variables storing the current frequencies that we can move into cpufreq.
But it would naturally belong there, so I guess it is worth it.

2015-08-03 09:23:14

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 01/46] arm: Frequency invariant scheduler load-tracking support

Hi Morten,


On 7 July 2015 at 20:23, Morten Rasmussen <[email protected]> wrote:
> From: Morten Rasmussen <[email protected]>
>

[snip]

> -
> #endif
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 08b7847..9c09e6e 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -169,6 +169,23 @@ static void update_cpu_capacity(unsigned int cpu)
> cpu, arch_scale_cpu_capacity(NULL, cpu));
> }
>
> +/*
> + * Scheduler load-tracking scale-invariance
> + *
> + * Provides the scheduler with a scale-invariance correction factor that
> + * compensates for frequency scaling (arch_scale_freq_capacity()). The scaling
> + * factor is updated in smp.c
> + */
> +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> +{
> + unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu));

access to cpu_freq_capacity to should be put under #ifdef CONFIG_CPU_FREQ.

Why haven't you moved arm_arch_scale_freq_capacity in smp.c as
everything else for frequency in-variance is already in this file ?
This should also enable you to remove
DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity); from topology.h

Vincent

> +
> + if (!curr)
> + return SCHED_CAPACITY_SCALE;
> +
> + return curr;
> +}
> +
> #else
> static inline void parse_dt_topology(void) {}
> static inline void update_cpu_capacity(unsigned int cpuid) {}
> --
> 1.9.1
>

2015-08-04 13:41:58

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

Hi Juri,

On 7 July 2015 at 20:24, Morten Rasmussen <[email protected]> wrote:
> From: Juri Lelli <[email protected]>
>
> Each time a task is {en,de}queued we might need to adapt the current
> frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
> this purpose. Only trigger a freq request if we are effectively waking up
> or going to sleep. Filter out load balancing related calls to reduce the
> number of triggers.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 40 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f74e9d2..b8627c6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4281,7 +4281,10 @@ static inline void hrtick_update(struct rq *rq)
> }
> #endif
>
> +static unsigned int capacity_margin = 1280; /* ~20% margin */
> +
> static bool cpu_overutilized(int cpu);
> +static unsigned long get_cpu_usage(int cpu);
> struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
>
> /*
> @@ -4332,6 +4335,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!task_new && !rq->rd->overutilized &&
> cpu_overutilized(rq->cpu))
> rq->rd->overutilized = true;
> + /*
> + * We want to trigger a freq switch request only for tasks that
> + * are waking up; this is because we get here also during
> + * load balancing, but in these cases it seems wise to trigger
> + * as single request after load balancing is done.
> + *
> + * XXX: how about fork()? Do we need a special flag/something
> + * to tell if we are here after a fork() (wakeup_task_new)?
> + *
> + * Also, we add a margin (same ~20% used for the tipping point)
> + * to our request to provide some head room if p's utilization
> + * further increases.
> + */
> + if (sched_energy_freq() && !task_new) {
> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
> +
> + req_cap = req_cap * capacity_margin
> + >> SCHED_CAPACITY_SHIFT;
> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
> + }
> }
> hrtick_update(rq);
> }
> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!se) {
> sub_nr_running(rq, 1);
> update_rq_runnable_avg(rq, 1);
> + /*
> + * We want to trigger a freq switch request only for tasks that
> + * are going to sleep; this is because we get here also during
> + * load balancing, but in these cases it seems wise to trigger
> + * as single request after load balancing is done.
> + *
> + * Also, we add a margin (same ~20% used for the tipping point)
> + * to our request to provide some head room if p's utilization
> + * further increases.
> + */
> + if (sched_energy_freq() && task_sleep) {
> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
> +
> + req_cap = req_cap * capacity_margin
> + >> SCHED_CAPACITY_SHIFT;
> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);

Could you clarify why you want to trig a freq switch for tasks that
are going to sleep ?
The cpu_usage should not changed that much as the se_utilization of
the entity moves from utilization_load_avg to utilization_blocked_avg
of the rq and the usage and the freq are updated periodically.
It should be the same for the wake up of a task in enqueue_task_fair
above, even if it's less obvious for this latter use case because the
cpu might wake up from a long idle phase during which its
utilization_blocked_avg has not been updated. Nevertheless, a trig of
the freq switch at wake up of the cpu once its usage has been updated
should do the job.

So tick, migration of tasks, new tasks, entering/leaving idle state of
cpu should be enough to trig freq switch

Regards,
Vincent


> + }
> }
> hrtick_update(rq);
> }
> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
> return idx;
> }
>
> -static unsigned int capacity_margin = 1280; /* ~20% margin */
> -
> static bool cpu_overutilized(int cpu)
> {
> return (capacity_of(cpu) * 1024) <
> --
> 1.9.1
>

2015-08-10 13:43:18

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

Hi Vincent,

On 04/08/15 14:41, Vincent Guittot wrote:
> Hi Juri,
>
> On 7 July 2015 at 20:24, Morten Rasmussen <[email protected]> wrote:
>> From: Juri Lelli <[email protected]>
>>
>> Each time a task is {en,de}queued we might need to adapt the current
>> frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
>> this purpose. Only trigger a freq request if we are effectively waking up
>> or going to sleep. Filter out load balancing related calls to reduce the
>> number of triggers.
>>
>> cc: Ingo Molnar <[email protected]>
>> cc: Peter Zijlstra <[email protected]>
>>
>> Signed-off-by: Juri Lelli <[email protected]>
>> ---
>> kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 40 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f74e9d2..b8627c6 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4281,7 +4281,10 @@ static inline void hrtick_update(struct rq *rq)
>> }
>> #endif
>>
>> +static unsigned int capacity_margin = 1280; /* ~20% margin */
>> +
>> static bool cpu_overutilized(int cpu);
>> +static unsigned long get_cpu_usage(int cpu);
>> struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
>>
>> /*
>> @@ -4332,6 +4335,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> if (!task_new && !rq->rd->overutilized &&
>> cpu_overutilized(rq->cpu))
>> rq->rd->overutilized = true;
>> + /*
>> + * We want to trigger a freq switch request only for tasks that
>> + * are waking up; this is because we get here also during
>> + * load balancing, but in these cases it seems wise to trigger
>> + * as single request after load balancing is done.
>> + *
>> + * XXX: how about fork()? Do we need a special flag/something
>> + * to tell if we are here after a fork() (wakeup_task_new)?
>> + *
>> + * Also, we add a margin (same ~20% used for the tipping point)
>> + * to our request to provide some head room if p's utilization
>> + * further increases.
>> + */
>> + if (sched_energy_freq() && !task_new) {
>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>> +
>> + req_cap = req_cap * capacity_margin
>> + >> SCHED_CAPACITY_SHIFT;
>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>> + }
>> }
>> hrtick_update(rq);
>> }
>> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> if (!se) {
>> sub_nr_running(rq, 1);
>> update_rq_runnable_avg(rq, 1);
>> + /*
>> + * We want to trigger a freq switch request only for tasks that
>> + * are going to sleep; this is because we get here also during
>> + * load balancing, but in these cases it seems wise to trigger
>> + * as single request after load balancing is done.
>> + *
>> + * Also, we add a margin (same ~20% used for the tipping point)
>> + * to our request to provide some head room if p's utilization
>> + * further increases.
>> + */
>> + if (sched_energy_freq() && task_sleep) {
>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>> +
>> + req_cap = req_cap * capacity_margin
>> + >> SCHED_CAPACITY_SHIFT;
>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>
> Could you clarify why you want to trig a freq switch for tasks that
> are going to sleep ?
> The cpu_usage should not changed that much as the se_utilization of
> the entity moves from utilization_load_avg to utilization_blocked_avg
> of the rq and the usage and the freq are updated periodically.

I think we still need to cover multiple back-to-back dequeues. Suppose
that you have, let's say, 3 tasks that get enqueued at the same time.
After some time the first one goes to sleep and its utilization, as you
say, gets moved to utilization_blocked_avg. So, nothing changes, and
the trigger is superfluous (even if no freq change I guess will be
issued as we are already servicing enough capacity). However, after a
while, the second task goes to sleep. Now we still use get_cpu_usage()
and the first task contribution in utilization_blocked_avg should have
been decayed by this time. Same thing may than happen for the third task
as well. So, if we don't check if we need to scale down in
dequeue_task_fair, it seems to me that we might miss some opportunities,
as blocked contribution of other tasks could have been successively
decayed.

What you think?

Thanks,

- Juri

> It should be the same for the wake up of a task in enqueue_task_fair
> above, even if it's less obvious for this latter use case because the
> cpu might wake up from a long idle phase during which its
> utilization_blocked_avg has not been updated. Nevertheless, a trig of
> the freq switch at wake up of the cpu once its usage has been updated
> should do the job.
>
> So tick, migration of tasks, new tasks, entering/leaving idle state of
> cpu should be enough to trig freq switch
>
> Regards,
> Vincent
>
>
>> + }
>> }
>> hrtick_update(rq);
>> }
>> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
>> return idx;
>> }
>>
>> -static unsigned int capacity_margin = 1280; /* ~20% margin */
>> -
>> static bool cpu_overutilized(int cpu)
>> {
>> return (capacity_of(cpu) * 1024) <
>> --
>> 1.9.1
>>
>

2015-08-10 15:07:33

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

On 10 August 2015 at 15:43, Juri Lelli <[email protected]> wrote:
>
> Hi Vincent,
>
> On 04/08/15 14:41, Vincent Guittot wrote:
> > Hi Juri,
> >
> > On 7 July 2015 at 20:24, Morten Rasmussen <[email protected]> wrote:
> >> From: Juri Lelli <[email protected]>
> >>
> >> Each time a task is {en,de}queued we might need to adapt the current
> >> frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
> >> this purpose. Only trigger a freq request if we are effectively waking up
> >> or going to sleep. Filter out load balancing related calls to reduce the
> >> number of triggers.
> >>
> >> cc: Ingo Molnar <[email protected]>
> >> cc: Peter Zijlstra <[email protected]>
> >>
> >> Signed-off-by: Juri Lelli <[email protected]>
> >> ---
> >> kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++--
> >> 1 file changed, 40 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index f74e9d2..b8627c6 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4281,7 +4281,10 @@ static inline void hrtick_update(struct rq *rq)
> >> }
> >> #endif
> >>
> >> +static unsigned int capacity_margin = 1280; /* ~20% margin */
> >> +
> >> static bool cpu_overutilized(int cpu);
> >> +static unsigned long get_cpu_usage(int cpu);
> >> struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
> >>
> >> /*
> >> @@ -4332,6 +4335,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >> if (!task_new && !rq->rd->overutilized &&
> >> cpu_overutilized(rq->cpu))
> >> rq->rd->overutilized = true;
> >> + /*
> >> + * We want to trigger a freq switch request only for tasks that
> >> + * are waking up; this is because we get here also during
> >> + * load balancing, but in these cases it seems wise to trigger
> >> + * as single request after load balancing is done.
> >> + *
> >> + * XXX: how about fork()? Do we need a special flag/something
> >> + * to tell if we are here after a fork() (wakeup_task_new)?
> >> + *
> >> + * Also, we add a margin (same ~20% used for the tipping point)
> >> + * to our request to provide some head room if p's utilization
> >> + * further increases.
> >> + */
> >> + if (sched_energy_freq() && !task_new) {
> >> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
> >> +
> >> + req_cap = req_cap * capacity_margin
> >> + >> SCHED_CAPACITY_SHIFT;
> >> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
> >> + }
> >> }
> >> hrtick_update(rq);
> >> }
> >> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >> if (!se) {
> >> sub_nr_running(rq, 1);
> >> update_rq_runnable_avg(rq, 1);
> >> + /*
> >> + * We want to trigger a freq switch request only for tasks that
> >> + * are going to sleep; this is because we get here also during
> >> + * load balancing, but in these cases it seems wise to trigger
> >> + * as single request after load balancing is done.
> >> + *
> >> + * Also, we add a margin (same ~20% used for the tipping point)
> >> + * to our request to provide some head room if p's utilization
> >> + * further increases.
> >> + */
> >> + if (sched_energy_freq() && task_sleep) {
> >> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
> >> +
> >> + req_cap = req_cap * capacity_margin
> >> + >> SCHED_CAPACITY_SHIFT;
> >> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
> >
> > Could you clarify why you want to trig a freq switch for tasks that
> > are going to sleep ?
> > The cpu_usage should not changed that much as the se_utilization of
> > the entity moves from utilization_load_avg to utilization_blocked_avg
> > of the rq and the usage and the freq are updated periodically.
>
> I think we still need to cover multiple back-to-back dequeues. Suppose
> that you have, let's say, 3 tasks that get enqueued at the same time.
> After some time the first one goes to sleep and its utilization, as you
> say, gets moved to utilization_blocked_avg. So, nothing changes, and
> the trigger is superfluous (even if no freq change I guess will be
> issued as we are already servicing enough capacity). However, after a
> while, the second task goes to sleep. Now we still use get_cpu_usage()
> and the first task contribution in utilization_blocked_avg should have
> been decayed by this time. Same thing may than happen for the third task
> as well. So, if we don't check if we need to scale down in
> dequeue_task_fair, it seems to me that we might miss some opportunities,
> as blocked contribution of other tasks could have been successively
> decayed.
>
> What you think?

The tick is used to monitor such variation of the usage (in both way,
decay of the usage of sleeping tasks and increase of the usage of
running tasks). So in your example, if the duration between the sleep
of the 2 tasks is significant enough, the tick will handle this
variation

Regards,
Vincent
>
> Thanks,
>
> - Juri
>
> > It should be the same for the wake up of a task in enqueue_task_fair
> > above, even if it's less obvious for this latter use case because the
> > cpu might wake up from a long idle phase during which its
> > utilization_blocked_avg has not been updated. Nevertheless, a trig of
> > the freq switch at wake up of the cpu once its usage has been updated
> > should do the job.
> >
> > So tick, migration of tasks, new tasks, entering/leaving idle state of
> > cpu should be enough to trig freq switch
> >
> > Regards,
> > Vincent
> >
> >
> >> + }
> >> }
> >> hrtick_update(rq);
> >> }
> >> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
> >> return idx;
> >> }
> >>
> >> -static unsigned int capacity_margin = 1280; /* ~20% margin */
> >> -
> >> static bool cpu_overutilized(int cpu)
> >> {
> >> return (capacity_of(cpu) * 1024) <
> >> --
> >> 1.9.1
> >>
> >
>

2015-08-11 02:14:27

by Leo Yan

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 38/46] sched: scheduler-driven cpu frequency selection

On Tue, Jul 07, 2015 at 07:24:21PM +0100, Morten Rasmussen wrote:
> From: Michael Turquette <[email protected]>
>
> Scheduler-driven cpu frequency selection is desirable as part of the
> on-going effort to make the scheduler better aware of energy
> consumption. No piece of the Linux kernel has a better view of the
> factors that affect a cpu frequency selection policy than the
> scheduler[0], and this patch is an attempt to converge on an initial
> solution.
>
> This patch implements a simple shim layer between the Linux scheduler
> and the cpufreq subsystem. This interface accepts a capacity request
> from the Completely Fair Scheduler and honors the max request from all
> cpus in the same frequency domain.
>
> The policy magic comes from choosing the cpu capacity request from cfs
> and is not contained in this cpufreq governor. This code is
> intentionally dumb.
>
> Note that this "governor" is event-driven. There is no polling loop to
> check cpu idle time nor any other method which is unsynchronized with
> the scheduler.
>
> Thanks to Juri Lelli <[email protected]> for contributing design ideas,
> code and test results.
>
> [0] http://article.gmane.org/gmane.linux.kernel/1499836
>
> Signed-off-by: Michael Turquette <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> drivers/cpufreq/Kconfig | 24 ++++
> include/linux/cpufreq.h | 3 +
> kernel/sched/Makefile | 1 +
> kernel/sched/cpufreq_sched.c | 308 +++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 8 ++
> 5 files changed, 344 insertions(+)
> create mode 100644 kernel/sched/cpufreq_sched.c
>
> diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> index 659879a..9bbf44c 100644
> --- a/drivers/cpufreq/Kconfig
> +++ b/drivers/cpufreq/Kconfig
> @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
> Be aware that not all cpufreq drivers support the conservative
> governor. If unsure have a look at the help section of the
> driver. Fallback governor will be the performance governor.
> +
> +config CPU_FREQ_DEFAULT_GOV_SCHED
> + bool "sched"
> + select CPU_FREQ_GOV_SCHED
> + select CPU_FREQ_GOV_PERFORMANCE
> + help
> + Use the CPUfreq governor 'sched' as default. This scales
> + cpu frequency from the scheduler as per-entity load tracking
> + statistics are updated.
> endchoice
>
> config CPU_FREQ_GOV_PERFORMANCE
> @@ -183,6 +192,21 @@ config CPU_FREQ_GOV_CONSERVATIVE
>
> If in doubt, say N.
>
> +config CPU_FREQ_GOV_SCHED
> + tristate "'sched' cpufreq governor"
> + depends on CPU_FREQ
> + select CPU_FREQ_GOV_COMMON
> + help
> + 'sched' - this governor scales cpu frequency from the
> + scheduler as a function of cpu capacity utilization. It does
> + not evaluate utilization on a periodic basis (as ondemand
> + does) but instead is invoked from the completely fair
> + scheduler when updating per-entity load tracking statistics.
> + Latency to respond to changes in load is improved over polling
> + governors due to its event-driven design.
> +
> + If in doubt, say N.
> +
> comment "CPU frequency scaling drivers"
>
> config CPUFREQ_DT
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 1f2c9a1..30241c9 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -494,6 +494,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
> #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
> extern struct cpufreq_governor cpufreq_gov_conservative;
> #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative)
> +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED_GOV)

s/CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED_GOV/CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED/

> +extern struct cpufreq_governor cpufreq_gov_sched_gov;

s/cpufreq_gov_sched_gov/cpufreq_gov_sched/

> +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_sched)
> #endif
>
> /*********************************************************************
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 6768797..90ed832 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
> obj-$(CONFIG_SCHEDSTATS) += stats.o
> obj-$(CONFIG_SCHED_DEBUG) += debug.o
> obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
> +obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
> new file mode 100644
> index 0000000..5020f24
> --- /dev/null
> +++ b/kernel/sched/cpufreq_sched.c
> @@ -0,0 +1,308 @@
> +/*
> + * Copyright (C) 2015 Michael Turquette <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/cpufreq.h>
> +#include <linux/module.h>
> +#include <linux/kthread.h>
> +#include <linux/percpu.h>
> +#include <linux/irq_work.h>
> +
> +#include "sched.h"
> +
> +#define THROTTLE_NSEC 50000000 /* 50ms default */
> +
> +static DEFINE_PER_CPU(unsigned long, pcpu_capacity);
> +static DEFINE_PER_CPU(struct cpufreq_policy *, pcpu_policy);
> +
> +/**
> + * gov_data - per-policy data internal to the governor
> + * @throttle: next throttling period expiry. Derived from throttle_nsec
> + * @throttle_nsec: throttle period length in nanoseconds
> + * @task: worker thread for dvfs transition that may block/sleep
> + * @irq_work: callback used to wake up worker thread
> + * @freq: new frequency stored in *_sched_update_cpu and used in *_sched_thread
> + *
> + * struct gov_data is the per-policy cpufreq_sched-specific data structure. A
> + * per-policy instance of it is created when the cpufreq_sched governor receives
> + * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
> + * member of struct cpufreq_policy.
> + *
> + * Readers of this data must call down_read(policy->rwsem). Writers must
> + * call down_write(policy->rwsem).
> + */
> +struct gov_data {
> + ktime_t throttle;
> + unsigned int throttle_nsec;
> + struct task_struct *task;
> + struct irq_work irq_work;
> + struct cpufreq_policy *policy;
> + unsigned int freq;
> +};
> +
> +static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy, unsigned int freq)
> +{
> + struct gov_data *gd = policy->governor_data;
> +
> + /* avoid race with cpufreq_sched_stop */
> + if (!down_write_trylock(&policy->rwsem))
> + return;
> +
> + __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
> +
> + gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
> + up_write(&policy->rwsem);
> +}
> +
> +/*
> + * we pass in struct cpufreq_policy. This is safe because changing out the
> + * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
> + * which tears down all of the data structures and __cpufreq_governor(policy,
> + * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
> + * new policy pointer
> + */
> +static int cpufreq_sched_thread(void *data)
> +{
> + struct sched_param param;
> + struct cpufreq_policy *policy;
> + struct gov_data *gd;
> + int ret;
> +
> + policy = (struct cpufreq_policy *) data;
> + if (!policy) {
> + pr_warn("%s: missing policy\n", __func__);
> + do_exit(-EINVAL);
> + }
> +
> + gd = policy->governor_data;
> + if (!gd) {
> + pr_warn("%s: missing governor data\n", __func__);
> + do_exit(-EINVAL);
> + }
> +
> + param.sched_priority = 50;
> + ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);
> + if (ret) {
> + pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
> + do_exit(-EINVAL);
> + } else {
> + pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
> + __func__, gd->task->pid);
> + }
> +
> + ret = set_cpus_allowed_ptr(gd->task, policy->related_cpus);
> + if (ret) {
> + pr_warn("%s: failed to set allowed ptr\n", __func__);
> + do_exit(-EINVAL);
> + }
> +
> + /* main loop of the per-policy kthread */
> + do {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule();
> + if (kthread_should_stop())
> + break;
> +
> + cpufreq_sched_try_driver_target(policy, gd->freq);
> + } while (!kthread_should_stop());
> +
> + do_exit(0);
> +}
> +
> +static void cpufreq_sched_irq_work(struct irq_work *irq_work)
> +{
> + struct gov_data *gd;
> +
> + gd = container_of(irq_work, struct gov_data, irq_work);
> + if (!gd) {
> + return;
> + }
> +
> + wake_up_process(gd->task);
> +}
> +
> +/**
> + * cpufreq_sched_set_capacity - interface to scheduler for changing capacity values
> + * @cpu: cpu whose capacity utilization has recently changed
> + * @capacity: the new capacity requested by cpu
> + *
> + * cpufreq_sched_sched_capacity is an interface exposed to the scheduler so
> + * that the scheduler may inform the governor of updates to capacity
> + * utilization and make changes to cpu frequency. Currently this interface is
> + * designed around PELT values in CFS. It can be expanded to other scheduling
> + * classes in the future if needed.
> + *
> + * cpufreq_sched_set_capacity raises an IPI. The irq_work handler for that IPI
> + * wakes up the thread that does the actual work, cpufreq_sched_thread.
> + *
> + * This functions bails out early if either condition is true:
> + * 1) this cpu did not the new maximum capacity for its frequency domain
> + * 2) no change in cpu frequency is necessary to meet the new capacity request
> + */
> +void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
> +{
> + unsigned int freq_new, cpu_tmp;
> + struct cpufreq_policy *policy;
> + struct gov_data *gd;
> + unsigned long capacity_max = 0;
> +
> + /* update per-cpu capacity request */
> + __this_cpu_write(pcpu_capacity, capacity);
> +
> + policy = cpufreq_cpu_get(cpu);
> + if (IS_ERR_OR_NULL(policy)) {
> + return;
> + }
> +
> + if (!policy->governor_data)
> + goto out;
> +
> + gd = policy->governor_data;
> +
> + /* bail early if we are throttled */
> + if (ktime_before(ktime_get(), gd->throttle))
> + goto out;
> +
> + /* find max capacity requested by cpus in this policy */
> + for_each_cpu(cpu_tmp, policy->cpus)
> + capacity_max = max(capacity_max, per_cpu(pcpu_capacity, cpu_tmp));
> +
> + /*
> + * We only change frequency if this cpu's capacity request represents a
> + * new max. If another cpu has requested a capacity greater than the
> + * previous max then we rely on that cpu to hit this code path and make
> + * the change. IOW, the cpu with the new max capacity is responsible
> + * for setting the new capacity/frequency.
> + *
> + * If this cpu is not the new maximum then bail
> + */
> + if (capacity_max > capacity)
> + goto out;
> +
> + /* Convert the new maximum capacity request into a cpu frequency */
> + freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
> +
> + /* No change in frequency? Bail and return current capacity. */
> + if (freq_new == policy->cur)
> + goto out;
> +
> + /* store the new frequency and perform the transition */
> + gd->freq = freq_new;
> +
> + if (cpufreq_driver_might_sleep())
> + irq_work_queue_on(&gd->irq_work, cpu);
> + else
> + cpufreq_sched_try_driver_target(policy, freq_new);
> +
> +out:
> + cpufreq_cpu_put(policy);
> + return;
> +}
> +
> +static int cpufreq_sched_start(struct cpufreq_policy *policy)
> +{
> + struct gov_data *gd;
> + int cpu;
> +
> + /* prepare per-policy private data */
> + gd = kzalloc(sizeof(*gd), GFP_KERNEL);
> + if (!gd) {
> + pr_debug("%s: failed to allocate private data\n", __func__);
> + return -ENOMEM;
> + }
> +
> + /* initialize per-cpu data */
> + for_each_cpu(cpu, policy->cpus) {
> + per_cpu(pcpu_capacity, cpu) = 0;
> + per_cpu(pcpu_policy, cpu) = policy;
> + }
> +
> + /*
> + * Don't ask for freq changes at an higher rate than what
> + * the driver advertises as transition latency.
> + */
> + gd->throttle_nsec = policy->cpuinfo.transition_latency ?
> + policy->cpuinfo.transition_latency :
> + THROTTLE_NSEC;
> + pr_debug("%s: throttle threshold = %u [ns]\n",
> + __func__, gd->throttle_nsec);
> +
> + if (cpufreq_driver_might_sleep()) {
> + /* init per-policy kthread */
> + gd->task = kthread_run(cpufreq_sched_thread, policy, "kcpufreq_sched_task");
> + if (IS_ERR_OR_NULL(gd->task)) {
> + pr_err("%s: failed to create kcpufreq_sched_task thread\n", __func__);
> + goto err;
> + }
> + init_irq_work(&gd->irq_work, cpufreq_sched_irq_work);
> + }
> +
> + policy->governor_data = gd;
> + gd->policy = policy;
> + return 0;
> +
> +err:
> + kfree(gd);
> + return -ENOMEM;
> +}
> +
> +static int cpufreq_sched_stop(struct cpufreq_policy *policy)
> +{
> + struct gov_data *gd = policy->governor_data;
> +
> + if (cpufreq_driver_might_sleep()) {
> + kthread_stop(gd->task);
> + }
> +
> + policy->governor_data = NULL;
> +
> + /* FIXME replace with devm counterparts? */
> + kfree(gd);
> + return 0;
> +}
> +
> +static int cpufreq_sched_setup(struct cpufreq_policy *policy, unsigned int event)
> +{
> + switch (event) {
> + case CPUFREQ_GOV_START:
> + /* Start managing the frequency */
> + return cpufreq_sched_start(policy);
> +
> + case CPUFREQ_GOV_STOP:
> + return cpufreq_sched_stop(policy);
> +
> + case CPUFREQ_GOV_LIMITS: /* unused */
> + case CPUFREQ_GOV_POLICY_INIT: /* unused */
> + case CPUFREQ_GOV_POLICY_EXIT: /* unused */
> + break;
> + }
> + return 0;
> +}
> +
> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
> +static
> +#endif
> +struct cpufreq_governor cpufreq_gov_sched = {
> + .name = "sched",
> + .governor = cpufreq_sched_setup,
> + .owner = THIS_MODULE,
> +};
> +
> +static int __init cpufreq_sched_init(void)
> +{
> + return cpufreq_register_governor(&cpufreq_gov_sched);
> +}
> +
> +static void __exit cpufreq_sched_exit(void)
> +{
> + cpufreq_unregister_governor(&cpufreq_gov_sched);
> +}
> +
> +/* Try to make this the default governor */
> +fs_initcall(cpufreq_sched_init);
> +
> +MODULE_LICENSE("GPL v2");
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c395559..30aa0c4 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1476,6 +1476,13 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> }
> #endif
>
> +#ifdef CONFIG_CPU_FREQ_GOV_SCHED
> +void cpufreq_sched_set_cap(int cpu, unsigned long util);
> +#else
> +static inline void cpufreq_sched_set_cap(int cpu, unsigned long util)
> +{ }
> +#endif
> +
> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
> {
> rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
> @@ -1484,6 +1491,7 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
> #else
> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
> static inline void sched_avg_update(struct rq *rq) { }
> +static inline void gov_cfs_update_cpu(int cpu) {}
> #endif
>
> extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-08-11 08:59:20

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 38/46] sched: scheduler-driven cpu frequency selection

Hi,

On 11/08/15 03:14, Leo Yan wrote:
> On Tue, Jul 07, 2015 at 07:24:21PM +0100, Morten Rasmussen wrote:
>> From: Michael Turquette <[email protected]>
>>
>> Scheduler-driven cpu frequency selection is desirable as part of the
>> on-going effort to make the scheduler better aware of energy
>> consumption. No piece of the Linux kernel has a better view of the
>> factors that affect a cpu frequency selection policy than the
>> scheduler[0], and this patch is an attempt to converge on an initial
>> solution.
>>
>> This patch implements a simple shim layer between the Linux scheduler
>> and the cpufreq subsystem. This interface accepts a capacity request
>> from the Completely Fair Scheduler and honors the max request from all
>> cpus in the same frequency domain.
>>
>> The policy magic comes from choosing the cpu capacity request from cfs
>> and is not contained in this cpufreq governor. This code is
>> intentionally dumb.
>>
>> Note that this "governor" is event-driven. There is no polling loop to
>> check cpu idle time nor any other method which is unsynchronized with
>> the scheduler.
>>
>> Thanks to Juri Lelli <[email protected]> for contributing design ideas,
>> code and test results.
>>
>> [0] http://article.gmane.org/gmane.linux.kernel/1499836
>>
>> Signed-off-by: Michael Turquette <[email protected]>
>> Signed-off-by: Juri Lelli <[email protected]>
>> ---
>> drivers/cpufreq/Kconfig | 24 ++++
>> include/linux/cpufreq.h | 3 +
>> kernel/sched/Makefile | 1 +
>> kernel/sched/cpufreq_sched.c | 308 +++++++++++++++++++++++++++++++++++++++++++
>> kernel/sched/sched.h | 8 ++
>> 5 files changed, 344 insertions(+)
>> create mode 100644 kernel/sched/cpufreq_sched.c
>>
>> diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
>> index 659879a..9bbf44c 100644
>> --- a/drivers/cpufreq/Kconfig
>> +++ b/drivers/cpufreq/Kconfig
>> @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
>> Be aware that not all cpufreq drivers support the conservative
>> governor. If unsure have a look at the help section of the
>> driver. Fallback governor will be the performance governor.
>> +
>> +config CPU_FREQ_DEFAULT_GOV_SCHED
>> + bool "sched"
>> + select CPU_FREQ_GOV_SCHED
>> + select CPU_FREQ_GOV_PERFORMANCE
>> + help
>> + Use the CPUfreq governor 'sched' as default. This scales
>> + cpu frequency from the scheduler as per-entity load tracking
>> + statistics are updated.
>> endchoice
>>
>> config CPU_FREQ_GOV_PERFORMANCE
>> @@ -183,6 +192,21 @@ config CPU_FREQ_GOV_CONSERVATIVE
>>
>> If in doubt, say N.
>>
>> +config CPU_FREQ_GOV_SCHED
>> + tristate "'sched' cpufreq governor"
>> + depends on CPU_FREQ
>> + select CPU_FREQ_GOV_COMMON
>> + help
>> + 'sched' - this governor scales cpu frequency from the
>> + scheduler as a function of cpu capacity utilization. It does
>> + not evaluate utilization on a periodic basis (as ondemand
>> + does) but instead is invoked from the completely fair
>> + scheduler when updating per-entity load tracking statistics.
>> + Latency to respond to changes in load is improved over polling
>> + governors due to its event-driven design.
>> +
>> + If in doubt, say N.
>> +
>> comment "CPU frequency scaling drivers"
>>
>> config CPUFREQ_DT
>> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
>> index 1f2c9a1..30241c9 100644
>> --- a/include/linux/cpufreq.h
>> +++ b/include/linux/cpufreq.h
>> @@ -494,6 +494,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
>> #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
>> extern struct cpufreq_governor cpufreq_gov_conservative;
>> #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative)
>> +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED_GOV)
>
> s/CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED_GOV/CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED/
>
>> +extern struct cpufreq_governor cpufreq_gov_sched_gov;
>
> s/cpufreq_gov_sched_gov/cpufreq_gov_sched/
>

Yes, right. Dietmar pointed out the same problem in reply to Mike's
original posting. I guess Mike is going to squash the fix in his
next posting.

Thanks a lot anyway! :)

Best,

- Juri

>> +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_sched)
>> #endif
>>
>> /*********************************************************************
>> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
>> index 6768797..90ed832 100644
>> --- a/kernel/sched/Makefile
>> +++ b/kernel/sched/Makefile
>> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
>> obj-$(CONFIG_SCHEDSTATS) += stats.o
>> obj-$(CONFIG_SCHED_DEBUG) += debug.o
>> obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
>> +obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
>> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
>> new file mode 100644
>> index 0000000..5020f24
>> --- /dev/null
>> +++ b/kernel/sched/cpufreq_sched.c
>> @@ -0,0 +1,308 @@
>> +/*
>> + * Copyright (C) 2015 Michael Turquette <[email protected]>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#include <linux/cpufreq.h>
>> +#include <linux/module.h>
>> +#include <linux/kthread.h>
>> +#include <linux/percpu.h>
>> +#include <linux/irq_work.h>
>> +
>> +#include "sched.h"
>> +
>> +#define THROTTLE_NSEC 50000000 /* 50ms default */
>> +
>> +static DEFINE_PER_CPU(unsigned long, pcpu_capacity);
>> +static DEFINE_PER_CPU(struct cpufreq_policy *, pcpu_policy);
>> +
>> +/**
>> + * gov_data - per-policy data internal to the governor
>> + * @throttle: next throttling period expiry. Derived from throttle_nsec
>> + * @throttle_nsec: throttle period length in nanoseconds
>> + * @task: worker thread for dvfs transition that may block/sleep
>> + * @irq_work: callback used to wake up worker thread
>> + * @freq: new frequency stored in *_sched_update_cpu and used in *_sched_thread
>> + *
>> + * struct gov_data is the per-policy cpufreq_sched-specific data structure. A
>> + * per-policy instance of it is created when the cpufreq_sched governor receives
>> + * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
>> + * member of struct cpufreq_policy.
>> + *
>> + * Readers of this data must call down_read(policy->rwsem). Writers must
>> + * call down_write(policy->rwsem).
>> + */
>> +struct gov_data {
>> + ktime_t throttle;
>> + unsigned int throttle_nsec;
>> + struct task_struct *task;
>> + struct irq_work irq_work;
>> + struct cpufreq_policy *policy;
>> + unsigned int freq;
>> +};
>> +
>> +static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy, unsigned int freq)
>> +{
>> + struct gov_data *gd = policy->governor_data;
>> +
>> + /* avoid race with cpufreq_sched_stop */
>> + if (!down_write_trylock(&policy->rwsem))
>> + return;
>> +
>> + __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
>> +
>> + gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
>> + up_write(&policy->rwsem);
>> +}
>> +
>> +/*
>> + * we pass in struct cpufreq_policy. This is safe because changing out the
>> + * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
>> + * which tears down all of the data structures and __cpufreq_governor(policy,
>> + * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
>> + * new policy pointer
>> + */
>> +static int cpufreq_sched_thread(void *data)
>> +{
>> + struct sched_param param;
>> + struct cpufreq_policy *policy;
>> + struct gov_data *gd;
>> + int ret;
>> +
>> + policy = (struct cpufreq_policy *) data;
>> + if (!policy) {
>> + pr_warn("%s: missing policy\n", __func__);
>> + do_exit(-EINVAL);
>> + }
>> +
>> + gd = policy->governor_data;
>> + if (!gd) {
>> + pr_warn("%s: missing governor data\n", __func__);
>> + do_exit(-EINVAL);
>> + }
>> +
>> + param.sched_priority = 50;
>> + ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);
>> + if (ret) {
>> + pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
>> + do_exit(-EINVAL);
>> + } else {
>> + pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
>> + __func__, gd->task->pid);
>> + }
>> +
>> + ret = set_cpus_allowed_ptr(gd->task, policy->related_cpus);
>> + if (ret) {
>> + pr_warn("%s: failed to set allowed ptr\n", __func__);
>> + do_exit(-EINVAL);
>> + }
>> +
>> + /* main loop of the per-policy kthread */
>> + do {
>> + set_current_state(TASK_INTERRUPTIBLE);
>> + schedule();
>> + if (kthread_should_stop())
>> + break;
>> +
>> + cpufreq_sched_try_driver_target(policy, gd->freq);
>> + } while (!kthread_should_stop());
>> +
>> + do_exit(0);
>> +}
>> +
>> +static void cpufreq_sched_irq_work(struct irq_work *irq_work)
>> +{
>> + struct gov_data *gd;
>> +
>> + gd = container_of(irq_work, struct gov_data, irq_work);
>> + if (!gd) {
>> + return;
>> + }
>> +
>> + wake_up_process(gd->task);
>> +}
>> +
>> +/**
>> + * cpufreq_sched_set_capacity - interface to scheduler for changing capacity values
>> + * @cpu: cpu whose capacity utilization has recently changed
>> + * @capacity: the new capacity requested by cpu
>> + *
>> + * cpufreq_sched_sched_capacity is an interface exposed to the scheduler so
>> + * that the scheduler may inform the governor of updates to capacity
>> + * utilization and make changes to cpu frequency. Currently this interface is
>> + * designed around PELT values in CFS. It can be expanded to other scheduling
>> + * classes in the future if needed.
>> + *
>> + * cpufreq_sched_set_capacity raises an IPI. The irq_work handler for that IPI
>> + * wakes up the thread that does the actual work, cpufreq_sched_thread.
>> + *
>> + * This functions bails out early if either condition is true:
>> + * 1) this cpu did not the new maximum capacity for its frequency domain
>> + * 2) no change in cpu frequency is necessary to meet the new capacity request
>> + */
>> +void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
>> +{
>> + unsigned int freq_new, cpu_tmp;
>> + struct cpufreq_policy *policy;
>> + struct gov_data *gd;
>> + unsigned long capacity_max = 0;
>> +
>> + /* update per-cpu capacity request */
>> + __this_cpu_write(pcpu_capacity, capacity);
>> +
>> + policy = cpufreq_cpu_get(cpu);
>> + if (IS_ERR_OR_NULL(policy)) {
>> + return;
>> + }
>> +
>> + if (!policy->governor_data)
>> + goto out;
>> +
>> + gd = policy->governor_data;
>> +
>> + /* bail early if we are throttled */
>> + if (ktime_before(ktime_get(), gd->throttle))
>> + goto out;
>> +
>> + /* find max capacity requested by cpus in this policy */
>> + for_each_cpu(cpu_tmp, policy->cpus)
>> + capacity_max = max(capacity_max, per_cpu(pcpu_capacity, cpu_tmp));
>> +
>> + /*
>> + * We only change frequency if this cpu's capacity request represents a
>> + * new max. If another cpu has requested a capacity greater than the
>> + * previous max then we rely on that cpu to hit this code path and make
>> + * the change. IOW, the cpu with the new max capacity is responsible
>> + * for setting the new capacity/frequency.
>> + *
>> + * If this cpu is not the new maximum then bail
>> + */
>> + if (capacity_max > capacity)
>> + goto out;
>> +
>> + /* Convert the new maximum capacity request into a cpu frequency */
>> + freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
>> +
>> + /* No change in frequency? Bail and return current capacity. */
>> + if (freq_new == policy->cur)
>> + goto out;
>> +
>> + /* store the new frequency and perform the transition */
>> + gd->freq = freq_new;
>> +
>> + if (cpufreq_driver_might_sleep())
>> + irq_work_queue_on(&gd->irq_work, cpu);
>> + else
>> + cpufreq_sched_try_driver_target(policy, freq_new);
>> +
>> +out:
>> + cpufreq_cpu_put(policy);
>> + return;
>> +}
>> +
>> +static int cpufreq_sched_start(struct cpufreq_policy *policy)
>> +{
>> + struct gov_data *gd;
>> + int cpu;
>> +
>> + /* prepare per-policy private data */
>> + gd = kzalloc(sizeof(*gd), GFP_KERNEL);
>> + if (!gd) {
>> + pr_debug("%s: failed to allocate private data\n", __func__);
>> + return -ENOMEM;
>> + }
>> +
>> + /* initialize per-cpu data */
>> + for_each_cpu(cpu, policy->cpus) {
>> + per_cpu(pcpu_capacity, cpu) = 0;
>> + per_cpu(pcpu_policy, cpu) = policy;
>> + }
>> +
>> + /*
>> + * Don't ask for freq changes at an higher rate than what
>> + * the driver advertises as transition latency.
>> + */
>> + gd->throttle_nsec = policy->cpuinfo.transition_latency ?
>> + policy->cpuinfo.transition_latency :
>> + THROTTLE_NSEC;
>> + pr_debug("%s: throttle threshold = %u [ns]\n",
>> + __func__, gd->throttle_nsec);
>> +
>> + if (cpufreq_driver_might_sleep()) {
>> + /* init per-policy kthread */
>> + gd->task = kthread_run(cpufreq_sched_thread, policy, "kcpufreq_sched_task");
>> + if (IS_ERR_OR_NULL(gd->task)) {
>> + pr_err("%s: failed to create kcpufreq_sched_task thread\n", __func__);
>> + goto err;
>> + }
>> + init_irq_work(&gd->irq_work, cpufreq_sched_irq_work);
>> + }
>> +
>> + policy->governor_data = gd;
>> + gd->policy = policy;
>> + return 0;
>> +
>> +err:
>> + kfree(gd);
>> + return -ENOMEM;
>> +}
>> +
>> +static int cpufreq_sched_stop(struct cpufreq_policy *policy)
>> +{
>> + struct gov_data *gd = policy->governor_data;
>> +
>> + if (cpufreq_driver_might_sleep()) {
>> + kthread_stop(gd->task);
>> + }
>> +
>> + policy->governor_data = NULL;
>> +
>> + /* FIXME replace with devm counterparts? */
>> + kfree(gd);
>> + return 0;
>> +}
>> +
>> +static int cpufreq_sched_setup(struct cpufreq_policy *policy, unsigned int event)
>> +{
>> + switch (event) {
>> + case CPUFREQ_GOV_START:
>> + /* Start managing the frequency */
>> + return cpufreq_sched_start(policy);
>> +
>> + case CPUFREQ_GOV_STOP:
>> + return cpufreq_sched_stop(policy);
>> +
>> + case CPUFREQ_GOV_LIMITS: /* unused */
>> + case CPUFREQ_GOV_POLICY_INIT: /* unused */
>> + case CPUFREQ_GOV_POLICY_EXIT: /* unused */
>> + break;
>> + }
>> + return 0;
>> +}
>> +
>> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
>> +static
>> +#endif
>> +struct cpufreq_governor cpufreq_gov_sched = {
>> + .name = "sched",
>> + .governor = cpufreq_sched_setup,
>> + .owner = THIS_MODULE,
>> +};
>> +
>> +static int __init cpufreq_sched_init(void)
>> +{
>> + return cpufreq_register_governor(&cpufreq_gov_sched);
>> +}
>> +
>> +static void __exit cpufreq_sched_exit(void)
>> +{
>> + cpufreq_unregister_governor(&cpufreq_gov_sched);
>> +}
>> +
>> +/* Try to make this the default governor */
>> +fs_initcall(cpufreq_sched_init);
>> +
>> +MODULE_LICENSE("GPL v2");
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index c395559..30aa0c4 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1476,6 +1476,13 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>> }
>> #endif
>>
>> +#ifdef CONFIG_CPU_FREQ_GOV_SCHED
>> +void cpufreq_sched_set_cap(int cpu, unsigned long util);
>> +#else
>> +static inline void cpufreq_sched_set_cap(int cpu, unsigned long util)
>> +{ }
>> +#endif
>> +
>> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
>> {
>> rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
>> @@ -1484,6 +1491,7 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
>> #else
>> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
>> static inline void sched_avg_update(struct rq *rq) { }
>> +static inline void gov_cfs_update_cpu(int cpu) {}
>> #endif
>>
>> extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>

2015-08-11 09:07:36

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

On 10/08/15 16:07, Vincent Guittot wrote:
> On 10 August 2015 at 15:43, Juri Lelli <[email protected]> wrote:
>>
>> Hi Vincent,
>>
>> On 04/08/15 14:41, Vincent Guittot wrote:
>>> Hi Juri,
>>>
>>> On 7 July 2015 at 20:24, Morten Rasmussen <[email protected]> wrote:
>>>> From: Juri Lelli <[email protected]>
>>>>
>>>> Each time a task is {en,de}queued we might need to adapt the current
>>>> frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
>>>> this purpose. Only trigger a freq request if we are effectively waking up
>>>> or going to sleep. Filter out load balancing related calls to reduce the
>>>> number of triggers.
>>>>
>>>> cc: Ingo Molnar <[email protected]>
>>>> cc: Peter Zijlstra <[email protected]>
>>>>
>>>> Signed-off-by: Juri Lelli <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>>>> 1 file changed, 40 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index f74e9d2..b8627c6 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -4281,7 +4281,10 @@ static inline void hrtick_update(struct rq *rq)
>>>> }
>>>> #endif
>>>>
>>>> +static unsigned int capacity_margin = 1280; /* ~20% margin */
>>>> +
>>>> static bool cpu_overutilized(int cpu);
>>>> +static unsigned long get_cpu_usage(int cpu);
>>>> struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
>>>>
>>>> /*
>>>> @@ -4332,6 +4335,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>> if (!task_new && !rq->rd->overutilized &&
>>>> cpu_overutilized(rq->cpu))
>>>> rq->rd->overutilized = true;
>>>> + /*
>>>> + * We want to trigger a freq switch request only for tasks that
>>>> + * are waking up; this is because we get here also during
>>>> + * load balancing, but in these cases it seems wise to trigger
>>>> + * as single request after load balancing is done.
>>>> + *
>>>> + * XXX: how about fork()? Do we need a special flag/something
>>>> + * to tell if we are here after a fork() (wakeup_task_new)?
>>>> + *
>>>> + * Also, we add a margin (same ~20% used for the tipping point)
>>>> + * to our request to provide some head room if p's utilization
>>>> + * further increases.
>>>> + */
>>>> + if (sched_energy_freq() && !task_new) {
>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>>>> +
>>>> + req_cap = req_cap * capacity_margin
>>>> + >> SCHED_CAPACITY_SHIFT;
>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>>>> + }
>>>> }
>>>> hrtick_update(rq);
>>>> }
>>>> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>> if (!se) {
>>>> sub_nr_running(rq, 1);
>>>> update_rq_runnable_avg(rq, 1);
>>>> + /*
>>>> + * We want to trigger a freq switch request only for tasks that
>>>> + * are going to sleep; this is because we get here also during
>>>> + * load balancing, but in these cases it seems wise to trigger
>>>> + * as single request after load balancing is done.
>>>> + *
>>>> + * Also, we add a margin (same ~20% used for the tipping point)
>>>> + * to our request to provide some head room if p's utilization
>>>> + * further increases.
>>>> + */
>>>> + if (sched_energy_freq() && task_sleep) {
>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>>>> +
>>>> + req_cap = req_cap * capacity_margin
>>>> + >> SCHED_CAPACITY_SHIFT;
>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>>>
>>> Could you clarify why you want to trig a freq switch for tasks that
>>> are going to sleep ?
>>> The cpu_usage should not changed that much as the se_utilization of
>>> the entity moves from utilization_load_avg to utilization_blocked_avg
>>> of the rq and the usage and the freq are updated periodically.
>>
>> I think we still need to cover multiple back-to-back dequeues. Suppose
>> that you have, let's say, 3 tasks that get enqueued at the same time.
>> After some time the first one goes to sleep and its utilization, as you
>> say, gets moved to utilization_blocked_avg. So, nothing changes, and
>> the trigger is superfluous (even if no freq change I guess will be
>> issued as we are already servicing enough capacity). However, after a
>> while, the second task goes to sleep. Now we still use get_cpu_usage()
>> and the first task contribution in utilization_blocked_avg should have
>> been decayed by this time. Same thing may than happen for the third task
>> as well. So, if we don't check if we need to scale down in
>> dequeue_task_fair, it seems to me that we might miss some opportunities,
>> as blocked contribution of other tasks could have been successively
>> decayed.
>>
>> What you think?
>
> The tick is used to monitor such variation of the usage (in both way,
> decay of the usage of sleeping tasks and increase of the usage of
> running tasks). So in your example, if the duration between the sleep
> of the 2 tasks is significant enough, the tick will handle this
> variation
>

The tick is used to decide if we need to scale up (to max OPP for the
time being), but we don't scale down. It makes more logical sense to
scale down at task deactivation, or wakeup after a long time, IMHO.

Best,

- Juri

> Regards,
> Vincent
>>
>> Thanks,
>>
>> - Juri
>>
>>> It should be the same for the wake up of a task in enqueue_task_fair
>>> above, even if it's less obvious for this latter use case because the
>>> cpu might wake up from a long idle phase during which its
>>> utilization_blocked_avg has not been updated. Nevertheless, a trig of
>>> the freq switch at wake up of the cpu once its usage has been updated
>>> should do the job.
>>>
>>> So tick, migration of tasks, new tasks, entering/leaving idle state of
>>> cpu should be enough to trig freq switch
>>>
>>> Regards,
>>> Vincent
>>>
>>>
>>>> + }
>>>> }
>>>> hrtick_update(rq);
>>>> }
>>>> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
>>>> return idx;
>>>> }
>>>>
>>>> -static unsigned int capacity_margin = 1280; /* ~20% margin */
>>>> -
>>>> static bool cpu_overutilized(int cpu)
>>>> {
>>>> return (capacity_of(cpu) * 1024) <
>>>> --
>>>> 1.9.1
>>>>
>>>
>>
>

2015-08-11 09:28:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 01/46] arm: Frequency invariant scheduler load-tracking support

On Tue, Jul 07, 2015 at 07:23:44PM +0100, Morten Rasmussen wrote:
> +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
> +DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);

> + atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity);
> + unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
> + atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
> + unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu));

The use of atomic_long_t here is entirely pointless.

In fact (and someone needs to go fix this), its worse than
WRITE_ONCE()/READ_ONCE().

2015-08-11 11:39:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 11/46] sched: Remove blocked load and utilization contributions of dying tasks

On Tue, Jul 07, 2015 at 07:23:54PM +0100, Morten Rasmussen wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 775b0c7..fa12ce5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3217,6 +3217,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> * Update run-time statistics of the 'current'.
> */
> update_curr(cfs_rq);
> + if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> + flags &= !DEQUEUE_SLEEP;
> dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
>
> update_stats_dequeue(cfs_rq, se);

I know this is entirely redundant at this point (we took Yuyang's
patches), but this is the wrong way to go about doing this.

You add extra code the hot dequeue path for something that 'never'
happens. We have the sched_class::task_dead call for that.

2015-08-11 11:41:36

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

On 11 August 2015 at 11:08, Juri Lelli <[email protected]> wrote:
> On 10/08/15 16:07, Vincent Guittot wrote:
>> On 10 August 2015 at 15:43, Juri Lelli <[email protected]> wrote:
>>>
>>> Hi Vincent,
>>>
>>> On 04/08/15 14:41, Vincent Guittot wrote:
>>>> Hi Juri,
>>>>
>>>> On 7 July 2015 at 20:24, Morten Rasmussen <[email protected]> wrote:
>>>>> From: Juri Lelli <[email protected]>
>>>>>
>>>>> Each time a task is {en,de}queued we might need to adapt the current
>>>>> frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
>>>>> this purpose. Only trigger a freq request if we are effectively waking up
>>>>> or going to sleep. Filter out load balancing related calls to reduce the
>>>>> number of triggers.
>>>>>
>>>>> cc: Ingo Molnar <[email protected]>
>>>>> cc: Peter Zijlstra <[email protected]>
>>>>>
>>>>> Signed-off-by: Juri Lelli <[email protected]>
>>>>> ---
>>>>> kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>>>>> 1 file changed, 40 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index f74e9d2..b8627c6 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -4281,7 +4281,10 @@ static inline void hrtick_update(struct rq *rq)
>>>>> }
>>>>> #endif
>>>>>
>>>>> +static unsigned int capacity_margin = 1280; /* ~20% margin */
>>>>> +
>>>>> static bool cpu_overutilized(int cpu);
>>>>> +static unsigned long get_cpu_usage(int cpu);
>>>>> struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
>>>>>
>>>>> /*
>>>>> @@ -4332,6 +4335,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>> if (!task_new && !rq->rd->overutilized &&
>>>>> cpu_overutilized(rq->cpu))
>>>>> rq->rd->overutilized = true;
>>>>> + /*
>>>>> + * We want to trigger a freq switch request only for tasks that
>>>>> + * are waking up; this is because we get here also during
>>>>> + * load balancing, but in these cases it seems wise to trigger
>>>>> + * as single request after load balancing is done.
>>>>> + *
>>>>> + * XXX: how about fork()? Do we need a special flag/something
>>>>> + * to tell if we are here after a fork() (wakeup_task_new)?
>>>>> + *
>>>>> + * Also, we add a margin (same ~20% used for the tipping point)
>>>>> + * to our request to provide some head room if p's utilization
>>>>> + * further increases.
>>>>> + */
>>>>> + if (sched_energy_freq() && !task_new) {
>>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>>>>> +
>>>>> + req_cap = req_cap * capacity_margin
>>>>> + >> SCHED_CAPACITY_SHIFT;
>>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>>>>> + }
>>>>> }
>>>>> hrtick_update(rq);
>>>>> }
>>>>> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>> if (!se) {
>>>>> sub_nr_running(rq, 1);
>>>>> update_rq_runnable_avg(rq, 1);
>>>>> + /*
>>>>> + * We want to trigger a freq switch request only for tasks that
>>>>> + * are going to sleep; this is because we get here also during
>>>>> + * load balancing, but in these cases it seems wise to trigger
>>>>> + * as single request after load balancing is done.
>>>>> + *
>>>>> + * Also, we add a margin (same ~20% used for the tipping point)
>>>>> + * to our request to provide some head room if p's utilization
>>>>> + * further increases.
>>>>> + */
>>>>> + if (sched_energy_freq() && task_sleep) {
>>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>>>>> +
>>>>> + req_cap = req_cap * capacity_margin
>>>>> + >> SCHED_CAPACITY_SHIFT;
>>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>>>>
>>>> Could you clarify why you want to trig a freq switch for tasks that
>>>> are going to sleep ?
>>>> The cpu_usage should not changed that much as the se_utilization of
>>>> the entity moves from utilization_load_avg to utilization_blocked_avg
>>>> of the rq and the usage and the freq are updated periodically.
>>>
>>> I think we still need to cover multiple back-to-back dequeues. Suppose
>>> that you have, let's say, 3 tasks that get enqueued at the same time.
>>> After some time the first one goes to sleep and its utilization, as you
>>> say, gets moved to utilization_blocked_avg. So, nothing changes, and
>>> the trigger is superfluous (even if no freq change I guess will be
>>> issued as we are already servicing enough capacity). However, after a
>>> while, the second task goes to sleep. Now we still use get_cpu_usage()
>>> and the first task contribution in utilization_blocked_avg should have
>>> been decayed by this time. Same thing may than happen for the third task
>>> as well. So, if we don't check if we need to scale down in
>>> dequeue_task_fair, it seems to me that we might miss some opportunities,
>>> as blocked contribution of other tasks could have been successively
>>> decayed.
>>>
>>> What you think?
>>
>> The tick is used to monitor such variation of the usage (in both way,
>> decay of the usage of sleeping tasks and increase of the usage of
>> running tasks). So in your example, if the duration between the sleep
>> of the 2 tasks is significant enough, the tick will handle this
>> variation
>>
>
> The tick is used to decide if we need to scale up (to max OPP for the
> time being), but we don't scale down. It makes more logical sense to

why don't you want to check if you need to scale down ?

> scale down at task deactivation, or wakeup after a long time, IMHO.

But waking up or going to sleep don't have any impact on the usage of
a cpu. The only events that impact the cpu usage are:
-task migration,
-new task
-time that elapse which can be monitored by periodically checking the usage.
-and for nohz system when cpu enter or leave idle state

waking up and going to sleep events doesn't give any useful
information and using them to trig the monitoring of the usage
variation doesn't give you a predictable/periodic update of it whereas
the tick will

Regards,
Vincent

>
> Best,
>
> - Juri
>
>> Regards,
>> Vincent
>>>
>>> Thanks,
>>>
>>> - Juri
>>>
>>>> It should be the same for the wake up of a task in enqueue_task_fair
>>>> above, even if it's less obvious for this latter use case because the
>>>> cpu might wake up from a long idle phase during which its
>>>> utilization_blocked_avg has not been updated. Nevertheless, a trig of
>>>> the freq switch at wake up of the cpu once its usage has been updated
>>>> should do the job.
>>>>
>>>> So tick, migration of tasks, new tasks, entering/leaving idle state of
>>>> cpu should be enough to trig freq switch
>>>>
>>>> Regards,
>>>> Vincent
>>>>
>>>>
>>>>> + }
>>>>> }
>>>>> hrtick_update(rq);
>>>>> }
>>>>> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
>>>>> return idx;
>>>>> }
>>>>>
>>>>> -static unsigned int capacity_margin = 1280; /* ~20% margin */
>>>>> -
>>>>> static bool cpu_overutilized(int cpu)
>>>>> {
>>>>> return (capacity_of(cpu) * 1024) <
>>>>> --
>>>>> 1.9.1
>>>>>
>>>>
>>>
>>
>

2015-08-11 14:55:46

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 11/46] sched: Remove blocked load and utilization contributions of dying tasks

On Tue, Aug 11, 2015 at 01:39:27PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:23:54PM +0100, Morten Rasmussen wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 775b0c7..fa12ce5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3217,6 +3217,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > * Update run-time statistics of the 'current'.
> > */
> > update_curr(cfs_rq);
> > + if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> > + flags &= !DEQUEUE_SLEEP;
> > dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
> >
> > update_stats_dequeue(cfs_rq, se);
>
> I know this is entirely redundant at this point (we took Yuyang's
> patches), but this is the wrong way to go about doing this.

Yes, I'm still working my way through Yuyang's changes.

> You add extra code the hot dequeue path for something that 'never'
> happens. We have the sched_class::task_dead call for that.

I don't mind using sched_class::task_dead() instead. The reason why I
didn't go that way is that we have to retake the rq->lock or mess with
cfs_rq::removed_load instead of just not adding the utilization in
the first place when we have the rq->lock.

Anyway, it is probably redundant by now. I will check Yuyang's code to
see if he already fixed this problem.

2015-08-11 15:07:14

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

Hi Vincent,

On 11/08/15 12:41, Vincent Guittot wrote:
> On 11 August 2015 at 11:08, Juri Lelli <[email protected]> wrote:
>> On 10/08/15 16:07, Vincent Guittot wrote:
>>> On 10 August 2015 at 15:43, Juri Lelli <[email protected]> wrote:
>>>>
>>>> Hi Vincent,
>>>>
>>>> On 04/08/15 14:41, Vincent Guittot wrote:
>>>>> Hi Juri,
>>>>>
>>>>> On 7 July 2015 at 20:24, Morten Rasmussen <[email protected]> wrote:
>>>>>> From: Juri Lelli <[email protected]>
>>>>>>
>>>>>> Each time a task is {en,de}queued we might need to adapt the current
>>>>>> frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
>>>>>> this purpose. Only trigger a freq request if we are effectively waking up
>>>>>> or going to sleep. Filter out load balancing related calls to reduce the
>>>>>> number of triggers.
>>>>>>
>>>>>> cc: Ingo Molnar <[email protected]>
>>>>>> cc: Peter Zijlstra <[email protected]>
>>>>>>
>>>>>> Signed-off-by: Juri Lelli <[email protected]>
>>>>>> ---
>>>>>> kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>>>>>> 1 file changed, 40 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index f74e9d2..b8627c6 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -4281,7 +4281,10 @@ static inline void hrtick_update(struct rq *rq)
>>>>>> }
>>>>>> #endif
>>>>>>
>>>>>> +static unsigned int capacity_margin = 1280; /* ~20% margin */
>>>>>> +
>>>>>> static bool cpu_overutilized(int cpu);
>>>>>> +static unsigned long get_cpu_usage(int cpu);
>>>>>> struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
>>>>>>
>>>>>> /*
>>>>>> @@ -4332,6 +4335,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>>> if (!task_new && !rq->rd->overutilized &&
>>>>>> cpu_overutilized(rq->cpu))
>>>>>> rq->rd->overutilized = true;
>>>>>> + /*
>>>>>> + * We want to trigger a freq switch request only for tasks that
>>>>>> + * are waking up; this is because we get here also during
>>>>>> + * load balancing, but in these cases it seems wise to trigger
>>>>>> + * as single request after load balancing is done.
>>>>>> + *
>>>>>> + * XXX: how about fork()? Do we need a special flag/something
>>>>>> + * to tell if we are here after a fork() (wakeup_task_new)?
>>>>>> + *
>>>>>> + * Also, we add a margin (same ~20% used for the tipping point)
>>>>>> + * to our request to provide some head room if p's utilization
>>>>>> + * further increases.
>>>>>> + */
>>>>>> + if (sched_energy_freq() && !task_new) {
>>>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>>>>>> +
>>>>>> + req_cap = req_cap * capacity_margin
>>>>>> + >> SCHED_CAPACITY_SHIFT;
>>>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>>>>>> + }
>>>>>> }
>>>>>> hrtick_update(rq);
>>>>>> }
>>>>>> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>>> if (!se) {
>>>>>> sub_nr_running(rq, 1);
>>>>>> update_rq_runnable_avg(rq, 1);
>>>>>> + /*
>>>>>> + * We want to trigger a freq switch request only for tasks that
>>>>>> + * are going to sleep; this is because we get here also during
>>>>>> + * load balancing, but in these cases it seems wise to trigger
>>>>>> + * as single request after load balancing is done.
>>>>>> + *
>>>>>> + * Also, we add a margin (same ~20% used for the tipping point)
>>>>>> + * to our request to provide some head room if p's utilization
>>>>>> + * further increases.
>>>>>> + */
>>>>>> + if (sched_energy_freq() && task_sleep) {
>>>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>>>>>> +
>>>>>> + req_cap = req_cap * capacity_margin
>>>>>> + >> SCHED_CAPACITY_SHIFT;
>>>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>>>>>
>>>>> Could you clarify why you want to trig a freq switch for tasks that
>>>>> are going to sleep ?
>>>>> The cpu_usage should not changed that much as the se_utilization of
>>>>> the entity moves from utilization_load_avg to utilization_blocked_avg
>>>>> of the rq and the usage and the freq are updated periodically.
>>>>
>>>> I think we still need to cover multiple back-to-back dequeues. Suppose
>>>> that you have, let's say, 3 tasks that get enqueued at the same time.
>>>> After some time the first one goes to sleep and its utilization, as you
>>>> say, gets moved to utilization_blocked_avg. So, nothing changes, and
>>>> the trigger is superfluous (even if no freq change I guess will be
>>>> issued as we are already servicing enough capacity). However, after a
>>>> while, the second task goes to sleep. Now we still use get_cpu_usage()
>>>> and the first task contribution in utilization_blocked_avg should have
>>>> been decayed by this time. Same thing may than happen for the third task
>>>> as well. So, if we don't check if we need to scale down in
>>>> dequeue_task_fair, it seems to me that we might miss some opportunities,
>>>> as blocked contribution of other tasks could have been successively
>>>> decayed.
>>>>
>>>> What you think?
>>>
>>> The tick is used to monitor such variation of the usage (in both way,
>>> decay of the usage of sleeping tasks and increase of the usage of
>>> running tasks). So in your example, if the duration between the sleep
>>> of the 2 tasks is significant enough, the tick will handle this
>>> variation
>>>
>>
>> The tick is used to decide if we need to scale up (to max OPP for the
>> time being), but we don't scale down. It makes more logical sense to
>
> why don't you want to check if you need to scale down ?
>

Well, because if I'm still executing something the cpu usage is only
subject to raise.

>> scale down at task deactivation, or wakeup after a long time, IMHO.
>
> But waking up or going to sleep don't have any impact on the usage of
> a cpu. The only events that impact the cpu usage are:
> -task migration,

We explicitly cover this on load balancing paths.

> -new task

We cover this in enqueue_task_fair(), introducing a new flag.

> -time that elapse which can be monitored by periodically checking the usage.

Do you mean when a task utilization crosses some threshold
related to the current OPP? If that is the case, we have a
check in task_tick_fair().

> -and for nohz system when cpu enter or leave idle state
>

We address this in dequeue_task_fair(). In particular, if
the cpu is going to be idle we don't trigger any change as
it seems not always wise to wake up a thread to just change
the OPP and the go idle; some platforms might require this
behaviour anyway, but it probably more cpuidle/fw related?

I would also add:

- task is going to die

We address this in dequeue as well, as its contribution is
removed from usage (mod Yuyang's patches).

> waking up and going to sleep events doesn't give any useful
> information and using them to trig the monitoring of the usage
> variation doesn't give you a predictable/periodic update of it whereas
> the tick will
>

So, one key point of this solution is to get away as much
as we can from periodic updates/sampling and move towards a
(fully) event driven approach. The event logically associated
to task_tick_fair() is when we realize that a task is going
to saturate the current capacity; in this case we trigger a
freq switch to an higher capacity. Also, if we never react
to normal wakeups (as I understand you are proposing) we might
miss some chances to adapt quickly enough. As an example, if
you have a big task that suddenly goes to sleep, and sleeps
until its decayed utilization goes almost to zero; when it
wakes up, if we don't have a trigger in enqueue_task_fair(),
we'll have to wait until the next tick to select an appropriate
(low) OPP.

Best,

- Juri

>>
>> Best,
>>
>> - Juri
>>
>>> Regards,
>>> Vincent
>>>>
>>>> Thanks,
>>>>
>>>> - Juri
>>>>
>>>>> It should be the same for the wake up of a task in enqueue_task_fair
>>>>> above, even if it's less obvious for this latter use case because the
>>>>> cpu might wake up from a long idle phase during which its
>>>>> utilization_blocked_avg has not been updated. Nevertheless, a trig of
>>>>> the freq switch at wake up of the cpu once its usage has been updated
>>>>> should do the job.
>>>>>
>>>>> So tick, migration of tasks, new tasks, entering/leaving idle state of
>>>>> cpu should be enough to trig freq switch
>>>>>
>>>>> Regards,
>>>>> Vincent
>>>>>
>>>>>
>>>>>> + }
>>>>>> }
>>>>>> hrtick_update(rq);
>>>>>> }
>>>>>> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
>>>>>> return idx;
>>>>>> }
>>>>>>
>>>>>> -static unsigned int capacity_margin = 1280; /* ~20% margin */
>>>>>> -
>>>>>> static bool cpu_overutilized(int cpu)
>>>>>> {
>>>>>> return (capacity_of(cpu) * 1024) <
>>>>>> --
>>>>>> 1.9.1
>>>>>>
>>>>>
>>>>
>>>
>>
>

2015-08-11 16:37:36

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

On 11 August 2015 at 17:07, Juri Lelli <[email protected]> wrote:
> Hi Vincent,
>
> On 11/08/15 12:41, Vincent Guittot wrote:
>> On 11 August 2015 at 11:08, Juri Lelli <[email protected]> wrote:
>>> On 10/08/15 16:07, Vincent Guittot wrote:
>>>> On 10 August 2015 at 15:43, Juri Lelli <[email protected]> wrote:
>>>>>
>>>>> Hi Vincent,
>>>>>
>>>>> On 04/08/15 14:41, Vincent Guittot wrote:
>>>>>> Hi Juri,
>>>>>>
>>>>>> On 7 July 2015 at 20:24, Morten Rasmussen <[email protected]> wrote:
>>>>>>> From: Juri Lelli <[email protected]>
>>>>>>>
>>>>>>> Each time a task is {en,de}queued we might need to adapt the current
>>>>>>> frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
>>>>>>> this purpose. Only trigger a freq request if we are effectively waking up
>>>>>>> or going to sleep. Filter out load balancing related calls to reduce the
>>>>>>> number of triggers.
>>>>>>>
>>>>>>> cc: Ingo Molnar <[email protected]>
>>>>>>> cc: Peter Zijlstra <[email protected]>
>>>>>>>
>>>>>>> Signed-off-by: Juri Lelli <[email protected]>
>>>>>>> ---
>>>>>>> kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>>>>>>> 1 file changed, 40 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>> index f74e9d2..b8627c6 100644
>>>>>>> --- a/kernel/sched/fair.c
>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>> @@ -4281,7 +4281,10 @@ static inline void hrtick_update(struct rq *rq)
>>>>>>> }
>>>>>>> #endif
>>>>>>>
>>>>>>> +static unsigned int capacity_margin = 1280; /* ~20% margin */
>>>>>>> +
>>>>>>> static bool cpu_overutilized(int cpu);
>>>>>>> +static unsigned long get_cpu_usage(int cpu);
>>>>>>> struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
>>>>>>>
>>>>>>> /*
>>>>>>> @@ -4332,6 +4335,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>>>> if (!task_new && !rq->rd->overutilized &&
>>>>>>> cpu_overutilized(rq->cpu))
>>>>>>> rq->rd->overutilized = true;
>>>>>>> + /*
>>>>>>> + * We want to trigger a freq switch request only for tasks that
>>>>>>> + * are waking up; this is because we get here also during
>>>>>>> + * load balancing, but in these cases it seems wise to trigger
>>>>>>> + * as single request after load balancing is done.
>>>>>>> + *
>>>>>>> + * XXX: how about fork()? Do we need a special flag/something
>>>>>>> + * to tell if we are here after a fork() (wakeup_task_new)?
>>>>>>> + *
>>>>>>> + * Also, we add a margin (same ~20% used for the tipping point)
>>>>>>> + * to our request to provide some head room if p's utilization
>>>>>>> + * further increases.
>>>>>>> + */
>>>>>>> + if (sched_energy_freq() && !task_new) {
>>>>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>>>>>>> +
>>>>>>> + req_cap = req_cap * capacity_margin
>>>>>>> + >> SCHED_CAPACITY_SHIFT;
>>>>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>>>>>>> + }
>>>>>>> }
>>>>>>> hrtick_update(rq);
>>>>>>> }
>>>>>>> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>>>> if (!se) {
>>>>>>> sub_nr_running(rq, 1);
>>>>>>> update_rq_runnable_avg(rq, 1);
>>>>>>> + /*
>>>>>>> + * We want to trigger a freq switch request only for tasks that
>>>>>>> + * are going to sleep; this is because we get here also during
>>>>>>> + * load balancing, but in these cases it seems wise to trigger
>>>>>>> + * as single request after load balancing is done.
>>>>>>> + *
>>>>>>> + * Also, we add a margin (same ~20% used for the tipping point)
>>>>>>> + * to our request to provide some head room if p's utilization
>>>>>>> + * further increases.
>>>>>>> + */
>>>>>>> + if (sched_energy_freq() && task_sleep) {
>>>>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>>>>>>> +
>>>>>>> + req_cap = req_cap * capacity_margin
>>>>>>> + >> SCHED_CAPACITY_SHIFT;
>>>>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>>>>>>
>>>>>> Could you clarify why you want to trig a freq switch for tasks that
>>>>>> are going to sleep ?
>>>>>> The cpu_usage should not changed that much as the se_utilization of
>>>>>> the entity moves from utilization_load_avg to utilization_blocked_avg
>>>>>> of the rq and the usage and the freq are updated periodically.
>>>>>
>>>>> I think we still need to cover multiple back-to-back dequeues. Suppose
>>>>> that you have, let's say, 3 tasks that get enqueued at the same time.
>>>>> After some time the first one goes to sleep and its utilization, as you
>>>>> say, gets moved to utilization_blocked_avg. So, nothing changes, and
>>>>> the trigger is superfluous (even if no freq change I guess will be
>>>>> issued as we are already servicing enough capacity). However, after a
>>>>> while, the second task goes to sleep. Now we still use get_cpu_usage()
>>>>> and the first task contribution in utilization_blocked_avg should have
>>>>> been decayed by this time. Same thing may than happen for the third task
>>>>> as well. So, if we don't check if we need to scale down in
>>>>> dequeue_task_fair, it seems to me that we might miss some opportunities,
>>>>> as blocked contribution of other tasks could have been successively
>>>>> decayed.
>>>>>
>>>>> What you think?
>>>>
>>>> The tick is used to monitor such variation of the usage (in both way,
>>>> decay of the usage of sleeping tasks and increase of the usage of
>>>> running tasks). So in your example, if the duration between the sleep
>>>> of the 2 tasks is significant enough, the tick will handle this
>>>> variation
>>>>
>>>
>>> The tick is used to decide if we need to scale up (to max OPP for the
>>> time being), but we don't scale down. It makes more logical sense to
>>
>> why don't you want to check if you need to scale down ?
>>
>
> Well, because if I'm still executing something the cpu usage is only
> subject to raise.

This is only true for system with NO HZ idle

>
>>> scale down at task deactivation, or wakeup after a long time, IMHO.
>>
>> But waking up or going to sleep don't have any impact on the usage of
>> a cpu. The only events that impact the cpu usage are:
>> -task migration,
>
> We explicitly cover this on load balancing paths.
>
>> -new task
>
> We cover this in enqueue_task_fair(), introducing a new flag.
>
>> -time that elapse which can be monitored by periodically checking the usage.
>
> Do you mean when a task utilization crosses some threshold
> related to the current OPP? If that is the case, we have a
> check in task_tick_fair().
>
>> -and for nohz system when cpu enter or leave idle state
>>
>
> We address this in dequeue_task_fair(). In particular, if
> the cpu is going to be idle we don't trigger any change as
> it seems not always wise to wake up a thread to just change
> the OPP and the go idle; some platforms might require this
> behaviour anyway, but it probably more cpuidle/fw related?

I would say that it's interesting to notifiy sched-dvfs that a cpu
becomes idle because we could decrease the opp of a cluster of cpus
that share the same clock if this cpu is the one that requires the max
capacity of the cluster (and other cpus are still running).

>
> I would also add:
>
> - task is going to die
>
> We address this in dequeue as well, as its contribution is
> removed from usage (mod Yuyang's patches).
>
>> waking up and going to sleep events doesn't give any useful
>> information and using them to trig the monitoring of the usage
>> variation doesn't give you a predictable/periodic update of it whereas
>> the tick will
>>
>
> So, one key point of this solution is to get away as much
> as we can from periodic updates/sampling and move towards a
> (fully) event driven approach. The event logically associated
> to task_tick_fair() is when we realize that a task is going
> to saturate the current capacity; in this case we trigger a
> freq switch to an higher capacity. Also, if we never react
> to normal wakeups (as I understand you are proposing) we might
> miss some chances to adapt quickly enough. As an example, if
> you have a big task that suddenly goes to sleep, and sleeps
> until its decayed utilization goes almost to zero; when it
> wakes up, if we don't have a trigger in enqueue_task_fair(),
> we'll have to wait until the next tick to select an appropriate
> (low) OPP.

I assume that the cpu is idle in this case. This situation only
happens on Nohz idle system because tick is disable and you have to
update statistics when leaving idle as it is done for the jiffies or
the cpu_load array. So you should track cpu enter/leave idle (for nohz
system only) instead of tracking all tasks wake up/sleep events.

So you can either use update_cpu_load_nohz like it is already done for
cpu_load array
or you should use some conditions like below if you want to stay in
enqueue/dequeue_task_fair but task wake up or sleep event are not the
right condition
if (!(flags & ENQUEUE_WAKEUP) || rq->nr_running == 1 ) in enqueue_task_fair
and
if (!task_sleep || rq->nr_running == 0) in dequeue_task_fair

We can probably optimized by using rq->cfs.h_nr_running instead of
rq->nr_running as only cfs tasks really modifies the usage

Regards,
Vincent

>
> Best,
>
> - Juri
>
>>>
>>> Best,
>>>
>>> - Juri
>>>
>>>> Regards,
>>>> Vincent
>>>>>
>>>>> Thanks,
>>>>>
>>>>> - Juri
>>>>>
>>>>>> It should be the same for the wake up of a task in enqueue_task_fair
>>>>>> above, even if it's less obvious for this latter use case because the
>>>>>> cpu might wake up from a long idle phase during which its
>>>>>> utilization_blocked_avg has not been updated. Nevertheless, a trig of
>>>>>> the freq switch at wake up of the cpu once its usage has been updated
>>>>>> should do the job.
>>>>>>
>>>>>> So tick, migration of tasks, new tasks, entering/leaving idle state of
>>>>>> cpu should be enough to trig freq switch
>>>>>>
>>>>>> Regards,
>>>>>> Vincent
>>>>>>
>>>>>>
>>>>>>> + }
>>>>>>> }
>>>>>>> hrtick_update(rq);
>>>>>>> }
>>>>>>> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
>>>>>>> return idx;
>>>>>>> }
>>>>>>>
>>>>>>> -static unsigned int capacity_margin = 1280; /* ~20% margin */
>>>>>>> -
>>>>>>> static bool cpu_overutilized(int cpu)
>>>>>>> {
>>>>>>> return (capacity_of(cpu) * 1024) <
>>>>>>> --
>>>>>>> 1.9.1
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

2015-08-11 17:23:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 11/46] sched: Remove blocked load and utilization contributions of dying tasks

On Tue, Aug 11, 2015 at 03:58:48PM +0100, Morten Rasmussen wrote:
> On Tue, Aug 11, 2015 at 01:39:27PM +0200, Peter Zijlstra wrote:

> > You add extra code the hot dequeue path for something that 'never'
> > happens. We have the sched_class::task_dead call for that.
>
> I don't mind using sched_class::task_dead() instead. The reason why I
> didn't go that way is that we have to retake the rq->lock or mess with
> cfs_rq::removed_load instead of just not adding the utilization in
> the first place when we have the rq->lock.
>
> Anyway, it is probably redundant by now. I will check Yuyang's code to
> see if he already fixed this problem.

He did, he used the removed_load stuff, same as migration does.

2015-08-12 09:04:58

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 11/46] sched: Remove blocked load and utilization contributions of dying tasks

On Tue, Aug 11, 2015 at 07:23:44PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 11, 2015 at 03:58:48PM +0100, Morten Rasmussen wrote:
> > On Tue, Aug 11, 2015 at 01:39:27PM +0200, Peter Zijlstra wrote:
>
> > > You add extra code the hot dequeue path for something that 'never'
> > > happens. We have the sched_class::task_dead call for that.
> >
> > I don't mind using sched_class::task_dead() instead. The reason why I
> > didn't go that way is that we have to retake the rq->lock or mess with
> > cfs_rq::removed_load instead of just not adding the utilization in
> > the first place when we have the rq->lock.
> >
> > Anyway, it is probably redundant by now. I will check Yuyang's code to
> > see if he already fixed this problem.
>
> He did, he used the removed_load stuff, same as migration does.

Nice. One less patch to worry about :)

2015-08-12 10:04:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 16/46] sched: Allocate and initialize energy data structures

On Tue, Jul 07, 2015 at 07:23:59PM +0100, Morten Rasmussen wrote:
> +
> + sge->nr_idle_states = fn(cpu)->nr_idle_states;
> + sge->nr_cap_states = fn(cpu)->nr_cap_states;
> + memcpy(sge->idle_states, fn(cpu)->idle_states,
> + sge->nr_idle_states*sizeof(struct idle_state));
> + memcpy(sge->cap_states, fn(cpu)->cap_states,
> + sge->nr_cap_states*sizeof(struct capacity_state));

> + if (fn && fn(j)) {
> + nr_idle_states = fn(j)->nr_idle_states;
> + nr_cap_states = fn(j)->nr_cap_states;
> + BUG_ON(!nr_idle_states || !nr_cap_states);
> + }

> + for_each_cpu(i, &mask) {
> + int y;
> +
> + BUG_ON(fn(i)->nr_idle_states != fn(cpu)->nr_idle_states);
> +
> + for (y = 0; y < (fn(i)->nr_idle_states); y++) {
> + BUG_ON(fn(i)->idle_states[y].power !=
> + fn(cpu)->idle_states[y].power);
> + }
> +
> + BUG_ON(fn(i)->nr_cap_states != fn(cpu)->nr_cap_states);
> +
> + for (y = 0; y < (fn(i)->nr_cap_states); y++) {
> + BUG_ON(fn(i)->cap_states[y].cap !=
> + fn(cpu)->cap_states[y].cap);
> + BUG_ON(fn(i)->cap_states[y].power !=
> + fn(cpu)->cap_states[y].power);
> + }
> + }
> +}

Might it not make more sense to have:

const struct blah *const blah = fn();

and use blah afterwards, instead of the repeated invocation of fn()?

2015-08-12 10:17:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 16/46] sched: Allocate and initialize energy data structures

On Tue, Jul 07, 2015 at 07:23:59PM +0100, Morten Rasmussen wrote:
> @@ -6647,10 +6703,24 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
> if (!sdd->sgc)
> return -ENOMEM;
>
> + sdd->sge = alloc_percpu(struct sched_group_energy *);
> + if (!sdd->sge)
> + return -ENOMEM;
> +
> for_each_cpu(j, cpu_map) {
> struct sched_domain *sd;
> struct sched_group *sg;
> struct sched_group_capacity *sgc;
> + struct sched_group_energy *sge;
> + sched_domain_energy_f fn = tl->energy;
> + unsigned int nr_idle_states = 0;
> + unsigned int nr_cap_states = 0;
> +
> + if (fn && fn(j)) {
> + nr_idle_states = fn(j)->nr_idle_states;
> + nr_cap_states = fn(j)->nr_cap_states;
> + BUG_ON(!nr_idle_states || !nr_cap_states);
> + }
>
> sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
> GFP_KERNEL, cpu_to_node(j));
> @@ -6674,6 +6744,16 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
> return -ENOMEM;
>
> *per_cpu_ptr(sdd->sgc, j) = sgc;
> +
> + sge = kzalloc_node(sizeof(struct sched_group_energy) +
> + nr_idle_states*sizeof(struct idle_state) +
> + nr_cap_states*sizeof(struct capacity_state),
> + GFP_KERNEL, cpu_to_node(j));
> +
> + if (!sge)
> + return -ENOMEM;
> +
> + *per_cpu_ptr(sdd->sge, j) = sge;
> }
> }
>

One more question, if fn() returns a full structure, why are we
allocating and copying the thing? Its all const read only data, right?

2015-08-12 10:33:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 18/46] arm: topology: Define TC2 energy and provide it to the scheduler

On Tue, Jul 07, 2015 at 07:24:01PM +0100, Morten Rasmussen wrote:
> +static struct capacity_state cap_states_cluster_a7[] = {
> + /* Cluster only power */
> + { .cap = 150, .power = 2967, }, /* 350 MHz */
> + { .cap = 172, .power = 2792, }, /* 400 MHz */
> + { .cap = 215, .power = 2810, }, /* 500 MHz */
> + { .cap = 258, .power = 2815, }, /* 600 MHz */
> + { .cap = 301, .power = 2919, }, /* 700 MHz */
> + { .cap = 344, .power = 2847, }, /* 800 MHz */
> + { .cap = 387, .power = 3917, }, /* 900 MHz */
> + { .cap = 430, .power = 4905, }, /* 1000 MHz */
> + };

So can I suggest a SCHED_DEBUG validation of the data provided?

Given the above table, it _never_ makes sense to run at .cap=150, it
equally also doesn't make sense to run at .cap = 301.

So please add a SCHED_DEBUG test on domain creation that validates that
not only is the .cap monotonically increasing, but the .power is too.

2015-08-12 10:59:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 20/46] sched: Relocated get_cpu_usage() and change return type

On Tue, Jul 07, 2015 at 07:24:03PM +0100, Morten Rasmussen wrote:
> +static unsigned long get_cpu_usage(int cpu)
> +{
> + int sum;
> + unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
> + unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
> + unsigned long capacity_orig = capacity_orig_of(cpu);
> +
> + sum = usage + blocked;
> +
> + if (sum >= capacity_orig)
> + return capacity_orig;
> +
> + return sum;
> +}

Should sum then not also be unsigned long?

2015-08-12 14:36:53

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 20/46] sched: Relocated get_cpu_usage() and change return type

On Wed, Aug 12, 2015 at 12:59:23PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:24:03PM +0100, Morten Rasmussen wrote:
> > +static unsigned long get_cpu_usage(int cpu)
> > +{
> > + int sum;
> > + unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
> > + unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
> > + unsigned long capacity_orig = capacity_orig_of(cpu);
> > +
> > + sum = usage + blocked;
> > +
> > + if (sum >= capacity_orig)
> > + return capacity_orig;
> > +
> > + return sum;
> > +}
>
> Should sum then not also be unsigned long?

Yes it should. So much for my attempt at fixing the types :(

2015-08-12 15:14:54

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

On 11/08/15 17:37, Vincent Guittot wrote:
> On 11 August 2015 at 17:07, Juri Lelli <[email protected]> wrote:
>> Hi Vincent,
>>
>> On 11/08/15 12:41, Vincent Guittot wrote:
>>> On 11 August 2015 at 11:08, Juri Lelli <[email protected]> wrote:
>>>> On 10/08/15 16:07, Vincent Guittot wrote:
>>>>> On 10 August 2015 at 15:43, Juri Lelli <[email protected]> wrote:
>>>>>>
>>>>>> Hi Vincent,
>>>>>>
>>>>>> On 04/08/15 14:41, Vincent Guittot wrote:
>>>>>>> Hi Juri,
>>>>>>>
>>>>>>> On 7 July 2015 at 20:24, Morten Rasmussen <[email protected]> wrote:
>>>>>>>> From: Juri Lelli <[email protected]>
>>>>>>>>
>>>>>>>> Each time a task is {en,de}queued we might need to adapt the current
>>>>>>>> frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
>>>>>>>> this purpose. Only trigger a freq request if we are effectively waking up
>>>>>>>> or going to sleep. Filter out load balancing related calls to reduce the
>>>>>>>> number of triggers.
>>>>>>>>
>>>>>>>> cc: Ingo Molnar <[email protected]>
>>>>>>>> cc: Peter Zijlstra <[email protected]>
>>>>>>>>
>>>>>>>> Signed-off-by: Juri Lelli <[email protected]>
>>>>>>>> ---
>>>>>>>> kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>>>>>>>> 1 file changed, 40 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>>> index f74e9d2..b8627c6 100644
>>>>>>>> --- a/kernel/sched/fair.c
>>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>>> @@ -4281,7 +4281,10 @@ static inline void hrtick_update(struct rq *rq)
>>>>>>>> }
>>>>>>>> #endif
>>>>>>>>
>>>>>>>> +static unsigned int capacity_margin = 1280; /* ~20% margin */
>>>>>>>> +
>>>>>>>> static bool cpu_overutilized(int cpu);
>>>>>>>> +static unsigned long get_cpu_usage(int cpu);
>>>>>>>> struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
>>>>>>>>
>>>>>>>> /*
>>>>>>>> @@ -4332,6 +4335,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>>>>> if (!task_new && !rq->rd->overutilized &&
>>>>>>>> cpu_overutilized(rq->cpu))
>>>>>>>> rq->rd->overutilized = true;
>>>>>>>> + /*
>>>>>>>> + * We want to trigger a freq switch request only for tasks that
>>>>>>>> + * are waking up; this is because we get here also during
>>>>>>>> + * load balancing, but in these cases it seems wise to trigger
>>>>>>>> + * as single request after load balancing is done.
>>>>>>>> + *
>>>>>>>> + * XXX: how about fork()? Do we need a special flag/something
>>>>>>>> + * to tell if we are here after a fork() (wakeup_task_new)?
>>>>>>>> + *
>>>>>>>> + * Also, we add a margin (same ~20% used for the tipping point)
>>>>>>>> + * to our request to provide some head room if p's utilization
>>>>>>>> + * further increases.
>>>>>>>> + */
>>>>>>>> + if (sched_energy_freq() && !task_new) {
>>>>>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>>>>>>>> +
>>>>>>>> + req_cap = req_cap * capacity_margin
>>>>>>>> + >> SCHED_CAPACITY_SHIFT;
>>>>>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>>>>>>>> + }
>>>>>>>> }
>>>>>>>> hrtick_update(rq);
>>>>>>>> }
>>>>>>>> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>>>>> if (!se) {
>>>>>>>> sub_nr_running(rq, 1);
>>>>>>>> update_rq_runnable_avg(rq, 1);
>>>>>>>> + /*
>>>>>>>> + * We want to trigger a freq switch request only for tasks that
>>>>>>>> + * are going to sleep; this is because we get here also during
>>>>>>>> + * load balancing, but in these cases it seems wise to trigger
>>>>>>>> + * as single request after load balancing is done.
>>>>>>>> + *
>>>>>>>> + * Also, we add a margin (same ~20% used for the tipping point)
>>>>>>>> + * to our request to provide some head room if p's utilization
>>>>>>>> + * further increases.
>>>>>>>> + */
>>>>>>>> + if (sched_energy_freq() && task_sleep) {
>>>>>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>>>>>>>> +
>>>>>>>> + req_cap = req_cap * capacity_margin
>>>>>>>> + >> SCHED_CAPACITY_SHIFT;
>>>>>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>>>>>>>
>>>>>>> Could you clarify why you want to trig a freq switch for tasks that
>>>>>>> are going to sleep ?
>>>>>>> The cpu_usage should not changed that much as the se_utilization of
>>>>>>> the entity moves from utilization_load_avg to utilization_blocked_avg
>>>>>>> of the rq and the usage and the freq are updated periodically.
>>>>>>
>>>>>> I think we still need to cover multiple back-to-back dequeues. Suppose
>>>>>> that you have, let's say, 3 tasks that get enqueued at the same time.
>>>>>> After some time the first one goes to sleep and its utilization, as you
>>>>>> say, gets moved to utilization_blocked_avg. So, nothing changes, and
>>>>>> the trigger is superfluous (even if no freq change I guess will be
>>>>>> issued as we are already servicing enough capacity). However, after a
>>>>>> while, the second task goes to sleep. Now we still use get_cpu_usage()
>>>>>> and the first task contribution in utilization_blocked_avg should have
>>>>>> been decayed by this time. Same thing may than happen for the third task
>>>>>> as well. So, if we don't check if we need to scale down in
>>>>>> dequeue_task_fair, it seems to me that we might miss some opportunities,
>>>>>> as blocked contribution of other tasks could have been successively
>>>>>> decayed.
>>>>>>
>>>>>> What you think?
>>>>>
>>>>> The tick is used to monitor such variation of the usage (in both way,
>>>>> decay of the usage of sleeping tasks and increase of the usage of
>>>>> running tasks). So in your example, if the duration between the sleep
>>>>> of the 2 tasks is significant enough, the tick will handle this
>>>>> variation
>>>>>
>>>>
>>>> The tick is used to decide if we need to scale up (to max OPP for the
>>>> time being), but we don't scale down. It makes more logical sense to
>>>
>>> why don't you want to check if you need to scale down ?
>>>
>>
>> Well, because if I'm still executing something the cpu usage is only
>> subject to raise.
>
> This is only true for system with NO HZ idle
>

Well, even with !NO_HZ_IDLE usage only decreases when cpu is idle. But,
I think I got your point; for !NO_HZ_IDLE configurations we might end
up not scaling down frequency even if we have the tick running and
the cpu is idle. I might need some more time to think this through, but
it seems to me that we are still fine without an explicit trigger in
task_tick_fair(); if we are running a !NO_HZ_IDLE system we are probably
not so much concerned about power savings and still we react
to tasks waking up, sleeping, leaving or moving around (which seems the
real important events to me); OTOH, we might add that trigger, but this
will generate unconditional checks at tick time for NO_HZ_IDLE
configurations, for a benefit that it seems to be still not completely
clear.

>>
>>>> scale down at task deactivation, or wakeup after a long time, IMHO.
>>>
>>> But waking up or going to sleep don't have any impact on the usage of
>>> a cpu. The only events that impact the cpu usage are:
>>> -task migration,
>>
>> We explicitly cover this on load balancing paths.
>>
>>> -new task
>>
>> We cover this in enqueue_task_fair(), introducing a new flag.
>>
>>> -time that elapse which can be monitored by periodically checking the usage.
>>
>> Do you mean when a task utilization crosses some threshold
>> related to the current OPP? If that is the case, we have a
>> check in task_tick_fair().
>>
>>> -and for nohz system when cpu enter or leave idle state
>>>
>>
>> We address this in dequeue_task_fair(). In particular, if
>> the cpu is going to be idle we don't trigger any change as
>> it seems not always wise to wake up a thread to just change
>> the OPP and the go idle; some platforms might require this
>> behaviour anyway, but it probably more cpuidle/fw related?
>
> I would say that it's interesting to notifiy sched-dvfs that a cpu
> becomes idle because we could decrease the opp of a cluster of cpus
> that share the same clock if this cpu is the one that requires the max
> capacity of the cluster (and other cpus are still running).
>

Well, we reset the capacity request of the cpu that is going idle.
The idea is that the next event on one of the other related cpus
will update the cluster freq correctly. If any other cpu in the
cluster is running something we keep the same frequency until
the task running on that cpu goes to sleep; this seems fine to
me because that task might end up being heavy and we saved a
back to back lower to higher OPP switch; if the task is instead
light it will probably be dequeued pretty soon, and at that time
we switch to a lower OPP (since we cleared the idle cpu request
before). Also, if the other cpus in the cluster are all idle
we'll most probably enter an idle state, so no freq switch is
most likely required.

>>
>> I would also add:
>>
>> - task is going to die
>>
>> We address this in dequeue as well, as its contribution is
>> removed from usage (mod Yuyang's patches).
>>
>>> waking up and going to sleep events doesn't give any useful
>>> information and using them to trig the monitoring of the usage
>>> variation doesn't give you a predictable/periodic update of it whereas
>>> the tick will
>>>
>>
>> So, one key point of this solution is to get away as much
>> as we can from periodic updates/sampling and move towards a
>> (fully) event driven approach. The event logically associated
>> to task_tick_fair() is when we realize that a task is going
>> to saturate the current capacity; in this case we trigger a
>> freq switch to an higher capacity. Also, if we never react
>> to normal wakeups (as I understand you are proposing) we might
>> miss some chances to adapt quickly enough. As an example, if
>> you have a big task that suddenly goes to sleep, and sleeps
>> until its decayed utilization goes almost to zero; when it
>> wakes up, if we don't have a trigger in enqueue_task_fair(),
>> we'll have to wait until the next tick to select an appropriate
>> (low) OPP.
>
> I assume that the cpu is idle in this case. This situation only
> happens on Nohz idle system because tick is disable and you have to
> update statistics when leaving idle as it is done for the jiffies or
> the cpu_load array. So you should track cpu enter/leave idle (for nohz
> system only) instead of tracking all tasks wake up/sleep events.
>

I think I already replied to this in what above. Did I? :)

> So you can either use update_cpu_load_nohz like it is already done for
> cpu_load array
> or you should use some conditions like below if you want to stay in
> enqueue/dequeue_task_fair but task wake up or sleep event are not the
> right condition
> if (!(flags & ENQUEUE_WAKEUP) || rq->nr_running == 1 ) in enqueue_task_fair
> and
> if (!task_sleep || rq->nr_running == 0) in dequeue_task_fair
>
> We can probably optimized by using rq->cfs.h_nr_running instead of
> rq->nr_running as only cfs tasks really modifies the usage
>

I already filter out enqueues/dequeues that comes from load balancing;
and I use cfs.nr_running because, as you say, we currently work with CFS
tasks only.

Thanks,

- Juri

> Regards,
> Vincent
>
>>
>> Best,
>>
>> - Juri
>>
>>>>
>>>> Best,
>>>>
>>>> - Juri
>>>>
>>>>> Regards,
>>>>> Vincent
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> - Juri
>>>>>>
>>>>>>> It should be the same for the wake up of a task in enqueue_task_fair
>>>>>>> above, even if it's less obvious for this latter use case because the
>>>>>>> cpu might wake up from a long idle phase during which its
>>>>>>> utilization_blocked_avg has not been updated. Nevertheless, a trig of
>>>>>>> the freq switch at wake up of the cpu once its usage has been updated
>>>>>>> should do the job.
>>>>>>>
>>>>>>> So tick, migration of tasks, new tasks, entering/leaving idle state of
>>>>>>> cpu should be enough to trig freq switch
>>>>>>>
>>>>>>> Regards,
>>>>>>> Vincent
>>>>>>>
>>>>>>>
>>>>>>>> + }
>>>>>>>> }
>>>>>>>> hrtick_update(rq);
>>>>>>>> }
>>>>>>>> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
>>>>>>>> return idx;
>>>>>>>> }
>>>>>>>>
>>>>>>>> -static unsigned int capacity_margin = 1280; /* ~20% margin */
>>>>>>>> -
>>>>>>>> static bool cpu_overutilized(int cpu)
>>>>>>>> {
>>>>>>>> return (capacity_of(cpu) * 1024) <
>>>>>>>> --
>>>>>>>> 1.9.1
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

2015-08-12 17:08:56

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 16/46] sched: Allocate and initialize energy data structures

On 12/08/15 11:04, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:23:59PM +0100, Morten Rasmussen wrote:
>> +
>> + sge->nr_idle_states = fn(cpu)->nr_idle_states;
>> + sge->nr_cap_states = fn(cpu)->nr_cap_states;
>> + memcpy(sge->idle_states, fn(cpu)->idle_states,
>> + sge->nr_idle_states*sizeof(struct idle_state));
>> + memcpy(sge->cap_states, fn(cpu)->cap_states,
>> + sge->nr_cap_states*sizeof(struct capacity_state));
>
>> + if (fn && fn(j)) {
>> + nr_idle_states = fn(j)->nr_idle_states;
>> + nr_cap_states = fn(j)->nr_cap_states;
>> + BUG_ON(!nr_idle_states || !nr_cap_states);
>> + }
>
>> + for_each_cpu(i, &mask) {
>> + int y;
>> +
>> + BUG_ON(fn(i)->nr_idle_states != fn(cpu)->nr_idle_states);
>> +
>> + for (y = 0; y < (fn(i)->nr_idle_states); y++) {
>> + BUG_ON(fn(i)->idle_states[y].power !=
>> + fn(cpu)->idle_states[y].power);
>> + }
>> +
>> + BUG_ON(fn(i)->nr_cap_states != fn(cpu)->nr_cap_states);
>> +
>> + for (y = 0; y < (fn(i)->nr_cap_states); y++) {
>> + BUG_ON(fn(i)->cap_states[y].cap !=
>> + fn(cpu)->cap_states[y].cap);
>> + BUG_ON(fn(i)->cap_states[y].power !=
>> + fn(cpu)->cap_states[y].power);
>> + }
>> + }
>> +}
>
> Might it not make more sense to have:
>
> const struct blah *const blah = fn();
>
> and use blah afterwards, instead of the repeated invocation of fn()?

Absolutely! I can change this in the next release.

2015-08-12 17:10:09

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 16/46] sched: Allocate and initialize energy data structures

On 12/08/15 11:17, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:23:59PM +0100, Morten Rasmussen wrote:
>> @@ -6647,10 +6703,24 @@ static int __sdt_alloc(const struct cpumask *cpu_map)

[...]

>> @@ -6674,6 +6744,16 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>> return -ENOMEM;
>>
>> *per_cpu_ptr(sdd->sgc, j) = sgc;
>> +
>> + sge = kzalloc_node(sizeof(struct sched_group_energy) +
>> + nr_idle_states*sizeof(struct idle_state) +
>> + nr_cap_states*sizeof(struct capacity_state),
>> + GFP_KERNEL, cpu_to_node(j));
>> +
>> + if (!sge)
>> + return -ENOMEM;
>> +
>> + *per_cpu_ptr(sdd->sge, j) = sge;
>> }
>> }
>>
>
> One more question, if fn() returns a full structure, why are we
> allocating and copying the thing? Its all const read only data, right?
>

Yeah, that's not strictly necessary. I could get rid of all the
allocation/copying/ and freeing code and just simply set sd->groups->sge
= fn(cpu) in init_sched_energy(). Plus delete the atomic_t ref in struct
sched_group_energy.

In this case, should I still keep the check_sched_energy_data() function
to verify that the scheduler got valid data via the struct
sched_domain_topology_level table from the arch, i.e. to make sure that
the per-cpu provided sd energy data is consistent for all cpus within
the sched group cpumask?

2015-08-12 17:23:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 16/46] sched: Allocate and initialize energy data structures

On Wed, Aug 12, 2015 at 06:09:59PM +0100, Dietmar Eggemann wrote:
> > One more question, if fn() returns a full structure, why are we
> > allocating and copying the thing? Its all const read only data, right?
> >
>
> Yeah, that's not strictly necessary. I could get rid of all the
> allocation/copying/ and freeing code and just simply set sd->groups->sge
> = fn(cpu) in init_sched_energy(). Plus delete the atomic_t ref in struct
> sched_group_energy.
>
> In this case, should I still keep the check_sched_energy_data() function
> to verify that the scheduler got valid data via the struct
> sched_domain_topology_level table from the arch, i.e. to make sure that
> the per-cpu provided sd energy data is consistent for all cpus within
> the sched group cpumask?

Oh yes very much. We want sanity checking of the data handed.

2015-08-12 18:47:59

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 18/46] arm: topology: Define TC2 energy and provide it to the scheduler

On 12/08/15 11:33, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:24:01PM +0100, Morten Rasmussen wrote:
>> +static struct capacity_state cap_states_cluster_a7[] = {
>> + /* Cluster only power */
>> + { .cap = 150, .power = 2967, }, /* 350 MHz */
>> + { .cap = 172, .power = 2792, }, /* 400 MHz */
>> + { .cap = 215, .power = 2810, }, /* 500 MHz */
>> + { .cap = 258, .power = 2815, }, /* 600 MHz */
>> + { .cap = 301, .power = 2919, }, /* 700 MHz */
>> + { .cap = 344, .power = 2847, }, /* 800 MHz */
>> + { .cap = 387, .power = 3917, }, /* 900 MHz */
>> + { .cap = 430, .power = 4905, }, /* 1000 MHz */
>> + };
>
> So can I suggest a SCHED_DEBUG validation of the data provided?

Yes we can do that.

>
> Given the above table, it _never_ makes sense to run at .cap=150, it
> equally also doesn't make sense to run at .cap = 301.
>

Absolutely right.


> So please add a SCHED_DEBUG test on domain creation that validates that
> not only is the .cap monotonically increasing, but the .power is too.

The requirement for current EAS code to work is even higher. We're not
only requiring monotonically increasing values for .cap and .power but
that the energy efficiency (.cap/.power) is monotonically decreasing.
Otherwise we can't stop the search for a new appropriate OPP in
find_new_capacity() in case .cap >= current 'max. group usage' because
we can't assume that this OPP will be the most energy efficient one.

For the example above we get .cap/.power = [0.05 0.06 0.08 0.09 0.1 0.12
0.1 0.09] so only the last 3 OPPs [800, 900, 1000 Mhz] make sense from
this perspective on our TC2 test chip platform.

So we should check for monotonically decreasing (.cap/.power) values.

2015-08-13 12:08:53

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

On 12 August 2015 at 17:15, Juri Lelli <[email protected]> wrote:
> On 11/08/15 17:37, Vincent Guittot wrote:
>> On 11 August 2015 at 17:07, Juri Lelli <[email protected]> wrote:
>>> Hi Vincent,
>>>
>>> On 11/08/15 12:41, Vincent Guittot wrote:
>>>> On 11 August 2015 at 11:08, Juri Lelli <[email protected]> wrote:
>>>>> On 10/08/15 16:07, Vincent Guittot wrote:
>>>>>> On 10 August 2015 at 15:43, Juri Lelli <[email protected]> wrote:
>>>>>>>
>>>>>>> Hi Vincent,
>>>>>>>
>>>>>>> On 04/08/15 14:41, Vincent Guittot wrote:
>>>>>>>> Hi Juri,
>>>>>>>>
>>>>>>>> On 7 July 2015 at 20:24, Morten Rasmussen <[email protected]> wrote:
>>>>>>>>> From: Juri Lelli <[email protected]>

[snip]

>>>>>>>>> }
>>>>>>>>> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>>>>>> if (!se) {
>>>>>>>>> sub_nr_running(rq, 1);
>>>>>>>>> update_rq_runnable_avg(rq, 1);
>>>>>>>>> + /*
>>>>>>>>> + * We want to trigger a freq switch request only for tasks that
>>>>>>>>> + * are going to sleep; this is because we get here also during
>>>>>>>>> + * load balancing, but in these cases it seems wise to trigger
>>>>>>>>> + * as single request after load balancing is done.
>>>>>>>>> + *
>>>>>>>>> + * Also, we add a margin (same ~20% used for the tipping point)
>>>>>>>>> + * to our request to provide some head room if p's utilization
>>>>>>>>> + * further increases.
>>>>>>>>> + */
>>>>>>>>> + if (sched_energy_freq() && task_sleep) {
>>>>>>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>>>>>>>>> +
>>>>>>>>> + req_cap = req_cap * capacity_margin
>>>>>>>>> + >> SCHED_CAPACITY_SHIFT;
>>>>>>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>>>>>>>>
>>>>>>>> Could you clarify why you want to trig a freq switch for tasks that
>>>>>>>> are going to sleep ?
>>>>>>>> The cpu_usage should not changed that much as the se_utilization of
>>>>>>>> the entity moves from utilization_load_avg to utilization_blocked_avg
>>>>>>>> of the rq and the usage and the freq are updated periodically.
>>>>>>>
>>>>>>> I think we still need to cover multiple back-to-back dequeues. Suppose
>>>>>>> that you have, let's say, 3 tasks that get enqueued at the same time.
>>>>>>> After some time the first one goes to sleep and its utilization, as you
>>>>>>> say, gets moved to utilization_blocked_avg. So, nothing changes, and
>>>>>>> the trigger is superfluous (even if no freq change I guess will be
>>>>>>> issued as we are already servicing enough capacity). However, after a
>>>>>>> while, the second task goes to sleep. Now we still use get_cpu_usage()
>>>>>>> and the first task contribution in utilization_blocked_avg should have
>>>>>>> been decayed by this time. Same thing may than happen for the third task
>>>>>>> as well. So, if we don't check if we need to scale down in
>>>>>>> dequeue_task_fair, it seems to me that we might miss some opportunities,
>>>>>>> as blocked contribution of other tasks could have been successively
>>>>>>> decayed.
>>>>>>>
>>>>>>> What you think?
>>>>>>
>>>>>> The tick is used to monitor such variation of the usage (in both way,
>>>>>> decay of the usage of sleeping tasks and increase of the usage of
>>>>>> running tasks). So in your example, if the duration between the sleep
>>>>>> of the 2 tasks is significant enough, the tick will handle this
>>>>>> variation
>>>>>>
>>>>>
>>>>> The tick is used to decide if we need to scale up (to max OPP for the
>>>>> time being), but we don't scale down. It makes more logical sense to
>>>>
>>>> why don't you want to check if you need to scale down ?
>>>>
>>>
>>> Well, because if I'm still executing something the cpu usage is only
>>> subject to raise.
>>
>> This is only true for system with NO HZ idle
>>
>
> Well, even with !NO_HZ_IDLE usage only decreases when cpu is idle. But,

Well, thanks for this obvious statement that usage only decreases when
cpu is idle but my question has never been about usage variation of
idle/running cpu but about the tick.

> I think I got your point; for !NO_HZ_IDLE configurations we might end
> up not scaling down frequency even if we have the tick running and
> the cpu is idle. I might need some more time to think this through, but
> it seems to me that we are still fine without an explicit trigger in
> task_tick_fair(); if we are running a !NO_HZ_IDLE system we are probably
> not so much concerned about power savings and still we react
> to tasks waking up, sleeping, leaving or moving around (which seems the
> real important events to me); OTOH, we might add that trigger, but this
> will generate unconditional checks at tick time for NO_HZ_IDLE

That will be far less critical than unconditionally check during all
task wake up or sleep. A task that wakes up every 200us will generate
much more check in the wake up hot path of the cpu that is already
busy with another task

> configurations, for a benefit that it seems to be still not completely
> clear.
>
>>>
>>>>> scale down at task deactivation, or wakeup after a long time, IMHO.
>>>>
>>>> But waking up or going to sleep don't have any impact on the usage of
>>>> a cpu. The only events that impact the cpu usage are:
>>>> -task migration,
>>>
>>> We explicitly cover this on load balancing paths.

But task can migrate out of the load balancing; At wake up for example
and AFAICT, you don't use this event to notify the decrease of the
usage of the cpu and check if a new OPP will fit better with the new
usage.

>>>
>>>> -new task
>>>
>>> We cover this in enqueue_task_fair(), introducing a new flag.
>>>
>>>> -time that elapse which can be monitored by periodically checking the usage.
>>>
>>> Do you mean when a task utilization crosses some threshold
>>> related to the current OPP? If that is the case, we have a
>>> check in task_tick_fair().
>>>
>>>> -and for nohz system when cpu enter or leave idle state
>>>>
>>>
>>> We address this in dequeue_task_fair(). In particular, if
>>> the cpu is going to be idle we don't trigger any change as
>>> it seems not always wise to wake up a thread to just change
>>> the OPP and the go idle; some platforms might require this
>>> behaviour anyway, but it probably more cpuidle/fw related?
>>
>> I would say that it's interesting to notifiy sched-dvfs that a cpu
>> becomes idle because we could decrease the opp of a cluster of cpus
>> that share the same clock if this cpu is the one that requires the max
>> capacity of the cluster (and other cpus are still running).
>>
>
> Well, we reset the capacity request of the cpu that is going idle.

And i'm fine with the fact that you use the cpu idle event

> The idea is that the next event on one of the other related cpus
> will update the cluster freq correctly. If any other cpu in the
> cluster is running something we keep the same frequency until
> the task running on that cpu goes to sleep; this seems fine to
> me because that task might end up being heavy and we saved a
> back to back lower to higher OPP switch; if the task is instead
> light it will probably be dequeued pretty soon, and at that time
> we switch to a lower OPP (since we cleared the idle cpu request
> before). Also, if the other cpus in the cluster are all idle
> we'll most probably enter an idle state, so no freq switch is
> most likely required.
>
>>>
>>> I would also add:
>>>
>>> - task is going to die
>>>
>>> We address this in dequeue as well, as its contribution is
>>> removed from usage (mod Yuyang's patches).
>>>
>>>> waking up and going to sleep events doesn't give any useful
>>>> information and using them to trig the monitoring of the usage
>>>> variation doesn't give you a predictable/periodic update of it whereas
>>>> the tick will
>>>>
>>>
>>> So, one key point of this solution is to get away as much
>>> as we can from periodic updates/sampling and move towards a
>>> (fully) event driven approach. The event logically associated
>>> to task_tick_fair() is when we realize that a task is going
>>> to saturate the current capacity; in this case we trigger a
>>> freq switch to an higher capacity. Also, if we never react
>>> to normal wakeups (as I understand you are proposing) we might
>>> miss some chances to adapt quickly enough. As an example, if
>>> you have a big task that suddenly goes to sleep, and sleeps
>>> until its decayed utilization goes almost to zero; when it
>>> wakes up, if we don't have a trigger in enqueue_task_fair(),

I'm not against having a trigger in enqueue, i'm against bindly
checking all task wake up in order to be sure to catch the useful
event like the cpu leave idle event of your example

>>> we'll have to wait until the next tick to select an appropriate
>>> (low) OPP.
>>
>> I assume that the cpu is idle in this case. This situation only
>> happens on Nohz idle system because tick is disable and you have to
>> update statistics when leaving idle as it is done for the jiffies or
>> the cpu_load array. So you should track cpu enter/leave idle (for nohz
>> system only) instead of tracking all tasks wake up/sleep events.
>>
>
> I think I already replied to this in what above. Did I? :)

In fact, It was not a question, i just state that using all wake up /
sleep events to be sure to trig the check of cpu capacity when the cpu
leave an idle phase (and especially a long one like in your example
above), is wrong. You have to use the leave idle event instead of
checking all wake up events to be sure to catch the right one. You say
that you want to get away as much as possible the periodic
updates/sampling but it's exactly what you do with these 2 events.
Using them only enables you to periodically check if the capacity has
changed since the last time like the tick already does. But instead of
using a periodic and controlled event, you use these random (with
regards to capacity evolution) and uncontrolled events in order to
catch useful change. As an example, If a cpu run a task and there is a
short running task that wakes up every 100us for running 100us, you
will call cpufreq_sched_set_cap, 5000 times per second for no good
reason as you already have the tick to periodically check the
evolution of the usage.

>
>> So you can either use update_cpu_load_nohz like it is already done for
>> cpu_load array
>> or you should use some conditions like below if you want to stay in
>> enqueue/dequeue_task_fair but task wake up or sleep event are not the
>> right condition
>> if (!(flags & ENQUEUE_WAKEUP) || rq->nr_running == 1 ) in enqueue_task_fair
>> and
>> if (!task_sleep || rq->nr_running == 0) in dequeue_task_fair
>>
>> We can probably optimized by using rq->cfs.h_nr_running instead of
>> rq->nr_running as only cfs tasks really modifies the usage
>>
>
> I already filter out enqueues/dequeues that comes from load balancing;
> and I use cfs.nr_running because, as you say, we currently work with CFS
> tasks only.

But not for the enqueue where you should use it instead of all wake up events.

Just to be clear: Using all enqueue/dequeue events (including task
wake up and sleep) to check a change of the usage of a cpu was doing a
lot of sense when mike has sent his v3 of the scheduler driven cpu
frequency selection because the usage was not accounting the blocked
tasks at that time so it was changing for all enqueue/dequeue events.
But this doesn't make sense in this patchset that account the blocked
tasks in the usage of the cpu and more generally now that yuyang's
patch set has been accepted.

Regards,
Vincent

>
> Thanks,
>
> - Juri
>
>> Regards,
>> Vincent
>>
>>>
>>> Best,
>>>
>>> - Juri
>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> - Juri
>>>>>
>>>>>> Regards,
>>>>>> Vincent
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> - Juri
>>>>>>>
>>>>>>>> It should be the same for the wake up of a task in enqueue_task_fair
>>>>>>>> above, even if it's less obvious for this latter use case because the
>>>>>>>> cpu might wake up from a long idle phase during which its
>>>>>>>> utilization_blocked_avg has not been updated. Nevertheless, a trig of
>>>>>>>> the freq switch at wake up of the cpu once its usage has been updated
>>>>>>>> should do the job.
>>>>>>>>
>>>>>>>> So tick, migration of tasks, new tasks, entering/leaving idle state of
>>>>>>>> cpu should be enough to trig freq switch
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Vincent
>>>>>>>>
>>>>>>>>
>>>>>>>>> + }
>>>>>>>>> }
>>>>>>>>> hrtick_update(rq);
>>>>>>>>> }
>>>>>>>>> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
>>>>>>>>> return idx;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> -static unsigned int capacity_margin = 1280; /* ~20% margin */
>>>>>>>>> -
>>>>>>>>> static bool cpu_overutilized(int cpu)
>>>>>>>>> {
>>>>>>>>> return (capacity_of(cpu) * 1024) <
>>>>>>>>> --
>>>>>>>>> 1.9.1
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

2015-08-13 15:34:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 22/46] sched: Calculate energy consumption of sched_group

On Tue, Jul 07, 2015 at 07:24:05PM +0100, Morten Rasmussen wrote:
> +static unsigned int sched_group_energy(struct sched_group *sg_top)
> +{
> + struct sched_domain *sd;
> + int cpu, total_energy = 0;
> + struct cpumask visit_cpus;
> + struct sched_group *sg;
> +
> + WARN_ON(!sg_top->sge);
> +
> + cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
> +
> + while (!cpumask_empty(&visit_cpus)) {
> + struct sched_group *sg_shared_cap = NULL;
> +
> + cpu = cpumask_first(&visit_cpus);
> +
> + /*
> + * Is the group utilization affected by cpus outside this
> + * sched_group?
> + */
> + sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
> + if (sd && sd->parent)
> + sg_shared_cap = sd->parent->groups;
> +
> + for_each_domain(cpu, sd) {
> + sg = sd->groups;
> +
> + /* Has this sched_domain already been visited? */
> + if (sd->child && group_first_cpu(sg) != cpu)
> + break;
> +
> + do {
> + struct sched_group *sg_cap_util;
> + unsigned long group_util;
> + int sg_busy_energy, sg_idle_energy, cap_idx;
> +
> + if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
> + sg_cap_util = sg_shared_cap;
> + else
> + sg_cap_util = sg;
> +
> + cap_idx = find_new_capacity(sg_cap_util, sg->sge);

So here its not really 'new' capacity is it, most like the current
capacity?

So in the case of coupled P states, you look for the CPU with highest
utilization, as that is the on that determines the required P state.

> + group_util = group_norm_usage(sg, cap_idx);
> + sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
> + >> SCHED_CAPACITY_SHIFT;
> + sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
> + >> SCHED_CAPACITY_SHIFT;
> +
> + total_energy += sg_busy_energy + sg_idle_energy;
> +
> + if (!sd->child)
> + cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg));
> +
> + if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(sg_top)))
> + goto next_cpu;
> +
> + } while (sg = sg->next, sg != sd->groups);
> + }
> +next_cpu:
> + continue;
> + }
> +
> + return total_energy;
> +}

2015-08-13 17:35:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 25/46] sched: Add over-utilization/tipping point indicator

On Tue, Jul 07, 2015 at 07:24:08PM +0100, Morten Rasmussen wrote:
> Energy-aware scheduling is only meant to be active while the system is
> _not_ over-utilized. That is, there are spare cycles available to shift
> tasks around based on their actual utilization to get a more
> energy-efficient task distribution without depriving any tasks. When
> above the tipping point task placement is done the traditional way,
> spreading the tasks across as many cpus as possible based on priority
> scaled load to preserve smp_nice.
>
> The over-utilization condition is conservatively chosen to indicate
> over-utilization as soon as one cpu is fully utilized at it's highest
> frequency. We don't consider groups as lumping usage and capacity
> together for a group of cpus may hide the fact that one or more cpus in
> the group are over-utilized while group-siblings are partially idle. The
> tasks could be served better if moved to another group with completely
> idle cpus. This is particularly problematic if some cpus have a
> significantly reduced capacity due to RT/IRQ pressure or if the system
> has cpus of different capacity (e.g. ARM big.LITTLE).

I might be tired, but I'm having a very hard time deciphering this
second paragraph.

2015-08-13 18:10:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 28/46] sched: Count number of shallower idle-states in struct sched_group_energy

On Tue, Jul 07, 2015 at 07:24:11PM +0100, Morten Rasmussen wrote:
> cpuidle associates all idle-states with each cpu while the energy model
> associates them with the sched_group covering the cpus coordinating
> entry to the idle-state. To look up the idle-state power consumption in
> the energy model it is therefore necessary to translate from cpuidle
> idle-state index to energy model index. For this purpose it is helpful
> to know how many idle-states that are listed in lower level sched_groups
> (in struct sched_group_energy).
>
> Example: ARMv8 big.LITTLE JUNO (Cortex A57, A53) idle-states:
> Idle-state cpuidle Energy model table indices
> index per-cpu sg per-cluster sg
> WFI 0 0 (0)
> Core power-down 1 1 0*
> Cluster power-down 2 (1) 1
>
> For per-cpu sgs no translation is required. If cpuidle reports state
> index 0 or 1, the cpu is in WFI or core power-down, respectively. We can
> look the idle-power up directly in the sg energy model table.

OK..

> Idle-state
> cluster power-down, is represented in the per-cluster sg energy model
> table as index 1. Index 0* is reserved for cluster power consumption
> when the cpus all are in state 0 or 1, but cpuidle decided not to go for
> cluster power-down.

0* is not an integer.

> Given the index from cpuidle we can compute the
> correct index in the energy model tables for the sgs at each level if we
> know how many states are in the tables in the child sgs. The actual
> translation is implemented in a later patch.

And you've lost me... I've looked at that later patch (its the next one)
and I cannot say I'm less confused.

2015-08-13 18:24:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 30/46] sched: Add cpu capacity awareness to wakeup balancing

On Tue, Jul 07, 2015 at 07:24:13PM +0100, Morten Rasmussen wrote:
>
> This patch adds capacity awareness to find_idlest_{group,queue} (used by
> SD_BALANCE_{FORK,EXEC}) such that groups/cpus that can accommodate the
> waking task based on task utilization are preferred. In addition, wakeup
> of existing tasks (SD_BALANCE_WAKE) is sent through
> find_idlest_{group,queue} if the task doesn't fit the capacity of the
> previous cpu to allow it to escape (override wake_affine) when
> necessary instead of relying on periodic/idle/nohz_idle balance to
> eventually sort it out.
>

That's policy not guided by the energy model.. Also we need something
clever for the wakeup balancing, the current stuff all stinks :/

2015-08-14 10:25:10

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 22/46] sched: Calculate energy consumption of sched_group

On Thu, Aug 13, 2015 at 05:34:17PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:24:05PM +0100, Morten Rasmussen wrote:
> > +static unsigned int sched_group_energy(struct sched_group *sg_top)
> > +{
> > + struct sched_domain *sd;
> > + int cpu, total_energy = 0;
> > + struct cpumask visit_cpus;
> > + struct sched_group *sg;
> > +
> > + WARN_ON(!sg_top->sge);
> > +
> > + cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
> > +
> > + while (!cpumask_empty(&visit_cpus)) {
> > + struct sched_group *sg_shared_cap = NULL;
> > +
> > + cpu = cpumask_first(&visit_cpus);
> > +
> > + /*
> > + * Is the group utilization affected by cpus outside this
> > + * sched_group?
> > + */
> > + sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
> > + if (sd && sd->parent)
> > + sg_shared_cap = sd->parent->groups;
> > +
> > + for_each_domain(cpu, sd) {
> > + sg = sd->groups;
> > +
> > + /* Has this sched_domain already been visited? */
> > + if (sd->child && group_first_cpu(sg) != cpu)
> > + break;
> > +
> > + do {
> > + struct sched_group *sg_cap_util;
> > + unsigned long group_util;
> > + int sg_busy_energy, sg_idle_energy, cap_idx;
> > +
> > + if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
> > + sg_cap_util = sg_shared_cap;
> > + else
> > + sg_cap_util = sg;
> > +
> > + cap_idx = find_new_capacity(sg_cap_util, sg->sge);
>
> So here its not really 'new' capacity is it, most like the current
> capacity?

Yes, sort of. It is what the current capacity (P-state) should be to
accommodate the current utilization. Using a sane cpufreq governor it is
most likely not far off.

I could rename it to find_capacity() instead. It is extended in a
subsequent patch to figure out the 'new' capacity in cases were we
consider putting more utilization into the group.

> So in the case of coupled P states, you look for the CPU with highest
> utilization, as that is the on that determines the required P state.

Yes. That is why we need the SD_SHARE_CAP_STATES flag and we use
group_max_usage() in find_new_capacity().

2015-08-14 11:38:34

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

Hi vincent,

On 13/08/15 13:08, Vincent Guittot wrote:
> On 12 August 2015 at 17:15, Juri Lelli <[email protected]> wrote:
>> On 11/08/15 17:37, Vincent Guittot wrote:
>>> On 11 August 2015 at 17:07, Juri Lelli <[email protected]> wrote:
>>>> Hi Vincent,
>>>>
>>>> On 11/08/15 12:41, Vincent Guittot wrote:
>>>>> On 11 August 2015 at 11:08, Juri Lelli <[email protected]> wrote:
>>>>>> On 10/08/15 16:07, Vincent Guittot wrote:
>>>>>>> On 10 August 2015 at 15:43, Juri Lelli <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Hi Vincent,
>>>>>>>>
>>>>>>>> On 04/08/15 14:41, Vincent Guittot wrote:
>>>>>>>>> Hi Juri,
>>>>>>>>>
>>>>>>>>> On 7 July 2015 at 20:24, Morten Rasmussen <[email protected]> wrote:
>>>>>>>>>> From: Juri Lelli <[email protected]>
>
> [snip]
>
>>>>>>>>>> }
>>>>>>>>>> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>>>>>>> if (!se) {
>>>>>>>>>> sub_nr_running(rq, 1);
>>>>>>>>>> update_rq_runnable_avg(rq, 1);
>>>>>>>>>> + /*
>>>>>>>>>> + * We want to trigger a freq switch request only for tasks that
>>>>>>>>>> + * are going to sleep; this is because we get here also during
>>>>>>>>>> + * load balancing, but in these cases it seems wise to trigger
>>>>>>>>>> + * as single request after load balancing is done.
>>>>>>>>>> + *
>>>>>>>>>> + * Also, we add a margin (same ~20% used for the tipping point)
>>>>>>>>>> + * to our request to provide some head room if p's utilization
>>>>>>>>>> + * further increases.
>>>>>>>>>> + */
>>>>>>>>>> + if (sched_energy_freq() && task_sleep) {
>>>>>>>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>>>>>>>>>> +
>>>>>>>>>> + req_cap = req_cap * capacity_margin
>>>>>>>>>> + >> SCHED_CAPACITY_SHIFT;
>>>>>>>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>>>>>>>>>
>>>>>>>>> Could you clarify why you want to trig a freq switch for tasks that
>>>>>>>>> are going to sleep ?
>>>>>>>>> The cpu_usage should not changed that much as the se_utilization of
>>>>>>>>> the entity moves from utilization_load_avg to utilization_blocked_avg
>>>>>>>>> of the rq and the usage and the freq are updated periodically.
>>>>>>>>
>>>>>>>> I think we still need to cover multiple back-to-back dequeues. Suppose
>>>>>>>> that you have, let's say, 3 tasks that get enqueued at the same time.
>>>>>>>> After some time the first one goes to sleep and its utilization, as you
>>>>>>>> say, gets moved to utilization_blocked_avg. So, nothing changes, and
>>>>>>>> the trigger is superfluous (even if no freq change I guess will be
>>>>>>>> issued as we are already servicing enough capacity). However, after a
>>>>>>>> while, the second task goes to sleep. Now we still use get_cpu_usage()
>>>>>>>> and the first task contribution in utilization_blocked_avg should have
>>>>>>>> been decayed by this time. Same thing may than happen for the third task
>>>>>>>> as well. So, if we don't check if we need to scale down in
>>>>>>>> dequeue_task_fair, it seems to me that we might miss some opportunities,
>>>>>>>> as blocked contribution of other tasks could have been successively
>>>>>>>> decayed.
>>>>>>>>
>>>>>>>> What you think?
>>>>>>>
>>>>>>> The tick is used to monitor such variation of the usage (in both way,
>>>>>>> decay of the usage of sleeping tasks and increase of the usage of
>>>>>>> running tasks). So in your example, if the duration between the sleep
>>>>>>> of the 2 tasks is significant enough, the tick will handle this
>>>>>>> variation
>>>>>>>
>>>>>>
>>>>>> The tick is used to decide if we need to scale up (to max OPP for the
>>>>>> time being), but we don't scale down. It makes more logical sense to
>>>>>
>>>>> why don't you want to check if you need to scale down ?
>>>>>
>>>>
>>>> Well, because if I'm still executing something the cpu usage is only
>>>> subject to raise.
>>>
>>> This is only true for system with NO HZ idle
>>>
>>
>> Well, even with !NO_HZ_IDLE usage only decreases when cpu is idle. But,
>
> Well, thanks for this obvious statement that usage only decreases when
> cpu is idle but my question has never been about usage variation of
> idle/running cpu but about the tick.
>

I'm sorry if I sounded haughty to you, of course I didn't want too and
I apologize for that. I just wanted to state the obvious to confirm
myself that I understood your point, as I say below. :)

>> I think I got your point; for !NO_HZ_IDLE configurations we might end
>> up not scaling down frequency even if we have the tick running and
>> the cpu is idle. I might need some more time to think this through, but
>> it seems to me that we are still fine without an explicit trigger in
>> task_tick_fair(); if we are running a !NO_HZ_IDLE system we are probably
>> not so much concerned about power savings and still we react
>> to tasks waking up, sleeping, leaving or moving around (which seems the
>> real important events to me); OTOH, we might add that trigger, but this
>> will generate unconditional checks at tick time for NO_HZ_IDLE
>
> That will be far less critical than unconditionally check during all
> task wake up or sleep. A task that wakes up every 200us will generate
> much more check in the wake up hot path of the cpu that is already
> busy with another task
>

We have a throttling threshold for this kind of problem, which is the
same as the transition latency exported by cpufreq drivers. Now, we
still do some operations before checking that threshold, and the check
itself might be too expensive. I guess I'll go back and profile it.

>> configurations, for a benefit that it seems to be still not completely
>> clear.
>>
>>>>
>>>>>> scale down at task deactivation, or wakeup after a long time, IMHO.
>>>>>
>>>>> But waking up or going to sleep don't have any impact on the usage of
>>>>> a cpu. The only events that impact the cpu usage are:
>>>>> -task migration,
>>>>
>>>> We explicitly cover this on load balancing paths.
>
> But task can migrate out of the load balancing; At wake up for example
> and AFAICT, you don't use this event to notify the decrease of the
> usage of the cpu and check if a new OPP will fit better with the new
> usage.
>

If the task gets wakeup migrated its request will be issued as part
of the enqueue on the new cpu. If the cpu it was previously running
on is idle, it has already cleared its request, so it shouldn't
need any notification.

>>>>
>>>>> -new task
>>>>
>>>> We cover this in enqueue_task_fair(), introducing a new flag.
>>>>
>>>>> -time that elapse which can be monitored by periodically checking the usage.
>>>>
>>>> Do you mean when a task utilization crosses some threshold
>>>> related to the current OPP? If that is the case, we have a
>>>> check in task_tick_fair().
>>>>
>>>>> -and for nohz system when cpu enter or leave idle state
>>>>>
>>>>
>>>> We address this in dequeue_task_fair(). In particular, if
>>>> the cpu is going to be idle we don't trigger any change as
>>>> it seems not always wise to wake up a thread to just change
>>>> the OPP and the go idle; some platforms might require this
>>>> behaviour anyway, but it probably more cpuidle/fw related?
>>>
>>> I would say that it's interesting to notifiy sched-dvfs that a cpu
>>> becomes idle because we could decrease the opp of a cluster of cpus
>>> that share the same clock if this cpu is the one that requires the max
>>> capacity of the cluster (and other cpus are still running).
>>>
>>
>> Well, we reset the capacity request of the cpu that is going idle.
>
> And i'm fine with the fact that you use the cpu idle event
>

Ok :).

>> The idea is that the next event on one of the other related cpus
>> will update the cluster freq correctly. If any other cpu in the
>> cluster is running something we keep the same frequency until
>> the task running on that cpu goes to sleep; this seems fine to
>> me because that task might end up being heavy and we saved a
>> back to back lower to higher OPP switch; if the task is instead
>> light it will probably be dequeued pretty soon, and at that time
>> we switch to a lower OPP (since we cleared the idle cpu request
>> before). Also, if the other cpus in the cluster are all idle
>> we'll most probably enter an idle state, so no freq switch is
>> most likely required.
>>
>>>>
>>>> I would also add:
>>>>
>>>> - task is going to die
>>>>
>>>> We address this in dequeue as well, as its contribution is
>>>> removed from usage (mod Yuyang's patches).
>>>>
>>>>> waking up and going to sleep events doesn't give any useful
>>>>> information and using them to trig the monitoring of the usage
>>>>> variation doesn't give you a predictable/periodic update of it whereas
>>>>> the tick will
>>>>>
>>>>
>>>> So, one key point of this solution is to get away as much
>>>> as we can from periodic updates/sampling and move towards a
>>>> (fully) event driven approach. The event logically associated
>>>> to task_tick_fair() is when we realize that a task is going
>>>> to saturate the current capacity; in this case we trigger a
>>>> freq switch to an higher capacity. Also, if we never react
>>>> to normal wakeups (as I understand you are proposing) we might
>>>> miss some chances to adapt quickly enough. As an example, if
>>>> you have a big task that suddenly goes to sleep, and sleeps
>>>> until its decayed utilization goes almost to zero; when it
>>>> wakes up, if we don't have a trigger in enqueue_task_fair(),
>
> I'm not against having a trigger in enqueue, i'm against bindly
> checking all task wake up in order to be sure to catch the useful
> event like the cpu leave idle event of your example
>

Right, agreed.

>>>> we'll have to wait until the next tick to select an appropriate
>>>> (low) OPP.
>>>
>>> I assume that the cpu is idle in this case. This situation only
>>> happens on Nohz idle system because tick is disable and you have to
>>> update statistics when leaving idle as it is done for the jiffies or
>>> the cpu_load array. So you should track cpu enter/leave idle (for nohz
>>> system only) instead of tracking all tasks wake up/sleep events.
>>>
>>
>> I think I already replied to this in what above. Did I? :)
>
> In fact, It was not a question, i just state that using all wake up /
> sleep events to be sure to trig the check of cpu capacity when the cpu
> leave an idle phase (and especially a long one like in your example
> above), is wrong. You have to use the leave idle event instead of
> checking all wake up events to be sure to catch the right one.

It seems to me that we need to catch enqueue events even for cpus
that are already busy. Don't we have to see if we have to scale
up in case a task is enqueued after a wakeup or fork on a cpu that is
already running some other task?

Then, I agree that some other conditions might be added to check
that we are not over triggering the thing. I need to think about
those more.

> You say
> that you want to get away as much as possible the periodic
> updates/sampling but it's exactly what you do with these 2 events.
> Using them only enables you to periodically check if the capacity has
> changed since the last time like the tick already does. But instead of
> using a periodic and controlled event, you use these random (with
> regards to capacity evolution) and uncontrolled events in order to
> catch useful change. As an example, If a cpu run a task and there is a
> short running task that wakes up every 100us for running 100us, you
> will call cpufreq_sched_set_cap, 5000 times per second for no good
> reason as you already have the tick to periodically check the
> evolution of the usage.
>

In this case it's the workload that is inherently periodic, so we
have periodic checks. Anyway, this is most probably one bailout
condition we need to add, since if both task are already running on
the same cpu, I guess, its utilization shouldn't change that much.

>>
>>> So you can either use update_cpu_load_nohz like it is already done for
>>> cpu_load array
>>> or you should use some conditions like below if you want to stay in
>>> enqueue/dequeue_task_fair but task wake up or sleep event are not the
>>> right condition
>>> if (!(flags & ENQUEUE_WAKEUP) || rq->nr_running == 1 ) in enqueue_task_fair
>>> and
>>> if (!task_sleep || rq->nr_running == 0) in dequeue_task_fair
>>>
>>> We can probably optimized by using rq->cfs.h_nr_running instead of
>>> rq->nr_running as only cfs tasks really modifies the usage
>>>
>>
>> I already filter out enqueues/dequeues that comes from load balancing;
>> and I use cfs.nr_running because, as you say, we currently work with CFS
>> tasks only.
>
> But not for the enqueue where you should use it instead of all wake up events.
>
> Just to be clear: Using all enqueue/dequeue events (including task
> wake up and sleep) to check a change of the usage of a cpu was doing a
> lot of sense when mike has sent his v3 of the scheduler driven cpu
> frequency selection because the usage was not accounting the blocked
> tasks at that time so it was changing for all enqueue/dequeue events.
> But this doesn't make sense in this patchset that account the blocked
> tasks in the usage of the cpu and more generally now that yuyang's
> patch set has been accepted.
>

As above, I agree that we are most probably missing some optimizations;
that's what I'm going to look at again.

Thanks,

- Juri

> Regards,
> Vincent
>
>>
>> Thanks,
>>
>> - Juri
>>
>>> Regards,
>>> Vincent
>>>
>>>>
>>>> Best,
>>>>
>>>> - Juri
>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> - Juri
>>>>>>
>>>>>>> Regards,
>>>>>>> Vincent
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> - Juri
>>>>>>>>
>>>>>>>>> It should be the same for the wake up of a task in enqueue_task_fair
>>>>>>>>> above, even if it's less obvious for this latter use case because the
>>>>>>>>> cpu might wake up from a long idle phase during which its
>>>>>>>>> utilization_blocked_avg has not been updated. Nevertheless, a trig of
>>>>>>>>> the freq switch at wake up of the cpu once its usage has been updated
>>>>>>>>> should do the job.
>>>>>>>>>
>>>>>>>>> So tick, migration of tasks, new tasks, entering/leaving idle state of
>>>>>>>>> cpu should be enough to trig freq switch
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Vincent
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> + }
>>>>>>>>>> }
>>>>>>>>>> hrtick_update(rq);
>>>>>>>>>> }
>>>>>>>>>> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
>>>>>>>>>> return idx;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> -static unsigned int capacity_margin = 1280; /* ~20% margin */
>>>>>>>>>> -
>>>>>>>>>> static bool cpu_overutilized(int cpu)
>>>>>>>>>> {
>>>>>>>>>> return (capacity_of(cpu) * 1024) <
>>>>>>>>>> --
>>>>>>>>>> 1.9.1
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

2015-08-14 12:59:28

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 25/46] sched: Add over-utilization/tipping point indicator

On Thu, Aug 13, 2015 at 07:35:33PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:24:08PM +0100, Morten Rasmussen wrote:
> > Energy-aware scheduling is only meant to be active while the system is
> > _not_ over-utilized. That is, there are spare cycles available to shift
> > tasks around based on their actual utilization to get a more
> > energy-efficient task distribution without depriving any tasks. When
> > above the tipping point task placement is done the traditional way,
> > spreading the tasks across as many cpus as possible based on priority
> > scaled load to preserve smp_nice.
> >
> > The over-utilization condition is conservatively chosen to indicate
> > over-utilization as soon as one cpu is fully utilized at it's highest
> > frequency. We don't consider groups as lumping usage and capacity
> > together for a group of cpus may hide the fact that one or more cpus in
> > the group are over-utilized while group-siblings are partially idle. The
> > tasks could be served better if moved to another group with completely
> > idle cpus. This is particularly problematic if some cpus have a
> > significantly reduced capacity due to RT/IRQ pressure or if the system
> > has cpus of different capacity (e.g. ARM big.LITTLE).
>
> I might be tired, but I'm having a very hard time deciphering this
> second paragraph.

I can see why, let me try again :-)

It is essentially about when do we make balancing decisions based on
load_avg and util_avg (using the new names in Yuyang's rewrite). As you
mentioned in another thread recently, we want to use util_avg until the
system is over-utilized and then switch to load_avg. We need to define
the conditions that determine the switch.

The util_avg for each cpu converges towards 100% (1024) regardless of
how many task additional task we may put on it. If we define
over-utilized as being something like:

sum_{cpus}(rq::cfs::avg::util_avg) + margin > sum_{cpus}(rq::capacity)

some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.

For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60%. Balancing based on util_avg we are likely
to end up with nice=-10 sharing cpus and nice=0 getting their own as we
1.5*n_cpus tasks in total and 55%+55% is less over-utilized than 55%+60%
for those cpus that have to be shared. The system utilization is only
85% of the system capacity, but we are breaking smp_nice.

To be sure not to break smp_nice, we have defined over-utilization as
when:

cpu_rq(any)::cfs::avg::util_avg + margin > cpu_rq(any)::capacity

is true for any cpu in the system. IOW, as soon as one cpu is (nearly)
100% utilized, we switch to load_avg to factor in priority.

Now with this definition, we can skip periodic load-balance as no cpu
has an always-running task when the system is not over-utilized. All
tasks will be periodic and we can balance them at wake-up. This
conservative condition does however mean that some scenarios that could
benefit from energy-aware decisions even if one cpu is fully utilized
would not get those benefits.

For system where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.

I haven't found any reasonably easy-to-track conditions that would work
better. Suggestions are very welcome.

2015-08-14 16:05:10

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 01/46] arm: Frequency invariant scheduler load-tracking support

On Tue, Aug 11, 2015 at 11:27:54AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:23:44PM +0100, Morten Rasmussen wrote:
> > +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
> > +DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);
>
> > + atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity);
> > + unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
> > + atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
> > + unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu));
>
> The use of atomic_long_t here is entirely pointless.

Yes, I will get rid of those.

2015-08-14 16:17:15

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 30/46] sched: Add cpu capacity awareness to wakeup balancing

On Thu, Aug 13, 2015 at 08:24:09PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:24:13PM +0100, Morten Rasmussen wrote:
> >
> > This patch adds capacity awareness to find_idlest_{group,queue} (used by
> > SD_BALANCE_{FORK,EXEC}) such that groups/cpus that can accommodate the
> > waking task based on task utilization are preferred. In addition, wakeup
> > of existing tasks (SD_BALANCE_WAKE) is sent through
> > find_idlest_{group,queue} if the task doesn't fit the capacity of the
> > previous cpu to allow it to escape (override wake_affine) when
> > necessary instead of relying on periodic/idle/nohz_idle balance to
> > eventually sort it out.
> >
>
> That's policy not guided by the energy model.. Also we need something
> clever for the wakeup balancing, the current stuff all stinks :/

Yes, it is a bit of a mess. This patch just attempting to make things
better for systems where cpu capacity differ. Use of the energy model is
introduced in later patch.

The current code only considers idle cpus and when no such can be found,
the least loaded cpu. This is one of the main reasons why mainline
performance is random on big.LITTLE.

I can write you a longer story, or we can discuss it in SEA next week
;-)

2015-08-14 19:09:58

by Sai Gurrappadi

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 28/46] sched: Count number of shallower idle-states in struct sched_group_energy


On 08/13/2015 11:10 AM, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:24:11PM +0100, Morten Rasmussen wrote:
>> cpuidle associates all idle-states with each cpu while the energy model
>> associates them with the sched_group covering the cpus coordinating
>> entry to the idle-state. To look up the idle-state power consumption in
>> the energy model it is therefore necessary to translate from cpuidle
>> idle-state index to energy model index. For this purpose it is helpful
>> to know how many idle-states that are listed in lower level sched_groups
>> (in struct sched_group_energy).
>>
>> Example: ARMv8 big.LITTLE JUNO (Cortex A57, A53) idle-states:
>> Idle-state cpuidle Energy model table indices
>> index per-cpu sg per-cluster sg
>> WFI 0 0 (0)
>> Core power-down 1 1 0*
>> Cluster power-down 2 (1) 1
>>
>> For per-cpu sgs no translation is required. If cpuidle reports state
>> index 0 or 1, the cpu is in WFI or core power-down, respectively. We can
>> look the idle-power up directly in the sg energy model table.
>
> OK..
>
>> Idle-state
>> cluster power-down, is represented in the per-cluster sg energy model
>> table as index 1. Index 0* is reserved for cluster power consumption
>> when the cpus all are in state 0 or 1, but cpuidle decided not to go for
>> cluster power-down.
>
> 0* is not an integer.
>
>> Given the index from cpuidle we can compute the
>> correct index in the energy model tables for the sgs at each level if we
>> know how many states are in the tables in the child sgs. The actual
>> translation is implemented in a later patch.
>
> And you've lost me... I've looked at that later patch (its the next one)
> and I cannot say I'm less confused.
>

I think I understand this roughly but I don't understand why this isn't as simple as describing the power consumption at core and cluster level for each cpuidle state. If a particular cpuidle state has no real impact at the cluster level, then can't we just describe that in the power model? Sorry if this has already been discussed.

Every cpuidle state is essentially a Cx/CCx combination. For the WFI/core pd case above, the cluster state turns out to be CC0 (cluster 'active').

Thanks,
-Sai

2015-08-15 19:51:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 34/46] sched: Enable idle balance to pull single task towards cpu with higher capacity

On Tue, Jul 07, 2015 at 07:24:17PM +0100, Morten Rasmussen wrote:
> +++ b/kernel/sched/fair.c
> @@ -7569,6 +7569,13 @@ static int need_active_balance(struct lb_env *env)
> return 1;
> }
>
> + if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&
> + env->src_rq->cfs.h_nr_running == 1 &&
> + cpu_overutilized(env->src_cpu) &&
> + !cpu_overutilized(env->dst_cpu)) {
> + return 1;
> + }
> +
> return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
> }

Doesn't this allow for a nice game of ping-pong? Where if a task runs on
CPU X and generates interrupts there, its capacity will lower and we'll
migrate it over to CPU Y because that isn't receiving interrupts.

Now the task is running on Y, will generate interrupts there, and sees X
as a more attractive destination.

goto 1

2015-08-15 19:51:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 35/46] sched: Disable energy-unfriendly nohz kicks

On Tue, Jul 07, 2015 at 07:24:18PM +0100, Morten Rasmussen wrote:
> With energy-aware scheduling enabled nohz_kick_needed() generates many
> nohz idle-balance kicks which lead to nothing when multiple tasks get
> packed on a single cpu to save energy. This causes unnecessary wake-ups
> and hence wastes energy. Make these conditions depend on !energy_aware()
> for now until the energy-aware nohz story gets sorted out.

The patch does slightly more; it also allows the kick if over utilized.

But disabling this will allow getting 'stuck' in certain over loaded
situations because we're not kicking the balancer.

I think you need more justification for doing this.

2015-08-15 19:52:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 36/46] sched: Prevent unnecessary active balance of single task in sched group

On Tue, Jul 07, 2015 at 07:24:19PM +0100, Morten Rasmussen wrote:
> Scenarios with the busiest group having just one task and the local
> being idle on topologies with sched groups with different numbers of
> cpus manage to dodge all load-balance bailout conditions resulting the
> nr_balance_failed counter to be incremented. This eventually causes an
> pointless active migration of the task. This patch prevents this by not
> incrementing the counter when the busiest group only has one task.
> ASYM_PACKING migrations and migrations due to reduced capacity should
> still take place as these are explicitly captured by
> need_active_balance().
>
> A better solution would be to not attempt the load-balance in the first
> place, but that requires significant changes to the order of bailout
> conditions and statistics gathering.

*groan*, and this is of course triggered by your 2+3 core TC2 thingy.

Yes, asymmetric groups like that are a pain.

2015-08-15 19:54:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 38/46] sched: scheduler-driven cpu frequency selection

On Tue, Jul 07, 2015 at 07:24:21PM +0100, Morten Rasmussen wrote:
> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
> new file mode 100644
> index 0000000..5020f24
> --- /dev/null
> +++ b/kernel/sched/cpufreq_sched.c
> @@ -0,0 +1,308 @@
> +/*
> + * Copyright (C) 2015 Michael Turquette <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/cpufreq.h>
> +#include <linux/module.h>
> +#include <linux/kthread.h>
> +#include <linux/percpu.h>
> +#include <linux/irq_work.h>
> +
> +#include "sched.h"
> +
> +#define THROTTLE_NSEC 50000000 /* 50ms default */
> +
> +static DEFINE_PER_CPU(unsigned long, pcpu_capacity);
> +static DEFINE_PER_CPU(struct cpufreq_policy *, pcpu_policy);
> +
> +/**
> + * gov_data - per-policy data internal to the governor
> + * @throttle: next throttling period expiry. Derived from throttle_nsec
> + * @throttle_nsec: throttle period length in nanoseconds
> + * @task: worker thread for dvfs transition that may block/sleep
> + * @irq_work: callback used to wake up worker thread
> + * @freq: new frequency stored in *_sched_update_cpu and used in *_sched_thread
> + *
> + * struct gov_data is the per-policy cpufreq_sched-specific data structure. A
> + * per-policy instance of it is created when the cpufreq_sched governor receives
> + * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
> + * member of struct cpufreq_policy.
> + *
> + * Readers of this data must call down_read(policy->rwsem). Writers must
> + * call down_write(policy->rwsem).
> + */
> +struct gov_data {
> + ktime_t throttle;
> + unsigned int throttle_nsec;
> + struct task_struct *task;
> + struct irq_work irq_work;
> + struct cpufreq_policy *policy;
> + unsigned int freq;
> +};
> +
> +static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy, unsigned int freq)
> +{
> + struct gov_data *gd = policy->governor_data;
> +
> + /* avoid race with cpufreq_sched_stop */
> + if (!down_write_trylock(&policy->rwsem))
> + return;
> +
> + __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
> +
> + gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
> + up_write(&policy->rwsem);
> +}

That locking truly is disgusting.. why can't we change that?

> +static int cpufreq_sched_thread(void *data)
> +{

> +
> + ret = set_cpus_allowed_ptr(gd->task, policy->related_cpus);

That's not sufficient, you really want to have called kthread_bind() on
these threads, otherwise userspace can change affinity on you.

> +
> + do_exit(0);

I thought kthreads only needed to return...

> +}

> +void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
> +{
> + unsigned int freq_new, cpu_tmp;
> + struct cpufreq_policy *policy;
> + struct gov_data *gd;
> + unsigned long capacity_max = 0;
> +
> + /* update per-cpu capacity request */
> + __this_cpu_write(pcpu_capacity, capacity);
> +
> + policy = cpufreq_cpu_get(cpu);

So this does a down_read_trylock(&cpufreq_rwsem) and a
read_lock_irqsave(&cpufreq_driver_lock), all while holding scheduler
locks.

> + if (cpufreq_driver_might_sleep())
> + irq_work_queue_on(&gd->irq_work, cpu);
> + else
> + cpufreq_sched_try_driver_target(policy, freq_new);

This will then do a down_write_trylock(&policy->rwsem)

> +
> +out:
> + cpufreq_cpu_put(policy);

> + return;
> +}

That is just insane... surely we can replace all that with a wee bit of
RCU logic.

So something like:

DEFINE_MUTEX(cpufreq_mutex);
struct cpufreq_driver *cpufreq_driver;

struct cpufreq_policy *cpufreq_cpu_get(unsigned int cpu)
{
struct cpufreq_driver *driver;
struct cpufreq_policy *policy;

rcu_read_lock();
driver = rcu_dereference(cpufreq_driver);
if (!driver)
goto err;

policy = per_cpu_ptr(driver->policy, cpu);
if (!policy)
goto err;

return policy;

err:
rcu_read_unlock();
return NULL;
}


void cpufreq_cpu_put(struct cpufreq_policy *policy)
{
rcu_read_unlock();
}



void cpufreq_set_driver(struct cpufreq_driver *driver)
{
mutex_lock(&cpufreq_mutex);

rcu_assign_pointer(cpufreq_driver, NULL);

/*
* Wait for everyone to observe the lack of driver; iow. until
* its unused.
*/
synchronize_rcu();

/*
* Now that ye olde driver be gone, install a new one.
*/
if (driver)
rcu_assign_pointer(cpufreq_driver, driver);

mutex_unlock(&cpufreq_mutex);
}


No need for cpufreq_rwsem or cpufreq_driver_lock..


Hmm?

2015-08-15 19:51:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 39/46] sched/cpufreq_sched: use static key for cpu frequency selection

On Wed, Jul 08, 2015 at 08:19:56AM -0700, Michael Turquette wrote:
> > @@ -254,6 +267,7 @@ static int cpufreq_sched_stop(struct cpufreq_policy *policy)
> > {
> > struct gov_data *gd = policy->governor_data;
> >
> > + clear_sched_energy_freq();
>
> <paranoia>
>
> These controls are exposed to userspace via cpufreq sysfs knobs. Should
> we use a struct static_key_deferred and static_key_slow_dec_deferred()
> instead? This helps avoid a possible attack vector for slowing down the
> system.
>
> </paranoia>
>
> I don't really know what a sane default rate limit would be in that case
> though. Otherwise feel free to add:

Exposed through being able to change the policy, right? No other new
knobs, right?

IIRC the policy is only writable by root, in which case deferred isn't
really needed, root can kill the machine many other ways.

2015-08-15 19:52:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 40/46] sched/cpufreq_sched: compute freq_new based on capacity_orig_of()

On Tue, Jul 07, 2015 at 07:24:23PM +0100, Morten Rasmussen wrote:
> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
> index 2968f3a..7071528 100644
> --- a/kernel/sched/cpufreq_sched.c
> +++ b/kernel/sched/cpufreq_sched.c
> @@ -184,7 +184,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
> goto out;
>
> /* Convert the new maximum capacity request into a cpu frequency */
> - freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
> + freq_new = (capacity * policy->max) / capacity_orig_of(cpu);
>
> /* No change in frequency? Bail and return current capacity. */
> if (freq_new == policy->cur)

Can't we avoid exporting that lot by simply passing in the right values
to begin with?

2015-08-15 19:51:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests



So this OPP thing, I think that got mentioned once earlier in this patch
set, wth is that?

2015-08-15 19:51:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 38/46] sched: scheduler-driven cpu frequency selection

On Tue, Jul 07, 2015 at 07:24:21PM +0100, Morten Rasmussen wrote:
> +void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
> +{
> + unsigned int freq_new, cpu_tmp;
> + struct cpufreq_policy *policy;
> + struct gov_data *gd;
> + unsigned long capacity_max = 0;
> +
> + /* update per-cpu capacity request */
> + __this_cpu_write(pcpu_capacity, capacity);
> +
> + policy = cpufreq_cpu_get(cpu);
> + if (IS_ERR_OR_NULL(policy)) {
> + return;
> + }
> +
> + if (!policy->governor_data)
> + goto out;
> +
> + gd = policy->governor_data;
> +
> + /* bail early if we are throttled */
> + if (ktime_before(ktime_get(), gd->throttle))
> + goto out;

Isn't this the wrong place to throttle? Suppose you're getting multiple
new tasks placed on this CPU, the first one would trigger this callback
and start increasing freq..

While we're still changing freq. (and therefore throttled), another task
comes in which would again raise the freq.

With this scheme you loose the latter freq. change and will not
re-evaluate.

Any scheme that limits the callbacks to the actual hardware will have to
buffer requests and once the hardware returns (be it through an
interrupt or timeout) issue the latest request.

2015-08-16 03:51:14

by Michael Turquette

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

Quoting Peter Zijlstra (2015-08-15 05:48:17)
>
>
> So this OPP thing, I think that got mentioned once earlier in this patch
> set, wth is that?

OPP == OPerating Point == P-state

In System-on-chip Land OPP is a very common term, roughly defined as a
frequency & voltage pair that makes up a performance state.

In other words, OPP is the P-state of the non-ACPI world.

Similarly DVFS is sometimes confused as a brand new file system, but it
is also a very standardized acronym amongst SoC vendors meaning
frequency and voltage scaling.

Regards,
Mike

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-08-16 04:03:50

by Michael Turquette

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 40/46] sched/cpufreq_sched: compute freq_new based on capacity_orig_of()

Quoting Peter Zijlstra (2015-08-15 05:46:38)
> On Tue, Jul 07, 2015 at 07:24:23PM +0100, Morten Rasmussen wrote:
> > diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
> > index 2968f3a..7071528 100644
> > --- a/kernel/sched/cpufreq_sched.c
> > +++ b/kernel/sched/cpufreq_sched.c
> > @@ -184,7 +184,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
> > goto out;
> >
> > /* Convert the new maximum capacity request into a cpu frequency */
> > - freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
> > + freq_new = (capacity * policy->max) / capacity_orig_of(cpu);
> >
> > /* No change in frequency? Bail and return current capacity. */
> > if (freq_new == policy->cur)
>
> Can't we avoid exporting that lot by simply passing in the right values
> to begin with?

By "right value" do you mean, "pass the frequency from cfs"?

If that is what you mean, then the answer is "yes". But it also means
that cfs will need access to either:

1) the cpu frequncy-domain topology described in struct cpufreq.cpus
OR
2) duplicate that frequency-domain knowledge, perhaps in sched_domain

If that isn't what you mean by "right value" then let me know.

Regards,
Mike

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-08-16 22:02:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 40/46] sched/cpufreq_sched: compute freq_new based on capacity_orig_of()

On Sat, Aug 15, 2015 at 09:03:33PM -0700, Michael Turquette wrote:
> Quoting Peter Zijlstra (2015-08-15 05:46:38)
> > On Tue, Jul 07, 2015 at 07:24:23PM +0100, Morten Rasmussen wrote:
> > > diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
> > > index 2968f3a..7071528 100644
> > > --- a/kernel/sched/cpufreq_sched.c
> > > +++ b/kernel/sched/cpufreq_sched.c
> > > @@ -184,7 +184,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
> > > goto out;
> > >
> > > /* Convert the new maximum capacity request into a cpu frequency */
> > > - freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
> > > + freq_new = (capacity * policy->max) / capacity_orig_of(cpu);
> > >
> > > /* No change in frequency? Bail and return current capacity. */
> > > if (freq_new == policy->cur)
> >
> > Can't we avoid exporting that lot by simply passing in the right values
> > to begin with?
>
> By "right value" do you mean, "pass the frequency from cfs"?

Nah, just maybe: (capacity << SCHED_CAPACITY_SHIFT) / capacity_orig_of()
such that you don't have to export that knowledge to this thing.

2015-08-16 22:02:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 45/46] sched/cpufreq_sched: modify pcpu_capacity handling

On Tue, Jul 07, 2015 at 07:24:28PM +0100, Morten Rasmussen wrote:
> From: Juri Lelli <[email protected]>
>
> Use the cpu argument of cpufreq_sched_set_cap() to handle per_cpu writes,
> as the thing can be called remotely (e.g., from load balacing code).
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Juri Lelli <[email protected]>
> ---
> kernel/sched/cpufreq_sched.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
> index 06ff183..b81ac779 100644
> --- a/kernel/sched/cpufreq_sched.c
> +++ b/kernel/sched/cpufreq_sched.c
> @@ -151,7 +151,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
> unsigned long capacity_max = 0;
>
> /* update per-cpu capacity request */
> - __this_cpu_write(pcpu_capacity, capacity);
> + per_cpu(pcpu_capacity, cpu) = capacity;
>
> policy = cpufreq_cpu_get(cpu);
> if (IS_ERR_OR_NULL(policy)) {

Uhm,.. this function seems to hard assume its called for the local CPU.
It will only use the irq_queue_work_on() if the cpufreq thing requires
the thread, otherwise it will call the method directly on the calling
cpu.

2015-08-17 09:19:54

by Leo Yan

[permalink] [raw]
Subject: Re: [RFCv5, 18/46] arm: topology: Define TC2 energy and provide it to the scheduler

Hi Morten,

On Tue, Jul 07, 2015 at 07:24:01PM +0100, Morten Rasmussen wrote:
> From: Dietmar Eggemann <[email protected]>
>
> This patch is only here to be able to test provisioning of energy related
> data from an arch topology shim layer to the scheduler. Since there is no
> code today which deals with extracting energy related data from the dtb or
> acpi, and process it in the topology shim layer, the content of the
> sched_group_energy structures as well as the idle_state and capacity_state
> arrays are hard-coded here.
>
> This patch defines the sched_group_energy structure as well as the
> idle_state and capacity_state array for the cluster (relates to sched
> groups (sgs) in DIE sched domain level) and for the core (relates to sgs
> in MC sd level) for a Cortex A7 as well as for a Cortex A15.
> It further provides related implementations of the sched_domain_energy_f
> functions (cpu_cluster_energy() and cpu_core_energy()).
>
> To be able to propagate this information from the topology shim layer to
> the scheduler, the elements of the arm_topology[] table have been
> provisioned with the appropriate sched_domain_energy_f functions.
>
> cc: Russell King <[email protected]>
>
> Signed-off-by: Dietmar Eggemann <[email protected]>
>
> ---
> arch/arm/kernel/topology.c | 118 +++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 115 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index b35d3e5..bbe20c7 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -274,6 +274,119 @@ void store_cpu_topology(unsigned int cpuid)
> cpu_topology[cpuid].socket_id, mpidr);
> }
>
> +/*
> + * ARM TC2 specific energy cost model data. There are no unit requirements for
> + * the data. Data can be normalized to any reference point, but the
> + * normalization must be consistent. That is, one bogo-joule/watt must be the
> + * same quantity for all data, but we don't care what it is.
> + */
> +static struct idle_state idle_states_cluster_a7[] = {
> + { .power = 25 }, /* WFI */

This state is confused. Is this state corresponding to all CPUs have been
powered off but L2 cache RAM array and SCU are still power on?

> + { .power = 10 }, /* cluster-sleep-l */

Is this status means all CPU and cluster have been powered off, if so
then it will have no power consumption anymore...

> + };
> +
> +static struct idle_state idle_states_cluster_a15[] = {
> + { .power = 70 }, /* WFI */
> + { .power = 25 }, /* cluster-sleep-b */
> + };
> +
> +static struct capacity_state cap_states_cluster_a7[] = {
> + /* Cluster only power */
> + { .cap = 150, .power = 2967, }, /* 350 MHz */

For cluster level's capacity, does it mean need run benchmark on all
CPUs within cluster?

> + { .cap = 172, .power = 2792, }, /* 400 MHz */
> + { .cap = 215, .power = 2810, }, /* 500 MHz */
> + { .cap = 258, .power = 2815, }, /* 600 MHz */
> + { .cap = 301, .power = 2919, }, /* 700 MHz */
> + { .cap = 344, .power = 2847, }, /* 800 MHz */
> + { .cap = 387, .power = 3917, }, /* 900 MHz */
> + { .cap = 430, .power = 4905, }, /* 1000 MHz */
> + };
> +
> +static struct capacity_state cap_states_cluster_a15[] = {
> + /* Cluster only power */
> + { .cap = 426, .power = 7920, }, /* 500 MHz */
> + { .cap = 512, .power = 8165, }, /* 600 MHz */
> + { .cap = 597, .power = 8172, }, /* 700 MHz */
> + { .cap = 682, .power = 8195, }, /* 800 MHz */
> + { .cap = 768, .power = 8265, }, /* 900 MHz */
> + { .cap = 853, .power = 8446, }, /* 1000 MHz */
> + { .cap = 938, .power = 11426, }, /* 1100 MHz */
> + { .cap = 1024, .power = 15200, }, /* 1200 MHz */
> + };
> +
> +static struct sched_group_energy energy_cluster_a7 = {
> + .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a7),
> + .idle_states = idle_states_cluster_a7,
> + .nr_cap_states = ARRAY_SIZE(cap_states_cluster_a7),
> + .cap_states = cap_states_cluster_a7,
> +};
> +
> +static struct sched_group_energy energy_cluster_a15 = {
> + .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a15),
> + .idle_states = idle_states_cluster_a15,
> + .nr_cap_states = ARRAY_SIZE(cap_states_cluster_a15),
> + .cap_states = cap_states_cluster_a15,
> +};
> +
> +static struct idle_state idle_states_core_a7[] = {
> + { .power = 0 }, /* WFI */

Should have two idle states for CPU level (WFI and CPU's power off)?

> + };
> +
> +static struct idle_state idle_states_core_a15[] = {
> + { .power = 0 }, /* WFI */
> + };
> +
> +static struct capacity_state cap_states_core_a7[] = {
> + /* Power per cpu */
> + { .cap = 150, .power = 187, }, /* 350 MHz */
> + { .cap = 172, .power = 275, }, /* 400 MHz */
> + { .cap = 215, .power = 334, }, /* 500 MHz */
> + { .cap = 258, .power = 407, }, /* 600 MHz */
> + { .cap = 301, .power = 447, }, /* 700 MHz */
> + { .cap = 344, .power = 549, }, /* 800 MHz */
> + { .cap = 387, .power = 761, }, /* 900 MHz */
> + { .cap = 430, .power = 1024, }, /* 1000 MHz */
> + };
> +
> +static struct capacity_state cap_states_core_a15[] = {
> + /* Power per cpu */
> + { .cap = 426, .power = 2021, }, /* 500 MHz */
> + { .cap = 512, .power = 2312, }, /* 600 MHz */
> + { .cap = 597, .power = 2756, }, /* 700 MHz */
> + { .cap = 682, .power = 3125, }, /* 800 MHz */
> + { .cap = 768, .power = 3524, }, /* 900 MHz */
> + { .cap = 853, .power = 3846, }, /* 1000 MHz */
> + { .cap = 938, .power = 5177, }, /* 1100 MHz */
> + { .cap = 1024, .power = 6997, }, /* 1200 MHz */
> + };
> +
> +static struct sched_group_energy energy_core_a7 = {
> + .nr_idle_states = ARRAY_SIZE(idle_states_core_a7),
> + .idle_states = idle_states_core_a7,
> + .nr_cap_states = ARRAY_SIZE(cap_states_core_a7),
> + .cap_states = cap_states_core_a7,
> +};
> +
> +static struct sched_group_energy energy_core_a15 = {
> + .nr_idle_states = ARRAY_SIZE(idle_states_core_a15),
> + .idle_states = idle_states_core_a15,
> + .nr_cap_states = ARRAY_SIZE(cap_states_core_a15),
> + .cap_states = cap_states_core_a15,
> +};
> +
> +/* sd energy functions */
> +static inline const struct sched_group_energy *cpu_cluster_energy(int cpu)
> +{
> + return cpu_topology[cpu].socket_id ? &energy_cluster_a7 :
> + &energy_cluster_a15;
> +}
> +
> +static inline const struct sched_group_energy *cpu_core_energy(int cpu)
> +{
> + return cpu_topology[cpu].socket_id ? &energy_core_a7 :
> + &energy_core_a15;
> +}
> +
> static inline int cpu_corepower_flags(void)
> {
> return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN | \
> @@ -282,10 +395,9 @@ static inline int cpu_corepower_flags(void)
>
> static struct sched_domain_topology_level arm_topology[] = {
> #ifdef CONFIG_SCHED_MC
> - { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
> - { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> + { cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) },
> #endif
> - { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> + { cpu_cpu_mask, 0, cpu_cluster_energy, SD_INIT_NAME(DIE) },
> { NULL, },
> };
>

2015-08-17 09:43:38

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

On 14 August 2015 at 13:39, Juri Lelli <[email protected]> wrote:
> Hi vincent,
>
> On 13/08/15 13:08, Vincent Guittot wrote:
>> On 12 August 2015 at 17:15, Juri Lelli <[email protected]> wrote:
>>> On 11/08/15 17:37, Vincent Guittot wrote:
>>>> On 11 August 2015 at 17:07, Juri Lelli <[email protected]> wrote:
>>>>> Hi Vincent,
>>>>>
>>>>> On 11/08/15 12:41, Vincent Guittot wrote:
>>>>>> On 11 August 2015 at 11:08, Juri Lelli <[email protected]> wrote:
>>>>>>> On 10/08/15 16:07, Vincent Guittot wrote:
>>>>>>>> On 10 August 2015 at 15:43, Juri Lelli <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Hi Vincent,
>>>>>>>>>
>>>>>>>>> On 04/08/15 14:41, Vincent Guittot wrote:
>>>>>>>>>> Hi Juri,
>>>>>>>>>>
>>>>>>>>>> On 7 July 2015 at 20:24, Morten Rasmussen <[email protected]> wrote:
>>>>>>>>>>> From: Juri Lelli <[email protected]>
>>
>> [snip]
>>
>>>>>>>>>>> }
>>>>>>>>>>> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>>>>>>>> if (!se) {
>>>>>>>>>>> sub_nr_running(rq, 1);
>>>>>>>>>>> update_rq_runnable_avg(rq, 1);
>>>>>>>>>>> + /*
>>>>>>>>>>> + * We want to trigger a freq switch request only for tasks that
>>>>>>>>>>> + * are going to sleep; this is because we get here also during
>>>>>>>>>>> + * load balancing, but in these cases it seems wise to trigger
>>>>>>>>>>> + * as single request after load balancing is done.
>>>>>>>>>>> + *
>>>>>>>>>>> + * Also, we add a margin (same ~20% used for the tipping point)
>>>>>>>>>>> + * to our request to provide some head room if p's utilization
>>>>>>>>>>> + * further increases.
>>>>>>>>>>> + */
>>>>>>>>>>> + if (sched_energy_freq() && task_sleep) {
>>>>>>>>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>>>>>>>>>>> +
>>>>>>>>>>> + req_cap = req_cap * capacity_margin
>>>>>>>>>>> + >> SCHED_CAPACITY_SHIFT;
>>>>>>>>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
>>>>>>>>>>
>>>>>>>>>> Could you clarify why you want to trig a freq switch for tasks that
>>>>>>>>>> are going to sleep ?
>>>>>>>>>> The cpu_usage should not changed that much as the se_utilization of
>>>>>>>>>> the entity moves from utilization_load_avg to utilization_blocked_avg
>>>>>>>>>> of the rq and the usage and the freq are updated periodically.
>>>>>>>>>
>>>>>>>>> I think we still need to cover multiple back-to-back dequeues. Suppose
>>>>>>>>> that you have, let's say, 3 tasks that get enqueued at the same time.
>>>>>>>>> After some time the first one goes to sleep and its utilization, as you
>>>>>>>>> say, gets moved to utilization_blocked_avg. So, nothing changes, and
>>>>>>>>> the trigger is superfluous (even if no freq change I guess will be
>>>>>>>>> issued as we are already servicing enough capacity). However, after a
>>>>>>>>> while, the second task goes to sleep. Now we still use get_cpu_usage()
>>>>>>>>> and the first task contribution in utilization_blocked_avg should have
>>>>>>>>> been decayed by this time. Same thing may than happen for the third task
>>>>>>>>> as well. So, if we don't check if we need to scale down in
>>>>>>>>> dequeue_task_fair, it seems to me that we might miss some opportunities,
>>>>>>>>> as blocked contribution of other tasks could have been successively
>>>>>>>>> decayed.
>>>>>>>>>
>>>>>>>>> What you think?
>>>>>>>>
>>>>>>>> The tick is used to monitor such variation of the usage (in both way,
>>>>>>>> decay of the usage of sleeping tasks and increase of the usage of
>>>>>>>> running tasks). So in your example, if the duration between the sleep
>>>>>>>> of the 2 tasks is significant enough, the tick will handle this
>>>>>>>> variation
>>>>>>>>
>>>>>>>
>>>>>>> The tick is used to decide if we need to scale up (to max OPP for the
>>>>>>> time being), but we don't scale down. It makes more logical sense to
>>>>>>
>>>>>> why don't you want to check if you need to scale down ?
>>>>>>
>>>>>
>>>>> Well, because if I'm still executing something the cpu usage is only
>>>>> subject to raise.
>>>>
>>>> This is only true for system with NO HZ idle
>>>>
>>>
>>> Well, even with !NO_HZ_IDLE usage only decreases when cpu is idle. But,
>>
>> Well, thanks for this obvious statement that usage only decreases when
>> cpu is idle but my question has never been about usage variation of
>> idle/running cpu but about the tick.
>>
>
> I'm sorry if I sounded haughty to you, of course I didn't want too and
> I apologize for that. I just wanted to state the obvious to confirm
> myself that I understood your point, as I say below. :)
>
>>> I think I got your point; for !NO_HZ_IDLE configurations we might end
>>> up not scaling down frequency even if we have the tick running and
>>> the cpu is idle. I might need some more time to think this through, but
>>> it seems to me that we are still fine without an explicit trigger in
>>> task_tick_fair(); if we are running a !NO_HZ_IDLE system we are probably
>>> not so much concerned about power savings and still we react
>>> to tasks waking up, sleeping, leaving or moving around (which seems the
>>> real important events to me); OTOH, we might add that trigger, but this
>>> will generate unconditional checks at tick time for NO_HZ_IDLE
>>
>> That will be far less critical than unconditionally check during all
>> task wake up or sleep. A task that wakes up every 200us will generate
>> much more check in the wake up hot path of the cpu that is already
>> busy with another task
>>
>
> We have a throttling threshold for this kind of problem, which is the
> same as the transition latency exported by cpufreq drivers. Now, we
> still do some operations before checking that threshold, and the check
> itself might be too expensive. I guess I'll go back and profile it.

So, you take the cpufreq driver transition latency into account but
you should take the scheduler and its statistic responsiveness as the
transition latency can be small (less than fews hundreds of us) . As
an example, the variation of the usage of a cpu is always less 20%
over a period of 10ms and the variation is less than 10% if the usage
is already above 50% ( This is also weighted by the current
frequency). This variation is in the same range or lower that the
margin you take when setting the current usage of a cpu. As most of
system have a tick period less or equal to 10ms, i'm not sure that
it's worth to ensure a periodic check of the usage that is shorter
than a tick (or the transition latency is the latter is greater than
the tick)

>
>>> configurations, for a benefit that it seems to be still not completely
>>> clear.
>>>
>>>>>
>>>>>>> scale down at task deactivation, or wakeup after a long time, IMHO.
>>>>>>
>>>>>> But waking up or going to sleep don't have any impact on the usage of
>>>>>> a cpu. The only events that impact the cpu usage are:
>>>>>> -task migration,
>>>>>
>>>>> We explicitly cover this on load balancing paths.
>>
>> But task can migrate out of the load balancing; At wake up for example
>> and AFAICT, you don't use this event to notify the decrease of the
>> usage of the cpu and check if a new OPP will fit better with the new
>> usage.
>>
>
> If the task gets wakeup migrated its request will be issued as part
> of the enqueue on the new cpu. If the cpu it was previously running
> on is idle, it has already cleared its request, so it shouldn't
> need any notification.

Such migration occurs most of the time when the source CPU is already
used by another task so scheduler looks for an idle sibling cpu. In
this case, you can't use the idle event of the source cpu but you have
to use the migration event. I agree that you will catch the increase
of usage of the dest cpu in the enqueue but you don't catch the remove
of usage of the source cpu whereas this event is a good one IMHO to
check if we can change the OPP as the usage will have a abrupt change

>
>>>>>
>>>>>> -new task
>>>>>
>>>>> We cover this in enqueue_task_fair(), introducing a new flag.
>>>>>
>>>>>> -time that elapse which can be monitored by periodically checking the usage.
>>>>>
>>>>> Do you mean when a task utilization crosses some threshold
>>>>> related to the current OPP? If that is the case, we have a
>>>>> check in task_tick_fair().
>>>>>
>>>>>> -and for nohz system when cpu enter or leave idle state
>>>>>>
>>>>>
>>>>> We address this in dequeue_task_fair(). In particular, if
>>>>> the cpu is going to be idle we don't trigger any change as
>>>>> it seems not always wise to wake up a thread to just change
>>>>> the OPP and the go idle; some platforms might require this
>>>>> behaviour anyway, but it probably more cpuidle/fw related?
>>>>
>>>> I would say that it's interesting to notifiy sched-dvfs that a cpu
>>>> becomes idle because we could decrease the opp of a cluster of cpus
>>>> that share the same clock if this cpu is the one that requires the max
>>>> capacity of the cluster (and other cpus are still running).
>>>>
>>>
>>> Well, we reset the capacity request of the cpu that is going idle.
>>
>> And i'm fine with the fact that you use the cpu idle event
>>
>
> Ok :).
>
>>> The idea is that the next event on one of the other related cpus
>>> will update the cluster freq correctly. If any other cpu in the
>>> cluster is running something we keep the same frequency until
>>> the task running on that cpu goes to sleep; this seems fine to
>>> me because that task might end up being heavy and we saved a
>>> back to back lower to higher OPP switch; if the task is instead
>>> light it will probably be dequeued pretty soon, and at that time
>>> we switch to a lower OPP (since we cleared the idle cpu request
>>> before). Also, if the other cpus in the cluster are all idle
>>> we'll most probably enter an idle state, so no freq switch is
>>> most likely required.
>>>
>>>>>
>>>>> I would also add:
>>>>>
>>>>> - task is going to die
>>>>>
>>>>> We address this in dequeue as well, as its contribution is
>>>>> removed from usage (mod Yuyang's patches).
>>>>>
>>>>>> waking up and going to sleep events doesn't give any useful
>>>>>> information and using them to trig the monitoring of the usage
>>>>>> variation doesn't give you a predictable/periodic update of it whereas
>>>>>> the tick will
>>>>>>
>>>>>
>>>>> So, one key point of this solution is to get away as much
>>>>> as we can from periodic updates/sampling and move towards a
>>>>> (fully) event driven approach. The event logically associated
>>>>> to task_tick_fair() is when we realize that a task is going
>>>>> to saturate the current capacity; in this case we trigger a
>>>>> freq switch to an higher capacity. Also, if we never react
>>>>> to normal wakeups (as I understand you are proposing) we might
>>>>> miss some chances to adapt quickly enough. As an example, if
>>>>> you have a big task that suddenly goes to sleep, and sleeps
>>>>> until its decayed utilization goes almost to zero; when it
>>>>> wakes up, if we don't have a trigger in enqueue_task_fair(),
>>
>> I'm not against having a trigger in enqueue, i'm against bindly
>> checking all task wake up in order to be sure to catch the useful
>> event like the cpu leave idle event of your example
>>
>
> Right, agreed.
>
>>>>> we'll have to wait until the next tick to select an appropriate
>>>>> (low) OPP.
>>>>
>>>> I assume that the cpu is idle in this case. This situation only
>>>> happens on Nohz idle system because tick is disable and you have to
>>>> update statistics when leaving idle as it is done for the jiffies or
>>>> the cpu_load array. So you should track cpu enter/leave idle (for nohz
>>>> system only) instead of tracking all tasks wake up/sleep events.
>>>>
>>>
>>> I think I already replied to this in what above. Did I? :)
>>
>> In fact, It was not a question, i just state that using all wake up /
>> sleep events to be sure to trig the check of cpu capacity when the cpu
>> leave an idle phase (and especially a long one like in your example
>> above), is wrong. You have to use the leave idle event instead of
>> checking all wake up events to be sure to catch the right one.
>
> It seems to me that we need to catch enqueue events even for cpus
> that are already busy. Don't we have to see if we have to scale
> up in case a task is enqueued after a wakeup or fork on a cpu that is
> already running some other task?

The enqueue of a waking up task doesn't change the usage of a cpu so
there is no reason to use this event instead of a periodic one. At the
opposite, fork, migrated and new tasks will change the usage so it's
worth to use them

>
> Then, I agree that some other conditions might be added to check
> that we are not over triggering the thing. I need to think about
> those more.

ok

>
>> You say
>> that you want to get away as much as possible the periodic
>> updates/sampling but it's exactly what you do with these 2 events.
>> Using them only enables you to periodically check if the capacity has
>> changed since the last time like the tick already does. But instead of
>> using a periodic and controlled event, you use these random (with
>> regards to capacity evolution) and uncontrolled events in order to
>> catch useful change. As an example, If a cpu run a task and there is a
>> short running task that wakes up every 100us for running 100us, you
>> will call cpufreq_sched_set_cap, 5000 times per second for no good
>> reason as you already have the tick to periodically check the
>> evolution of the usage.
>>
>
> In this case it's the workload that is inherently periodic, so we
> have periodic checks. Anyway, this is most probably one bailout
> condition we need to add, since if both task are already running on
> the same cpu, I guess, its utilization shouldn't change that much.

IMHO, the tick should be enough to periodically check if the change of
the usage value require an OPP update with regards of the rate of
change of the usage.

>
>>>
>>>> So you can either use update_cpu_load_nohz like it is already done for
>>>> cpu_load array
>>>> or you should use some conditions like below if you want to stay in
>>>> enqueue/dequeue_task_fair but task wake up or sleep event are not the
>>>> right condition
>>>> if (!(flags & ENQUEUE_WAKEUP) || rq->nr_running == 1 ) in enqueue_task_fair
>>>> and
>>>> if (!task_sleep || rq->nr_running == 0) in dequeue_task_fair
>>>>
>>>> We can probably optimized by using rq->cfs.h_nr_running instead of
>>>> rq->nr_running as only cfs tasks really modifies the usage
>>>>
>>>
>>> I already filter out enqueues/dequeues that comes from load balancing;
>>> and I use cfs.nr_running because, as you say, we currently work with CFS
>>> tasks only.
>>
>> But not for the enqueue where you should use it instead of all wake up events.
>>
>> Just to be clear: Using all enqueue/dequeue events (including task
>> wake up and sleep) to check a change of the usage of a cpu was doing a
>> lot of sense when mike has sent his v3 of the scheduler driven cpu
>> frequency selection because the usage was not accounting the blocked
>> tasks at that time so it was changing for all enqueue/dequeue events.
>> But this doesn't make sense in this patchset that account the blocked
>> tasks in the usage of the cpu and more generally now that yuyang's
>> patch set has been accepted.
>>
>
> As above, I agree that we are most probably missing some optimizations;
> that's what I'm going to look at again.

Thanks,
Vincent

>
> Thanks,
>
> - Juri
>
>> Regards,
>> Vincent
>>
>>>
>>> Thanks,
>>>
>>> - Juri
>>>
>>>> Regards,
>>>> Vincent
>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> - Juri
>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> - Juri
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Vincent
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> - Juri
>>>>>>>>>
>>>>>>>>>> It should be the same for the wake up of a task in enqueue_task_fair
>>>>>>>>>> above, even if it's less obvious for this latter use case because the
>>>>>>>>>> cpu might wake up from a long idle phase during which its
>>>>>>>>>> utilization_blocked_avg has not been updated. Nevertheless, a trig of
>>>>>>>>>> the freq switch at wake up of the cpu once its usage has been updated
>>>>>>>>>> should do the job.
>>>>>>>>>>
>>>>>>>>>> So tick, migration of tasks, new tasks, entering/leaving idle state of
>>>>>>>>>> cpu should be enough to trig freq switch
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Vincent
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> + }
>>>>>>>>>>> }
>>>>>>>>>>> hrtick_update(rq);
>>>>>>>>>>> }
>>>>>>>>>>> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv,
>>>>>>>>>>> return idx;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> -static unsigned int capacity_margin = 1280; /* ~20% margin */
>>>>>>>>>>> -
>>>>>>>>>>> static bool cpu_overutilized(int cpu)
>>>>>>>>>>> {
>>>>>>>>>>> return (capacity_of(cpu) * 1024) <
>>>>>>>>>>> --
>>>>>>>>>>> 1.9.1
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

2015-08-17 11:15:32

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 45/46] sched/cpufreq_sched: modify pcpu_capacity handling

Hi Peter,

On 16/08/15 21:35, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:24:28PM +0100, Morten Rasmussen wrote:
>> From: Juri Lelli <[email protected]>
>>
>> Use the cpu argument of cpufreq_sched_set_cap() to handle per_cpu writes,
>> as the thing can be called remotely (e.g., from load balacing code).
>>
>> cc: Ingo Molnar <[email protected]>
>> cc: Peter Zijlstra <[email protected]>
>>
>> Signed-off-by: Juri Lelli <[email protected]>
>> ---
>> kernel/sched/cpufreq_sched.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
>> index 06ff183..b81ac779 100644
>> --- a/kernel/sched/cpufreq_sched.c
>> +++ b/kernel/sched/cpufreq_sched.c
>> @@ -151,7 +151,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
>> unsigned long capacity_max = 0;
>>
>> /* update per-cpu capacity request */
>> - __this_cpu_write(pcpu_capacity, capacity);
>> + per_cpu(pcpu_capacity, cpu) = capacity;
>>
>> policy = cpufreq_cpu_get(cpu);
>> if (IS_ERR_OR_NULL(policy)) {
>
> Uhm,.. this function seems to hard assume its called for the local CPU.
> It will only use the irq_queue_work_on() if the cpufreq thing requires
> the thread, otherwise it will call the method directly on the calling
> cpu.
>

True, but we still retrieve policy from cpu passed as argument; and
then we use policy to request a freq transition, that should end up
updating the right cpu.

Thanks,

- Juri

2015-08-17 12:19:18

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 40/46] sched/cpufreq_sched: compute freq_new based on capacity_orig_of()

On 16/08/15 21:24, Peter Zijlstra wrote:
> On Sat, Aug 15, 2015 at 09:03:33PM -0700, Michael Turquette wrote:
>> Quoting Peter Zijlstra (2015-08-15 05:46:38)
>>> On Tue, Jul 07, 2015 at 07:24:23PM +0100, Morten Rasmussen wrote:
>>>> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
>>>> index 2968f3a..7071528 100644
>>>> --- a/kernel/sched/cpufreq_sched.c
>>>> +++ b/kernel/sched/cpufreq_sched.c
>>>> @@ -184,7 +184,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
>>>> goto out;
>>>>
>>>> /* Convert the new maximum capacity request into a cpu frequency */
>>>> - freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
>>>> + freq_new = (capacity * policy->max) / capacity_orig_of(cpu);
>>>>
>>>> /* No change in frequency? Bail and return current capacity. */
>>>> if (freq_new == policy->cur)
>>>
>>> Can't we avoid exporting that lot by simply passing in the right values
>>> to begin with?
>>
>> By "right value" do you mean, "pass the frequency from cfs"?
>
> Nah, just maybe: (capacity << SCHED_CAPACITY_SHIFT) / capacity_orig_of()
> such that you don't have to export that knowledge to this thing.
>

Oh, right. I guess we can just go with something like:

req_cap = get_cpu_usage(cpu) * capacity_margin / capacity_orig_of(cpu);

on fair.c side and switch back to

freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;

on cpufreq_sched.c side. That saves us exporting capacity_orig_of().

Thanks,

- Juri

2015-08-17 13:10:55

by Leo Yan

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 25/46] sched: Add over-utilization/tipping point indicator

On Tue, Jul 07, 2015 at 07:24:08PM +0100, Morten Rasmussen wrote:
> Energy-aware scheduling is only meant to be active while the system is
> _not_ over-utilized. That is, there are spare cycles available to shift
> tasks around based on their actual utilization to get a more
> energy-efficient task distribution without depriving any tasks. When
> above the tipping point task placement is done the traditional way,
> spreading the tasks across as many cpus as possible based on priority
> scaled load to preserve smp_nice.
>
> The over-utilization condition is conservatively chosen to indicate
> over-utilization as soon as one cpu is fully utilized at it's highest
> frequency. We don't consider groups as lumping usage and capacity
> together for a group of cpus may hide the fact that one or more cpus in
> the group are over-utilized while group-siblings are partially idle. The
> tasks could be served better if moved to another group with completely
> idle cpus. This is particularly problematic if some cpus have a
> significantly reduced capacity due to RT/IRQ pressure or if the system
> has cpus of different capacity (e.g. ARM big.LITTLE).
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++----
> kernel/sched/sched.h | 3 +++
> 2 files changed, 34 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bf1d34c..99e43ee 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4281,6 +4281,8 @@ static inline void hrtick_update(struct rq *rq)
> }
> #endif
>
> +static bool cpu_overutilized(int cpu);
> +
> /*
> * The enqueue_task method is called before nr_running is
> * increased. Here we update the fair scheduling stats and
> @@ -4291,6 +4293,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> {
> struct cfs_rq *cfs_rq;
> struct sched_entity *se = &p->se;
> + int task_new = !(flags & ENQUEUE_WAKEUP);
>
> for_each_sched_entity(se) {
> if (se->on_rq)
> @@ -4325,6 +4328,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!se) {
> update_rq_runnable_avg(rq, rq->nr_running);
> add_nr_running(rq, 1);
> + if (!task_new && !rq->rd->overutilized &&
> + cpu_overutilized(rq->cpu))
> + rq->rd->overutilized = true;

Maybe this is a stupid question, the root domain's overutilized value is
shared by all CPUs; so just curious if need lock to protect this
variable or use atmomic type for it?

[...]

Thanks,
Leo Yan

2015-08-17 15:59:37

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 01/46] arm: Frequency invariant scheduler load-tracking support

Hi Vincent,

On 03/08/15 10:22, Vincent Guittot wrote:
> Hi Morten,
>
>
> On 7 July 2015 at 20:23, Morten Rasmussen <[email protected]> wrote:
>> From: Morten Rasmussen <[email protected]>
>>
>
> [snip]
>
>> -
>> #endif
>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>> index 08b7847..9c09e6e 100644
>> --- a/arch/arm/kernel/topology.c
>> +++ b/arch/arm/kernel/topology.c
>> @@ -169,6 +169,23 @@ static void update_cpu_capacity(unsigned int cpu)
>> cpu, arch_scale_cpu_capacity(NULL, cpu));
>> }
>>
>> +/*
>> + * Scheduler load-tracking scale-invariance
>> + *
>> + * Provides the scheduler with a scale-invariance correction factor that
>> + * compensates for frequency scaling (arch_scale_freq_capacity()). The scaling
>> + * factor is updated in smp.c
>> + */
>> +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
>> +{
>> + unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu));
>
> access to cpu_freq_capacity to should be put under #ifdef CONFIG_CPU_FREQ.

True, in case we keep everything related to frequency scaling under
CONFIG_CPU_FREQ, then we should also put the
#define arch_scale_freq_capacity arm_arch_scale_freq_capacity
in topology.h under CONFIG_CPU_FREQ.

> Why haven't you moved arm_arch_scale_freq_capacity in smp.c as
> everything else for frequency in-variance is already in this file ?
> This should also enable you to remove
> DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity); from topology.h

True, arm_arch_scale_freq_capacity() should be in smp.c so we don't have
to export cpu_freq_capacity which btw. does not have to be a atomic.

OTHA, we could also put the whole 'Frequency Invariance Engine' into
cpufreq.c as cpufreq_scale_freq_capacity() and let the interested ARCH
do a
#include <linux/cpufreq.h>
#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
This would allow us to implement this 'Frequency Invariance Engine'
only once and not couple of times for different ARCHs.

Something like this: (only compile tested on ARM64 w/ and w/o
CONFIG_CPU_FREQ)

diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index a8a69a1f3c4c..f25341170749 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -4,6 +4,7 @@
#ifdef CONFIG_SMP

#include <linux/cpumask.h>
+#include <linux/cpufreq.h>

struct cpu_topology {
int thread_id;
@@ -26,9 +27,7 @@ const struct cpumask *cpu_coregroup_mask(int cpu);

struct sched_domain;
#ifdef CONFIG_CPU_FREQ
-#define arch_scale_freq_capacity arm_arch_scale_freq_capacity
-extern
-unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
#endif

#define arch_scale_cpu_capacity arm_arch_scale_cpu_capacity
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index b612411655f9..97f4391d0f55 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2400,6 +2400,12 @@ static int cpufreq_cpu_callback(struct notifier_block *nfb,
return NOTIFY_OK;
}

+unsigned long cpufreq_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+ return 1024; /* Implementation missing !!! */
+}
+EXPORT_SYMBOL(cpufreq_scale_freq_capacity);
+
static struct notifier_block __refdata cpufreq_cpu_notifier = {
.notifier_call = cpufreq_cpu_callback,
};
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 29ad97c34fd5..a4516d53a3cc 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -604,4 +604,11 @@ unsigned int cpufreq_generic_get(unsigned int cpu);
int cpufreq_generic_init(struct cpufreq_policy *policy,
struct cpufreq_frequency_table *table,
unsigned int transition_latency);
+/*
+ * In case we delete unused sd pointer in arch_scale_freq_capacity()
+ * [kernel/sched/sched.h] this can become:
+ * unsigned long cpufreq_scale_freq_capacity(int cpu);
+ */
+struct sched_domain;
+unsigned long cpufreq_scale_freq_capacity(struct sched_domain *sd, int cpu);

[...]

2015-08-17 16:24:04

by Leo Yan

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 32/46] sched: Energy-aware wake-up task placement

Hi Morten,

On Tue, Jul 07, 2015 at 07:24:15PM +0100, Morten Rasmussen wrote:
> Let available compute capacity and estimated energy impact select
> wake-up target cpu when energy-aware scheduling is enabled and the
> system in not over-utilized (above the tipping point).
>
> energy_aware_wake_cpu() attempts to find group of cpus with sufficient
> compute capacity to accommodate the task and find a cpu with enough spare
> capacity to handle the task within that group. Preference is given to
> cpus with enough spare capacity at the current OPP. Finally, the energy
> impact of the new target and the previous task cpu is compared to select
> the wake-up target cpu.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 84 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0f7dbda4..01f7337 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5427,6 +5427,86 @@ static int select_idle_sibling(struct task_struct *p, int target)
> return target;
> }
>
> +static int energy_aware_wake_cpu(struct task_struct *p, int target)
> +{
> + struct sched_domain *sd;
> + struct sched_group *sg, *sg_target;
> + int target_max_cap = INT_MAX;
> + int target_cpu = task_cpu(p);
> + int i;
> +
> + sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> +
> + if (!sd)
> + return target;
> +
> + sg = sd->groups;
> + sg_target = sg;
> +
> + /*
> + * Find group with sufficient capacity. We only get here if no cpu is
> + * overutilized. We may end up overutilizing a cpu by adding the task,
> + * but that should not be any worse than select_idle_sibling().
> + * load_balance() should sort it out later as we get above the tipping
> + * point.
> + */
> + do {
> + /* Assuming all cpus are the same in group */
> + int max_cap_cpu = group_first_cpu(sg);
> +
> + /*
> + * Assume smaller max capacity means more energy-efficient.
> + * Ideally we should query the energy model for the right
> + * answer but it easily ends up in an exhaustive search.
> + */
> + if (capacity_of(max_cap_cpu) < target_max_cap &&
> + task_fits_capacity(p, max_cap_cpu)) {
> + sg_target = sg;
> + target_max_cap = capacity_of(max_cap_cpu);
> + }
> + } while (sg = sg->next, sg != sd->groups);
> +
> + /* Find cpu with sufficient capacity */
> + for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> + /*
> + * p's blocked utilization is still accounted for on prev_cpu
> + * so prev_cpu will receive a negative bias due the double
> + * accouting. However, the blocked utilization may be zero.
> + */
> + int new_usage = get_cpu_usage(i) + task_utilization(p);
> +
> + if (new_usage > capacity_orig_of(i))
> + continue;
> +
> + if (new_usage < capacity_curr_of(i)) {
> + target_cpu = i;
> + if (cpu_rq(i)->nr_running)
> + break;
> + }
> +
> + /* cpu has capacity at higher OPP, keep it as fallback */
> + if (target_cpu == task_cpu(p))
> + target_cpu = i;

If CPU's current capacity cannot meet requirement, why not stay
task on prev CPU so have chance to use hot cache? Or the purpose
is to place tasks on the first CPU in schedule group as possible?

> + }
> +
> + if (target_cpu != task_cpu(p)) {
> + struct energy_env eenv = {
> + .usage_delta = task_utilization(p),
> + .src_cpu = task_cpu(p),
> + .dst_cpu = target_cpu,
> + };
> +
> + /* Not enough spare capacity on previous cpu */
> + if (cpu_overutilized(task_cpu(p)))
> + return target_cpu;
> +
> + if (energy_diff(&eenv) >= 0)
> + return task_cpu(p);
> + }
> +
> + return target_cpu;
> +}
> +
> /*
> * select_task_rq_fair: Select target runqueue for the waking task in domains
> * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5479,7 +5559,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> prev_cpu = cpu;
>
> if (sd_flag & SD_BALANCE_WAKE && want_sibling) {
> - new_cpu = select_idle_sibling(p, prev_cpu);
> + if (energy_aware() && !cpu_rq(cpu)->rd->overutilized)
> + new_cpu = energy_aware_wake_cpu(p, prev_cpu);
> + else
> + new_cpu = select_idle_sibling(p, prev_cpu);
> goto unlock;
> }
>
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-08-17 17:55:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests

On Saturday, August 15, 2015 02:48:17 PM Peter Zijlstra wrote:
>
> So this OPP thing, I think that got mentioned once earlier in this patch
> set, wth is that?

OPP stands for Operating Performance Points. It is a library for representing
working clock-voltage combinations.

Described in Documentation/power/opp.txt (but may be outdated as there's some
work on it going on).

CC Viresh who's working on it right now.

Thanks,
Rafael

2015-08-20 19:19:39

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFCv5, 18/46] arm: topology: Define TC2 energy and provide it to the scheduler

Hi Leo,

On 08/17/2015 02:19 AM, Leo Yan wrote:

[...]

>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>> index b35d3e5..bbe20c7 100644
>> --- a/arch/arm/kernel/topology.c
>> +++ b/arch/arm/kernel/topology.c
>> @@ -274,6 +274,119 @@ void store_cpu_topology(unsigned int cpuid)
>> cpu_topology[cpuid].socket_id, mpidr);
>> }
>>
>> +/*
>> + * ARM TC2 specific energy cost model data. There are no unit requirements for
>> + * the data. Data can be normalized to any reference point, but the
>> + * normalization must be consistent. That is, one bogo-joule/watt must be the
>> + * same quantity for all data, but we don't care what it is.
>> + */
>> +static struct idle_state idle_states_cluster_a7[] = {
>> + { .power = 25 }, /* WFI */
>
> This state is confused. Is this state corresponding to all CPUs have been
> powered off but L2 cache RAM array and SCU are still power on?

This is what we refer to as 'active idle'. All cpus of the cluster are
in WFI but the cluster is not in cluster-sleep yet. We measure the
corresponding energy value by disabling the 'cluster-sleep-[b,l]' state
and let the cpus do nothing for a specific time period.
>
>> + { .power = 10 }, /* cluster-sleep-l */
>
> Is this status means all CPU and cluster have been powered off, if so
> then it will have no power consumption anymore...

The cluster is in cluster-sleep but there is still some peripheral
related to the cluster active which explains this power value we
calculated from the pre/post energy value diff (by reading the vexpress
energy counter for this cluster) and the time period we were idling on
this cluster.

>
>> + };
>> +
>> +static struct idle_state idle_states_cluster_a15[] = {
>> + { .power = 70 }, /* WFI */
>> + { .power = 25 }, /* cluster-sleep-b */
>> + };
>> +
>> +static struct capacity_state cap_states_cluster_a7[] = {
>> + /* Cluster only power */
>> + { .cap = 150, .power = 2967, }, /* 350 MHz */
>
> For cluster level's capacity, does it mean need run benchmark on all
> CPUs within cluster?

We run an 'always running thread per cpu' workload on {n, n-1, ..., 1}
cpus of a cluster (hotplug-out the other cpus) for a specific time
period. Then we calculate the cluster power value by extrapolating from
the power values for the {n, n-1, ... 1} test runs and use the delta
between a n and n+1 test run value as core power value.

[...]

>> +static struct idle_state idle_states_core_a7[] = {
>> + { .power = 0 }, /* WFI */
>
> Should have two idle states for CPU level (WFI and CPU's power off)?

The ARM TC2 platform has only 2 idle states, there is no 'cpu power off':

# cat /sys/devices/system/cpu/cpu[0,2]/cpuidle/state*/name
WFI
cluster-sleep-b
WFI
cluster-sleep-l

[...]

2015-08-25 10:45:04

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFCv5 PATCH 38/46] sched: scheduler-driven cpu frequency selection

Hi Peter,

On 15/08/15 14:05, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:24:21PM +0100, Morten Rasmussen wrote:
>> +void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
>> +{
>> + unsigned int freq_new, cpu_tmp;
>> + struct cpufreq_policy *policy;
>> + struct gov_data *gd;
>> + unsigned long capacity_max = 0;
>> +
>> + /* update per-cpu capacity request */
>> + __this_cpu_write(pcpu_capacity, capacity);
>> +
>> + policy = cpufreq_cpu_get(cpu);
>> + if (IS_ERR_OR_NULL(policy)) {
>> + return;
>> + }
>> +
>> + if (!policy->governor_data)
>> + goto out;
>> +
>> + gd = policy->governor_data;
>> +
>> + /* bail early if we are throttled */
>> + if (ktime_before(ktime_get(), gd->throttle))
>> + goto out;
>
> Isn't this the wrong place to throttle? Suppose you're getting multiple
> new tasks placed on this CPU, the first one would trigger this callback
> and start increasing freq..
>
> While we're still changing freq. (and therefore throttled), another task
> comes in which would again raise the freq.
>
> With this scheme you loose the latter freq. change and will not
> re-evaluate.
>

The way the policy is implemented, you should not have this problem.
For new tasks you actually jump to max freq, as new tasks util gets
initialized to 1024. For load balancing migrations we wait until
all the tasks are migrated and then we trigger an update.

> Any scheme that limits the callbacks to the actual hardware will have to
> buffer requests and once the hardware returns (be it through an
> interrupt or timeout) issue the latest request.
>

But, it is true that if the above events happened the other way around
(we trigger an update after load balancing and a new task arrives), we
may miss the opportunity to jump to max with the new task. In my mind
this is probably not a big deal, as we'll have a tick pretty soon that
will fix things anyway (saving us some complexity in the backend).

What you think?

Thanks,

- Juri