Several techniques for saving energy through various scheduler
modifications have been proposed in the past, however most of the
techniques have not been universally beneficial for all use-cases and
platforms. For example, consolidating tasks on fewer cpus is an
effective way to save energy on some platforms, while it might make
things worse on others.
This proposal, which is inspired by the Ksummit workshop discussions in
2013 [1], takes a different approach by using a (relatively) simple
platform energy cost model to guide scheduling decisions. By providing
the model with platform specific costing data the model can provide a
estimate of the energy implications of scheduling decisions. So instead
of blindly applying scheduling techniques that may or may not work for
the current use-case, the scheduler can make informed energy-aware
decisions. We believe this approach provides a methodology that can be
adapted to any platform, including heterogeneous systems such as ARM
big.LITTLE. The model considers cpus only, i.e. no peripherals, GPU or
memory. Model data includes power consumption at each P-state and
C-state.
This is an RFC and there are some loose ends that have not been
addressed here or in the code yet. The model and its infrastructure is
in place in the scheduler and it is being used for load-balancing
decisions. The energy model data is hardcoded, the load-balancing
heuristics are still under development, and there are some limitations
still to be addressed. However, the main idea is presented here, which
is the use of an energy model for scheduling decisions.
RFCv3 is a consolidation of the latest energy model related patches and
previously posted patch sets related to capacity and utilization
tracking [2][3] to show where we are heading. [2] and [3] have been
rebased onto v3.19-rc7 with a few minor modifications. Large parts of
the energy model code and use of the energy model in the scheduler has
been rewritten and simplified. The patch set consists of three main
parts (more details further down):
Patch 1-11: sched: consolidation of CPU capacity and usage [2] (rebase)
Patch 12-19: sched: frequency and cpu invariant per-entity load-tracking
and other load-tracking bits [3] (rebase)
Patch 20-48: sched: Energy cost model for energy-aware scheduling (RFCv3)
Test results for ARM TC2 (2xA15+3xA7) with cpufreq enabled:
sysbench: Single task running for 3 seconds.
rt-app [4]: 5 medium (~50%) periodic tasks
rt-app [4]: 2 light (~10%) periodic tasks
Average numbers for 20 runs per test.
Energy sysbench rt-app medium rt-app light
Mainline 100* 100 100
EA 279 88 63
* Sensitive to task placement on big.LITTLE. Mainline may put it on
either cpu due to it's lack of compute capacity awareness, while EA
consistently puts heavy tasks on big cpus. The EA energy increase came
with a 2.65x _increase_ in performance (throughput).
[1] http://etherpad.osuosl.org/energy-aware-scheduling-ks-2013 (search
for 'cost')
[2] https://lkml.org/lkml/2015/1/15/136
[3] https://lkml.org/lkml/2014/12/2/328
[4] https://wiki.linaro.org/WorkingGroups/PowerManagement/Resources/Tools/WorkloadGen
Changes:
RFCv3:
'sched: Energy cost model for energy-aware scheduling' changes:
RFCv2->RFCv3:
(1) Remove frequency- and cpu-invariant load/utilization patches since
this is now provided by [2] and [3].
(2) Remove system-wide sched_energy to make the code easier to
understand, i.e. single socket systems are not supported (yet).
(3) Remove wake-up energy. Extra complexity that wasn't fully justified.
Idle-state awareness introduced recently in mainline may be
sufficient.
(4) Remove procfs interface for energy data to make the patch-set
smaller.
(5) Rework energy-aware load balancing code.
In RFCv2 we only attempted to pick the source cpu in an energy-aware
fashion. In addition to support for finding the most energy
inefficient source CPU during the load-balancing action, RFCv3 also
introduces the energy-aware based moving of tasks between cpus as
well as support for managing the 'tipping point' - the threshold
where we switch away from energy model based load balancing to
conventional load balancing.
'sched: frequency and cpu invariant per-entity load-tracking and other
load-tracking bits' [3]
(1) Remove blocked load from load tracking.
(2) Remove cpu-invariant load tracking.
Both (1) and (2) require changes to the existing load-balance code
which haven't been done yet. These are therefore left out until that
has been addressed.
(3) One patch renamed.
'sched: consolidation of CPU capacity and usage' [2]
(1) Fixed conflict when rebasing to v3.19-rc7.
(2) One patch subject changed slightly.
RFC v2:
- Extended documentation:
- Cover the energy model in greater detail.
- Recipe for deriving platform energy model.
- Replaced Kconfig with sched feature (jump label).
- Add unweighted load tracking.
- Use unweighted load as task/cpu utilization.
- Support for multiple idle states per sched_group. cpuidle integration
still missing.
- Changed energy aware functionality in select_idle_sibling().
- Experimental energy aware load-balance support.
Dietmar Eggemann (17):
sched: Make load tracking frequency scale-invariant
sched: Make usage tracking cpu scale-invariant
arm: vexpress: Add CPU clock-frequencies to TC2 device-tree
arm: Cpu invariant scheduler load-tracking support
sched: Get rid of scaling usage by cpu_capacity_orig
sched: Introduce energy data structures
sched: Allocate and initialize energy data structures
arm: topology: Define TC2 energy and provide it to the scheduler
sched: Infrastructure to query if load balancing is energy-aware
sched: Introduce energy awareness into update_sg_lb_stats
sched: Introduce energy awareness into update_sd_lb_stats
sched: Introduce energy awareness into find_busiest_group
sched: Introduce energy awareness into find_busiest_queue
sched: Introduce energy awareness into detach_tasks
sched: Tipping point from energy-aware to conventional load balancing
sched: Skip cpu as lb src which has one task and capacity gte the dst
cpu
sched: Turn off fast idling of cpus on a partially loaded system
Morten Rasmussen (23):
sched: Track group sched_entity usage contributions
sched: Make sched entity usage tracking frequency-invariant
cpufreq: Architecture specific callback for frequency changes
arm: Frequency invariant scheduler load-tracking support
sched: Track blocked utilization contributions
sched: Include blocked utilization in usage tracking
sched: Documentation for scheduler energy cost model
sched: Make energy awareness a sched feature
sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
sched: Compute cpu capacity available at current frequency
sched: Relocated get_cpu_usage()
sched: Use capacity_curr to cap utilization in get_cpu_usage()
sched: Highest energy aware balancing sched_domain level pointer
sched: Calculate energy consumption of sched_group
sched: Extend sched_group_energy to test load-balancing decisions
sched: Estimate energy impact of scheduling decisions
sched: Energy-aware wake-up task placement
sched: Bias new task wakeups towards higher capacity cpus
sched, cpuidle: Track cpuidle state index in the scheduler
sched: Count number of shallower idle-states in struct
sched_group_energy
sched: Determine the current sched_group idle-state
sched: Enable active migration for cpus of lower capacity
sched: Disable energy-unfriendly nohz kicks
Vincent Guittot (8):
sched: add utilization_avg_contrib
sched: remove frequency scaling from cpu_capacity
sched: make scale_rt invariant with frequency
sched: add per rq cpu_capacity_orig
sched: get CPU's usage statistic
sched: replace capacity_factor by usage
sched: add SD_PREFER_SIBLING for SMT level
sched: move cfs task on a CPU with higher capacity
Documentation/scheduler/sched-energy.txt | 359 +++++++++++
arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 5 +
arch/arm/kernel/topology.c | 218 +++++--
drivers/cpufreq/cpufreq.c | 10 +-
include/linux/sched.h | 43 +-
kernel/sched/core.c | 119 +++-
kernel/sched/debug.c | 12 +-
kernel/sched/fair.c | 935 ++++++++++++++++++++++++-----
kernel/sched/features.h | 6 +
kernel/sched/idle.c | 2 +
kernel/sched/sched.h | 75 ++-
11 files changed, 1559 insertions(+), 225 deletions(-)
create mode 100644 Documentation/scheduler/sched-energy.txt
--
1.9.1
From: Vincent Guittot <[email protected]>
Add new statistics which reflect the average time a task is running on the CPU
and the sum of these running time of the tasks on a runqueue. The latter is
named utilization_load_avg.
This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset by Paul Turner
<[email protected]> but that has be removed afterwards. This version differs from
the original one in the sense that it's not linked to task_group.
The rq's utilization_load_avg will be used to check if a rq is overloaded or
not instead of trying to compute how many tasks a group of CPUs can handle.
Rename runnable_avg_period into avg_period as it is now used with both
runnable_avg_sum and running_avg_sum
Add some descriptions of the variables to explain their differences
cc: Paul Turner <[email protected]>
cc: Ben Segall <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Morten Rasmussen <[email protected]>
---
include/linux/sched.h | 21 ++++++++++++---
kernel/sched/debug.c | 10 ++++---
kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++-----------
kernel/sched/sched.h | 8 +++++-
4 files changed, 89 insertions(+), 24 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..e220a91 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1111,15 +1111,28 @@ struct load_weight {
};
struct sched_avg {
+ u64 last_runnable_update;
+ s64 decay_count;
+ /*
+ * utilization_avg_contrib describes the amount of time that a
+ * sched_entity is running on a CPU. It is based on running_avg_sum
+ * and is scaled in the range [0..SCHED_LOAD_SCALE].
+ * load_avg_contrib described the amount of time that a sched_entity
+ * is runnable on a rq. It is based on both runnable_avg_sum and the
+ * weight of the task.
+ */
+ unsigned long load_avg_contrib, utilization_avg_contrib;
/*
* These sums represent an infinite geometric series and so are bound
* above by 1024/(1-y). Thus we only need a u32 to store them for all
* choices of y < 1-2^(-32)*1024.
+ * running_avg_sum reflects the time that the sched_entity is
+ * effectively running on the CPU.
+ * runnable_avg_sum represents the amount of time a sched_entity is on
+ * a runqueue which includes the running time that is monitored by
+ * running_avg_sum.
*/
- u32 runnable_avg_sum, runnable_avg_period;
- u64 last_runnable_update;
- s64 decay_count;
- unsigned long load_avg_contrib;
+ u32 runnable_avg_sum, avg_period, running_avg_sum;
};
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 92cc520..3033aaa 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -71,7 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
if (!se) {
struct sched_avg *avg = &cpu_rq(cpu)->avg;
P(avg->runnable_avg_sum);
- P(avg->runnable_avg_period);
+ P(avg->avg_period);
return;
}
@@ -94,7 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->load.weight);
#ifdef CONFIG_SMP
P(se->avg.runnable_avg_sum);
- P(se->avg.runnable_avg_period);
+ P(se->avg.avg_period);
P(se->avg.load_avg_contrib);
P(se->avg.decay_count);
#endif
@@ -214,6 +214,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->runnable_load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
cfs_rq->blocked_load_avg);
+ SEQ_printf(m, " .%-30s: %ld\n", "utilization_load_avg",
+ cfs_rq->utilization_load_avg);
#ifdef CONFIG_FAIR_GROUP_SCHED
SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib",
cfs_rq->tg_load_contrib);
@@ -635,8 +637,10 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
P(se.load.weight);
#ifdef CONFIG_SMP
P(se.avg.runnable_avg_sum);
- P(se.avg.runnable_avg_period);
+ P(se.avg.running_avg_sum);
+ P(se.avg.avg_period);
P(se.avg.load_avg_contrib);
+ P(se.avg.utilization_avg_contrib);
P(se.avg.decay_count);
#endif
P(policy);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40667cb..29adcbb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -670,6 +670,7 @@ static int select_idle_sibling(struct task_struct *p, int cpu);
static unsigned long task_h_load(struct task_struct *p);
static inline void __update_task_entity_contrib(struct sched_entity *se);
+static inline void __update_task_entity_utilization(struct sched_entity *se);
/* Give new task start runnable values to heavy its load in infant time */
void init_task_runnable_average(struct task_struct *p)
@@ -678,9 +679,10 @@ void init_task_runnable_average(struct task_struct *p)
p->se.avg.decay_count = 0;
slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
- p->se.avg.runnable_avg_sum = slice;
- p->se.avg.runnable_avg_period = slice;
+ p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
+ p->se.avg.avg_period = slice;
__update_task_entity_contrib(&p->se);
+ __update_task_entity_utilization(&p->se);
}
#else
void init_task_runnable_average(struct task_struct *p)
@@ -1674,7 +1676,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
*period = now - p->last_task_numa_placement;
} else {
delta = p->se.avg.runnable_avg_sum;
- *period = p->se.avg.runnable_avg_period;
+ *period = p->se.avg.avg_period;
}
p->last_sum_exec_runtime = runtime;
@@ -2500,7 +2502,8 @@ static u32 __compute_runnable_contrib(u64 n)
*/
static __always_inline int __update_entity_runnable_avg(u64 now,
struct sched_avg *sa,
- int runnable)
+ int runnable,
+ int running)
{
u64 delta, periods;
u32 runnable_contrib;
@@ -2526,7 +2529,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
sa->last_runnable_update = now;
/* delta_w is the amount already accumulated against our next period */
- delta_w = sa->runnable_avg_period % 1024;
+ delta_w = sa->avg_period % 1024;
if (delta + delta_w >= 1024) {
/* period roll-over */
decayed = 1;
@@ -2539,7 +2542,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
delta_w = 1024 - delta_w;
if (runnable)
sa->runnable_avg_sum += delta_w;
- sa->runnable_avg_period += delta_w;
+ if (running)
+ sa->running_avg_sum += delta_w;
+ sa->avg_period += delta_w;
delta -= delta_w;
@@ -2549,20 +2554,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
periods + 1);
- sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
+ sa->running_avg_sum = decay_load(sa->running_avg_sum,
+ periods + 1);
+ sa->avg_period = decay_load(sa->avg_period,
periods + 1);
/* Efficiently calculate \sum (1..n_period) 1024*y^i */
runnable_contrib = __compute_runnable_contrib(periods);
if (runnable)
sa->runnable_avg_sum += runnable_contrib;
- sa->runnable_avg_period += runnable_contrib;
+ if (running)
+ sa->running_avg_sum += runnable_contrib;
+ sa->avg_period += runnable_contrib;
}
/* Remainder of delta accrued against u_0` */
if (runnable)
sa->runnable_avg_sum += delta;
- sa->runnable_avg_period += delta;
+ if (running)
+ sa->running_avg_sum += delta;
+ sa->avg_period += delta;
return decayed;
}
@@ -2578,6 +2589,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
return 0;
se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+ se->avg.utilization_avg_contrib =
+ decay_load(se->avg.utilization_avg_contrib, decays);
se->avg.decay_count = 0;
return decays;
@@ -2614,7 +2627,7 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
/* The fraction of a cpu used by this cfs_rq */
contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
- sa->runnable_avg_period + 1);
+ sa->avg_period + 1);
contrib -= cfs_rq->tg_runnable_contrib;
if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
@@ -2667,7 +2680,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
{
- __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+ __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
+ runnable);
__update_tg_runnable_avg(&rq->avg, &rq->cfs);
}
#else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2685,7 +2699,7 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
- contrib /= (se->avg.runnable_avg_period + 1);
+ contrib /= (se->avg.avg_period + 1);
se->avg.load_avg_contrib = scale_load(contrib);
}
@@ -2704,6 +2718,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
return se->avg.load_avg_contrib - old_contrib;
}
+
+static inline void __update_task_entity_utilization(struct sched_entity *se)
+{
+ u32 contrib;
+
+ /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
+ contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
+ contrib /= (se->avg.avg_period + 1);
+ se->avg.utilization_avg_contrib = scale_load(contrib);
+}
+
+static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
+{
+ long old_contrib = se->avg.utilization_avg_contrib;
+
+ if (entity_is_task(se))
+ __update_task_entity_utilization(se);
+
+ return se->avg.utilization_avg_contrib - old_contrib;
+}
+
static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
long load_contrib)
{
@@ -2720,7 +2755,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
int update_cfs_rq)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- long contrib_delta;
+ long contrib_delta, utilization_delta;
u64 now;
/*
@@ -2732,18 +2767,22 @@ static inline void update_entity_load_avg(struct sched_entity *se,
else
now = cfs_rq_clock_task(group_cfs_rq(se));
- if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
+ if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+ cfs_rq->curr == se))
return;
contrib_delta = __update_entity_load_avg_contrib(se);
+ utilization_delta = __update_entity_utilization_avg_contrib(se);
if (!update_cfs_rq)
return;
- if (se->on_rq)
+ if (se->on_rq) {
cfs_rq->runnable_load_avg += contrib_delta;
- else
+ cfs_rq->utilization_load_avg += utilization_delta;
+ } else {
subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+ }
}
/*
@@ -2818,6 +2857,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
}
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+ cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib;
/* we force update consideration on load-balancer moves */
update_cfs_rq_blocked_load(cfs_rq, !wakeup);
}
@@ -2836,6 +2876,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
update_cfs_rq_blocked_load(cfs_rq, !sleep);
cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+ cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
if (sleep) {
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
@@ -3173,6 +3214,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
update_stats_wait_end(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
+ update_entity_load_avg(se, 1);
}
update_stats_curr_start(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9a2a45c..17a3b6b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -362,8 +362,14 @@ struct cfs_rq {
* Under CFS, load is tracked on a per-entity basis and aggregated up.
* This allows for the description of both thread and group usage (in
* the FAIR_GROUP_SCHED case).
+ * runnable_load_avg is the sum of the load_avg_contrib of the
+ * sched_entities on the rq.
+ * blocked_load_avg is similar to runnable_load_avg except that its
+ * the blocked sched_entities on the rq.
+ * utilization_load_avg is the sum of the average running time of the
+ * sched_entities on the rq.
*/
- unsigned long runnable_load_avg, blocked_load_avg;
+ unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
atomic64_t decay_counter;
u64 last_decay;
atomic_long_t removed_load;
--
1.9.1
Adds usage contribution tracking for group entities. Unlike
se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
entities is the sum of se->avg.utilization_avg_contrib for all entities on the
group runqueue. It is _not_ influenced in any way by the task group
h_load. Hence it is representing the actual cpu usage of the group, not
its intended load contribution which may differ significantly from the
utilization on lightly utilized systems.
cc: Paul Turner <[email protected]>
cc: Ben Segall <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/debug.c | 2 ++
kernel/sched/fair.c | 3 +++
2 files changed, 5 insertions(+)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 3033aaa..9dce8b5 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -94,8 +94,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->load.weight);
#ifdef CONFIG_SMP
P(se->avg.runnable_avg_sum);
+ P(se->avg.running_avg_sum);
P(se->avg.avg_period);
P(se->avg.load_avg_contrib);
+ P(se->avg.utilization_avg_contrib);
P(se->avg.decay_count);
#endif
#undef PN
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 29adcbb..fad93d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2735,6 +2735,9 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
if (entity_is_task(se))
__update_task_entity_utilization(se);
+ else
+ se->avg.utilization_avg_contrib =
+ group_cfs_rq(se)->utilization_load_avg;
return se->avg.utilization_avg_contrib - old_contrib;
}
--
1.9.1
From: Vincent Guittot <[email protected]>
Now that arch_scale_cpu_capacity has been introduced to scale the original
capacity, the arch_scale_freq_capacity is no longer used (it was
previously used by ARM arch). Remove arch_scale_freq_capacity from the
computation of cpu_capacity. The frequency invariance will be handled in the
load tracking and not in the CPU capacity. arch_scale_freq_capacity will be
revisited for scaling load with the current frequency of the CPUs in a later
patch.
Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fad93d8..35fd296 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6030,13 +6030,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
sdg->sgc->capacity_orig = capacity;
- if (sched_feat(ARCH_CAPACITY))
- capacity *= arch_scale_freq_capacity(sd, cpu);
- else
- capacity *= default_scale_capacity(sd, cpu);
-
- capacity >>= SCHED_CAPACITY_SHIFT;
-
capacity *= scale_rt_capacity(cpu);
capacity >>= SCHED_CAPACITY_SHIFT;
--
1.9.1
Apply frequency scale-invariance correction factor to usage tracking.
Each segment of the running_load_avg geometric series is now scaled by the
current frequency so the utilization_avg_contrib of each entity will be
invariant with frequency scaling. As a result, utilization_load_avg which is
the sum of utilization_avg_contrib, becomes invariant too. So the usage level
that is returned by get_cpu_usage, stays relative to the max frequency as the
cpu_capacity which is is compared against.
Then, we want the keep the load tracking values in a 32bits type, which implies
that the max value of {runnable|running}_avg_sum must be lower than
2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
arch_scale_freq_capacity must return a value less than
(48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.
cc: Paul Turner <[email protected]>
cc: Ben Segall <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 35fd296..b6fb7c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2472,6 +2472,8 @@ static u32 __compute_runnable_contrib(u64 n)
return contrib + runnable_avg_yN_sum[n];
}
+unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
/*
* We can represent the historical contribution to runnable average as the
* coefficients of a geometric series. To do this we sub-divide our runnable
@@ -2500,7 +2502,7 @@ static u32 __compute_runnable_contrib(u64 n)
* load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
* = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
*/
-static __always_inline int __update_entity_runnable_avg(u64 now,
+static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
struct sched_avg *sa,
int runnable,
int running)
@@ -2508,6 +2510,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
u64 delta, periods;
u32 runnable_contrib;
int delta_w, decayed = 0;
+ unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
delta = now - sa->last_runnable_update;
/*
@@ -2543,7 +2546,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
if (runnable)
sa->runnable_avg_sum += delta_w;
if (running)
- sa->running_avg_sum += delta_w;
+ sa->running_avg_sum += delta_w * scale_freq
+ >> SCHED_CAPACITY_SHIFT;
sa->avg_period += delta_w;
delta -= delta_w;
@@ -2564,7 +2568,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
if (runnable)
sa->runnable_avg_sum += runnable_contrib;
if (running)
- sa->running_avg_sum += runnable_contrib;
+ sa->running_avg_sum += runnable_contrib * scale_freq
+ >> SCHED_CAPACITY_SHIFT;
sa->avg_period += runnable_contrib;
}
@@ -2572,7 +2577,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
if (runnable)
sa->runnable_avg_sum += delta;
if (running)
- sa->running_avg_sum += delta;
+ sa->running_avg_sum += delta * scale_freq
+ >> SCHED_CAPACITY_SHIFT;
sa->avg_period += delta;
return decayed;
@@ -2680,8 +2686,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
{
- __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
- runnable);
+ __update_entity_runnable_avg(rq_clock_task(rq), cpu_of(rq), &rq->avg,
+ runnable, runnable);
__update_tg_runnable_avg(&rq->avg, &rq->cfs);
}
#else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2759,6 +2765,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
long contrib_delta, utilization_delta;
+ int cpu = cpu_of(rq_of(cfs_rq));
u64 now;
/*
@@ -2770,7 +2777,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
else
now = cfs_rq_clock_task(group_cfs_rq(se));
- if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+ if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
cfs_rq->curr == se))
return;
--
1.9.1
From: Vincent Guittot <[email protected]>
The average running time of RT tasks is used to estimate the remaining compute
capacity for CFS tasks. This remaining capacity is the original capacity scaled
down by a factor (aka scale_rt_capacity). This estimation of available capacity
must also be invariant with frequency scaling.
A frequency scaling factor is applied on the running time of the RT tasks for
computing scale_rt_capacity.
In sched_rt_avg_update, we scale the RT execution time like below:
rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
Then, scale_rt_capacity can be summarized by:
scale_rt_capacity = SCHED_CAPACITY_SCALE -
((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)
We can optimize by removing right and left shift in the computation of rq->rt_avg
and scale_rt_capacity
The call to arch_scale_frequency_capacity in the rt scheduling path might be
a concern for RT folks because I'm not sure whether we can rely on
arch_scale_freq_capacity to be short and efficient ?
Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 17 +++++------------
kernel/sched/sched.h | 4 +++-
2 files changed, 8 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b6fb7c4..cfe3aea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5992,7 +5992,7 @@ unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
static unsigned long scale_rt_capacity(int cpu)
{
struct rq *rq = cpu_rq(cpu);
- u64 total, available, age_stamp, avg;
+ u64 total, used, age_stamp, avg;
s64 delta;
/*
@@ -6008,19 +6008,12 @@ static unsigned long scale_rt_capacity(int cpu)
total = sched_avg_period() + delta;
- if (unlikely(total < avg)) {
- /* Ensures that capacity won't end up being negative */
- available = 0;
- } else {
- available = total - avg;
- }
+ used = div_u64(avg, total);
- if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
- total = SCHED_CAPACITY_SCALE;
+ if (likely(used < SCHED_CAPACITY_SCALE))
+ return SCHED_CAPACITY_SCALE - used;
- total >>= SCHED_CAPACITY_SHIFT;
-
- return div_u64(available, total);
+ return 1;
}
static void update_cpu_capacity(struct sched_domain *sd, int cpu)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 17a3b6b..e61f00e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1356,9 +1356,11 @@ static inline int hrtick_enabled(struct rq *rq)
#ifdef CONFIG_SMP
extern void sched_avg_update(struct rq *rq);
+extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
- rq->rt_avg += rt_delta;
+ rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
sched_avg_update(rq);
}
#else
--
1.9.1
From: Vincent Guittot <[email protected]>
This new field cpu_capacity_orig reflects the original capacity of a CPU
before being altered by rt tasks and/or IRQ
The cpu_capacity_orig will be used:
- to detect when the capacity of a CPU has been noticeably reduced so we can
trig load balance to look for a CPU with better capacity. As an example, we
can detect when a CPU handles a significant amount of irq
(with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
scheduler whereas CPUs, which are really idle, are available.
- evaluate the available capacity for CFS tasks
Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Kamalesh Babulal <[email protected]>
Acked-by: Morten Rasmussen <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 8 +++++++-
kernel/sched/sched.h | 1 +
3 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e628cb1..48f9053 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7212,7 +7212,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
rq->sd = NULL;
rq->rd = NULL;
- rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+ rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
rq->post_schedule = 0;
rq->active_balance = 0;
rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cfe3aea..3fdad38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4351,6 +4351,11 @@ static unsigned long capacity_of(int cpu)
return cpu_rq(cpu)->cpu_capacity;
}
+static unsigned long capacity_orig_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -6028,6 +6033,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
capacity >>= SCHED_CAPACITY_SHIFT;
+ cpu_rq(cpu)->cpu_capacity_orig = capacity;
sdg->sgc->capacity_orig = capacity;
capacity *= scale_rt_capacity(cpu);
@@ -6082,7 +6088,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
* Runtime updates will correct capacity_orig.
*/
if (unlikely(!rq->sd)) {
- capacity_orig += capacity_of(cpu);
+ capacity_orig += capacity_orig_of(cpu);
capacity += capacity_of(cpu);
continue;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e61f00e..09bb18b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -604,6 +604,7 @@ struct rq {
struct sched_domain *sd;
unsigned long cpu_capacity;
+ unsigned long cpu_capacity_orig;
unsigned char idle_balance;
/* For active balancing */
--
1.9.1
From: Vincent Guittot <[email protected]>
Monitor the usage level of each group of each sched_domain level. The usage is
the portion of cpu_capacity_orig that is currently used on a CPU or group of
CPUs. We use the utilization_load_avg to evaluate the usage level of each
group.
The utilization_load_avg only takes into account the running time of the CFS
tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
utilized. Nevertheless, we must cap utilization_load_avg which can be temporaly
greater than SCHED_LOAD_SCALE after the migration of a task on this CPU and
until the metrics are stabilized.
The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
running load on the CPU whereas the available capacity for the CFS task is in
the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
of the CPU to get the usage of the latter. The usage can then be compared with
the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.
The frequency scaling invariance of the usage is not taken into account in this
patch, it will be solved in another patch which will deal with frequency
scaling invariance on the running_load_avg.
Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3fdad38..7ec48db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4769,6 +4769,33 @@ static int select_idle_sibling(struct task_struct *p, int target)
done:
return target;
}
+/*
+ * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * tasks. The unit of the return value must capacity so we can compare the
+ * usage with the capacity of the CPU that is available for CFS task (ie
+ * cpu_capacity).
+ * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
+ * CPU. It represents the amount of utilization of a CPU in the range
+ * [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full
+ * capacity of the CPU because it's about the running time on this CPU.
+ * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
+ * because of unfortunate rounding in avg_period and running_load_avg or just
+ * after migrating tasks until the average stabilizes with the new running
+ * time. So we need to check that the usage stays into the range
+ * [0..cpu_capacity_orig] and cap if necessary.
+ * Without capping the usage, a group could be seen as overloaded (CPU0 usage
+ * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
+ */
+static int get_cpu_usage(int cpu)
+{
+ unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+ unsigned long capacity = capacity_orig_of(cpu);
+
+ if (usage >= SCHED_LOAD_SCALE)
+ return capacity;
+
+ return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
@@ -5895,6 +5922,7 @@ struct sg_lb_stats {
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
unsigned long load_per_task;
unsigned long group_capacity;
+ unsigned long group_usage; /* Total usage of the group */
unsigned int sum_nr_running; /* Nr tasks running in the group */
unsigned int group_capacity_factor;
unsigned int idle_cpus;
@@ -6243,6 +6271,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
load = source_load(i, load_idx);
sgs->group_load += load;
+ sgs->group_usage += get_cpu_usage(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;
if (rq->nr_running > 1)
--
1.9.1
From: Vincent Guittot <[email protected]>
The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
SCHED_CAPACITY_SCALE. group_capacity_factor divides the capacity of the group
by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
compares this value with the sum of nr_running to decide if the group is
overloaded or not. But the group_capacity_factor is hardly working for SMT
system, it sometimes works for big cores but fails to do the right thing for
little cores.
Below are two examples to illustrate the problem that this patch solves:
1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
(div_round_closest(3x640/1024) = 2) which means that it will be seen as
overloaded even if we have only one task per CPU.
2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
(1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
(at max and thanks to the fix [0] for SMT system that prevent the apparition
of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
reduced to nearly nothing), the capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).
So, this patch tries to solve this issue by removing capacity_factor and
replacing it with the 2 following metrics :
-The available CPU's capacity for CFS tasks which is already used by
load_balance.
-The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
has been re-introduced to compute the usage of a CPU by CFS tasks.
group_capacity_factor and group_has_free_capacity has been removed and replaced
by group_no_capacity. We compare the number of task with the number of CPUs and
we evaluate the level of utilization of the CPUs to define if a group is
overloaded or if a group has capacity to handle more tasks.
For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
so it will be selected in priority (among the overloaded groups). Since [1],
SD_PREFER_SIBLING is no more concerned by the computation of load_above_capacity
because local is not overloaded.
Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
because it's no more used during load balance.
[1] https://lkml.org/lkml/2014/8/12/295
Signed-off-by: Vincent Guittot <[email protected]>
[Fixed merge conflict on v3.19-rc6: Morten Rasmussen
<[email protected]>]
---
kernel/sched/core.c | 12 ----
kernel/sched/fair.c | 152 +++++++++++++++++++++++++--------------------------
kernel/sched/sched.h | 2 +-
3 files changed, 75 insertions(+), 91 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48f9053..252011d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5442,17 +5442,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
break;
}
- /*
- * Even though we initialize ->capacity to something semi-sane,
- * we leave capacity_orig unset. This allows us to detect if
- * domain iteration is still funny without causing /0 traps.
- */
- if (!group->sgc->capacity_orig) {
- printk(KERN_CONT "\n");
- printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n");
- break;
- }
-
if (!cpumask_weight(sched_group_cpus(group))) {
printk(KERN_CONT "\n");
printk(KERN_ERR "ERROR: empty group\n");
@@ -5937,7 +5926,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
* die on a /0 trap.
*/
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
- sg->sgc->capacity_orig = sg->sgc->capacity;
/*
* Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ec48db..52c494f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5924,11 +5924,10 @@ struct sg_lb_stats {
unsigned long group_capacity;
unsigned long group_usage; /* Total usage of the group */
unsigned int sum_nr_running; /* Nr tasks running in the group */
- unsigned int group_capacity_factor;
unsigned int idle_cpus;
unsigned int group_weight;
enum group_type group_type;
- int group_has_free_capacity;
+ int group_no_capacity;
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -6062,7 +6061,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
capacity >>= SCHED_CAPACITY_SHIFT;
cpu_rq(cpu)->cpu_capacity_orig = capacity;
- sdg->sgc->capacity_orig = capacity;
capacity *= scale_rt_capacity(cpu);
capacity >>= SCHED_CAPACITY_SHIFT;
@@ -6078,7 +6076,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
- unsigned long capacity, capacity_orig;
+ unsigned long capacity;
unsigned long interval;
interval = msecs_to_jiffies(sd->balance_interval);
@@ -6090,7 +6088,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
return;
}
- capacity_orig = capacity = 0;
+ capacity = 0;
if (child->flags & SD_OVERLAP) {
/*
@@ -6110,19 +6108,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
* Use capacity_of(), which is set irrespective of domains
* in update_cpu_capacity().
*
- * This avoids capacity/capacity_orig from being 0 and
- * causing divide-by-zero issues on boot.
- *
- * Runtime updates will correct capacity_orig.
+ * This avoids capacity from being 0 and causing
+ * divide-by-zero issues on boot.
*/
if (unlikely(!rq->sd)) {
- capacity_orig += capacity_orig_of(cpu);
capacity += capacity_of(cpu);
continue;
}
sgc = rq->sd->groups->sgc;
- capacity_orig += sgc->capacity_orig;
capacity += sgc->capacity;
}
} else {
@@ -6133,39 +6127,24 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
group = child->groups;
do {
- capacity_orig += group->sgc->capacity_orig;
capacity += group->sgc->capacity;
group = group->next;
} while (group != child->groups);
}
- sdg->sgc->capacity_orig = capacity_orig;
sdg->sgc->capacity = capacity;
}
/*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
+ * Check whether the capacity of the rq has been noticeably reduced by side
+ * activity. The imbalance_pct is used for the threshold.
+ * Return true is the capacity is reduced
*/
static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
+check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
{
- /*
- * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
- */
- if (!(sd->flags & SD_SHARE_CPUCAPACITY))
- return 0;
-
- /*
- * If ~90% of the cpu_capacity is still there, we're good.
- */
- if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
- return 1;
-
- return 0;
+ return ((rq->cpu_capacity * sd->imbalance_pct) <
+ (rq->cpu_capacity_orig * 100));
}
/*
@@ -6203,37 +6182,54 @@ static inline int sg_imbalanced(struct sched_group *group)
}
/*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
+ * group_has_capacity returns true if the group has spare capacity that could
+ * be used by some tasks. We consider that a group has spare capacity if the
+ * number of task is smaller than the number of CPUs or if the usage is lower
+ * than the available capacity for CFS tasks. For the latter, we use a
+ * threshold to stabilize the state, to take into account the variance of the
+ * tasks' load and to return true if the available capacity in meaningful for
+ * the load balancer. As an example, an available capacity of 1% can appear
+ * but it doesn't make any benefit for the load balance.
*/
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline bool
+group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
{
- unsigned int capacity_factor, smt, cpus;
- unsigned int capacity, capacity_orig;
+ if ((sgs->group_capacity * 100) >
+ (sgs->group_usage * env->sd->imbalance_pct))
+ return true;
- capacity = group->sgc->capacity;
- capacity_orig = group->sgc->capacity_orig;
- cpus = group->group_weight;
+ if (sgs->sum_nr_running < sgs->group_weight)
+ return true;
- /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
- smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
- capacity_factor = cpus / smt; /* cores */
+ return false;
+}
- capacity_factor = min_t(unsigned,
- capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
- if (!capacity_factor)
- capacity_factor = fix_small_capacity(env->sd, group);
+/*
+ * group_is_overloaded returns true if the group has more tasks than it can
+ * handle. We consider that a group is overloaded if the number of tasks is
+ * greater than the number of CPUs and the tasks already use all available
+ * capacity for CFS tasks. For the latter, we use a threshold to stabilize
+ * the state, to take into account the variance of tasks' load and to return
+ * true if available capacity is no more meaningful for load balancer
+ */
+static inline bool
+group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
+{
+ if (sgs->sum_nr_running <= sgs->group_weight)
+ return false;
- return capacity_factor;
+ if ((sgs->group_capacity * 100) <
+ (sgs->group_usage * env->sd->imbalance_pct))
+ return true;
+
+ return false;
}
-static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+static enum group_type group_classify(struct lb_env *env,
+ struct sched_group *group,
+ struct sg_lb_stats *sgs)
{
- if (sgs->sum_nr_running > sgs->group_capacity_factor)
+ if (sgs->group_no_capacity)
return group_overloaded;
if (sg_imbalanced(group))
@@ -6294,11 +6290,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
sgs->group_weight = group->group_weight;
- sgs->group_capacity_factor = sg_capacity_factor(env, group);
- sgs->group_type = group_classify(group, sgs);
- if (sgs->group_capacity_factor > sgs->sum_nr_running)
- sgs->group_has_free_capacity = 1;
+ sgs->group_no_capacity = group_is_overloaded(env, sgs);
+ sgs->group_type = group_classify(env, group, sgs);
}
/**
@@ -6420,18 +6414,19 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/*
* In case the child domain prefers tasks go to siblings
- * first, lower the sg capacity factor to one so that we'll try
+ * first, lower the sg capacity to one so that we'll try
* and move all the excess tasks away. We lower the capacity
* of a group only if the local group has the capacity to fit
- * these excess tasks, i.e. nr_running < group_capacity_factor. The
- * extra check prevents the case where you always pull from the
- * heaviest group when it is already under-utilized (possible
- * with a large weight task outweighs the tasks on the system).
+ * these excess tasks. The extra check prevents the case where
+ * you always pull from the heaviest group when it is already
+ * under-utilized (possible with a large weight task outweighs
+ * the tasks on the system).
*/
if (prefer_sibling && sds->local &&
- sds->local_stat.group_has_free_capacity) {
- sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
- sgs->group_type = group_classify(sg, sgs);
+ group_has_capacity(env, &sds->local_stat) &&
+ (sgs->sum_nr_running > 1)) {
+ sgs->group_no_capacity = 1;
+ sgs->group_type = group_overloaded;
}
if (update_sd_pick_busiest(env, sds, sg, sgs)) {
@@ -6611,11 +6606,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
*/
if (busiest->group_type == group_overloaded &&
local->group_type == group_overloaded) {
- load_above_capacity =
- (busiest->sum_nr_running - busiest->group_capacity_factor);
-
- load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
- load_above_capacity /= busiest->group_capacity;
+ load_above_capacity = busiest->sum_nr_running *
+ SCHED_LOAD_SCALE;
+ if (load_above_capacity > busiest->group_capacity)
+ load_above_capacity -= busiest->group_capacity;
+ else
+ load_above_capacity = ~0UL;
}
/*
@@ -6678,6 +6674,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
local = &sds.local_stat;
busiest = &sds.busiest_stat;
+ /* ASYM feature bypasses nice load balance check */
if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
check_asym_packing(env, &sds))
return sds.busiest;
@@ -6698,8 +6695,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
- if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
- !busiest->group_has_free_capacity)
+ if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
+ busiest->group_no_capacity)
goto force_balance;
/*
@@ -6758,7 +6755,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
int i;
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
- unsigned long capacity, capacity_factor, wl;
+ unsigned long capacity, wl;
enum fbq_type rt;
rq = cpu_rq(i);
@@ -6787,9 +6784,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
continue;
capacity = capacity_of(i);
- capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
- if (!capacity_factor)
- capacity_factor = fix_small_capacity(env->sd, group);
wl = weighted_cpuload(i);
@@ -6797,7 +6791,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* When comparing with imbalance, use weighted_cpuload()
* which is not scaled with the cpu capacity.
*/
- if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+ if (rq->nr_running == 1 && wl > env->imbalance &&
+ !check_cpu_capacity(rq, env->sd))
continue;
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 09bb18b..e402133 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -796,7 +796,7 @@ struct sched_group_capacity {
* CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
* for a single CPU.
*/
- unsigned int capacity, capacity_orig;
+ unsigned int capacity;
unsigned long next_update;
int imbalance; /* XXX unrelated to capacity but shared group state */
/*
--
1.9.1
From: Vincent Guittot <[email protected]>
Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will put at least 1 task per core.
Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Preeti U. Murthy <[email protected]>
---
kernel/sched/core.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 252011d..a00a4c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6236,6 +6236,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
*/
if (sd->flags & SD_SHARE_CPUCAPACITY) {
+ sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 110;
sd->smt_gain = 1178; /* ~15% */
--
1.9.1
From: Vincent Guittot <[email protected]>
When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
capacity for CFS tasks can be significantly reduced. Once we detect such
situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
load balance to check if it's worth moving its tasks on an idle CPU.
Once the idle load_balance has selected the busiest CPU, it will look for an
active load balance for only two cases :
- there is only 1 task on the busiest CPU.
- we haven't been able to move a task of the busiest rq.
A CPU with a reduced capacity is included in the 1st case, and it's worth to
actively migrate its task if the idle CPU has got full capacity. This test has
been added in need_active_balance.
As a sidenote, this will note generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new
test
env.src_cpu and env.src_rq must be set unconditionnally because they are used
in need_active_balance which is called even if busiest->nr_running equals 1
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 53 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52c494f..bd73f26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6841,6 +6841,28 @@ static int need_active_balance(struct lb_env *env)
return 1;
}
+ /*
+ * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
+ * It's worth migrating the task if the src_cpu's capacity is reduced
+ * because of other sched_class or IRQs whereas capacity stays
+ * available on dst_cpu.
+ */
+ if ((env->idle != CPU_NOT_IDLE) &&
+ (env->src_rq->cfs.h_nr_running == 1)) {
+ unsigned long src_eff_capacity, dst_eff_capacity;
+
+ dst_eff_capacity = 100;
+ dst_eff_capacity *= capacity_of(env->dst_cpu);
+ dst_eff_capacity *= capacity_orig_of(env->src_cpu);
+
+ src_eff_capacity = sd->imbalance_pct;
+ src_eff_capacity *= capacity_of(env->src_cpu);
+ src_eff_capacity *= capacity_orig_of(env->dst_cpu);
+
+ if (src_eff_capacity < dst_eff_capacity)
+ return 1;
+ }
+
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
@@ -6940,6 +6962,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
schedstat_add(sd, lb_imbalance[idle], env.imbalance);
+ env.src_cpu = busiest->cpu;
+ env.src_rq = busiest;
+
ld_moved = 0;
if (busiest->nr_running > 1) {
/*
@@ -6949,8 +6974,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
* correctly treated as an imbalance.
*/
env.flags |= LBF_ALL_PINNED;
- env.src_cpu = busiest->cpu;
- env.src_rq = busiest;
env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
more_balance:
@@ -7650,22 +7673,25 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
/*
* Current heuristic for kicking the idle load balancer in the presence
- * of an idle cpu is the system.
+ * of an idle cpu in the system.
* - This rq has more than one task.
- * - At any scheduler domain level, this cpu's scheduler group has multiple
- * busy cpu's exceeding the group's capacity.
+ * - This rq has at least one CFS task and the capacity of the CPU is
+ * significantly reduced because of RT tasks or IRQs.
+ * - At parent of LLC scheduler domain level, this cpu's scheduler group has
+ * multiple busy cpu.
* - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
* domain span are idle.
*/
-static inline int nohz_kick_needed(struct rq *rq)
+static inline bool nohz_kick_needed(struct rq *rq)
{
unsigned long now = jiffies;
struct sched_domain *sd;
struct sched_group_capacity *sgc;
int nr_busy, cpu = rq->cpu;
+ bool kick = false;
if (unlikely(rq->idle_balance))
- return 0;
+ return false;
/*
* We may be recently in ticked or tickless idle mode. At the first
@@ -7679,38 +7705,44 @@ static inline int nohz_kick_needed(struct rq *rq)
* balancing.
*/
if (likely(!atomic_read(&nohz.nr_cpus)))
- return 0;
+ return false;
if (time_before(now, nohz.next_balance))
- return 0;
+ return false;
if (rq->nr_running >= 2)
- goto need_kick;
+ return true;
rcu_read_lock();
sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
if (sd) {
sgc = sd->groups->sgc;
nr_busy = atomic_read(&sgc->nr_busy_cpus);
- if (nr_busy > 1)
- goto need_kick_unlock;
+ if (nr_busy > 1) {
+ kick = true;
+ goto unlock;
+ }
+
}
- sd = rcu_dereference(per_cpu(sd_asym, cpu));
+ sd = rcu_dereference(rq->sd);
+ if (sd) {
+ if ((rq->cfs.h_nr_running >= 1) &&
+ check_cpu_capacity(rq, sd)) {
+ kick = true;
+ goto unlock;
+ }
+ }
+ sd = rcu_dereference(per_cpu(sd_asym, cpu));
if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
sched_domain_span(sd)) < cpu))
- goto need_kick_unlock;
+ kick = true;
+unlock:
rcu_read_unlock();
- return 0;
-
-need_kick_unlock:
- rcu_read_unlock();
-need_kick:
- return 1;
+ return kick;
}
#else
static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
--
1.9.1
From: Dietmar Eggemann <[email protected]>
Apply frequency scale-invariance correction factor to load tracking.
Each segment of the sched_avg::runnable_avg_sum geometric series is now
scaled by the current frequency so the sched_avg::load_avg_contrib of each
entity will be invariant with frequency scaling. As a result,
cfs_rq::runnable_load_avg which is the sum of sched_avg::load_avg_contrib,
becomes invariant too. So the load level that is returned by
weighted_cpuload, stays relative to the max frequency of the cpu.
Then, we want the keep the load tracking values in a 32bits type, which
implies that the max value of sched_avg::{runnable|running}_avg_sum must
be lower than 2^32/88761=48388 (88761 is the max weight of a task). As
LOAD_AVG_MAX = 47742, arch_scale_freq_capacity must return a value less
than (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY =
1024). So we define the range to [0..SCHED_SCALE_CAPACITY] in order to
avoid overflow.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Acked-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 28 ++++++++++++++++------------
1 file changed, 16 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bd73f26..e9a26b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2507,9 +2507,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
int runnable,
int running)
{
- u64 delta, periods;
- u32 runnable_contrib;
- int delta_w, decayed = 0;
+ u64 delta, scaled_delta, periods;
+ u32 runnable_contrib, scaled_runnable_contrib;
+ int delta_w, scaled_delta_w, decayed = 0;
unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
delta = now - sa->last_runnable_update;
@@ -2543,11 +2543,12 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
* period and accrue it.
*/
delta_w = 1024 - delta_w;
+ scaled_delta_w = (delta_w * scale_freq) >> SCHED_CAPACITY_SHIFT;
+
if (runnable)
- sa->runnable_avg_sum += delta_w;
+ sa->runnable_avg_sum += scaled_delta_w;
if (running)
- sa->running_avg_sum += delta_w * scale_freq
- >> SCHED_CAPACITY_SHIFT;
+ sa->running_avg_sum += scaled_delta_w;
sa->avg_period += delta_w;
delta -= delta_w;
@@ -2565,20 +2566,23 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
/* Efficiently calculate \sum (1..n_period) 1024*y^i */
runnable_contrib = __compute_runnable_contrib(periods);
+ scaled_runnable_contrib = (runnable_contrib * scale_freq)
+ >> SCHED_CAPACITY_SHIFT;
+
if (runnable)
- sa->runnable_avg_sum += runnable_contrib;
+ sa->runnable_avg_sum += scaled_runnable_contrib;
if (running)
- sa->running_avg_sum += runnable_contrib * scale_freq
- >> SCHED_CAPACITY_SHIFT;
+ sa->running_avg_sum += scaled_runnable_contrib;
sa->avg_period += runnable_contrib;
}
/* Remainder of delta accrued against u_0` */
+ scaled_delta = (delta * scale_freq) >> SCHED_CAPACITY_SHIFT;
+
if (runnable)
- sa->runnable_avg_sum += delta;
+ sa->runnable_avg_sum += scaled_delta;
if (running)
- sa->running_avg_sum += delta * scale_freq
- >> SCHED_CAPACITY_SHIFT;
+ sa->running_avg_sum += scaled_delta;
sa->avg_period += delta;
return decayed;
--
1.9.1
From: Dietmar Eggemann <[email protected]>
Besides the existing frequency scale-invariance correction factor, apply
cpu scale-invariance correction factor to usage tracking.
Cpu scale-invariance takes cpu performance deviations due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency
value of the highest OPP between cpus in SMP systems into consideration.
Each segment of the sched_avg::running_avg_sum geometric series is now
scaled by the cpu performance factor too so the
sched_avg::utilization_avg_contrib of each entity will be invariant from
the particular cpu of the HMP/SMP system it is gathered on.
So the usage level that is returned by get_cpu_usage stays relative to
the max cpu performance of the system.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e9a26b1..5375ab1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2473,6 +2473,7 @@ static u32 __compute_runnable_contrib(u64 n)
}
unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu);
/*
* We can represent the historical contribution to runnable average as the
@@ -2511,6 +2512,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
u32 runnable_contrib, scaled_runnable_contrib;
int delta_w, scaled_delta_w, decayed = 0;
unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
+ unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
delta = now - sa->last_runnable_update;
/*
@@ -2547,6 +2549,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
if (runnable)
sa->runnable_avg_sum += scaled_delta_w;
+
+ scaled_delta_w *= scale_cpu;
+ scaled_delta_w >>= SCHED_CAPACITY_SHIFT;
+
if (running)
sa->running_avg_sum += scaled_delta_w;
sa->avg_period += delta_w;
@@ -2571,6 +2577,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
if (runnable)
sa->runnable_avg_sum += scaled_runnable_contrib;
+
+ scaled_runnable_contrib *= scale_cpu;
+ scaled_runnable_contrib >>= SCHED_CAPACITY_SHIFT;
+
if (running)
sa->running_avg_sum += scaled_runnable_contrib;
sa->avg_period += runnable_contrib;
@@ -2581,6 +2591,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
if (runnable)
sa->runnable_avg_sum += scaled_delta;
+
+ scaled_delta *= scale_cpu;
+ scaled_delta >>= SCHED_CAPACITY_SHIFT;
+
if (running)
sa->running_avg_sum += scaled_delta;
sa->avg_period += delta;
@@ -6014,7 +6028,7 @@ unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
{
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+ if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
return sd->smt_gain / sd->span_weight;
return SCHED_CAPACITY_SCALE;
--
1.9.1
From: Morten Rasmussen <[email protected]>
Architectures that don't have any other means for tracking cpu frequency
changes need a callback from cpufreq to implement a scaling factor to
enable scale-invariant per-entity load-tracking in the scheduler.
To compute the scale invariance correction factor the architecture would
need to know both the max frequency and the current frequency. This
patch defines weak functions for setting both from cpufreq.
Related architecture specific functions use weak function definitions.
The same approach is followed here.
These callbacks can be used to implement frequency scaling of cpu
capacity later.
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
drivers/cpufreq/cpufreq.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 46bed4f..951df85 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -278,6 +278,10 @@ static inline void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci)
}
#endif
+void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {}
+
+void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {}
+
static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, unsigned int state)
{
@@ -315,6 +319,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
pr_debug("FREQ: %lu - CPU: %lu\n",
(unsigned long)freqs->new, (unsigned long)freqs->cpu);
trace_cpu_frequency(freqs->new, freqs->cpu);
+ arch_scale_set_curr_freq(freqs->cpu, freqs->new);
srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
CPUFREQ_POSTCHANGE, freqs);
if (likely(policy) && likely(policy->cpu == freqs->cpu))
@@ -2178,7 +2183,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
struct cpufreq_policy *new_policy)
{
struct cpufreq_governor *old_gov;
- int ret;
+ int ret, cpu;
pr_debug("setting new policy for CPU %u: %u - %u kHz\n",
new_policy->cpu, new_policy->min, new_policy->max);
@@ -2216,6 +2221,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
policy->min = new_policy->min;
policy->max = new_policy->max;
+ for_each_cpu(cpu, policy->cpus)
+ arch_scale_set_max_freq(cpu, policy->max);
+
pr_debug("new min and max freqs are %u - %u kHz\n",
policy->min, policy->max);
--
1.9.1
From: Morten Rasmussen <[email protected]>
Implements arch-specific function to provide the scheduler with a
frequency scaling correction factor for more accurate load-tracking. The
factor is:
current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu)
This implementation only provides frequency invariance. No
micro-architecture invariance yet.
Cc: Russell King <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
arch/arm/kernel/topology.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 08b7847..a1274e6 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -169,6 +169,39 @@ static void update_cpu_capacity(unsigned int cpu)
cpu, arch_scale_cpu_capacity(NULL, cpu));
}
+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling.
+ */
+
+static DEFINE_PER_CPU(atomic_long_t, cpu_curr_freq);
+static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
+
+/* cpufreq callback function setting current cpu frequency */
+void arch_scale_set_curr_freq(int cpu, unsigned long freq)
+{
+ atomic_long_set(&per_cpu(cpu_curr_freq, cpu), freq);
+}
+
+/* cpufreq callback function setting max cpu frequency */
+void arch_scale_set_max_freq(int cpu, unsigned long freq)
+{
+ atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);
+}
+
+unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+ unsigned long curr = atomic_long_read(&per_cpu(cpu_curr_freq, cpu));
+ unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
+
+ if (!curr || !max)
+ return SCHED_CAPACITY_SCALE;
+
+ return (curr * SCHED_CAPACITY_SCALE) / max;
+}
+
#else
static inline void parse_dt_topology(void) {}
static inline void update_cpu_capacity(unsigned int cpuid) {}
--
1.9.1
From: Dietmar Eggemann <[email protected]>
To enable the parsing of clock frequency and cpu efficiency values
inside parse_dt_topology [arch/arm/kernel/topology.c] to scale the
relative capacity of the cpus, this property has to be provided within
the cpu nodes of the dts file.
The patch is a copy of commit 8f15973ef8c3 ("ARM: vexpress: Add CPU
clock-frequencies to TC2 device-tree") taken from Linaro Stable Kernel
(LSK) massaged into mainline.
Cc: Jon Medhurst <[email protected]>
Cc: Russell King <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
index 33920df..43841b5 100644
--- a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
+++ b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
@@ -39,6 +39,7 @@
reg = <0>;
cci-control-port = <&cci_control1>;
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
+ clock-frequency = <1000000000>;
};
cpu1: cpu@1 {
@@ -47,6 +48,7 @@
reg = <1>;
cci-control-port = <&cci_control1>;
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
+ clock-frequency = <1000000000>;
};
cpu2: cpu@2 {
@@ -55,6 +57,7 @@
reg = <0x100>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ clock-frequency = <800000000>;
};
cpu3: cpu@3 {
@@ -63,6 +66,7 @@
reg = <0x101>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ clock-frequency = <800000000>;
};
cpu4: cpu@4 {
@@ -71,6 +75,7 @@
reg = <0x102>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ clock-frequency = <800000000>;
};
idle-states {
--
1.9.1
From: Dietmar Eggemann <[email protected]>
Reuses the existing infrastructure for cpu_scale to provide the scheduler
with a cpu scaling correction factor for more accurate load-tracking.
This factor comprises a micro-architectural part, which is based on the
cpu efficiency value of a cpu as well as a platform-wide max frequency
part, which relates to the dtb property clock-frequency of a cpu node.
The calculation of cpu_scale, return value of arch_scale_cpu_capacity,
changes from:
capacity / middle_capacity
with capacity = (clock_frequency >> 20) * cpu_efficiency
to:
SCHED_CAPACITY_SCALE * cpu_perf / max_cpu_perf
The range of the cpu_scale value changes from
[0..3*SCHED_CAPACITY_SCALE/2] to [0..SCHED_CAPACITY_SCALE].
The functionality to calculate the middle_capacity which corresponds to an
'average' cpu has been taken out since the scaling is now done
differently.
In the case that either the cpu efficiency or the clock-frequency value
for a cpu is missing, no cpu scaling is done for any cpu.
The platform-wide max frequency part of the factor should not be confused
with the frequency invariant scheduler load-tracking support which deals
with frequency related scaling due to DFVS functionality on a cpu.
Cc: Russell King <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/kernel/topology.c | 64 +++++++++++++++++-----------------------------
1 file changed, 23 insertions(+), 41 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index a1274e6..34ecbdc 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -62,9 +62,7 @@ struct cpu_efficiency {
* Table of relative efficiency of each processors
* The efficiency value must fit in 20bit and the final
* cpu_scale value must be in the range
- * 0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2
- * in order to return at most 1 when DIV_ROUND_CLOSEST
- * is used to compute the capacity of a CPU.
+ * 0 < cpu_scale < SCHED_CAPACITY_SCALE.
* Processors that are not defined in the table,
* use the default SCHED_CAPACITY_SCALE value for cpu_scale.
*/
@@ -77,24 +75,18 @@ static const struct cpu_efficiency table_efficiency[] = {
static unsigned long *__cpu_capacity;
#define cpu_capacity(cpu) __cpu_capacity[cpu]
-static unsigned long middle_capacity = 1;
+static unsigned long max_cpu_perf;
/*
* Iterate all CPUs' descriptor in DT and compute the efficiency
- * (as per table_efficiency). Also calculate a middle efficiency
- * as close as possible to (max{eff_i} - min{eff_i}) / 2
- * This is later used to scale the cpu_capacity field such that an
- * 'average' CPU is of middle capacity. Also see the comments near
- * table_efficiency[] and update_cpu_capacity().
+ * (as per table_efficiency). Calculate the max cpu performance too.
*/
+
static void __init parse_dt_topology(void)
{
const struct cpu_efficiency *cpu_eff;
struct device_node *cn = NULL;
- unsigned long min_capacity = ULONG_MAX;
- unsigned long max_capacity = 0;
- unsigned long capacity = 0;
- int cpu = 0;
+ int cpu = 0, i = 0;
__cpu_capacity = kcalloc(nr_cpu_ids, sizeof(*__cpu_capacity),
GFP_NOWAIT);
@@ -102,6 +94,7 @@ static void __init parse_dt_topology(void)
for_each_possible_cpu(cpu) {
const u32 *rate;
int len;
+ unsigned long cpu_perf;
/* too early to use cpu->of_node */
cn = of_get_cpu_node(cpu, NULL);
@@ -124,46 +117,35 @@ static void __init parse_dt_topology(void)
continue;
}
- capacity = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency;
-
- /* Save min capacity of the system */
- if (capacity < min_capacity)
- min_capacity = capacity;
-
- /* Save max capacity of the system */
- if (capacity > max_capacity)
- max_capacity = capacity;
-
- cpu_capacity(cpu) = capacity;
+ cpu_perf = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency;
+ cpu_capacity(cpu) = cpu_perf;
+ max_cpu_perf = max(max_cpu_perf, cpu_perf);
+ i++;
}
- /* If min and max capacities are equals, we bypass the update of the
- * cpu_scale because all CPUs have the same capacity. Otherwise, we
- * compute a middle_capacity factor that will ensure that the capacity
- * of an 'average' CPU of the system will be as close as possible to
- * SCHED_CAPACITY_SCALE, which is the default value, but with the
- * constraint explained near table_efficiency[].
- */
- if (4*max_capacity < (3*(max_capacity + min_capacity)))
- middle_capacity = (min_capacity + max_capacity)
- >> (SCHED_CAPACITY_SHIFT+1);
- else
- middle_capacity = ((max_capacity / 3)
- >> (SCHED_CAPACITY_SHIFT-1)) + 1;
-
+ if (i < num_possible_cpus())
+ max_cpu_perf = 0;
}
/*
* Look for a customed capacity of a CPU in the cpu_capacity table during the
* boot. The update of all CPUs is in O(n^2) for heteregeneous system but the
- * function returns directly for SMP system.
+ * function returns directly for SMP systems or if there is no complete set
+ * of cpu efficiency, clock frequency data for each cpu.
*/
static void update_cpu_capacity(unsigned int cpu)
{
- if (!cpu_capacity(cpu))
+ unsigned long capacity = cpu_capacity(cpu);
+
+ if (!capacity || !max_cpu_perf) {
+ cpu_capacity(cpu) = 0;
return;
+ }
+
+ capacity *= SCHED_CAPACITY_SCALE;
+ capacity /= max_cpu_perf;
- set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
+ set_capacity_scale(cpu, capacity);
pr_info("CPU%u: update cpu_capacity %lu\n",
cpu, arch_scale_cpu_capacity(NULL, cpu));
--
1.9.1
From: Dietmar Eggemann <[email protected]>
Since now we have besides frequency invariant also cpu (uarch plus max
system frequency) invariant cfs_rq::utilization_load_avg both, frequency
and cpu scaling happens as part of the load tracking.
So cfs_rq::utilization_load_avg does not have to be scaled by the original
capacity of the cpu again.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5375ab1..a85c34b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4807,12 +4807,11 @@ static int select_idle_sibling(struct task_struct *p, int target)
static int get_cpu_usage(int cpu)
{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
- unsigned long capacity = capacity_orig_of(cpu);
if (usage >= SCHED_LOAD_SCALE)
- return capacity;
+ return capacity_orig_of(cpu);
- return (usage * capacity) >> SCHED_LOAD_SHIFT;
+ return usage;
}
/*
--
1.9.1
Introduces the blocked utilization, the utilization counter-part to
cfs_rq->utilization_load_avg. It is the sum of sched_entity utilization
contributions of entities that were recently on the cfs_rq that are
currently blocked. Combined with sum of contributions of entities
currently on the cfs_rq or currently running
(cfs_rq->utilization_load_avg) this can provide a more stable average
view of the cpu usage.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 30 +++++++++++++++++++++++++++++-
kernel/sched/sched.h | 8 ++++++--
2 files changed, 35 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a85c34b..0fc8963 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2775,6 +2775,15 @@ static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
cfs_rq->blocked_load_avg = 0;
}
+static inline void subtract_utilization_blocked_contrib(struct cfs_rq *cfs_rq,
+ long utilization_contrib)
+{
+ if (likely(utilization_contrib < cfs_rq->utilization_blocked_avg))
+ cfs_rq->utilization_blocked_avg -= utilization_contrib;
+ else
+ cfs_rq->utilization_blocked_avg = 0;
+}
+
static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
/* Update a sched_entity's runnable average */
@@ -2810,6 +2819,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
cfs_rq->utilization_load_avg += utilization_delta;
} else {
subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ -utilization_delta);
}
}
@@ -2827,14 +2838,20 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
return;
if (atomic_long_read(&cfs_rq->removed_load)) {
- unsigned long removed_load;
+ unsigned long removed_load, removed_utilization;
removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
+ removed_utilization =
+ atomic_long_xchg(&cfs_rq->removed_utilization, 0);
subtract_blocked_load_contrib(cfs_rq, removed_load);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ removed_utilization);
}
if (decays) {
cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
decays);
+ cfs_rq->utilization_blocked_avg =
+ decay_load(cfs_rq->utilization_blocked_avg, decays);
atomic64_add(decays, &cfs_rq->decay_counter);
cfs_rq->last_decay = now;
}
@@ -2881,6 +2898,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
/* migrated tasks did not contribute to our blocked load */
if (wakeup) {
subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ se->avg.utilization_avg_contrib);
update_entity_load_avg(se, 0);
}
@@ -2907,6 +2926,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
if (sleep) {
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+ cfs_rq->utilization_blocked_avg +=
+ se->avg.utilization_avg_contrib;
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
} /* migrations, e.g. sleep=0 leave decay_count == 0 */
}
@@ -4927,6 +4948,8 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
se->avg.decay_count = -__synchronize_entity_decay(se);
atomic_long_add(se->avg.load_avg_contrib,
&cfs_rq->removed_load);
+ atomic_long_add(se->avg.utilization_avg_contrib,
+ &cfs_rq->removed_utilization);
}
/* We have migrated, no longer consider this task hot */
@@ -7942,6 +7965,8 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
if (se->avg.decay_count) {
__synchronize_entity_decay(se);
subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ se->avg.utilization_avg_contrib);
}
#endif
}
@@ -8001,6 +8026,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifdef CONFIG_SMP
atomic64_set(&cfs_rq->decay_counter, 1);
atomic_long_set(&cfs_rq->removed_load, 0);
+ atomic_long_set(&cfs_rq->removed_utilization, 0);
#endif
}
@@ -8053,6 +8079,8 @@ static void task_move_group_fair(struct task_struct *p, int queued)
*/
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+ cfs_rq->utilization_blocked_avg +=
+ se->avg.utilization_avg_contrib;
#endif
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e402133..208237f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -368,11 +368,15 @@ struct cfs_rq {
* the blocked sched_entities on the rq.
* utilization_load_avg is the sum of the average running time of the
* sched_entities on the rq.
+ * utilization_blocked_avg is the utilization equivalent of
+ * blocked_load_avg, i.e. the sum of running contributions of blocked
+ * sched_entities associated with the rq.
*/
- unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
+ unsigned long runnable_load_avg, blocked_load_avg;
+ unsigned long utilization_load_avg, utilization_blocked_avg;
atomic64_t decay_counter;
u64 last_decay;
- atomic_long_t removed_load;
+ atomic_long_t removed_load, removed_utilization;
#ifdef CONFIG_FAIR_GROUP_SCHED
/* Required to track per-cpu representation of a task_group */
--
1.9.1
Add the blocked utilization contribution to group sched_entity
utilization (se->avg.utilization_avg_contrib) and to get_cpu_usage().
With this change cpu usage now includes recent usage by currently
non-runnable tasks, hence it provides a more stable view of the cpu
usage. It does, however, also mean that the meaning of usage is changed:
A cpu may be momentarily idle while usage >0. It can no longer be
assumed that cpu usage >0 implies runnable tasks on the rq.
cfs_rq->utilization_load_avg or nr_running should be used instead to get
the current rq status.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fc8963..33d3d81 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2761,7 +2761,8 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
__update_task_entity_utilization(se);
else
se->avg.utilization_avg_contrib =
- group_cfs_rq(se)->utilization_load_avg;
+ group_cfs_rq(se)->utilization_load_avg +
+ group_cfs_rq(se)->utilization_blocked_avg;
return se->avg.utilization_avg_contrib - old_contrib;
}
@@ -4828,11 +4829,12 @@ static int select_idle_sibling(struct task_struct *p, int target)
static int get_cpu_usage(int cpu)
{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+ unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
- if (usage >= SCHED_LOAD_SCALE)
+ if (usage + blocked >= SCHED_LOAD_SCALE)
return capacity_orig_of(cpu);
- return usage;
+ return usage + blocked;
}
/*
--
1.9.1
This documentation patch provides an overview of the experimental
scheduler energy costing model, associated data structures, and a
reference recipe on how platforms can be characterized to derive energy
models.
Signed-off-by: Morten Rasmussen <[email protected]>
---
Documentation/scheduler/sched-energy.txt | 359 +++++++++++++++++++++++++++++++
1 file changed, 359 insertions(+)
create mode 100644 Documentation/scheduler/sched-energy.txt
diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt
new file mode 100644
index 0000000..c179df0
--- /dev/null
+++ b/Documentation/scheduler/sched-energy.txt
@@ -0,0 +1,359 @@
+Energy cost model for energy-aware scheduling (EXPERIMENTAL)
+
+Introduction
+=============
+
+The basic energy model uses platform energy data stored in sched_group_energy
+data structures attached to the sched_groups in the sched_domain hierarchy. The
+energy cost model offers two functions that can be used to guide scheduling
+decisions:
+
+1. static unsigned int sched_group_energy(struct energy_env *eenv)
+2. static int energy_diff(struct energy_env *eenv)
+
+sched_group_energy() estimates the energy consumed by all cpus in a specific
+sched_group including any shared resources owned exclusively by this group of
+cpus. Resources shared with other cpus are excluded (e.g. later level caches).
+
+energy_diff() estimates the total energy impact of a utilization change. That
+is, adding, removing, or migrating utilization (tasks).
+
+Both functions use a struct energy_env to specify the scenario to be evaluated:
+
+ struct energy_env {
+ struct sched_group *sg_top;
+ struct sched_group *sg_cap;
+ int usage_delta;
+ int src_cpu;
+ int dst_cpu;
+ int energy;
+ };
+
+sg_top: sched_group to be evaluated. Not used by energy_diff().
+
+sg_cap: sched_group covering the cpus in the same frequency domain. Set by
+sched_group_energy().
+
+usage_delta: Amount of utilization to be added, removed, or migrated.
+
+src_cpu: Source cpu from where 'usage_delta' utilization is removed. Should be
+-1 if no source (e.g. task wake-up).
+
+dst_cpu: Destination cpu where 'usage_delta' utilization is added. Should be -1
+if utilization is removed (e.g. terminating tasks).
+
+energy: Result of sched_group_energy().
+
+The metric used to represent utilization is the actual per-entity running time
+averaged over time using a geometric series. Very similar to the existing
+per-entity load-tracking, but _not_ scaled by task priority and capped by the
+capacity of the cpu. The latter property does mean that utilization may
+underestimate the compute requirements for task on fully/over utilized cpus.
+The greatest potential for energy savings without affecting performance too much
+is scenarios where the system isn't fully utilized. If the system is deemed
+fully utilized load-balancing should be done with task load (includes task
+priority) instead in the interest of fairness and performance.
+
+
+Background and Terminology
+===========================
+
+To make it clear from the start:
+
+energy = [joule] (resource like a battery on powered devices)
+power = energy/time = [joule/second] = [watt]
+
+The goal of energy-aware scheduling is to minimize energy, while still getting
+the job done. That is, we want to maximize:
+
+ performance [inst/s]
+ --------------------
+ power [W]
+
+which is equivalent to minimizing:
+
+ energy [J]
+ -----------
+ instruction
+
+while still getting 'good' performance. It is essentially an alternative
+optimization objective to the current performance-only objective for the
+scheduler. This alternative considers two objectives: energy-efficiency and
+performance. Hence, there needs to be a user controllable knob to switch the
+objective. Since it is early days, this is currently a sched_feature
+(ENERGY_AWARE).
+
+The idea behind introducing an energy cost model is to allow the scheduler to
+evaluate the implications of its decisions rather than applying energy-saving
+techniques blindly that may only have positive effects on some platforms. At
+the same time, the energy cost model must be as simple as possible to minimize
+the scheduler latency impact.
+
+Platform topology
+------------------
+
+The system topology (cpus, caches, and NUMA information, not peripherals) is
+represented in the scheduler by the sched_domain hierarchy which has
+sched_groups attached at each level that covers one or more cpus (see
+sched-domains.txt for more details). To add energy awareness to the scheduler
+we need to consider power and frequency domains.
+
+Power domain:
+
+A power domain is a part of the system that can be powered on/off
+independently. Power domains are typically organized in a hierarchy where you
+may be able to power down just a cpu or a group of cpus along with any
+associated resources (e.g. shared caches). Powering up a cpu means that all
+power domains it is a part of in the hierarchy must be powered up. Hence, it is
+more expensive to power up the first cpu that belongs to a higher level power
+domain than powering up additional cpus in the same high level domain. Two
+level power domain hierarchy example:
+
+ Power source
+ +-------------------------------+----...
+per group PD G G
+ | +----------+ |
+ +--------+-------| Shared | (other groups)
+per-cpu PD G G | resource |
+ | | +----------+
+ +-------+ +-------+
+ | CPU 0 | | CPU 1 |
+ +-------+ +-------+
+
+Frequency domain:
+
+Frequency domains (P-states) typically cover the same group of cpus as one of
+the power domain levels. That is, there might be several smaller power domains
+sharing the same frequency (P-state) or there might be a power domain spanning
+multiple frequency domains.
+
+From a scheduling point of view there is no need to know the actual frequencies
+[Hz]. All the scheduler cares about is the compute capacity available at the
+current state (P-state) the cpu is in and any other available states. For that
+reason, and to also factor in any cpu micro-architecture differences, compute
+capacity scaling states are called 'capacity states' in this document. For SMP
+systems this is equivalent to P-states. For mixed micro-architecture systems
+(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
+performance relative to the other cpus in the system.
+
+Energy modelling:
+------------------
+
+Due to the hierarchical nature of the power domains, the most obvious way to
+model energy costs is therefore to associate power and energy costs with
+domains (groups of cpus). Energy costs of shared resources are associated with
+the group of cpus that share the resources, only the cost of powering the
+cpu itself and any private resources (e.g. private L1 caches) is associated
+with the per-cpu groups (lowest level).
+
+For example, for an SMP system with per-cpu power domains and a cluster level
+(group of cpus) power domain we get the overall energy costs to be:
+
+ energy = energy_cluster + n * energy_cpu
+
+where 'n' is the number of cpus powered up and energy_cluster is the cost paid
+as soon as any cpu in the cluster is powered up.
+
+The power and frequency domains can naturally be mapped onto the existing
+sched_domain hierarchy and sched_groups by adding the necessary data to the
+existing data structures.
+
+The energy model considers energy consumption from two contributors (shown in
+the illustration below):
+
+1. Busy energy: Energy consumed while a cpu and the higher level groups that it
+belongs to are busy running tasks. Busy energy is associated with the state of
+the cpu, not an event. The time the cpu spends in this state varies. Thus, the
+most obvious platform parameter for this contribution is busy power
+(energy/time).
+
+2. Idle energy: Energy consumed while a cpu and higher level groups that it
+belongs to are idle (in a C-state). Like busy energy, idle energy is associated
+with the state of the cpu. Thus, the platform parameter for this contribution
+is idle power (energy/time).
+
+Energy consumed during transitions from an idle-state (C-state) to a busy state
+(P-staet) or going the other way is ignored by the model to simplify the energy
+model calculations.
+
+
+ Power
+ ^
+ | busy->idle idle->busy
+ | transition transition
+ |
+ | _ __
+ | / \ / \__________________
+ |______________/ \ /
+ | \ /
+ | Busy \ Idle / Busy
+ | low P-state \____________/ high P-state
+ |
+ +------------------------------------------------------------> time
+
+Busy |--------------| |-----------------|
+
+Wakeup |------| |------|
+
+Idle |------------|
+
+
+The basic algorithm
+====================
+
+The basic idea is to determine the total energy impact when utilization is
+added or removed by estimating the impact at each level in the sched_domain
+hierarchy starting from the bottom (sched_group contains just a single cpu).
+The energy cost comes from busy time (sched_group is awake because one or more
+cpus are busy) and idle time (in an idle-state). Energy model numbers account
+for energy costs associated with all cpus in the sched_group as a group.
+
+ for_each_domain(cpu, sd) {
+ sg = sched_group_of(cpu)
+ energy_before = curr_util(sg) * busy_power(sg)
+ + (1-curr_util(sg)) * idle_power(sg)
+ energy_after = new_util(sg) * busy_power(sg)
+ + (1-new_util(sg)) * idle_power(sg)
+ energy_diff += energy_before - energy_after
+
+ }
+
+ return energy_diff
+
+{curr, new}_util: The cpu utilization at the lowest level and the overall
+non-idle time for the entire group for higher levels. Utilization is in the
+range 0.0 to 1.0 in the pseudo-code.
+
+busy_power: The power consumption of the sched_group.
+
+idle_power: The power consumption of the sched_group when idle.
+
+Note: It is a fundamental assumption that the utilization is (roughly) scale
+invariant. Task utilization tracking factors in any frequency scaling and
+performance scaling differences due to difference cpu microarchitectures such
+that task utilization can be used across the entire system.
+
+
+Platform energy data
+=====================
+
+struct sched_group_energy can be attached to sched_groups in the sched_domain
+hierarchy and has the following members:
+
+cap_states:
+ List of struct capacity_state representing the supported capacity states
+ (P-states). struct capacity_state has two members: cap and power, which
+ represents the compute capacity and the busy_power of the state. The
+ list must be ordered by capacity low->high.
+
+nr_cap_states:
+ Number of capacity states in cap_states list.
+
+idle_states:
+ List of struct idle_state containing idle_state power cost for each
+ idle-state support by the sched_group. Note that the energy model
+ calculations will use this table to determine idle power even if no idle
+ state is actually entered by cpuidle. That is, if latency constraints
+ prevents that the group enters a coupled state or no idle-states are
+ supported. Hence, the first entry of the list must be the idle power
+ when idle, but no idle state was actually entered ('active idle'). This
+ state may be left out groups with one cpu if the cpu is guaranteed to
+ enter the state when idle.
+
+nr_idle_states:
+ Number of idle states in idle_states list.
+
+idle_states_below:
+ Number of idle-states below current level. Filled by generic code, not
+ to be provided by the platform.
+
+There are no unit requirements for the energy cost data. Data can be normalized
+with any reference, however, the normalization must be consistent across all
+energy cost data. That is, one bogo-joule/watt must be the same quantity for
+data, but we don't care what it is.
+
+A recipe for platform characterization
+=======================================
+
+Obtaining the actual model data for a particular platform requires some way of
+measuring power/energy. There isn't a tool to help with this (yet). This
+section provides a recipe for use as reference. It covers the steps used to
+characterize the ARM TC2 development platform. This sort of measurements is
+expected to be done anyway when tuning cpuidle and cpufreq for a given
+platform.
+
+The energy model needs two types of data (struct sched_group_energy holds
+these) for each sched_group where energy costs should be taken into account:
+
+1. Capacity state information
+
+A list containing the compute capacity and power consumption when fully
+utilized attributed to the group as a whole for each available capacity state.
+At the lowest level (group contains just a single cpu) this is the power of the
+cpu alone without including power consumed by resources shared with other cpus.
+It basically needs to fit the basic modelling approach described in "Background
+and Terminology" section:
+
+ energy_system = energy_shared + n * energy_cpu
+
+for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at
+the lowest level. 'energy_shared' is included at the next level which
+represents the group of cpus among which the resources are shared.
+
+This model is, of course, a simplification of reality. Thus, power/energy
+attributions might not always exactly represent how the hardware is designed.
+Also, busy power is likely to depend on the workload. It is therefore
+recommended to use a representative mix of workloads when characterizing the
+capacity states.
+
+If the group has no capacity scaling support, the list will contain a single
+state where power is the busy power attributed to the group. The capacity
+should be set to a default value (1024).
+
+When frequency domains include multiple power domains, the group representing
+the frequency domain and all child groups share capacity states. This must be
+indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at
+all levels that share the capacity state must have the list of capacity states
+with the power set to the contribution of the individual group.
+
+2. Idle power information
+
+Stored in the idle_states list. The power number is the group idle power
+consumption in each idle state as well when the group is idle but has not
+entered an idle-state ('active idle' as mentioned earlier). Due to the way the
+energy model is defined, the idle power of the deepest group idle state can
+alternatively be accounted for in the parent group busy power. In that case the
+group idle state power values are offset such that the idle power of the
+deepest state is zero. It is less intuitive, but it is easier to measure as
+idle power consumed by the group and the busy/idle power of the parent group
+cannot be distinguished without per group measurement points.
+
+Measuring capacity states and idle power:
+
+The capacity states' capacity and power can be estimated by running a benchmark
+workload at each available capacity state. By restricting the benchmark to run
+on subsets of cpus it is possible to extrapolate the power consumption of
+shared resources.
+
+ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a
+shared L2 cache. TC2 has on-chip energy counters per cluster. Running a
+benchmark workload on just one cpu in a cluster means that power is consumed in
+the cluster (higher level group) and a single cpu (lowest level group). Adding
+another benchmark task to another cpu increases the power consumption by the
+amount consumed by the additional cpu. Hence, it is possible to extrapolate the
+cluster busy power.
+
+For platforms that don't have energy counters or equivalent instrumentation
+built-in, it may be possible to use an external DAQ to acquire similar data.
+
+If the benchmark includes some performance score (for example sysbench cpu
+benchmark), this can be used to record the compute capacity.
+
+Measuring idle power requires insight into the idle state implementation on the
+particular platform. Specifically, if the platform has coupled idle-states (or
+package states). To measure non-coupled per-cpu idle-states it is necessary to
+keep one cpu busy to keep any shared resources alive to isolate the idle power
+of the cpu from idle/busy power of the shared resources. The cpu can be tricked
+into different per-cpu idle states by disabling the other states. Based on
+various combinations of measurements with specific cpus busy and disabling
+idle-states it is possible to extrapolate the idle-state power.
--
1.9.1
This patch introduces the ENERGY_AWARE sched feature, which is
implemented using jump labels when SCHED_DEBUG is defined. It is
statically set false when SCHED_DEBUG is not defined. Hence this doesn't
allow energy awareness to be enabled without SCHED_DEBUG. This
sched_feature knob will be replaced later with a more appropriate
control knob when things have matured a bit.
ENERGY_AWARE is based on per-entity load-tracking hence FAIR_GROUP_SCHED
must be enable. This dependency isn't checked at compile time yet.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 6 ++++++
kernel/sched/features.h | 6 ++++++
2 files changed, 12 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33d3d81..2557774 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4554,6 +4554,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
return wl;
}
+
#else
static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
@@ -4563,6 +4564,11 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
#endif
+static inline bool energy_aware(void)
+{
+ return sched_feat(ENERGY_AWARE);
+}
+
static int wake_wide(struct task_struct *p)
{
int factor = this_cpu_read(sd_llc_size);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 90284d1..199ee3a 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,3 +83,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
*/
SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
+
+/*
+ * Energy aware scheduling. Use platform energy model to guide scheduling
+ * decisions optimizing for energy efficiency.
+ */
+SCHED_FEAT(ENERGY_AWARE, false)
--
1.9.1
From: Dietmar Eggemann <[email protected]>
The struct sched_group_energy represents the per sched_group related
data which is needed for energy aware scheduling. It contains:
(1) atomic reference counter for scheduler internal bookkeeping of
data allocation and freeing
(2) number of elements of the idle state array
(3) pointer to the idle state array which comprises 'power consumption'
for each idle state
(4) number of elements of the capacity state array
(5) pointer to the capacity state array which comprises 'compute
capacity and power consumption' tuples for each capacity state
Allocation and freeing of struct sched_group_energy utilizes the existing
infrastructure of the scheduler which is currently used for the other sd
hierarchy data structures (e.g. struct sched_domain) as well. That's why
struct sd_data is provisioned with a per cpu struct sched_group_energy
double pointer.
The struct sched_group obtains a pointer to a struct sched_group_energy.
The function pointer sched_domain_energy_f is introduced into struct
sched_domain_topology_level which will allow the arch to pass a particular
struct sched_group_energy from the topology shim layer into the scheduler
core.
The function pointer sched_domain_energy_f has an 'int cpu' parameter
since the folding of two adjacent sd levels via sd degenerate doesn't work
for all sd levels. I.e. it is not possible for example to use this feature
to provide per-cpu energy in sd level DIE on ARM's TC2 platform.
It was discussed that the folding of sd levels approach is preferable
over the cpu parameter approach, simply because the user (the arch
specifying the sd topology table) can introduce less errors. But since
it is not working, the 'int cpu' parameter is the only way out. It's
possible to use the folding of sd levels approach for
sched_domain_flags_f and the cpu parameter approach for the
sched_domain_energy_f at the same time though. With the use of the
'int cpu' parameter, an extra check function has to be provided to make
sure that all cpus spanned by a sched group are provisioned with the same
energy data.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
include/linux/sched.h | 20 ++++++++++++++++++++
kernel/sched/sched.h | 1 +
2 files changed, 21 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e220a91..2ea93fb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -944,6 +944,23 @@ struct sched_domain_attr {
extern int sched_domain_level_max;
+struct capacity_state {
+ unsigned long cap; /* compute capacity */
+ unsigned long power; /* power consumption at this compute capacity */
+};
+
+struct idle_state {
+ unsigned long power; /* power consumption in this idle state */
+};
+
+struct sched_group_energy {
+ atomic_t ref;
+ unsigned int nr_idle_states; /* number of idle states */
+ struct idle_state *idle_states; /* ptr to idle state array */
+ unsigned int nr_cap_states; /* number of capacity states */
+ struct capacity_state *cap_states; /* ptr to capacity state array */
+};
+
struct sched_group;
struct sched_domain {
@@ -1042,6 +1059,7 @@ bool cpus_share_cache(int this_cpu, int that_cpu);
typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
typedef int (*sched_domain_flags_f)(void);
+typedef const struct sched_group_energy *(*sched_domain_energy_f)(int cpu);
#define SDTL_OVERLAP 0x01
@@ -1049,11 +1067,13 @@ struct sd_data {
struct sched_domain **__percpu sd;
struct sched_group **__percpu sg;
struct sched_group_capacity **__percpu sgc;
+ struct sched_group_energy **__percpu sge;
};
struct sched_domain_topology_level {
sched_domain_mask_f mask;
sched_domain_flags_f sd_flags;
+ sched_domain_energy_f energy;
int flags;
int numa_level;
struct sd_data data;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 208237f..0e9dcc6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -817,6 +817,7 @@ struct sched_group {
unsigned int group_weight;
struct sched_group_capacity *sgc;
+ struct sched_group_energy *sge;
/*
* The CPUs this group covers.
--
1.9.1
From: Dietmar Eggemann <[email protected]>
The per sched group sched_group_energy structure plus the related
idle_state and capacity_state arrays are allocated like the other sched
domain (sd) hierarchy data structures. This includes the freeing of
sched_group_energy structures which are not used.
One problem is that the number of elements of the idle_state and the
capacity_state arrays is not fixed and has to be retrieved in
__sdt_alloc() to allocate memory for the sched_group_energy structure and
the two arrays in one chunk. The array pointers (idle_states and
cap_states) are initialized here to point to the correct place inside the
memory chunk.
The new function init_sched_energy() initializes the sched_group_energy
structure and the two arrays in case the sd topology level contains energy
information.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/core.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 33 ++++++++++++++++++++++++
2 files changed, 103 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a00a4c3..031ea48 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5707,6 +5707,7 @@ static void free_sched_domain(struct rcu_head *rcu)
free_sched_groups(sd->groups, 1);
} else if (atomic_dec_and_test(&sd->groups->ref)) {
kfree(sd->groups->sgc);
+ kfree(sd->groups->sge);
kfree(sd->groups);
}
kfree(sd);
@@ -5965,6 +5966,8 @@ static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
*sg = *per_cpu_ptr(sdd->sg, cpu);
(*sg)->sgc = *per_cpu_ptr(sdd->sgc, cpu);
atomic_set(&(*sg)->sgc->ref, 1); /* for claim_allocations */
+ (*sg)->sge = *per_cpu_ptr(sdd->sge, cpu);
+ atomic_set(&(*sg)->sge->ref, 1); /* for claim_allocations */
}
return cpu;
@@ -6054,6 +6057,28 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
atomic_set(&sg->sgc->nr_busy_cpus, sg->group_weight);
}
+static void init_sched_energy(int cpu, struct sched_domain *sd,
+ struct sched_domain_topology_level *tl)
+{
+ struct sched_group *sg = sd->groups;
+ struct sched_group_energy *energy = sg->sge;
+ sched_domain_energy_f fn = tl->energy;
+ struct cpumask *mask = sched_group_cpus(sg);
+
+ if (!fn || !fn(cpu))
+ return;
+
+ if (cpumask_weight(mask) > 1)
+ check_sched_energy_data(cpu, fn, mask);
+
+ energy->nr_idle_states = fn(cpu)->nr_idle_states;
+ memcpy(energy->idle_states, fn(cpu)->idle_states,
+ energy->nr_idle_states*sizeof(struct idle_state));
+ energy->nr_cap_states = fn(cpu)->nr_cap_states;
+ memcpy(energy->cap_states, fn(cpu)->cap_states,
+ energy->nr_cap_states*sizeof(struct capacity_state));
+}
+
/*
* Initializers for schedule domains
* Non-inlined to reduce accumulated stack pressure in build_sched_domains()
@@ -6144,6 +6169,9 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
if (atomic_read(&(*per_cpu_ptr(sdd->sgc, cpu))->ref))
*per_cpu_ptr(sdd->sgc, cpu) = NULL;
+
+ if (atomic_read(&(*per_cpu_ptr(sdd->sge, cpu))->ref))
+ *per_cpu_ptr(sdd->sge, cpu) = NULL;
}
#ifdef CONFIG_NUMA
@@ -6609,10 +6637,24 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
if (!sdd->sgc)
return -ENOMEM;
+ sdd->sge = alloc_percpu(struct sched_group_energy *);
+ if (!sdd->sge)
+ return -ENOMEM;
+
for_each_cpu(j, cpu_map) {
struct sched_domain *sd;
struct sched_group *sg;
struct sched_group_capacity *sgc;
+ struct sched_group_energy *sge;
+ sched_domain_energy_f fn = tl->energy;
+ unsigned int nr_idle_states = 0;
+ unsigned int nr_cap_states = 0;
+
+ if (fn && fn(j)) {
+ nr_idle_states = fn(j)->nr_idle_states;
+ nr_cap_states = fn(j)->nr_cap_states;
+ BUG_ON(!nr_idle_states || !nr_cap_states);
+ }
sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
@@ -6636,6 +6678,26 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
return -ENOMEM;
*per_cpu_ptr(sdd->sgc, j) = sgc;
+
+ sge = kzalloc_node(sizeof(struct sched_group_energy) +
+ nr_idle_states*sizeof(struct idle_state) +
+ nr_cap_states*sizeof(struct capacity_state),
+ GFP_KERNEL, cpu_to_node(j));
+
+ if (!sge)
+ return -ENOMEM;
+
+ sge->idle_states = (struct idle_state *)
+ ((void *)&sge->cap_states +
+ sizeof(sge->cap_states));
+
+ sge->cap_states = (struct capacity_state *)
+ ((void *)&sge->cap_states +
+ sizeof(sge->cap_states) +
+ nr_idle_states*
+ sizeof(struct idle_state));
+
+ *per_cpu_ptr(sdd->sge, j) = sge;
}
}
@@ -6664,6 +6726,8 @@ static void __sdt_free(const struct cpumask *cpu_map)
kfree(*per_cpu_ptr(sdd->sg, j));
if (sdd->sgc)
kfree(*per_cpu_ptr(sdd->sgc, j));
+ if (sdd->sge)
+ kfree(*per_cpu_ptr(sdd->sge, j));
}
free_percpu(sdd->sd);
sdd->sd = NULL;
@@ -6671,6 +6735,8 @@ static void __sdt_free(const struct cpumask *cpu_map)
sdd->sg = NULL;
free_percpu(sdd->sgc);
sdd->sgc = NULL;
+ free_percpu(sdd->sge);
+ sdd->sge = NULL;
}
}
@@ -6756,10 +6822,13 @@ static int build_sched_domains(const struct cpumask *cpu_map,
/* Calculate CPU capacity for physical packages and nodes */
for (i = nr_cpumask_bits-1; i >= 0; i--) {
+ struct sched_domain_topology_level *tl = sched_domain_topology;
+
if (!cpumask_test_cpu(i, cpu_map))
continue;
- for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+ for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent, tl++) {
+ init_sched_energy(i, sd, tl);
claim_allocations(i, sd);
init_sched_groups_capacity(i, sd);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0e9dcc6..86cf6b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -854,6 +854,39 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
extern int group_balance_cpu(struct sched_group *sg);
+/*
+ * Check that the per-cpu provided sd energy data is consistent for all cpus
+ * within the mask.
+ */
+static inline void check_sched_energy_data(int cpu, sched_domain_energy_f fn,
+ const struct cpumask *cpumask)
+{
+ struct cpumask mask;
+ int i;
+
+ cpumask_xor(&mask, cpumask, get_cpu_mask(cpu));
+
+ for_each_cpu(i, &mask) {
+ int y;
+
+ BUG_ON(fn(i)->nr_idle_states != fn(cpu)->nr_idle_states);
+
+ for (y = 0; y < (fn(i)->nr_idle_states); y++) {
+ BUG_ON(fn(i)->idle_states[y].power !=
+ fn(cpu)->idle_states[y].power);
+ }
+
+ BUG_ON(fn(i)->nr_cap_states != fn(cpu)->nr_cap_states);
+
+ for (y = 0; y < (fn(i)->nr_cap_states); y++) {
+ BUG_ON(fn(i)->cap_states[y].cap !=
+ fn(cpu)->cap_states[y].cap);
+ BUG_ON(fn(i)->cap_states[y].power !=
+ fn(cpu)->cap_states[y].power);
+ }
+ }
+}
+
#else
static inline void sched_ttwu_pending(void) { }
--
1.9.1
cpufreq is currently keeping it a secret which cpus are sharing
clock source. The scheduler needs to know about clock domains as well
to become more energy aware. The SD_SHARE_CAP_STATES domain flag
indicates whether cpus belonging to the sched_domain share capacity
states (P-states).
There is no connection with cpufreq (yet). The flag must be set by
the arch specific topology code.
cc: Russell King <[email protected]>
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
arch/arm/kernel/topology.c | 3 ++-
include/linux/sched.h | 1 +
kernel/sched/core.c | 10 +++++++---
3 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 34ecbdc..fdbe784 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -292,7 +292,8 @@ void store_cpu_topology(unsigned int cpuid)
static inline int cpu_corepower_flags(void)
{
- return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
+ return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN | \
+ SD_SHARE_CAP_STATES;
}
static struct sched_domain_topology_level arm_topology[] = {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2ea93fb..78b6eb7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -912,6 +912,7 @@ enum cpu_idle_type {
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
#define SD_NUMA 0x4000 /* cross-node balancing */
+#define SD_SHARE_CAP_STATES 0x8000 /* Domain members share capacity state */
#ifdef CONFIG_SCHED_SMT
static inline int cpu_smt_flags(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 031ea48..c49f3ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5522,7 +5522,8 @@ static int sd_degenerate(struct sched_domain *sd)
SD_BALANCE_EXEC |
SD_SHARE_CPUCAPACITY |
SD_SHARE_PKG_RESOURCES |
- SD_SHARE_POWERDOMAIN)) {
+ SD_SHARE_POWERDOMAIN |
+ SD_SHARE_CAP_STATES)) {
if (sd->groups != sd->groups->next)
return 0;
}
@@ -5554,7 +5555,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
SD_SHARE_CPUCAPACITY |
SD_SHARE_PKG_RESOURCES |
SD_PREFER_SIBLING |
- SD_SHARE_POWERDOMAIN);
+ SD_SHARE_POWERDOMAIN |
+ SD_SHARE_CAP_STATES);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
}
@@ -6190,6 +6192,7 @@ static int sched_domains_curr_level;
* SD_SHARE_PKG_RESOURCES - describes shared caches
* SD_NUMA - describes NUMA topologies
* SD_SHARE_POWERDOMAIN - describes shared power domain
+ * SD_SHARE_CAP_STATES - describes shared capacity states
*
* Odd one out:
* SD_ASYM_PACKING - describes SMT quirks
@@ -6199,7 +6202,8 @@ static int sched_domains_curr_level;
SD_SHARE_PKG_RESOURCES | \
SD_NUMA | \
SD_ASYM_PACKING | \
- SD_SHARE_POWERDOMAIN)
+ SD_SHARE_POWERDOMAIN | \
+ SD_SHARE_CAP_STATES)
static struct sched_domain *
sd_init(struct sched_domain_topology_level *tl, int cpu)
--
1.9.1
From: Dietmar Eggemann <[email protected]>
This patch is only here to be able to test provisioning of energy related
data from an arch topology shim layer to the scheduler. Since there is no
code today which deals with extracting energy related data from the dtb or
acpi, and process it in the topology shim layer, the content of the
sched_group_energy structures as well as the idle_state and capacity_state
arrays are hard-coded here.
This patch defines the sched_group_energy structure as well as the
idle_state and capacity_state array for the cluster (relates to sched
groups (sgs) in DIE sched domain level) and for the core (relates to sgs
in MC sd level) for a Cortex A7 as well as for a Cortex A15.
It further provides related implementations of the sched_domain_energy_f
functions (cpu_cluster_energy() and cpu_core_energy()).
To be able to propagate this information from the topology shim layer to
the scheduler, the elements of the arm_topology[] table have been
provisioned with the appropriate sched_domain_energy_f functions.
cc: Russell King <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/kernel/topology.c | 118 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 115 insertions(+), 3 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index fdbe784..d3811e5 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -290,6 +290,119 @@ void store_cpu_topology(unsigned int cpuid)
cpu_topology[cpuid].socket_id, mpidr);
}
+/*
+ * ARM TC2 specific energy cost model data. There are no unit requirements for
+ * the data. Data can be normalized to any reference point, but the
+ * normalization must be consistent. That is, one bogo-joule/watt must be the
+ * same quantity for all data, but we don't care what it is.
+ */
+static struct idle_state idle_states_cluster_a7[] = {
+ { .power = 25 }, /* WFI */
+ { .power = 10 }, /* cluster-sleep-l */
+ };
+
+static struct idle_state idle_states_cluster_a15[] = {
+ { .power = 70 }, /* WFI */
+ { .power = 25 }, /* cluster-sleep-b */
+ };
+
+static struct capacity_state cap_states_cluster_a7[] = {
+ /* Cluster only power */
+ { .cap = 150, .power = 2967, }, /* 350 MHz */
+ { .cap = 172, .power = 2792, }, /* 400 MHz */
+ { .cap = 215, .power = 2810, }, /* 500 MHz */
+ { .cap = 258, .power = 2815, }, /* 600 MHz */
+ { .cap = 301, .power = 2919, }, /* 700 MHz */
+ { .cap = 344, .power = 2847, }, /* 800 MHz */
+ { .cap = 387, .power = 3917, }, /* 900 MHz */
+ { .cap = 430, .power = 4905, }, /* 1000 MHz */
+ };
+
+static struct capacity_state cap_states_cluster_a15[] = {
+ /* Cluster only power */
+ { .cap = 426, .power = 7920, }, /* 500 MHz */
+ { .cap = 512, .power = 8165, }, /* 600 MHz */
+ { .cap = 597, .power = 8172, }, /* 700 MHz */
+ { .cap = 682, .power = 8195, }, /* 800 MHz */
+ { .cap = 768, .power = 8265, }, /* 900 MHz */
+ { .cap = 853, .power = 8446, }, /* 1000 MHz */
+ { .cap = 938, .power = 11426, }, /* 1100 MHz */
+ { .cap = 1024, .power = 15200, }, /* 1200 MHz */
+ };
+
+static struct sched_group_energy energy_cluster_a7 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a7),
+ .idle_states = idle_states_cluster_a7,
+ .nr_cap_states = ARRAY_SIZE(cap_states_cluster_a7),
+ .cap_states = cap_states_cluster_a7,
+};
+
+static struct sched_group_energy energy_cluster_a15 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a15),
+ .idle_states = idle_states_cluster_a15,
+ .nr_cap_states = ARRAY_SIZE(cap_states_cluster_a15),
+ .cap_states = cap_states_cluster_a15,
+};
+
+static struct idle_state idle_states_core_a7[] = {
+ { .power = 0 }, /* WFI */
+ };
+
+static struct idle_state idle_states_core_a15[] = {
+ { .power = 0 }, /* WFI */
+ };
+
+static struct capacity_state cap_states_core_a7[] = {
+ /* Power per cpu */
+ { .cap = 150, .power = 187, }, /* 350 MHz */
+ { .cap = 172, .power = 275, }, /* 400 MHz */
+ { .cap = 215, .power = 334, }, /* 500 MHz */
+ { .cap = 258, .power = 407, }, /* 600 MHz */
+ { .cap = 301, .power = 447, }, /* 700 MHz */
+ { .cap = 344, .power = 549, }, /* 800 MHz */
+ { .cap = 387, .power = 761, }, /* 900 MHz */
+ { .cap = 430, .power = 1024, }, /* 1000 MHz */
+ };
+
+static struct capacity_state cap_states_core_a15[] = {
+ /* Power per cpu */
+ { .cap = 426, .power = 2021, }, /* 500 MHz */
+ { .cap = 512, .power = 2312, }, /* 600 MHz */
+ { .cap = 597, .power = 2756, }, /* 700 MHz */
+ { .cap = 682, .power = 3125, }, /* 800 MHz */
+ { .cap = 768, .power = 3524, }, /* 900 MHz */
+ { .cap = 853, .power = 3846, }, /* 1000 MHz */
+ { .cap = 938, .power = 5177, }, /* 1100 MHz */
+ { .cap = 1024, .power = 6997, }, /* 1200 MHz */
+ };
+
+static struct sched_group_energy energy_core_a7 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_core_a7),
+ .idle_states = idle_states_core_a7,
+ .nr_cap_states = ARRAY_SIZE(cap_states_core_a7),
+ .cap_states = cap_states_core_a7,
+};
+
+static struct sched_group_energy energy_core_a15 = {
+ .nr_idle_states = ARRAY_SIZE(idle_states_core_a15),
+ .idle_states = idle_states_core_a15,
+ .nr_cap_states = ARRAY_SIZE(cap_states_core_a15),
+ .cap_states = cap_states_core_a15,
+};
+
+/* sd energy functions */
+static inline const struct sched_group_energy *cpu_cluster_energy(int cpu)
+{
+ return cpu_topology[cpu].socket_id ? &energy_cluster_a7 :
+ &energy_cluster_a15;
+}
+
+static inline const struct sched_group_energy *cpu_core_energy(int cpu)
+{
+ return cpu_topology[cpu].socket_id ? &energy_core_a7 :
+ &energy_core_a15;
+}
+
static inline int cpu_corepower_flags(void)
{
return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN | \
@@ -298,10 +411,9 @@ static inline int cpu_corepower_flags(void)
static struct sched_domain_topology_level arm_topology[] = {
#ifdef CONFIG_SCHED_MC
- { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+ { cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) },
#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { cpu_cpu_mask, 0, cpu_cluster_energy, SD_INIT_NAME(DIE) },
{ NULL, },
};
--
1.9.1
capacity_orig_of() returns the max available compute capacity of a cpu.
For scale-invariant utilization tracking and energy-aware scheduling
decisions it is useful to know the compute capacity available at the
current OPP of a cpu.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2557774..70acb4c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4564,6 +4564,17 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
#endif
+/*
+ * Returns the current capacity of cpu after applying both
+ * cpu and freq scaling.
+ */
+static unsigned long capacity_curr_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_orig *
+ arch_scale_freq_capacity(NULL, cpu)
+ >> SCHED_CAPACITY_SHIFT;
+}
+
static inline bool energy_aware(void)
{
return sched_feat(ENERGY_AWARE);
--
1.9.1
Move get_cpu_usage() to an earlier position in fair.c.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 56 ++++++++++++++++++++++++++---------------------------
1 file changed, 28 insertions(+), 28 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 70acb4c..071310a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4575,6 +4575,34 @@ static unsigned long capacity_curr_of(int cpu)
>> SCHED_CAPACITY_SHIFT;
}
+/*
+ * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * tasks. The unit of the return value must capacity so we can compare the
+ * usage with the capacity of the CPU that is available for CFS task (ie
+ * cpu_capacity).
+ * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
+ * CPU. It represents the amount of utilization of a CPU in the range
+ * [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full
+ * capacity of the CPU because it's about the running time on this CPU.
+ * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
+ * because of unfortunate rounding in avg_period and running_load_avg or just
+ * after migrating tasks until the average stabilizes with the new running
+ * time. So we need to check that the usage stays into the range
+ * [0..cpu_capacity_orig] and cap if necessary.
+ * Without capping the usage, a group could be seen as overloaded (CPU0 usage
+ * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
+ */
+static int get_cpu_usage(int cpu)
+{
+ unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+ unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
+
+ if (usage + blocked >= SCHED_LOAD_SCALE)
+ return capacity_orig_of(cpu);
+
+ return usage + blocked;
+}
+
static inline bool energy_aware(void)
{
return sched_feat(ENERGY_AWARE);
@@ -4827,34 +4855,6 @@ static int select_idle_sibling(struct task_struct *p, int target)
return target;
}
/*
- * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
- * tasks. The unit of the return value must capacity so we can compare the
- * usage with the capacity of the CPU that is available for CFS task (ie
- * cpu_capacity).
- * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
- * CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full
- * capacity of the CPU because it's about the running time on this CPU.
- * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
- * because of unfortunate rounding in avg_period and running_load_avg or just
- * after migrating tasks until the average stabilizes with the new running
- * time. So we need to check that the usage stays into the range
- * [0..cpu_capacity_orig] and cap if necessary.
- * Without capping the usage, a group could be seen as overloaded (CPU0 usage
- * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
- */
-static int get_cpu_usage(int cpu)
-{
- unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
- unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
-
- if (usage + blocked >= SCHED_LOAD_SCALE)
- return capacity_orig_of(cpu);
-
- return usage + blocked;
-}
-
-/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
* SD_BALANCE_FORK, or SD_BALANCE_EXEC.
--
1.9.1
With scale-invariant usage tracking get_cpu_usage() should never return
a usage above the current compute capacity of the cpu (capacity_curr).
The scaling of the utilization tracking contributions should generally
cause the cpu utilization to saturate at capacity_curr, but it may
temporarily exceed this value in certain situations. This patch changes
the cap from capacity_orig to capacity_curr.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 071310a..872ae0e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4582,13 +4582,13 @@ static unsigned long capacity_curr_of(int cpu)
* cpu_capacity).
* cfs.utilization_load_avg is the sum of running time of runnable tasks on a
* CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full
+ * [0..capacity_curr]. The usage of a CPU can't be higher than the current
* capacity of the CPU because it's about the running time on this CPU.
- * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
+ * Nevertheless, cfs.utilization_load_avg can be higher than capacity_curr
* because of unfortunate rounding in avg_period and running_load_avg or just
* after migrating tasks until the average stabilizes with the new running
* time. So we need to check that the usage stays into the range
- * [0..cpu_capacity_orig] and cap if necessary.
+ * [0..cpu_capacity_curr] and cap if necessary.
* Without capping the usage, a group could be seen as overloaded (CPU0 usage
* at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
*/
@@ -4596,9 +4596,10 @@ static int get_cpu_usage(int cpu)
{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
+ unsigned long capacity_curr = capacity_curr_of(cpu);
- if (usage + blocked >= SCHED_LOAD_SCALE)
- return capacity_orig_of(cpu);
+ if (usage + blocked >= capacity_curr)
+ return capacity_curr;
return usage + blocked;
}
--
1.9.1
Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_ea, points to the highest level at which energy
model is provided. At this level and all levels below all sched_groups
have energy model data attached.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/core.c | 11 ++++++++++-
kernel/sched/sched.h | 1 +
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c49f3ee..e47febf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5741,11 +5741,12 @@ DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
DEFINE_PER_CPU(struct sched_domain *, sd_busy);
DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_PER_CPU(struct sched_domain *, sd_ea);
static void update_top_cache_domain(int cpu)
{
struct sched_domain *sd;
- struct sched_domain *busy_sd = NULL;
+ struct sched_domain *busy_sd = NULL, *ea_sd = NULL;
int id = cpu;
int size = 1;
@@ -5766,6 +5767,14 @@ static void update_top_cache_domain(int cpu)
sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
+
+ for_each_domain(cpu, sd) {
+ if (sd->groups->sge)
+ ea_sd = sd;
+ else
+ break;
+ }
+ rcu_assign_pointer(per_cpu(sd_ea, cpu), ea_sd);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 86cf6b2..dedf0ec 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -793,6 +793,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_busy);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(struct sched_domain *, sd_ea);
struct sched_group_capacity {
atomic_t ref;
--
1.9.1
For energy-aware load-balancing decisions it is necessary to know the
energy consumption estimates of groups of cpus. This patch introduces a
basic function, sched_group_energy(), which estimates the energy
consumption of the cpus in the group and any resources shared by the
members of the group.
NOTE: The function has five levels of identation and breaks the 80
character limit. Refactoring is necessary.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 143 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 872ae0e..d12aa63 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4609,6 +4609,149 @@ static inline bool energy_aware(void)
return sched_feat(ENERGY_AWARE);
}
+/*
+ * cpu_norm_usage() returns the cpu usage relative to it's current capacity,
+ * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
+ * energy calculations. Using the scale-invariant usage returned by
+ * get_cpu_usage() and approximating scale-invariant usage by:
+ *
+ * usage ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time
+ *
+ * the normalized usage can be found using capacity_curr.
+ *
+ * capacity_curr = capacity_orig * curr_freq/max_freq
+ *
+ * norm_usage = running_time/time ~ usage/capacity_curr
+ */
+static inline unsigned long cpu_norm_usage(int cpu)
+{
+ unsigned long capacity_curr = capacity_curr_of(cpu);
+
+ return (get_cpu_usage(cpu) << SCHED_CAPACITY_SHIFT)/capacity_curr;
+}
+
+static unsigned group_max_usage(struct sched_group *sg)
+{
+ int i;
+ int max_usage = 0;
+
+ for_each_cpu(i, sched_group_cpus(sg))
+ max_usage = max(max_usage, get_cpu_usage(i));
+
+ return max_usage;
+}
+
+/*
+ * group_norm_usage() returns the approximated group usage relative to it's
+ * current capacity (busy ratio) in the range [0..SCHED_LOAD_SCALE] for use in
+ * energy calculations. Since task executions may or may not overlap in time in
+ * the group the true normalized usage is between max(cpu_norm_usage(i)) and
+ * sum(cpu_norm_usage(i)) when iterating over all cpus in the group, i. The
+ * latter is used as the estimate as it leads to a more pessimistic energy
+ * estimate (more busy).
+ */
+static unsigned group_norm_usage(struct sched_group *sg)
+{
+ int i;
+ unsigned long usage_sum = 0;
+
+ for_each_cpu(i, sched_group_cpus(sg))
+ usage_sum += cpu_norm_usage(i);
+
+ if (usage_sum > SCHED_CAPACITY_SCALE)
+ return SCHED_CAPACITY_SCALE;
+ return usage_sum;
+}
+
+static int find_new_capacity(struct sched_group *sg,
+ struct sched_group_energy *sge)
+{
+ int idx;
+ unsigned long util = group_max_usage(sg);
+
+ for (idx = 0; idx < sge->nr_cap_states; idx++) {
+ if (sge->cap_states[idx].cap >= util)
+ return idx;
+ }
+
+ return idx;
+}
+
+/*
+ * sched_group_energy(): Returns absolute energy consumption of cpus belonging
+ * to the sched_group including shared resources shared only by members of the
+ * group. Iterates over all cpus in the hierarchy below the sched_group starting
+ * from the bottom working it's way up before going to the next cpu until all
+ * cpus are covered at all levels. The current implementation is likely to
+ * gather the same usage statistics multiple times. This can probably be done in
+ * a faster but more complex way.
+ */
+static unsigned int sched_group_energy(struct sched_group *sg_top)
+{
+ struct sched_domain *sd;
+ int cpu, total_energy = 0;
+ struct cpumask visit_cpus;
+ struct sched_group *sg;
+
+ WARN_ON(!sg_top->sge);
+
+ cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
+
+ while (!cpumask_empty(&visit_cpus)) {
+ struct sched_group *sg_shared_cap = NULL;
+
+ cpu = cpumask_first(&visit_cpus);
+
+ /*
+ * Is the group utilization affected by cpus outside this
+ * sched_group?
+ */
+ sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
+ if (sd && sd->parent)
+ sg_shared_cap = sd->parent->groups;
+
+ for_each_domain(cpu, sd) {
+ sg = sd->groups;
+
+ /* Has this sched_domain already been visited? */
+ if (sd->child && cpumask_first(sched_group_cpus(sg)) != cpu)
+ break;
+
+ do {
+ struct sched_group *sg_cap_util;
+ unsigned group_util;
+ int sg_busy_energy, sg_idle_energy;
+ int cap_idx;
+
+ if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
+ sg_cap_util = sg_shared_cap;
+ else
+ sg_cap_util = sg;
+
+ cap_idx = find_new_capacity(sg_cap_util, sg->sge);
+ group_util = group_norm_usage(sg);
+ sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
+ >> SCHED_CAPACITY_SHIFT;
+ sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
+ >> SCHED_CAPACITY_SHIFT;
+
+ total_energy += sg_busy_energy + sg_idle_energy;
+
+ if (!sd->child)
+ cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg));
+
+ if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(sg_top)))
+ goto next_cpu;
+
+ } while (sg = sg->next, sg != sd->groups);
+ }
+next_cpu:
+ continue;
+ }
+
+ return total_energy;
+}
+
static int wake_wide(struct task_struct *p)
{
int factor = this_cpu_read(sd_llc_size);
--
1.9.1
Extended sched_group_energy() to support energy prediction with usage
(tasks) added/removed from a specific cpu or migrated between a pair of
cpus. Useful for load-balancing decision making.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 90 +++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 66 insertions(+), 24 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d12aa63..07c84af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4592,23 +4592,44 @@ static unsigned long capacity_curr_of(int cpu)
* Without capping the usage, a group could be seen as overloaded (CPU0 usage
* at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
*/
-static int get_cpu_usage(int cpu)
+static int __get_cpu_usage(int cpu, int delta)
{
+ int sum;
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
unsigned long capacity_curr = capacity_curr_of(cpu);
- if (usage + blocked >= capacity_curr)
+ sum = usage + blocked + delta;
+
+ if (sum < 0)
+ return 0;
+
+ if (sum >= capacity_curr)
return capacity_curr;
- return usage + blocked;
+ return sum;
}
+static int get_cpu_usage(int cpu)
+{
+ return __get_cpu_usage(cpu, 0);
+}
+
+
static inline bool energy_aware(void)
{
return sched_feat(ENERGY_AWARE);
}
+struct energy_env {
+ struct sched_group *sg_top;
+ struct sched_group *sg_cap;
+ int usage_delta;
+ int src_cpu;
+ int dst_cpu;
+ int energy;
+};
+
/*
* cpu_norm_usage() returns the cpu usage relative to it's current capacity,
* i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
@@ -4623,20 +4644,38 @@ static inline bool energy_aware(void)
*
* norm_usage = running_time/time ~ usage/capacity_curr
*/
-static inline unsigned long cpu_norm_usage(int cpu)
+static inline unsigned long __cpu_norm_usage(int cpu, int delta)
{
unsigned long capacity_curr = capacity_curr_of(cpu);
- return (get_cpu_usage(cpu) << SCHED_CAPACITY_SHIFT)/capacity_curr;
+ return (__get_cpu_usage(cpu, delta) << SCHED_CAPACITY_SHIFT)
+ /capacity_curr;
}
-static unsigned group_max_usage(struct sched_group *sg)
+static inline unsigned long cpu_norm_usage(int cpu)
{
- int i;
+ return __cpu_norm_usage(cpu, 0);
+}
+
+static inline int calc_usage_delta(struct energy_env *eenv, int cpu)
+{
+ if (cpu == eenv->src_cpu)
+ return -eenv->usage_delta;
+ if (cpu == eenv->dst_cpu)
+ return eenv->usage_delta;
+ return 0;
+}
+
+static unsigned group_max_usage(struct energy_env *eenv,
+ struct sched_group *sg)
+{
+ int i, delta;
int max_usage = 0;
- for_each_cpu(i, sched_group_cpus(sg))
- max_usage = max(max_usage, get_cpu_usage(i));
+ for_each_cpu(i, sched_group_cpus(sg)) {
+ delta = calc_usage_delta(eenv, i);
+ max_usage = max(max_usage, __get_cpu_usage(i, delta));
+ }
return max_usage;
}
@@ -4650,24 +4689,27 @@ static unsigned group_max_usage(struct sched_group *sg)
* latter is used as the estimate as it leads to a more pessimistic energy
* estimate (more busy).
*/
-static unsigned group_norm_usage(struct sched_group *sg)
+static unsigned group_norm_usage(struct energy_env *eenv,
+ struct sched_group *sg)
{
- int i;
+ int i, delta;
unsigned long usage_sum = 0;
- for_each_cpu(i, sched_group_cpus(sg))
- usage_sum += cpu_norm_usage(i);
+ for_each_cpu(i, sched_group_cpus(sg)) {
+ delta = calc_usage_delta(eenv, i);
+ usage_sum += __cpu_norm_usage(i, delta);
+ }
if (usage_sum > SCHED_CAPACITY_SCALE)
return SCHED_CAPACITY_SCALE;
return usage_sum;
}
-static int find_new_capacity(struct sched_group *sg,
+static int find_new_capacity(struct energy_env *eenv,
struct sched_group_energy *sge)
{
int idx;
- unsigned long util = group_max_usage(sg);
+ unsigned long util = group_max_usage(eenv, eenv->sg_cap);
for (idx = 0; idx < sge->nr_cap_states; idx++) {
if (sge->cap_states[idx].cap >= util)
@@ -4686,16 +4728,16 @@ static int find_new_capacity(struct sched_group *sg,
* gather the same usage statistics multiple times. This can probably be done in
* a faster but more complex way.
*/
-static unsigned int sched_group_energy(struct sched_group *sg_top)
+static unsigned int sched_group_energy(struct energy_env *eenv)
{
struct sched_domain *sd;
int cpu, total_energy = 0;
struct cpumask visit_cpus;
struct sched_group *sg;
- WARN_ON(!sg_top->sge);
+ WARN_ON(!eenv->sg_top->sge);
- cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
+ cpumask_copy(&visit_cpus, sched_group_cpus(eenv->sg_top));
while (!cpumask_empty(&visit_cpus)) {
struct sched_group *sg_shared_cap = NULL;
@@ -4718,18 +4760,17 @@ static unsigned int sched_group_energy(struct sched_group *sg_top)
break;
do {
- struct sched_group *sg_cap_util;
unsigned group_util;
int sg_busy_energy, sg_idle_energy;
int cap_idx;
if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
- sg_cap_util = sg_shared_cap;
+ eenv->sg_cap = sg_shared_cap;
else
- sg_cap_util = sg;
+ eenv->sg_cap = sg;
- cap_idx = find_new_capacity(sg_cap_util, sg->sge);
- group_util = group_norm_usage(sg);
+ cap_idx = find_new_capacity(eenv, sg->sge);
+ group_util = group_norm_usage(eenv, sg);
sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
>> SCHED_CAPACITY_SHIFT;
sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
@@ -4740,7 +4781,7 @@ static unsigned int sched_group_energy(struct sched_group *sg_top)
if (!sd->child)
cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg));
- if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(sg_top)))
+ if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(eenv->sg_top)))
goto next_cpu;
} while (sg = sg->next, sg != sd->groups);
@@ -4749,6 +4790,7 @@ static unsigned int sched_group_energy(struct sched_group *sg_top)
continue;
}
+ eenv->energy = total_energy;
return total_energy;
}
--
1.9.1
Adds a generic energy-aware helper function, energy_diff(), that
calculates energy impact of adding, removing, and migrating utilization
in the system.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07c84af..b371f32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4794,6 +4794,57 @@ static unsigned int sched_group_energy(struct energy_env *eenv)
return total_energy;
}
+/*
+ * energy_diff(): Estimate the energy impact of changing the utilization
+ * distribution. eenv specifies the change: utilisation amount, source, and
+ * destination cpu. Source or destination cpu may be -1 in which case the
+ * utilization is removed from or added to the system (e.g. task wake-up). If
+ * both are specified, the utilization is migrated.
+ */
+static int energy_diff(struct energy_env *eenv)
+{
+ struct sched_domain *sd;
+ struct sched_group *sg;
+ int sd_cpu = -1, energy_before = 0, energy_after = 0;
+
+ struct energy_env eenv_before = {
+ .usage_delta = 0,
+ .src_cpu = eenv->src_cpu,
+ .dst_cpu = eenv->dst_cpu,
+ };
+
+ if (eenv->src_cpu == eenv->dst_cpu)
+ return 0;
+
+ sd_cpu = (eenv->src_cpu != -1) ? eenv->src_cpu : eenv->dst_cpu;
+ sd = rcu_dereference(per_cpu(sd_ea, sd_cpu));
+
+ if (!sd)
+ return 0; /* Error */
+
+ sg = sd->groups;
+ do {
+ if (eenv->src_cpu != -1 && cpumask_test_cpu(eenv->src_cpu,
+ sched_group_cpus(sg))) {
+ eenv_before.sg_top = eenv->sg_top = sg;
+ energy_before += sched_group_energy(&eenv_before);
+ energy_after += sched_group_energy(eenv);
+
+ /* src_cpu and dst_cpu may belong to the same group */
+ continue;
+ }
+
+ if (eenv->dst_cpu != -1 && cpumask_test_cpu(eenv->dst_cpu,
+ sched_group_cpus(sg))) {
+ eenv_before.sg_top = eenv->sg_top = sg;
+ energy_before += sched_group_energy(&eenv_before);
+ energy_after += sched_group_energy(eenv);
+ }
+ } while (sg = sg->next, sg != sd->groups);
+
+ return energy_after-energy_before;
+}
+
static int wake_wide(struct task_struct *p)
{
int factor = this_cpu_read(sd_llc_size);
--
1.9.1
Let available compute capacity and estimated energy impact select
wake-up target cpu when energy-aware scheduling is enabled.
energy_aware_wake_cpu() attempts to find group of cpus with sufficient
compute capacity to accommodate the task and find a cpu with enough spare
capacity to handle the task within that group. Preference is given to
cpus with enough spare capacity at the current OPP. Finally, the energy
impact of the new target and the previous task cpu is compared to select
the wake-up target cpu.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 90 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b371f32..8713310 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5091,6 +5091,92 @@ static int select_idle_sibling(struct task_struct *p, int target)
done:
return target;
}
+
+static unsigned long group_max_capacity(struct sched_group *sg)
+{
+ int max_idx;
+
+ if (!sg->sge)
+ return 0;
+
+ max_idx = sg->sge->nr_cap_states-1;
+
+ return sg->sge->cap_states[max_idx].cap;
+}
+
+static inline unsigned long task_utilization(struct task_struct *p)
+{
+ return p->se.avg.utilization_avg_contrib;
+}
+
+static int cpu_overutilized(int cpu, struct sched_domain *sd)
+{
+ return (capacity_orig_of(cpu) * 100) <
+ (get_cpu_usage(cpu) * sd->imbalance_pct);
+}
+
+static int energy_aware_wake_cpu(struct task_struct *p)
+{
+ struct sched_domain *sd;
+ struct sched_group *sg, *sg_target;
+ int target_max_cap = SCHED_CAPACITY_SCALE;
+ int target_cpu = task_cpu(p);
+ int i;
+
+ sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
+
+ if (!sd)
+ return -1;
+
+ sg = sd->groups;
+ sg_target = sg;
+ /* Find group with sufficient capacity */
+ do {
+ int sg_max_capacity = group_max_capacity(sg);
+
+ if (sg_max_capacity >= task_utilization(p) &&
+ sg_max_capacity <= target_max_cap) {
+ sg_target = sg;
+ target_max_cap = sg_max_capacity;
+ }
+ } while (sg = sg->next, sg != sd->groups);
+
+ /* Find cpu with sufficient capacity */
+ for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
+ int new_usage = get_cpu_usage(i) + task_utilization(p);
+
+ if (new_usage > capacity_orig_of(i))
+ continue;
+
+ if (new_usage < capacity_curr_of(i)) {
+ target_cpu = i;
+ if (!cpu_rq(i)->nr_running)
+ break;
+ }
+
+ /* cpu has capacity at higher OPP, keep it as fallback */
+ if (target_cpu == task_cpu(p))
+ target_cpu = i;
+ }
+
+ if (target_cpu != task_cpu(p)) {
+ struct energy_env eenv = {
+ .usage_delta = task_utilization(p),
+ .src_cpu = task_cpu(p),
+ .dst_cpu = target_cpu,
+ };
+
+ /* Not enough spare capacity on previous cpu */
+ if (cpu_overutilized(task_cpu(p), sd))
+ return target_cpu;
+
+ if (energy_diff(&eenv) >= 0)
+ return task_cpu(p);
+ }
+
+ return target_cpu;
+}
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5138,6 +5224,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
prev_cpu = cpu;
if (sd_flag & SD_BALANCE_WAKE) {
+ if (energy_aware()) {
+ new_cpu = energy_aware_wake_cpu(p);
+ goto unlock;
+ }
new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock;
}
--
1.9.1
Make wake-ups of new tasks (find_idlest_group) aware of any differences
in cpu compute capacity so new tasks don't get handed off to a cpus with
lower capacity.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8713310..4251e75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4950,6 +4950,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
{
struct sched_group *idlest = NULL, *group = sd->groups;
unsigned long min_load = ULONG_MAX, this_load = 0;
+ unsigned long this_cpu_cap = 0, idlest_cpu_cap = 0;
int load_idx = sd->forkexec_idx;
int imbalance = 100 + (sd->imbalance_pct-100)/2;
@@ -4957,7 +4958,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
load_idx = sd->wake_idx;
do {
- unsigned long load, avg_load;
+ unsigned long load, avg_load, cpu_capacity;
int local_group;
int i;
@@ -4971,6 +4972,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
/* Tally up the load of all CPUs in the group */
avg_load = 0;
+ cpu_capacity = 0;
for_each_cpu(i, sched_group_cpus(group)) {
/* Bias balancing toward cpus of our domain */
@@ -4980,6 +4982,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
load = target_load(i, load_idx);
avg_load += load;
+
+ if (cpu_capacity < capacity_of(i))
+ cpu_capacity = capacity_of(i);
}
/* Adjust by relative CPU capacity of the group */
@@ -4987,14 +4992,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
if (local_group) {
this_load = avg_load;
+ this_cpu_cap = cpu_capacity;
} else if (avg_load < min_load) {
min_load = avg_load;
idlest = group;
+ idlest_cpu_cap = cpu_capacity;
}
} while (group = group->next, group != sd->groups);
- if (!idlest || 100*this_load < imbalance*min_load)
+ if (!idlest)
+ return NULL;
+
+ if (100*this_load < imbalance*min_load && this_cpu_cap >= idlest_cpu_cap)
return NULL;
+
return idlest;
}
--
1.9.1
The idle-state of each cpu is currently pointed to by rq->idle_state but
there isn't any information in the struct cpuidle_state that can used to
look up the idle-state energy model data stored in struct
sched_group_energy. For this purpose is necessary to store the idle
state index as well. Ideally, the idle-state data should be unified.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/idle.c | 2 ++
kernel/sched/sched.h | 21 +++++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c47fce7..e46c85c 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -149,6 +149,7 @@ static void cpuidle_idle_call(void)
/* Take note of the planned idle state. */
idle_set_state(this_rq(), &drv->states[next_state]);
+ idle_set_state_idx(this_rq(), next_state);
/*
* Enter the idle state previously returned by the governor decision.
@@ -159,6 +160,7 @@ static void cpuidle_idle_call(void)
/* The cpu is no longer idle or about to enter idle. */
idle_set_state(this_rq(), NULL);
+ idle_set_state_idx(this_rq(), -1);
if (broadcast)
clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dedf0ec..107f478 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -678,6 +678,7 @@ struct rq {
#ifdef CONFIG_CPU_IDLE
/* Must be inspected within a rcu lock section */
struct cpuidle_state *idle_state;
+ int idle_state_idx;
#endif
};
@@ -1274,6 +1275,17 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
WARN_ON(!rcu_read_lock_held());
return rq->idle_state;
}
+
+static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
+{
+ rq->idle_state_idx = idle_state_idx;
+}
+
+static inline int idle_get_state_idx(struct rq *rq)
+{
+ WARN_ON(!rcu_read_lock_held());
+ return rq->idle_state_idx;
+}
#else
static inline void idle_set_state(struct rq *rq,
struct cpuidle_state *idle_state)
@@ -1284,6 +1296,15 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
{
return NULL;
}
+
+static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
+{
+}
+
+static inline int idle_get_state_idx(struct rq *rq)
+{
+ return -1;
+}
#endif
extern void sysrq_sched_debug_show(void);
--
1.9.1
cpuidle associates all idle-states with each cpu while the energy model
associates them with the sched_group covering the cpus coordinating
entry to the idle-state. To get idle-state power consumption it is
therefore necessary to translate from cpuidle idle-state index to energy
model index. For this purpose it is helpful to know how many idle-states
that are listed in lower level sched_groups (in struct
sched_group_energy).
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 12 ++++++++++++
2 files changed, 13 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78b6eb7..f984b4e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -958,6 +958,7 @@ struct sched_group_energy {
atomic_t ref;
unsigned int nr_idle_states; /* number of idle states */
struct idle_state *idle_states; /* ptr to idle state array */
+ unsigned int idle_states_below; /* Number idle states in lower groups */
unsigned int nr_cap_states; /* number of capacity states */
struct capacity_state *cap_states; /* ptr to capacity state array */
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e47febf..4f52c2e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6075,6 +6075,7 @@ static void init_sched_energy(int cpu, struct sched_domain *sd,
struct sched_group_energy *energy = sg->sge;
sched_domain_energy_f fn = tl->energy;
struct cpumask *mask = sched_group_cpus(sg);
+ int idle_states_below = 0;
if (!fn || !fn(cpu))
return;
@@ -6082,9 +6083,20 @@ static void init_sched_energy(int cpu, struct sched_domain *sd,
if (cpumask_weight(mask) > 1)
check_sched_energy_data(cpu, fn, mask);
+ /* Figure out the number of true cpuidle states below current group */
+ sd = sd->child;
+ for_each_lower_domain(sd) {
+ idle_states_below += sd->groups->sge->nr_idle_states;
+
+ /* Disregard non-cpuidle 'active' idle states */
+ if (sd->child)
+ idle_states_below--;
+ }
+
energy->nr_idle_states = fn(cpu)->nr_idle_states;
memcpy(energy->idle_states, fn(cpu)->idle_states,
energy->nr_idle_states*sizeof(struct idle_state));
+ energy->idle_states_below = idle_states_below;
energy->nr_cap_states = fn(cpu)->nr_cap_states;
memcpy(energy->cap_states, fn(cpu)->cap_states,
energy->nr_cap_states*sizeof(struct capacity_state));
--
1.9.1
To estimate the energy consumption of a sched_group in
sched_group_energy() it is necessary to know which idle-state the group
is in when it is idle. For now, it is assumed that this is the current
idle-state (though it might be wrong). Based on the individual cpu
idle-states group_idle_state() finds the group idle-state.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++----
1 file changed, 28 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4251e75..0e95eb5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4719,6 +4719,28 @@ static int find_new_capacity(struct energy_env *eenv,
return idx;
}
+static int group_idle_state(struct sched_group *sg)
+{
+ struct sched_group_energy *sge = sg->sge;
+ int shallowest_state = sge->idle_states_below + sge->nr_idle_states;
+ int i;
+
+ for_each_cpu(i, sched_group_cpus(sg)) {
+ int cpuidle_idx = idle_get_state_idx(cpu_rq(i));
+ int group_idx = cpuidle_idx - sge->idle_states_below + 1;
+
+ if (group_idx <= 0)
+ return 0;
+
+ shallowest_state = min(shallowest_state, group_idx);
+ }
+
+ if (shallowest_state >= sge->nr_idle_states)
+ return sge->nr_idle_states - 1;
+
+ return shallowest_state;
+}
+
/*
* sched_group_energy(): Returns absolute energy consumption of cpus belonging
* to the sched_group including shared resources shared only by members of the
@@ -4762,7 +4784,7 @@ static unsigned int sched_group_energy(struct energy_env *eenv)
do {
unsigned group_util;
int sg_busy_energy, sg_idle_energy;
- int cap_idx;
+ int cap_idx, idle_idx;
if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
eenv->sg_cap = sg_shared_cap;
@@ -4770,11 +4792,13 @@ static unsigned int sched_group_energy(struct energy_env *eenv)
eenv->sg_cap = sg;
cap_idx = find_new_capacity(eenv, sg->sge);
+ idle_idx = group_idle_state(sg);
group_util = group_norm_usage(eenv, sg);
sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
- >> SCHED_CAPACITY_SHIFT;
- sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
- >> SCHED_CAPACITY_SHIFT;
+ >> SCHED_CAPACITY_SHIFT;
+ sg_idle_energy = ((SCHED_LOAD_SCALE-group_util)
+ * sg->sge->idle_states[idle_idx].power)
+ >> SCHED_CAPACITY_SHIFT;
total_energy += sg_busy_energy + sg_idle_energy;
--
1.9.1
From: Dietmar Eggemann <[email protected]>
Energy-aware load balancing should only happen if the ENERGY_AWARE feature
is turned on and the sched domain on which the load balancing is performed
on contains energy data.
There is also a need during a load balance action to be able to query if we
should continue to load balance energy-aware or if we reached the tipping
point which forces us to fall back to the conventional load balancing
functionality.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0e95eb5..45c784f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5836,6 +5836,7 @@ struct lb_env {
enum fbq_type fbq_type;
struct list_head tasks;
+ bool use_ea;
};
/*
@@ -7348,6 +7349,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.cpus = cpus,
.fbq_type = all,
.tasks = LIST_HEAD_INIT(env.tasks),
+ .use_ea = (energy_aware() && sd->groups->sge) ? true : false,
};
/*
--
1.9.1
From: Dietmar Eggemann <[email protected]>
To be able to identify the least efficient (costliest) sched group
introduce group_eff as the efficiency of the sched group into sg_lb_stats.
The group efficiency is defined as the ratio between the group usage and
the group energy consumption.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 45c784f..bfa335e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6345,6 +6345,7 @@ struct sg_lb_stats {
unsigned long load_per_task;
unsigned long group_capacity;
unsigned long group_usage; /* Total usage of the group */
+ unsigned long group_eff;
unsigned int sum_nr_running; /* Nr tasks running in the group */
unsigned int idle_cpus;
unsigned int group_weight;
@@ -6715,6 +6716,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_no_capacity = group_is_overloaded(env, sgs);
sgs->group_type = group_classify(env, group, sgs);
+
+ if (env->use_ea) {
+ struct energy_env eenv = {
+ .sg_top = group,
+ .usage_delta = 0,
+ .src_cpu = -1,
+ .dst_cpu = -1,
+ };
+ unsigned long group_energy = sched_group_energy(&eenv);
+
+ if (group_energy)
+ sgs->group_eff = 1024*sgs->group_usage/group_energy;
+ else
+ sgs->group_eff = ULONG_MAX;
+ }
}
/**
--
1.9.1
From: Dietmar Eggemann <[email protected]>
Energy-aware load balancing has to work alongside the conventional load
based functionality. This includes the tipping point feature, i.e. being
able to fall back from energy aware to the conventional load based
functionality during an ongoing load balancing action.
That is why this patch introduces an additional reference to hold the
least efficient sched group (costliest) as well its statistics in form of
an extra sg_lb_stats structure (costliest_stat).
The function update_sd_pick_costliest is used to assign the least
efficient sched group parallel to the existing update_sd_pick_busiest.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfa335e..36f3c77 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6363,12 +6363,14 @@ struct sg_lb_stats {
*/
struct sd_lb_stats {
struct sched_group *busiest; /* Busiest group in this sd */
+ struct sched_group *costliest; /* Least efficient group in this sd */
struct sched_group *local; /* Local group in this sd */
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_capacity; /* Total capacity of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
+ struct sg_lb_stats costliest_stat;/* Statistics of the least efficient group */
struct sg_lb_stats local_stat; /* Statistics of the local group */
};
@@ -6390,6 +6392,9 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
.sum_nr_running = 0,
.group_type = group_other,
},
+ .costliest_stat = {
+ .group_eff = ULONG_MAX,
+ },
};
}
@@ -6782,6 +6787,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
return false;
}
+static noinline bool update_sd_pick_costliest(struct sd_lb_stats *sds,
+ struct sg_lb_stats *sgs)
+{
+ struct sg_lb_stats *costliest = &sds->costliest_stat;
+
+ if (sgs->group_eff < costliest->group_eff)
+ return true;
+
+ return false;
+}
+
#ifdef CONFIG_NUMA_BALANCING
static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
{
@@ -6872,6 +6888,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
sds->busiest_stat = *sgs;
}
+ if (env->use_ea && update_sd_pick_costliest(sds, sgs)) {
+ sds->costliest = sg;
+ sds->costliest_stat = *sgs;
+ }
+
next_group:
/* Now, start updating sd_lb_stats */
sds->total_load += sgs->group_load;
--
1.9.1
From: Dietmar Eggemann <[email protected]>
In case that after the gathering of sched domain statistics the current
load balancing operation is still in energy-aware mode, just return the
least efficient (costliest) reference. That implies the system is
considered to be balanced in case no least efficient sched group was
found.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 36f3c77..199ffff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7133,6 +7133,9 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
local = &sds.local_stat;
busiest = &sds.busiest_stat;
+ if (env->use_ea)
+ return sds.costliest;
+
/* ASYM feature bypasses nice load balance check */
if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
check_asym_packing(env, &sds))
--
1.9.1
From: Dietmar Eggemann <[email protected]>
In case that after the gathering of sched domain statistics the current
load balancing operation is still in energy-aware mode and a least
efficient sched group has been found, detect the least efficient cpu by
comparing the cpu efficiency (ratio between cpu usage and cpu energy
consumption) among all cpus of the least efficient sched group.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 199ffff..48cd5b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7216,6 +7216,37 @@ static struct rq *find_busiest_queue(struct lb_env *env,
unsigned long busiest_load = 0, busiest_capacity = 1;
int i;
+ if (env->use_ea) {
+ struct rq *costliest = NULL;
+ unsigned long costliest_usage = 1024, costliest_energy = 1;
+
+ for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
+ unsigned long usage = get_cpu_usage(i);
+ struct rq *rq = cpu_rq(i);
+ struct sched_domain *sd = rcu_dereference(rq->sd);
+ struct energy_env eenv = {
+ .sg_top = sd->groups,
+ .usage_delta = 0,
+ .src_cpu = -1,
+ .dst_cpu = -1,
+ };
+ unsigned long energy = sched_group_energy(&eenv);
+
+ /*
+ * We're looking for the minimal cpu efficiency
+ * min(u_i / e_i), crosswise multiplication leads to
+ * u_i * e_j < u_j * e_i with j as previous minimum.
+ */
+ if (usage * costliest_energy < costliest_usage * energy) {
+ costliest_usage = usage;
+ costliest_energy = energy;
+ costliest = rq;
+ }
+ }
+
+ return costliest;
+ }
+
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
unsigned long capacity, wl;
enum fbq_type rt;
--
1.9.1
From: Dietmar Eggemann <[email protected]>
Energy-aware load balancing does not rely on env->imbalance but instead it
evaluates the system-wide energy difference for each task on the src rq by
potentially moving it to the dst rq. If this energy difference is lesser
than zero the task is actually moved from src to dst rq.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48cd5b5..6b79603 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6095,12 +6095,12 @@ static int detach_tasks(struct lb_env *env)
{
struct list_head *tasks = &env->src_rq->cfs_tasks;
struct task_struct *p;
- unsigned long load;
+ unsigned long load = 0;
int detached = 0;
lockdep_assert_held(&env->src_rq->lock);
- if (env->imbalance <= 0)
+ if (!env->use_ea && env->imbalance <= 0)
return 0;
while (!list_empty(tasks)) {
@@ -6121,6 +6121,20 @@ static int detach_tasks(struct lb_env *env)
if (!can_migrate_task(p, env))
goto next;
+ if (env->use_ea) {
+ struct energy_env eenv = {
+ .src_cpu = env->src_cpu,
+ .dst_cpu = env->dst_cpu,
+ .usage_delta = task_utilization(p),
+ };
+ int e_diff = energy_diff(&eenv);
+
+ if (e_diff >= 0)
+ goto next;
+
+ goto detach;
+ }
+
load = task_h_load(p);
if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
@@ -6129,6 +6143,7 @@ static int detach_tasks(struct lb_env *env)
if ((load / 2) > env->imbalance)
goto next;
+detach:
detach_task(p, env);
list_add(&p->se.group_node, &env->tasks);
@@ -6149,7 +6164,7 @@ static int detach_tasks(struct lb_env *env)
* We only want to steal up to the prescribed amount of
* weighted load.
*/
- if (env->imbalance <= 0)
+ if (!env->use_ea && env->imbalance <= 0)
break;
continue;
--
1.9.1
From: Dietmar Eggemann <[email protected]>
Energy-aware load balancing bases on cpu usage so the upper bound of its
operational range is a fully utilized cpu. Above this tipping point it
makes more sense to use weighted_cpuload to preserve smp_nice.
This patch implements the tipping point detection in update_sg_lb_stats
as if one cpu is over-utilized the current energy-aware load balance
operation will fall back into the conventional weighted load based one.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b79603..4849bad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6723,6 +6723,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->sum_weighted_load += weighted_cpuload(i);
if (idle_cpu(i))
sgs->idle_cpus++;
+
+ /* If cpu is over-utilized, bail out of ea */
+ if (env->use_ea && cpu_overutilized(i, env->sd))
+ env->use_ea = false;
}
/* Adjust by relative CPU capacity of the group */
--
1.9.1
From: Dietmar Eggemann <[email protected]>
Skip cpu as a potential src (costliest) in case it has only one task
running and its original capacity is greater than or equal to the
original capacity of the dst cpu.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4849bad..b6e2e92 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7251,6 +7251,10 @@ static struct rq *find_busiest_queue(struct lb_env *env,
};
unsigned long energy = sched_group_energy(&eenv);
+ if (rq->nr_running == 1 && capacity_orig_of(i) >=
+ capacity_orig_of(env->dst_cpu))
+ continue;
+
/*
* We're looking for the minimal cpu efficiency
* min(u_i / e_i), crosswise multiplication leads to
--
1.9.1
From: Dietmar Eggemann <[email protected]>
We do not want to miss out on the ability to do energy-aware idle load
balancing if the system is only partially loaded since the operational
range of energy-aware scheduling corresponds to a partially loaded
system. We might want to pull a single remaining task from a potential
src cpu towards an idle destination cpu if the energy model tells us
this is worth doing to save energy.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b6e2e92..92fd1d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7734,7 +7734,7 @@ static int idle_balance(struct rq *this_rq)
this_rq->idle_stamp = rq_clock(this_rq);
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !this_rq->rd->overload) {
+ (!energy_aware() && !this_rq->rd->overload)) {
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
--
1.9.1
Add an extra criteria to need_active_balance() to kick off active load
balance if the source cpu is overutilized and has lower capacity than
the destination cpus.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 92fd1d8..1c248f8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7379,6 +7379,13 @@ static int need_active_balance(struct lb_env *env)
return 1;
}
+ if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&
+ env->src_rq->cfs.h_nr_running == 1 &&
+ cpu_overutilized(env->src_cpu, env->sd) &&
+ !cpu_overutilized(env->dst_cpu, env->sd)) {
+ return 1;
+ }
+
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
--
1.9.1
With energy-aware scheduling enabled nohz_kick_needed() generates many
nohz idle-balance kicks which lead to nothing when multiple tasks get
packed on a single cpu to save energy. This causes unnecessary wake-ups
and hence wastes energy. Make these conditions depend on !energy_aware()
for now until the energy-aware nohz story gets sorted out.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c248f8..cfe65ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8195,6 +8195,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
}
+static int cpu_overutilized(int cpu, struct sched_domain *sd);
+
/*
* Current heuristic for kicking the idle load balancer in the presence
* of an idle cpu in the system.
@@ -8234,12 +8236,13 @@ static inline bool nohz_kick_needed(struct rq *rq)
if (time_before(now, nohz.next_balance))
return false;
- if (rq->nr_running >= 2)
+ sd = rcu_dereference(rq->sd);
+ if (rq->nr_running >= 2 && (!energy_aware() || cpu_overutilized(cpu, sd)))
return true;
rcu_read_lock();
sd = rcu_dereference(per_cpu(sd_busy, cpu));
- if (sd) {
+ if (sd && !energy_aware()) {
sgc = sd->groups->sgc;
nr_busy = atomic_read(&sgc->nr_busy_cpus);
--
1.9.1
On 02/05/2015 12:00 AM, Morten Rasmussen wrote:
> From: Vincent Guittot <[email protected]>
>
> Add new statistics which reflect the average time a task is running on the CPU
> and the sum of these running time of the tasks on a runqueue. The latter is
> named utilization_load_avg.
>
> This patch is based on the usage metric that was proposed in the 1st
> versions of the per-entity load tracking patchset by Paul Turner
> <[email protected]> but that has be removed afterwards. This version differs from
> the original one in the sense that it's not linked to task_group.
>
> The rq's utilization_load_avg will be used to check if a rq is overloaded or
> not instead of trying to compute how many tasks a group of CPUs can handle.
>
> Rename runnable_avg_period into avg_period as it is now used with both
> runnable_avg_sum and running_avg_sum
>
> Add some descriptions of the variables to explain their differences
>
> cc: Paul Turner <[email protected]>
> cc: Ben Segall <[email protected]>
>
> Signed-off-by: Vincent Guittot <[email protected]>
> Acked-by: Morten Rasmussen <[email protected]>
> ---
> include/linux/sched.h | 21 ++++++++++++---
> kernel/sched/debug.c | 10 ++++---
> kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++-----------
> kernel/sched/sched.h | 8 +++++-
> 4 files changed, 89 insertions(+), 24 deletions(-)
> +static inline void __update_task_entity_utilization(struct sched_entity *se)
> +{
> + u32 contrib;
> +
> + /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
> + contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
> + contrib /= (se->avg.avg_period + 1);
> + se->avg.utilization_avg_contrib = scale_load(contrib);
> +}
> +
> +static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
> +{
> + long old_contrib = se->avg.utilization_avg_contrib;
> +
> + if (entity_is_task(se))
> + __update_task_entity_utilization(se);
> +
> + return se->avg.utilization_avg_contrib - old_contrib;
When the entity is not a task, shouldn't the utilization_avg_contrib be
updated like this :
se->avg.utilization_avg_contrib = group_cfs_rq(se)->utilization_load_avg
? and then the delta with old_contrib be passed ?
Or is this being updated someplace that I have missed ?
Regards
Preeti U Murthy
On 11 February 2015 at 09:50, Preeti U Murthy <[email protected]> wrote:
> On 02/05/2015 12:00 AM, Morten Rasmussen wrote:
>> From: Vincent Guittot <[email protected]>
>>
>> Add new statistics which reflect the average time a task is running on the CPU
>> and the sum of these running time of the tasks on a runqueue. The latter is
>> named utilization_load_avg.
>>
>> This patch is based on the usage metric that was proposed in the 1st
>> versions of the per-entity load tracking patchset by Paul Turner
>> <[email protected]> but that has be removed afterwards. This version differs from
>> the original one in the sense that it's not linked to task_group.
>>
>> The rq's utilization_load_avg will be used to check if a rq is overloaded or
>> not instead of trying to compute how many tasks a group of CPUs can handle.
>>
>> Rename runnable_avg_period into avg_period as it is now used with both
>> runnable_avg_sum and running_avg_sum
>>
>> Add some descriptions of the variables to explain their differences
>>
>> cc: Paul Turner <[email protected]>
>> cc: Ben Segall <[email protected]>
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> Acked-by: Morten Rasmussen <[email protected]>
>> ---
>> include/linux/sched.h | 21 ++++++++++++---
>> kernel/sched/debug.c | 10 ++++---
>> kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++-----------
>> kernel/sched/sched.h | 8 +++++-
>> 4 files changed, 89 insertions(+), 24 deletions(-)
>
>> +static inline void __update_task_entity_utilization(struct sched_entity *se)
>> +{
>> + u32 contrib;
>> +
>> + /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
>> + contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
>> + contrib /= (se->avg.avg_period + 1);
>> + se->avg.utilization_avg_contrib = scale_load(contrib);
>> +}
>> +
>> +static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
>> +{
>> + long old_contrib = se->avg.utilization_avg_contrib;
>> +
>> + if (entity_is_task(se))
>> + __update_task_entity_utilization(se);
>> +
>> + return se->avg.utilization_avg_contrib - old_contrib;
>
> When the entity is not a task, shouldn't the utilization_avg_contrib be
> updated like this :
>
> se->avg.utilization_avg_contrib = group_cfs_rq(se)->utilization_load_avg
> ? and then the delta with old_contrib be passed ?
>
> Or is this being updated someplace that I have missed ?
Patch 02 handles the contribution of entities which are not tasks.
Regards,
Vincent
>
> Regards
> Preeti U Murthy
>
Hi Morten,
On 04/02/15 18:31, Morten Rasmussen wrote:
> With energy-aware scheduling enabled nohz_kick_needed() generates many
> nohz idle-balance kicks which lead to nothing when multiple tasks get
> packed on a single cpu to save energy. This causes unnecessary wake-ups
> and hence wastes energy. Make these conditions depend on !energy_aware()
> for now until the energy-aware nohz story gets sorted out.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1c248f8..cfe65ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8195,6 +8195,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
> }
>
> +static int cpu_overutilized(int cpu, struct sched_domain *sd);
> +
> /*
> * Current heuristic for kicking the idle load balancer in the presence
> * of an idle cpu in the system.
> @@ -8234,12 +8236,13 @@ static inline bool nohz_kick_needed(struct rq *rq)
> if (time_before(now, nohz.next_balance))
> return false;
>
> - if (rq->nr_running >= 2)
> + sd = rcu_dereference(rq->sd);
> + if (rq->nr_running >= 2 && (!energy_aware() || cpu_overutilized(cpu, sd)))
> return true;
CONFIG_PROVE_RCU checking revealed this one:
[ 3.814454] ===============================
[ 3.826989] [ INFO: suspicious RCU usage. ]
[ 3.839526] 3.19.0-rc7+ #10 Not tainted
[ 3.851018] -------------------------------
[ 3.863554] kernel/sched/fair.c:8239 suspicious
rcu_dereference_check() usage!
[ 3.885216]
[ 3.885216] other info that might help us debug this:
[ 3.885216]
[ 3.909236]
[ 3.909236] rcu_scheduler_active = 1, debug_locks = 1
[ 3.928817] no locks held by kthreadd/437.
The RCU read-side critical section has to be extended to incorporate
this sd = rcu_dereference(rq->sd):
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cfe65aec3237..145360ee6e4a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8236,11 +8236,13 @@ static inline bool nohz_kick_needed(struct rq *rq)
if (time_before(now, nohz.next_balance))
return false;
+ rcu_read_lock();
sd = rcu_dereference(rq->sd);
- if (rq->nr_running >= 2 && (!energy_aware() ||
cpu_overutilized(cpu, sd)))
- return true;
+ if (rq->nr_running >= 2 && (!energy_aware() ||
cpu_overutilized(cpu, sd))) {
+ kick = true;
+ goto unlock;
+ }
- rcu_read_lock();
sd = rcu_dereference(per_cpu(sd_busy, cpu));
if (sd && !energy_aware()) {
sgc = sd->groups->sgc;
-- Dietmar
>
> rcu_read_lock();
> sd = rcu_dereference(per_cpu(sd_busy, cpu));
> - if (sd) {
> + if (sd && !energy_aware()) {
> sgc = sd->groups->sgc;
> nr_busy = atomic_read(&sgc->nr_busy_cpus);
>
>
On 4 February 2015 at 19:30, Morten Rasmussen <[email protected]> wrote:
> Several techniques for saving energy through various scheduler
> modifications have been proposed in the past, however most of the
> techniques have not been universally beneficial for all use-cases and
> platforms. For example, consolidating tasks on fewer cpus is an
> effective way to save energy on some platforms, while it might make
> things worse on others.
>
> This proposal, which is inspired by the Ksummit workshop discussions in
> 2013 [1], takes a different approach by using a (relatively) simple
> platform energy cost model to guide scheduling decisions. By providing
> the model with platform specific costing data the model can provide a
> estimate of the energy implications of scheduling decisions. So instead
> of blindly applying scheduling techniques that may or may not work for
> the current use-case, the scheduler can make informed energy-aware
> decisions. We believe this approach provides a methodology that can be
> adapted to any platform, including heterogeneous systems such as ARM
> big.LITTLE. The model considers cpus only, i.e. no peripherals, GPU or
> memory. Model data includes power consumption at each P-state and
> C-state.
>
> This is an RFC and there are some loose ends that have not been
> addressed here or in the code yet. The model and its infrastructure is
> in place in the scheduler and it is being used for load-balancing
> decisions. The energy model data is hardcoded, the load-balancing
> heuristics are still under development, and there are some limitations
> still to be addressed. However, the main idea is presented here, which
> is the use of an energy model for scheduling decisions.
>
> RFCv3 is a consolidation of the latest energy model related patches and
> previously posted patch sets related to capacity and utilization
> tracking [2][3] to show where we are heading. [2] and [3] have been
> rebased onto v3.19-rc7 with a few minor modifications. Large parts of
> the energy model code and use of the energy model in the scheduler has
> been rewritten and simplified. The patch set consists of three main
> parts (more details further down):
>
> Patch 1-11: sched: consolidation of CPU capacity and usage [2] (rebase)
>
> Patch 12-19: sched: frequency and cpu invariant per-entity load-tracking
> and other load-tracking bits [3] (rebase)
>
> Patch 20-48: sched: Energy cost model for energy-aware scheduling (RFCv3)
Hi Morten,
48 patches is a big number of patches and when i look into your
patchset, some feature are quite self contained. IMHO it would be
worth splitting it in smaller patchsets in order to ease the review
and the regression test.
>From a 1st look at your patchset , i have found
-patches 11,13,14 and 15 are only linked to frequency scaling invariance
-patches 12, 17 and 17 are only about adding cpu scaling invariance
-patches 18 and 19 are about tracking and adding the blocked
utilization in the CPU usage
-patches 20 to the end is linked the EAS
Regards,
Vincent
>
> Test results for ARM TC2 (2xA15+3xA7) with cpufreq enabled:
>
> sysbench: Single task running for 3 seconds.
> rt-app [4]: 5 medium (~50%) periodic tasks
> rt-app [4]: 2 light (~10%) periodic tasks
>
> Average numbers for 20 runs per test.
>
> Energy sysbench rt-app medium rt-app light
> Mainline 100* 100 100
> EA 279 88 63
>
> * Sensitive to task placement on big.LITTLE. Mainline may put it on
> either cpu due to it's lack of compute capacity awareness, while EA
> consistently puts heavy tasks on big cpus. The EA energy increase came
> with a 2.65x _increase_ in performance (throughput).
>
> [1] http://etherpad.osuosl.org/energy-aware-scheduling-ks-2013 (search
> for 'cost')
> [2] https://lkml.org/lkml/2015/1/15/136
> [3] https://lkml.org/lkml/2014/12/2/328
> [4] https://wiki.linaro.org/WorkingGroups/PowerManagement/Resources/Tools/WorkloadGen
>
> Changes:
>
> RFCv3:
>
> 'sched: Energy cost model for energy-aware scheduling' changes:
> RFCv2->RFCv3:
>
> (1) Remove frequency- and cpu-invariant load/utilization patches since
> this is now provided by [2] and [3].
>
> (2) Remove system-wide sched_energy to make the code easier to
> understand, i.e. single socket systems are not supported (yet).
>
> (3) Remove wake-up energy. Extra complexity that wasn't fully justified.
> Idle-state awareness introduced recently in mainline may be
> sufficient.
>
> (4) Remove procfs interface for energy data to make the patch-set
> smaller.
>
> (5) Rework energy-aware load balancing code.
>
> In RFCv2 we only attempted to pick the source cpu in an energy-aware
> fashion. In addition to support for finding the most energy
> inefficient source CPU during the load-balancing action, RFCv3 also
> introduces the energy-aware based moving of tasks between cpus as
> well as support for managing the 'tipping point' - the threshold
> where we switch away from energy model based load balancing to
> conventional load balancing.
>
> 'sched: frequency and cpu invariant per-entity load-tracking and other
> load-tracking bits' [3]
>
> (1) Remove blocked load from load tracking.
>
> (2) Remove cpu-invariant load tracking.
>
> Both (1) and (2) require changes to the existing load-balance code
> which haven't been done yet. These are therefore left out until that
> has been addressed.
>
> (3) One patch renamed.
>
> 'sched: consolidation of CPU capacity and usage' [2]
>
> (1) Fixed conflict when rebasing to v3.19-rc7.
>
> (2) One patch subject changed slightly.
>
>
> RFC v2:
> - Extended documentation:
> - Cover the energy model in greater detail.
> - Recipe for deriving platform energy model.
> - Replaced Kconfig with sched feature (jump label).
> - Add unweighted load tracking.
> - Use unweighted load as task/cpu utilization.
> - Support for multiple idle states per sched_group. cpuidle integration
> still missing.
> - Changed energy aware functionality in select_idle_sibling().
> - Experimental energy aware load-balance support.
>
>
> Dietmar Eggemann (17):
> sched: Make load tracking frequency scale-invariant
> sched: Make usage tracking cpu scale-invariant
> arm: vexpress: Add CPU clock-frequencies to TC2 device-tree
> arm: Cpu invariant scheduler load-tracking support
> sched: Get rid of scaling usage by cpu_capacity_orig
> sched: Introduce energy data structures
> sched: Allocate and initialize energy data structures
> arm: topology: Define TC2 energy and provide it to the scheduler
> sched: Infrastructure to query if load balancing is energy-aware
> sched: Introduce energy awareness into update_sg_lb_stats
> sched: Introduce energy awareness into update_sd_lb_stats
> sched: Introduce energy awareness into find_busiest_group
> sched: Introduce energy awareness into find_busiest_queue
> sched: Introduce energy awareness into detach_tasks
> sched: Tipping point from energy-aware to conventional load balancing
> sched: Skip cpu as lb src which has one task and capacity gte the dst
> cpu
> sched: Turn off fast idling of cpus on a partially loaded system
>
> Morten Rasmussen (23):
> sched: Track group sched_entity usage contributions
> sched: Make sched entity usage tracking frequency-invariant
> cpufreq: Architecture specific callback for frequency changes
> arm: Frequency invariant scheduler load-tracking support
> sched: Track blocked utilization contributions
> sched: Include blocked utilization in usage tracking
> sched: Documentation for scheduler energy cost model
> sched: Make energy awareness a sched feature
> sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
> sched: Compute cpu capacity available at current frequency
> sched: Relocated get_cpu_usage()
> sched: Use capacity_curr to cap utilization in get_cpu_usage()
> sched: Highest energy aware balancing sched_domain level pointer
> sched: Calculate energy consumption of sched_group
> sched: Extend sched_group_energy to test load-balancing decisions
> sched: Estimate energy impact of scheduling decisions
> sched: Energy-aware wake-up task placement
> sched: Bias new task wakeups towards higher capacity cpus
> sched, cpuidle: Track cpuidle state index in the scheduler
> sched: Count number of shallower idle-states in struct
> sched_group_energy
> sched: Determine the current sched_group idle-state
> sched: Enable active migration for cpus of lower capacity
> sched: Disable energy-unfriendly nohz kicks
>
> Vincent Guittot (8):
> sched: add utilization_avg_contrib
> sched: remove frequency scaling from cpu_capacity
> sched: make scale_rt invariant with frequency
> sched: add per rq cpu_capacity_orig
> sched: get CPU's usage statistic
> sched: replace capacity_factor by usage
> sched: add SD_PREFER_SIBLING for SMT level
> sched: move cfs task on a CPU with higher capacity
>
> Documentation/scheduler/sched-energy.txt | 359 +++++++++++
> arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 5 +
> arch/arm/kernel/topology.c | 218 +++++--
> drivers/cpufreq/cpufreq.c | 10 +-
> include/linux/sched.h | 43 +-
> kernel/sched/core.c | 119 +++-
> kernel/sched/debug.c | 12 +-
> kernel/sched/fair.c | 935 ++++++++++++++++++++++++-----
> kernel/sched/features.h | 6 +
> kernel/sched/idle.c | 2 +
> kernel/sched/sched.h | 75 ++-
> 11 files changed, 1559 insertions(+), 225 deletions(-)
> create mode 100644 Documentation/scheduler/sched-energy.txt
>
> --
> 1.9.1
>
Hi Vincent,
On Thu, Apr 02, 2015 at 01:43:31PM +0100, Vincent Guittot wrote:
> On 4 February 2015 at 19:30, Morten Rasmussen <[email protected]> wrote:
> > RFCv3 is a consolidation of the latest energy model related patches and
> > previously posted patch sets related to capacity and utilization
> > tracking [2][3] to show where we are heading. [2] and [3] have been
> > rebased onto v3.19-rc7 with a few minor modifications. Large parts of
> > the energy model code and use of the energy model in the scheduler has
> > been rewritten and simplified. The patch set consists of three main
> > parts (more details further down):
> >
> > Patch 1-11: sched: consolidation of CPU capacity and usage [2] (rebase)
> >
> > Patch 12-19: sched: frequency and cpu invariant per-entity load-tracking
> > and other load-tracking bits [3] (rebase)
> >
> > Patch 20-48: sched: Energy cost model for energy-aware scheduling (RFCv3)
>
>
> Hi Morten,
>
> 48 patches is a big number of patches and when i look into your
> patchset, some feature are quite self contained. IMHO it would be
> worth splitting it in smaller patchsets in order to ease the review
> and the regression test.
> From a 1st look at your patchset , i have found
> -patches 11,13,14 and 15 are only linked to frequency scaling invariance
> -patches 12, 17 and 17 are only about adding cpu scaling invariance
> -patches 18 and 19 are about tracking and adding the blocked
> utilization in the CPU usage
> -patches 20 to the end is linked the EAS
I agree it makes sense to regroup the patches as you suggest. A better
logical ordering should make the reviewing a less daunting task. I'm a
bit hesitant to float many small sets of patches as their role in the
bigger picture would be less clear and hence risk loosing the 'why'.
IMHO, it should be as easy (if not easier) to review and pick patches in
a larger set as it is for multiple smaller sets. However, I guess that
is individual and for automated testing it would be easier to have them
split out.
How about focusing on one (or two) of these smaller patch sets at the
time to minimize the potential confusion and post them separately?
I would still include them in updated mega-postings that includes all
the dependencies so the full story would still available for those who
are interested. I would of course make it clear which patches that are
also posted separately.
Thanks,
Morten
On 8 April 2015 at 15:33, Morten Rasmussen <[email protected]> wrote:
> Hi Vincent,
>
> On Thu, Apr 02, 2015 at 01:43:31PM +0100, Vincent Guittot wrote:
>> On 4 February 2015 at 19:30, Morten Rasmussen <[email protected]> wrote:
>> > RFCv3 is a consolidation of the latest energy model related patches and
>> > previously posted patch sets related to capacity and utilization
>> > tracking [2][3] to show where we are heading. [2] and [3] have been
>> > rebased onto v3.19-rc7 with a few minor modifications. Large parts of
>> > the energy model code and use of the energy model in the scheduler has
>> > been rewritten and simplified. The patch set consists of three main
>> > parts (more details further down):
>> >
>> > Patch 1-11: sched: consolidation of CPU capacity and usage [2] (rebase)
>> >
>> > Patch 12-19: sched: frequency and cpu invariant per-entity load-tracking
>> > and other load-tracking bits [3] (rebase)
>> >
>> > Patch 20-48: sched: Energy cost model for energy-aware scheduling (RFCv3)
>>
>>
>> Hi Morten,
>>
>> 48 patches is a big number of patches and when i look into your
>> patchset, some feature are quite self contained. IMHO it would be
>> worth splitting it in smaller patchsets in order to ease the review
>> and the regression test.
>> From a 1st look at your patchset , i have found
>> -patches 11,13,14 and 15 are only linked to frequency scaling invariance
>> -patches 12, 17 and 17 are only about adding cpu scaling invariance
>> -patches 18 and 19 are about tracking and adding the blocked
>> utilization in the CPU usage
>> -patches 20 to the end is linked the EAS
>
> I agree it makes sense to regroup the patches as you suggest. A better
> logical ordering should make the reviewing a less daunting task. I'm a
> bit hesitant to float many small sets of patches as their role in the
> bigger picture would be less clear and hence risk loosing the 'why'.
> IMHO, it should be as easy (if not easier) to review and pick patches in
> a larger set as it is for multiple smaller sets. However, I guess that
Having self contained patchset merged in a larger set can create so
useless dependency between them as they modify same area but for
different goal
> is individual and for automated testing it would be easier to have them
> split out.
>
> How about focusing on one (or two) of these smaller patch sets at the
> time to minimize the potential confusion and post them separately?
I'm fine with your proposal to start with 1 or 2 smaller patchset. The
2 following patchset are, IMHO, the ones the most self contained and
straight forward:
- patches 11,13,14 and 15 are only linked to frequency scaling invariance
- patches 18 and 19 are about tracking and adding the blocked
utilization in the CPU usage
May be we can start with them ?
Regards,
Vincent
>
> I would still include them in updated mega-postings that includes all
> the dependencies so the full story would still available for those who
> are interested. I would of course make it clear which patches that are
> also posted separately.
that's fair enough
>
> Thanks,
> Morten
On Thu, Apr 09, 2015 at 08:41:34AM +0100, Vincent Guittot wrote:
> On 8 April 2015 at 15:33, Morten Rasmussen <[email protected]> wrote:
> > Hi Vincent,
> >
> > On Thu, Apr 02, 2015 at 01:43:31PM +0100, Vincent Guittot wrote:
> >> On 4 February 2015 at 19:30, Morten Rasmussen <[email protected]> wrote:
> >> > RFCv3 is a consolidation of the latest energy model related patches and
> >> > previously posted patch sets related to capacity and utilization
> >> > tracking [2][3] to show where we are heading. [2] and [3] have been
> >> > rebased onto v3.19-rc7 with a few minor modifications. Large parts of
> >> > the energy model code and use of the energy model in the scheduler has
> >> > been rewritten and simplified. The patch set consists of three main
> >> > parts (more details further down):
> >> >
> >> > Patch 1-11: sched: consolidation of CPU capacity and usage [2] (rebase)
> >> >
> >> > Patch 12-19: sched: frequency and cpu invariant per-entity load-tracking
> >> > and other load-tracking bits [3] (rebase)
> >> >
> >> > Patch 20-48: sched: Energy cost model for energy-aware scheduling (RFCv3)
> >>
> >>
> >> Hi Morten,
> >>
> >> 48 patches is a big number of patches and when i look into your
> >> patchset, some feature are quite self contained. IMHO it would be
> >> worth splitting it in smaller patchsets in order to ease the review
> >> and the regression test.
> >> From a 1st look at your patchset , i have found
> >> -patches 11,13,14 and 15 are only linked to frequency scaling invariance
> >> -patches 12, 17 and 17 are only about adding cpu scaling invariance
> >> -patches 18 and 19 are about tracking and adding the blocked
> >> utilization in the CPU usage
> >> -patches 20 to the end is linked the EAS
> >
> > I agree it makes sense to regroup the patches as you suggest. A better
> > logical ordering should make the reviewing a less daunting task. I'm a
> > bit hesitant to float many small sets of patches as their role in the
> > bigger picture would be less clear and hence risk loosing the 'why'.
> > IMHO, it should be as easy (if not easier) to review and pick patches in
> > a larger set as it is for multiple smaller sets. However, I guess that
>
> Having self contained patchset merged in a larger set can create so
> useless dependency between them as they modify same area but for
> different goal
>
> > is individual and for automated testing it would be easier to have them
> > split out.
> >
> > How about focusing on one (or two) of these smaller patch sets at the
> > time to minimize the potential confusion and post them separately?
>
> I'm fine with your proposal to start with 1 or 2 smaller patchset. The
> 2 following patchset are, IMHO, the ones the most self contained and
> straight forward:
> - patches 11,13,14 and 15 are only linked to frequency scaling invariance
> - patches 18 and 19 are about tracking and adding the blocked
> utilization in the CPU usage
>
> May be we can start with them ?
Agreed. Those two would form meaningful patch sets. I will fix them and
split them out.
Thanks,
Morten
Quoting Peter Zijlstra (2015-03-26 03:41:50)
> On Thu, Mar 26, 2015 at 10:21:24AM +0000, Juri Lelli wrote:
> > - what about other sched classes? I know that this is very premature,
> > but I can help but thinking that we'll need to do some sort of
> > aggregation of requests, and if we put triggers in very specialized
> > points we might lose some of the sched classes separation
>
> So for deadline we can do P state selection (as you're well aware) based
> on the requested utilization. Not sure what to do for fifo/rr though,
> they lack much useful information (as always).
>
> Now if we also look ahead to things like the ACPI CPPC stuff we'll see
> that CFS and DL place different requirements on the hints. Where CFS
> would like to hint a max perf (the hardware going slower due to the code
> consisting of mostly stalls is always fine from a best effort energy
> pov), the DL stuff would like to hint a min perf, seeing how it 'needs'
> to provide a QoS.
>
> So we either need to carry this information along in a 'generic' way
> between the various classes or put the hinting in every class.
>
> But yes, food for thought for sure.
I am a fan of putting the hints in every class. One idea I've been
considering is that each sched class could have a small, simple cpufreq
governor that expresses its constraints (max for cfs, min qos for dl)
and then the cpufreq core Does The Right Thing.
This would be a multi-governor approach, which requires some surgery to
cpufreq core code, but I like the modularity and maintainability of it
more than having one big super governor that has to satisfy every need.
Regards,
Mike
On Mon, Apr 27, 2015 at 09:01:13AM -0700, Michael Turquette wrote:
> Quoting Peter Zijlstra (2015-03-26 03:41:50)
> > On Thu, Mar 26, 2015 at 10:21:24AM +0000, Juri Lelli wrote:
> > > - what about other sched classes? I know that this is very premature,
> > > but I can help but thinking that we'll need to do some sort of
> > > aggregation of requests, and if we put triggers in very specialized
> > > points we might lose some of the sched classes separation
> >
> > So for deadline we can do P state selection (as you're well aware) based
> > on the requested utilization. Not sure what to do for fifo/rr though,
> > they lack much useful information (as always).
> >
> > Now if we also look ahead to things like the ACPI CPPC stuff we'll see
> > that CFS and DL place different requirements on the hints. Where CFS
> > would like to hint a max perf (the hardware going slower due to the code
> > consisting of mostly stalls is always fine from a best effort energy
> > pov), the DL stuff would like to hint a min perf, seeing how it 'needs'
> > to provide a QoS.
> >
> > So we either need to carry this information along in a 'generic' way
> > between the various classes or put the hinting in every class.
> >
> > But yes, food for thought for sure.
>
> I am a fan of putting the hints in every class. One idea I've been
> considering is that each sched class could have a small, simple cpufreq
> governor that expresses its constraints (max for cfs, min qos for dl)
> and then the cpufreq core Does The Right Thing.
>
> This would be a multi-governor approach, which requires some surgery to
> cpufreq core code, but I like the modularity and maintainability of it
> more than having one big super governor that has to satisfy every need.
Well, at that point we really don't need cpufreq anymore do we? All
you need is the hardware driver (ACPI P-state, ACPI CPPC etc.).
Because as I understand it, cpufreq currently is mostly the governor
thing (which we'll replace) and some infra for dealing with these head
cases that require scheduling for changing P states (which we can leave
on cpufreq proper for the time being).
Would it no be easier to just start from scratch and convert the (few)
drivers we need to prototype this? Instead of trying to drag the
entirety of cpufreq along just to keep all the drivers?
On 28/04/15 08:12, [email protected] wrote:
> Morten Rasmussen <[email protected]> wrote 2015-02-05 AM 02:30:54:
>> From: Dietmar Eggemann <[email protected]>
>>
>> Since now we have besides frequency invariant also cpu (uarch plus max
>> system frequency) invariant cfs_rq::utilization_load_avg both, frequency
>> and cpu scaling happens as part of the load tracking.
>> So cfs_rq::utilization_load_avg does not have to be scaled by the original
>> capacity of the cpu again.
>>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Signed-off-by: Dietmar Eggemann <[email protected]>
>> ---
>> kernel/sched/fair.c | 5 ++---
>> 1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5375ab1..a85c34b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4807,12 +4807,11 @@ static int select_idle_sibling(struct
>> task_struct *p, int target)
>> static int get_cpu_usage(int cpu)
>> {
>> unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>> - unsigned long capacity = capacity_orig_of(cpu);
>>
>> if (usage >= SCHED_LOAD_SCALE)
>
> Since utilization_load_avg has been applied all the scaling,
> it can't exceed its original capacity. SCHED_LOAD_SCALE is
> the max original capacity of all, right?
>
> So, shouldn't this be "if(usage >= orig_capacity)"?
Absolutely, you're right. The usage on cpus which have a smaller orig
capacity (<1024) than the cpus with the highest orig capacity (1024)
have to be limited to their orig capacity.
There is another patch in this series '[RFCv3 PATCH 28/48] sched: Use
capacity_curr to cap utilization in get_cpu_usage()' which changes the
upper bound from SCHED_LOAD_SCALE to capacity_curr_of(cpu) and returns
this current capacity in case the usage (running and blocked) exceeds it.
Our testing in the meantime has shown that this is the wrong approach in
some cases. Like adding more tasks to a cpu and deciding the new
capacity state (OPP) based on get_cpu_usage(). We are likely to change
this to capacity_orig_of(cpu) in the next version.
[...]
On 28/04/15 09:58, [email protected] wrote:
> Morten Rasmussen <[email protected]> wrote 2015-02-05 AM 02:31:00:
>> [RFCv3 PATCH 23/48] sched: Allocate and initialize energy data structures
>>
>> From: Dietmar Eggemann <[email protected]>
>>
>> The per sched group sched_group_energy structure plus the related
>> idle_state and capacity_state arrays are allocated like the other sched
>> domain (sd) hierarchy data structures. This includes the freeing of
>> sched_group_energy structures which are not used.
>>
>> One problem is that the number of elements of the idle_state and the
>> capacity_state arrays is not fixed and has to be retrieved in
>> __sdt_alloc() to allocate memory for the sched_group_energy structure and
>> the two arrays in one chunk. The array pointers (idle_states and
>> cap_states) are initialized here to point to the correct place inside the
>> memory chunk.
>>
>> The new function init_sched_energy() initializes the sched_group_energy
>> structure and the two arrays in case the sd topology level contains energy
>> information.
[...]
>>
>> +static void init_sched_energy(int cpu, struct sched_domain *sd,
>> + struct sched_domain_topology_level *tl)
>> +{
>> + struct sched_group *sg = sd->groups;
>> + struct sched_group_energy *energy = sg->sge;
>> + sched_domain_energy_f fn = tl->energy;
>> + struct cpumask *mask = sched_group_cpus(sg);
>> +
>> + if (!fn || !fn(cpu))
>> + return;
>
> Maybe if there's no valid fn(), we can dec the sched_group_energy's
> reference count, so that it can be freed if no one uses it.
Good catch! We actually want that sg->sge is NULL if there is no
energy function fn or fn returns NULL. We never noticed that this is
not the case since we have tested the whole patch-set only with energy
functions available for each sched domain (sd) so far.
All sd's lower or equal 'struct sched_domain * ea_sd' (highest level
at which energy model is provided) have to provide a valid energy
function fn. A check which is currently missing as well.
Instead of dec the ref count, I could defer the inc from get_group to
init_sched_energy.
>
> Also, this function may enter several times for the shared sge,
> there is no need to do the duplicate operation below. Adding
> this would be better?
>
> if (cpu != group_balance_cpu(sg))
> return;
>
That's true.
This snippet gives the functionality on top of this patch (Tested on
a two cluster ARM system w/ fn set to NULL on MC or DIE sd level or
both in arm_topology[] (arch/arm/kernel/topology.c)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c49f3ee928b8..6d9b5327a2b6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5969,7 +5969,6 @@ static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
(*sg)->sgc = *per_cpu_ptr(sdd->sgc, cpu);
atomic_set(&(*sg)->sgc->ref, 1); /* for claim_allocations */
(*sg)->sge = *per_cpu_ptr(sdd->sge, cpu);
- atomic_set(&(*sg)->sge->ref, 1); /* for claim_allocations */
}
return cpu;
@@ -6067,8 +6066,16 @@ static void init_sched_energy(int cpu, struct sched_domain *sd,
sched_domain_energy_f fn = tl->energy;
struct cpumask *mask = sched_group_cpus(sg);
- if (!fn || !fn(cpu))
+ if (cpu != group_balance_cpu(sg))
+ return;
+
+ if (!fn || !fn(cpu)) {
+ sg->sge = NULL;
return;
+ }
+
+ atomic_set(&sg->sge->ref, 1); /* for claim_allocations */
+
[...]
On 28/04/15 10:09, [email protected] wrote:
> Morten Rasmussen <[email protected]> wrote 2015-02-05 AM 02:31:06:
>
>> Morten Rasmussen <[email protected]>
>> [RFCv3 PATCH 29/48] sched: Highest energy aware balancing
>> sched_domain level pointer
>>
>> Add another member to the family of per-cpu sched_domain shortcut
>> pointers. This one, sd_ea, points to the highest level at which energy
>> model is provided. At this level and all levels below all sched_groups
>> have energy model data attached.
[...]
>> @@ -5766,6 +5767,14 @@ static void update_top_cache_domain(int cpu)
>>
>> sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
>> rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
>> +
>> + for_each_domain(cpu, sd) {
>> + if (sd->groups->sge)
>
> sge is like sgc, I think the test will always return true, no?
True for the current implementation. This code will make more sense once
we integrate the changes you proposed for '[RFCv3 PATCH 23/48] sched:
Allocate and initialize energy data structures' into the next version.
[...]
On 30/04/15 06:12, [email protected] wrote:
> [email protected] wrote 2015-02-05 AM 02:31:14:
>
>> Morten Rasmussen <[email protected]>
>>
>> [RFCv3 PATCH 37/48] sched: Determine the current sched_group idle-state
>>
>> To estimate the energy consumption of a sched_group in
>> sched_group_energy() it is necessary to know which idle-state the group
>> is in when it is idle. For now, it is assumed that this is the current
>> idle-state (though it might be wrong). Based on the individual cpu
>> idle-states group_idle_state() finds the group idle-state.
[...]
>> +static int group_idle_state(struct sched_group *sg)
>> +{
>> + struct sched_group_energy *sge = sg->sge;
>> + int shallowest_state = sge->idle_states_below + sge->nr_idle_states;
>> + int i;
>> +
>> + for_each_cpu(i, sched_group_cpus(sg)) {
>> + int cpuidle_idx = idle_get_state_idx(cpu_rq(i));
>> + int group_idx = cpuidle_idx - sge->idle_states_below + 1;
>
> So cpuidle_idx==0 for core-level(2 C-States for example) groups, it
> returns 1?
Yes.
>
> What does this mean?
So for ARM TC2 and CPU0 (Cortex A15, big) for example:
# cat /sys/devices/system/cpu/cpu0/cpuidle/state*/name
WFI [cpuidle_idx=0]
cluster-sleep-b [cpuidle_idx=1]
group_idx is the system C state re-indexed to the sd (e.g.on MC level
WFI is group idx 1 and on DIE level it's group idx 0).
For MC (or core-) level groups:
cpu=0 sg_mask=0x1 group_idx=1 cpuidle_idx=0 sge->idle_states_below=0
shallowest_state=1 sge->nr_idle_states=1
group_idle_state() returns 0 in this case because of (shallowest_state=1
>= sge->nr_idle_states=1).
This value is then used to index into the sg->sge->idle_states[] to get
the idle power value for the topology bit represented by sg (i.e. for
CPU0). The idle power value for CPU0 is originally specified in the
energy model:
static struct idle_state idle_states_core_a15[] = { .power = 0 }, };
[arch/arm/kernel/topology.c]
Another example would be to find the idle power value for the LITTLE
cluster if it is in 'cluster-sleep-l' [cpuidle_idx=1]:
cpu=2 sg_mask=0x1c group_idx=1 cpuidle_idx=1 sge->idle_states_below=1
shallowest_state=3 sge->nr_idle_states=2
cpu=3 sg_mask=0x1c group_idx=1 cpuidle_idx=1 sge->idle_states_below=1
shallowest_state=1 sge->nr_idle_states=2
cpu=3 sg_mask=0x1c group_idx=1 cpuidle_idx=1 sge->idle_states_below=1
shallowest_state=1 sge->nr_idle_states=2
group_idle_state() returns 1 (shallowest_state=1) and the idle power
value is 10.
static struct idle_state idle_states_cluster_a7[] = { ..., { .power = 10 }};
To sum it up, group_idle_state() is necessary to transform the C state
values (-1, 0, 1) into sg->sge->idle_states[] indexes for the different
sg's (MC, DIE).
[...]
On 30/04/15 08:00, [email protected] wrote:
> [email protected] wrote 2015-02-05 AM 02:31:08:
[...]
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d12aa63..07c84af 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4592,23 +4592,44 @@ static unsigned long capacity_curr_of(int cpu)
>> * Without capping the usage, a group could be seen as overloaded
> (CPU0 usage
>> * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available
> capacity/
>> */
>> -static int get_cpu_usage(int cpu)
>> +static int __get_cpu_usage(int cpu, int delta)
>> {
>> + int sum;
>> unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>> unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
>> unsigned long capacity_curr = capacity_curr_of(cpu);
>>
>> - if (usage + blocked >= capacity_curr)
>> + sum = usage + blocked + delta;
>> +
>> + if (sum < 0)
>> + return 0;
>> +
>> + if (sum >= capacity_curr)
>> return capacity_curr;
>
> So if the added delta exceeds the curr capacity not its orignal capacity
> which I think would be quite often cases, I guess it should be better if
> it's allowed to increase its freq and calculate the right energy diff.
Yes, I mentioned this in my answer for [RFCv3 PATCH 17/48] that our
testing in the meantime has shown that this capping by capacity_curr is
the wrong approach in some cases and that we are likely to change this
to capacity_orig_of(cpu) in the next version.
[...]
On 30/04/15 08:12, [email protected] wrote:
> [email protected] wrote 2015-03-27 PM 11:03:24:
>> Re: [RFCv3 PATCH 43/48] sched: Introduce energy awareness into
> detach_tasks
[...]
>> > Hi Morten, Dietmar,
>> >
>> > Wouldn't the above energy_diff() use the 'old' value of dst_cpu's util?
>> > Tasks are detached/dequeued in this loop so they have their util
>> > contrib. removed from src_cpu but their contrib. hasn't been added to
>> > dst_cpu yet (happens in attach_tasks).
>>
>> You're absolutely right Sai. Thanks for pointing this out! I guess I
> rather
>> have to accumulate the usage of tasks I've detached and add this to the
>> eenv::usage_delta of the energy_diff() call for the next task.
>>
>> Something like this (only slightly tested):
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8d4cc72f4778..d0d0e965fd0c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6097,6 +6097,7 @@ static int detach_tasks(struct lb_env *env)
>> struct task_struct *p;
>> unsigned long load = 0;
>> int detached = 0;
>> + int usage_delta = 0;
>>
>> lockdep_assert_held(&env->src_rq->lock);
>>
>> @@ -6122,16 +6123,19 @@ static int detach_tasks(struct lb_env *env)
>> goto next;
>>
>> if (env->use_ea) {
>> + int util = task_utilization(p);
>> struct energy_env eenv = {
>> .src_cpu = env->src_cpu,
>> .dst_cpu = env->dst_cpu,
>> - .usage_delta = task_utilization(p),
>> + .usage_delta = usage_delta + util,
>> };
>> int e_diff = energy_diff(&eenv);
>
> If any or total utilization of tasks detached exceeds the orig capacity,
> of src_cpu, should we bail out in case of performance plus avoiding
> reaching the tipping point easily(because in such cases, src_cpu tends
> to be ended up overutilized)?
Yes, correct. We have to limit the dst cpu from pulling more than its
remaining capacity worth of usage. I already integrated a check that the
dst cpu is not over-utilized by taking the usage_delta.
> Also like what I just replied to "[RFCv4 PATCH 31/48]", when doing
> energy_diff() it should be allowed to exceed cur capacity if not
> reaching its original capacity(is capacity_of(cpu) better?).
True.
[...]
On 01/05/15 10:56, [email protected] wrote:
> Hi Dietmar,
>
> Dietmar Eggemann <[email protected]> wrote 2015-05-01 AM 04:17:51:
>>
>> Re: [RFCv3 PATCH 37/48] sched: Determine the current sched_group
> idle-state
>>
>> On 30/04/15 06:12, [email protected] wrote:
>> > [email protected] wrote 2015-02-05 AM 02:31:14:
[...]
> Thanks for explaining this in graphic detail.
>
> From what I understood, let's just assume ARM TC2 has an extra
> MC-level C-States SPC(assuming its power is 40 for the big).
>
> Take the power value from "RFCv3 PATCH 25/48":
> static struct idle_state idle_states_cluster_a15[] = {
> { .power = 70 }, /* WFI */
> { .power = 25 }, /* cluster-sleep-b */
> };
>
> static struct idle_state idle_states_core_a15[] = {
> { .power = 0 }, /* WFI */
> };
>
> Then we will get the following idle energy table?
> static struct idle_state idle_states_core_a15[] = {
> { .power = 70 }, /* WFI */
> { .power = 0 }, /* SPC*/
> };
>
> static struct idle_state idle_states_cluster_a15[] = {
> { .power = 40 }, /* SPC */
> { .power = 25 }, /* cluster-sleep-b */
> };
>
> Is this correct?
Yes. It's the same C-state configuration we have on our ARMv8 big.LITTLE
JUNO (Cortex A57, A53) board.
# cat /sys/devices/system/cpu/cpu0/cpuidle/state*/name
WFI
cpu-sleep-0
cluster-sleep-0
> If this is right, there may be a bug.
>
> For MC-level CPU0:
> sg_mask=0x1 sge->nr_idle_states=2 sge->idle_states_below=0
> cpuidle_idx=0 group_idx=1 shallowest_state=1
>
> See, group_idle_state() finally returns 1 as CPU0's MC-level
> idle engery model index, and this is obviously wrong.
Yes, I can see this problem too.
> So, I think for
> "int group_idx = cpuidle_idx - sge->idle_states_below + 1;"
>
> Maybe we shouldn't add the extra 1 for lowest levels?
Maybe. In any case we will have to resolve this issue in the next
version. Thanks for pointing this out!
[...]
On 30/04/15 08:46, [email protected] wrote:
> [email protected] wrote 2015-03-26 AM 02:44:48:
>
>> Dietmar Eggemann <[email protected]>
>>
>> Re: [RFCv3 PATCH 45/48] sched: Skip cpu as lb src which has one task
>> and capacity gte the dst cpu
>>
>> On 24/03/15 15:27, Peter Zijlstra wrote:
>> > On Wed, Feb 04, 2015 at 06:31:22PM +0000, Morten Rasmussen wrote:
>> >> From: Dietmar Eggemann <[email protected]>
>> >>
>> >> Skip cpu as a potential src (costliest) in case it has only one task
>> >> running and its original capacity is greater than or equal to the
>> >> original capacity of the dst cpu.
>> >
>> > Again, that's what, but is lacking a why.
>> >
>>
>> You're right, the 'why' is completely missing.
>>
>> This is one of our heterogeneous (big.LITTLE) cpu related patches. We
>> don't want to end up migrating this single task from a big to a little
>> cpu, hence the use of capacity_orig_of(cpu). Our cpu topology makes sure
>> that this rule is only active on DIE sd level.
>
> Hi Dietmar,
>
> Could you tell me the reason why don't want to end up migrating this single
> task from a big to a little cpu?
>
> Like what I just replied to "[RFCv3 PATCH 47/48]", if the task is a
> small one,
> why couldn't we migrate it to the little cpu to save energy especially when
> the cluster has shared freq, the saving may be appreciable?
If it's a big (always running) task, it should stay on the cpu with the
higher capacity. If it is a small task it will eventually go to sleep
and the wakeup path will take care of placing it onto the right cpu.
[...]
On 03/05/15 07:27, [email protected] wrote:
> Hi Dietmar,
>
> Dietmar Eggemann <[email protected]> wrote 2015-03-24 AM 03:19:41:
>>
>> Re: [RFCv3 PATCH 12/48] sched: Make usage tracking cpu scale-invariant
[...]
>> In the previous patch-set https://lkml.org/lkml/2014/12/2/332we
>> cpu-scaled both (sched_avg::runnable_avg_sum (load) and
>> sched_avg::running_avg_sum (utilization)) but during the review Vincent
>> pointed out that a cpu-scaled invariant load signal messes up
>> load-balancing based on s[dg]_lb_stats::avg_load in overload scenarios.
>>
>> avg_load = load/capacity and load can't be simply replaced here by
>> 'cpu-scale invariant load' (which is load*capacity).
>
> I can't see why it shouldn't.
>
> For "avg_load = load/capacity", "avg_load" stands for how busy the cpu
> works,
> it is actually a value relative to its capacity. The system is seen
> balanced
> for the case that a task runs on a 512-capacity cpu contributing 50% usage,
> and two the same tasks run on the 1024-capacity cpu contributing 50% usage.
> "capacity" in this formula contains uarch capacity, "load" in this formula
> must be an absolute real load, not relative.
>
> But with current kernel implementation, "load" computed without this patch
> is a relative value. For example, one task (1024 weight) runs on a 1024
> capacity CPU, it gets 256 load contribution(25% on this CPU). When it runs
> on a 512 capacity CPU, it will get the 512 load contribution(50% on ths
> CPU).
> See, currently runnable "load" is relative, so "avg_load" is actually wrong
> and its value equals that of "load". So I think the runnable load should be
> made cpu scale-invariant as well.
>
> Please point me out if I was wrong.
Cpu-scaled load leads to wrong lb decisions in overload scenarios:
(1) Overload example taken from email thread between Vincent and Morten:
https://lkml.org/lkml/2014/12/30/114
7 always running tasks, 4 on cluster 0, 3 on cluster 1:
cluster 0 cluster 1
capacity 1024 (2*512) 1024 (1*1024)
load 4096 3072
scale_load 2048 3072
Simply using cpu-scaled load in the existing lb code would declare
cluster 1 busier than cluster 0, although the compute capacity budget
for one task is higher on cluster 1 (1024/3 = 341) than on cluster 0
(2*512/4 = 256).
(2) A non-overload example does not show this problem:
7 12.5% (scaled to 1024) tasks, 4 on cluster 0, 3 on cluster 1:
cluster 0 cluster 1
capacity 1024 (2*512) 1024 (1*1024)
load 1024 384
scale_load 512 384
Here cluster 0 is busier taking load or cpu-scaled load.
We should continue to use avg_load based on load (maybe calculated out
of scaled load once introduced?) for overload scenarios and use
scale_load for non-overload scenarios. Since this hasn't been
implemented yet, we got rid of cpu-scaled load in
this RFC.
[...]