2015-08-14 16:20:09

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking

Per-entity load-tracking currently only compensates for frequency scaling for
utilization tracking. This patch set extends this compensation to load as well,
and adds compute capacity (different microarchitectures and/or max
frequency/P-state) invariance to utilization. The former prevents suboptimal
load-balancing decisions when cpus run at different frequencies, while the
latter ensures that utilization (sched_avg.util_avg) can be compared across
cpus and that utilization can be compared directly to cpu capacity to determine
if the cpu is overloaded.

Note that this patch only contains the scheduler patches, the architecture
specific implementations of arch_scale_{freq, cpu}_capacity() will be posted
separately later.

The patches have posted several times before. Most recently as part of the
energy-model driven scheduling RFCv5 patch set [1] (patch #2,4,6,8-12). That
RFC also contains patches for the architecture specific side. In this posting
the commit messages have been updated and the patches have been rebased on a
more recent tip/sched/core that includes Yuyang's rewrite which made some of
the previously posted patches redundant.

Target: ARM TC2 A7-only (x3)
Test: hackbench -g 25 --threads -l 10000

Before After
315.545 313.408 -0.68%

Target: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
Test: hackbench -g 25 --threads -l 1000 (avg of 10)

Before After
6.4643 6.395 -1.07%

[1] http://www.kernelhub.org/?p=2&msg=787634

Dietmar Eggemann (4):
sched/fair: Make load tracking frequency scale-invariant
sched/fair: Make utilization tracking cpu scale-invariant
sched/fair: Name utilization related data and functions consistently
sched/fair: Get rid of scaling utilization by capacity_orig

Morten Rasmussen (2):
sched/fair: Convert arch_scale_cpu_capacity() from weak function to
#define
sched/fair: Initialize task load and utilization before placing task
on rq

include/linux/sched.h | 8 ++--
kernel/sched/core.c | 4 +-
kernel/sched/fair.c | 109 +++++++++++++++++++++++-------------------------
kernel/sched/features.h | 5 ---
kernel/sched/sched.h | 11 +++++
5 files changed, 69 insertions(+), 68 deletions(-)

--
1.9.1


2015-08-14 16:21:39

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 1/6] sched/fair: Make load tracking frequency scale-invariant

From: Dietmar Eggemann <[email protected]>

Apply frequency scaling correction factor to per-entity load tracking to
make it frequency invariant. Currently, load appears bigger when the cpu
is running slower which affects load-balancing decisions.

Each segment of the sched_avg.load_sum geometric series is now scaled by
the current frequency so that the sched_avg.load_avg of each sched entity
will be invariant from frequency scaling.

Moreover, cfs_rq.runnable_load_sum is scaled by the current frequency as
well.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Acked-by: Vincent Guittot <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
include/linux/sched.h | 6 +++---
kernel/sched/fair.c | 27 +++++++++++++++++----------
2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 44dca5b..a153051 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1177,9 +1177,9 @@ struct load_weight {

/*
* The load_avg/util_avg accumulates an infinite geometric series.
- * 1) load_avg factors the amount of time that a sched_entity is
- * runnable on a rq into its weight. For cfs_rq, it is the aggregated
- * such weights of all runnable and blocked sched_entities.
+ * 1) load_avg factors frequency scaling into the amount of time that a
+ * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
+ * aggregated such weights of all runnable and blocked sched_entities.
* 2) util_avg factors frequency scaling into the amount of time
* that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
* For cfs_rq, it is the aggregated such times of all runnable and
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 858b94a..1626410 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2515,6 +2515,8 @@ static u32 __compute_runnable_contrib(u64 n)
return contrib + runnable_avg_yN_sum[n];
}

+#define scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
/*
* We can represent the historical contribution to runnable average as the
* coefficients of a geometric series. To do this we sub-divide our runnable
@@ -2547,9 +2549,9 @@ static __always_inline int
__update_load_avg(u64 now, int cpu, struct sched_avg *sa,
unsigned long weight, int running, struct cfs_rq *cfs_rq)
{
- u64 delta, periods;
+ u64 delta, scaled_delta, periods;
u32 contrib;
- int delta_w, decayed = 0;
+ int delta_w, scaled_delta_w, decayed = 0;
unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);

delta = now - sa->last_update_time;
@@ -2585,13 +2587,16 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
* period and accrue it.
*/
delta_w = 1024 - delta_w;
+ scaled_delta_w = scale(delta_w, scale_freq);
if (weight) {
- sa->load_sum += weight * delta_w;
- if (cfs_rq)
- cfs_rq->runnable_load_sum += weight * delta_w;
+ sa->load_sum += weight * scaled_delta_w;
+ if (cfs_rq) {
+ cfs_rq->runnable_load_sum +=
+ weight * scaled_delta_w;
+ }
}
if (running)
- sa->util_sum += delta_w * scale_freq >> SCHED_CAPACITY_SHIFT;
+ sa->util_sum += scaled_delta_w;

delta -= delta_w;

@@ -2608,23 +2613,25 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

/* Efficiently calculate \sum (1..n_period) 1024*y^i */
contrib = __compute_runnable_contrib(periods);
+ contrib = scale(contrib, scale_freq);
if (weight) {
sa->load_sum += weight * contrib;
if (cfs_rq)
cfs_rq->runnable_load_sum += weight * contrib;
}
if (running)
- sa->util_sum += contrib * scale_freq >> SCHED_CAPACITY_SHIFT;
+ sa->util_sum += contrib;
}

/* Remainder of delta accrued against u_0` */
+ scaled_delta = scale(delta, scale_freq);
if (weight) {
- sa->load_sum += weight * delta;
+ sa->load_sum += weight * scaled_delta;
if (cfs_rq)
- cfs_rq->runnable_load_sum += weight * delta;
+ cfs_rq->runnable_load_sum += weight * scaled_delta;
}
if (running)
- sa->util_sum += delta * scale_freq >> SCHED_CAPACITY_SHIFT;
+ sa->util_sum += scaled_delta;

sa->period_contrib += delta;

--
1.9.1

2015-08-14 16:20:13

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define

Bring arch_scale_cpu_capacity() in line with the recent change of its
arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched:
Optimize freq invariant accounting") from weak function to #define to
allow inlining of the function.

While at it, remove the ARCH_CAPACITY sched_feature as well. With the
change to #define there isn't a straightforward way to allow runtime
switch between an arch implementation and the default implementation of
arch_scale_cpu_capacity() using sched_feature. The default was to use
the arch-specific implementation, but only the arm architecture provides
one and that is essentially equivalent to the default implementation.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 22 +---------------------
kernel/sched/features.h | 5 -----
kernel/sched/sched.h | 11 +++++++++++
3 files changed, 12 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1626410..c72223a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6016,19 +6016,6 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
return load_idx;
}

-static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
- return sd->smt_gain / sd->span_weight;
-
- return SCHED_CAPACITY_SCALE;
-}
-
-unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
- return default_scale_cpu_capacity(sd, cpu);
-}
-
static unsigned long scale_rt_capacity(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -6058,16 +6045,9 @@ static unsigned long scale_rt_capacity(int cpu)

static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
- unsigned long capacity = SCHED_CAPACITY_SCALE;
+ unsigned long capacity = arch_scale_cpu_capacity(sd, cpu);
struct sched_group *sdg = sd->groups;

- if (sched_feat(ARCH_CAPACITY))
- capacity *= arch_scale_cpu_capacity(sd, cpu);
- else
- capacity *= default_scale_cpu_capacity(sd, cpu);
-
- capacity >>= SCHED_CAPACITY_SHIFT;
-
cpu_rq(cpu)->cpu_capacity_orig = capacity;

capacity *= scale_rt_capacity(cpu);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 83a50e7..6565eac 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -36,11 +36,6 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
*/
SCHED_FEAT(WAKEUP_PREEMPTION, true)

-/*
- * Use arch dependent cpu capacity functions
- */
-SCHED_FEAT(ARCH_CAPACITY, true)
-
SCHED_FEAT(HRTICK, false)
SCHED_FEAT(DOUBLE_TICK, false)
SCHED_FEAT(LB_BIAS, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 22ccc55..7e6f250 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1402,6 +1402,17 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
}
#endif

+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+ if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+ return sd->smt_gain / sd->span_weight;
+
+ return SCHED_CAPACITY_SCALE;
+}
+#endif
+
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
--
1.9.1

2015-08-14 16:20:18

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 3/6] sched/fair: Make utilization tracking cpu scale-invariant

From: Dietmar Eggemann <[email protected]>

Besides the existing frequency scale-invariance correction factor, apply
cpu scale-invariance correction factor to utilization tracking to
compensate for any differences in compute capacity. This could be due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
maximum frequency supported by individual cpus in SMP systems. In the
existing implementation utilization isn't comparable between cpus as it
is relative to the capacity of each individual cpu.

Each segment of the sched_avg.util_sum geometric series is now scaled
by the cpu performance factor too so the sched_avg.util_avg of each
sched entity will be invariant from the particular cpu of the HMP/SMP
system on which the sched entity is scheduled.

With this patch, the utilization of a cpu stays relative to the max cpu
performance of the fastest cpu in the system.

In contrast to utilization (sched_avg.util_sum), load
(sched_avg.load_sum) should not be scaled by compute capacity. The
utilization metric is based on running time which only makes sense when
cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
more tasks are added), where load is runnable time which isn't limited
by the capacity of the cpu and therefore is a better metric for
overloaded scenarios. If we run two nice-0 busy loops on two cpus with
different compute capacity their load should be similar since their
compute demands are the same. We have to assume that the compute demand
of any task running on a fully utilized cpu (no spare cycles = 100%
utilization) is high and the same no matter of the compute capacity of
its current cpu, hence we shouldn't scale load by cpu capacity.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched/fair.c | 7 ++++---
kernel/sched/sched.h | 2 +-
3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a153051..78a93d7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1180,7 +1180,7 @@ struct load_weight {
* 1) load_avg factors frequency scaling into the amount of time that a
* sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
* aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency scaling into the amount of time
+ * 2) util_avg factors frequency and cpu scaling into the amount of time
* that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
* For cfs_rq, it is the aggregated such times of all runnable and
* blocked sched_entities.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c72223a..63be5a5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2553,6 +2553,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
u32 contrib;
int delta_w, scaled_delta_w, decayed = 0;
unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
+ unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

delta = now - sa->last_update_time;
/*
@@ -2596,7 +2597,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
}
}
if (running)
- sa->util_sum += scaled_delta_w;
+ sa->util_sum = scale(scaled_delta_w, scale_cpu);

delta -= delta_w;

@@ -2620,7 +2621,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
cfs_rq->runnable_load_sum += weight * contrib;
}
if (running)
- sa->util_sum += contrib;
+ sa->util_sum += scale(contrib, scale_cpu);
}

/* Remainder of delta accrued against u_0` */
@@ -2631,7 +2632,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
cfs_rq->runnable_load_sum += weight * scaled_delta;
}
if (running)
- sa->util_sum += scaled_delta;
+ sa->util_sum += scale(scaled_delta, scale_cpu);

sa->period_contrib += delta;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7e6f250..50836a9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1406,7 +1406,7 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
{
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+ if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
return sd->smt_gain / sd->span_weight;

return SCHED_CAPACITY_SCALE;
--
1.9.1

2015-08-14 16:21:10

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 4/6] sched/fair: Name utilization related data and functions consistently

From: Dietmar Eggemann <[email protected]>

Use the advent of the per-entity load tracking rewrite to streamline the
naming of utilization related data and functions by using
{prefix_}util{_suffix} consistently. Moreover call both signals
({se,cfs}.avg.util_avg) utilization.

Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 37 +++++++++++++++++++------------------
1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63be5a5..4cc3050 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4825,31 +4825,32 @@ static int select_idle_sibling(struct task_struct *p, int target)
return target;
}
/*
- * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * cpu_util returns the amount of capacity of a CPU that is used by CFS
* tasks. The unit of the return value must be the one of capacity so we can
- * compare the usage with the capacity of the CPU that is available for CFS
- * task (ie cpu_capacity).
+ * compare the utilization with the capacity of the CPU that is available for
+ * CFS task (ie cpu_capacity).
* cfs.avg.util_avg is the sum of running time of runnable tasks on a
* CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE]. The usage of a CPU can't be higher than the full
- * capacity of the CPU because it's about the running time on this CPU.
+ * [0..SCHED_LOAD_SCALE]. The utilization of a CPU can't be higher than the
+ * full capacity of the CPU because it's about the running time on this CPU.
* Nevertheless, cfs.avg.util_avg can be higher than SCHED_LOAD_SCALE
* because of unfortunate rounding in util_avg or just
* after migrating tasks until the average stabilizes with the new running
- * time. So we need to check that the usage stays into the range
+ * time. So we need to check that the utilization stays into the range
* [0..cpu_capacity_orig] and cap if necessary.
- * Without capping the usage, a group could be seen as overloaded (CPU0 usage
- * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
+ * Without capping the utilization, a group could be seen as overloaded (CPU0
+ * utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
+ * available capacity.
*/
-static int get_cpu_usage(int cpu)
+static int cpu_util(int cpu)
{
- unsigned long usage = cpu_rq(cpu)->cfs.avg.util_avg;
+ unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
unsigned long capacity = capacity_orig_of(cpu);

- if (usage >= SCHED_LOAD_SCALE)
+ if (util >= SCHED_LOAD_SCALE)
return capacity;

- return (usage * capacity) >> SCHED_LOAD_SHIFT;
+ return (util * capacity) >> SCHED_LOAD_SHIFT;
}

/*
@@ -5941,7 +5942,7 @@ struct sg_lb_stats {
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
unsigned long load_per_task;
unsigned long group_capacity;
- unsigned long group_usage; /* Total usage of the group */
+ unsigned long group_util; /* Total utilization of the group */
unsigned int sum_nr_running; /* Nr tasks running in the group */
unsigned int idle_cpus;
unsigned int group_weight;
@@ -6174,8 +6175,8 @@ static inline int sg_imbalanced(struct sched_group *group)
* group_has_capacity returns true if the group has spare capacity that could
* be used by some tasks.
* We consider that a group has spare capacity if the * number of task is
- * smaller than the number of CPUs or if the usage is lower than the available
- * capacity for CFS tasks.
+ * smaller than the number of CPUs or if the utilization is lower than the
+ * available capacity for CFS tasks.
* For the latter, we use a threshold to stabilize the state, to take into
* account the variance of the tasks' load and to return true if the available
* capacity in meaningful for the load balancer.
@@ -6189,7 +6190,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
return true;

if ((sgs->group_capacity * 100) >
- (sgs->group_usage * env->sd->imbalance_pct))
+ (sgs->group_util * env->sd->imbalance_pct))
return true;

return false;
@@ -6210,7 +6211,7 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
return false;

if ((sgs->group_capacity * 100) <
- (sgs->group_usage * env->sd->imbalance_pct))
+ (sgs->group_util * env->sd->imbalance_pct))
return true;

return false;
@@ -6258,7 +6259,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
load = source_load(i, load_idx);

sgs->group_load += load;
- sgs->group_usage += get_cpu_usage(i);
+ sgs->group_util += cpu_util(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;

if (rq->nr_running > 1)
--
1.9.1

2015-08-14 16:20:24

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig

From: Dietmar Eggemann <[email protected]>

Utilization is currently scaled by capacity_orig, but since we now have
frequency and cpu invariant cfs_rq.avg.util_avg, frequency and cpu scaling
now happens as part of the utilization tracking itself.
So cfs_rq.avg.util_avg should no longer be scaled in cpu_util().

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 38 ++++++++++++++++++++++----------------
1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4cc3050..34e24181 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4824,33 +4824,39 @@ static int select_idle_sibling(struct task_struct *p, int target)
done:
return target;
}
+
/*
* cpu_util returns the amount of capacity of a CPU that is used by CFS
* tasks. The unit of the return value must be the one of capacity so we can
* compare the utilization with the capacity of the CPU that is available for
* CFS task (ie cpu_capacity).
- * cfs.avg.util_avg is the sum of running time of runnable tasks on a
- * CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE]. The utilization of a CPU can't be higher than the
- * full capacity of the CPU because it's about the running time on this CPU.
- * Nevertheless, cfs.avg.util_avg can be higher than SCHED_LOAD_SCALE
- * because of unfortunate rounding in util_avg or just
- * after migrating tasks until the average stabilizes with the new running
- * time. So we need to check that the utilization stays into the range
- * [0..cpu_capacity_orig] and cap if necessary.
- * Without capping the utilization, a group could be seen as overloaded (CPU0
- * utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
- * available capacity.
+ *
+ * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
+ * recent utilization of currently non-runnable tasks on a CPU. It represents
+ * the amount of utilization of a CPU in the range [0..capacity_orig] where
+ * capacity_orig is the cpu_capacity available at * the highest frequency
+ * (arch_scale_freq_capacity()).
+ * The utilization of a CPU converges towards a sum equal to or less than the
+ * current capacity (capacity_curr <= capacity_orig) of the CPU because it is
+ * the running time on this CPU scaled by capacity_curr.
+ *
+ * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
+ * higher than capacity_orig because of unfortunate rounding in
+ * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
+ * the average stabilizes with the new running time. We need to check that the
+ * utilization stays within the range of [0..capacity_orig] and cap it if
+ * necessary. Without utilization capping, a group could be seen as overloaded
+ * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
+ * available capacity. We allow utilization to overshoot capacity_curr (but not
+ * capacity_orig) as it useful for predicting the capacity required after task
+ * migrations (scheduler-driven DVFS).
*/
static int cpu_util(int cpu)
{
unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
unsigned long capacity = capacity_orig_of(cpu);

- if (util >= SCHED_LOAD_SCALE)
- return capacity;
-
- return (util * capacity) >> SCHED_LOAD_SHIFT;
+ return (util >= capacity) ? capacity : util;
}

/*
--
1.9.1

2015-08-14 16:20:26

by Morten Rasmussen

[permalink] [raw]
Subject: [PATCH 6/6] sched/fair: Initialize task load and utilization before placing task on rq

Task load or utilization is not currently considered in
select_task_rq_fair(), but if we want that in the future we should make
sure it is not zero for new tasks.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b11f624..ae9fca32 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2294,6 +2294,8 @@ void wake_up_new_task(struct task_struct *p)
struct rq *rq;

raw_spin_lock_irqsave(&p->pi_lock, flags);
+ /* Initialize new task's runnable average */
+ init_entity_runnable_average(&p->se);
#ifdef CONFIG_SMP
/*
* Fork balancing, do it here and not earlier because:
@@ -2303,8 +2305,6 @@ void wake_up_new_task(struct task_struct *p)
set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif

- /* Initialize new task's runnable average */
- init_entity_runnable_average(&p->se);
rq = __task_rq_lock(p);
activate_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
--
1.9.1

2015-08-14 23:04:47

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 3/6] sched/fair: Make utilization tracking cpu scale-invariant

On 14/08/15 17:23, Morten Rasmussen wrote:
> From: Dietmar Eggemann <[email protected]>

[...]

> @@ -2596,7 +2597,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> }
> }
> if (running)
> - sa->util_sum += scaled_delta_w;
> + sa->util_sum = scale(scaled_delta_w, scale_cpu);


There is a small issue (using = instead of +=) with fatal consequences
for the utilization signal.

-- >8 --

Subject: [PATCH] sched/fair: Make utilization tracking cpu scale-invariant

Besides the existing frequency scale-invariance correction factor, apply
cpu scale-invariance correction factor to utilization tracking to
compensate for any differences in compute capacity. This could be due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
maximum frequency supported by individual cpus in SMP systems. In the
existing implementation utilization isn't comparable between cpus as it
is relative to the capacity of each individual cpu.

Each segment of the sched_avg.util_sum geometric series is now scaled
by the cpu performance factor too so the sched_avg.util_avg of each
sched entity will be invariant from the particular cpu of the HMP/SMP
system on which the sched entity is scheduled.

With this patch, the utilization of a cpu stays relative to the max cpu
performance of the fastest cpu in the system.

In contrast to utilization (sched_avg.util_sum), load
(sched_avg.load_sum) should not be scaled by compute capacity. The
utilization metric is based on running time which only makes sense when
cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
more tasks are added), where load is runnable time which isn't limited
by the capacity of the cpu and therefore is a better metric for
overloaded scenarios. If we run two nice-0 busy loops on two cpus with
different compute capacity their load should be similar since their
compute demands are the same. We have to assume that the compute demand
of any task running on a fully utilized cpu (no spare cycles = 100%
utilization) is high and the same no matter of the compute capacity of
its current cpu, hence we shouldn't scale load by cpu capacity.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched/fair.c | 7 ++++---
kernel/sched/sched.h | 2 +-
3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a15305117ace..78a93d716fcb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1180,7 +1180,7 @@ struct load_weight {
* 1) load_avg factors frequency scaling into the amount of time that a
* sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
* aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency scaling into the amount of time
+ * 2) util_avg factors frequency and cpu scaling into the amount of time
* that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
* For cfs_rq, it is the aggregated such times of all runnable and
* blocked sched_entities.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c72223a299a8..3321eb13e422 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2553,6 +2553,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
u32 contrib;
int delta_w, scaled_delta_w, decayed = 0;
unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
+ unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

delta = now - sa->last_update_time;
/*
@@ -2596,7 +2597,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
}
}
if (running)
- sa->util_sum += scaled_delta_w;
+ sa->util_sum += scale(scaled_delta_w, scale_cpu);

delta -= delta_w;

@@ -2620,7 +2621,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
cfs_rq->runnable_load_sum += weight * contrib;
}
if (running)
- sa->util_sum += contrib;
+ sa->util_sum += scale(contrib, scale_cpu);
}

/* Remainder of delta accrued against u_0` */
@@ -2631,7 +2632,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
cfs_rq->runnable_load_sum += weight * scaled_delta;
}
if (running)
- sa->util_sum += scaled_delta;
+ sa->util_sum += scale(scaled_delta, scale_cpu);

sa->period_contrib += delta;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7e6f2506a402..50836a9301f9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1406,7 +1406,7 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
{
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+ if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
return sd->smt_gain / sd->span_weight;

return SCHED_CAPACITY_SCALE;
--
1.9.1

2015-08-16 22:01:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking

On Fri, Aug 14, 2015 at 05:23:08PM +0100, Morten Rasmussen wrote:
> Target: ARM TC2 A7-only (x3)
> Test: hackbench -g 25 --threads -l 10000
>
> Before After
> 315.545 313.408 -0.68%
>
> Target: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
> Test: hackbench -g 25 --threads -l 1000 (avg of 10)
>
> Before After
> 6.4643 6.395 -1.07%

Yeah, so that is a problem.

I'm taking it some of the new scaling stuff doesn't compile away, can we
look at fixing that?

2015-08-17 11:26:32

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking

On Sun, Aug 16, 2015 at 10:46:05PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 14, 2015 at 05:23:08PM +0100, Morten Rasmussen wrote:
> > Target: ARM TC2 A7-only (x3)
> > Test: hackbench -g 25 --threads -l 10000
> >
> > Before After
> > 315.545 313.408 -0.68%
> >
> > Target: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
> > Test: hackbench -g 25 --threads -l 1000 (avg of 10)
> >
> > Before After
> > 6.4643 6.395 -1.07%
>
> Yeah, so that is a problem.

Maybe I'm totally wrong, but doesn't hackbench report execution so less
is better? In that case -1.07% means we are doing better with the
patches applied (after time < before time). In any case, I should have
indicated whether the change is good or bad for performance.

> I'm taking it some of the new scaling stuff doesn't compile away, can we
> look at fixing that?

I will double-check that the stuff goes away as expected. I'm pretty
sure it does on ARM.

2015-08-17 11:48:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking

On Mon, Aug 17, 2015 at 12:29:51PM +0100, Morten Rasmussen wrote:
> On Sun, Aug 16, 2015 at 10:46:05PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 14, 2015 at 05:23:08PM +0100, Morten Rasmussen wrote:
> > > Target: ARM TC2 A7-only (x3)
> > > Test: hackbench -g 25 --threads -l 10000
> > >
> > > Before After
> > > 315.545 313.408 -0.68%
> > >
> > > Target: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
> > > Test: hackbench -g 25 --threads -l 1000 (avg of 10)
> > >
> > > Before After
> > > 6.4643 6.395 -1.07%
> >
> > Yeah, so that is a problem.
>
> Maybe I'm totally wrong, but doesn't hackbench report execution so less
> is better? In that case -1.07% means we are doing better with the
> patches applied (after time < before time). In any case, I should have
> indicated whether the change is good or bad for performance.
>
> > I'm taking it some of the new scaling stuff doesn't compile away, can we
> > look at fixing that?
>
> I will double-check that the stuff goes away as expected. I'm pretty
> sure it does on ARM.

Ah, uhm.. you have a point there ;-) I'll run the numbers when I'm back
home again.

2015-08-31 09:25:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking

On Fri, Aug 14, 2015 at 05:23:08PM +0100, Morten Rasmussen wrote:
> Target: ARM TC2 A7-only (x3)
> Test: hackbench -g 25 --threads -l 10000
>
> Before After
> 315.545 313.408 -0.68%
>
> Target: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
> Test: hackbench -g 25 --threads -l 1000 (avg of 10)
>
> Before After
> 6.4643 6.395 -1.07%
>

A quick run here gives:

IVB-EP (2*20*2):

perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 5000

Before: After:
5.484170711 ( +- 0.74% ) 5.590001145 ( +- 0.45% )

Which is an almost 2% slowdown :/

I've yet to look at what happens.