2014-12-02 14:07:01

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC PATCH 00/10] sched: frequency and cpu invariant per-entity load-tracking and other load-tracking bits

Numerous proposals for various changes to per-entity load-tracking have
been posted in the past to introduce various forms of scale invariance,
utilization tracking, and taking blocked load into account. This RFC
patch set has attempts to merge the features of these proposals.

The patch set is based on top of Vincent's cpu capacity and usage patch
set [1] which introduces the scheduler side of frequency invariant
utilization tracking. This patch set adds:

Patches:

1: Adds frequency scale invariance to runnable load tracking
(weighted_cpuload() and friends). Like utilization in [1], load is now
also scaled with frequency relative to the max frequency.

2: Adds cpu scale invariance to both load and utilization. This second
scaling compensates for differences in max performance between cpus,
e.g. cpus with different max OPPs or different cpu uarchs (ARM
big.LITTLE).

3-5: ARM arch implementation of arch_scale_freq_capacity() to enable
frequency invariance of both utilization and load.

6: Update ARM arch implementation of arch_scale_cpu_capacity().

7: Remove scaling of cpu usage by capacity_orig. With introduction of
cpu scale-invariance the per-entity tracking already does this scaling.

Experimental patches:

8: Add tracking of blocked utilization (usage) to the utilization
introduced in [1].

9: Change get_cpu_usage() to include blocked utilization (usage).

10: Change weighted_cpuload() to include blocked load. The implications of
this change needs further testing and very likely more changes.


The last three patches are quite likely to cause some trouble and
require some modifications to the users of get_cpu_usage() and
weighted_cpuload(). An audit of the load-balance code is needed. The
blocked load/utilization patches should be considered experimental, but
they are part of what is needed to add invariance and blocked
load/utilization to per-entity load-tracking.

The purpose of this whole exercise is to get more accurate load and
utilization tracking for systems with frequency scaling and/or cpus with
different uarchs.

[1] https://lkml.org/lkml/2014/11/3/535

Dietmar Eggemann (5):
sched: Make load tracking frequency scale-invariant
sched: Make usage and load tracking cpu scale-invariant
ARM: vexpress: Add CPU clock-frequencies to TC2 device-tree
arm: Cpu invariant scheduler load-tracking support
sched: Get rid of scaling usage by cpu_capacity_orig

Morten Rasmussen (5):
cpufreq: Architecture specific callback for frequency changes
arm: Frequency invariant scheduler load-tracking support
sched: Track blocked utilization contributions
sched: Include blocked utilization in usage tracking
sched: Include blocked load in weighted_cpuload

arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 5 ++
arch/arm/kernel/topology.c | 97 +++++++++++++++++-------------
drivers/cpufreq/cpufreq.c | 10 ++-
kernel/sched/fair.c | 91 ++++++++++++++++++++++------
kernel/sched/sched.h | 8 ++-
5 files changed, 147 insertions(+), 64 deletions(-)

--
1.9.1


2014-12-02 14:07:24

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC PATCH 07/10] sched: Get rid of scaling usage by cpu_capacity_orig

From: Dietmar Eggemann <[email protected]>

Since now we have besides frequency invariant also cpu (uarch plus max
system frequency) invariant cfs_rq::utilization_load_avg both, frequency
and cpu scaling happens as part of the load tracking.
So cfs_rq::utilization_load_avg does not have to be scaled by the original
capacity of the cpu again.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c4c989..090223f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4806,12 +4806,11 @@ static int select_idle_sibling(struct task_struct *p, int target)
static int get_cpu_usage(int cpu)
{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
- unsigned long capacity = capacity_orig_of(cpu);

if (usage >= SCHED_LOAD_SCALE)
- return capacity;
+ return capacity_orig_of(cpu);

- return (usage * capacity) >> SCHED_LOAD_SHIFT;
+ return usage;
}

/*
--
1.9.1

2014-12-02 14:08:03

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC PATCH 06/10] arm: Cpu invariant scheduler load-tracking support

From: Dietmar Eggemann <[email protected]>

Reuses the existing infrastructure for cpu_scale to provide the scheduler
with a cpu scaling correction factor for more accurate load-tracking.
This factor comprises a micro-architectural part, which is based on the
cpu efficiency value of a cpu as well as a platform-wide max frequency
part, which relates to the dtb property clock-frequency of a cpu node.

The calculation of cpu_scale, return value of arch_scale_cpu_capacity,
changes from:

capacity / middle_capacity

with capacity = (clock_frequency >> 20) * cpu_efficiency

to:

SCHED_CAPACITY_SCALE * cpu_perf / max_cpu_perf

The range of the cpu_scale value changes from
[0..3*SCHED_CAPACITY_SCALE/2] to [0..SCHED_CAPACITY_SCALE].

The functionality to calculate the middle_capacity which corresponds to an
'average' cpu has been taken out since the scaling is now done
differently.

In the case that either the cpu efficiency or the clock-frequency value
for a cpu is missing, no cpu scaling is done for any cpu.

The platform-wide max frequency part of the factor should not be confused
with the frequency invariant scheduler load-tracking support which deals
with frequency related scaling due to DFVS functionality on a cpu.

Cc: Russell King <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/kernel/topology.c | 64 +++++++++++++++++-----------------------------
1 file changed, 23 insertions(+), 41 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 73ef337..e4c59fa 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -62,9 +62,7 @@ struct cpu_efficiency {
* Table of relative efficiency of each processors
* The efficiency value must fit in 20bit and the final
* cpu_scale value must be in the range
- * 0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2
- * in order to return at most 1 when DIV_ROUND_CLOSEST
- * is used to compute the capacity of a CPU.
+ * 0 < cpu_scale < SCHED_CAPACITY_SCALE.
* Processors that are not defined in the table,
* use the default SCHED_CAPACITY_SCALE value for cpu_scale.
*/
@@ -77,24 +75,18 @@ static const struct cpu_efficiency table_efficiency[] = {
static unsigned long *__cpu_capacity;
#define cpu_capacity(cpu) __cpu_capacity[cpu]

-static unsigned long middle_capacity = 1;
+static unsigned long max_cpu_perf;

/*
* Iterate all CPUs' descriptor in DT and compute the efficiency
- * (as per table_efficiency). Also calculate a middle efficiency
- * as close as possible to (max{eff_i} - min{eff_i}) / 2
- * This is later used to scale the cpu_capacity field such that an
- * 'average' CPU is of middle capacity. Also see the comments near
- * table_efficiency[] and update_cpu_capacity().
+ * (as per table_efficiency). Calculate the max cpu performance too.
*/
+
static void __init parse_dt_topology(void)
{
const struct cpu_efficiency *cpu_eff;
struct device_node *cn = NULL;
- unsigned long min_capacity = ULONG_MAX;
- unsigned long max_capacity = 0;
- unsigned long capacity = 0;
- int cpu = 0;
+ int cpu = 0, i = 0;

__cpu_capacity = kcalloc(nr_cpu_ids, sizeof(*__cpu_capacity),
GFP_NOWAIT);
@@ -102,6 +94,7 @@ static void __init parse_dt_topology(void)
for_each_possible_cpu(cpu) {
const u32 *rate;
int len;
+ unsigned long cpu_perf;

/* too early to use cpu->of_node */
cn = of_get_cpu_node(cpu, NULL);
@@ -124,46 +117,35 @@ static void __init parse_dt_topology(void)
continue;
}

- capacity = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency;
-
- /* Save min capacity of the system */
- if (capacity < min_capacity)
- min_capacity = capacity;
-
- /* Save max capacity of the system */
- if (capacity > max_capacity)
- max_capacity = capacity;
-
- cpu_capacity(cpu) = capacity;
+ cpu_perf = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency;
+ cpu_capacity(cpu) = cpu_perf;
+ max_cpu_perf = max(max_cpu_perf, cpu_perf);
+ i++;
}

- /* If min and max capacities are equals, we bypass the update of the
- * cpu_scale because all CPUs have the same capacity. Otherwise, we
- * compute a middle_capacity factor that will ensure that the capacity
- * of an 'average' CPU of the system will be as close as possible to
- * SCHED_CAPACITY_SCALE, which is the default value, but with the
- * constraint explained near table_efficiency[].
- */
- if (4*max_capacity < (3*(max_capacity + min_capacity)))
- middle_capacity = (min_capacity + max_capacity)
- >> (SCHED_CAPACITY_SHIFT+1);
- else
- middle_capacity = ((max_capacity / 3)
- >> (SCHED_CAPACITY_SHIFT-1)) + 1;
-
+ if (i < num_possible_cpus())
+ max_cpu_perf = 0;
}

/*
* Look for a customed capacity of a CPU in the cpu_capacity table during the
* boot. The update of all CPUs is in O(n^2) for heteregeneous system but the
- * function returns directly for SMP system.
+ * function returns directly for SMP systems or if there is no complete set
+ * of cpu efficiency, clock frequency data for each cpu.
*/
static void update_cpu_capacity(unsigned int cpu)
{
- if (!cpu_capacity(cpu))
+ unsigned long capacity = cpu_capacity(cpu);
+
+ if (!capacity || !max_cpu_perf) {
+ cpu_capacity(cpu) = 0;
return;
+ }
+
+ capacity *= SCHED_CAPACITY_SCALE;
+ capacity /= max_cpu_perf;

- set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
+ set_capacity_scale(cpu, capacity);

printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n",
cpu, arch_scale_cpu_capacity(NULL, cpu));
--
1.9.1

2014-12-02 14:07:05

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC PATCH 03/10] cpufreq: Architecture specific callback for frequency changes

From: Morten Rasmussen <[email protected]>

Architectures that don't have any other means for tracking cpu frequency
changes need a callback from cpufreq to implement a scaling factor to
enable scale-invariant per-entity load-tracking in the scheduler.

To compute the scale invariance correction factor the architecture would
need to know both the max frequency and the current frequency. This
patch defines weak functions for setting both from cpufreq.

Related architecture specific functions use weak function definitions.
The same approach is followed here.

These callbacks can be used to implement frequency scaling of cpu
capacity later.

Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
drivers/cpufreq/cpufreq.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 644b54e..1b17608 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -278,6 +278,10 @@ static inline void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci)
}
#endif

+void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {}
+
+void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {}
+
static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, unsigned int state)
{
@@ -315,6 +319,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
pr_debug("FREQ: %lu - CPU: %lu\n",
(unsigned long)freqs->new, (unsigned long)freqs->cpu);
trace_cpu_frequency(freqs->new, freqs->cpu);
+ arch_scale_set_curr_freq(freqs->cpu, freqs->new);
srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
CPUFREQ_POSTCHANGE, freqs);
if (likely(policy) && likely(policy->cpu == freqs->cpu))
@@ -2164,7 +2169,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
struct cpufreq_policy *new_policy)
{
struct cpufreq_governor *old_gov;
- int ret;
+ int ret, cpu;

pr_debug("setting new policy for CPU %u: %u - %u kHz\n",
new_policy->cpu, new_policy->min, new_policy->max);
@@ -2202,6 +2207,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
policy->min = new_policy->min;
policy->max = new_policy->max;

+ for_each_cpu(cpu, policy->cpus)
+ arch_scale_set_max_freq(cpu, policy->max);
+
pr_debug("new min and max freqs are %u - %u kHz\n",
policy->min, policy->max);

--
1.9.1

2014-12-02 14:08:52

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC PATCH 08/10] sched: Track blocked utilization contributions

Introduces the blocked utilization, the utilization counter-part to
cfs_rq->utilization_load_avg. It is the sum of sched_entity utilization
contributions of entities that were recently on the cfs_rq that are
currently blocked. Combined with sum of contributions of entities
currently on the cfs_rq or currently running
(cfs_rq->utilization_load_avg) this can provide a more stable average
view of the cpu usage.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 30 +++++++++++++++++++++++++++++-
kernel/sched/sched.h | 8 ++++++--
2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 090223f..adf64df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2778,6 +2778,15 @@ static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
cfs_rq->blocked_load_avg = 0;
}

+static inline void subtract_utilization_blocked_contrib(struct cfs_rq *cfs_rq,
+ long utilization_contrib)
+{
+ if (likely(utilization_contrib < cfs_rq->utilization_blocked_avg))
+ cfs_rq->utilization_blocked_avg -= utilization_contrib;
+ else
+ cfs_rq->utilization_blocked_avg = 0;
+}
+
static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);

/* Update a sched_entity's runnable average */
@@ -2813,6 +2822,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
cfs_rq->utilization_load_avg += utilization_delta;
} else {
subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ -utilization_delta);
}
}

@@ -2830,14 +2841,20 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
return;

if (atomic_long_read(&cfs_rq->removed_load)) {
- unsigned long removed_load;
+ unsigned long removed_load, removed_utilization;
removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
+ removed_utilization =
+ atomic_long_xchg(&cfs_rq->removed_utilization, 0);
subtract_blocked_load_contrib(cfs_rq, removed_load);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ removed_utilization);
}

if (decays) {
cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
decays);
+ cfs_rq->utilization_blocked_avg =
+ decay_load(cfs_rq->utilization_blocked_avg, decays);
atomic64_add(decays, &cfs_rq->decay_counter);
cfs_rq->last_decay = now;
}
@@ -2884,6 +2901,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
/* migrated tasks did not contribute to our blocked load */
if (wakeup) {
subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ se->avg.utilization_avg_contrib);
update_entity_load_avg(se, 0);
}

@@ -2910,6 +2929,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
if (sleep) {
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+ cfs_rq->utilization_blocked_avg +=
+ se->avg.utilization_avg_contrib;
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
} /* migrations, e.g. sleep=0 leave decay_count == 0 */
}
@@ -4929,6 +4950,8 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
se->avg.decay_count = -__synchronize_entity_decay(se);
atomic_long_add(se->avg.load_avg_contrib,
&cfs_rq->removed_load);
+ atomic_long_add(se->avg.utilization_avg_contrib,
+ &cfs_rq->removed_utilization);
}

/* We have migrated, no longer consider this task hot */
@@ -7944,6 +7967,8 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
if (se->avg.decay_count) {
__synchronize_entity_decay(se);
subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+ subtract_utilization_blocked_contrib(cfs_rq,
+ se->avg.utilization_avg_contrib);
}
#endif
}
@@ -8003,6 +8028,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#ifdef CONFIG_SMP
atomic64_set(&cfs_rq->decay_counter, 1);
atomic_long_set(&cfs_rq->removed_load, 0);
+ atomic_long_set(&cfs_rq->removed_utilization, 0);
#endif
}

@@ -8055,6 +8081,8 @@ static void task_move_group_fair(struct task_struct *p, int queued)
*/
se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+ cfs_rq->utilization_blocked_avg +=
+ se->avg.utilization_avg_contrib;
#endif
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e402133..208237f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -368,11 +368,15 @@ struct cfs_rq {
* the blocked sched_entities on the rq.
* utilization_load_avg is the sum of the average running time of the
* sched_entities on the rq.
+ * utilization_blocked_avg is the utilization equivalent of
+ * blocked_load_avg, i.e. the sum of running contributions of blocked
+ * sched_entities associated with the rq.
*/
- unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
+ unsigned long runnable_load_avg, blocked_load_avg;
+ unsigned long utilization_load_avg, utilization_blocked_avg;
atomic64_t decay_counter;
u64 last_decay;
- atomic_long_t removed_load;
+ atomic_long_t removed_load, removed_utilization;

#ifdef CONFIG_FAIR_GROUP_SCHED
/* Required to track per-cpu representation of a task_group */
--
1.9.1

2014-12-02 14:08:51

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC PATCH 09/10] sched: Include blocked utilization in usage tracking

Add the blocked utilization contribution to group sched_entity
utilization (se->avg.utilization_avg_contrib) and to get_cpu_usage().
With this change cpu usage now includes recent usage by currently
non-runnable tasks, hence it provides a more stable view of the cpu
usage. It does, however, also mean that the meaning of usage is changed:
A cpu may be momentarily idle while usage >0. It can no longer be
assumed that cpu usage >0 implies runnable tasks on the rq.
cfs_rq->utilization_load_avg or nr_running should be used instead to get
the current rq status.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index adf64df..bd950b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2764,7 +2764,8 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
__update_task_entity_utilization(se);
else
se->avg.utilization_avg_contrib =
- group_cfs_rq(se)->utilization_load_avg;
+ group_cfs_rq(se)->utilization_load_avg +
+ group_cfs_rq(se)->utilization_blocked_avg;

return se->avg.utilization_avg_contrib - old_contrib;
}
@@ -4827,11 +4828,12 @@ static int select_idle_sibling(struct task_struct *p, int target)
static int get_cpu_usage(int cpu)
{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+ unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;

- if (usage >= SCHED_LOAD_SCALE)
+ if (usage + blocked >= SCHED_LOAD_SCALE)
return capacity_orig_of(cpu);

- return usage;
+ return usage + blocked;
}

/*
--
1.9.1

2014-12-02 14:08:49

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC PATCH 10/10] sched: Include blocked load in weighted_cpuload

Adds blocked_load_avg to weighted_cpuload() to take recently runnable
tasks into account in load-balancing decisions. This changes the nature
of weighted_cpuload() as it may >0 while there are currently no runnable
tasks on the cpu rq. Hence care must be taken in the load-balance code
to use cfs_rq->runnable_load_avg or nr_running when current rq status is
needed.

This patch is highly experimental and will probably have require
additional updates of the users of weighted_cpuload().

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bd950b2..ad0ebb7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4349,7 +4349,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->cfs.runnable_load_avg;
+ return cpu_rq(cpu)->cfs.runnable_load_avg
+ + cpu_rq(cpu)->cfs.blocked_load_avg;
}

/*
--
1.9.1

2014-12-02 14:07:00

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC PATCH 02/10] sched: Make usage and load tracking cpu scale-invariant

From: Dietmar Eggemann <[email protected]>

Besides the existing frequency scale-invariance correction factor, apply
cpu scale-invariance correction factor to usage and load tracking.

Cpu scale-invariance takes cpu performance deviations due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency
value of the highest OPP between cpus in SMP systems into consideration.

Each segment of the sched_avg::{running_avg_sum, runnable_avg_sum}
geometric series is now scaled by the cpu performance factor too so the
sched_avg::{utilization_avg_contrib, load_avg_contrib} of each entity will
be invariant from the particular cpu of the HMP/SMP system it is gathered
on. As a result, cfs_rq::runnable_load_avg which is the sum of
sched_avg::load_avg_contrib, becomes cpu scale-invariant too.

So the {usage, load} level that is returned by {get_cpu_usage,
weighted_cpuload} stays relative to the max cpu performance of the system.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 27 ++++++++++++++++++++++-----
1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b41f03d..5c4c989 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2473,6 +2473,21 @@ static u32 __compute_runnable_contrib(u64 n)
}

unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu);
+
+static unsigned long contrib_scale_factor(int cpu)
+{
+ unsigned long scale_factor;
+
+ scale_factor = arch_scale_freq_capacity(NULL, cpu);
+ scale_factor *= arch_scale_cpu_capacity(NULL, cpu);
+ scale_factor >>= SCHED_CAPACITY_SHIFT;
+
+ return scale_factor;
+}
+
+#define scale_contrib(contrib, scale_factor) \
+ ((contrib * scale_factor) >> SCHED_CAPACITY_SHIFT)

/*
* We can represent the historical contribution to runnable average as the
@@ -2510,7 +2525,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
u64 delta, scaled_delta, periods;
u32 runnable_contrib, scaled_runnable_contrib;
int delta_w, scaled_delta_w, decayed = 0;
- unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
+ unsigned long scale_factor;

delta = now - sa->last_runnable_update;
/*
@@ -2531,6 +2546,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
return 0;
sa->last_runnable_update = now;

+ scale_factor = contrib_scale_factor(cpu);
+
/* delta_w is the amount already accumulated against our next period */
delta_w = sa->avg_period % 1024;
if (delta + delta_w >= 1024) {
@@ -2543,7 +2560,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
* period and accrue it.
*/
delta_w = 1024 - delta_w;
- scaled_delta_w = (delta_w * scale_freq) >> SCHED_CAPACITY_SHIFT;
+ scaled_delta_w = scale_contrib(delta_w, scale_factor);

if (runnable)
sa->runnable_avg_sum += scaled_delta_w;
@@ -2566,8 +2583,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,

/* Efficiently calculate \sum (1..n_period) 1024*y^i */
runnable_contrib = __compute_runnable_contrib(periods);
- scaled_runnable_contrib = (runnable_contrib * scale_freq)
- >> SCHED_CAPACITY_SHIFT;
+ scaled_runnable_contrib =
+ scale_contrib(runnable_contrib, scale_factor);

if (runnable)
sa->runnable_avg_sum += scaled_runnable_contrib;
@@ -2577,7 +2594,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
}

/* Remainder of delta accrued against u_0` */
- scaled_delta = (delta * scale_freq) >> SCHED_CAPACITY_SHIFT;
+ scaled_delta = scale_contrib(delta, scale_factor);

if (runnable)
sa->runnable_avg_sum += scaled_delta;
--
1.9.1

2014-12-02 14:10:13

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC PATCH 05/10] ARM: vexpress: Add CPU clock-frequencies to TC2 device-tree

From: Dietmar Eggemann <[email protected]>

To enable the parsing of clock frequency and cpu efficiency values inside
parse_dt_topology [arch/arm/kernel/topology.c] to scale the relative
capacity of the cpus, this property has to be provided within the cpu
nodes of the dts file.

The patch is a copy of 'ARM: vexpress: Add CPU clock-frequencies to TC2
device-tree' (commit 8f15973ef8c3) taken from Linaro Stable Kernel (LSK)
massaged into mainline.

Cc: Jon Medhurst <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Pawel Moll <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Ian Campbell <[email protected]>
Cc: Kumar Gala <[email protected]>
Cc: Russell King <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
index 322fd15..62f89e2 100644
--- a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
+++ b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
@@ -39,6 +39,7 @@
reg = <0>;
cci-control-port = <&cci_control1>;
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
+ clock-frequency = <1000000000>;
};

cpu1: cpu@1 {
@@ -47,6 +48,7 @@
reg = <1>;
cci-control-port = <&cci_control1>;
cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
+ clock-frequency = <1000000000>;
};

cpu2: cpu@2 {
@@ -55,6 +57,7 @@
reg = <0x100>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ clock-frequency = <800000000>;
};

cpu3: cpu@3 {
@@ -63,6 +66,7 @@
reg = <0x101>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ clock-frequency = <800000000>;
};

cpu4: cpu@4 {
@@ -71,6 +75,7 @@
reg = <0x102>;
cci-control-port = <&cci_control2>;
cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+ clock-frequency = <800000000>;
};

idle-states {
--
1.9.1

2014-12-02 14:11:16

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC PATCH 01/10] sched: Make load tracking frequency scale-invariant

From: Dietmar Eggemann <[email protected]>

Apply frequency scale-invariance correction factor to load tracking.
Each segment of the sched_avg::runnable_avg_sum geometric series is now
scaled by the current frequency so the sched_avg::load_avg_contrib of each
entity will be invariant with frequency scaling. As a result,
cfs_rq::runnable_load_avg which is the sum of sched_avg::load_avg_contrib,
becomes invariant too. So the load level that is returned by
weighted_cpuload, stays relative to the max frequency of the cpu.

Then, we want the keep the load tracking values in a 32bits type, which
implies that the max value of sched_avg::{runnable|running}_avg_sum must
be lower than 2^32/88761=48388 (88761 is the max weight of a task). As
LOAD_AVG_MAX = 47742, arch_scale_freq_capacity must return a value less
than (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY =
1024). So we define the range to [0..SCHED_SCALE_CAPACITY] in order to
avoid overflow.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 28 ++++++++++++++++------------
1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee76d52..b41f03d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2507,9 +2507,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
int runnable,
int running)
{
- u64 delta, periods;
- u32 runnable_contrib;
- int delta_w, decayed = 0;
+ u64 delta, scaled_delta, periods;
+ u32 runnable_contrib, scaled_runnable_contrib;
+ int delta_w, scaled_delta_w, decayed = 0;
unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);

delta = now - sa->last_runnable_update;
@@ -2543,11 +2543,12 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
* period and accrue it.
*/
delta_w = 1024 - delta_w;
+ scaled_delta_w = (delta_w * scale_freq) >> SCHED_CAPACITY_SHIFT;
+
if (runnable)
- sa->runnable_avg_sum += delta_w;
+ sa->runnable_avg_sum += scaled_delta_w;
if (running)
- sa->running_avg_sum += delta_w * scale_freq
- >> SCHED_CAPACITY_SHIFT;
+ sa->running_avg_sum += scaled_delta_w;
sa->avg_period += delta_w;

delta -= delta_w;
@@ -2565,20 +2566,23 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,

/* Efficiently calculate \sum (1..n_period) 1024*y^i */
runnable_contrib = __compute_runnable_contrib(periods);
+ scaled_runnable_contrib = (runnable_contrib * scale_freq)
+ >> SCHED_CAPACITY_SHIFT;
+
if (runnable)
- sa->runnable_avg_sum += runnable_contrib;
+ sa->runnable_avg_sum += scaled_runnable_contrib;
if (running)
- sa->running_avg_sum += runnable_contrib * scale_freq
- >> SCHED_CAPACITY_SHIFT;
+ sa->running_avg_sum += scaled_runnable_contrib;
sa->avg_period += runnable_contrib;
}

/* Remainder of delta accrued against u_0` */
+ scaled_delta = (delta * scale_freq) >> SCHED_CAPACITY_SHIFT;
+
if (runnable)
- sa->runnable_avg_sum += delta;
+ sa->runnable_avg_sum += scaled_delta;
if (running)
- sa->running_avg_sum += delta * scale_freq
- >> SCHED_CAPACITY_SHIFT;
+ sa->running_avg_sum += scaled_delta;
sa->avg_period += delta;

return decayed;
--
1.9.1

2014-12-02 14:11:46

by Morten Rasmussen

[permalink] [raw]
Subject: [RFC PATCH 04/10] arm: Frequency invariant scheduler load-tracking support

From: Morten Rasmussen <[email protected]>

Implements arch-specific function to provide the scheduler with a
frequency scaling correction factor for more accurate load-tracking. The
factor is:

current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu)

This implementation only provides frequency invariance. No
micro-architecture invariance yet.

Cc: Russell King <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
arch/arm/kernel/topology.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 89cfdd6..73ef337 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -169,6 +169,39 @@ static void update_cpu_capacity(unsigned int cpu)
cpu, arch_scale_cpu_capacity(NULL, cpu));
}

+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling.
+ */
+
+static DEFINE_PER_CPU(atomic_long_t, cpu_curr_freq);
+static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
+
+/* cpufreq callback function setting current cpu frequency */
+void arch_scale_set_curr_freq(int cpu, unsigned long freq)
+{
+ atomic_long_set(&per_cpu(cpu_curr_freq, cpu), freq);
+}
+
+/* cpufreq callback function setting max cpu frequency */
+void arch_scale_set_max_freq(int cpu, unsigned long freq)
+{
+ atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);
+}
+
+unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+ unsigned long curr = atomic_long_read(&per_cpu(cpu_curr_freq, cpu));
+ unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
+
+ if (!max)
+ return SCHED_CAPACITY_SCALE;
+
+ return (curr * SCHED_CAPACITY_SCALE) / max;
+}
+
#else
static inline void parse_dt_topology(void) {}
static inline void update_cpu_capacity(unsigned int cpuid) {}
--
1.9.1

2014-12-17 07:56:37

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 03/10] cpufreq: Architecture specific callback for frequency changes

On 2 December 2014 at 15:06, Morten Rasmussen <[email protected]> wrote:
> From: Morten Rasmussen <[email protected]>
>
> Architectures that don't have any other means for tracking cpu frequency
> changes need a callback from cpufreq to implement a scaling factor to
> enable scale-invariant per-entity load-tracking in the scheduler.
>
> To compute the scale invariance correction factor the architecture would
> need to know both the max frequency and the current frequency. This
> patch defines weak functions for setting both from cpufreq.
>
> Related architecture specific functions use weak function definitions.
> The same approach is followed here.
>
> These callbacks can be used to implement frequency scaling of cpu
> capacity later.
>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> drivers/cpufreq/cpufreq.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 644b54e..1b17608 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -278,6 +278,10 @@ static inline void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci)
> }
> #endif
>
> +void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {}
> +
> +void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {}
> +
> static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
> struct cpufreq_freqs *freqs, unsigned int state)
> {
> @@ -315,6 +319,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
> pr_debug("FREQ: %lu - CPU: %lu\n",
> (unsigned long)freqs->new, (unsigned long)freqs->cpu);
> trace_cpu_frequency(freqs->new, freqs->cpu);
> + arch_scale_set_curr_freq(freqs->cpu, freqs->new);

You must ensure that arch_scale_set_curr_freq will be called at least
once per CPU in order to ensure that you have a defined current
frequency.

If cpufreq don't have to change the frequency of a CPU, you will never
set the current frequency and the notification handler will never be
called. A typical example is a CPU booting at max frequency and we use
the performance governor; The CPU frequency is already the right one
so cpufreq will not change it and you will never call
arch_scale_set_curr_freq. The current frequency will be at least
undefined or set to 0 for ARM platform (according to patch 04) and the
arch_scale_freq_capacity will return 0

Regards,
Vincent

> srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
> CPUFREQ_POSTCHANGE, freqs);
> if (likely(policy) && likely(policy->cpu == freqs->cpu))
> @@ -2164,7 +2169,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
> struct cpufreq_policy *new_policy)
> {
> struct cpufreq_governor *old_gov;
> - int ret;
> + int ret, cpu;
>
> pr_debug("setting new policy for CPU %u: %u - %u kHz\n",
> new_policy->cpu, new_policy->min, new_policy->max);
> @@ -2202,6 +2207,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
> policy->min = new_policy->min;
> policy->max = new_policy->max;
>
> + for_each_cpu(cpu, policy->cpus)
> + arch_scale_set_max_freq(cpu, policy->max);
> +
> pr_debug("new min and max freqs are %u - %u kHz\n",
> policy->min, policy->max);
>
> --
> 1.9.1
>
>

2014-12-17 08:13:00

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 08/10] sched: Track blocked utilization contributions

On 2 December 2014 at 15:06, Morten Rasmussen <[email protected]> wrote:
> Introduces the blocked utilization, the utilization counter-part to
> cfs_rq->utilization_load_avg. It is the sum of sched_entity utilization
> contributions of entities that were recently on the cfs_rq that are
> currently blocked. Combined with sum of contributions of entities
> currently on the cfs_rq or currently running
> (cfs_rq->utilization_load_avg) this can provide a more stable average
> view of the cpu usage.

I'm fully aligned with the interest of adding blocked tasks in the
CPU's utilization. Now, instead of adding one more atomic data and
it's manipulation, it might be worth to move on [email protected]
patchset: rewrites the per entity load tracking which includes blocked
load.

>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++++-
> kernel/sched/sched.h | 8 ++++++--
> 2 files changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 090223f..adf64df 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2778,6 +2778,15 @@ static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
> cfs_rq->blocked_load_avg = 0;
> }
>
> +static inline void subtract_utilization_blocked_contrib(struct cfs_rq *cfs_rq,
> + long utilization_contrib)
> +{
> + if (likely(utilization_contrib < cfs_rq->utilization_blocked_avg))
> + cfs_rq->utilization_blocked_avg -= utilization_contrib;
> + else
> + cfs_rq->utilization_blocked_avg = 0;
> +}
> +
> static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
>
> /* Update a sched_entity's runnable average */
> @@ -2813,6 +2822,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
> cfs_rq->utilization_load_avg += utilization_delta;
> } else {
> subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
> + subtract_utilization_blocked_contrib(cfs_rq,
> + -utilization_delta);
> }
> }
>
> @@ -2830,14 +2841,20 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
> return;
>
> if (atomic_long_read(&cfs_rq->removed_load)) {
> - unsigned long removed_load;
> + unsigned long removed_load, removed_utilization;
> removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
> + removed_utilization =
> + atomic_long_xchg(&cfs_rq->removed_utilization, 0);
> subtract_blocked_load_contrib(cfs_rq, removed_load);
> + subtract_utilization_blocked_contrib(cfs_rq,
> + removed_utilization);
> }
>
> if (decays) {
> cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
> decays);
> + cfs_rq->utilization_blocked_avg =
> + decay_load(cfs_rq->utilization_blocked_avg, decays);
> atomic64_add(decays, &cfs_rq->decay_counter);
> cfs_rq->last_decay = now;
> }
> @@ -2884,6 +2901,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> /* migrated tasks did not contribute to our blocked load */
> if (wakeup) {
> subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
> + subtract_utilization_blocked_contrib(cfs_rq,
> + se->avg.utilization_avg_contrib);
> update_entity_load_avg(se, 0);
> }
>
> @@ -2910,6 +2929,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
> cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
> if (sleep) {
> cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
> + cfs_rq->utilization_blocked_avg +=
> + se->avg.utilization_avg_contrib;
> se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
> } /* migrations, e.g. sleep=0 leave decay_count == 0 */
> }
> @@ -4929,6 +4950,8 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
> se->avg.decay_count = -__synchronize_entity_decay(se);
> atomic_long_add(se->avg.load_avg_contrib,
> &cfs_rq->removed_load);
> + atomic_long_add(se->avg.utilization_avg_contrib,
> + &cfs_rq->removed_utilization);
> }
>
> /* We have migrated, no longer consider this task hot */
> @@ -7944,6 +7967,8 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
> if (se->avg.decay_count) {
> __synchronize_entity_decay(se);
> subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
> + subtract_utilization_blocked_contrib(cfs_rq,
> + se->avg.utilization_avg_contrib);
> }
> #endif
> }
> @@ -8003,6 +8028,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
> #ifdef CONFIG_SMP
> atomic64_set(&cfs_rq->decay_counter, 1);
> atomic_long_set(&cfs_rq->removed_load, 0);
> + atomic_long_set(&cfs_rq->removed_utilization, 0);
> #endif
> }
>
> @@ -8055,6 +8081,8 @@ static void task_move_group_fair(struct task_struct *p, int queued)
> */
> se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
> cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
> + cfs_rq->utilization_blocked_avg +=
> + se->avg.utilization_avg_contrib;
> #endif
> }
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e402133..208237f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -368,11 +368,15 @@ struct cfs_rq {
> * the blocked sched_entities on the rq.
> * utilization_load_avg is the sum of the average running time of the
> * sched_entities on the rq.
> + * utilization_blocked_avg is the utilization equivalent of
> + * blocked_load_avg, i.e. the sum of running contributions of blocked
> + * sched_entities associated with the rq.
> */
> - unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
> + unsigned long runnable_load_avg, blocked_load_avg;
> + unsigned long utilization_load_avg, utilization_blocked_avg;
> atomic64_t decay_counter;
> u64 last_decay;
> - atomic_long_t removed_load;
> + atomic_long_t removed_load, removed_utilization;
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> /* Required to track per-cpu representation of a task_group */
> --
> 1.9.1
>
>

2014-12-17 08:22:53

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 09/10] sched: Include blocked utilization in usage tracking

On 2 December 2014 at 15:06, Morten Rasmussen <[email protected]> wrote:
> Add the blocked utilization contribution to group sched_entity
> utilization (se->avg.utilization_avg_contrib) and to get_cpu_usage().
> With this change cpu usage now includes recent usage by currently
> non-runnable tasks, hence it provides a more stable view of the cpu
> usage. It does, however, also mean that the meaning of usage is changed:
> A cpu may be momentarily idle while usage >0. It can no longer be
> assumed that cpu usage >0 implies runnable tasks on the rq.
> cfs_rq->utilization_load_avg or nr_running should be used instead to get
> the current rq status.

if CONFIG_FAIR_GROUP_SCHED is not set, the blocked utilization of idle
CPUs will never be updated and their utilization will stay at last
value just before going idle. So you can have an CPU which became idle
a long time ago but its utilization remains high.

You have to periodically decay and update the blocked utilization of idle CPUs

>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index adf64df..bd950b2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2764,7 +2764,8 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
> __update_task_entity_utilization(se);
> else
> se->avg.utilization_avg_contrib =
> - group_cfs_rq(se)->utilization_load_avg;
> + group_cfs_rq(se)->utilization_load_avg +
> + group_cfs_rq(se)->utilization_blocked_avg;
>
> return se->avg.utilization_avg_contrib - old_contrib;
> }
> @@ -4827,11 +4828,12 @@ static int select_idle_sibling(struct task_struct *p, int target)
> static int get_cpu_usage(int cpu)
> {
> unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
> + unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
>
> - if (usage >= SCHED_LOAD_SCALE)
> + if (usage + blocked >= SCHED_LOAD_SCALE)
> return capacity_orig_of(cpu);
>
> - return usage;
> + return usage + blocked;
> }
>
> /*
> --
> 1.9.1
>
>

2014-12-17 08:28:57

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 01/10] sched: Make load tracking frequency scale-invariant

On 2 December 2014 at 15:06, Morten Rasmussen <[email protected]> wrote:
> From: Dietmar Eggemann <[email protected]>
>
> Apply frequency scale-invariance correction factor to load tracking.
> Each segment of the sched_avg::runnable_avg_sum geometric series is now
> scaled by the current frequency so the sched_avg::load_avg_contrib of each
> entity will be invariant with frequency scaling. As a result,
> cfs_rq::runnable_load_avg which is the sum of sched_avg::load_avg_contrib,
> becomes invariant too. So the load level that is returned by
> weighted_cpuload, stays relative to the max frequency of the cpu.
>
> Then, we want the keep the load tracking values in a 32bits type, which
> implies that the max value of sched_avg::{runnable|running}_avg_sum must
> be lower than 2^32/88761=48388 (88761 is the max weight of a task). As
> LOAD_AVG_MAX = 47742, arch_scale_freq_capacity must return a value less
> than (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY =
> 1024). So we define the range to [0..SCHED_SCALE_CAPACITY] in order to
> avoid overflow.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/fair.c | 28 ++++++++++++++++------------
> 1 file changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ee76d52..b41f03d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2507,9 +2507,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> int runnable,
> int running)
> {
> - u64 delta, periods;
> - u32 runnable_contrib;
> - int delta_w, decayed = 0;
> + u64 delta, scaled_delta, periods;
> + u32 runnable_contrib, scaled_runnable_contrib;
> + int delta_w, scaled_delta_w, decayed = 0;
> unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
>
> delta = now - sa->last_runnable_update;
> @@ -2543,11 +2543,12 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> * period and accrue it.
> */
> delta_w = 1024 - delta_w;
> + scaled_delta_w = (delta_w * scale_freq) >> SCHED_CAPACITY_SHIFT;
> +
> if (runnable)
> - sa->runnable_avg_sum += delta_w;
> + sa->runnable_avg_sum += scaled_delta_w;
> if (running)
> - sa->running_avg_sum += delta_w * scale_freq
> - >> SCHED_CAPACITY_SHIFT;
> + sa->running_avg_sum += scaled_delta_w;
> sa->avg_period += delta_w;
>
> delta -= delta_w;
> @@ -2565,20 +2566,23 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
>
> /* Efficiently calculate \sum (1..n_period) 1024*y^i */
> runnable_contrib = __compute_runnable_contrib(periods);
> + scaled_runnable_contrib = (runnable_contrib * scale_freq)
> + >> SCHED_CAPACITY_SHIFT;
> +
> if (runnable)
> - sa->runnable_avg_sum += runnable_contrib;
> + sa->runnable_avg_sum += scaled_runnable_contrib;
> if (running)
> - sa->running_avg_sum += runnable_contrib * scale_freq
> - >> SCHED_CAPACITY_SHIFT;
> + sa->running_avg_sum += scaled_runnable_contrib;
> sa->avg_period += runnable_contrib;
> }
>
> /* Remainder of delta accrued against u_0` */
> + scaled_delta = (delta * scale_freq) >> SCHED_CAPACITY_SHIFT;
> +
> if (runnable)
> - sa->runnable_avg_sum += delta;
> + sa->runnable_avg_sum += scaled_delta;
> if (running)
> - sa->running_avg_sum += delta * scale_freq
> - >> SCHED_CAPACITY_SHIFT;
> + sa->running_avg_sum += scaled_delta;
> sa->avg_period += delta;
>
> return decayed;

Acked-by: Vincent Guittot <[email protected]>

> --
> 1.9.1
>
>

2014-12-18 09:42:15

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 02/10] sched: Make usage and load tracking cpu scale-invariant

On 2 December 2014 at 15:06, Morten Rasmussen <[email protected]> wrote:
> From: Dietmar Eggemann <[email protected]>
>
> Besides the existing frequency scale-invariance correction factor, apply
> cpu scale-invariance correction factor to usage and load tracking.
>
> Cpu scale-invariance takes cpu performance deviations due to
> micro-architectural differences (i.e. instructions per seconds) between
> cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency
> value of the highest OPP between cpus in SMP systems into consideration.
>
> Each segment of the sched_avg::{running_avg_sum, runnable_avg_sum}
> geometric series is now scaled by the cpu performance factor too so the
> sched_avg::{utilization_avg_contrib, load_avg_contrib} of each entity will
> be invariant from the particular cpu of the HMP/SMP system it is gathered
> on. As a result, cfs_rq::runnable_load_avg which is the sum of
> sched_avg::load_avg_contrib, becomes cpu scale-invariant too.
>
> So the {usage, load} level that is returned by {get_cpu_usage,
> weighted_cpuload} stays relative to the max cpu performance of the system.

Having a load/utilization that is invariant across the system is a
good thing but your patch only do part of the job. The load is
invariant so they can be directly compared across system but you
haven't updated the load balance code that also scales the load with
capacity.

Then, the task load is now cap by the max capacity of the CPU on which
it runs. Let use an example made of 3 CPUs with the following
topology:
-CPU0 and CPU1 are in the same cluster 0 (share cache) and have a
capacity of 512 each
-CPU2 is in its own cluster (don't share cache with other) and have a
capacity of 1024
Each cluster have thee same compute capacity of 1024

Then, let consider that we have 7 always running tasks with the
following placement:
-tasks A and B on CPU0
-tasks C, D on CPU1
-tasks F, G and H on CPU2

At cluster level with have the following statistic:
-On cluster 0, compute capacity budget for each task is 256 (2 * 512 /
4) and the cluster load is 4096 with current implementation and 2048
with cpu invariant load tracking
-On custer 1, compute capacity budget for each task is 341 (1024 / 3)
and the cluster load is 3072 with both implementation

The cluster 0 is more loaded than cluster 1 as the compute capacity
available for each task is lower than on cluster 1. The trends is
similar with current implementation of load tracking as we have a load
of 4096 for cluster 0 vs 3072 for cluster 1 but the cpu invariant load
tracking shows an different trend with a load of 2048 for cluster 0 vs
3072 for cluster 1

Considering that adding cpu invariance in the load tracking implies
more modification of the load balance, it might be worth reordering
your patchset and move this patch at the end instead of the beginning
so other patches might be merged while fixing the load balance

Regards,
Vincent

>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/fair.c | 27 ++++++++++++++++++++++-----
> 1 file changed, 22 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b41f03d..5c4c989 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2473,6 +2473,21 @@ static u32 __compute_runnable_contrib(u64 n)
> }
>
> unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
> +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu);
> +
> +static unsigned long contrib_scale_factor(int cpu)
> +{
> + unsigned long scale_factor;
> +
> + scale_factor = arch_scale_freq_capacity(NULL, cpu);
> + scale_factor *= arch_scale_cpu_capacity(NULL, cpu);
> + scale_factor >>= SCHED_CAPACITY_SHIFT;
> +
> + return scale_factor;
> +}
> +
> +#define scale_contrib(contrib, scale_factor) \
> + ((contrib * scale_factor) >> SCHED_CAPACITY_SHIFT)
>
> /*
> * We can represent the historical contribution to runnable average as the
> @@ -2510,7 +2525,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> u64 delta, scaled_delta, periods;
> u32 runnable_contrib, scaled_runnable_contrib;
> int delta_w, scaled_delta_w, decayed = 0;
> - unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
> + unsigned long scale_factor;
>
> delta = now - sa->last_runnable_update;
> /*
> @@ -2531,6 +2546,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> return 0;
> sa->last_runnable_update = now;
>
> + scale_factor = contrib_scale_factor(cpu);
> +
> /* delta_w is the amount already accumulated against our next period */
> delta_w = sa->avg_period % 1024;
> if (delta + delta_w >= 1024) {
> @@ -2543,7 +2560,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> * period and accrue it.
> */
> delta_w = 1024 - delta_w;
> - scaled_delta_w = (delta_w * scale_freq) >> SCHED_CAPACITY_SHIFT;
> + scaled_delta_w = scale_contrib(delta_w, scale_factor);
>
> if (runnable)
> sa->runnable_avg_sum += scaled_delta_w;
> @@ -2566,8 +2583,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
>
> /* Efficiently calculate \sum (1..n_period) 1024*y^i */
> runnable_contrib = __compute_runnable_contrib(periods);
> - scaled_runnable_contrib = (runnable_contrib * scale_freq)
> - >> SCHED_CAPACITY_SHIFT;
> + scaled_runnable_contrib =
> + scale_contrib(runnable_contrib, scale_factor);
>
> if (runnable)
> sa->runnable_avg_sum += scaled_runnable_contrib;
> @@ -2577,7 +2594,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> }
>
> /* Remainder of delta accrued against u_0` */
> - scaled_delta = (delta * scale_freq) >> SCHED_CAPACITY_SHIFT;
> + scaled_delta = scale_contrib(delta, scale_factor);
>
> if (runnable)
> sa->runnable_avg_sum += scaled_delta;
> --
> 1.9.1
>
>

2014-12-22 09:43:13

by Yuyang Du

[permalink] [raw]
Subject: RE: [RFC PATCH 08/10] sched: Track blocked utilization contributions

Thanks, Vincent. Then again ping Peter and PJT and Ben for the
rewrite patch v6, which has not been reviewed.

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Vincent Guittot
Sent: Wednesday, December 17, 2014 4:13 PM
To: Morten Rasmussen
Cc: Peter Zijlstra; [email protected]; Dietmar Eggemann; Paul Turner; Benjamin Segall; Michael Turquette; linux-kernel; [email protected]
Subject: Re: [RFC PATCH 08/10] sched: Track blocked utilization contributions

On 2 December 2014 at 15:06, Morten Rasmussen <[email protected]> wrote:
> Introduces the blocked utilization, the utilization counter-part to
> cfs_rq->utilization_load_avg. It is the sum of sched_entity
> utilization contributions of entities that were recently on the cfs_rq
> that are currently blocked. Combined with sum of contributions of
> entities currently on the cfs_rq or currently running
> (cfs_rq->utilization_load_avg) this can provide a more stable average
> view of the cpu usage.

I'm fully aligned with the interest of adding blocked tasks in the CPU's utilization. Now, instead of adding one more atomic data and it's manipulation, it might be worth to move on [email protected]
patchset: rewrites the per entity load tracking which includes blocked load.

>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++++-
> kernel/sched/sched.h | 8 ++++++--
> 2 files changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
> 090223f..adf64df 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2778,6 +2778,15 @@ static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
> cfs_rq->blocked_load_avg = 0; }
>
> +static inline void subtract_utilization_blocked_contrib(struct cfs_rq *cfs_rq,
> + long
> +utilization_contrib) {
> + if (likely(utilization_contrib < cfs_rq->utilization_blocked_avg))
> + cfs_rq->utilization_blocked_avg -= utilization_contrib;
> + else
> + cfs_rq->utilization_blocked_avg = 0; }
> +
> static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
>
> /* Update a sched_entity's runnable average */ @@ -2813,6 +2822,8 @@
> static inline void update_entity_load_avg(struct sched_entity *se,
> cfs_rq->utilization_load_avg += utilization_delta;
> } else {
> subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
> + subtract_utilization_blocked_contrib(cfs_rq,
> +
> + -utilization_delta);
> }
> }
>
> @@ -2830,14 +2841,20 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
> return;
>
> if (atomic_long_read(&cfs_rq->removed_load)) {
> - unsigned long removed_load;
> + unsigned long removed_load, removed_utilization;
> removed_load = atomic_long_xchg(&cfs_rq->removed_load,
> 0);
> + removed_utilization =
> + atomic_long_xchg(&cfs_rq->removed_utilization,
> + 0);
> subtract_blocked_load_contrib(cfs_rq, removed_load);
> + subtract_utilization_blocked_contrib(cfs_rq,
> +
> + removed_utilization);
> }
>
> if (decays) {
> cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
> decays);
> + cfs_rq->utilization_blocked_avg =
> + decay_load(cfs_rq->utilization_blocked_avg,
> + decays);
> atomic64_add(decays, &cfs_rq->decay_counter);
> cfs_rq->last_decay = now;
> }
> @@ -2884,6 +2901,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> /* migrated tasks did not contribute to our blocked load */
> if (wakeup) {
> subtract_blocked_load_contrib(cfs_rq,
> se->avg.load_avg_contrib);
> + subtract_utilization_blocked_contrib(cfs_rq,
> +
> + se->avg.utilization_avg_contrib);
> update_entity_load_avg(se, 0);
> }
>
> @@ -2910,6 +2929,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
> cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
> if (sleep) {
> cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
> + cfs_rq->utilization_blocked_avg +=
> +
> + se->avg.utilization_avg_contrib;
> se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
> } /* migrations, e.g. sleep=0 leave decay_count == 0 */ } @@
> -4929,6 +4950,8 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
> se->avg.decay_count = -__synchronize_entity_decay(se);
> atomic_long_add(se->avg.load_avg_contrib,
>
> &cfs_rq->removed_load);
> + atomic_long_add(se->avg.utilization_avg_contrib,
> + &cfs_rq->removed_utilization);
> }
>
> /* We have migrated, no longer consider this task hot */ @@
> -7944,6 +7967,8 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
> if (se->avg.decay_count) {
> __synchronize_entity_decay(se);
> subtract_blocked_load_contrib(cfs_rq,
> se->avg.load_avg_contrib);
> + subtract_utilization_blocked_contrib(cfs_rq,
> +
> + se->avg.utilization_avg_contrib);
> }
> #endif
> }
> @@ -8003,6 +8028,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq) #ifdef
> CONFIG_SMP
> atomic64_set(&cfs_rq->decay_counter, 1);
> atomic_long_set(&cfs_rq->removed_load, 0);
> + atomic_long_set(&cfs_rq->removed_utilization, 0);
> #endif
> }
>
> @@ -8055,6 +8081,8 @@ static void task_move_group_fair(struct task_struct *p, int queued)
> */
> se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
> cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
> + cfs_rq->utilization_blocked_avg +=
> +
> + se->avg.utilization_avg_contrib;
> #endif
> }
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index
> e402133..208237f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -368,11 +368,15 @@ struct cfs_rq {
> * the blocked sched_entities on the rq.
> * utilization_load_avg is the sum of the average running time of the
> * sched_entities on the rq.
> + * utilization_blocked_avg is the utilization equivalent of
> + * blocked_load_avg, i.e. the sum of running contributions of blocked
> + * sched_entities associated with the rq.
> */
> - unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
> + unsigned long runnable_load_avg, blocked_load_avg;
> + unsigned long utilization_load_avg, utilization_blocked_avg;
> atomic64_t decay_counter;
> u64 last_decay;
> - atomic_long_t removed_load;
> + atomic_long_t removed_load, removed_utilization;
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> /* Required to track per-cpu representation of a task_group */
> --
> 1.9.1
>
>

2014-12-30 15:05:51

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 02/10] sched: Make usage and load tracking cpu scale-invariant

On Thu, Dec 18, 2014 at 09:41:51AM +0000, Vincent Guittot wrote:
> On 2 December 2014 at 15:06, Morten Rasmussen <[email protected]> wrote:
> > From: Dietmar Eggemann <[email protected]>
> >
> > Besides the existing frequency scale-invariance correction factor, apply
> > cpu scale-invariance correction factor to usage and load tracking.
> >
> > Cpu scale-invariance takes cpu performance deviations due to
> > micro-architectural differences (i.e. instructions per seconds) between
> > cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency
> > value of the highest OPP between cpus in SMP systems into consideration.
> >
> > Each segment of the sched_avg::{running_avg_sum, runnable_avg_sum}
> > geometric series is now scaled by the cpu performance factor too so the
> > sched_avg::{utilization_avg_contrib, load_avg_contrib} of each entity will
> > be invariant from the particular cpu of the HMP/SMP system it is gathered
> > on. As a result, cfs_rq::runnable_load_avg which is the sum of
> > sched_avg::load_avg_contrib, becomes cpu scale-invariant too.
> >
> > So the {usage, load} level that is returned by {get_cpu_usage,
> > weighted_cpuload} stays relative to the max cpu performance of the system.
>
> Having a load/utilization that is invariant across the system is a
> good thing but your patch only do part of the job. The load is
> invariant so they can be directly compared across system but you
> haven't updated the load balance code that also scales the load with
> capacity.
>
> Then, the task load is now cap by the max capacity of the CPU on which
> it runs. Let use an example made of 3 CPUs with the following
> topology:
> -CPU0 and CPU1 are in the same cluster 0 (share cache) and have a
> capacity of 512 each
> -CPU2 is in its own cluster (don't share cache with other) and have a
> capacity of 1024
> Each cluster have thee same compute capacity of 1024
>
> Then, let consider that we have 7 always running tasks with the
> following placement:
> -tasks A and B on CPU0
> -tasks C, D on CPU1
> -tasks F, G and H on CPU2
>
> At cluster level with have the following statistic:
> -On cluster 0, compute capacity budget for each task is 256 (2 * 512 /
> 4) and the cluster load is 4096 with current implementation and 2048
> with cpu invariant load tracking
> -On custer 1, compute capacity budget for each task is 341 (1024 / 3)
> and the cluster load is 3072 with both implementation
>
> The cluster 0 is more loaded than cluster 1 as the compute capacity
> available for each task is lower than on cluster 1. The trends is
> similar with current implementation of load tracking as we have a load
> of 4096 for cluster 0 vs 3072 for cluster 1 but the cpu invariant load
> tracking shows an different trend with a load of 2048 for cluster 0 vs
> 3072 for cluster 1

Very good point. This patch can't go without necessary modifications of
the load balance code. I think we have even discussed it in the past :-/

I have been pondering the solution for a while. The obvious solution
that will leave everything as it is to derive the 'old' load from the
new scale-invariant load, but that sort of defeats the whole purpose of
introducing it. We need to figure out when we need the 'old' and 'new'
load.

Let's call the current load implementation 'norm_load' as it is
basically:

norm_load = busy_time/wall_time * prio_scale

The new scale-invariant load, 'scale_load', is then:

scale_load = busy_time/wall_time * prio_scale * capacity

Hence, we can compute norm_load as:

norm_load = scale_load/capacity

For overload scenarios, such as your example, norm_load can tell us to
which degree the group is overloaded, but not relative to its capacity.
Hence, it doesn't tell us anything about the capacity per task. (If you
change the CPU0 and CPU1 capacity to 1024 in your example the cluster
load remains unchanged, but the capacity per task increases). I agree
that capacity per task is important for fairness in overloaded
scenarios, but I think the right metric should be based on load instead
of number of tasks to factor in priority. Instead of norm_load, we
should compare capacity per load (or the inverse) of the clusters:

capacity_per_load = capacity/norm_load

This is exactly the inverse of avg_load already used in the current
load-balance code. For your example avg_load happens to be equal to the
cluster norm_load. If you change the CPU0 and CPU1 capacities to 1024,
the cluster 0 avg_load decreases to 2048 to correctly indicate that the
load per capacity is now lower than cluster 1, while norm_load is
unchanged.

All that said, we can maintain the current behaviour if we just make
sure to compute norm_load from scale_load before we compute avg_load.
This is all good for overloaded scenarios, but for non-overloaded
scenarios avg_load gives us a wrong picture. When none of the cpus are
overloaded we don't care about the load per capacity as all tasks get
the cpu cycles they need.

If you replace the tasks in your example with tasks that each use 128
capacity you get:

cluster 0 cluster 1
capacity 1024 1024
scale_load 512 384
avg_load 1024 384

Here avg_load exaggerates the load per capacity due to the fact that
avg_load uses norm_load, not scale_load. This is current behaviour. It
seems that we should continue using avg_load (or some re-expression
thereof based on scale_load) for overloaded scenarios and use scale_load
for non-overloaded scenarios to improve things over what we currently
have.

> Considering that adding cpu invariance in the load tracking implies
> more modification of the load balance, it might be worth reordering
> your patchset and move this patch at the end instead of the beginning
> so other patches might be merged while fixing the load balance

Indeed. I will move this patch further back in the set to the other more
experimental patches while working on patches to fix the load-balance
code.

Morten

>
> Regards,
> Vincent
>
> >
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Signed-off-by: Dietmar Eggemann <[email protected]>
> > ---
> > kernel/sched/fair.c | 27 ++++++++++++++++++++++-----
> > 1 file changed, 22 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b41f03d..5c4c989 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2473,6 +2473,21 @@ static u32 __compute_runnable_contrib(u64 n)
> > }
> >
> > unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
> > +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu);
> > +
> > +static unsigned long contrib_scale_factor(int cpu)
> > +{
> > + unsigned long scale_factor;
> > +
> > + scale_factor = arch_scale_freq_capacity(NULL, cpu);
> > + scale_factor *= arch_scale_cpu_capacity(NULL, cpu);
> > + scale_factor >>= SCHED_CAPACITY_SHIFT;
> > +
> > + return scale_factor;
> > +}
> > +
> > +#define scale_contrib(contrib, scale_factor) \
> > + ((contrib * scale_factor) >> SCHED_CAPACITY_SHIFT)
> >
> > /*
> > * We can represent the historical contribution to runnable average as the
> > @@ -2510,7 +2525,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> > u64 delta, scaled_delta, periods;
> > u32 runnable_contrib, scaled_runnable_contrib;
> > int delta_w, scaled_delta_w, decayed = 0;
> > - unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > + unsigned long scale_factor;
> >
> > delta = now - sa->last_runnable_update;
> > /*
> > @@ -2531,6 +2546,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> > return 0;
> > sa->last_runnable_update = now;
> >
> > + scale_factor = contrib_scale_factor(cpu);
> > +
> > /* delta_w is the amount already accumulated against our next period */
> > delta_w = sa->avg_period % 1024;
> > if (delta + delta_w >= 1024) {
> > @@ -2543,7 +2560,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> > * period and accrue it.
> > */
> > delta_w = 1024 - delta_w;
> > - scaled_delta_w = (delta_w * scale_freq) >> SCHED_CAPACITY_SHIFT;
> > + scaled_delta_w = scale_contrib(delta_w, scale_factor);
> >
> > if (runnable)
> > sa->runnable_avg_sum += scaled_delta_w;
> > @@ -2566,8 +2583,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> >
> > /* Efficiently calculate \sum (1..n_period) 1024*y^i */
> > runnable_contrib = __compute_runnable_contrib(periods);
> > - scaled_runnable_contrib = (runnable_contrib * scale_freq)
> > - >> SCHED_CAPACITY_SHIFT;
> > + scaled_runnable_contrib =
> > + scale_contrib(runnable_contrib, scale_factor);
> >
> > if (runnable)
> > sa->runnable_avg_sum += scaled_runnable_contrib;
> > @@ -2577,7 +2594,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> > }
> >
> > /* Remainder of delta accrued against u_0` */
> > - scaled_delta = (delta * scale_freq) >> SCHED_CAPACITY_SHIFT;
> > + scaled_delta = scale_contrib(delta, scale_factor);
> >
> > if (runnable)
> > sa->runnable_avg_sum += scaled_delta;
> > --
> > 1.9.1
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>