LinuxLists.cc - [PATCH v5 0/6] sched: Clean-ups and asymmetric cpu capacity support

2016-10-14 13:42:16

Subject: [PATCH v5 0/6] sched: Clean-ups and asymmetric cpu capacity support

Hi,

The scheduler is currently not doing much to help performance on systems with
asymmetric compute capacities (read ARM big.LITTLE). This series improves the
situation with a few tweaks mainly to the task wake-up path that considers
compute capacity at wake-up and not just whether a cpu is idle for these
systems. This gives us consistent, and potentially higher, throughput in
partially utilized scenarios. SMP behaviour and performance should be
unaffected.

Test 0:
for i in `seq 1 10`; \
do sysbench --test=cpu --max-time=3 --num-threads=1 run; \
done \
| awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
END {print "Average events: " sum/runs}'

Target: ARM TC2 (2xA15+3xA7)

(Higher is better)
tip: Average events: 116.3
patch: Average events: 217.9

Target: ARM Juno (2xA57+4xA53)

(Higher is better)
tip: Average events: 2063.2
patch: Average events: 2684.1

Test 1:
perf stat --null --repeat 10 -- \
perf bench sched messaging -g 50 -l 5000

Target: Intel IVB-EP (2*10*2)

tip: 4.815292358 seconds time elapsed ( +- 0.77% )
patch: 4.855237141 seconds time elapsed ( +- 1.00% )

Target: ARM TC2 A7-only (3xA7) (-l 1000)

tip: 63.888583172 seconds time elapsed ( +- 0.08% )
patch: 63.841030289 seconds time elapsed ( +- 0.23% )

Target: ARM Juno A53-only (4xA53) (-l 1000)

tip: 37.252267738 seconds time elapsed ( +- 0.24% )
patch: 37.480712902 seconds time elapsed ( +- 0.26% )

Notes:

Active migration of tasks away from small capacity cpus isn't addressed
in this set although it is necessary for consistent throughput in other
scenarios on asymmetric cpu capacity systems.

The infrastructure to enable capacity awareness for arm64 and arm is not
provided here but will be based on Juri's DT bindings patch set [1]. A
combined preview branch is available [2]. Test results above a based on
[2].

[1] https://lkml.org/lkml/2016/7/19/419
[2] git://linux-arm.org/linux-power.git capacity_awareness_v5_arm64_v1

Patch 1: Fix task utilization for wake-up decisions.
Patch 2-5: Improve capacity awareness.
Patch 6: Comment fix.

Tested-by: Koan-Sin Tan <[email protected]>
Tested-by: Keita Kobayashi <[email protected]>

v5:

- Changed peak utilization tracking to only update when tasks are
dequeued to sleep as suggested by Patrick Bellasi.

- Fixed wrong use of task_util_peak() in cpu_util_wake().

- Added comment fix for previously merged patch.

v4: https://lkml.org/lkml/2016/8/31/292

- Removed patches already in tip/sched/core.

- Fixed wrong use of capacity_of() instead of capacity_orig_of() as
reported by Wanpeng Li.

- Re-implement fix for task wake-up utilization. Instead of estimating
the utilization it is now computed and updated correctly.

- Introduced peak utilization tracking to compensate for decay in
wake-up placement decisions.

- Removed pointless spare capacity selection criteria in
find_idlest_group() as pointed out by Vincent and added a comment
describing when we use spare capacity instead of least load.

v3: https://lkml.org/lkml/2016/7/25/245

- Changed SD_ASYM_CPUCAPACITY sched_domain flag semantics as suggested
by PeterZ.

- Dropped arm specific patches for setting cpu capacity as these are
superseded by Juri's patches [2].

- Changed capacity-aware pulling during load-balance to use sched_group
min capacity instead of max as suggested by Sai.

v2: https://lkml.org/lkml/2016/6/22/614

- Dropped patch ignoring wakee_flips for pid=0 for now as we can not
distinguish cpu time processing irqs from idle time.

- Dropped disabling WAKE_AFFINE as suggested by Vincent to allow more
scenarios to use fast-path (select_idle_sibling()). Asymmetric wake
conditions adjusted accordingly.

- Changed use of new SD_ASYM_CPUCAPACITY slightly. Now enables
SD_BALANCE_WAKE.

- Minor clean-ups and rebased to more recent tip/sched/core.

v1: https://lkml.org/lkml/2014/5/23/621

Morten Rasmussen (6):
sched/fair: Compute task/cpu utilization at wake-up correctly
sched/fair: Consider spare capacity in find_idlest_group()
sched: Add per-cpu min capacity to sched_group_capacity
sched/fair: Avoid pulling tasks from non-overloaded higher capacity
groups
sched/fair: Track peak per-entity utilization
sched/fair: Fix wrong comment for capacity_margin

include/linux/sched.h | 2 +-
kernel/sched/core.c | 3 +-
kernel/sched/fair.c | 146 ++++++++++++++++++++++++++++++++++++++++++++------
kernel/sched/sched.h | 3 +-
4 files changed, 135 insertions(+), 19 deletions(-)

--
2.7.4

2016-10-14 13:42:18

by Morten Rasmussen

[permalink] [raw]

Subject: [PATCH v5 4/6] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups

For asymmetric cpu capacity systems it is counter-productive for
throughput if low capacity cpus are pulling tasks from non-overloaded
cpus with higher capacity. The assumption is that higher cpu capacity is
preferred over running alone in a group with lower cpu capacity.

This patch rejects higher cpu capacity groups with one or less task per
cpu as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 28e42cb41d7b..a5efafda23ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7069,6 +7069,17 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
return false;
}

+/*
+ * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-cpu capacity than sched_group ref.
+ */
+static inline bool
+group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+ return sg->sgc->min_capacity * capacity_margin <
+ ref->sgc->min_capacity * 1024;
+}
+
static inline enum
group_type group_classify(struct sched_group *group,
struct sg_lb_stats *sgs)
@@ -7172,6 +7183,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
if (sgs->avg_load <= busiest->avg_load)
return false;

+ if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
+ goto asym_packing;
+
+ /*
+ * Candidate sg has no more than one task per cpu and
+ * has higher per-cpu capacity. Migrating tasks to less
+ * capable cpus may harm throughput. Maximize throughput,
+ * power/energy consequences are not considered.
+ */
+ if (sgs->sum_nr_running <= sgs->group_weight &&
+ group_smaller_cpu_capacity(sds->local, sg))
+ return false;
+
+asym_packing:
/* This is the busiest node in its class. */
if (!(env->sd->flags & SD_ASYM_PACKING))
return true;
--
2.7.4

2016-10-14 13:42:27

by Morten Rasmussen

[permalink] [raw]

Subject: [PATCH v5 5/6] sched/fair: Track peak per-entity utilization

When using PELT (per-entity load tracking) utilization to place tasks at
wake-up using the decayed utilization (due to sleep) leads to
under-estimation of true utilization of the task. This could mean
putting the task on a cpu with less available capacity than is actually
needed. This issue can be mitigated by using 'peak' utilization instead
of the decayed utilization for placement decisions, e.g. at task
wake-up.

The 'peak' utilization metric, util_peak, tracks util_avg when the task
is running and retains its previous value while the task is
blocked/waiting on the rq. It is instantly updated to track util_avg
again as soon as the task running again.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched/fair.c | 23 +++++++++++++++++------
2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad51978ff15e..988d7f48604e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1294,7 +1294,7 @@ struct load_weight {
struct sched_avg {
u64 last_update_time, load_sum;
u32 util_sum, period_contrib;
- unsigned long load_avg, util_avg;
+ unsigned long load_avg, util_avg, util_peak;
};

#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a5efafda23ef..e0abff77764f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -696,6 +696,7 @@ void init_entity_runnable_average(struct sched_entity *se)
* At this point, util_avg won't be used in select_task_rq_fair anyway
*/
sa->util_avg = 0;
+ sa->util_peak = 0;
sa->util_sum = 0;
/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
}
@@ -747,6 +748,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
} else {
sa->util_avg = cap;
}
+ sa->util_peak = sa->util_avg;
sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
}

@@ -3515,6 +3517,10 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*/
if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) == DEQUEUE_SAVE)
update_min_vruntime(cfs_rq);
+
+ /* Save peak PELT utilization for task to help wake-up decisions */
+ if (flags & DEQUEUE_SLEEP && entity_is_task(se))
+ se->avg.util_peak = se->avg.util_avg;
}

/*
@@ -5203,7 +5209,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return 1;
}

-static inline int task_util(struct task_struct *p);
+static inline int task_util_peak(struct task_struct *p);
static int cpu_util_wake(int cpu, struct task_struct *p);

static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -5286,14 +5292,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
/*
* The cross-over point between using spare capacity or least load
* is too conservative for high utilization tasks on partially
- * utilized systems if we require spare_capacity > task_util(p),
+ * utilized systems if we require spare_capacity > task_util_peak(p),
* so we allow for some task stuffing by using
- * spare_capacity > task_util(p)/2.
+ * spare_capacity > task_util_peak(p)/2.
*/
- if (this_spare > task_util(p) / 2 &&
+ if (this_spare > task_util_peak(p) / 2 &&
imbalance*this_spare > 100*most_spare)
return NULL;
- else if (most_spare > task_util(p) / 2)
+ else if (most_spare > task_util_peak(p) / 2)
return most_spare_sg;

if (!idlest || 100*this_load < imbalance*min_load)
@@ -5628,6 +5634,11 @@ static inline int task_util(struct task_struct *p)
return p->se.avg.util_avg;
}

+static inline int task_util_peak(struct task_struct *p)
+{
+ return p->se.avg.util_peak;
+}
+
/*
* cpu_util_wake: Compute cpu utilization with any contributions from
* the waking task p removed.
@@ -5667,7 +5678,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
/* Bring task utilization in sync with prev_cpu */
sync_entity_load_avg(&p->se);

- return min_cap * 1024 < task_util(p) * capacity_margin;
+ return min_cap * 1024 < task_util_peak(p) * capacity_margin;
}

/*
--
2.7.4

2016-10-14 13:42:26

by Morten Rasmussen

[permalink] [raw]

Subject: [PATCH v5 6/6] sched/fair: Fix wrong comment for capacity_margin

The comment for capacity_margin introduced in "sched/fair: Let
asymmetric cpu configurations balance at wake-up" got its usage the
wrong way round.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e0abff77764f..08408d6405a5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -116,7 +116,7 @@ unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;

/*
* The margin used when comparing utilization with CPU capacity:
- * util * 1024 < capacity * margin
+ * util * margin < capacity * 1024
*/
unsigned int capacity_margin = 1280; /* ~20% */

--
2.7.4

2016-10-14 13:43:12

by Morten Rasmussen

[permalink] [raw]

Subject: [PATCH v5 3/6] sched: Add per-cpu min capacity to sched_group_capacity

struct sched_group_capacity currently represents the compute capacity
sum of all cpus in the sched_group. Unless it is divided by the
group_weight to get the average capacity per cpu it hides differences in
cpu capacity for mixed capacity systems (e.g. high RT/IRQ utilization or
ARM big.LITTLE). But even the average may not be sufficient if the group
covers cpus of different capacities. Instead, by extending struct
sched_group_capacity to indicate min per-cpu capacity in the group a
suitable group for a given task utilization can more easily be found
such that cpus with reduced capacity can be avoided for tasks with high
utilization (not implemented by this patch).

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/core.c | 3 ++-
kernel/sched/fair.c | 17 ++++++++++++-----
kernel/sched/sched.h | 3 ++-
3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aae08cedd75e..50526a20ab94 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5703,7 +5703,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
printk(KERN_CONT " %*pbl",
cpumask_pr_args(sched_group_cpus(group)));
if (group->sgc->capacity != SCHED_CAPACITY_SCALE) {
- printk(KERN_CONT " (cpu_capacity = %d)",
+ printk(KERN_CONT " (cpu_capacity = %lu)",
group->sgc->capacity);
}

@@ -6180,6 +6180,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
* die on a /0 trap.
*/
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+ sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;

/*
* Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cf84e37b3b4a..28e42cb41d7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6905,13 +6905,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)

cpu_rq(cpu)->cpu_capacity = capacity;
sdg->sgc->capacity = capacity;
+ sdg->sgc->min_capacity = capacity;
}

void update_group_capacity(struct sched_domain *sd, int cpu)
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
- unsigned long capacity;
+ unsigned long capacity, min_capacity;
unsigned long interval;

interval = msecs_to_jiffies(sd->balance_interval);
@@ -6924,6 +6925,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
}

capacity = 0;
+ min_capacity = ULONG_MAX;

if (child->flags & SD_OVERLAP) {
/*
@@ -6948,11 +6950,12 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
*/
if (unlikely(!rq->sd)) {
capacity += capacity_of(cpu);
- continue;
+ } else {
+ sgc = rq->sd->groups->sgc;
+ capacity += sgc->capacity;
}

- sgc = rq->sd->groups->sgc;
- capacity += sgc->capacity;
+ min_capacity = min(capacity, min_capacity);
}
} else {
/*
@@ -6962,12 +6965,16 @@ void update_group_capacity(struct sched_domain *sd, int cpu)

group = child->groups;
do {
- capacity += group->sgc->capacity;
+ struct sched_group_capacity *sgc = group->sgc;
+
+ capacity += sgc->capacity;
+ min_capacity = min(sgc->min_capacity, min_capacity);
group = group->next;
} while (group != child->groups);
}

sdg->sgc->capacity = capacity;
+ sdg->sgc->min_capacity = min_capacity;
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 19b99869809d..b3b3ecbbb494 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -892,7 +892,8 @@ struct sched_group_capacity {
* CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
* for a single CPU.
*/
- unsigned int capacity;
+ unsigned long capacity;
+ unsigned long min_capacity; /* Min per-cpu capacity in group */
unsigned long next_update;
int imbalance; /* XXX unrelated to capacity but shared group state */

--
2.7.4

2016-10-14 13:43:23

by Morten Rasmussen

[permalink] [raw]

Subject: [PATCH v5 2/6] sched/fair: Consider spare capacity in find_idlest_group()

In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.

In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 50 +++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 45 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f4abdbdeab50..cf84e37b3b4a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5203,6 +5203,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return 1;
}

+static inline int task_util(struct task_struct *p);
+static int cpu_util_wake(int cpu, struct task_struct *p);
+
+static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
+{
+ return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
+}
+
/*
* find_idlest_group finds and returns the least busy CPU group within the
* domain.
@@ -5212,7 +5220,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int sd_flag)
{
struct sched_group *idlest = NULL, *group = sd->groups;
+ struct sched_group *most_spare_sg = NULL;
unsigned long min_load = ULONG_MAX, this_load = 0;
+ unsigned long most_spare = 0, this_spare = 0;
int load_idx = sd->forkexec_idx;
int imbalance = 100 + (sd->imbalance_pct-100)/2;

@@ -5220,7 +5230,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
load_idx = sd->wake_idx;

do {
- unsigned long load, avg_load;
+ unsigned long load, avg_load, spare_cap, max_spare_cap;
int local_group;
int i;

@@ -5232,8 +5242,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
local_group = cpumask_test_cpu(this_cpu,
sched_group_cpus(group));

- /* Tally up the load of all CPUs in the group */
+ /*
+ * Tally up the load of all CPUs in the group and find
+ * the group containing the cpu with most spare capacity.
+ */
avg_load = 0;
+ max_spare_cap = 0;

for_each_cpu(i, sched_group_cpus(group)) {
/* Bias balancing toward cpus of our domain */
@@ -5243,6 +5257,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
load = target_load(i, load_idx);

avg_load += load;
+
+ spare_cap = capacity_spare_wake(i, p);
+
+ if (spare_cap > max_spare_cap)
+ max_spare_cap = spare_cap;
}

/* Adjust by relative CPU capacity of the group */
@@ -5250,12 +5269,33 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

if (local_group) {
this_load = avg_load;
- } else if (avg_load < min_load) {
- min_load = avg_load;
- idlest = group;
+ this_spare = max_spare_cap;
+ } else {
+ if (avg_load < min_load) {
+ min_load = avg_load;
+ idlest = group;
+ }
+
+ if (most_spare < max_spare_cap) {
+ most_spare = max_spare_cap;
+ most_spare_sg = group;
+ }
}
} while (group = group->next, group != sd->groups);

+ /*
+ * The cross-over point between using spare capacity or least load
+ * is too conservative for high utilization tasks on partially
+ * utilized systems if we require spare_capacity > task_util(p),
+ * so we allow for some task stuffing by using
+ * spare_capacity > task_util(p)/2.
+ */
+ if (this_spare > task_util(p) / 2 &&
+ imbalance*this_spare > 100*most_spare)
+ return NULL;
+ else if (most_spare > task_util(p) / 2)
+ return most_spare_sg;
+
if (!idlest || 100*this_load < imbalance*min_load)
return NULL;
return idlest;
--
2.7.4

2016-10-14 13:43:30

by Morten Rasmussen

[permalink] [raw]

Subject: [PATCH v5 1/6] sched/fair: Compute task/cpu utilization at wake-up correctly

At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).

To solve this problem, this patch synchronizes the task utilization with
its previous rq before the task utilization is used in the wake-up path.
Currently the update/synchronization is done _after_ the task has been
placed by select_task_rq_fair(). The synchronization is done without
having to take the rq lock using the existing mechanism used in
remove_entity_load_avg().

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++----
1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 543b2f291152..f4abdbdeab50 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3205,13 +3205,25 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
#endif

/*
+ * Synchronize entity load avg of dequeued entity without locking
+ * the previous rq.
+ */
+void sync_entity_load_avg(struct sched_entity *se)
+{
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ u64 last_update_time;
+
+ last_update_time = cfs_rq_last_update_time(cfs_rq);
+ __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+}
+
+/*
* Task first catches up with cfs_rq, and then subtract
* itself from the cfs_rq (task must be off the queue now).
*/
void remove_entity_load_avg(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- u64 last_update_time;

/*
* tasks cannot exit without having gone through wake_up_new_task() ->
@@ -3223,9 +3235,7 @@ void remove_entity_load_avg(struct sched_entity *se)
* calls this.
*/

- last_update_time = cfs_rq_last_update_time(cfs_rq);
-
- __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+ sync_entity_load_avg(se);
atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
}
@@ -5579,6 +5589,24 @@ static inline int task_util(struct task_struct *p)
}

/*
+ * cpu_util_wake: Compute cpu utilization with any contributions from
+ * the waking task p removed.
+ */
+static int cpu_util_wake(int cpu, struct task_struct *p)
+{
+ unsigned long util, capacity;
+
+ /* Task has no contribution or is new */
+ if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
+ return cpu_util(cpu);
+
+ capacity = capacity_orig_of(cpu);
+ util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+
+ return (util >= capacity) ? capacity : util;
+}
+
+/*
* Disable WAKE_AFFINE in the case where task @p doesn't fit in the
* capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
*
@@ -5596,6 +5624,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
if (max_cap - min_cap < max_cap >> 3)
return 0;

+ /* Bring task utilization in sync with prev_cpu */
+ sync_entity_load_avg(&p->se);
+
return min_cap * 1024 < task_util(p) * capacity_margin;
}

--
2.7.4

2016-10-17 08:52:57

by Morten Rasmussen

[permalink] [raw]

Subject: Re: [PATCH v5 5/6] sched/fair: Track peak per-entity utilization

On Fri, Oct 14, 2016 at 02:41:11PM +0100, Morten Rasmussen wrote:
> @@ -3515,6 +3517,10 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> */
> if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) == DEQUEUE_SAVE)
> update_min_vruntime(cfs_rq);
> +
> + /* Save peak PELT utilization for task to help wake-up decisions */
> + if (flags & DEQUEUE_SLEEP && entity_is_task(se))
> + se->avg.util_peak = se->avg.util_avg;
> }
>
> /*

The friendly kbuild robot swiftly pointed out that this doesn't build
for !CONFIG_SMP. The below replacement patch moves this bit inside
dequeue_entity_load_avg() which should be equivalent and not break
!CONFIG_SMP.

----8<---

>From 36966c83cc3493332d92dcadb795eebc8c300558 Mon Sep 17 00:00:00 2001
From: Morten Rasmussen <[email protected]>
Date: Wed, 17 Aug 2016 15:30:43 +0100
Subject: [PATCH v5 5/6] sched/fair: Track peak per-entity utilization

When using PELT (per-entity load tracking) utilization to place tasks at
wake-up using the decayed utilization (due to sleep) leads to
under-estimation of true utilization of the task. This could mean
putting the task on a cpu with less available capacity than is actually
needed. This issue can be mitigated by using 'peak' utilization instead
of the decayed utilization for placement decisions, e.g. at task
wake-up.

The 'peak' utilization metric, util_peak, tracks util_avg when the task
is running and retains its previous value while the task is
blocked/waiting on the rq. It is instantly updated to track util_avg
again as soon as the task running again.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>

Signed-off-by: Morten Rasmussen <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched/fair.c | 23 +++++++++++++++++------
2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad51978ff15e..988d7f48604e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1294,7 +1294,7 @@ struct load_weight {
struct sched_avg {
u64 last_update_time, load_sum;
u32 util_sum, period_contrib;
- unsigned long load_avg, util_avg;
+ unsigned long load_avg, util_avg, util_peak;
};

#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a5efafda23ef..84b767399d61 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -696,6 +696,7 @@ void init_entity_runnable_average(struct sched_entity *se)
* At this point, util_avg won't be used in select_task_rq_fair anyway
*/
sa->util_avg = 0;
+ sa->util_peak = 0;
sa->util_sum = 0;
/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
}
@@ -747,6 +748,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
} else {
sa->util_avg = cap;
}
+ sa->util_peak = sa->util_avg;
sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
}

@@ -3181,6 +3183,10 @@ dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
cfs_rq->runnable_load_sum =
max_t(s64, cfs_rq->runnable_load_sum - se->avg.load_sum, 0);
+
+ /* Save peak PELT utilization for task to help wake-up decisions */
+ if (entity_is_task(se))
+ se->avg.util_peak = se->avg.util_avg;
}

#ifndef CONFIG_64BIT
@@ -5203,7 +5209,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return 1;
}

-static inline int task_util(struct task_struct *p);
+static inline int task_util_peak(struct task_struct *p);
static int cpu_util_wake(int cpu, struct task_struct *p);

static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -5286,14 +5292,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
/*
* The cross-over point between using spare capacity or least load
* is too conservative for high utilization tasks on partially
- * utilized systems if we require spare_capacity > task_util(p),
+ * utilized systems if we require spare_capacity > task_util_peak(p),
* so we allow for some task stuffing by using
- * spare_capacity > task_util(p)/2.
+ * spare_capacity > task_util_peak(p)/2.
*/
- if (this_spare > task_util(p) / 2 &&
+ if (this_spare > task_util_peak(p) / 2 &&
imbalance*this_spare > 100*most_spare)
return NULL;
- else if (most_spare > task_util(p) / 2)
+ else if (most_spare > task_util_peak(p) / 2)
return most_spare_sg;

if (!idlest || 100*this_load < imbalance*min_load)
@@ -5628,6 +5634,11 @@ static inline int task_util(struct task_struct *p)
return p->se.avg.util_avg;
}

+static inline int task_util_peak(struct task_struct *p)
+{
+ return p->se.avg.util_peak;
+}
+
/*
* cpu_util_wake: Compute cpu utilization with any contributions from
* the waking task p removed.
@@ -5667,7 +5678,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
/* Bring task utilization in sync with prev_cpu */
sync_entity_load_avg(&p->se);

- return min_cap * 1024 < task_util(p) * capacity_margin;
+ return min_cap * 1024 < task_util_peak(p) * capacity_margin;
}

/*
--
2.7.4

2016-11-16 12:13:19

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:sched/core] sched/fair: Compute task/cpu utilization at wake-up correctly

Commit-ID: 104cb16d9eb684f071d5bf3aa87c0d01af259b7c
Gitweb: http://git.kernel.org/tip/104cb16d9eb684f071d5bf3aa87c0d01af259b7c
Author: Morten Rasmussen <[email protected]>
AuthorDate: Fri, 14 Oct 2016 14:41:07 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 16 Nov 2016 10:29:05 +0100

sched/fair: Compute task/cpu utilization at wake-up correctly

At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).

To solve this problem, this patch synchronizes the task utilization with
its previous rq before the task utilization is used in the wake-up path.
Currently the update/synchronization is done _after_ the task has been
placed by select_task_rq_fair(). The synchronization is done without
having to take the rq lock using the existing mechanism used in
remove_entity_load_avg().

Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++----
1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3cf446c..b05d691 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3199,13 +3199,25 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
#endif

/*
+ * Synchronize entity load avg of dequeued entity without locking
+ * the previous rq.
+ */
+void sync_entity_load_avg(struct sched_entity *se)
+{
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ u64 last_update_time;
+
+ last_update_time = cfs_rq_last_update_time(cfs_rq);
+ __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+}
+
+/*
* Task first catches up with cfs_rq, and then subtract
* itself from the cfs_rq (task must be off the queue now).
*/
void remove_entity_load_avg(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- u64 last_update_time;

/*
* tasks cannot exit without having gone through wake_up_new_task() ->
@@ -3217,9 +3229,7 @@ void remove_entity_load_avg(struct sched_entity *se)
* calls this.
*/

- last_update_time = cfs_rq_last_update_time(cfs_rq);
-
- __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+ sync_entity_load_avg(se);
atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
}
@@ -5583,6 +5593,24 @@ static inline int task_util(struct task_struct *p)
}

/*
+ * cpu_util_wake: Compute cpu utilization with any contributions from
+ * the waking task p removed.
+ */
+static int cpu_util_wake(int cpu, struct task_struct *p)
+{
+ unsigned long util, capacity;
+
+ /* Task has no contribution or is new */
+ if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
+ return cpu_util(cpu);
+
+ capacity = capacity_orig_of(cpu);
+ util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+
+ return (util >= capacity) ? capacity : util;
+}
+
+/*
* Disable WAKE_AFFINE in the case where task @p doesn't fit in the
* capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
*
@@ -5600,6 +5628,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
if (max_cap - min_cap < max_cap >> 3)
return 0;

+ /* Bring task utilization in sync with prev_cpu */
+ sync_entity_load_avg(&p->se);
+
return min_cap * 1024 < task_util(p) * capacity_margin;
}

2016-11-16 12:13:43

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:sched/core] sched/fair: Consider spare capacity in find_idlest_group()

Commit-ID: 6a0b19c0f39a7a7b7fb77d3867a733136ff059a3
Gitweb: http://git.kernel.org/tip/6a0b19c0f39a7a7b7fb77d3867a733136ff059a3
Author: Morten Rasmussen <[email protected]>
AuthorDate: Fri, 14 Oct 2016 14:41:08 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 16 Nov 2016 10:29:05 +0100

sched/fair: Consider spare capacity in find_idlest_group()

In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.

In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.

Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 50 +++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 45 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b05d691..1ad3706 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5202,6 +5202,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return 1;
}

+static inline int task_util(struct task_struct *p);
+static int cpu_util_wake(int cpu, struct task_struct *p);
+
+static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
+{
+ return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
+}
+
/*
* find_idlest_group finds and returns the least busy CPU group within the
* domain.
@@ -5211,7 +5219,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int sd_flag)
{
struct sched_group *idlest = NULL, *group = sd->groups;
+ struct sched_group *most_spare_sg = NULL;
unsigned long min_load = ULONG_MAX, this_load = 0;
+ unsigned long most_spare = 0, this_spare = 0;
int load_idx = sd->forkexec_idx;
int imbalance = 100 + (sd->imbalance_pct-100)/2;

@@ -5219,7 +5229,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
load_idx = sd->wake_idx;

do {
- unsigned long load, avg_load;
+ unsigned long load, avg_load, spare_cap, max_spare_cap;
int local_group;
int i;

@@ -5231,8 +5241,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
local_group = cpumask_test_cpu(this_cpu,
sched_group_cpus(group));

- /* Tally up the load of all CPUs in the group */
+ /*
+ * Tally up the load of all CPUs in the group and find
+ * the group containing the CPU with most spare capacity.
+ */
avg_load = 0;
+ max_spare_cap = 0;

for_each_cpu(i, sched_group_cpus(group)) {
/* Bias balancing toward cpus of our domain */
@@ -5242,6 +5256,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
load = target_load(i, load_idx);

avg_load += load;
+
+ spare_cap = capacity_spare_wake(i, p);
+
+ if (spare_cap > max_spare_cap)
+ max_spare_cap = spare_cap;
}

/* Adjust by relative CPU capacity of the group */
@@ -5249,12 +5268,33 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

if (local_group) {
this_load = avg_load;
- } else if (avg_load < min_load) {
- min_load = avg_load;
- idlest = group;
+ this_spare = max_spare_cap;
+ } else {
+ if (avg_load < min_load) {
+ min_load = avg_load;
+ idlest = group;
+ }
+
+ if (most_spare < max_spare_cap) {
+ most_spare = max_spare_cap;
+ most_spare_sg = group;
+ }
}
} while (group = group->next, group != sd->groups);

+ /*
+ * The cross-over point between using spare capacity or least load
+ * is too conservative for high utilization tasks on partially
+ * utilized systems if we require spare_capacity > task_util(p),
+ * so we allow for some task stuffing by using
+ * spare_capacity > task_util(p)/2.
+ */
+ if (this_spare > task_util(p) / 2 &&
+ imbalance*this_spare > 100*most_spare)
+ return NULL;
+ else if (most_spare > task_util(p) / 2)
+ return most_spare_sg;
+
if (!idlest || 100*this_load < imbalance*min_load)
return NULL;
return idlest;

2016-11-16 12:14:14

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:sched/core] sched/fair: Add per-CPU min capacity to sched_group_capacity

Commit-ID: bf475ce0a3dd75b5d1df6c6c14ae25168caa15ac
Gitweb: http://git.kernel.org/tip/bf475ce0a3dd75b5d1df6c6c14ae25168caa15ac
Author: Morten Rasmussen <[email protected]>
AuthorDate: Fri, 14 Oct 2016 14:41:09 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 16 Nov 2016 10:29:06 +0100

sched/fair: Add per-CPU min capacity to sched_group_capacity

struct sched_group_capacity currently represents the compute capacity
sum of all CPUs in the sched_group.

Unless it is divided by the group_weight to get the average capacity
per CPU, it hides differences in CPU capacity for mixed capacity systems
(e.g. high RT/IRQ utilization or ARM big.LITTLE).

But even the average may not be sufficient if the group covers CPUs of
different capacities.

Instead, by extending struct sched_group_capacity to indicate min per-CPU
capacity in the group a suitable group for a given task utilization can
more easily be found such that CPUs with reduced capacity can be avoided
for tasks with high utilization (not implemented by this patch).

Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 3 ++-
kernel/sched/fair.c | 17 ++++++++++++-----
kernel/sched/sched.h | 3 ++-
3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f3cfa0d..6bf1fd3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5708,7 +5708,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
printk(KERN_CONT " %*pbl",
cpumask_pr_args(sched_group_cpus(group)));
if (group->sgc->capacity != SCHED_CAPACITY_SCALE) {
- printk(KERN_CONT " (cpu_capacity = %d)",
+ printk(KERN_CONT " (cpu_capacity = %lu)",
group->sgc->capacity);
}

@@ -6185,6 +6185,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
* die on a /0 trap.
*/
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+ sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;

/*
* Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ad3706..faf8f18 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6909,13 +6909,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)

cpu_rq(cpu)->cpu_capacity = capacity;
sdg->sgc->capacity = capacity;
+ sdg->sgc->min_capacity = capacity;
}

void update_group_capacity(struct sched_domain *sd, int cpu)
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
- unsigned long capacity;
+ unsigned long capacity, min_capacity;
unsigned long interval;

interval = msecs_to_jiffies(sd->balance_interval);
@@ -6928,6 +6929,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
}

capacity = 0;
+ min_capacity = ULONG_MAX;

if (child->flags & SD_OVERLAP) {
/*
@@ -6952,11 +6954,12 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
*/
if (unlikely(!rq->sd)) {
capacity += capacity_of(cpu);
- continue;
+ } else {
+ sgc = rq->sd->groups->sgc;
+ capacity += sgc->capacity;
}

- sgc = rq->sd->groups->sgc;
- capacity += sgc->capacity;
+ min_capacity = min(capacity, min_capacity);
}
} else {
/*
@@ -6966,12 +6969,16 @@ void update_group_capacity(struct sched_domain *sd, int cpu)

group = child->groups;
do {
- capacity += group->sgc->capacity;
+ struct sched_group_capacity *sgc = group->sgc;
+
+ capacity += sgc->capacity;
+ min_capacity = min(sgc->min_capacity, min_capacity);
group = group->next;
} while (group != child->groups);
}

sdg->sgc->capacity = capacity;
+ sdg->sgc->min_capacity = min_capacity;
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 055f935..345c1cc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -892,7 +892,8 @@ struct sched_group_capacity {
* CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
* for a single CPU.
*/
- unsigned int capacity;
+ unsigned long capacity;
+ unsigned long min_capacity; /* Min per-CPU capacity in group */
unsigned long next_update;
int imbalance; /* XXX unrelated to capacity but shared group state */

2016-11-16 12:14:45

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:sched/core] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups

Commit-ID: 9e0994c0a1c1f82c705f1f66388e1bcffcee8bb9
Gitweb: http://git.kernel.org/tip/9e0994c0a1c1f82c705f1f66388e1bcffcee8bb9
Author: Morten Rasmussen <[email protected]>
AuthorDate: Fri, 14 Oct 2016 14:41:10 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 16 Nov 2016 10:29:06 +0100

sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups

For asymmetric CPU capacity systems it is counter-productive for
throughput if low capacity CPUs are pulling tasks from non-overloaded
CPUs with higher capacity. The assumption is that higher CPU capacity is
preferred over running alone in a group with lower CPU capacity.

This patch rejects higher CPU capacity groups with one or less task per
CPU as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.

Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index faf8f18..ee39bfd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7073,6 +7073,17 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
return false;
}

+/*
+ * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-CPU capacity than sched_group ref.
+ */
+static inline bool
+group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+ return sg->sgc->min_capacity * capacity_margin <
+ ref->sgc->min_capacity * 1024;
+}
+
static inline enum
group_type group_classify(struct sched_group *group,
struct sg_lb_stats *sgs)
@@ -7176,6 +7187,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
if (sgs->avg_load <= busiest->avg_load)
return false;

+ if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
+ goto asym_packing;
+
+ /*
+ * Candidate sg has no more than one task per CPU and
+ * has higher per-CPU capacity. Migrating tasks to less
+ * capable CPUs may harm throughput. Maximize throughput,
+ * power/energy consequences are not considered.
+ */
+ if (sgs->sum_nr_running <= sgs->group_weight &&
+ group_smaller_cpu_capacity(sds->local, sg))
+ return false;
+
+asym_packing:
/* This is the busiest node in its class. */
if (!(env->sd->flags & SD_ASYM_PACKING))
return true;

2016-11-16 12:15:15

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:sched/core] sched/fair: Fix incorrect comment for capacity_margin

Commit-ID: 893c5d2279041afeb593f1fa8edd9d02edf5b7cb
Gitweb: http://git.kernel.org/tip/893c5d2279041afeb593f1fa8edd9d02edf5b7cb
Author: Morten Rasmussen <[email protected]>
AuthorDate: Fri, 14 Oct 2016 14:41:12 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 16 Nov 2016 10:29:07 +0100

sched/fair: Fix incorrect comment for capacity_margin

The comment for capacity_margin introduced in:

3273163c6775 ("sched/fair: Let asymmetric CPU configurations balance at wake-up")

... got its usage the wrong way round - fix it.

Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee39bfd..5e6c00a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -109,7 +109,7 @@ unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;

/*
* The margin used when comparing utilization with CPU capacity:
- * util * 1024 < capacity * margin
+ * util * margin < capacity * 1024
*/
unsigned int capacity_margin = 1280; /* ~20% */