In multiprocessor systems with cpus with different compute capabilities it is
essential for performance that heavy tasks are scheduled on the most capable
cpus. The current scheduler does not handle such performance heterogeneous
systems optimally. This patch set proposes a small set of changes that
significantly improves performance on these systems.
Looking at the current scheduler design the most obvious way to represent the
compute capability of each individual cpu is to use cpu_power as this is
already used for load-balancing. The recently included entity load-tracking
adds the infrastructure to distinguish between heavy and light tasks.
The proposed changes moves heavy tasks to cpus with higher cpu_power to get
better performance and fixes load-balancing issues for caused by the cpu_power
difference when having one heavy task per cpu.
The patches requires load-balancing to be based on entity load-tracking and
there uses Alex Shi's patch set as the starting point:
https://lkml.org/lkml/2013/1/25/767
The patches are based in 3.9-rc2 and have been tested on an ARM vexpress TC2
big.LITTLE testchip containing five cpus: 2xCortex-A15 + 3xCortex-A7.
Additional testing and refinements might be needed later as more sophisticated
platforms become available.
cpu_power A15: 1441
cpu_power A7: 606
Benchmarks:
cyclictest: cyclictest -a -t 2 -n -D 10
hackbench: hackbench (default settings)
sysbench_1t: sysbench --test=cpu --num-threads=1 --max-requests=1000 run
sysbench_2t: sysbench --test=cpu --num-threads=2 --max-requests=1000 run
sysbench_5t: sysbench --test=cpu --num-threads=5 --max-requests=1000 run
Mixed cpu_power:
Average times over 20 runs normalized to 3.9-rc2 (lower is better):
3.9-rc2 +shi +shi+patches Improvement
cyclictest
AVG 74.9 74.5 75.75 -1.13%
MIN 69 69 69
MAX 88 88 94
hackbench
AVG 2.17 2.09 2.09 3.90%
MIN 2.10 1.95 2.02
MAX 2.25 2.48 2.17
sysbench_1t
AVG 25.13* 16.47' 16.48 34.43%
MIN 16.47 16.47 16.47
MAX 33.78 16.48 16.54
sysbench_2t
AVG 19.32 18.19 16.51 14.55%
MIN 16.48 16.47 16.47
MAX 22.15 22.19 16.61
sysbench_5t
AVG 27.22 27.71 24.14 11.31%
MIN 25.42 27.66 24.04
MAX 27.75 27.86 24.31
* The unpatched 3.9-rc2 scheduler gives inconsistent performance as tasks may
randomly be placed on either A7 or A15 cores. The max/min values reflects this
behaviour. A15 and A7 performance are ~16.5 and ~33.5 respectively.
' While Alex Shi's patches appear to solve the performance inconsistency for
sysbench_1t, it is not the true picture for all workloads. This can be seen for
sysbench_2t.
To ensure that the proposed changes does not affect normal SMP systems, the
same benchmarks have been run on a 2xCortex-A15 configuration as well:
SMP:
Average times over 20 runs normalized to 3.9-rc2 (lower is better):
3.9-rc2 +shi +shi+patches Improvement
cyclictest
AVG 78.6 75.3 77.6 1.34%
MIN 69 69 69
MAX 135 98 125
hackbench
AVG 3.55 3.54 3.55 0.06%
MIN 3.51 3.48 3.49
MAX 3.66 3.65 3.67
sysbench_1t
AVG 16.48 16.48 16.48 -0.03%
MIN 16.47 16.48 16.48
MAX 16.49 16.48 16.48
sysbench_2t
AVG 16.53 16.53 16.54 -0.05%
MIN 16.47 16.47 16.48
MAX 16.59 16.57 16.59
sysbench_5t
AVG 41.16 41.15 41.15 0.04%
MIN 41.14 41.13 41.11
MAX 41.35 41.19 41.17
Note:
The cpu_power setup code is already present in 3.9-rc2, but the device
tree provided for ARM vexpress TC2 is missing frequency information. Adding
this will give the cpu_powers listed above.
Morten
Morten Rasmussen (1):
sched: Pull tasks from cpus with multiple tasks when idle
Vincent Guittot (1):
sched: Force migration on a better cpu
kernel/sched/fair.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 53 insertions(+), 4 deletions(-)
--
1.7.9.5
If a cpu is idle and another cpu has more than one runnable task,
pull one of them without considering cpu_power source or target.
This allows low cpu_power cpus to offload potentially oversubscribed
high cpu_power cpus.
In heterogeneous systems containing cpus with different cpu_power,
the load-balancer will put more tasks on sched_domains with high
(above default) cpu_power cpus and fewer on sched_domains with low
cpu_power cpus. Hence, if the number of running tasks is equal to
the number of cpus, the load-balancer may decide to leave low
cpu_power idle and placing more than one task on each high cpu_power
cpu. This is not optimal use of the available compute resources.
Placing one task on each cpu before adding more to any of the high
cpu_power cpus should generally give a better overall throughput
regardless of the cpu_power of the cpus.
Signed-off-by: Morten Rasmussen <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4781cdd..095885c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4039,7 +4039,8 @@ static int move_tasks(struct lb_env *env)
if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
goto next;
- if ((load / 2) > env->imbalance)
+ if ((load / 2) > env->imbalance &&
+ (env->idle != CPU_IDLE && env->idle != CPU_NEWLY_IDLE))
goto next;
if (!can_migrate_task(p, env))
@@ -4539,6 +4540,15 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (overloaded_cpu)
sgs->group_imb = 1;
+ /*
+ * When idle balancing pull tasks if more than one task per cpu
+ * in group
+ */
+ if (env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) {
+ if (group->group_weight < sgs->sum_nr_running)
+ sgs->group_imb = 1;
+ }
+
sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
SCHED_POWER_SCALE);
if (!sgs->group_capacity)
@@ -4766,8 +4776,13 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
min(sds->this_load_per_task, sds->this_load + tmp);
pwr_move /= SCHED_POWER_SCALE;
- /* Move if we gain throughput */
- if (pwr_move > pwr_now)
+ /*
+ * Move if we gain throughput, or if we have cpus idling while others
+ * are running more than one task.
+ */
+ if ((pwr_move > pwr_now) ||
+ (sds->busiest_group_weight < sds->busiest_nr_running &&
+ (env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE)))
env->imbalance = sds->busiest_load_per_task;
}
--
1.7.9.5
From: Vincent Guittot <[email protected]>
In a system with different cpu_power for cpus, we can fall in a
situation where a heavy task runs on a cpu with a lower cpu_power
which by definition means lower compute capacity and lower
performance. We can detect this scenario and force the task to migrate
to a cpu with higher compute capacity to improve performance for
demanding tasks.
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 36 +++++++++++++++++++++++++++++++++++-
1 file changed, 35 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4243143..4781cdd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4444,7 +4444,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
{
unsigned long nr_running, max_nr_running, min_nr_running;
unsigned long load, max_cpu_load, min_cpu_load;
- unsigned int balance_cpu = -1, first_idle_cpu = 0;
+ unsigned int balance_cpu = -1, first_idle_cpu = 0, overloaded_cpu = 0;
unsigned long avg_load_per_task = 0;
int i;
@@ -4482,6 +4482,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
max_nr_running = nr_running;
if (min_nr_running > nr_running)
min_nr_running = nr_running;
+
+ if ((load > rq->cpu_power)
+ && ((rq->cpu_power*env->sd->imbalance_pct) < (env->dst_rq->cpu_power*100))
+ && (load > target_load(env->dst_cpu, load_idx)))
+ overloaded_cpu = 1;
}
sgs->group_load += load;
@@ -4527,6 +4532,13 @@ static inline void update_sg_lb_stats(struct lb_env *env,
(max_nr_running - min_nr_running) > 1)
sgs->group_imb = 1;
+ /*
+ * The load contrib of a CPU exceeds its capacity, we should try to
+ * find a better CPU with more capacity
+ */
+ if (overloaded_cpu)
+ sgs->group_imb = 1;
+
sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
SCHED_POWER_SCALE);
if (!sgs->group_capacity)
@@ -4940,6 +4952,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
struct sched_group *group)
{
struct rq *busiest = NULL, *rq;
+ struct rq *overloaded = NULL, *dst_rq = cpu_rq(env->dst_cpu);
unsigned long max_load = 0;
int i;
@@ -4959,6 +4972,17 @@ static struct rq *find_busiest_queue(struct lb_env *env,
wl = weighted_cpuload(i);
/*
+ * If the task requires more power than the current CPU
+ * capacity and the dst_cpu has more capacity, keep the
+ * dst_cpu in mind
+ */
+ if ((rq->nr_running == 1)
+ && (rq->cfs.runnable_load_avg > rq->cpu_power)
+ && (rq->cfs.runnable_load_avg > dst_rq->cfs.runnable_load_avg)
+ && ((rq->cpu_power*env->sd->imbalance_pct) < (dst_rq->cpu_power*100)))
+ overloaded = rq;
+
+ /*
* When comparing with imbalance, use weighted_cpuload()
* which is not scaled with the cpu power.
*/
@@ -4979,6 +5003,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
}
}
+ if (!busiest)
+ busiest = overloaded;
+
return busiest;
}
@@ -5006,6 +5033,9 @@ static int need_active_balance(struct lb_env *env)
return 1;
}
+ if ((power_of(env->src_cpu)*sd->imbalance_pct) < (power_of(env->dst_cpu)*100))
+ return 1;
+
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
@@ -5650,6 +5680,10 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
if (rq->nr_running >= 2)
goto need_kick;
+ /* load contrib is higher than cpu capacity */
+ if (rq->cfs.runnable_load_avg > rq->cpu_power)
+ goto need_kick;
+
rcu_read_lock();
for_each_domain(cpu, sd) {
struct sched_group *sg = sd->groups;
--
1.7.9.5