On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
for performance that cpu intensive tasks are aggressively migrated to
high capacity cpus as soon as those become available. The capacity
awareness tweaks already in the wake-up path can't handle this as such
tasks might run or be runnable forever. If they happen to be placed on a
low capacity cpu from the beginning they are stuck there forever while
high capacity cpus may have become available in the meantime.
To address this issue this patch set introduces a new "misfit"
load-balancing scenario in periodic/nohz/newly idle balance which tweaks
the load-balance conditions to ignore load per capacity in certain
cases. Since misfit tasks are commonly running alone on a cpu, more
aggressive active load-balancing is needed too.
The fundamental idea of this patch set has been in Android kernels for a
long time and is absolutely essential for consistent performance on
asymmetric cpu capacity systems.
The patches have been tested on:
1. Arm Juno (r0): 2+4 Cortex A57/A53
2. Hikey960: 4+4 Cortex A73/A53
Test case:
Big cpus are always kept busy. Pin a shorter running sysbench tasks to
big cpus, while creating a longer running set of unpinned sysbench
tasks.
REQUESTS=1000
BIGS="1 2"
LITTLES="0 3 4 5"
# Don't care about the score for those, just keep the bigs busy
for i in $BIGS; do
taskset -c $i sysbench --max-requests=$((REQUESTS / 4)) \
--test=cpu run &>/dev/null &
done
for i in $LITTLES; do
sysbench --max-requests=$REQUESTS --test=cpu run \
| grep "total time:" &
done
wait
Results:
Single runs with completion time of each task
Juno (tip)
total time: 1.2608s
total time: 1.2995s
total time: 1.5954s
total time: 1.7463s
Juno (misfit)
total time: 1.2575s
total time: 1.3004s
total time: 1.5860s
total time: 1.5871s
Hikey960 (tip)
total time: 1.7431s
total time: 2.2914s
total time: 2.5976s
total time: 1.7280s
Hikey960 (misfit)
total time: 1.7866s
total time: 1.7513s
total time: 1.6918s
total time: 1.6965s
10 run summary (tracking longest running task for each run)
Juno Hikey960
avg max avg max
tip 1.7465 1.7469 2.5997 2.6131
misfit 1.6016 1.6192 1.8506 1.9666
Changelog:
v4
- Added check for empty cpu_map in sd_init().
- Added patch to disable SD_ASYM_CPUCAPACITY for root_domains that don't
observe capacity asymmetry if the system as a whole is asymmetric.
- Added patch to disable SD_PREFER_SIBLING on the sched_domain level below
SD_ASYM_CPUCAPACITY.
- Rebased against tip/sched/core.
- Fixed uninitialised variable introduced in update_sd_lb_stats.
- Added patch to do a slight variable initialisation cleanup in update_sd_lb_stats.
- Removed superfluous type changes for temp variables assigned to root_domain->overload.
- Reworded commit for the patch setting rq->rd->overload when misfit.
- v3 Tested-by: Gaku Inami <[email protected]>
v3
- Fixed locking around static_key.
- Changed group per-cpu capacity comparison to be based on max rather
than min capacity.
- Added patch to prevent occasional pointless high->low capacity
migrations.
- Changed type of group_misfit_task_load and misfit_task_load to
unsigned long.
- Changed fbq() to pick the cpu with highest misfit_task_load rather
than breaking when the first is found.
- Rebased against tip/sched/core.
- v2 Tested-by: Gaku Inami <[email protected]>
v2
- Removed redudant condition in static_key enablement.
- Fixed logic flaw in patch #2 reported by Yi Yao <[email protected]>
- Dropped patch #4 as although the patch seems to make sense no benefit
has been proven.
- Dropped root_domain->overload renaming
- Changed type of root_domain->overload to int
- Wrapped accesses of rq->rd->overload with READ/WRITE_ONCE
- v1 Tested-by: Gaku Inami <[email protected]>
Chris Redpath (1):
sched/fair: Don't move tasks to lower capacity cpus unless necessary
Morten Rasmussen (6):
sched: Add static_key for asymmetric cpu capacity optimizations
sched/fair: Add group_misfit_task load-balance type
sched: Add sched_group per-cpu max capacity
sched/fair: Consider misfit tasks when load-balancing
sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without
asymmetry
sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity
domains
Valentin Schneider (5):
sched/fair: Kick nohz balance if rq->misfit_task_load
sched/fair: Change prefer_sibling type to bool
sched: Change root_domain->overload type to int
sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
sched/fair: Set rq->rd->overload when misfit
kernel/sched/fair.c | 161 +++++++++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 16 +++--
kernel/sched/topology.c | 53 ++++++++++++++--
3 files changed, 199 insertions(+), 31 deletions(-)
--
2.7.4
The 'prefer sibling' sched_domain flag is intended to encourage
spreading tasks to sibling sched_domain to take advantage of more caches
and core for SMT systems. It has recently been changed to be on all
non-NUMA topology level. However, spreading across domains with cpu
capacity asymmetry isn't desirable, e.g. spreading from high capacity to
low capacity cpus even if high capacity cpus aren't overutilized might
give access to more cache but the cpu will be slower and possibly lead
to worse overall throughput.
To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
level immediately below SD_ASYM_CPUCAPACITY.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/topology.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 29c186961345..00c7a08c7f77 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1140,7 +1140,7 @@ sd_init(struct sched_domain_topology_level *tl,
| 0*SD_SHARE_CPUCAPACITY
| 0*SD_SHARE_PKG_RESOURCES
| 0*SD_SERIALIZE
- | 0*SD_PREFER_SIBLING
+ | 1*SD_PREFER_SIBLING
| 0*SD_NUMA
| sd_flags
,
@@ -1186,17 +1186,21 @@ sd_init(struct sched_domain_topology_level *tl,
if (sd->flags & SD_ASYM_CPUCAPACITY) {
struct sched_domain *t = sd;
+ /*
+ * Don't attempt to spread across cpus of different capacities.
+ */
+ if (sd->child)
+ sd->child->flags &= ~SD_PREFER_SIBLING;
+
for_each_lower_domain(t)
t->flags |= SD_BALANCE_WAKE;
}
if (sd->flags & SD_SHARE_CPUCAPACITY) {
- sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 110;
sd->smt_gain = 1178; /* ~15% */
} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
- sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 117;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
@@ -1207,6 +1211,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->busy_idx = 3;
sd->idle_idx = 2;
+ sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
sd->flags &= ~(SD_BALANCE_EXEC |
@@ -1216,7 +1221,6 @@ sd_init(struct sched_domain_topology_level *tl,
#endif
} else {
- sd->flags |= SD_PREFER_SIBLING;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
sd->idle_idx = 1;
--
2.7.4
When hotplugging cpus out or creating exclusive cpusets (disabling
sched_load_balance) systems which were asymmetric at boot might become
symmetric. In this case leaving the flag set might lead to suboptimal
scheduling decisions.
The arch-code proving the flag doesn't have visibility of the cpuset
configuration so it must either be told by passing a cpumask or the
generic topology code has to verify if the flag should still be set
when taking the actual sched_domain_span() into account. This patch
implements the latter approach.
We need to detect capacity based on calling arch_scale_cpu_capacity()
directly as rq->cpu_capacity_orig hasn't been set yet early in the boot
process.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/topology.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 71330e0e41db..29c186961345 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl,
sd_id = cpumask_first(sched_domain_span(sd));
/*
+ * Check if cpu_map eclipses cpu capacity asymmetry.
+ */
+
+ if (sd->flags & SD_ASYM_CPUCAPACITY) {
+ int i;
+ bool disable = true;
+ long capacity = arch_scale_cpu_capacity(NULL, sd_id);
+
+ for_each_cpu(i, sched_domain_span(sd)) {
+ if (capacity != arch_scale_cpu_capacity(NULL, i)) {
+ disable = false;
+ break;
+ }
+ }
+
+ if (disable)
+ sd->flags &= ~SD_ASYM_CPUCAPACITY;
+ }
+
+ /*
* Convert topological properties into behaviour.
*/
--
2.7.4
From: Chris Redpath <[email protected]>
When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
cpu with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Chris Redpath <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de84f5a9a65a..06beefa02420 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8793,6 +8793,17 @@ static struct rq *find_busiest_queue(struct lb_env *env,
capacity = capacity_of(i);
+ /*
+ * For ASYM_CPUCAPACITY domains, don't pick a cpu that could
+ * eventually lead to active_balancing high->low capacity.
+ * Higher per-cpu capacity is considered better than balancing
+ * average load.
+ */
+ if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ capacity_of(env->dst_cpu) < capacity &&
+ rq->nr_running == 1)
+ continue;
+
wl = weighted_cpuload(rq);
/*
--
2.7.4
From: Valentin Schneider <[email protected]>
This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE are added to make sure nothing terribly wrong
can happen because of the compiler.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 6 +++---
kernel/sched/sched.h | 4 ++--
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee26eeb188ef..d0641ba7bea1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8428,8 +8428,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
- if (env->dst_rq->rd->overload != overload)
- env->dst_rq->rd->overload = overload;
+ if (READ_ONCE(env->dst_rq->rd->overload) != overload)
+ WRITE_ONCE(env->dst_rq->rd->overload, overload);
}
}
@@ -9872,7 +9872,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
rq_unpin_lock(this_rq, rf);
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !this_rq->rd->overload) {
+ !READ_ONCE(this_rq->rd->overload)) {
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 648224b23287..e1dc85d1bfdd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1672,8 +1672,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
if (prev_nr < 2 && rq->nr_running >= 2) {
#ifdef CONFIG_SMP
- if (!rq->rd->overload)
- rq->rd->overload = 1;
+ if (!READ_ONCE(rq->rd->overload))
+ WRITE_ONCE(rq->rd->overload, 1);
#endif
}
--
2.7.4
From: Valentin Schneider <[email protected]>
Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.
A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.
They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.
This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:
a,b,c,d are our CPU-hogging tasks
_ signifies idling
LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
big_0 | c c c c a a
big_1 | d d d d b b
^
^
Tasks end on the big CPUs, idle balance happens
and the misfit tasks are pulled straight away
This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.
As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 6 ++++--
kernel/sched/sched.h | 6 +++++-
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0641ba7bea1..de84f5a9a65a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8163,7 +8163,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
* @load_idx: Load index of sched_domain of this_cpu for load calc.
* @local_group: Does group contain this_cpu.
* @sgs: variable to hold the statistics for this group.
- * @overload: Indicate more than one runnable task for any CPU.
+ * @overload: Indicate pullable load (e.g. >1 runnable task).
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
@@ -8207,8 +8207,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->idle_cpus++;
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
- sgs->group_misfit_task_load < rq->misfit_task_load)
+ sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
+ *overload = 1;
+ }
}
/* Adjust by relative CPU capacity of the group */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e1dc85d1bfdd..377545b5aa15 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -695,7 +695,11 @@ struct root_domain {
cpumask_var_t span;
cpumask_var_t online;
- /* Indicate more than one runnable task for any CPU */
+ /*
+ * Indicate pullable load on at least one CPU, e.g:
+ * - More than one runnable task
+ * - Running task is misfit
+ */
int overload;
/*
--
2.7.4
From: Valentin Schneider <[email protected]>
sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.
The local 'overload' variable used in update_sd_lb_stats can remain
bool, as it won't impact any struct layout and can be assigned to the
root_domain field.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/sched.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6c39a07e8a68..648224b23287 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -696,7 +696,7 @@ struct root_domain {
cpumask_var_t online;
/* Indicate more than one runnable task for any CPU */
- bool overload;
+ int overload;
/*
* The bit corresponding to a CPU gets set here if such CPU has more
@@ -1673,7 +1673,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
if (prev_nr < 2 && rq->nr_running >= 2) {
#ifdef CONFIG_SMP
if (!rq->rd->overload)
- rq->rd->overload = true;
+ rq->rd->overload = 1;
#endif
}
--
2.7.4
On asymmetric cpu capacity systems load intensive tasks can end up on
cpus that don't suit their compute demand. In this scenarios 'misfit'
tasks should be migrated to cpus with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.
Misfit balancing only makes sense between a source group of lower
per-cpu capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.
The modifications are:
1. Only pick a group containing misfit tasks as the busiest group if the
destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
to be pulled ignoring average load.
4. Pick the cpu with the highest misfit load as the source cpu.
5. If the misfit task is alone on the source cpu, go for active
balancing.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 49 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 09ede4321a3d..6e885d92fad2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7285,6 +7285,7 @@ struct lb_env {
unsigned int loop_max;
enum fbq_type fbq_type;
+ enum group_type src_grp_type;
struct list_head tasks;
};
@@ -8243,6 +8244,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
{
struct sg_lb_stats *busiest = &sds->busiest_stat;
+ /*
+ * Don't try to pull misfit tasks we can't help.
+ * We can use max_capacity here as reduction in capacity on some
+ * cpus in the group should either be possible to resolve
+ * internally or be covered by avg_load imbalance (eventually).
+ */
+ if (sgs->group_type == group_misfit_task &&
+ (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+ !group_has_capacity(env, &sds->local_stat)))
+ return false;
+
if (sgs->group_type > busiest->group_type)
return true;
@@ -8265,6 +8277,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
group_smaller_min_cpu_capacity(sds->local, sg))
return false;
+ /*
+ * If we have more than one misfit sg go with the biggest misfit.
+ */
+ if (sgs->group_type == group_misfit_task &&
+ sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+ return false;
+
asym_packing:
/* This is the busiest node in its class. */
if (!(env->sd->flags & SD_ASYM_PACKING))
@@ -8562,8 +8581,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* factors in sg capacity and sgs with smaller group_type are
* skipped when updating the busiest sg:
*/
- if (busiest->avg_load <= sds->avg_load ||
- local->avg_load >= sds->avg_load) {
+ if (busiest->group_type != group_misfit_task &&
+ (busiest->avg_load <= sds->avg_load ||
+ local->avg_load >= sds->avg_load)) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}
@@ -8597,6 +8617,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;
+ /* Boost imbalance to allow misfit task to be balanced. */
+ if (busiest->group_type == group_misfit_task) {
+ env->imbalance = max_t(long, env->imbalance,
+ busiest->group_misfit_task_load);
+ }
+
/*
* if *imbalance is less than the average load per runnable task
* there is no guarantee that any tasks will be moved so we'll have
@@ -8663,6 +8689,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
busiest->group_no_capacity)
goto force_balance;
+ /* Misfit tasks should be dealt with regardless of the avg load */
+ if (busiest->group_type == group_misfit_task)
+ goto force_balance;
+
/*
* If the local group is busier than the selected busiest group
* don't try and pull any tasks.
@@ -8700,6 +8730,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
force_balance:
/* Looks like there is an imbalance. Compute it */
+ env->src_grp_type = busiest->group_type;
calculate_imbalance(env, &sds);
return sds.busiest;
@@ -8747,6 +8778,19 @@ static struct rq *find_busiest_queue(struct lb_env *env,
if (rt > env->fbq_type)
continue;
+ /*
+ * For ASYM_CPUCAPACITY domains with misfit tasks we simply
+ * seek the "biggest" misfit task.
+ */
+ if (env->src_grp_type == group_misfit_task) {
+ if (rq->misfit_task_load > busiest_load) {
+ busiest_load = rq->misfit_task_load;
+ busiest = rq;
+ }
+
+ continue;
+ }
+
capacity = capacity_of(i);
wl = weighted_cpuload(rq);
@@ -8816,6 +8860,9 @@ static int need_active_balance(struct lb_env *env)
return 1;
}
+ if (env->src_grp_type == group_misfit_task)
+ return 1;
+
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
--
2.7.4
From: Valentin Schneider <[email protected]>
There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:
* rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
* sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity cpus, this will be evaluated to False.
* rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure
As such, this commit adds an rq->misfit_task_load condition to trigger a
nohz kick.
The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan <[email protected]>, and a similar patch was submitted for
the Android Common Kernel - see [1].
[1]: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e885d92fad2..acec93e1dc51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9505,7 +9505,7 @@ static void nohz_balancer_kick(struct rq *rq)
if (time_before(now, nohz.next_balance))
goto out;
- if (rq->nr_running >= 2) {
+ if (rq->nr_running >= 2 || rq->misfit_task_load) {
flags = NOHZ_KICK_MASK;
goto out;
}
--
2.7.4
From: Valentin Schneider <[email protected]>
This variable is entirely local to update_sd_lb_stats, so we can
safely change its type and slightly clean up its initialisation.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index acec93e1dc51..ee26eeb188ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8352,11 +8352,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
- int load_idx, prefer_sibling = 0;
+ int load_idx;
bool overload = false;
-
- if (child && child->flags & SD_PREFER_SIBLING)
- prefer_sibling = 1;
+ bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
#ifdef CONFIG_NO_HZ_COMMON
if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
--
2.7.4
The existing asymmetric cpu capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 3 +++
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 19 +++++++++++++++++++
3 files changed, 23 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 321cd5dcf2e8..85fb7e8ff5c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6583,6 +6583,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
{
long min_cap, max_cap;
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return 0;
+
min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7742dcc136c..35ce218f0157 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1160,6 +1160,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;
struct sched_group_capacity {
atomic_t ref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05a831427bc7..0cfdeff669fe 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
static void update_top_cache_domain(int cpu)
{
@@ -425,6 +426,21 @@ static void update_top_cache_domain(int cpu)
rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
}
+static void update_asym_cpucapacity(int cpu)
+{
+ int enable = false;
+
+ rcu_read_lock();
+ if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
+ enable = true;
+ rcu_read_unlock();
+
+ if (enable) {
+ /* This expects to be hotplug-safe */
+ static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
+ }
+}
+
/*
* Attach the domain 'sd' to 'cpu' as its base domain. Callers must
* hold the hotplug lock.
@@ -1707,6 +1723,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
}
rcu_read_unlock();
+ if (!cpumask_empty(cpu_map))
+ update_asym_cpucapacity(cpumask_first(cpu_map));
+
if (rq && sched_debug_enabled) {
pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
--
2.7.4
The current sg->min_capacity tracks the lowest per-cpu compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple cpus if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all cpus or
asymmetric cpu capacities (e.g. big.LITTLE).
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 24 ++++++++++++++++++++----
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 2 ++
3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e05e5202a1d2..09ede4321a3d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7927,13 +7927,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
cpu_rq(cpu)->cpu_capacity = capacity;
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
+ sdg->sgc->max_capacity = capacity;
}
void update_group_capacity(struct sched_domain *sd, int cpu)
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
- unsigned long capacity, min_capacity;
+ unsigned long capacity, min_capacity, max_capacity;
unsigned long interval;
interval = msecs_to_jiffies(sd->balance_interval);
@@ -7947,6 +7948,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
capacity = 0;
min_capacity = ULONG_MAX;
+ max_capacity = 0;
if (child->flags & SD_OVERLAP) {
/*
@@ -7977,6 +7979,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
}
min_capacity = min(capacity, min_capacity);
+ max_capacity = max(capacity, max_capacity);
}
} else {
/*
@@ -7990,12 +7993,14 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
capacity += sgc->capacity;
min_capacity = min(sgc->min_capacity, min_capacity);
+ max_capacity = max(sgc->max_capacity, max_capacity);
group = group->next;
} while (group != child->groups);
}
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = min_capacity;
+ sdg->sgc->max_capacity = max_capacity;
}
/*
@@ -8091,16 +8096,27 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
}
/*
- * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
* per-CPU capacity than sched_group ref.
*/
static inline bool
-group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
{
return sg->sgc->min_capacity * capacity_margin <
ref->sgc->min_capacity * 1024;
}
+/*
+ * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-CPU capacity_orig than sched_group ref.
+ */
+static inline bool
+group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+ return sg->sgc->max_capacity * capacity_margin <
+ ref->sgc->max_capacity * 1024;
+}
+
static inline enum
group_type group_classify(struct sched_group *group,
struct sg_lb_stats *sgs)
@@ -8246,7 +8262,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* power/energy consequences are not considered.
*/
if (sgs->sum_nr_running <= sgs->group_weight &&
- group_smaller_cpu_capacity(sds->local, sg))
+ group_smaller_min_cpu_capacity(sds->local, sg))
return false;
asym_packing:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3376bacab712..6c39a07e8a68 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1172,6 +1172,7 @@ struct sched_group_capacity {
*/
unsigned long capacity;
unsigned long min_capacity; /* Min per-CPU capacity in group */
+ unsigned long max_capacity; /* Max per-CPU capacity in group */
unsigned long next_update;
int imbalance; /* XXX unrelated to capacity but shared group state */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 0cfdeff669fe..71330e0e41db 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -708,6 +708,7 @@ static void init_overlap_sched_group(struct sched_domain *sd,
sg_span = sched_group_span(sg);
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+ sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
}
static int
@@ -867,6 +868,7 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg));
sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+ sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
return sg;
}
--
2.7.4
To maximize throughput in systems with asymmetric cpu capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and cpu utilization
as well as per-cpu compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity cpu need to be
identified and migrated to a higher capacity cpu if possible to maximize
throughput.
To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-cpu capacity. group_misfit_task is only considered
if the system is not overloaded or imbalanced (group_imbalanced or
group_overloaded).
Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++++--------
kernel/sched/sched.h | 2 ++
2 files changed, 48 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 85fb7e8ff5c8..e05e5202a1d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -697,6 +697,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
static unsigned long task_h_load(struct task_struct *p);
+static unsigned long capacity_of(int cpu);
/* Give new sched_entity start runnable values to heavy its load in infant time */
void init_entity_runnable_average(struct sched_entity *se)
@@ -1448,7 +1449,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
static unsigned long weighted_cpuload(struct rq *rq);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);
/* Cached statistics for all CPUs within a node */
struct numa_stats {
@@ -4035,6 +4035,29 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
WRITE_ONCE(p->se.avg.util_est, ue);
}
+static inline int task_fits_capacity(struct task_struct *p, long capacity)
+{
+ return capacity * 1024 > task_util_est(p) * capacity_margin;
+}
+
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
+{
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return;
+
+ if (!p) {
+ rq->misfit_task_load = 0;
+ return;
+ }
+
+ if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
+ rq->misfit_task_load = 0;
+ return;
+ }
+
+ rq->misfit_task_load = task_h_load(p);
+}
+
#else /* CONFIG_SMP */
static inline int
@@ -4070,6 +4093,7 @@ util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
static inline void
util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
bool task_sleep) {}
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
#endif /* CONFIG_SMP */
@@ -6596,7 +6620,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
/* Bring task utilization in sync with prev_cpu */
sync_entity_load_avg(&p->se);
- return min_cap * 1024 < task_util(p) * capacity_margin;
+ return !task_fits_capacity(p, min_cap);
}
/*
@@ -7013,9 +7037,12 @@ done: __maybe_unused;
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
+ update_misfit_status(p, rq);
+
return p;
idle:
+ update_misfit_status(NULL, rq);
new_tasks = idle_balance(rq, rf);
/*
@@ -7221,6 +7248,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
enum fbq_type { regular, remote, all };
+enum group_type {
+ group_other = 0,
+ group_misfit_task,
+ group_imbalanced,
+ group_overloaded,
+};
+
#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
#define LBF_DST_PINNED 0x04
@@ -7762,12 +7796,6 @@ static unsigned long task_h_load(struct task_struct *p)
/********** Helpers for find_busiest_group ************************/
-enum group_type {
- group_other = 0,
- group_imbalanced,
- group_overloaded,
-};
-
/*
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
@@ -7783,6 +7811,7 @@ struct sg_lb_stats {
unsigned int group_weight;
enum group_type group_type;
int group_no_capacity;
+ unsigned long group_misfit_task_load; /* A cpu has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -8082,6 +8111,9 @@ group_type group_classify(struct sched_group *group,
if (sg_imbalanced(group))
return group_imbalanced;
+ if (sgs->group_misfit_task_load)
+ return group_misfit_task;
+
return group_other;
}
@@ -8156,6 +8188,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
*/
if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
+
+ if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ sgs->group_misfit_task_load < rq->misfit_task_load)
+ sgs->group_misfit_task_load = rq->misfit_task_load;
}
/* Adjust by relative CPU capacity of the group */
@@ -9937,6 +9973,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+
+ update_misfit_status(curr, rq);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 35ce218f0157..3376bacab712 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -822,6 +822,8 @@ struct rq {
unsigned char idle_balance;
+ unsigned long misfit_task_load;
+
/* For active balancing */
int active_balance;
int push_cpu;
--
2.7.4
Hi Morten,
On Wednesday 04 Jul 2018 at 11:17:49 (+0100), Morten Rasmussen wrote:
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 71330e0e41db..29c186961345 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl,
> sd_id = cpumask_first(sched_domain_span(sd));
>
> /*
> + * Check if cpu_map eclipses cpu capacity asymmetry.
> + */
> +
> + if (sd->flags & SD_ASYM_CPUCAPACITY) {
> + int i;
> + bool disable = true;
> + long capacity = arch_scale_cpu_capacity(NULL, sd_id);
> +
> + for_each_cpu(i, sched_domain_span(sd)) {
> + if (capacity != arch_scale_cpu_capacity(NULL, i)) {
> + disable = false;
> + break;
> + }
> + }
> +
> + if (disable)
> + sd->flags &= ~SD_ASYM_CPUCAPACITY;
> + }
> +
> + /*
> * Convert topological properties into behaviour.
> */
If SD_ASYM_CPUCAPACITY means that some CPUs have different
arch_scale_cpu_capacity() values, we could also automatically _set_
the flag in sd_init() no ? Why should we let the arch set it and just
correct it later ?
I understand the moment at which we know the capacities of CPUs varies
from arch to arch, but the arch code could just call
rebuild_sched_domain when the capacities of CPUs change and let the
scheduler detect things automatically. I mean, even if the arch code
sets the flag in its topology level table, it will have to rebuild
the sched domains anyway ...
What do you think ?
Thanks,
Quentin
On Thu, Jul 05, 2018 at 02:31:43PM +0100, Quentin Perret wrote:
> Hi Morten,
>
> On Wednesday 04 Jul 2018 at 11:17:49 (+0100), Morten Rasmussen wrote:
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 71330e0e41db..29c186961345 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl,
> > sd_id = cpumask_first(sched_domain_span(sd));
> >
> > /*
> > + * Check if cpu_map eclipses cpu capacity asymmetry.
> > + */
> > +
> > + if (sd->flags & SD_ASYM_CPUCAPACITY) {
> > + int i;
> > + bool disable = true;
> > + long capacity = arch_scale_cpu_capacity(NULL, sd_id);
> > +
> > + for_each_cpu(i, sched_domain_span(sd)) {
> > + if (capacity != arch_scale_cpu_capacity(NULL, i)) {
> > + disable = false;
> > + break;
> > + }
> > + }
> > +
> > + if (disable)
> > + sd->flags &= ~SD_ASYM_CPUCAPACITY;
> > + }
> > +
> > + /*
> > * Convert topological properties into behaviour.
> > */
>
> If SD_ASYM_CPUCAPACITY means that some CPUs have different
> arch_scale_cpu_capacity() values, we could also automatically _set_
> the flag in sd_init() no ? Why should we let the arch set it and just
> correct it later ?
>
> I understand the moment at which we know the capacities of CPUs varies
> from arch to arch, but the arch code could just call
> rebuild_sched_domain when the capacities of CPUs change and let the
> scheduler detect things automatically. I mean, even if the arch code
> sets the flag in its topology level table, it will have to rebuild
> the sched domains anyway ...
>
> What do you think ?
We could as well set the flag here so the architecture doesn't have to
do it. It is a bit more complicated though for few reasons:
1. Detecting when to disable the flag is a lot simpler than checking
which level is should be set on. You basically have to work you way up
from the lowest topology level until you get to a level spanning all the
capacities available in the system to figure out where the flag should
be set. I don't think this fits easily with how we build the
sched_domain hierarchy. It can of course be done.
2. As you say, we still need the arch code (or cpufreq?) to rebuild the
whole thing once we know that the capacities have been determined. That
currently implies implementing arch_update_cpu_topology() which is
arch-specific. So we would need some arch code to make rebuild happen at
the right point in time. If the rebuild should be triggering the rebuild
we need another way to force a full rebuild. This can also be done.
3. Detecting the flag in generic kernel/sched/* code means that all
architectures will pay the for the overhead when building/rebuilding the
sched_domain hierarchy, and all architectures that sets the cpu
capacities to asymmetric will set the flag whether they like it or not.
I'm not sure if this is a problem.
In the end it is really about how much of this we want in generic code
and how much we hide in arch/, and if we dare to touch the sched_domain
build code ;-)
Morten
On Thursday 05 Jul 2018 at 15:13:49 (+0100), Morten Rasmussen wrote:
> On Thu, Jul 05, 2018 at 02:31:43PM +0100, Quentin Perret wrote:
> > If SD_ASYM_CPUCAPACITY means that some CPUs have different
> > arch_scale_cpu_capacity() values, we could also automatically _set_
> > the flag in sd_init() no ? Why should we let the arch set it and just
> > correct it later ?
> >
> > I understand the moment at which we know the capacities of CPUs varies
> > from arch to arch, but the arch code could just call
> > rebuild_sched_domain when the capacities of CPUs change and let the
> > scheduler detect things automatically. I mean, even if the arch code
> > sets the flag in its topology level table, it will have to rebuild
> > the sched domains anyway ...
> >
> > What do you think ?
>
> We could as well set the flag here so the architecture doesn't have to
> do it. It is a bit more complicated though for few reasons:
>
> 1. Detecting when to disable the flag is a lot simpler than checking
> which level is should be set on. You basically have to work you way up
> from the lowest topology level until you get to a level spanning all the
> capacities available in the system to figure out where the flag should
> be set. I don't think this fits easily with how we build the
> sched_domain hierarchy. It can of course be done.
Ah, that is something I missed. I see in 1f6e6c7cb9bc ("sched/core:
Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag") that this
flag should be set only at the _lowest_ level at which there is
asymmetry. I had the wrong impression that the flag was supposed to be
set at _all_ level where there is some asymmetry. And actually having
'some asymmetry' isn't enough, we want to see the full range of CPU
capacities. Hmmm, that is indeed more complex than I thought ... :/
>
> 2. As you say, we still need the arch code (or cpufreq?) to rebuild the
> whole thing once we know that the capacities have been determined. That
> currently implies implementing arch_update_cpu_topology() which is
> arch-specific. So we would need some arch code to make rebuild happen at
> the right point in time. If the rebuild should be triggering the rebuild
> we need another way to force a full rebuild. This can also be done.
Yeah, with just this patch the arch code will have to:
1. update the arch_scale_cpu_capacity() of the CPUs;
2. detect the asymmetry to set the flag in the topology table;
3. rebuild the sched domains to let the scheduler know about the new
topology information.
By doing what I suggest we would just save 2 from the arch side and do
it in the scheduler. So actually, I really start to wonder if it's worth
it given the added complexity ...
> 3. Detecting the flag in generic kernel/sched/* code means that all
> architectures will pay the for the overhead when building/rebuilding the
> sched_domain hierarchy, and all architectures that sets the cpu
> capacities to asymmetric will set the flag whether they like it or not.
> I'm not sure if this is a problem.
That is true as well ...
>
> In the end it is really about how much of this we want in generic code
> and how much we hide in arch/, and if we dare to touch the sched_domain
> build code ;-)
Right so you can argue that the arch code is here to give you a
system-level information, and that if the scheduler wants to virtually
split that system, then it's its job to make sure that happens properly.
That is exactly what your patch does (IIUC), and I now think that this
is a very sensible middle-ground option. But this is debatable so I'm
interested to see what others think :-)
Thanks,
Quentin
On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <[email protected]> wrote:
>
> The 'prefer sibling' sched_domain flag is intended to encourage
> spreading tasks to sibling sched_domain to take advantage of more caches
> and core for SMT systems. It has recently been changed to be on all
> non-NUMA topology level. However, spreading across domains with cpu
> capacity asymmetry isn't desirable, e.g. spreading from high capacity to
> low capacity cpus even if high capacity cpus aren't overutilized might
> give access to more cache but the cpu will be slower and possibly lead
> to worse overall throughput.
>
> To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
> level immediately below SD_ASYM_CPUCAPACITY.
This makes sense. Nevertheless, this patch also raises a scheduling
problem and break the 1 task per CPU policy that is enforced by
SD_PREFER_SIBLING. When running the tests of your cover letter, 1 long
running task is often co scheduled on a big core whereas short pinned
tasks are still running and a little core is idle which is not an
optimal scheduling decision
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/topology.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 29c186961345..00c7a08c7f77 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1140,7 +1140,7 @@ sd_init(struct sched_domain_topology_level *tl,
> | 0*SD_SHARE_CPUCAPACITY
> | 0*SD_SHARE_PKG_RESOURCES
> | 0*SD_SERIALIZE
> - | 0*SD_PREFER_SIBLING
> + | 1*SD_PREFER_SIBLING
> | 0*SD_NUMA
> | sd_flags
> ,
> @@ -1186,17 +1186,21 @@ sd_init(struct sched_domain_topology_level *tl,
> if (sd->flags & SD_ASYM_CPUCAPACITY) {
> struct sched_domain *t = sd;
>
> + /*
> + * Don't attempt to spread across cpus of different capacities.
> + */
> + if (sd->child)
> + sd->child->flags &= ~SD_PREFER_SIBLING;
> +
> for_each_lower_domain(t)
> t->flags |= SD_BALANCE_WAKE;
> }
>
> if (sd->flags & SD_SHARE_CPUCAPACITY) {
> - sd->flags |= SD_PREFER_SIBLING;
> sd->imbalance_pct = 110;
> sd->smt_gain = 1178; /* ~15% */
>
> } else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
> - sd->flags |= SD_PREFER_SIBLING;
> sd->imbalance_pct = 117;
> sd->cache_nice_tries = 1;
> sd->busy_idx = 2;
> @@ -1207,6 +1211,7 @@ sd_init(struct sched_domain_topology_level *tl,
> sd->busy_idx = 3;
> sd->idle_idx = 2;
>
> + sd->flags &= ~SD_PREFER_SIBLING;
> sd->flags |= SD_SERIALIZE;
> if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
> sd->flags &= ~(SD_BALANCE_EXEC |
> @@ -1216,7 +1221,6 @@ sd_init(struct sched_domain_topology_level *tl,
>
> #endif
> } else {
> - sd->flags |= SD_PREFER_SIBLING;
> sd->cache_nice_tries = 1;
> sd->busy_idx = 2;
> sd->idle_idx = 1;
> --
> 2.7.4
>
Hi Morten,
On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <[email protected]> wrote:
>
> On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
> for performance that cpu intensive tasks are aggressively migrated to
> high capacity cpus as soon as those become available. The capacity
> awareness tweaks already in the wake-up path can't handle this as such
> tasks might run or be runnable forever. If they happen to be placed on a
> low capacity cpu from the beginning they are stuck there forever while
> high capacity cpus may have become available in the meantime.
>
> To address this issue this patch set introduces a new "misfit"
> load-balancing scenario in periodic/nohz/newly idle balance which tweaks
> the load-balance conditions to ignore load per capacity in certain
> cases. Since misfit tasks are commonly running alone on a cpu, more
> aggressive active load-balancing is needed too.
>
> The fundamental idea of this patch set has been in Android kernels for a
> long time and is absolutely essential for consistent performance on
> asymmetric cpu capacity systems.
>
As already said , I'm not convinced by the proposal which seems quite
complex and also adds some kind of arbitrary and fixed power
management policy by deciding which tasks can or not go on big cores
whereas there are other frameworks to take such decision like EAS or
cgroups. Furthermore, there is already something similar in the kernel
with SD_ASYM_PACKING and IMO, it would be better to improve this
feature (if needed) instead of adding a new one which often do similar
things.
I have rerun your tests and got same results than misfit task patchset
on my hikey960 with SD_ASYM_PACKING feature for legacy b.L topology
and fake dynamiQ topology. And it give better performance when the
pinned tasks are short and scheduler has to wait for the task to
increase their utilization before getting a chance to migrate on big
core.
Then, I have tested SD_ASYM_PACKING with EAS patchset and they work
together for b/L and dynamiQ topology
> The patches have been tested on:
> 1. Arm Juno (r0): 2+4 Cortex A57/A53
> 2. Hikey960: 4+4 Cortex A73/A53
>
> Test case:
> Big cpus are always kept busy. Pin a shorter running sysbench tasks to
> big cpus, while creating a longer running set of unpinned sysbench
> tasks.
>
> REQUESTS=1000
> BIGS="1 2"
> LITTLES="0 3 4 5"
>
> # Don't care about the score for those, just keep the bigs busy
> for i in $BIGS; do
> taskset -c $i sysbench --max-requests=$((REQUESTS / 4)) \
> --test=cpu run &>/dev/null &
> done
>
> for i in $LITTLES; do
> sysbench --max-requests=$REQUESTS --test=cpu run \
> | grep "total time:" &
> done
>
> wait
>
> Results:
> Single runs with completion time of each task
> Juno (tip)
> total time: 1.2608s
> total time: 1.2995s
> total time: 1.5954s
> total time: 1.7463s
>
> Juno (misfit)
> total time: 1.2575s
> total time: 1.3004s
> total time: 1.5860s
> total time: 1.5871s
>
> Hikey960 (tip)
> total time: 1.7431s
> total time: 2.2914s
> total time: 2.5976s
> total time: 1.7280s
>
> Hikey960 (misfit)
> total time: 1.7866s
> total time: 1.7513s
> total time: 1.6918s
> total time: 1.6965s
>
> 10 run summary (tracking longest running task for each run)
> Juno Hikey960
> avg max avg max
> tip 1.7465 1.7469 2.5997 2.6131
> misfit 1.6016 1.6192 1.8506 1.9666
>
> Changelog:
> v4
> - Added check for empty cpu_map in sd_init().
> - Added patch to disable SD_ASYM_CPUCAPACITY for root_domains that don't
> observe capacity asymmetry if the system as a whole is asymmetric.
> - Added patch to disable SD_PREFER_SIBLING on the sched_domain level below
> SD_ASYM_CPUCAPACITY.
> - Rebased against tip/sched/core.
> - Fixed uninitialised variable introduced in update_sd_lb_stats.
> - Added patch to do a slight variable initialisation cleanup in update_sd_lb_stats.
> - Removed superfluous type changes for temp variables assigned to root_domain->overload.
> - Reworded commit for the patch setting rq->rd->overload when misfit.
> - v3 Tested-by: Gaku Inami <[email protected]>
>
> v3
> - Fixed locking around static_key.
> - Changed group per-cpu capacity comparison to be based on max rather
> than min capacity.
> - Added patch to prevent occasional pointless high->low capacity
> migrations.
> - Changed type of group_misfit_task_load and misfit_task_load to
> unsigned long.
> - Changed fbq() to pick the cpu with highest misfit_task_load rather
> than breaking when the first is found.
> - Rebased against tip/sched/core.
> - v2 Tested-by: Gaku Inami <[email protected]>
>
> v2
> - Removed redudant condition in static_key enablement.
> - Fixed logic flaw in patch #2 reported by Yi Yao <[email protected]>
> - Dropped patch #4 as although the patch seems to make sense no benefit
> has been proven.
> - Dropped root_domain->overload renaming
> - Changed type of root_domain->overload to int
> - Wrapped accesses of rq->rd->overload with READ/WRITE_ONCE
> - v1 Tested-by: Gaku Inami <[email protected]>
>
> Chris Redpath (1):
> sched/fair: Don't move tasks to lower capacity cpus unless necessary
>
> Morten Rasmussen (6):
> sched: Add static_key for asymmetric cpu capacity optimizations
> sched/fair: Add group_misfit_task load-balance type
> sched: Add sched_group per-cpu max capacity
> sched/fair: Consider misfit tasks when load-balancing
> sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without
> asymmetry
> sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity
> domains
>
> Valentin Schneider (5):
> sched/fair: Kick nohz balance if rq->misfit_task_load
> sched/fair: Change prefer_sibling type to bool
> sched: Change root_domain->overload type to int
> sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
> sched/fair: Set rq->rd->overload when misfit
>
> kernel/sched/fair.c | 161 +++++++++++++++++++++++++++++++++++++++++-------
> kernel/sched/sched.h | 16 +++--
> kernel/sched/topology.c | 53 ++++++++++++++--
> 3 files changed, 199 insertions(+), 31 deletions(-)
>
> --
> 2.7.4
>
On Fri, Jul 06, 2018 at 12:18:17PM +0200, Vincent Guittot wrote:
> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <[email protected]> wrote:
> >
> > The 'prefer sibling' sched_domain flag is intended to encourage
> > spreading tasks to sibling sched_domain to take advantage of more caches
> > and core for SMT systems. It has recently been changed to be on all
> > non-NUMA topology level. However, spreading across domains with cpu
> > capacity asymmetry isn't desirable, e.g. spreading from high capacity to
> > low capacity cpus even if high capacity cpus aren't overutilized might
> > give access to more cache but the cpu will be slower and possibly lead
> > to worse overall throughput.
> >
> > To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
> > level immediately below SD_ASYM_CPUCAPACITY.
>
> This makes sense. Nevertheless, this patch also raises a scheduling
> problem and break the 1 task per CPU policy that is enforced by
> SD_PREFER_SIBLING.
Scheduling one task per cpu when n_task == n_cpus on asymmetric
topologies is generally broken already and this patch set doesn't fix
that problem.
SD_PREFER_SIBLING might seem to help in very specific cases:
n_litte_cpus == n_big_cpus. In that case the little group might
classified as overloaded. It doesn't guarantee that anything gets pulled
as the grp_load/grp_capacity in the imbalance calculation on some system
still says the little cpus are more loaded than the bigs despite one of
them being idle. That depends on the little cpu capacities.
On systems where n_little_cpus != n_big_cpus SD_PREFER_SIBLING is broken
as it assumes the group_weight to be the same. This is the case on Juno
and several other platforms.
IMHO, SD_PREFER_SIBLING isn't the solution to this problem. It might
help for a limited subset of topologies/capacities but the right
solution is to change the imbalance calculation. As the name says, it is
meant to spread tasks and does so unconditionally. For asymmetric
systems we would like to consider cpu capacity before migrating tasks.
> When running the tests of your cover letter, 1 long
> running task is often co scheduled on a big core whereas short pinned
> tasks are still running and a little core is idle which is not an
> optimal scheduling decision
This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
say that this patch breaks anything that isn't broken already. In fact
we this happening with and without this patch applied.
Morten
On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote:
> Hi Morten,
>
> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <[email protected]> wrote:
> >
> > On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
> > for performance that cpu intensive tasks are aggressively migrated to
> > high capacity cpus as soon as those become available. The capacity
> > awareness tweaks already in the wake-up path can't handle this as such
> > tasks might run or be runnable forever. If they happen to be placed on a
> > low capacity cpu from the beginning they are stuck there forever while
> > high capacity cpus may have become available in the meantime.
> >
> > To address this issue this patch set introduces a new "misfit"
> > load-balancing scenario in periodic/nohz/newly idle balance which tweaks
> > the load-balance conditions to ignore load per capacity in certain
> > cases. Since misfit tasks are commonly running alone on a cpu, more
> > aggressive active load-balancing is needed too.
> >
> > The fundamental idea of this patch set has been in Android kernels for a
> > long time and is absolutely essential for consistent performance on
> > asymmetric cpu capacity systems.
> >
>
> As already said , I'm not convinced by the proposal which seems quite
> complex and also adds some kind of arbitrary and fixed power
> management policy by deciding which tasks can or not go on big cores
> whereas there are other frameworks to take such decision like EAS or
> cgroups.
The misfit patches are a crucial part of the EAS solution but they also
make sense for some users on their own without an energy model. This is
why they are posted separately.
We have already discussed at length why the patches are needed and why
the look like they do here in this thread:
https://lore.kernel.org/lkml/CAKfTPtD4skW_3SAk--vBEC5-m1Ua48bjOQYS0pDqW3nPSpsENg@mail.gmail.com/
> Furthermore, there is already something similar in the kernel
> with SD_ASYM_PACKING and IMO, it would be better to improve this
> feature (if needed) instead of adding a new one which often do similar
> things.
As said in the previous thread, while it might look similar it isn't.
SD_ASYM_PACKING isn't utilization-based which is the key metric used for
EAS, schedutil, util_est, and util_clamp. SD_ASYM_PACKING serves a
different purpose (see previous thread for details).
> I have rerun your tests and got same results than misfit task patchset
> on my hikey960 with SD_ASYM_PACKING feature for legacy b.L topology
> and fake dynamiQ topology. And it give better performance when the
> pinned tasks are short and scheduler has to wait for the task to
> increase their utilization before getting a chance to migrate on big
> core.
Right, the test cases are quite simple and could be served better by
SD_ASYM_PACKING. As we already discussed in that thread, that is due to
the PELT lag but this the cost we have to pay if we don't have
additional information about the requirements of the task and we don't
want to default to big-first with all its implications.
We have covered all this in the thread in early April.
> Then, I have tested SD_ASYM_PACKING with EAS patchset and they work
> together for b/L and dynamiQ topology
Could you provide some more details about your evaluation? It probably
works well for some use-cases but it isn't really designed for what we
need for EAS.
Morten
On Thu, Jul 05, 2018 at 04:03:11PM +0100, Quentin Perret wrote:
> On Thursday 05 Jul 2018 at 15:13:49 (+0100), Morten Rasmussen wrote:
> > 3. Detecting the flag in generic kernel/sched/* code means that all
> > architectures will pay the for the overhead when building/rebuilding the
> > sched_domain hierarchy, and all architectures that sets the cpu
> > capacities to asymmetric will set the flag whether they like it or not.
> > I'm not sure if this is a problem.
>
> That is true as well ...
>
> >
> > In the end it is really about how much of this we want in generic code
> > and how much we hide in arch/, and if we dare to touch the sched_domain
> > build code ;-)
>
> Right so you can argue that the arch code is here to give you a
> system-level information, and that if the scheduler wants to virtually
> split that system, then it's its job to make sure that happens properly.
> That is exactly what your patch does (IIUC), and I now think that this
> is a very sensible middle-ground option. But this is debatable so I'm
> interested to see what others think :-)
I went ahead an hacked up some patches that sets the flag automatically
as part of the sched_domain build process. I posted them so people can
have a look: [email protected]
With those patches this patch has to be reverted/dropped.
Morten
Hi,
On 09/07/18 16:08, Morten Rasmussen wrote:
> On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote:
>> Hi Morten,
>>
>> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <[email protected]> wrote:
>>> [...]
>> As already said , I'm not convinced by the proposal which seems quite
>> complex and also adds some kind of arbitrary and fixed power
>> management policy by deciding which tasks can or not go on big cores
>> whereas there are other frameworks to take such decision like EAS or
>> cgroups.
>
> The misfit patches are a crucial part of the EAS solution but they also
> make sense for some users on their own without an energy model. This is
> why they are posted separately.
>
> We have already discussed at length why the patches are needed and why
> the look like they do here in this thread:
>
> https://lore.kernel.org/lkml/CAKfTPtD4skW_3SAk--vBEC5-m1Ua48bjOQYS0pDqW3nPSpsENg@mail.gmail.com/
>
>> Furthermore, there is already something similar in the kernel
>> with SD_ASYM_PACKING and IMO, it would be better to improve this
>> feature (if needed) instead of adding a new one which often do similar
>> things.
>
> As said in the previous thread, while it might look similar it isn't.
> SD_ASYM_PACKING isn't utilization-based which is the key metric used for
> EAS, schedutil, util_est, and util_clamp. SD_ASYM_PACKING serves a
> different purpose (see previous thread for details).
>
>> I have rerun your tests and got same results than misfit task patchset
>> on my hikey960 with SD_ASYM_PACKING feature for legacy b.L topology
>> and fake dynamiQ topology. And it give better performance when the
>> pinned tasks are short and scheduler has to wait for the task to
>> increase their utilization before getting a chance to migrate on big
>> core.
>
> Right, the test cases are quite simple and could be served better by
> SD_ASYM_PACKING. As we already discussed in that thread, that is due to
> the PELT lag but this the cost we have to pay if we don't have
> additional information about the requirements of the task and we don't
> want to default to big-first with all its implications.
>
I played around with SD_ASYM_PACKING & lmbench on my HiKey960, and I think I
can bring a few more arguments to the table.
In terms of setup, I took Vincent's approach ([1]) which is to define
CPU priority as CPU capacity. As for the sched_domain flags, I initially
added SD_ASYM_PACKING to the DIE sched_domain, since it's the only level where
we want to do any ASYM_PACKING.
That causes a problem with nohz kicks, because the per-CPU sd_asym cache
(which is used to determine if we can pack stuff into nohz CPUs) is defined
as:
highest_flag_domain(cpu, SD_ASYM_PACKING);
which returns NULL because the first domain (MC) doesn't have it. That makes
sense to me as AFAICT that feature is mostly used to pack stuff on SMT levels.
It is only set at MC level in Vincent's patch, but that doesn't work for
regular big.LITTLE, so I had to set it for both MC and DIE. This does add a few
useless ops though, since all of the CPUs in a given MC domain have the same
priority.
With that out of the way, I did some lmbench runs:
> lat_mem_rd 10 1024
With ASYM_PACKING, I still see lmbench tasks remaining on LITTLE CPUs while
bigs are free, because ASYM_PACKING only does explicit active balancing on
CPU_NEWLY_IDLE balancing - otherwise it'll rely on the nr_balance_failed counter.
However, that counter can be reset before it reaches the threshold at which
active balance is done, which can lead to huge upmigration delays (almost a
full second). I also see the same kind of issues on Juno r0.
This could be resolved by extending ASYM_PACKING active balancing to
non NEWLY_IDLE cases, but then we'd be thrashing everything. That's another
argument for basing upmigration on task load-tracking signals, as we can
determine which tasks need active balancing much faster than the
nr_balance_failed counter way while not active balancing the world.
---
[1]: https://lore.kernel.org/lkml/[email protected]/
---
lmbench results are meant to be plotted, I've added some pointers to show the
big obvious anomalies. I don't really care for the actual scores, as the
resulting traces are interesting on their own, but I've included them for
the sake of completeness.
(lat_mem_rd 10 1024) with ASYM_PACKING:
0.00098 1.275
0.00195 1.274
0.00293 1.274
0.00391 1.274
0.00586 1.274
0.00781 1.275
0.00977 1.274
0.01172 1.275
0.01367 1.274
0.01562 1.274
0.01758 1.274
0.01953 1.274
0.02148 1.274
0.02344 1.274
0.02539 1.274
0.02734 1.275
0.0293 1.275
0.03125 1.275
0.03516 1.275
0.03906 1.275
0.04297 1.274
0.04688 1.275
0.05078 1.275
0.05469 1.275
0.05859 1.275
0.0625 1.275
0.07031 3.153
0.07812 4.035
0.08594 4.164
0.09375 4.237
0.10156 4.172
0.10938 4.1
0.11719 4.121
0.125 4.171
0.14062 4.073
0.15625 4.051
0.17188 4.026
0.1875 4.002
0.20312 3.973
0.21875 3.948
0.23438 3.927
0.25 3.904
0.28125 3.869
0.3125 3.86
0.34375 3.824
0.375 3.803
0.40625 3.798
0.4375 3.768
0.46875 3.784
0.5 3.753
0.5625 3.73
0.625 3.739
0.6875 3.703
0.75 3.69
0.8125 3.693
0.875 3.679
0.9375 3.686
1.0 3.664
1.125 3.656
1.25 3.658
1.375 3.638
1.5 3.635
1.625 3.628
1.75 4.274
1.875 4.579
2.0 4.651
2.25 5.313
2.5 6.314
2.75 7.585
3.0 8.457
3.25 9.045
3.5 9.532
3.75 9.909
4.0 148.66 <-----
4.5 10.191
5.0 10.222
5.5 10.208
6.0 10.21
6.5 10.21
7.0 10.199
7.5 10.203
8.0 154.354 <-----
9.0 10.163
10.0 10.138
(lat_mem_rd 10 1024) with misfit patches:
0.00098 1.273
0.00195 1.273
0.00293 1.273
0.00391 1.273
0.00586 1.274
0.00781 1.273
0.00977 1.273
0.01172 1.273
0.01367 1.273
0.01562 1.273
0.01758 1.273
0.01953 1.273
0.02148 1.273
0.02344 1.274
0.02539 1.273
0.02734 1.273
0.0293 1.273
0.03125 1.273
0.03516 1.273
0.03906 1.273
0.04297 1.273
0.04688 1.274
0.05078 1.274
0.05469 1.274
0.05859 1.274
0.0625 1.274
0.07031 3.323
0.07812 4.074
0.08594 4.171
0.09375 4.254
0.10156 4.166
0.10938 4.084
0.11719 4.088
0.125 4.112
0.14062 4.127
0.15625 4.132
0.17188 4.132
0.1875 4.131
0.20312 4.187
0.21875 4.17
0.23438 4.153
0.25 4.138
0.28125 4.102
0.3125 4.081
0.34375 4.075
0.375 4.011
0.40625 4.033
0.4375 4.021
0.46875 3.937
0.5 3.99
0.5625 3.901
0.625 3.995
0.6875 3.89
0.75 3.863
0.8125 3.903
0.875 3.883
0.9375 3.82
1.0 3.945
1.125 3.85
1.25 3.884
1.375 3.833
1.5 4.89
1.625 4.834
1.75 5.041
1.875 5.054
2.0 5.38
2.25 5.752
2.5 6.805
2.75 7.516
3.0 8.545
3.25 9.115
3.5 9.525
3.75 9.871
4.0 10.017
4.5 10.177
5.0 10.201
5.5 10.209
6.0 10.204
6.5 10.18
7.0 10.19
7.5 10.171
8.0 10.166
9.0 10.164
10.0 10.166
On 07/26/2018 07:14 PM, Valentin Schneider wrote:
> Hi,
>
> On 09/07/18 16:08, Morten Rasmussen wrote:
>> On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote:
>>> Hi Morten,
>>>
>>> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <[email protected]> wrote:
[...]
> With that out of the way, I did some lmbench runs:
>> lat_mem_rd 10 1024
>
> With ASYM_PACKING, I still see lmbench tasks remaining on LITTLE CPUs while
> bigs are free, because ASYM_PACKING only does explicit active balancing on
> CPU_NEWLY_IDLE balancing - otherwise it'll rely on the nr_balance_failed counter.
>
> However, that counter can be reset before it reaches the threshold at which
> active balance is done, which can lead to huge upmigration delays (almost a
> full second). I also see the same kind of issues on Juno r0.
>
> This could be resolved by extending ASYM_PACKING active balancing to
> non NEWLY_IDLE cases, but then we'd be thrashing everything. That's another
> argument for basing upmigration on task load-tracking signals, as we can
> determine which tasks need active balancing much faster than the
> nr_balance_failed counter way while not active balancing the world.
The task layout of the test looks like n=85 always running tasks (each
for ~ 1.25ms on big or little) and they all get created and run one
after the other. So on a big cpu, their util values go from 512 to 1024
and from 223 to 446 on little cpu (Juno board). Latter thanks to
Quentin's 'sched/fair: Fix util_avg of new tasks for asymmetric systems'.
root@juno:~# cat /sys/devices/system/cpu/cpu[01]/cpu_capacity
446
1024
> (lat_mem_rd 10 1024) with ASYM_PACKING:
...
> 4.0 148.66 <-----
> 4.5 10.191
...
> 7.5 10.203
> 8.0 154.354 <-----
I ran the test affine to big, little and all cpus on tip/sched/core w/o
ASYM_PACKING or Misfit:
cputype: big little all
cpumask: 0x06 0x39 0xff
mem size <---- latency ---->
0.00098 3.668 3.595 3.669
0.00195 3.668 3.594 3.594
0.00293 3.668 3.593 3.595
0.00391 3.669 3.596 3.595
...
3.75000 58.687 121.934 122.293
4.00000 57.054 121.771 120.489
4.50000 56.914 121.851 56.729
5.00000 57.347 121.777 56.975
5.50000 57.705 121.738 68.981
6.00000 57.935 121.728 57.542
6.50000 58.119 121.694 121.799
7.00000 58.194 121.502 57.844
7.50000 58.258 121.684 58.050
8.00000 58.293 121.725 58.030
9.00000 58.309 121.793 58.188
10.00000 58.561 122.252 122.078
There is no diff between big and little cpus with small memory sizes,
just with the MB range.
If I look into the trace for 'all' it turns out that their are cases in
which, even if the task only run for ~15% of the time on big, the
latency value is printed as when it was running affine to big. So using
the latency value as an indicator where the task was scheduled is IMHO
not really possible.
Combined with that SD_ASYM.. rework I ended up with the below.
Holler if you want it changed :-)
---
Subject: sched: Add static_key for asymmetric cpu capacity optimizations
From: Morten Rasmussen <[email protected]>
Date: Wed, 4 Jul 2018 11:17:39 +0100
The existing asymmetric cpu capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 3 +++
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 9 ++++++++-
3 files changed, 12 insertions(+), 1 deletion(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6186,6 +6186,9 @@ static int wake_cap(struct task_struct *
{
long min_cap, max_cap;
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return 0;
+
min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1185,6 +1185,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;
struct sched_group_capacity {
atomic_t ref;
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
static void update_top_cache_domain(int cpu)
{
@@ -1708,6 +1709,7 @@ build_sched_domains(const struct cpumask
struct rq *rq = NULL;
int i, ret = -ENOMEM;
struct sched_domain_topology_level *tl_asym;
+ bool has_asym = false;
alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
if (alloc_state != sa_rootdomain)
@@ -1723,8 +1725,10 @@ build_sched_domains(const struct cpumask
for_each_sd_topology(tl) {
int dflags = 0;
- if (tl == tl_asym)
+ if (tl == tl_asym) {
dflags |= SD_ASYM_CPUCAPACITY;
+ has_asym = true;
+ }
sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
@@ -1776,6 +1780,9 @@ build_sched_domains(const struct cpumask
}
rcu_read_unlock();
+ if (has_asym)
+ static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
+
if (rq && sched_debug_enabled) {
pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
Aside from the first patch, which I posted the change on, I've picked up
until 10. I think that other SD_ASYM patch-set replaces 11 and 12,
right?
Hi Peter,
On 31/07/18 13:00, Peter Zijlstra wrote:
>
>
> Aside from the first patch, which I posted the change on, I've picked up
> until 10. I think that other SD_ASYM patch-set replaces 11 and 12,
> right?
>
11 is no longer needed, but AFAICT we still need 12 - we don't want
PREFER_SIBLING to interfere with asymmetric systems.
Also, I've been playing around with a patch to modify 5 (the nohz kicks) that
first scans the capacity of nohz CPUs before doing any kick (pretty much like
what asym packing is doing with nohz CPU priority), do you want me to toss
it your way? I haven't done run any tests with it yet...
Cheers,
Valentin
On Mon, 9 Jul 2018 at 17:08, Morten Rasmussen <[email protected]> wrote:
>
> On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote:
> > Hi Morten,
> >
> > On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <[email protected]> wrote:
> > >
> > > On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
> > > for performance that cpu intensive tasks are aggressively migrated to
> > > high capacity cpus as soon as those become available. The capacity
> > > awareness tweaks already in the wake-up path can't handle this as such
> > > tasks might run or be runnable forever. If they happen to be placed on a
> > > low capacity cpu from the beginning they are stuck there forever while
> > > high capacity cpus may have become available in the meantime.
> > >
> > > To address this issue this patch set introduces a new "misfit"
> > > load-balancing scenario in periodic/nohz/newly idle balance which tweaks
> > > the load-balance conditions to ignore load per capacity in certain
> > > cases. Since misfit tasks are commonly running alone on a cpu, more
> > > aggressive active load-balancing is needed too.
> > >
> > > The fundamental idea of this patch set has been in Android kernels for a
> > > long time and is absolutely essential for consistent performance on
> > > asymmetric cpu capacity systems.
> > >
> >
> > As already said , I'm not convinced by the proposal which seems quite
> > complex and also adds some kind of arbitrary and fixed power
> > management policy by deciding which tasks can or not go on big cores
> > whereas there are other frameworks to take such decision like EAS or
> > cgroups.
>
> The misfit patches are a crucial part of the EAS solution but they also
EAS needs the scheduler to move long running task on big core
(especially when overloaded), and the misfit task is just one proposal
> make sense for some users on their own without an energy model. This is
> why they are posted separately.
>
> We have already discussed at length why the patches are needed and why
> the look like they do here in this thread:
>
> https://lore.kernel.org/lkml/CAKfTPtD4skW_3SAk--vBEC5-m1Ua48bjOQYS0pDqW3nPSpsENg@mail.gmail.com/
>
> > Furthermore, there is already something similar in the kernel
> > with SD_ASYM_PACKING and IMO, it would be better to improve this
> > feature (if needed) instead of adding a new one which often do similar
> > things.
>
> As said in the previous thread, while it might look similar it isn't.
> SD_ASYM_PACKING isn't utilization-based which is the key metric used for
> EAS, schedutil, util_est, and util_clamp. SD_ASYM_PACKING serves a
> different purpose (see previous thread for details).
>
> > I have rerun your tests and got same results than misfit task patchset
> > on my hikey960 with SD_ASYM_PACKING feature for legacy b.L topology
> > and fake dynamiQ topology. And it give better performance when the
> > pinned tasks are short and scheduler has to wait for the task to
> > increase their utilization before getting a chance to migrate on big
> > core.
>
> Right, the test cases are quite simple and could be served better by
> SD_ASYM_PACKING. As we already discussed in that thread, that is due to
> the PELT lag but this the cost we have to pay if we don't have
> additional information about the requirements of the task and we don't
> want to default to big-first with all its implications.
>
> We have covered all this in the thread in early April.
>
> > Then, I have tested SD_ASYM_PACKING with EAS patchset and they work
> > together for b/L and dynamiQ topology
>
> Could you provide some more details about your evaluation? It probably
> works well for some use-cases but it isn't really designed for what we
> need for EAS.
>
> Morten
On Mon, 30 Jul 2018 at 16:30, Dietmar Eggemann <[email protected]> wrote:
>
> On 07/26/2018 07:14 PM, Valentin Schneider wrote:
> > Hi,
> >
> > On 09/07/18 16:08, Morten Rasmussen wrote:
> >> On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote:
> >>> Hi Morten,
> >>>
> >>> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <[email protected]> wrote:
>
> [...]
>
> > With that out of the way, I did some lmbench runs:
> >> lat_mem_rd 10 1024
> >
> > With ASYM_PACKING, I still see lmbench tasks remaining on LITTLE CPUs while
> > bigs are free, because ASYM_PACKING only does explicit active balancing on
> > CPU_NEWLY_IDLE balancing - otherwise it'll rely on the nr_balance_failed counter.
> >
> > However, that counter can be reset before it reaches the threshold at which
> > active balance is done, which can lead to huge upmigration delays (almost a
> > full second). I also see the same kind of issues on Juno r0.
> >
> > This could be resolved by extending ASYM_PACKING active balancing to
> > non NEWLY_IDLE cases, but then we'd be thrashing everything. That's another
> > argument for basing upmigration on task load-tracking signals, as we can
> > determine which tasks need active balancing much faster than the
> > nr_balance_failed counter way while not active balancing the world.
>
> The task layout of the test looks like n=85 always running tasks (each
> for ~ 1.25ms on big or little) and they all get created and run one
How mistfit task can make a difference for a benchmark which uses 1.25ms tasks ?
> after the other. So on a big cpu, their util values go from 512 to 1024
> and from 223 to 446 on little cpu (Juno board). Latter thanks to
> Quentin's 'sched/fair: Fix util_avg of new tasks for asymmetric systems'.
>
> root@juno:~# cat /sys/devices/system/cpu/cpu[01]/cpu_capacity
> 446
> 1024
>
> > (lat_mem_rd 10 1024) with ASYM_PACKING:
>
> ...
> > 4.0 148.66 <-----
> > 4.5 10.191
> ...
> > 7.5 10.203
> > 8.0 154.354 <-----
>
> I ran the test affine to big, little and all cpus on tip/sched/core w/o
> ASYM_PACKING or Misfit:
>
> cputype: big little all
> cpumask: 0x06 0x39 0xff
>
> mem size <---- latency ---->
>
> 0.00098 3.668 3.595 3.669
> 0.00195 3.668 3.594 3.594
> 0.00293 3.668 3.593 3.595
> 0.00391 3.669 3.596 3.595
> ...
> 3.75000 58.687 121.934 122.293
> 4.00000 57.054 121.771 120.489
> 4.50000 56.914 121.851 56.729
> 5.00000 57.347 121.777 56.975
> 5.50000 57.705 121.738 68.981
> 6.00000 57.935 121.728 57.542
> 6.50000 58.119 121.694 121.799
> 7.00000 58.194 121.502 57.844
> 7.50000 58.258 121.684 58.050
> 8.00000 58.293 121.725 58.030
> 9.00000 58.309 121.793 58.188
> 10.00000 58.561 122.252 122.078
>
> There is no diff between big and little cpus with small memory sizes,
> just with the MB range.
> If I look into the trace for 'all' it turns out that their are cases in
> which, even if the task only run for ~15% of the time on big, the
> latency value is printed as when it was running affine to big. So using
> the latency value as an indicator where the task was scheduled is IMHO
> not really possible.
On 07/31/2018 02:13 PM, Vincent Guittot wrote:
> On Mon, 30 Jul 2018 at 16:30, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 07/26/2018 07:14 PM, Valentin Schneider wrote:
[...]
>> The task layout of the test looks like n=85 always running tasks (each
>> for ~ 1.25ms on big or little) and they all get created and run one
>
> How mistfit task can make a difference for a benchmark which uses 1.25ms tasks ?
Ah, sorry! This was a typo. It should be ~ 1.25s.
On Fri, 6 Jul 2018 at 16:31, Morten Rasmussen <[email protected]> wrote:
>
> On Fri, Jul 06, 2018 at 12:18:17PM +0200, Vincent Guittot wrote:
> > On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <[email protected]> wrote:
> > >
> > > The 'prefer sibling' sched_domain flag is intended to encourage
> > > spreading tasks to sibling sched_domain to take advantage of more caches
> > > and core for SMT systems. It has recently been changed to be on all
> > > non-NUMA topology level. However, spreading across domains with cpu
> > > capacity asymmetry isn't desirable, e.g. spreading from high capacity to
> > > low capacity cpus even if high capacity cpus aren't overutilized might
> > > give access to more cache but the cpu will be slower and possibly lead
> > > to worse overall throughput.
> > >
> > > To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
> > > level immediately below SD_ASYM_CPUCAPACITY.
> >
> > This makes sense. Nevertheless, this patch also raises a scheduling
> > problem and break the 1 task per CPU policy that is enforced by
> > SD_PREFER_SIBLING.
>
> Scheduling one task per cpu when n_task == n_cpus on asymmetric
> topologies is generally broken already and this patch set doesn't fix
> that problem.
>
> SD_PREFER_SIBLING might seem to help in very specific cases:
> n_litte_cpus == n_big_cpus. In that case the little group might
> classified as overloaded. It doesn't guarantee that anything gets pulled
> as the grp_load/grp_capacity in the imbalance calculation on some system
> still says the little cpus are more loaded than the bigs despite one of
> them being idle. That depends on the little cpu capacities.
>
> On systems where n_little_cpus != n_big_cpus SD_PREFER_SIBLING is broken
> as it assumes the group_weight to be the same. This is the case on Juno
> and several other platforms.
>
> IMHO, SD_PREFER_SIBLING isn't the solution to this problem. It might
I agree but this patchset creates a regression in the scheduling behavior
> help for a limited subset of topologies/capacities but the right
> solution is to change the imbalance calculation. As the name says, it is
Yes that what does the prototype that I came with.
> meant to spread tasks and does so unconditionally. For asymmetric
> systems we would like to consider cpu capacity before migrating tasks.
>
> > When running the tests of your cover letter, 1 long
> > running task is often co scheduled on a big core whereas short pinned
> > tasks are still running and a little core is idle which is not an
> > optimal scheduling decision
>
> This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
> say that this patch breaks anything that isn't broken already. In fact
> we this happening with and without this patch applied.
At least for the use case above, this doesn't happen when
SD_PREFER_SIBLING is set
>
> Morten
Hi,
On 31/07/18 13:17, Vincent Guittot wrote:
> On Fri, 6 Jul 2018 at 16:31, Morten Rasmussen <[email protected]> wrote:
>>
>> On Fri, Jul 06, 2018 at 12:18:17PM +0200, Vincent Guittot wrote:
>>> [...]
>>
>> Scheduling one task per cpu when n_task == n_cpus on asymmetric
>> topologies is generally broken already and this patch set doesn't fix
>> that problem.
>>
>> SD_PREFER_SIBLING might seem to help in very specific cases:
>> n_litte_cpus == n_big_cpus. In that case the little group might
>> classified as overloaded. It doesn't guarantee that anything gets pulled
>> as the grp_load/grp_capacity in the imbalance calculation on some system
>> still says the little cpus are more loaded than the bigs despite one of
>> them being idle. That depends on the little cpu capacities.
>>
>> On systems where n_little_cpus != n_big_cpus SD_PREFER_SIBLING is broken
>> as it assumes the group_weight to be the same. This is the case on Juno
>> and several other platforms.
>>
>> IMHO, SD_PREFER_SIBLING isn't the solution to this problem. It might
>
> I agree but this patchset creates a regression in the scheduling behavior
>
>> help for a limited subset of topologies/capacities but the right
>> solution is to change the imbalance calculation. As the name says, it is
>
> Yes that what does the prototype that I came with.
>
>> meant to spread tasks and does so unconditionally. For asymmetric
>> systems we would like to consider cpu capacity before migrating tasks.
>>
>>> When running the tests of your cover letter, 1 long
>>> running task is often co scheduled on a big core whereas short pinned
>>> tasks are still running and a little core is idle which is not an
>>> optimal scheduling decision
>>
>> This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
>> say that this patch breaks anything that isn't broken already. In fact
>> we this happening with and without this patch applied.
>
> At least for the use case above, this doesn't happen when
> SD_PREFER_SIBLING is set
>
On my HiKey960 I can see coscheduling on a big CPU while a LITTLE is free
with **and** without SD_PREFER_SIBLING. Having it set only means that in
some cases the imbalance will be re-classified as group_overloaded instead
of group_misfit_task, so we'll skip the misfit logic when we shouldn't (this
happens on Juno for instance).
It does nothing for the "1 task per CPU" problem that Morten described above.
When you have this little amount of tasks, load isn't very relevant, but it
skews the load-balancer into thinking the LITTLE CPUs are more busy than
the bigs even though there's an idle one in the lot.
>>
>> Morten
On Tue, Jul 31, 2018 at 01:10:49PM +0100, Valentin Schneider wrote:
> Hi Peter,
>
> On 31/07/18 13:00, Peter Zijlstra wrote:
> >
> >
> > Aside from the first patch, which I posted the change on, I've picked up
> > until 10. I think that other SD_ASYM patch-set replaces 11 and 12,
> > right?
> >
> 11 is no longer needed, but AFAICT we still need 12 - we don't want
> PREFER_SIBLING to interfere with asymmetric systems.
Yes, we still want patch 12 if possible.
On Tue, Jul 31, 2018 at 12:59:16PM +0200, Peter Zijlstra wrote:
>
> Combined with that SD_ASYM.. rework I ended up with the below.
>
> Holler if you want it changed :-)
Looks good to me.
Thanks,
Morten
Hi Valentin,
On Tue, 31 Jul 2018 at 14:33, Valentin Schneider
<[email protected]> wrote:
>
> Hi,
>
> On 31/07/18 13:17, Vincent Guittot wrote:
> > On Fri, 6 Jul 2018 at 16:31, Morten Rasmussen <[email protected]> wrote:
> >>
> >> On Fri, Jul 06, 2018 at 12:18:17PM +0200, Vincent Guittot wrote:
> >>> [...]
> >>
> >> Scheduling one task per cpu when n_task == n_cpus on asymmetric
> >> topologies is generally broken already and this patch set doesn't fix
> >> that problem.
> >>
> >> SD_PREFER_SIBLING might seem to help in very specific cases:
> >> n_litte_cpus == n_big_cpus. In that case the little group might
> >> classified as overloaded. It doesn't guarantee that anything gets pulled
> >> as the grp_load/grp_capacity in the imbalance calculation on some system
> >> still says the little cpus are more loaded than the bigs despite one of
> >> them being idle. That depends on the little cpu capacities.
> >>
> >> On systems where n_little_cpus != n_big_cpus SD_PREFER_SIBLING is broken
> >> as it assumes the group_weight to be the same. This is the case on Juno
> >> and several other platforms.
> >>
> >> IMHO, SD_PREFER_SIBLING isn't the solution to this problem. It might
> >
> > I agree but this patchset creates a regression in the scheduling behavior
> >
> >> help for a limited subset of topologies/capacities but the right
> >> solution is to change the imbalance calculation. As the name says, it is
> >
> > Yes that what does the prototype that I came with.
> >
> >> meant to spread tasks and does so unconditionally. For asymmetric
> >> systems we would like to consider cpu capacity before migrating tasks.
> >>
> >>> When running the tests of your cover letter, 1 long
> >>> running task is often co scheduled on a big core whereas short pinned
> >>> tasks are still running and a little core is idle which is not an
> >>> optimal scheduling decision
> >>
> >> This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
> >> say that this patch breaks anything that isn't broken already. In fact
> >> we this happening with and without this patch applied.
> >
> > At least for the use case above, this doesn't happen when
> > SD_PREFER_SIBLING is set
> >
>
> On my HiKey960 I can see coscheduling on a big CPU while a LITTLE is free
> with **and** without SD_PREFER_SIBLING. Having it set only means that in
> some cases the imbalance will be re-classified as group_overloaded instead
> of group_misfit_task, so we'll skip the misfit logic when we shouldn't (this
> happens on Juno for instance).
Can you give more details about your test case ?
>
> It does nothing for the "1 task per CPU" problem that Morten described above.
> When you have this little amount of tasks, load isn't very relevant, but it
> skews the load-balancer into thinking the LITTLE CPUs are more busy than
> the bigs even though there's an idle one in the lot.
>
> >>
> >> Morten
Hi,
On 06/08/18 11:20, Vincent Guittot wrote:
> Hi Valentin,
>
> On Tue, 31 Jul 2018 at 14:33, Valentin Schneider
> <[email protected]> wrote:
>>
>> Hi,
>>
>> On 31/07/18 13:17, Vincent Guittot wrote:
>> [...]
>>>>
>>>> This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
>>>> say that this patch breaks anything that isn't broken already. In fact
>>>> we this happening with and without this patch applied.
>>>
>>> At least for the use case above, this doesn't happen when
>>> SD_PREFER_SIBLING is set
>>>
>>
>> On my HiKey960 I can see coscheduling on a big CPU while a LITTLE is free
>> with **and** without SD_PREFER_SIBLING. Having it set only means that in
>> some cases the imbalance will be re-classified as group_overloaded instead
>> of group_misfit_task, so we'll skip the misfit logic when we shouldn't (this
>> happens on Juno for instance).
>
> Can you give more details about your test case ?
>
I've been running the same test case as presented in the cover letter on
my HiKey960 but with sched_switch tracing and with no tasksets. I've just
re-run the testcase with tasksets and I get similar results (i.e. a big
with coscheduling while a LITTLE is free) with or without the flag.
>>
>> It does nothing for the "1 task per CPU" problem that Morten described above.
>> When you have this little amount of tasks, load isn't very relevant, but it
>> skews the load-balancer into thinking the LITTLE CPUs are more busy than
>> the bigs even though there's an idle one in the lot.
>>
>>>>
>>>> Morten
On Mon, 6 Aug 2018 at 12:53, Valentin Schneider
<[email protected]> wrote:
>
> Hi,
>
> On 06/08/18 11:20, Vincent Guittot wrote:
> > Hi Valentin,
> >
> > On Tue, 31 Jul 2018 at 14:33, Valentin Schneider
> > <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> On 31/07/18 13:17, Vincent Guittot wrote:
> >> [...]
> >>>>
> >>>> This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
> >>>> say that this patch breaks anything that isn't broken already. In fact
> >>>> we this happening with and without this patch applied.
> >>>
> >>> At least for the use case above, this doesn't happen when
> >>> SD_PREFER_SIBLING is set
> >>>
> >>
> >> On my HiKey960 I can see coscheduling on a big CPU while a LITTLE is free
> >> with **and** without SD_PREFER_SIBLING. Having it set only means that in
> >> some cases the imbalance will be re-classified as group_overloaded instead
> >> of group_misfit_task, so we'll skip the misfit logic when we shouldn't (this
> >> happens on Juno for instance).
> >
> > Can you give more details about your test case ?
> >
>
> I've been running the same test case as presented in the cover letter on
> my HiKey960 but with sched_switch tracing and with no tasksets. I've just
> re-run the testcase with tasksets and I get similar results (i.e. a big
> with coscheduling while a LITTLE is free) with or without the flag.
>
> >>
> >> It does nothing for the "1 task per CPU" problem that Morten described above.
> >> When you have this little amount of tasks, load isn't very relevant, but it
> >> skews the load-balancer into thinking the LITTLE CPUs are more busy than
> >> the bigs even though there's an idle one in the lot.
ok. I meant that without misfit patch, SD_PREFER_SIBLING ensures 1
task per CPU and no co scheduling. But co scheduling appears with
misfit task patchset
Vincent
> >>
> >>>>
> >>>> Morten
On Wed, Jul 04, 2018 at 11:17:38AM +0100, Morten Rasmussen wrote:
> On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
> for performance that cpu intensive tasks are aggressively migrated to
> high capacity cpus as soon as those become available. The capacity
> awareness tweaks already in the wake-up path can't handle this as such
> tasks might run or be runnable forever. If they happen to be placed on a
> low capacity cpu from the beginning they are stuck there forever while
> high capacity cpus may have become available in the meantime.
>
> To address this issue this patch set introduces a new "misfit"
> load-balancing scenario in periodic/nohz/newly idle balance which tweaks
> the load-balance conditions to ignore load per capacity in certain
> cases. Since misfit tasks are commonly running alone on a cpu, more
> aggressive active load-balancing is needed too.
>
> The fundamental idea of this patch set has been in Android kernels for a
> long time and is absolutely essential for consistent performance on
> asymmetric cpu capacity systems.
>
> The patches have been tested on:
> 1. Arm Juno (r0): 2+4 Cortex A57/A53
> 2. Hikey960: 4+4 Cortex A73/A53
>
Thanks for posting these, we have been carrying it in the Android Common
Kernel for some time. They have been useful for Android systems.
Reviewed-by: Joel Fernandes (Google) <[email protected]>
thanks,
- Joel
Hi,
> -----Original Message-----
> From: Morten Rasmussen <[email protected]>
> Sent: Wednesday, July 4, 2018 7:18 PM
> To: [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected]; Gaku Inami
> <[email protected]>; [email protected]; Morten Rasmussen <[email protected]>
> Subject: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
[snip]
> The patches have been tested on:
> 1. Arm Juno (r0): 2+4 Cortex A57/A53
> 2. Hikey960: 4+4 Cortex A73/A53
>
> Test case:
> Big cpus are always kept busy. Pin a shorter running sysbench tasks to
> big cpus, while creating a longer running set of unpinned sysbench
> tasks.
I have tested v4 patches on Renesas SoC again. It looks fine.
You can add:
Tested-by: Gaku Inami <[email protected]>
The patches have been tested on:
3. Renesas R-Car H3 : 4+4 Cortex A57/A53
Results:
Single runs with completion time of each task
R-Car H3 (tip)
total time: 0.9488s
total time: 0.9766s
total time: 1.3243s
total time: 1.6740s
R-Car H3 (misfit)
total time: 0.9313s
total time: 0.9214s
total time: 0.9493s
total time: 0.9606s
10 run summary (tracking longest running task for each run)
R-Car H3
avg max
tip 1.6744 1.6780
misfit 0.9616 1.0290
Also, I confirmed that these patches bring performance improvement
to some benchmarks(UnixBench, LMbench). So they have very useful
for Renesas SoC.
[snip]
Regards,
Inami
Commit-ID: df054e8445a4011e3d693c2268129c0456108663
Gitweb: https://git.kernel.org/tip/df054e8445a4011e3d693c2268129c0456108663
Author: Morten Rasmussen <[email protected]>
AuthorDate: Wed, 4 Jul 2018 11:17:39 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 11:05:48 +0200
sched/topology: Add static_key for asymmetric CPU capacity optimizations
The existing asymmetric CPU capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 3 +++
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 9 ++++++++-
3 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f808ddf2a868..3e5071aeb117 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,6 +6188,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
{
long min_cap, max_cap;
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return 0;
+
min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8cae63c4..0f36adc31ba5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1185,6 +1185,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;
struct sched_group_capacity {
atomic_t ref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5c4d583d53ee..b0cdf5e95bda 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
static void update_top_cache_domain(int cpu)
{
@@ -1705,6 +1706,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
struct rq *rq = NULL;
int i, ret = -ENOMEM;
struct sched_domain_topology_level *tl_asym;
+ bool has_asym = false;
alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
if (alloc_state != sa_rootdomain)
@@ -1720,8 +1722,10 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
for_each_sd_topology(tl) {
int dflags = 0;
- if (tl == tl_asym)
+ if (tl == tl_asym) {
dflags |= SD_ASYM_CPUCAPACITY;
+ has_asym = true;
+ }
sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
@@ -1773,6 +1777,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
}
rcu_read_unlock();
+ if (has_asym)
+ static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
+
if (rq && sched_debug_enabled) {
pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
Commit-ID: e3d6d0cb66f2351cbfd09fbae04eb9804afe9577
Gitweb: https://git.kernel.org/tip/e3d6d0cb66f2351cbfd09fbae04eb9804afe9577
Author: Morten Rasmussen <[email protected]>
AuthorDate: Wed, 4 Jul 2018 11:17:41 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 11:05:49 +0200
sched/fair: Add sched_group per-CPU max capacity
The current sg->min_capacity tracks the lowest per-CPU compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple CPUs if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all CPUs or
asymmetric CPU capacities (e.g. big.LITTLE).
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 24 ++++++++++++++++++++----
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 2 ++
3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e04bea5b11a..fe04315d57b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7557,13 +7557,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
cpu_rq(cpu)->cpu_capacity = capacity;
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
+ sdg->sgc->max_capacity = capacity;
}
void update_group_capacity(struct sched_domain *sd, int cpu)
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
- unsigned long capacity, min_capacity;
+ unsigned long capacity, min_capacity, max_capacity;
unsigned long interval;
interval = msecs_to_jiffies(sd->balance_interval);
@@ -7577,6 +7578,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
capacity = 0;
min_capacity = ULONG_MAX;
+ max_capacity = 0;
if (child->flags & SD_OVERLAP) {
/*
@@ -7607,6 +7609,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
}
min_capacity = min(capacity, min_capacity);
+ max_capacity = max(capacity, max_capacity);
}
} else {
/*
@@ -7620,12 +7623,14 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
capacity += sgc->capacity;
min_capacity = min(sgc->min_capacity, min_capacity);
+ max_capacity = max(sgc->max_capacity, max_capacity);
group = group->next;
} while (group != child->groups);
}
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = min_capacity;
+ sdg->sgc->max_capacity = max_capacity;
}
/*
@@ -7721,16 +7726,27 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
}
/*
- * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
* per-CPU capacity than sched_group ref.
*/
static inline bool
-group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
{
return sg->sgc->min_capacity * capacity_margin <
ref->sgc->min_capacity * 1024;
}
+/*
+ * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-CPU capacity_orig than sched_group ref.
+ */
+static inline bool
+group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+ return sg->sgc->max_capacity * capacity_margin <
+ ref->sgc->max_capacity * 1024;
+}
+
static inline enum
group_type group_classify(struct sched_group *group,
struct sg_lb_stats *sgs)
@@ -7876,7 +7892,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* power/energy consequences are not considered.
*/
if (sgs->sum_nr_running <= sgs->group_weight &&
- group_smaller_cpu_capacity(sds->local, sg))
+ group_smaller_min_cpu_capacity(sds->local, sg))
return false;
asym_packing:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7dbf67d147a2..fe17e0be2d7b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1197,6 +1197,7 @@ struct sched_group_capacity {
*/
unsigned long capacity;
unsigned long min_capacity; /* Min per-CPU capacity in group */
+ unsigned long max_capacity; /* Max per-CPU capacity in group */
unsigned long next_update;
int imbalance; /* XXX unrelated to capacity but shared group state */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b0cdf5e95bda..2536e1b938f9 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -693,6 +693,7 @@ static void init_overlap_sched_group(struct sched_domain *sd,
sg_span = sched_group_span(sg);
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+ sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
}
static int
@@ -852,6 +853,7 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg));
sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+ sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
return sg;
}
Commit-ID: 3b1baa6496e6b7ad016342a9d256bdfb072ce902
Gitweb: https://git.kernel.org/tip/3b1baa6496e6b7ad016342a9d256bdfb072ce902
Author: Morten Rasmussen <[email protected]>
AuthorDate: Wed, 4 Jul 2018 11:17:40 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 11:05:49 +0200
sched/fair: Add 'group_misfit_task' load-balance type
To maximize throughput in systems with asymmetric CPU capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and CPU utilization
as well as per-CPU compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity CPU need to be
identified and migrated to a higher capacity CPU if possible to maximize
throughput.
To implement this additional policy an additional group_type
(load-balance scenario) is added: 'group_misfit_task'. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-CPU capacity. 'group_misfit_task' is only considered
if the system is not overloaded or imbalanced ('group_imbalanced' or
'group_overloaded').
Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each CPU is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++++--------
kernel/sched/sched.h | 2 ++
2 files changed, 48 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e5071aeb117..6e04bea5b11a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -693,6 +693,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
static unsigned long task_h_load(struct task_struct *p);
+static unsigned long capacity_of(int cpu);
/* Give new sched_entity start runnable values to heavy its load in infant time */
void init_entity_runnable_average(struct sched_entity *se)
@@ -1446,7 +1447,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
static unsigned long weighted_cpuload(struct rq *rq);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);
/* Cached statistics for all CPUs within a node */
struct numa_stats {
@@ -3647,6 +3647,29 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
WRITE_ONCE(p->se.avg.util_est, ue);
}
+static inline int task_fits_capacity(struct task_struct *p, long capacity)
+{
+ return capacity * 1024 > task_util_est(p) * capacity_margin;
+}
+
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
+{
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return;
+
+ if (!p) {
+ rq->misfit_task_load = 0;
+ return;
+ }
+
+ if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
+ rq->misfit_task_load = 0;
+ return;
+ }
+
+ rq->misfit_task_load = task_h_load(p);
+}
+
#else /* CONFIG_SMP */
#define UPDATE_TG 0x0
@@ -3676,6 +3699,7 @@ util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
static inline void
util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
bool task_sleep) {}
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
#endif /* CONFIG_SMP */
@@ -6201,7 +6225,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
/* Bring task utilization in sync with prev_cpu */
sync_entity_load_avg(&p->se);
- return min_cap * 1024 < task_util(p) * capacity_margin;
+ return !task_fits_capacity(p, min_cap);
}
/*
@@ -6618,9 +6642,12 @@ done: __maybe_unused;
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
+ update_misfit_status(p, rq);
+
return p;
idle:
+ update_misfit_status(NULL, rq);
new_tasks = idle_balance(rq, rf);
/*
@@ -6826,6 +6853,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
enum fbq_type { regular, remote, all };
+enum group_type {
+ group_other = 0,
+ group_misfit_task,
+ group_imbalanced,
+ group_overloaded,
+};
+
#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
#define LBF_DST_PINNED 0x04
@@ -7399,12 +7433,6 @@ static unsigned long task_h_load(struct task_struct *p)
/********** Helpers for find_busiest_group ************************/
-enum group_type {
- group_other = 0,
- group_imbalanced,
- group_overloaded,
-};
-
/*
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
@@ -7420,6 +7448,7 @@ struct sg_lb_stats {
unsigned int group_weight;
enum group_type group_type;
int group_no_capacity;
+ unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -7712,6 +7741,9 @@ group_type group_classify(struct sched_group *group,
if (sg_imbalanced(group))
return group_imbalanced;
+ if (sgs->group_misfit_task_load)
+ return group_misfit_task;
+
return group_other;
}
@@ -7786,6 +7818,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
*/
if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
+
+ if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ sgs->group_misfit_task_load < rq->misfit_task_load)
+ sgs->group_misfit_task_load = rq->misfit_task_load;
}
/* Adjust by relative CPU capacity of the group */
@@ -9567,6 +9603,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+
+ update_misfit_status(curr, rq);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0f36adc31ba5..7dbf67d147a2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -842,6 +842,8 @@ struct rq {
unsigned char idle_balance;
+ unsigned long misfit_task_load;
+
/* For active balancing */
int active_balance;
int push_cpu;
Commit-ID: cad68e552e7774b68ae6a2c5fedb792936098b72
Gitweb: https://git.kernel.org/tip/cad68e552e7774b68ae6a2c5fedb792936098b72
Author: Morten Rasmussen <[email protected]>
AuthorDate: Wed, 4 Jul 2018 11:17:42 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 11:05:50 +0200
sched/fair: Consider misfit tasks when load-balancing
On asymmetric CPU capacity systems load intensive tasks can end up on
CPUs that don't suit their compute demand. In this scenarios 'misfit'
tasks should be migrated to CPUs with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.
Misfit balancing only makes sense between a source group of lower
per-CPU capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.
The modifications are:
1. Only pick a group containing misfit tasks as the busiest group if the
destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
to be pulled ignoring average load.
4. Pick the CPU with the highest misfit load as the source CPU.
5. If the misfit task is alone on the source CPU, go for active
balancing.
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 49 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe04315d57b3..24fe39e57bc3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6890,6 +6890,7 @@ struct lb_env {
unsigned int loop_max;
enum fbq_type fbq_type;
+ enum group_type src_grp_type;
struct list_head tasks;
};
@@ -7873,6 +7874,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
{
struct sg_lb_stats *busiest = &sds->busiest_stat;
+ /*
+ * Don't try to pull misfit tasks we can't help.
+ * We can use max_capacity here as reduction in capacity on some
+ * CPUs in the group should either be possible to resolve
+ * internally or be covered by avg_load imbalance (eventually).
+ */
+ if (sgs->group_type == group_misfit_task &&
+ (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+ !group_has_capacity(env, &sds->local_stat)))
+ return false;
+
if (sgs->group_type > busiest->group_type)
return true;
@@ -7895,6 +7907,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
group_smaller_min_cpu_capacity(sds->local, sg))
return false;
+ /*
+ * If we have more than one misfit sg go with the biggest misfit.
+ */
+ if (sgs->group_type == group_misfit_task &&
+ sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+ return false;
+
asym_packing:
/* This is the busiest node in its class. */
if (!(env->sd->flags & SD_ASYM_PACKING))
@@ -8192,8 +8211,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* factors in sg capacity and sgs with smaller group_type are
* skipped when updating the busiest sg:
*/
- if (busiest->avg_load <= sds->avg_load ||
- local->avg_load >= sds->avg_load) {
+ if (busiest->group_type != group_misfit_task &&
+ (busiest->avg_load <= sds->avg_load ||
+ local->avg_load >= sds->avg_load)) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}
@@ -8227,6 +8247,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;
+ /* Boost imbalance to allow misfit task to be balanced. */
+ if (busiest->group_type == group_misfit_task) {
+ env->imbalance = max_t(long, env->imbalance,
+ busiest->group_misfit_task_load);
+ }
+
/*
* if *imbalance is less than the average load per runnable task
* there is no guarantee that any tasks will be moved so we'll have
@@ -8293,6 +8319,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
busiest->group_no_capacity)
goto force_balance;
+ /* Misfit tasks should be dealt with regardless of the avg load */
+ if (busiest->group_type == group_misfit_task)
+ goto force_balance;
+
/*
* If the local group is busier than the selected busiest group
* don't try and pull any tasks.
@@ -8330,6 +8360,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
force_balance:
/* Looks like there is an imbalance. Compute it */
+ env->src_grp_type = busiest->group_type;
calculate_imbalance(env, &sds);
return env->imbalance ? sds.busiest : NULL;
@@ -8377,6 +8408,19 @@ static struct rq *find_busiest_queue(struct lb_env *env,
if (rt > env->fbq_type)
continue;
+ /*
+ * For ASYM_CPUCAPACITY domains with misfit tasks we simply
+ * seek the "biggest" misfit task.
+ */
+ if (env->src_grp_type == group_misfit_task) {
+ if (rq->misfit_task_load > busiest_load) {
+ busiest_load = rq->misfit_task_load;
+ busiest = rq;
+ }
+
+ continue;
+ }
+
capacity = capacity_of(i);
wl = weighted_cpuload(rq);
@@ -8446,6 +8490,9 @@ static int need_active_balance(struct lb_env *env)
return 1;
}
+ if (env->src_grp_type == group_misfit_task)
+ return 1;
+
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
Commit-ID: 5fbdfae5221a5208ed8e7653fc1c4b31de420f74
Gitweb: https://git.kernel.org/tip/5fbdfae5221a5208ed8e7653fc1c4b31de420f74
Author: Valentin Schneider <[email protected]>
AuthorDate: Wed, 4 Jul 2018 11:17:43 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 11:05:51 +0200
sched/fair: Kick nohz balance if rq->misfit_task_load
There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:
- rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
- sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity CPUs, this will be evaluated to False.
- rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure
As such, this commit adds an rq->misfit_task_load condition to trigger a
nohz kick.
The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan <[email protected]>, and a similar patch was submitted for
the Android Common Kernel - see:
https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 24fe39e57bc3..e08287d3806f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9135,7 +9135,7 @@ static void nohz_balancer_kick(struct rq *rq)
if (time_before(now, nohz.next_balance))
goto out;
- if (rq->nr_running >= 2) {
+ if (rq->nr_running >= 2 || rq->misfit_task_load) {
flags = NOHZ_KICK_MASK;
goto out;
}
Commit-ID: dbbad719449e06d73db21598d6eee178f7a54b3b
Gitweb: https://git.kernel.org/tip/dbbad719449e06d73db21598d6eee178f7a54b3b
Author: Valentin Schneider <[email protected]>
AuthorDate: Wed, 4 Jul 2018 11:17:44 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 11:05:51 +0200
sched/fair: Change 'prefer_sibling' type to bool
This variable is entirely local to update_sd_lb_stats, so we can
safely change its type and slightly clean up its initialisation.
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e08287d3806f..23017939ecab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7982,11 +7982,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
- int load_idx, prefer_sibling = 0;
+ int load_idx;
bool overload = false;
-
- if (child && child->flags & SD_PREFER_SIBLING)
- prefer_sibling = 1;
+ bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
#ifdef CONFIG_NO_HZ_COMMON
if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
Commit-ID: 575638d1047eb057a5cdf95cc0b3c084e1279508
Gitweb: https://git.kernel.org/tip/575638d1047eb057a5cdf95cc0b3c084e1279508
Author: Valentin Schneider <[email protected]>
AuthorDate: Wed, 4 Jul 2018 11:17:45 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 11:05:52 +0200
sched/core: Change root_domain->overload type to int
sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.
The local 'overload' variable used in update_sd_lb_stats can remain
bool, as it won't impact any struct layout and can be assigned to the
root_domain field.
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/sched.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fe17e0be2d7b..4d181478c5b8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -716,7 +716,7 @@ struct root_domain {
cpumask_var_t online;
/* Indicate more than one runnable task for any CPU */
- bool overload;
+ int overload;
/*
* The bit corresponding to a CPU gets set here if such CPU has more
@@ -1698,7 +1698,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
if (prev_nr < 2 && rq->nr_running >= 2) {
#ifdef CONFIG_SMP
if (!rq->rd->overload)
- rq->rd->overload = true;
+ rq->rd->overload = 1;
#endif
}
Commit-ID: e90c8fe15a3bf93a23088bcf1a56a0fa391d4e50
Gitweb: https://git.kernel.org/tip/e90c8fe15a3bf93a23088bcf1a56a0fa391d4e50
Author: Valentin Schneider <[email protected]>
AuthorDate: Wed, 4 Jul 2018 11:17:46 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 11:05:52 +0200
sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE()
This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE() are added to make sure nothing terribly wrong
can happen because of the compiler.
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 6 +++---
kernel/sched/sched.h | 4 ++--
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23017939ecab..d9c4e97bfebd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8058,8 +8058,8 @@ next_group:
if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
- if (env->dst_rq->rd->overload != overload)
- env->dst_rq->rd->overload = overload;
+ if (READ_ONCE(env->dst_rq->rd->overload) != overload)
+ WRITE_ONCE(env->dst_rq->rd->overload, overload);
}
}
@@ -9502,7 +9502,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
rq_unpin_lock(this_rq, rf);
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !this_rq->rd->overload) {
+ !READ_ONCE(this_rq->rd->overload)) {
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4d181478c5b8..938063639793 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1697,8 +1697,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
if (prev_nr < 2 && rq->nr_running >= 2) {
#ifdef CONFIG_SMP
- if (!rq->rd->overload)
- rq->rd->overload = 1;
+ if (!READ_ONCE(rq->rd->overload))
+ WRITE_ONCE(rq->rd->overload, 1);
#endif
}
Commit-ID: 757ffdd705ee942fc8150b17942d968601d2a15b
Gitweb: https://git.kernel.org/tip/757ffdd705ee942fc8150b17942d968601d2a15b
Author: Valentin Schneider <[email protected]>
AuthorDate: Wed, 4 Jul 2018 11:17:47 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 11:05:53 +0200
sched/fair: Set rq->rd->overload when misfit
Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.
A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.
They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.
This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:
a,b,c,d are our CPU-hogging tasks _ signifies idling
LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
big_0 | c c c c a a
big_1 | d d d d b b
^
^
Tasks end on the big CPUs, idle balance happens
and the misfit tasks are pulled straight away
This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.
As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 6 ++++--
kernel/sched/sched.h | 6 +++++-
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d9c4e97bfebd..8b228c5b3eb4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7793,7 +7793,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
* @load_idx: Load index of sched_domain of this_cpu for load calc.
* @local_group: Does group contain this_cpu.
* @sgs: variable to hold the statistics for this group.
- * @overload: Indicate more than one runnable task for any CPU.
+ * @overload: Indicate pullable load (e.g. >1 runnable task).
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
@@ -7837,8 +7837,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->idle_cpus++;
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
- sgs->group_misfit_task_load < rq->misfit_task_load)
+ sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
+ *overload = 1;
+ }
}
/* Adjust by relative CPU capacity of the group */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 938063639793..85b3a2bf6c2b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -715,7 +715,11 @@ struct root_domain {
cpumask_var_t span;
cpumask_var_t online;
- /* Indicate more than one runnable task for any CPU */
+ /*
+ * Indicate pullable load on at least one CPU, e.g:
+ * - More than one runnable task
+ * - Running task is misfit
+ */
int overload;
/*
Commit-ID: 4ad3831a9d4af5e36da5d44a3b9c6522d0353cee
Gitweb: https://git.kernel.org/tip/4ad3831a9d4af5e36da5d44a3b9c6522d0353cee
Author: Chris Redpath <[email protected]>
AuthorDate: Wed, 4 Jul 2018 11:17:48 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 11:05:53 +0200
sched/fair: Don't move tasks to lower capacity CPUs unless necessary
When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
CPU with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.
Signed-off-by: Chris Redpath <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b228c5b3eb4..06ff75f4ac7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8423,6 +8423,17 @@ static struct rq *find_busiest_queue(struct lb_env *env,
capacity = capacity_of(i);
+ /*
+ * For ASYM_CPUCAPACITY domains, don't pick a CPU that could
+ * eventually lead to active_balancing high->low capacity.
+ * Higher per-CPU capacity is considered better than balancing
+ * average load.
+ */
+ if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ capacity_of(env->dst_cpu) < capacity &&
+ rq->nr_running == 1)
+ continue;
+
wl = weighted_cpuload(rq);
/*
Commit-ID: 9c63e84db29bcf584040931ad97c2edd11e35f6c
Gitweb: https://git.kernel.org/tip/9c63e84db29bcf584040931ad97c2edd11e35f6c
Author: Morten Rasmussen <[email protected]>
AuthorDate: Wed, 4 Jul 2018 11:17:50 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 10 Sep 2018 11:05:54 +0200
sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains
The 'prefer sibling' sched_domain flag is intended to encourage
spreading tasks to sibling sched_domain to take advantage of more caches
and core for SMT systems. It has recently been changed to be on all
non-NUMA topology level. However, spreading across domains with CPU
capacity asymmetry isn't desirable, e.g. spreading from high capacity to
low capacity CPUs even if high capacity CPUs aren't overutilized might
give access to more cache but the CPU will be slower and possibly lead
to worse overall throughput.
To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
level immediately below SD_ASYM_CPUCAPACITY.
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/topology.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 2536e1b938f9..7ffad0d3a4eb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1126,7 +1126,7 @@ sd_init(struct sched_domain_topology_level *tl,
| 0*SD_SHARE_CPUCAPACITY
| 0*SD_SHARE_PKG_RESOURCES
| 0*SD_SERIALIZE
- | 0*SD_PREFER_SIBLING
+ | 1*SD_PREFER_SIBLING
| 0*SD_NUMA
| sd_flags
,
@@ -1152,17 +1152,21 @@ sd_init(struct sched_domain_topology_level *tl,
if (sd->flags & SD_ASYM_CPUCAPACITY) {
struct sched_domain *t = sd;
+ /*
+ * Don't attempt to spread across CPUs of different capacities.
+ */
+ if (sd->child)
+ sd->child->flags &= ~SD_PREFER_SIBLING;
+
for_each_lower_domain(t)
t->flags |= SD_BALANCE_WAKE;
}
if (sd->flags & SD_SHARE_CPUCAPACITY) {
- sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 110;
sd->smt_gain = 1178; /* ~15% */
} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
- sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 117;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
@@ -1173,6 +1177,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->busy_idx = 3;
sd->idle_idx = 2;
+ sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
sd->flags &= ~(SD_BALANCE_EXEC |
@@ -1182,7 +1187,6 @@ sd_init(struct sched_domain_topology_level *tl,
#endif
} else {
- sd->flags |= SD_PREFER_SIBLING;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
sd->idle_idx = 1;