On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
for performance that cpu intensive tasks are aggressively migrated to
high capacity cpus as soon as those become available. The capacity
awareness tweaks already in the wake-up path can't handle this as such
tasks might run or be runnable forever. If they happen to be placed on a
low capacity cpu from the beginning they are stuck there forever while
high capacity cpus may have become available in the meantime.
To address this issue this patch set introduces a new "misfit"
load-balancing scenario in periodic/nohz/newly idle balance which tweaks
the load-balance conditions to ignore load per capacity in certain
cases. Since misfit tasks are commonly running alone on a cpu, more
aggressive active load-balancing is needed too.
The fundamental idea of this patch set has been in Android kernels for a
long time and is absolutely essential for consistent performance on
asymmetric cpu capacity systems.
The patches have been tested on:
1. Arm Juno (r0): 2+4 Cortex A57/A53
2. Hikey960: 4+4 Cortex A73/A53
Test case:
Big cpus are always kept busy. Pin a shorter running sysbench tasks to
big cpus, while creating a longer running set of unpinned sysbench
tasks.
REQUESTS=1000
BIGS="1 2"
LITTLES="0 3 4 5"
# Don't care about the score for those, just keep the bigs busy
for i in $BIGS; do
taskset -c $i sysbench --max-requests=$((REQUESTS / 4)) \
--test=cpu run &>/dev/null &
done
for i in $LITTLES; do
sysbench --max-requests=$REQUESTS --test=cpu run \
| grep "total time:" &
done
wait
Results:
Single runs with completion time of each task
Juno (tip)
total time: 1.2608s
total time: 1.2995s
total time: 1.5954s
total time: 1.7463s
Juno (misfit)
total time: 1.2575s
total time: 1.3004s
total time: 1.5860s
total time: 1.5871s
Hikey960 (tip)
total time: 1.7431s
total time: 2.2914s
total time: 2.5976s
total time: 1.7280s
Hikey960 (misfit)
total time: 1.7866s
total time: 1.7513s
total time: 1.6918s
total time: 1.6965s
10 run summary (tracking longest running task for each run)
Juno Hikey960
avg max avg max
tip 1.7465 1.7469 2.5997 2.6131
misfit 1.6016 1.6192 1.8506 1.9666
Changelog:
v2
- Removed redudant condition in static_key enablement.
- Fixed logic flaw in patch #2 reported by Yi Yao <[email protected]>
- Dropped patch #4 as although the patch seems to make sense no benefit
has been proven.
- Dropped root_domain->overload renaming
- Changed type of root_domain->overload to int
- Wrapped accesses of rq->rd->overload with READ/WRITE_ONCE
- v1 Tested-by: Gaku Inami <[email protected]>
Morten Rasmussen (3):
sched: Add static_key for asymmetric cpu capacity optimizations
sched/fair: Add group_misfit_task load-balance type
sched/fair: Consider misfit tasks when load-balancing
Valentin Schneider (4):
sched/fair: Kick nohz balance if rq->misfit_task
sched: Change root_domain->overload type to int
sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
sched/fair: Set sd->overload when misfit
kernel/sched/fair.c | 115 +++++++++++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 15 +++++--
kernel/sched/topology.c | 9 ++++
3 files changed, 115 insertions(+), 24 deletions(-)
--
2.7.4
From: Valentin Schneider <[email protected]>
Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.
A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.
They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.
This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:
a,b,c,d are our CPU-hogging tasks
_ signifies idling
LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
big_0 | c c c c a a
big_1 | d d d d b b
^
^
Tasks end on the big CPUs, idle balance happens
and the misfit tasks are pulled straight away
This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.
As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 6 ++++--
kernel/sched/sched.h | 6 +++++-
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e79ec8760be..7952e9491713 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7964,7 +7964,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
* @load_idx: Load index of sched_domain of this_cpu for load calc.
* @local_group: Does group contain this_cpu.
* @sgs: variable to hold the statistics for this group.
- * @overload: Indicate more than one runnable task for any CPU.
+ * @overload: Indicate pullable load (e.g. >1 runnable task).
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
@@ -8008,8 +8008,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->idle_cpus++;
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
- !sgs->group_misfit_task_load && rq->misfit_task_load)
+ !sgs->group_misfit_task_load && rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
+ *overload = 1;
+ }
}
/* Adjust by relative CPU capacity of the group */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 31847b7f7d47..2981a162d9c0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -688,7 +688,11 @@ struct root_domain {
cpumask_var_t span;
cpumask_var_t online;
- /* Indicate more than one runnable task for any CPU */
+ /*
+ * Indicate pullable load on at least one CPU, e.g:
+ * - More than one runnable task
+ * - Running task is misfit
+ */
int overload;
/*
--
2.7.4
From: Valentin Schneider <[email protected]>
There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:
* rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
* sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity cpus, this will be evaluated to False.
* rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure
As such, this commit adds an rq->misfit_task condition to trigger a
nohz kick.
The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan <[email protected]>, and a similar patch was submitted for
the Android Common Kernel - see [1].
[1]: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 496062860733..c791dd7ac9a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9291,7 +9291,7 @@ static void nohz_balancer_kick(struct rq *rq)
if (time_before(now, nohz.next_balance))
goto out;
- if (rq->nr_running >= 2) {
+ if (rq->nr_running >= 2 || rq->misfit_task_load) {
flags = NOHZ_KICK_MASK;
goto out;
}
--
2.7.4
From: Valentin Schneider <[email protected]>
This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE are added to make sure nothing terribly wrong
can happen because of the compiler.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 6 +++---
kernel/sched/sched.h | 4 ++--
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe4824a37302..4e79ec8760be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8220,8 +8220,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
- if (env->dst_rq->rd->overload != overload)
- env->dst_rq->rd->overload = overload;
+ if (READ_ONCE(env->dst_rq->rd->overload) != overload)
+ WRITE_ONCE(env->dst_rq->rd->overload, overload);
}
}
@@ -9659,7 +9659,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
rq_unpin_lock(this_rq, rf);
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !this_rq->rd->overload) {
+ !READ_ONCE(this_rq->rd->overload)) {
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5379f647016d..31847b7f7d47 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1651,8 +1651,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
if (prev_nr < 2 && rq->nr_running >= 2) {
#ifdef CONFIG_SMP
- if (!rq->rd->overload)
- rq->rd->overload = 1;
+ if (!READ_ONCE(rq->rd->overload))
+ WRITE_ONCE(rq->rd->overload, 1);
#endif
}
--
2.7.4
From: Valentin Schneider <[email protected]>
sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 7 +++----
kernel/sched/sched.h | 4 ++--
2 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c791dd7ac9a8..fe4824a37302 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7969,7 +7969,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
- bool *overload)
+ int *overload)
{
unsigned long load;
int i, nr_running;
@@ -7994,7 +7994,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
nr_running = rq->nr_running;
if (nr_running > 1)
- *overload = true;
+ *overload = 1;
#ifdef CONFIG_NUMA_BALANCING
sgs->nr_numa_running += rq->nr_numa_running;
@@ -8143,8 +8143,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
- int load_idx, prefer_sibling = 0;
- bool overload = false;
+ int load_idx, prefer_sibling, overload = 0;
if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 22177dfc1f04..5379f647016d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -689,7 +689,7 @@ struct root_domain {
cpumask_var_t online;
/* Indicate more than one runnable task for any CPU */
- bool overload;
+ int overload;
/*
* The bit corresponding to a CPU gets set here if such CPU has more
@@ -1652,7 +1652,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
if (prev_nr < 2 && rq->nr_running >= 2) {
#ifdef CONFIG_SMP
if (!rq->rd->overload)
- rq->rd->overload = true;
+ rq->rd->overload = 1;
#endif
}
--
2.7.4
On asymmetric cpu capacity systems load intensive tasks can end up on
cpus that don't suit their compute demand. In this scenarios 'misfit'
tasks should be migrated to cpus with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.
Misfit balancing only makes sense between a source group of lower
per-cpu capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.
The modifications are:
1. Only pick a group containing misfit tasks as the busiest group if the
destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
to be pulled ignoring average load.
4. Pick the first cpu with the rq->misfit flag raised as the source cpu.
5. If the misfit task is alone on the source cpu, go for active
balancing.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++--
1 file changed, 34 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e06c722bc2e..496062860733 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7102,6 +7102,7 @@ struct lb_env {
unsigned int loop_max;
enum fbq_type fbq_type;
+ enum group_type src_grp_type;
struct list_head tasks;
};
@@ -8044,6 +8045,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
{
struct sg_lb_stats *busiest = &sds->busiest_stat;
+ /*
+ * Don't try to pull misfit tasks we can't help.
+ */
+ if (sgs->group_type == group_misfit_task &&
+ (!group_smaller_cpu_capacity(sg, sds->local) ||
+ !group_has_capacity(env, &sds->local_stat)))
+ return false;
+
if (sgs->group_type > busiest->group_type)
return true;
@@ -8363,8 +8372,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* factors in sg capacity and sgs with smaller group_type are
* skipped when updating the busiest sg:
*/
- if (busiest->avg_load <= sds->avg_load ||
- local->avg_load >= sds->avg_load) {
+ if (busiest->group_type != group_misfit_task &&
+ (busiest->avg_load <= sds->avg_load ||
+ local->avg_load >= sds->avg_load)) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}
@@ -8398,6 +8408,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;
+ /* Boost imbalance to allow misfit task to be balanced. */
+ if (busiest->group_type == group_misfit_task) {
+ env->imbalance = max_t(long, env->imbalance,
+ busiest->group_misfit_task_load);
+ }
+
/*
* if *imbalance is less than the average load per runnable task
* there is no guarantee that any tasks will be moved so we'll have
@@ -8464,6 +8480,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
busiest->group_no_capacity)
goto force_balance;
+ /* Misfit tasks should be dealt with regardless of the avg load */
+ if (busiest->group_type == group_misfit_task)
+ goto force_balance;
+
/*
* If the local group is busier than the selected busiest group
* don't try and pull any tasks.
@@ -8501,6 +8521,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
force_balance:
/* Looks like there is an imbalance. Compute it */
+ env->src_grp_type = busiest->group_type;
calculate_imbalance(env, &sds);
return sds.busiest;
@@ -8548,6 +8569,14 @@ static struct rq *find_busiest_queue(struct lb_env *env,
if (rt > env->fbq_type)
continue;
+ /*
+ * For ASYM_CPUCAPACITY domains with misfit tasks we ignore
+ * load.
+ */
+ if (env->src_grp_type == group_misfit_task &&
+ rq->misfit_task_load)
+ return rq;
+
capacity = capacity_of(i);
wl = weighted_cpuload(rq);
@@ -8617,6 +8646,9 @@ static int need_active_balance(struct lb_env *env)
return 1;
}
+ if (env->src_grp_type == group_misfit_task)
+ return 1;
+
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
--
2.7.4
The existing asymmetric cpu capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 3 +++
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 9 +++++++++
3 files changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3582117e1580..e4ce572113a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6385,6 +6385,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
{
long min_cap, max_cap;
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return 0;
+
min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 22909ffc04fb..017f62069e0b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1140,6 +1140,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;
struct sched_group_capacity {
atomic_t ref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 64cc564f5255..b023a5b3665a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
static void update_top_cache_domain(int cpu)
{
@@ -425,6 +426,12 @@ static void update_top_cache_domain(int cpu)
rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
}
+static void update_asym_cpucapacity(int cpu)
+{
+ if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
+ static_branch_enable(&sched_asym_cpucapacity);
+}
+
/*
* Attach the domain 'sd' to 'cpu' as its base domain. Callers must
* hold the hotplug lock.
@@ -1705,6 +1712,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
cpu_attach_domain(sd, d.rd, i);
}
+
+ update_asym_cpucapacity(cpumask_first(cpu_map));
rcu_read_unlock();
if (rq && sched_debug_enabled) {
--
2.7.4
To maximize throughput in systems with asymmetric cpu capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and cpu utilization
as well as per-cpu compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity cpu need to be
identified and migrated to a higher capacity cpu if possible to maximize
throughput.
To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-cpu capacity. group_misfit_task is only considered
if the system is not overloaded or imbalanced (group_imbalanced or
group_overloaded).
Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 57 +++++++++++++++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 2 ++
2 files changed, 50 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4ce572113a9..1e06c722bc2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -697,6 +697,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
static unsigned long task_h_load(struct task_struct *p);
+static unsigned long capacity_of(int cpu);
/* Give new sched_entity start runnable values to heavy its load in infant time */
void init_entity_runnable_average(struct sched_entity *se)
@@ -1407,7 +1408,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
static unsigned long weighted_cpuload(struct rq *rq);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);
/* Cached statistics for all CPUs within a node */
struct numa_stats {
@@ -3873,6 +3873,30 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
+static inline unsigned long task_util(struct task_struct *p);
+static inline int task_fits_capacity(struct task_struct *p, long capacity)
+{
+ return capacity * 1024 > task_util(p) * capacity_margin;
+}
+
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
+{
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return;
+
+ if (!p) {
+ rq->misfit_task_load = 0;
+ return;
+ }
+
+ if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
+ rq->misfit_task_load = 0;
+ return;
+ }
+
+ rq->misfit_task_load = task_h_load(p);
+}
+
#else /* CONFIG_SMP */
static inline int
@@ -3902,6 +3926,8 @@ static inline int idle_balance(struct rq *rq, struct rq_flags *rf)
return 0;
}
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
+
#endif /* CONFIG_SMP */
static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -5835,7 +5861,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return target;
}
-static inline unsigned long task_util(struct task_struct *p);
static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -6398,7 +6423,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
/* Bring task utilization in sync with prev_cpu */
sync_entity_load_avg(&p->se);
- return min_cap * 1024 < task_util(p) * capacity_margin;
+ return !task_fits_capacity(p, min_cap);
}
/*
@@ -6829,9 +6854,12 @@ done: __maybe_unused;
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
+ update_misfit_status(p, rq);
+
return p;
idle:
+ update_misfit_status(NULL, rq);
new_tasks = idle_balance(rq, rf);
/*
@@ -7037,6 +7065,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
enum fbq_type { regular, remote, all };
+enum group_type {
+ group_other = 0,
+ group_misfit_task,
+ group_imbalanced,
+ group_overloaded,
+};
+
#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
#define LBF_DST_PINNED 0x04
@@ -7578,12 +7613,6 @@ static unsigned long task_h_load(struct task_struct *p)
/********** Helpers for find_busiest_group ************************/
-enum group_type {
- group_other = 0,
- group_imbalanced,
- group_overloaded,
-};
-
/*
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
@@ -7599,6 +7628,7 @@ struct sg_lb_stats {
unsigned int group_weight;
enum group_type group_type;
int group_no_capacity;
+ int group_misfit_task_load; /* A cpu has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -7898,6 +7928,9 @@ group_type group_classify(struct sched_group *group,
if (sg_imbalanced(group))
return group_imbalanced;
+ if (sgs->group_misfit_task_load)
+ return group_misfit_task;
+
return group_other;
}
@@ -7972,6 +8005,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
*/
if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
+
+ if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ !sgs->group_misfit_task_load && rq->misfit_task_load)
+ sgs->group_misfit_task_load = rq->misfit_task_load;
}
/* Adjust by relative CPU capacity of the group */
@@ -9753,6 +9790,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+
+ update_misfit_status(curr, rq);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 017f62069e0b..22177dfc1f04 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -815,6 +815,8 @@ struct rq {
unsigned char idle_balance;
+ unsigned int misfit_task_load;
+
/* For active balancing */
int active_balance;
int push_cpu;
--
2.7.4
Hi
> -----Original Message-----
> From: Morten Rasmussen [mailto:[email protected]]
> Sent: Thursday, March 15, 2018 11:47 PM
> To: [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected]; Gaku Inami
> <[email protected]>; [email protected]; Morten Rasmussen <[email protected]>
> Subject: [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
[snip]
>
> The patches have been tested on:
> 1. Arm Juno (r0): 2+4 Cortex A57/A53
> 2. Hikey960: 4+4 Cortex A73/A53
>
I have tested this on our R-Car again and it works well. In addition,
I confirmed that this patch-set also brings performance improvement to
other benchmarks(e.g. memory load latency in LMbench). You can add:
Tested-by: Gaku Inami <[email protected]>
The patches have been tested on:
3. Renesas R-Car H3 : 4+4 Cortex A57/A53
Results:
Single runs with completion time of each task
R-Car H3 (tip)
total time: 0.9435s
total time: 0.9952s
total time: 1.3511s
total time: 1.6747s
R-Car H3 (misfit)
total time: 0.9387s
total time: 0.9280s
total time: 0.9616s
total time: 0.9934s
10 run summary (tracking longest running task for each run)
R-Car H3
avg max
tip 1.6737 1.6758
misfit 0.9980 1.0409
Regards,
Inami
[snip]
Hi,
On Tue, Mar 20, 2018 at 05:30:00AM +0000, Gaku Inami wrote:
> Hi
>
> > -----Original Message-----
> > From: Morten Rasmussen [mailto:[email protected]]
> > Sent: Thursday, March 15, 2018 11:47 PM
> > To: [email protected]; [email protected]
> > Cc: [email protected]; [email protected]; [email protected]; Gaku Inami
> > <[email protected]>; [email protected]; Morten Rasmussen <[email protected]>
> > Subject: [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
> [snip]
> >
> > The patches have been tested on:
> > 1. Arm Juno (r0): 2+4 Cortex A57/A53
> > 2. Hikey960: 4+4 Cortex A73/A53
> >
>
> I have tested this on our R-Car again and it works well. In addition,
> I confirmed that this patch-set also brings performance improvement to
> other benchmarks(e.g. memory load latency in LMbench). You can add:
>
> Tested-by: Gaku Inami <[email protected]>
>
> The patches have been tested on:
> 3. Renesas R-Car H3 : 4+4 Cortex A57/A53
>
> Results:
> Single runs with completion time of each task
> R-Car H3 (tip)
> total time: 0.9435s
> total time: 0.9952s
> total time: 1.3511s
> total time: 1.6747s
>
> R-Car H3 (misfit)
> total time: 0.9387s
> total time: 0.9280s
> total time: 0.9616s
> total time: 0.9934s
>
> 10 run summary (tracking longest running task for each run)
> R-Car H3
> avg max
> tip 1.6737 1.6758
> misfit 0.9980 1.0409
Thanks a lot for testing again.
Morten
On Thu, Mar 15, 2018 at 02:46:59PM +0000, Morten Rasmussen wrote:
> +static inline unsigned long task_util(struct task_struct *p);
> +static inline int task_fits_capacity(struct task_struct *p, long capacity)
> +{
> + return capacity * 1024 > task_util(p) * capacity_margin;
> +}
> +
> +static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> +{
> + if (!static_branch_unlikely(&sched_asym_cpucapacity))
> + return;
> +
> + if (!p) {
> + rq->misfit_task_load = 0;
> + return;
> + }
> +
> + if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
> + rq->misfit_task_load = 0;
> + return;
> + }
> +
> + rq->misfit_task_load = task_h_load(p);
> +}
So RT/IRQ pressure can also cause misfit..
> @@ -7972,6 +8005,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> */
> if (!nr_running && idle_cpu(i))
> sgs->idle_cpus++;
> +
> + if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> + !sgs->group_misfit_task_load && rq->misfit_task_load)
> + sgs->group_misfit_task_load = rq->misfit_task_load;
> }
Should we not look for the biggest misfit instead of the first?
>
> /* Adjust by relative CPU capacity of the group */
On Thu, Mar 15, 2018 at 02:47:00PM +0000, Morten Rasmussen wrote:
> @@ -8548,6 +8569,14 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> if (rt > env->fbq_type)
> continue;
>
> + /*
> + * For ASYM_CPUCAPACITY domains with misfit tasks we ignore
> + * load.
> + */
> + if (env->src_grp_type == group_misfit_task &&
> + rq->misfit_task_load)
> + return rq;
> +
> capacity = capacity_of(i);
>
> wl = weighted_cpuload(rq);
Similarly, should we not return worst misfit instead of the first?
On Thu, Mar 15, 2018 at 02:47:00PM +0000, Morten Rasmussen wrote:
> + /*
> + * Don't try to pull misfit tasks we can't help.
> + */
> + if (sgs->group_type == group_misfit_task &&
> + (!group_smaller_cpu_capacity(sg, sds->local) ||
> + !group_has_capacity(env, &sds->local_stat)))
> + return false;
> +
I think that has the same 'problem' as last time, right? Since
group_smaller_cpu_capacity() is affected by RT/IRQ pressure.
We should have a comment stating this and preferably explaining why that
is ok.
Hi
with this patch, if enable CONFIG_DEBUG_ATOMIC_SLEEP=y,
then I am getting following BUG report during early startup
Backtrace caused by [1] during early kernel startup:
[ 5.325288] CPU: All CPU(s) started at EL2
[ 5.325700] alternatives: patching kernel code
[ 5.329255] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
[ 5.329525] in_atomic(): 0, irqs_disabled(): 0, pid: 1, name: swapper/0
[ 5.329657] 2 locks held by swapper/0/1:
[ 5.329744] #0: (sched_domains_mutex){+.+.}, at: [<ffff20000957f244>] sched_init_smp+0x88/0x158
[ 5.329993] #1: (rcu_read_lock){....}, at: [<ffff200008159794>] build_sched_domains+0x9cc/0x2f08
[ 5.330233] Preemption disabled at:
[ 5.330256] [<ffff200008157b5c>] rq_attach_root+0x28/0x1d8
[ 5.330511] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.17+ #123
[ 5.330635] Hardware name: Renesas Salvator-X board based on r8a7795 ES2.0+ (DT)
[ 5.330779] Call trace:
[ 5.330853] [<ffff20000808fe88>] dump_backtrace+0x0/0x364
[ 5.330968] [<ffff200008090200>] show_stack+0x14/0x1c
[ 5.331080] [<ffff200008f365d8>] dump_stack+0x108/0x174
[ 5.331194] [<ffff200008113170>] ___might_sleep+0x43c/0x44c
[ 5.331310] [<ffff2000081132e4>] __might_sleep+0x164/0x178
[ 5.331429] [<ffff2000080b2338>] cpus_read_lock+0x38/0x12c
[ 5.331547] [<ffff2000082d5860>] static_key_enable+0x14/0x2c
[ 5.331665] [<ffff20000815bcac>] build_sched_domains+0x2ee4/0x2f08
[ 5.331789] [<ffff20000815cc9c>] sched_init_domains+0xcc/0xe8
[ 5.331908] [<ffff20000957f250>] sched_init_smp+0x94/0x158
[ 5.332026] [<ffff200009571560>] kernel_init_freeable+0x1ec/0x4c4
[ 5.332153] [<ffff200008f593a8>] kernel_init+0x10/0x128
[ 5.332264] [<ffff2000080865d4>] ret_from_fork+0x10/0x18
[ 5.343400] devtmpfs: initialized
Thanks,
Jiada
Hi,
On 27/04/18 15:04, Jiada Wang wrote:
> Hi
>
> with this patch, if enable CONFIG_DEBUG_ATOMIC_SLEEP=y,
> then I am getting following BUG report during early startup
>
Thanks for bringing that up.
> Backtrace caused by [1] during early kernel startup:
> [ 5.325288] CPU: All CPU(s) started at EL2
> [ 5.325700] alternatives: patching kernel code
> [ 5.329255] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
> [ 5.329525] in_atomic(): 0, irqs_disabled(): 0, pid: 1, name: swapper/0
> [ 5.329657] 2 locks held by swapper/0/1:
> [ 5.329744] #0: (sched_domains_mutex){+.+.}, at: [<ffff20000957f244>] sched_init_smp+0x88/0x158
> [ 5.329993] #1: (rcu_read_lock){....}, at: [<ffff200008159794>] build_sched_domains+0x9cc/0x2f08
> [ 5.330233] Preemption disabled at:
> [ 5.330256] [<ffff200008157b5c>] rq_attach_root+0x28/0x1d8
> [ 5.330511] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.17+ #123
> [ 5.330635] Hardware name: Renesas Salvator-X board based on r8a7795 ES2.0+ (DT)
> [ 5.330779] Call trace:
> [ 5.330853] [<ffff20000808fe88>] dump_backtrace+0x0/0x364
> [ 5.330968] [<ffff200008090200>] show_stack+0x14/0x1c
> [ 5.331080] [<ffff200008f365d8>] dump_stack+0x108/0x174
> [ 5.331194] [<ffff200008113170>] ___might_sleep+0x43c/0x44c
> [ 5.331310] [<ffff2000081132e4>] __might_sleep+0x164/0x178
> [ 5.331429] [<ffff2000080b2338>] cpus_read_lock+0x38/0x12c
> [ 5.331547] [<ffff2000082d5860>] static_key_enable+0x14/0x2c
> [ 5.331665] [<ffff20000815bcac>] build_sched_domains+0x2ee4/0x2f08
> [ 5.331789] [<ffff20000815cc9c>] sched_init_domains+0xcc/0xe8
> [ 5.331908] [<ffff20000957f250>] sched_init_smp+0x94/0x158
> [ 5.332026] [<ffff200009571560>] kernel_init_freeable+0x1ec/0x4c4
> [ 5.332153] [<ffff200008f593a8>] kernel_init+0x10/0x128
> [ 5.332264] [<ffff2000080865d4>] ret_from_fork+0x10/0x18
> [ 5.343400] devtmpfs: initialized
>
I tried reproducing this on my HiKey960, and I do get a BUG pointing at a
static_key_enable but at a completely different place...
[ 0.158072] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:238
[ 0.158074] in_atomic(): 1, irqs_disabled(): 128, pid: 0, name: swapper/4
[ 0.158080] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G S 4.16.0-linaro-hikey960 #4
[ 0.158081] Hardware name: HiKey960 (DT)
[ 0.158083] Call trace:
[ 0.158098] dump_backtrace+0x0/0x188
[ 0.158102] show_stack+0x14/0x20
[ 0.158108] dump_stack+0x98/0xbc
[ 0.158113] ___might_sleep+0xf0/0x118
[ 0.158115] __might_sleep+0x50/0x88
[ 0.158118] mutex_lock+0x24/0x60
[ 0.158124] static_key_enable_cpuslocked+0x50/0xc0
[ 0.158130] arch_timer_check_ool_workaround+0x1ac/0x228
[ 0.158133] arch_timer_starting_cpu+0xfc/0x2e8
[ 0.158137] cpuhp_invoke_callback+0xa0/0x228
[ 0.158140] notify_cpu_starting+0x70/0x90
[ 0.158143] secondary_start_kernel+0x128/0x1c8
I went and had a look at the documentation for the static keys, and it mentions
fun stuff can happen with hotplug. I gave it a try and got this:
root@linaro-developer:~# echo 0 > /sys/devices/system/cpu/cpu4/online
[ 1893.765366] CPU4: shutdown
[ 1893.768077] psci: CPU4 killed.
[ 1893.771890] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
[ 1893.773361] crct10dif_ce: Unknown symbol _mcount (err 0)
[ 1893.777754] in_atomic(): 0, irqs_disabled(): 0, pid: 3392, name: kworker/4:0
[ 1893.799136] CPU: 0 PID: 3392 Comm: kworker/4:0 Tainted: G S W 4.16.0-linaro-hikey960 #4
[ 1893.808180] Hardware name: HiKey960 (DT)
[ 1893.812110] Workqueue: events cpuset_hotplug_workfn
[ 1893.816984] Call trace:
[ 1893.819430] dump_backtrace+0x0/0x188
[ 1893.823088] show_stack+0x14/0x20
[ 1893.826401] dump_stack+0x98/0xbc
[ 1893.829712] ___might_sleep+0xf0/0x118
[ 1893.833455] __might_sleep+0x50/0x88
[ 1893.837028] cpus_read_lock+0x1c/0x90
[ 1893.840689] static_key_enable+0x14/0x30
[ 1893.844608] build_sched_domains+0xe4c/0xf00
[ 1893.848874] partition_sched_domains+0x2c8/0x410
[ 1893.853486] rebuild_sched_domains_locked+0xe4/0x430
[ 1893.858446] rebuild_sched_domains+0x20/0x38
[ 1893.862712] cpuset_hotplug_workfn+0x28c/0x6b8
[ 1893.867153] process_one_work+0x114/0x330
[ 1893.871158] worker_thread+0x130/0x470
[ 1893.874903] kthread+0x104/0x130
[ 1893.878126] ret_from_fork+0x10/0x18
This seems to be complaining about holding 'sched_domains_mutex' while
taking the hotplug lock before flipping the static key.
Thing is, both callers of build_sched_domains():
- sched_init_domains()
- partition_sched_domains()
mention that they must be called with the hotplug lock held, so I figured
we could use that information to change the static key call (see snippet
below).
It does suppress the warning, and I *think* it's not completely insane -
assuming the comments about the hotplug lock are still up to date, and with
the exception of sched_init_smp() which doesn't care about hotplugs it seems
to be the case.
SMT also uses a static key but avoids this issue by being enabled outside of
sched_init_domains(), and from what I see it's just set once and for all.
I'm not sure we can use the same approach since we might not always be able
to detect asymmetry this early.
> Thanks,
> Jiada
>
Cheers,
Valentin
--->8---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b023a5b..89e502e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -428,8 +428,10 @@ static void update_top_cache_domain(int cpu)
static void update_asym_cpucapacity(int cpu)
{
- if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
- static_branch_enable(&sched_asym_cpucapacity);
+ if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY)) {
+ /* This expects the hotplug lock to be held */
+ static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
+ }
}
/*